Monday, May 29, 2023

Parsing a Floating Point String

or, Regular Expressions for Fun and Profit!

Here's a quick note on efficiently traversing and processing a regular expression that doesn't require backtracking. The example regular expression is a typical floating point string with exponential support. The goal is to efficiently parse the string

  • in a single pass

  • looking only one character ahead at any point

  • using no string functions other that to access each individual character


Example:


([-+]?\d+(\.\d*)?)([eE]([-+]?\d+))?

Step 1.

Break the RE into the set of states it represents, and the possible transitions from each state.


For example, at the beginning of the string ("start" state), there are four possible transitions:

  • plus sign (e.g. "+123")

  • minus sign (e.g. "-123")

  • digit (e.g. "123")

  • end-of-string (e.g. "")


It can help to make a first pass that just identifies the parts of your regular expression


sketchviz



and use that to expand to the complete set of states and transitions.

sketchviz

 

(pardon the large gap above, blogspot is messing up)

Step 2


Fill in the actions needed for each state.

At the transition of each state, we may optionally consume one character.  For the start state:

  • for plus or minus, we consume a character, since it is not necessary in any subsequent states.

  • for a digit, we don't consume a character, since we will need that first digit in the "int" state.


In each state, we may optionally perform some processing.  For example, let us assume we have a variable "sign" which will indicate if the scanned string represents a positive or negative value.  We would take these actions:

  • in state sign_positive, set sign = 1;

  • in state sign_negative, set sign = -1;


In the "int" state, we would add the digit to the variable representing the integer portion of the string:

  • int_val = int_val * 10;     // shift left one digit
    int_val = int_val + ordinal_value(current digit)  // add this digit's value


If the next character ever fails to have a transition out of the current state, it represents and error in the source string.  You can keep a character count for precise location, and report exactly what the error is.

  • "+12J5"  --> position 3, "character J is not a legal character"


In General:


  • We traverse the state diagram character by character from the input string.

  • Each state specifies the exact set of characters that can transition to subsequent states.

  • Each transition specifies whether or not to consume a character.

  • Each state optionally has processing to extract information from the current character.

  • (some constructs require backtracking, which we will ignore here).

  • A character without an outbound edge represents badly formed input.

Example Parsing

  • "123": (start) 1 (int) 2 (int) 3 (int) eos (end)

  • "+1.2.3": (start) + (sign_pos) 1 (int) . (dot) e (E) 3 (exp) eos (end)

  • "12345": (start) 1 (int) 2 (int) 3 (int) 4 (int) 1 (int) 6 (int) eos (end)

  • "hello":  (start) h (error)

  • "": (start) eos (end)


Example Code

I did this in C (a) to show there's no fancy functionality needed from your language, (b) because that's how I used to do it when I was a compiler writer and this was making me feel nostalgic, and (c) to have some fun by seeing how people will react to a claim where GOTO and case fall-throughs make for clean, easy to follow code. :) [disclaimer: I haven't timed the difference between the case code and the equivalent if code, but from examining the assembler output it looks like the case code is generating the efficient version of a vector table. In either case, since the code is so small and makes no function calls, it's pretty much guaranteed to execute all in cache for both the code and data.]



#include <stdio.h>


int parse(char *p) {

    printf("%20s -- ", p);


    char *errm;

    char *p0 = p;

    int mantissa_sign = 1;

    int exponent_sign = 1;

    unsigned long int_part = 0;

    unsigned long frac_part = 0;

    unsigned int scale = 0;

    unsigned long exponent = 0;

    int trailing_zeros = 0;

    errm = "unknown error";


    switch (*p) {

    case '-':

        mantissa_sign = -1;

p++;

break;

    case '+':

p++;

break;

    case '0':

    case '1':

    case '2':

    case '3':

    case '4':

    case '5':

    case '6':

    case '7':

    case '8':

    case '9':

    case '.':

break;

    case 0:

goto finished;

    default:

errm = "unexpected char in mantissa_sign";

        goto error;

    }


int_part:

    switch (*p) {

    case '.':

        p++;

break;

    case '0':

    case '1':

    case '2':

    case '3':

    case '4':

    case '5':

    case '6':

    case '7':

    case '8':

    case '9':

        int_part *= 10;

int_part += *p - '0';

p++;

goto int_part;

    case 'e':

    case 'E':

p++;

goto exponent;

    case 0:

goto finished;

    default:

errm = "unexpected char in int_part";

        goto error;

    }


frac_part_leading_zero:

    switch (*p) {

    case '0':

        scale += 1;

p++;

goto frac_part_leading_zero;

    case '1':

    case '2':

    case '3':

    case '4':

    case '5':

    case '6':

    case '7':

    case '8':

    case '9':

goto frac_part;

    case 0:

goto finished;

    case 'e':

    case 'E':

p++;

goto exponent;

    default:

errm = "unexpected char in frac_part_leading_zero";

        goto error;

    }


frac_part:

    switch (*p) {

    case '0':

        trailing_zeros += 1;

        frac_part *= 10;

frac_part += *p - '0';

p++;

goto frac_part;

    case '1':

    case '2':

    case '3':

    case '4':

    case '5':

    case '6':

    case '7':

    case '8':

    case '9':

trailing_zeros = 0;

        frac_part *= 10;

frac_part += *p - '0';

p++;

goto frac_part;

    case 'e':

    case 'E':

p++;

goto exponent_sign;

    case 0:

goto finished;

    default:

errm = "unexpected char in frac_part";

        goto error;

    }


exponent_sign:

    switch (*p) {

    case '-':

        exponent_sign = -1;

p++;

goto exponent;

    case '+':

p++;

goto exponent;

    case '0':

    case '1':

    case '2':

    case '3':

    case '4':

    case '5':

    case '6':

    case '7':

    case '8':

    case '9':

goto exponent;

    case 0:

errm = "unexpected char in exponent_sign";

goto finished;

    }


exponent:

    switch (*p) {

    case '0':

    case '1':

    case '2':

    case '3':

    case '4':

    case '5':

    case '6':

    case '7':

    case '8':

    case '9':

        exponent *= 10;

exponent += *p - '0';

p++;

goto exponent;

    case 0:

goto finished;

    default:

errm = "unexpected char in exponent";

        goto error;

    }


finished:

    for (int i = 0; i < trailing_zeros; i++) {

        frac_part = frac_part / 10;

    }

    printf("signs=%d,%d int=%lu frac=%lu trail=%d scale=%u exp=%lu\n",

            mantissa_sign, exponent_sign,

            int_part, frac_part, trailing_zeros, scale, exponent);

    return 0;


error:

    printf("pos=%ld c=%c,%d error:%s\n", p-p0, *p, *p, errm);

    return 1;

}


int main() {

    parse("+123");

    parse("-123");

    parse("123");

    parse("+123.");

    parse("+123.456");

    parse("-123.456");

    parse("123.456");


    parse("1e0");

    parse("1e1");

    parse("1E0");

    parse("1E1");

    parse("1.23e10");

    parse("1.23e-10");

    parse("");


    parse("123.00456");

    parse("123.004560000");

}