Sarun Tapee

Posted on Jul 22

Context Free Gramma (CFG)

#cpp #learning

A grammar is represented by a production rule which looks like this X -> Y, it means that we can convert X to Y, where X is a non-terminal symbol, and Y could be a terminal symbol, non-terminal symbol, or the combination of both. For example, let's look at the following grammar.

A -> aB
B -> bA | b | ba

Let's label the rule for ease of explanation.

     (1)
A -> aB

     (2)  (3)  (4)
B -> bA | b  | ba

We can try to expand this production rule by starting from the first rule as follwing table.

apply rule No.	result string
	A
1	aB
2	abA
1	abaB
4	ababa

After applying rule number 4, we can not expand the string anymore because it is terminated. And we can say that the string ababa complies with our grammar. If you look at the above grammar, you will notice that it is nothing but the grammar of alternate characters a and b

Implement alternate a b grammar

Let's convert the above grammar to code. Begin with A -> aB

// alternate_a_b.c
parse_status A() {
    PARSE_INIT;
    d = cyylex();
    if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "a") != 0) RETURN_PARSE_ERROR;
    parse_status s = B();
    if (s == PARSE_ERROR) RETURN_PARSE_ERROR;
    return PARSE_SUCCESS;
}

Now let's implement B -> bA | b | ba. Since there are 3 production rules. We need to try each production rule one by one, and if the testing fails, then move to the next production rule. The order of testing matters. If we test rule number 3 before rule number 2, it will match character b and return success without checking rule number 2. So we should order the rule that covers a bigger domain before the rule that is a subdomain. In this case, we should order as such: (2), (4), (3).

Since we need to parse the string according to the rule one by one. If the parsing fails in the middle, we want to jump to the next rule. The do {...} while(0) control structure can make our code more elegant since if the checking fails in the middle, we can break out of the loop and check the next rule.

// rule 1
do {
  // check1
  // check2
  if (fail) break; // move to next rule
  // check3
} while(0);

// rule 2
do {
} while (0);

// rule 3
do {
} while (0);

...

The production rule B -> bA | b | ba will look like the following.

// alternate_a_b.c
...

parse_status B() {
    PARSE_INIT;
    parse_status s;

    do {
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "b") != 0)
            break;
        s = A();
        if (s == PARSE_ERROR) 
            break;
        return PARSE_SUCCESS;
    } while(0);

    RESTORE_CHECK_POINT;

    do {
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "b") != 0) 
            break;
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "a") != 0)
            break;
        return PARSE_SUCCESS;
    } while(0);

    RESTORE_CHECK_POINT;

    do {
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "b") != 0)
            break;
        return PARSE_SUCCESS;
    } while(0);

    RETURN_PARSE_ERROR;
}

We can test our grammar by implementing the main function as follows.

// alternate_a_b.c
...

void parse_alternate_a_b() {
    if (A() == PARSE_ERROR) {
        printf("invalid ab\n");
    } else {
        lex_data d = cyylex();
        yyrewind(1);
        if ((int) d.token_code == (int) PARSER_EOL)
            printf("valid ab\n");
        else 
            printf("valid ab partial\n");
    }
    for (int i = 0; i <= lex_stack.top; i++) {
        lex_data *p = &lex_stack.data[i];
        printf("%s ", p->text);
    }
    printf("\n");
}

// parser.l
...

extern void parse_alternate_a_b();

int main(void) {
    int token_code;

    while (1) {
        printf("input -> ");
        fgets(lex_buffer, sizeof(lex_buffer), stdin);
        stack_reset();
        parse_alternate_a_b();
    }
    return 0;
}

Mathematical Expression

Now we know how to convert a production rule to code. Let's try to convert a mathematical expression to code. The mathematical expression can be written as follows. which can match a simple mathematical expression like a + b * c / (d + f) + 3 - 1.5

E -> E + F | E - F | F
F -> F * T | F / T | T
T -> (E) | <var> | <int> | <double>

If you try to convert the above production rule to code, you will end up with an infinite recursion. For example, E -> E + F the code will look like.

parse_status E() {
  ...
  do {
    s = E(); // this will cause infinite recursion
    if (s == PARSE_ERROR) break;
    d = cyylex();
    if (d.token_code != MATH_PLUS) break;
    s = F();
    if (s == PARSE_ERROR) break;
    return PARSE_SUCCESS;
  } while (0);
  ...
}

There are two ways to solve this issue.

Swap the order
Apply the algorithm to remove left recursion.

After swapping the order, the production rule will become.

E -> F + E | F - E | F
F -> T * F | T / F | T
T -> (E) | <var> | <int> | <double>

Then you can implement the code without suffering from left recursion.

Another approach is to apply an algorithm. The algorithm is as follows.

1. Remove all the rules that have left recursion.
2. Add a new rule and append to the remaining rules.
3. Add the removed rule to the new rule, but delete the first non-terminal symbol that causes left recursion and append the non-terminal symbol of the new rule.
2. Add an empty symbol as a terminal symbol ($) to the new rules.

Example

E -> E + F | E - F | F

Remove all the rules that have left recursion.

E -> F

Add a new rule and append to the remaining rules.

E -> FE'

E'

Add the removed rule to the new rule, but delete the first non-terminal symbol that causes left recursion and append the non-terminal symbol of the new rule.

E -> FE'

E' -> + FE' | - FE'

Add an empty symbol as a terminal symbol ($) to the new rules.

E -> FE'

E' -> + FE' | - FE' | $

After applying the algorithm to remove left recursion, the production rule will become.

E -> FE'
E' -> +FE' | -FE' | $

F -> TF'
F' -> *TF' | /TF' | $

T -> (E) | <var> | <int> | <double>

Example of E' -> +FE' | -FE' | $ implementation is as follows.

// E' -> +FE' | -FE' | $
parse_status E_dash() {
    PARSE_INIT;
    parse_status s;

    do {
        d = cyylex();
        if (d.token_code != MATH_PLUS) break;
        s = F();
        if (s == PARSE_ERROR) break;
        s = E_dash();
        if (s == PARSE_ERROR) break;
        return PARSE_SUCCESS;
    } while(0);
    RESTORE_CHECK_POINT;

    do {
        d = cyylex();
        if (d.token_code != MATH_MINUS) break;
        s = F();
        if (s == PARSE_ERROR) break;
        s = E_dash();
        if (s == PARSE_ERROR) break;
        return PARSE_SUCCESS;
    } while(0);
    RESTORE_CHECK_POINT;

    return PARSE_SUCCESS; // match empty rule
}

DEV Community

Context Free Gramma (CFG)

Implement alternate a b grammar

Mathematical Expression

Example

Top comments (0)