DEV Community

Sarun Tapee
Sarun Tapee

Posted on

Context Free Gramma (CFG)

A grammar is represented by a production rule which looks like this X -> Y, it means that we can convert X to Y, where X is a non-terminal symbol, and Y could be a terminal symbol, non-terminal symbol, or the combination of both. For example, let's look at the following grammar.

A -> aB
B -> bA | b | ba
Enter fullscreen mode Exit fullscreen mode

Let's label the rule for ease of explanation.

     (1)
A -> aB

     (2)  (3)  (4)
B -> bA | b  | ba
Enter fullscreen mode Exit fullscreen mode

We can try to expand this production rule by starting from the first rule as follwing table.

apply rule No. result string
A
1 aB
2 abA
1 abaB
4 ababa

After applying rule number 4, we can not expand the string anymore because it is terminated. And we can say that the string ababa complies with our grammar. If you look at the above grammar, you will notice that it is nothing but the grammar of alternate characters a and b

Implement alternate a b grammar

Let's convert the above grammar to code. Begin with A -> aB

// alternate_a_b.c
parse_status A() {
    PARSE_INIT;
    d = cyylex();
    if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "a") != 0) RETURN_PARSE_ERROR;
    parse_status s = B();
    if (s == PARSE_ERROR) RETURN_PARSE_ERROR;
    return PARSE_SUCCESS;
}

Enter fullscreen mode Exit fullscreen mode

Now let's implement B -> bA | b | ba. Since there are 3 production rules. We need to try each production rule one by one, and if the testing fails, then move to the next production rule. The order of testing matters. If we test rule number 3 before rule number 2, it will match character b and return success without checking rule number 2. So we should order the rule that covers a bigger domain before the rule that is a subdomain. In this case, we should order as such: (2), (4), (3).

Since we need to parse the string according to the rule one by one. If the parsing fails in the middle, we want to jump to the next rule. The do {...} while(0) control structure can make our code more elegant since if the checking fails in the middle, we can break out of the loop and check the next rule.

// rule 1
do {
  // check1
  // check2
  if (fail) break; // move to next rule
  // check3
} while(0);

// rule 2
do {
} while (0);

// rule 3
do {
} while (0);

...

Enter fullscreen mode Exit fullscreen mode

The production rule B -> bA | b | ba will look like the following.

// alternate_a_b.c
...

parse_status B() {
    PARSE_INIT;
    parse_status s;

    do {
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "b") != 0)
            break;
        s = A();
        if (s == PARSE_ERROR) 
            break;
        return PARSE_SUCCESS;
    } while(0);

    RESTORE_CHECK_POINT;

    do {
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "b") != 0) 
            break;
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "a") != 0)
            break;
        return PARSE_SUCCESS;
    } while(0);

    RESTORE_CHECK_POINT;

    do {
        d = cyylex();
        if (d.token_code != MATH_IDENTIFIER || strcmp(d.text, "b") != 0)
            break;
        return PARSE_SUCCESS;
    } while(0);

    RETURN_PARSE_ERROR;
}


Enter fullscreen mode Exit fullscreen mode

We can test our grammar by implementing the main function as follows.

// alternate_a_b.c
...

void parse_alternate_a_b() {
    if (A() == PARSE_ERROR) {
        printf("invalid ab\n");
    } else {
        lex_data d = cyylex();
        yyrewind(1);
        if ((int) d.token_code == (int) PARSER_EOL)
            printf("valid ab\n");
        else 
            printf("valid ab partial\n");
    }
    for (int i = 0; i <= lex_stack.top; i++) {
        lex_data *p = &lex_stack.data[i];
        printf("%s ", p->text);
    }
    printf("\n");
}

// parser.l
...

extern void parse_alternate_a_b();

int main(void) {
    int token_code;

    while (1) {
        printf("input -> ");
        fgets(lex_buffer, sizeof(lex_buffer), stdin);
        stack_reset();
        parse_alternate_a_b();
    }
    return 0;
}


Enter fullscreen mode Exit fullscreen mode

Mathematical Expression

Now we know how to convert a production rule to code. Let's try to convert a mathematical expression to code. The mathematical expression can be written as follows. which can match a simple mathematical expression like a + b * c / (d + f) + 3 - 1.5

E -> E + F | E - F | F
F -> F * T | F / T | T
T -> (E) | <var> | <int> | <double>
Enter fullscreen mode Exit fullscreen mode

If you try to convert the above production rule to code, you will end up with an infinite recursion. For example, E -> E + F the code will look like.

parse_status E() {
  ...
  do {
    s = E(); // this will cause infinite recursion
    if (s == PARSE_ERROR) break;
    d = cyylex();
    if (d.token_code != MATH_PLUS) break;
    s = F();
    if (s == PARSE_ERROR) break;
    return PARSE_SUCCESS;
  } while (0);
  ...
}

Enter fullscreen mode Exit fullscreen mode

There are two ways to solve this issue.

  • Swap the order
  • Apply the algorithm to remove left recursion.

After swapping the order, the production rule will become.

E -> F + E | F - E | F
F -> T * F | T / F | T
T -> (E) | <var> | <int> | <double> 
Enter fullscreen mode Exit fullscreen mode

Then you can implement the code without suffering from left recursion.

Another approach is to apply an algorithm. The algorithm is as follows.

1. Remove all the rules that have left recursion.
2. Add a new rule and append to the remaining rules.
3. Add the removed rule to the new rule, but delete the first non-terminal symbol that causes left recursion and append the non-terminal symbol of the new rule.
2. Add an empty symbol as a terminal symbol ($) to the new rules.
Enter fullscreen mode Exit fullscreen mode

Example

E -> E + F | E - F | F
Enter fullscreen mode Exit fullscreen mode

Remove all the rules that have left recursion.

E -> F
Enter fullscreen mode Exit fullscreen mode

Add a new rule and append to the remaining rules.

E -> FE'

E'
Enter fullscreen mode Exit fullscreen mode

Add the removed rule to the new rule, but delete the first non-terminal symbol that causes left recursion and append the non-terminal symbol of the new rule.

E -> FE'

E' -> + FE' | - FE'
Enter fullscreen mode Exit fullscreen mode

Add an empty symbol as a terminal symbol ($) to the new rules.

E -> FE'

E' -> + FE' | - FE' | $
Enter fullscreen mode Exit fullscreen mode

After applying the algorithm to remove left recursion, the production rule will become.

E -> FE'
E' -> +FE' | -FE' | $

F -> TF'
F' -> *TF' | /TF' | $

T -> (E) | <var> | <int> | <double>
Enter fullscreen mode Exit fullscreen mode

Example of E' -> +FE' | -FE' | $ implementation is as follows.

// E' -> +FE' | -FE' | $
parse_status E_dash() {
    PARSE_INIT;
    parse_status s;

    do {
        d = cyylex();
        if (d.token_code != MATH_PLUS) break;
        s = F();
        if (s == PARSE_ERROR) break;
        s = E_dash();
        if (s == PARSE_ERROR) break;
        return PARSE_SUCCESS;
    } while(0);
    RESTORE_CHECK_POINT;

    do {
        d = cyylex();
        if (d.token_code != MATH_MINUS) break;
        s = F();
        if (s == PARSE_ERROR) break;
        s = E_dash();
        if (s == PARSE_ERROR) break;
        return PARSE_SUCCESS;
    } while(0);
    RESTORE_CHECK_POINT;

    return PARSE_SUCCESS; // match empty rule
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)