Just as JFlex generates lexers, Jacc generates parsers, but what’s the difference? A lexer can recognize words and a parser can recognize whole sentences, or more formally, use lexers to work with regular grammars, and parsers to work with context-free grammars.
This is the reason the two are often used together — you first use a lexer to recognize words and pass those words to a parser which is able to determine if the words form a valid sentence.
Now, Java being Java, there is a choice of parser generators out there: Antlr is ubiquitous, but there are also CoCo/R, JavaCC, SableCC, Cup, Byacc/J and probably many others. Even the venerable Bison is capable of generating Java parsers. Some parsers, like Antlr, CoCo/R and JavaCC don’t even need a separate lexer to feed them words — they can generate one of their own!
So why Jacc? Why not?
Ok so how JFlex and Jacc work together:
JFlex reads the input as a stream of characters and produces a token for Jacc when Jacc asks for one. A token is a string with a meaning. For instance, +, true and 3.14 are all tokens — some of them don’t really need a value aside from their type: true is a Boolean literal, but some of them do: 3.14 is an integer literal with the value of 3.14.
As with JFlex, a Jacc file has three distinct sections:
directives section
%%
rules section
%%
additional code section
Jacc creates a list of all the tokens it expects in a separate file. You specify the file and the token list like this:
%interface ParserTokens
%token X NL
Let’s look at the generated file:
// Output created by jacc on Mon Mar 11 09:54:05 CET 2019
interface ParserTokens {
int ENDINPUT = 0;
int NL = 1;
int X = 2;
int error = 3;
}
It is an interface that doubles as an enumeration. Apart from the two token types we asked for, NL and X, two more are created: one for the end of input and one for errors. Back in the JFlex file you “implement” this “interface”:
%class Lexer
%implements ParserTokens
just so our lexer could see the ENDINPUT, NL, X AND error constants. There are a few more things Jacc expects:
- A function that returns integer values that represent token types (0, 1, or 3 in our example). Naming that function yylex is a tradition, so let’s do that:
%function yylex
%int
- Three more functions: getToken to get the current token code, nextToken to read the next token code and getSemantic to get the current token value:
%{
private int token;
private String semantic;
public int getToken()
{
return token;
}
public String getSemantic()
{
return semantic;
}
public int nextToken()
{
try
{
token = yylex();
}
catch (java.io.IOException e)
{
System.out.println(
"IO exception occured:\n" + e);
}
return token;
}
%}
You may notice that we decided to make the token semantic value a String so we also need to indicate that in the parser:
%semantic String
For our example we’ll just have the lexer recognize the following words: word, Word, wOrd, worD, … , wORD and WORD and new lines. We’ll ignore whitespace.
x = [wW][oO][rR][dD]
nl = \n | \r | \r\n
space = [\t]
%%
When the lexer finds a word it will return the X token (ie. 2 from the ParserTokens interface) with a value of word, Word,… — this is what semantic = yytext(); does. When it encounters a new line it will return ENDINPUT (ie. 0):
{x} { semantic = yytext(); return X; }
{space} { /\* Ignore space \*/ }
{nl} { return ENDINPUT; }
[^] { System.out.println("Error?"); }
That’s it. Now the parser “grammar” in all its glory:
sentence : X { System.out.println("X found: " + $1); }
| sentence X { System.out.println("X found: " + $2); }
;
The grammar is left-recursive, which allows us to have one or more X words in a sentence. $1 and $2 will hold the X semantic value, passed there by the lexer, and we’ll simply print it out.
In the main method we create Lexer and Parser instances and start parsing. One thing to note here is that we have to “prime” the lexer with
parser.lexer.nextToken();
before the parser can use it. For reference, here is the lexer full source
import java.io.\*;
%%
%class Lexer
%implements ParserTokens
%function yylex
%int
%{
private int token;
private String semantic;
public int getToken()
{
return token;
}
public String getSemantic()
{
return semantic;
}
public int nextToken()
{
try
{
token = yylex();
}
catch (java.io.IOException e)
{
System.out.println(
"IO exception occured:\n" + e);
}
return token;
}
%}
x = [wW][oO][rR][dD]
nl = \n | \r | \r\n
space = [\t]
%%
{x} { semantic = yytext(); return X; }
{space} { /\* Ignore space \*/ }
{nl} { return ENDINPUT; }
[^] { System.out.println("Error?"); }
and that of the parser:
%{
import java.io.\*;
%}
%class Parser
%interface ParserTokens
%semantic String
%token X NL
%%
sentence : X { System.out.println("X found: " + $1); }
| sentence X { System.out.println("X found: " + $2); }
;
%%
private Lexer lexer;
public Parser(Reader reader)
{
lexer = new Lexer(reader);
}
public void yyerror(String error)
{
System.err.println("Error: " + error);
}
public static void main(String args[]) throws IOException
{
System.out.println("Interactive evaluation:");
Parser parser = new Parser(
new InputStreamReader(System.in));
parser.lexer.nextToken();
parser.parse();
}
You need to compile Lexer.flex
jflex lexer.flex
and Parser.jacc
jacc parser.jacc
and the three generated Java files ( Lexer.java , Parser.java and ParserTokens.java ):
javac \*.java
to finally be able to run the parser:
java Parser
Here is an example terminal session:
Interactive evaluation:
word Word wOrD WORD
X found: word
X found: Word
X found: wOrD
X found: WORD
You can find the full source code on Github.
Actually this was pretty boring, but we are now free to play with context-free grammars!
Top comments (0)