DEV Community: Vicente Maldonado

Beautiful Soup Hello World

Vicente Maldonado — Fri, 02 Aug 2019 09:08:11 +0000

Beautiful Soup is a Python library for working with HTML and XML files. You can use it to navigate a HTML document, search it, extract data from it and even change the document structure. Let’s see how it works:

from bs4 import BeautifulSoup

html = '''
    <html>
    <head>
        <title>Beautiful Soup Hello World</title>
    </head>
    <body>
        <h1>Header</h1>
        <p>Paragraph 1</p>
        <p>Paragraph 2</p>
        <p>Paragraph 3</p>
    </body>
    </html>
'''

soup = BeautifulSoup(html, 'html.parser')

print(soup.title)
print(soup.title.name)
print(soup.title.text)

print(soup.p.text)

for paragraph in soup.find\_all('p'):
    print(paragraph.text)

print(soup.get\_text())

It’s a really basic example, but before you can run it you first need to install Beautiful Soup:

pip install beautifulsoup4

While you’re at it, install another library as well:

pip install lxml

It’s a HTML/XML parser. Don’t worry about it.

Let’s start — import Beautiful Soup:

from bs4 import BeautifulSoup

Next, we need some HTML to work with:

html = '''
    <html>
    <head>
        <title>Beautifulsoup Hello World</title>
    </head>
    <body>
        <h1>Header</h1>
        <p>Paragraph 1</p>
        <p>Paragraph 2</p>
        <p>Paragraph 3</p>
    </body>
    </html>
'''

It is a basic HTML document stored in a Python string. Of course, working with HTML stored in a Python script is not very exciting, but this is a Hello, World, so hey.

Create an instance of the BeautifulSoup object, specifying the HTML document and the parser to be used (I said don’t worry about it):

soup = BeautifulSoup(html, 'html.parser')

Now we have our HTML parsed and stored in a variable named soup and we can play with it:

print(soup.title)

Use soup.title to access the HTML document’s

element. This prints:

<title>Beautifulsoup Hello World</title>

Sometimes you don’t want the HTML tag:

print(soup.title.text)

and get just the element text:

Beautifulsoup Hello World

Our document has just one

element so Beautiful Soup appropriately returns it if we use soup.title. But the documents has three

elements (paragraphs) so what happens when we try to pull the same trick?

print(soup.p.text)

It returns the first

element in the document:

Paragraph 1

If you want to get all paragraphs in a documents, well, just use find_all():

for paragraph in soup.find\_all('p'):
    print(paragraph.text)

find_all() returns all paragraphs in the document and you can iterate them using a simple for loop.

This is just scratching the surface with Beautiful Soup. At the end let’s see how simple it is to get all text (and only text) in the document:

print(soup.get\_text())

As expected, this prints

Beautifulsoup Hello World

Header
Paragraph 1
Paragraph 2
Paragraph 3

You can find the full script in my Github. ttfn.

Meet the CHICKEN

Vicente Maldonado — Fri, 21 Jun 2019 17:00:38 +0000

No, not him. You see, in the beginning, there was the CHICKEN. And the Felix said, “Let there be eggs!” — — and there were eggs.

CHICKEN is a variant of the programming language Scheme. Yes, the one with lots of silly parentheses. No, not that one, that’s Lisp — L ost i n S tupid P arentheses. Scheme is even more alien and, believe it or not, even less usable. Or is it?

If you visit the CHICKEN web site you’ll learn that it strives to be simple, portable, extensible, well documented and actively supported. Hmm, let’s see:

Simple — simple it is. I was able to install CHICKEN on my Linux box with a single command. On Windows I had to download the source code and (gasp!) compile it to get the CHICKEN — this amounted to 6 — 7 mouse clicks and about as many keyboard hits. Compiling CHICKEN from source is pretty easy if you follow the readme. Also, CHICKEN has recently been upgraded to a version 5 and there are three-click installers available for the previous version (4) so it’s fairly safe to assume that a few will pop up for the current version as well. JUST BE PATIENT.

The editor. Ah, Emacs, the cause and the solution to all life’s problems. Yes, Emacs is still the best choice for the paren language family. You can compromise and use VS Code, Eclipse, JEdit or a number of text editors that support syntax highlighting for Scheme. A bit of a minus — not that there is much syntax anyway.

CHICKEN comes with several executables: ‘csi’ is the CHICKEN Scheme interpreter which starts the REPL a loop that all Lispers love but few can explain why. ‘csc’ is the CHICKEN Scheme compiler — it produces native executables. Yes, you can use CHICKEN to make standalone executable programs. THIS IS BIG. ‘chicken-install’ — you can use that one to install external CHICKEN libraries (they are called “Eggs” of course). ‘chicken-install’ pulls the eggs’ source code from the CHICKEN’s central repository, compiles it and makes it available for you to use. This work great most of the time, except some eggs have external dependencies (the SDL graphics library for instance) that you have to install yourself before installing the eggs in question. Also note that ‘csc’ chashes with ‘csc’ the C Sharp compiler so there is a possibility that ‘csi’ and ‘csc’ are called ‘chicken-csi’ and ‘chicken-csc’, so there’s that.

All in all CHICKEN is simple enough as long as you put in some work — just about as any other programming environment I guess.

Extensible. The eggs. The best part of CHICKEN.

Need a GUI? CHICKEN’s got you (almost) covered. The favorite CHICKEN GUI library seems to be IUP (made by the Lua guys) but I haven’t been able to install it on my Linux due to the dependencies I mentioned earlier. On Windows, CHICKEN did come pre-packaged with IUP in its previous release (4). On the other hand the Tk (Tkinter, Python, hello!) bindings work very well and if you find that too limiting you can always use the Java Swing (and probably JavaFx but I haven’t gotten around to trying that one).

Want to do some Web dev? That’s awful. Haven’t done more than the Hello World with Awful, but it’s there and it works.

Graphics programming? No problem! Databases? Still no problem! Networking?… you get the idea — heck, just look at the list for yourself — it’s impressive.

Well, CHICKEN is extensible and extended. Just have in mind it’s a small community that builds libraries for their own needs — that one thing you are looking for might just be the one that’s missing.

Actively supported. In that case just ask the guys themselves. There is a mailing list called chicken users here. The list is not crazy active but it is very responsive — I posted a couple of questions there recently and received an answer — directly from the man who hatched the egg, so to speak. There’s also an IRC channel, but I haven’t visited. Also, don’t be surprised if you run into the people from the CHICKEN team on Stack Overflow and Reddit (over on r/lisp and elsewhere)
Well documented. That’s an understatement. There is:

A Wiki

A getting started guide

…and more.

All in all CHICKEN is highly recommended. Just look at the little fella:

ttfn!

Python Lark Parser introduction

Vicente Maldonado — Tue, 09 Apr 2019 11:46:09 +0000

Lark is a Python parsing library. Unlike parser generators like Yacc it doesn’t generate a source code file from a grammar — the parser is generated dynamically. Let’s see hot it works. You import Lark:



from lark import Lark

then specify the grammar:



grammar = """
start: WORD "," WORD "!"
%import common.WORD
%ignore " "
"""

The grammar can be a Python string or read from a separate file. After that, just create a Lark class instance, initializing it with the grammar:



parser = Lark(grammar)

and you are ready to parse:



def main():
    print(parser.parse("Hello, world!"))
    print(parser.parse("Adios, amigo!"))

if \_\_name\_\_ == '\_\_main\_\_':
    main()

parser.parse returns a Tree instance containing the parse tree:



Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'world')])
Tree(start, [Token(WORD, 'Adios'), Token(WORD, 'amigo')])

That’s it, clean and simple. It’s up to you to decide what to do with the parsed string. Let’s see where we can go from there. Here is an example of a simple arithmetic expression parser:



from lark import Lark

grammar = """
start: add\_expr
     | sub\_expr

add\_expr: NUMBER "+" NUMBER

sub\_expr: NUMBER "-" NUMBER

%import common.NUMBER
%ignore " "
"""

The grammar ignores spaces. Also note that the grammar terminals are written in uppercase letters (NUMBER) while the grammar rules are written in lowercase letters (start, add_expr and sub_expr). %import and %ignore are directives. You can find the grammar reference in the Lark documentation. We can import definitions from other grammars — in this case common.lark .( common.lark just contains some useful definitions). The above grammar will successfully parse addition and subtraction expressions, like:



1+1
2-1
3 - 2

and nothing else. Next, create the Lark object:



parser = Lark(grammar)

and we are ready to parse:



def main():
    print(parser.parse("1+1"))
    print(parser.parse("2-1"))
    print(parser.parse("3 - 2"))    

if \_\_name\_\_ == '\_\_main\_\_':
    main()

The output is as expected:



Tree(start, [Tree(add\_expr, [Token(NUMBER, '1'), Token(NUMBER, '1')])])
Tree(start, [Tree(sub\_expr, [Token(NUMBER, '2'), Token(NUMBER, '1')])])
Tree(start, [Tree(sub\_expr, [Token(NUMBER, '3'), Token(NUMBER, '2')])])

Note that this example just prints the parse tree as before. Let’s transform it to something more useful:



from lark import Lark, Transformer

grammar = """
start: add\_expr
     | sub\_expr

add\_expr: NUMBER "+" NUMBER -> add\_expr

sub\_expr: NUMBER "-" NUMBER -> sub\_expr

%import common.NUMBER
%ignore " "
"""

add_expr and sub_expr on the right hand side of the grammar rules are the names of the functions that are to be applied when a rule is successfully parsed. Let’s write them:



class CalcTransformer(Transformer):

    def add\_expr(self, args):
        return int(args[0]) + int(args[1])

    def sub\_expr(self, args):
        return int(args[0]) - int(args[1])

Uh. For instance, when parsing

2-1

args[0] will contain "2" and args[1] will contain "1" . In our transformer functions we convert both to integers and add or subtract them returning the result. Now create the Lark object:



parser = Lark(grammar, parser='lalr', 
    transformer=CalcTransformer())

For it to be able to accept transformers the parser needs to be a LALR parser. We are finally ready to parse:



def main():
    print(parser.parse("1+1"))
    print(parser.parse("2-1"))
    print(parser.parse("3 - 2"))

if \_\_name\_\_ == '\_\_main\_\_':
    main()

The output is now:



Tree(start, [2])
Tree(start, [1])
Tree(start, [1])

Better? 1+1 is 2, 2–1 is1 and 3–2 is also 1.

Of course this is just scratching the surface. If you are interested, you can find the full examples on Github.

Visitor Pattern in Java

Vicente Maldonado — Sat, 23 Mar 2019 19:04:24 +0000

When I find a concept difficult to understand I try to strip it to bare essentials. This happened to me recently with the visitor pattern so here is my take on it. Of course I will be grateful for any corrections. Here goes.

Let’s say we have three classes derived from a common parent, called A

abstract class A
{
    public String name;
    abstract void accept(Visitor v);
}

class B that has two objects as components:

class B extends A
{
    public A child1;
    public A child2;

    public B(String name)
    {
        this.name = name;
    }

[@Override](http://twitter.com/Override)
    void accept(Visitor v)
    {
        v.visitB(this);
    }
}

class C that has one component:

class C extends A
{
    public A child;

    public C(String name)
    {
        this.name = name;
    }

[@Override](http://twitter.com/Override)
    void accept(Visitor v)
    {
        v.visitC(this);
    }
}

and class D that has no components

class D extends A
{
    public D(String name)
    {
        this.name = name;
    }

[@Override](http://twitter.com/Override)
    void accept(Visitor v)
    {
        v.visitD(this);
    }
}

All three classes expose a property, name that lets us distinguish their instances and a method named accept that allows visitors to visit them. The classes don’t care and don’t need to know what their visitors do. Visitor is an interface:

interface Visitor
{
    public void visitB(B b);
    public void visitC(C c);
    public void visitD(D d);
}

There is a method for each class it visits. Let’s try this out with a visitor implementation that just prints out the name of objects it visited:

class PrintVisitor implements Visitor
{
    public void visitB(B b)
    {
        b.child1.accept(this);
        System.out.println(b.name + " visited.");
        b.child2.accept(this);
    }

    public void visitC(C c)
    {
        System.out.println(c.name + " visited.");
        c.child.accept(this);
    }

    public void visitD(D d)
    {
        System.out.println(d.name + " visited.");
    }
}

The visitor is recursive: it visits a tree node and then it visits its children. Now let’s make a tree made up from the classes B, C and D:

There are nine objects and seven relations. First, create the objects:

        B f = new B("F");
        B b = new B("B");
        B d = new B("D");

        C g = new C("G");
        C h = new C("H");

        D a = new D("A");
        D c = new D("C");
        D e = new D("E");
        D i = new D("I");

Next, the relations:

        f.child1 = b;
        f.child2 = g;

        b.child1 = a;
        b.child2 = d;

        d.child1 = c;
        d.child2 = e;

        g.child = h;
        h.child = i;

And finally start visiting our tree by visiting its root node:

        PrintVisitor v = new PrintVisitor();
        f.accept(v);

The output is:

A visited.
B visited.
C visited.
D visited.
E visited.
F visited.
G visited.
H visited.
I visited.

If you look at this article, the above code performs tree traversal, and what is called in-order traversal at that (). Let’s change our visitor class to do a pre-order traversal — the visitor first displays the node name and then visits its children:

class PrintVisitor implements Visitor
{
    public void visitB(B b)
    {
        System.out.println(b.name + " visited.");
        b.child1.accept(this);
        b.child2.accept(this);
    }

    public void visitC(C c)
    {
        System.out.println(c.name + " visited.");
        c.child.accept(this);
    }

    public void visitD(D d)
    {
        System.out.println(d.name + " visited.");
    }
}

Now the output is:

F visited.
B visited.
A visited.
D visited.
C visited.
E visited.
G visited.
H visited.
I visited.

In post-order traversal the visitor first visits node children and only then displays its name:

class PrintVisitor implements Visitor
{
    public void visitB(B b)
    {
        b.child1.accept(this);
        b.child2.accept(this);
        System.out.println(b.name + " visited.");
    }

    public void visitC(C c)
    {
        c.child.accept(this);
        System.out.println(c.name + " visited.");
    }

    public void visitD(D d)
    {
        System.out.println(d.name + " visited.");
    }
}

Here is the output:

A visited.
C visited.
E visited.
D visited.
B visited.
I visited.
H visited.
G visited.
F visited.

Besides the Wikipedia article I linked at the beginning, there is a nice description of the visitor pattern here. In short:

The visited objects don’t need to know what their visitors do, they just need to accept them.
There needs to be a protocol that lets visited objects and visitors communicate, in our case the Visitor interface.
A visitor uses separate methods (ie visitB, visitC and visitD for visiting each class)

(You can find the code on Github.)

A Context-Free Grammar Tutorial

Vicente Maldonado — Wed, 13 Mar 2019 10:05:17 +0000

I recently came across a tutorial on context-free grammars with several examples of common patterns you can find in those grammars and I thought to myself — why not implement some of those examples as an exercise?

I first had to choose the language, Java, and the tools to implement the examples with, JFlex and Jacc. The pair is sufficiently similar to Flex and Bison and the grammars hopefully won’t get obscured by implementation.

I got myself familiar with JFlex and Jacc and made three notes about working with them:

Meet JFlex — an intro to JFlex with a small example of a standalone lexer,
Use JFlex to Count Words — another standalone lexer example, this time a bit more involved,
Use JFlex and Jacc Together — how to make JFlex and Jacc cooperate, also with a rudimentary example.

If you are interested in CFGs and parsing, the tutorial I mentioned is not a bad place to get some practical experience implementing them. You can find it here and I also uploaded it to Github (hoping not to have broken any copyright laws).

The grammar I used to demonstrate how JFlex and Jacc work together is actually the grammar from the section 2.1 of the tutorial (“A grammar for a language that allows a list of X’s “) so I’ve actually already started going through the exercises. Yay.

Using Graphviz you can get a visual representation of your grammars. First export the grammar as a dot file:

jacc -d Parser.jacc

This creates Parser.dot. Then

dot -Tjpg Parser.dot -o Parser.jpg

and you end up with Parser.jpg. It’s very simple:

It is a state machine represented as a directed graph. Creating grammar visual representations is just one of the tools Jacc provides to help you debug your grammars. You can export a grammar to a text file:

jacc -v Parser.jacc

This creates Parser.output:

// Output created by jacc on Wed Mar 13 09:15:29 CET 2019

state 0 (entry on sentence)
    $accept : \_sentence $end

X shift 2
    . error

sentence goto 1

state 1 (entry on sentence)
    $accept : sentence\_$end
    sentence : sentence\_X (2)

$end accept
    X shift 3
    . error

state 2 (entry on X)
    sentence : X\_ (1)

$end reduce 1
    X reduce 1
    . error

state 3 (entry on X)
    sentence : sentence X\_ (2)

$end reduce 2
    X reduce 2
    . error

4 terminals, 1 nonterminals;
2 grammar rules, 4 states;
0 shift/reduce and 0 reduce/reduce conflicts reported.

or into a html file:

jacc -h Parser.jacc

This creates ParserMachine.html:

Generated machine for Parser

// Output created by jacc on Wed Mar 13 09:20:03 CET 2019

state 0 (entry on sentence)
    $accept : _sentence $end

    X shift 2
    . error

    sentence goto 1

state 1 (entry on sentence)
    $accept : sentence_$end
    sentence : sentence_X    (2)

    $end accept
    X shift 3
    . error

state 2 (entry on X)
    sentence : X_    (1)

    $end reduce 1
    X reduce 1
    . error

state 3 (entry on X)
    sentence : sentence X_    (2)

    $end reduce 2
    X reduce 2
    . error

4 terminals, 1 nonterminals;
2 grammar rules, 4 states;
0 shift/reduce and 0 reduce/reduce conflicts reported.

You can actually go through the machine states by clicking the links.

Jacc also supports tracing your grammar on sample inputs and embedding custom error productions in grammars — that’s it. ttfn!

Use JFlex and Jacc Together

Vicente Maldonado — Mon, 11 Mar 2019 11:02:45 +0000

Just as JFlex generates lexers, Jacc generates parsers, but what’s the difference? A lexer can recognize words and a parser can recognize whole sentences, or more formally, use lexers to work with regular grammars, and parsers to work with context-free grammars.

This is the reason the two are often used together — you first use a lexer to recognize words and pass those words to a parser which is able to determine if the words form a valid sentence.

Now, Java being Java, there is a choice of parser generators out there: Antlr is ubiquitous, but there are also CoCo/R, JavaCC, SableCC, Cup, Byacc/J and probably many others. Even the venerable Bison is capable of generating Java parsers. Some parsers, like Antlr, CoCo/R and JavaCC don’t even need a separate lexer to feed them words — they can generate one of their own!

So why Jacc? Why not?

Ok so how JFlex and Jacc work together:

JFlex reads the input as a stream of characters and produces a token for Jacc when Jacc asks for one. A token is a string with a meaning. For instance, +, true and 3.14 are all tokens — some of them don’t really need a value aside from their type: true is a Boolean literal, but some of them do: 3.14 is an integer literal with the value of 3.14.

As with JFlex, a Jacc file has three distinct sections:

directives section
%%
rules section
%%
additional code section

Jacc creates a list of all the tokens it expects in a separate file. You specify the file and the token list like this:

%interface ParserTokens
%token X NL

Let’s look at the generated file:

// Output created by jacc on Mon Mar 11 09:54:05 CET 2019

interface ParserTokens {
    int ENDINPUT = 0;
    int NL = 1;
    int X = 2;
    int error = 3;
}

It is an interface that doubles as an enumeration. Apart from the two token types we asked for, NL and X, two more are created: one for the end of input and one for errors. Back in the JFlex file you “implement” this “interface”:

%class Lexer
%implements ParserTokens

just so our lexer could see the ENDINPUT, NL, X AND error constants. There are a few more things Jacc expects:

A function that returns integer values that represent token types (0, 1, or 3 in our example). Naming that function yylex is a tradition, so let’s do that:

%function yylex
%int

Three more functions: getToken to get the current token code, nextToken to read the next token code and getSemantic to get the current token value:

%{

private int token;
    private String semantic;

    public int getToken()
    {
        return token;
    }

    public String getSemantic()
    {
        return semantic;
    }

    public int nextToken()
    {
        try
        {
            token = yylex();
        }
        catch (java.io.IOException e)
        {
            System.out.println(
                "IO exception occured:\n" + e);
        }
        return token;
    }

%}

You may notice that we decided to make the token semantic value a String so we also need to indicate that in the parser:

%semantic String

For our example we’ll just have the lexer recognize the following words: word, Word, wOrd, worD, … , wORD and WORD and new lines. We’ll ignore whitespace.

x = [wW][oO][rR][dD]
nl = \n | \r | \r\n
space = [\t]

%%

When the lexer finds a word it will return the X token (ie. 2 from the ParserTokens interface) with a value of word, Word,… — this is what semantic = yytext(); does. When it encounters a new line it will return ENDINPUT (ie. 0):

{x} { semantic = yytext(); return X; }
{space} { /\* Ignore space \*/ }
{nl} { return ENDINPUT; }
[^] { System.out.println("Error?"); }

That’s it. Now the parser “grammar” in all its glory:

sentence : X { System.out.println("X found: " + $1); }
    | sentence X { System.out.println("X found: " + $2); }
    ;

The grammar is left-recursive, which allows us to have one or more X words in a sentence. $1 and $2 will hold the X semantic value, passed there by the lexer, and we’ll simply print it out.

In the main method we create Lexer and Parser instances and start parsing. One thing to note here is that we have to “prime” the lexer with

parser.lexer.nextToken();

before the parser can use it. For reference, here is the lexer full source

import java.io.\*;

%%

%class Lexer
%implements ParserTokens

%function yylex
%int

%{

private int token;
    private String semantic;

    public int getToken()
    {
        return token;
    }

    public String getSemantic()
    {
        return semantic;
    }

    public int nextToken()
    {
        try
        {
            token = yylex();
        }
        catch (java.io.IOException e)
        {
            System.out.println(
                "IO exception occured:\n" + e);
        }
        return token;
    }

%}

x = [wW][oO][rR][dD]
nl = \n | \r | \r\n
space = [\t]

%%

{x} { semantic = yytext(); return X; }
{space} { /\* Ignore space \*/ }
{nl} { return ENDINPUT; }
[^] { System.out.println("Error?"); }

and that of the parser:

%{

import java.io.\*;

%}

%class Parser
%interface ParserTokens

%semantic String

%token X NL

%%

sentence : X { System.out.println("X found: " + $1); }
    | sentence X { System.out.println("X found: " + $2); }
    ;

%%

private Lexer lexer;

    public Parser(Reader reader)
    {
        lexer = new Lexer(reader);
    }

    public void yyerror(String error)
    {
        System.err.println("Error: " + error);
    }

    public static void main(String args[]) throws IOException
    {
        System.out.println("Interactive evaluation:");

        Parser parser = new Parser(
            new InputStreamReader(System.in));

        parser.lexer.nextToken();
        parser.parse();
    }

You need to compile Lexer.flex

jflex lexer.flex

and Parser.jacc

jacc parser.jacc

and the three generated Java files ( Lexer.java , Parser.java and ParserTokens.java ):

javac \*.java

to finally be able to run the parser:

java Parser

Here is an example terminal session:

Interactive evaluation:
word Word wOrD WORD
X found: word
X found: Word
X found: wOrD
X found: WORD

You can find the full source code on Github.

Actually this was pretty boring, but we are now free to play with context-free grammars!

Use JFlex to Count Words

Vicente Maldonado — Sun, 10 Mar 2019 19:06:27 +0000

In the previous story we got to meet JFlex, a tool for generating lexers in Java. The example lexer was contrived, banal and not all that useful so let’s show that JFlex can be put to good use with a (bit) more useful example: we’ll count words, lines and characters the user enters.

The first part of the JFlex file is the same as in the first example:

import java.io.\*;

%%

We just import all the java.io classes. To start with, we’ll need a way to store our word, line and char count:

%{

public int chars = 0;
public int words = 0;
public int lines = 0;

chars, words and lines will become public members of the generated class, accessible to the rest of the code. The main method is much the same as in the first example:

public static void main(String[] args) throws IOException
{
 InputStreamReader reader =
 new InputStreamReader(System.in);

Lexer lexer = new Lexer(reader);

lexer.yylex();

 System.out.format(
 "Chars: %d\nWords: %d\nLines: %d\n",
 lexer.chars, lexer.words, lexer.lines);
}

%}

There are two differences though:

There is no infinite loop — the lexer will read from System.in until we interrupt it ( Ctrl-d on Linux and Ctrl-Z on Windows I think). This allows the user to enter several lines of text in the terminal.
We don’t use the yylex() directly.

In the next part:

%class Lexer
%type Integer

%%

The generated Java class will be named Lexer and yylex() will return a Java Integer value. This is only because yylex() needs to return something and it returns an object of type Yylex by default — javac will complain that Yylex type doesn’t exist because it doesn’t (if you don’t create it yourself).

Finally, in the lexical rules part:

[a-zA-Z]+ { words++; chars += yytext().length(); }
\n { chars++; lines++; }
. { chars++; }

If you type a word, recognized by the [a-zA-Z]+ regex, the word count will be incremented and the char count be increase by the entered word length. If you press Enter, ie. \n , the char count will be incremented. And if you enter a random character like * or & the character count will be incremented.

This allows us to print out the final count of chars, words and lines (back in main):

System.out.format(
 "Chars: %d\nWords: %d\nLines: %d\n",
 lexer.chars, lexer.words, lexer.lines);

Here is the complete file:

import java.io.\*;

%%

%{

public int chars = 0;
public int words = 0;
public int lines = 0;

public static void main(String[] args) throws IOException
{
 InputStreamReader reader =
 new InputStreamReader(System.in);

Lexer lexer = new Lexer(reader);

lexer.yylex();

 System.out.format(
 "Chars: %d\nWords: %d\nLines: %d\n",
 lexer.chars, lexer.words, lexer.lines);
}

%}

%class Lexer
%type Integer

%%

[a-zA-Z]+ { words++; chars += yytext().length(); }
\n { chars++; lines++; }
. { chars++; }

As in the previous example you need to compile both the JFlex file and the generated Java file:

[johnny@test example1]$ jflex Lexer.flex
[johnny@test example1]$ javac Lexer.java
[johnny@test example1]$ java Lexer

Here is a simple demo terminal session:

The quick brown fox
jumps over the lazy dog.
Chars: 45
Words: 9
Lines: 2

You can download the full code from Github.

Meet JFlex

Vicente Maldonado — Sun, 10 Mar 2019 16:16:16 +0000

JFlex is a scanner generator for Java. A scanner generator will generate a scanner (a.k.a. lexer) for you instead of you having to write one yourself. JFlex is modeled after (f)lex only it’s written in Java and generates Java lexers unlike the two older tools.

What is the JFlex workflow?

Create a JFlex source file (*.flex)
Use the JFlex command-line tool to compile the file into a Java file
Use javac to compile the Java file
Invoke the *.class file

et voilà, you have a working scanner/lexer. You can use it as a standalone tool or in combination with other programs — tools like Yacc/Bison commonly expect a scanner to feed them input to work with.

A JFlex source file is made up of three parts:

separated by double percent sign (%%). Here is a simple example:

import java.io.\*;

%%

That’s it for the first part. Oddly enough, if you want to add code to the generated Java class you’ll have to include that code in the middle section of the JFlex file (options and declarations). There’s no magic to it: JFlex creates the lexer based on a template, and the code you put in the first section does not end up as a part of the generated class — this is why you put your import statements here (You can go wild and put full Java classes there too but that’s not a very good idea).

The code you do put in the middle section of your JFlex file, on the other hand, does end up as a part of the lexer. Let’s add the main method to the class and make it self-contained:

%{

public static void main(String[] args) throws IOException
{
 InputStreamReader reader =
 new InputStreamReader(System.in);

Lexer lexer = new Lexer(reader);

 System.out.println("Start lexing");

 while (true)
 {
 System.out.println(lexer.yylex());
 }
}

%}

A lexer that JFlex generates needs to be initialized with a Java Reader. In this case we will accept input from System.in, ie. stdio. JFlex generates a class named Yylex with a function named yylex(). Let’s change that (we are still in the middle section of our JFlex file):

%class Lexer
%type String

%%

This will make JFlex change the class name to Lexer and yylex() will return a Java String instead of a Yytoken — a class we won’t bother creating.

The plan here is to make yylex() return any character we type on our keyboard — this is why we specify its %type as String. Then we’ll just use an infinite loop ( while (true) ) to accept characters and immediately print them out.

Let’s finish with the third section (lexical rules):

[^] { return yytext(); }

[^] will match any character. yytext() will return the character as a string and { return ...} is what yylex() will return so we are done.

Here is the complete file:

import java.io.\*;

%%

%{

public static void main(String[] args) throws IOException
{
 InputStreamReader reader =
 new InputStreamReader(System.in);

Lexer lexer = new Lexer(reader);

 System.out.println("Start lexing");

 while (true)
 {
 System.out.println(lexer.yylex());
 }
}

%}

%class Lexer
%type String

%%

[^] { return yytext(); }

You need to compile it (watch for jflex error messages in output):

[johnny@test example1]$ jflex Lexer.flex

compile the generated Java file:

[johnny@test example1]$ javac Lexer.java

and run it:

[johnny@test example1]$ java Lexer

Here’s an example terminal session:

Start lexing
123
1
2
3

abc 
a
b
c

^C[johnny@test example1]$

Use ctrl-c to stop the program. Of course this is not very exciting and you don’t have to use a 570 loc Java file (yes, that’s how long the generated lexer is) just to echo characters.

You can download the source code from Github.