Antlr is awesome

In this post I will show you how to integrate the Antlr4 parser generator in your Java project to generate a parser and use the parser to parse simple mathematical expressions. It’s a good starting point if you need to do any kind of parsing in your Java project. I’ve also provided a runnable Maven project.

The Grammar

About a year ago I had to implement a query language in a project I was (and currently still am) working on. We’re creating a specialized search engine where users can use a GUI to query the system. Power users need to be able to type in search queries and a requirement for this query language is that it would be both simple and powerful. This led to us ending up with something fairly similar to the where clause in a SQL select statement.

So… Time to break out the parser generator! Antlr stands for ANother Tool for Language Recognition and is a powerful parser generator for reading, processing, executing or translating structured text or binary files. And 'executing' is what we’ll basically be doing in this tutorial! We’re using the latest Antlr4 version, currently 4.5, in our examples.

So let’s take a step back first and explain what Antlr does for you. Antlr takes a grammar and uses this to generate a lexer and a parser. A lexer turns a string of characters into tokens. These tokens are then used by the parser to create an Abstract Syntax Tree (AST). The grammar we’ll be creating is very simple but there are numerous examples available for anything from SQL to JSON to the full Java grammar!

For this example we’ll be using our very simple grammar that can be used to do simple integer calculations:

grammar Simple;

calc: expr EOF;

expr
:BR_OPEN expr BR_CLOSE
|expr TIMES expr
|expr DIV expr
|expr PLUS expr
|expr MINUS expr
|number
;

number: NUMBER;

PLUS:   'plus'  | '+';
MINUS:  'minus' | '-';
TIMES:  'times' | '*';
DIV:    'div'   | '/';

NUMBER: '-'? [0-9]+;
BR_OPEN: '(';
BR_CLOSE: ')';

WS: [ \t\r\n]+ -> skip;

This is basically all you need for Antlr to do it’s magic. The first line is the 'name' of the grammar ('Simple'). This is used to name the parser and lexer classes. We define a 'calculation' (calc) as an expression and an EOF or end of file. An expression can be an expression between brackets, a multiplication, division, summation, substraction or a number. The way this is set up shows the precedence of the nested expressions: expressions between brackets have a higher precedence than anything else, multiplication has a higher precedence than summation, and the last one is 'just' a number. Keep this in mind when creating your own grammar; the lexer and parser use this order to resolve 'conflicts' when parsing your input.

You probably noticed that there’s a mix of lower and upper case words in the grammar. Lower cased words are parser expressions, UPPER CASE words are lexer expressions. It is important to understand that the whole parsing is happening in two phases; the lexer creates the token stream and the parser then builds the AST from this token stream. If your grammar is ambiguous the lexer might interpret a token wrong and then you’ll see the parser complain about recieving an unexpected token while in fact it was the lexing stage where this went wong.

You also see that we define for example the token PLUS as being either a '' or the word 'plus', this is the power of this two stage approach: the parser only sees a 'PLUS' token. The translation of 'plus' and '' was already handled by the lexer. This makes using the AST much easier. So both '10 plus 20' and '10 + 20' are valid and both yield the same result.

A number here is defined as a single integer number (looks familiar? it’s basically a regular expression), the dash (indicating it’s a negative number) is optional. This won’t parse floating point numbers obviously, you would have to add a decimal part to allow this.

Setting up the project

We’ll create a maven Java project for convenience. You can either create it yourself from scratch or just fork my repo and run it. To get Antlr4 working in your project if you create it from scratch you will have to follow these steps: * Create a maven project * Add the Antlr4 dependency and build plugin to your pom.xml * Create a src/main/antlr4 source folder * Create a package in the antlr4 source folder that is the same as the package you want your generated parser to end up in * Add the grammar shown above as Simple.g4

If you have maven integrated in your IDE you should see a target/generated-sources/antlr4 folder being created that will contain a SimpleLexer.java and SimpleParser.java file. The names are <GrammarName>Lexer.java and <GrammarName>Parser.java because they are based on the 'grammar Simple;' line in our grammar. They should also end up in the same package as the one we created for our grammar file. If you use my project it would be 'com.nibado.example.antlr'. If this isn’t done automatically you might have to do a Maven → Update project.

Using the Abstract Syntax Tree

Our goal is to end up with a (generated) parser that can turn any valid expression (such as '10 + 2 * 10 - 5') into an abstract syntax tree. We can then use that AST in our application to (in this case) calculate the result. You can also use this approach to parse a search query or even to implement your own programming language. In this example we’ll just parse a string and calculate the result.

In our project we’ll add a java package to our src/main/java source folder with the same name as our Grammar and Lexer/Parser are using ('com.nibado.example.antlr' in my project). We’ll create our calculator in this package, I’ve called it ExpressionParser.java. It will basically have 2 main methods; a public parse(String) method and a private visit(ExprContext) method that gets called recursively to traverse the AST. The full source of the file can be found here. I will walk you through the source, starting with the parse method:

    /**
     * Parses an expression into an integer result.
     * @param expression the expression to part
     * @return and integer result
     */
    public int parse(final String expression) {
        /*
         * Create a lexer that reads from our expression string
         */
        final SimpleLexer lexer = new SimpleLexer(new ANTLRInputStream(expression));

        /*
         * By default Antlr4 lexers / parsers have an attached error listener
         * that prints errors to stderr. I prefer them to throw an exception
         * instead so I implemented my own error listener which is attached
         * here. I also remove any existing error listeners.
         */
        lexer.removeErrorListeners();
        lexer.addErrorListener(_listener);

        /*
         * The lexer reads characters and lexes them into token stream. The
         * tokens are consumed by the parser that then builds an Abstract
         * Syntax Tree.
         */
        final CommonTokenStream tokens = new CommonTokenStream(lexer);
        final SimpleParser parser = new SimpleParser(tokens);
        parser.removeErrorListeners();
        parser.addErrorListener(_listener);

        /*
         * The ExprContext is the root of our Abstract Syntax Tree
         */
        final ExprContext context = parser.expr();

        /*
         * 'Visit' all the branches of the tree to get our expression result.
         */
        return visit(context);
    }

The parse method sets up the Lexer and Parser, gives them our own error listener (more on that later) and ties them together grabbing the root of the AST (ExprContext). It then calls the visit method to do the actual traversing:

    /*
     * Visits the branches in the expression tree recursively until we hit a leaf.
     */
    private int visit(final ExprContext context) {
        if (context.number() != null) { //Just a number
            return Integer.parseInt(context.number().getText());
        }
        else if (context.BR_CLOSE() != null) { //Expression between brackets
            return visit(context.expr(0));
        }
        else if (context.TIMES() != null) { //Expression * expression
            return visit(context.expr(0)) * visit(context.expr(1));
        }
        else if (context.DIV() != null) { //Expression / expression
            return visit(context.expr(0)) / visit(context.expr(1));
        }
        else if (context.PLUS() != null) { //Expression + expression
            return visit(context.expr(0)) + visit(context.expr(1));
        }
        else if (context.MINUS() != null) { //Expression - expression
            return visit(context.expr(0)) - visit(context.expr(1));
        }
        else {
            throw new IllegalStateException();
        }
    }

As you can see here it’s just a bunch of if-statement that check if certain components are available that identify an ExprContext as a certain type of expression. The last else throws an illegal state exception because according to our grammar this situation where everything is 'null' can’t happen.

If you check the full source you’ll see that we also construct our own custom error listener. This is because by default Antlr attaches it’s own error listener that outputs to stderr. I personally think it makes more sense to throw an exception.

With this in place we can now call new ExpressionParser().parse() with different expressions and we should end up with a valid result. To test this there is a corresponding test class that shows the typical usage for this pattern.

Next steps

This concludes this post! Now that you know how to integrate Antlr4 into your project some ideas to extend the grammar would be: * Implement a modulo operator * Allow the use of floating points * Allow scientific notation * Support BigDecimals

Have fun!