Monday, March 17, 2014

How to skip unwanted input string in ANTRL grammar


Question: I want to build a parser for analyzing a large input file, but I don’t need the entire input file, only some parts of it.

For exmaple, the input file may look like this:
bla bla bla bla bla ...

EVENT: e1
type: t1
version: 1
additional-info: abc

EVENT: e2
type: t2
version: 1
uninteresting-info: def

blu blu blu blu blu ...

From this file, all I want is to have a map of event to type (e1=>t1, e2=>t2). All other information is of no interest for me.


How can I build a simple ANTLR grammar that does this?


Anwer:


You can do that by introducing a boolean flag inside your lexer that keeps track whether an event- or type-keyword has been encountered. If it has been encountered, the lexer should not skip the word, all other words should be skipped.


A small demo:




You can test the parser with the following class:


import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = 
        "bla bla bla bla bla ...  \n" +
        "                         \n" +
        "prEVENT: ...             \n" +
        "EVENTs: ...              \n" +
        "                         \n" +
        "EVENT: e1                \n" +
        "type: t1                 \n" +
        "version: 1               \n" +
        "additional-info: abc     \n" +
        "                         \n" +
        "EVENT: e2                \n" +
        "type: t2                 \n" +
        "version: 1               \n" +
        "uninteresting-info: def  \n" +
        "                         \n" +
        "blu blu blu blu blu ...  \n";
    TLexer lexer = new TLexer(new ANTLRStringStream(src));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}


which will produce the following output:
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

event=e1, type=t1
event=e2, type=t2

From page at http://stackoverflow.com/questions/8313722/skipping-parts-of-the-input-file-in-antlr
string literal


Also, your string rule would probably be better of looking like this:


STRING_LITERAL : ‘”’ (~(‘”’ | ‘\’ | ‘\r’ | ‘\n’) | ‘\’ (‘”’ | ‘\’))* ‘”’;

In other words, the contents of your string is zero or more:


any char other than a quote, backslash or line break: ~(‘”’ | ‘\’ | ‘\r’ | ‘\n’)

or


an escaped quote or backslash ‘\’ (‘”’ | ‘\’)


OR


STRING : ‘”’ (options{greedy=false;}:( ~(‘\’|’”’) | (‘\’ ‘”’)))* ‘”’;

No comments:

Post a Comment