parserlib: klex : defining the lexer

A lexer interface and its implementation are generated automatically by m3build by running the command

lex MyLang. l [ -t MyLang.t [-ti3 MyLangTok.i3] ] [ -o MyLangLex .i3 ]

where MyLang.t is a token specification , MyLangTok.i3 is a token interface , MyLang. l is a lexer specification , and MyLangLex .i3 is the generated lexer interface.

lexer specification

A lexer specification is a file with the .l suffix which specifies regular expressions to be converted to tokens. The syntax differs from that of a UNIX .l file in that each regular expression is associated with an expression method , which must be given a name:

 %expr {
bare_STRING   RegExp1
quoted_STRING RegExp2
} 
define the expression methods named bare_STRING and quoted_STRING , which will be called whenever the respective expressions RegExp1 and RegExp2 are matched. See default method construction below.
METHOD1 RegExp1 same as above, i.e. the %expr{} is optional.
 %macro {
MACRO1  RegExp1
MACRO2  RegExp2
} 
define {MACRO1} to stand for RegExp1 , and define {MACRO2} to stand for RegExp2 .
%macro MACRO RegExp alternate syntax, same meaning as above.

default method construction

The longest suffix of the method name matching a token name causes that token type to be returned by default. For example, if there is a token named STRING , then any methods named bare_STRING or quoted_STRING will by default be assigned a procedure which returns a new token of type STRING .

The default method named char returns a constant token whose value equals the character code of the first matched character. If a method name is not char and does not have a suffix matching a token name, the default method returns NIL (instructing the lexer to skip the token) and a warning is printed. The warning is not printed, however, if the method is named skip ; in that case skipping is assumed to be the desired default behavior.

In addition it is customary to define a token named ERROR , which does not ordinarily match any grammar rules. Thus a lexer specification will typically end with the following 3 lines:

char        {%char}
skip        [ \t]*
ERROR       [^]
The behavior of any default method can be changed by overriding the method, for example using ext .

regular expressions

klex supports the following subset of flex regular expressions, in order of decreasing precedence:

x match the character 'x'
. any character (byte) except newline
[xyz] a "character class"; in this case, the pattern matches either an 'x' , a 'x' , or a 'x'
[abj-oZ] a "character class" with a range in it; matches an 'x' , a 'x' , any letter from 'x' through 'x' , or a 'x'
[^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
[^A-Z\n] any character EXCEPT an uppercase letter or a newline
r* zero or more r 's , where r is any regular expression
r+ one or more r 's
r? zero or one r 's (that is, an optional r )
r{2 , 5} anywhere from two to five r 's
r{2 , } two or more r 's
r{4} exactly 4 r 's
{NAME} the expansion of macro NAME
{%char} the %char macro expands to the class of characters which were declared %char in the token interface .
"[xyz]\"foo" the literal string: [xyz]"foo
\X if X is an 'x' or 'x' , then the ANSI-C interpretation of \x . Otherwise, a literal 'x' (used to escape operators such as 'x' )
\123 the character with octal value 123
(r) match an r; parentheses are used to override precedence
rs the regular expression r followed by the regular expression s; called "concatenation"
r|s either an r or an s

lexer interface

A lexer interface is a Modula-3 interface defining a type T that can be passed as an argument to a parser initialization method (or extended using ext ), and is itself generated from a lexer specification .

The type T representing a lexer is declared as an opaque subtype of the RdLexer instantiated in the token interface. Hence the following uses are possible:

myLexer := NEW(MyLangLex.T).setRd(rd);
Initialize the new lexer myLexer using the reader rd: Rd.T .
 
start := myParser.parse(NEW(MylangLex.T).fromText(text));
Parse text: TEXT , using a new lexer and myParser . The interface which was used to initialize myParser must be compatible with MyLangLex.i3 .

There is no method named init , to allow customized initialization parameters in extended lexers.

see also

M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator

Vern Paxson et. al., flex - fast lexical analyzer generator

A. Aho, R. Sethi and J. Ullman, Compilers: Principles, Techniques and Tools


[ parserlib page ] [ ktok ] [ klex ] [ kyacc ] [ kext ] [ m3build ]

$Id: klex.html,v 1.3 2001/01/08 03:25:31 kp Exp $