parserlib: ktok: defining the token

A lexer and parser are compatible if they import the same token interface. A token interface and its implementation are generated automatically by m3build by running the command

tok MyLang.t [ -o MyLangTok.i3 ].

where MyLang.t is a token specification , and MyLangTok.i3 is the generated token interface.

token specification

A token specification is a file with the .t suffix which specifies which tokens will be passed from a lexer to a parser. Each line of the file must have one of the following forms:

TOKEN1 TOKEN2 ... The given list of tokens are extendable types that can be returned by a lexer. The list is optionally preceded by %token .
%const TOKEN1 TOKEN2 The given list of tokens can be returned by a lexer, but these tokens cannot be extended to contain a value.
%char [chars] The lex-style set of characters enclosed in [] can be returned by a lexer, and behave like %const .

By convention, tokens are written in ALL CAPS so as not to be confused with other ParseType s (see below).

token interface

A token interface is a Modula-3 interface that can be imported by generated lexers and parsers (or extended using ext ), and is itself generated from a token specification .

The token interface defines BRANDED OBJECT types with the following subtype relationship:

Token <: ParseType

In addition, each token declared nonconstant in the token specification becomes a subtype of Token .

If a generated parser imports the token interface, then all arguments and return types of parser reduction methods are subtypes of ParseType . If a generated lexer imports the token interface, then each expression method returns a Token .

Any lexer (generated or handwritten) must be a subtype of the Lexer type defined in the token interface, which is defined generically as follows:

  Lexer = OBJECT METHODS
    get(): Token RAISES {Rd.EndOfFile};
    (* get next token, or raise Rd.EndOfFile if token cannot be formed
       from remaining input *)

    unget();
    (* will be called at most once after get(), and only when lookahead is
       required after last token when parsing without exhausting input *)

    error(message: TEXT);
    (* might print file name, line number, and message, and exit *)
  END;
A lexer generated by klex will more specifically be an RdLexer , which provides the following additional methods:
  RdLexer = Lexer OBJECT METHODS
    setRd(rd: Rd.T): RdLexer;
    (* Prepare to read tokens starting at cur(rd).
       After every token, rd is repositionned after that token. *)

    getRd(): Rd.T;
    (* get reader  *)

    fromText(t: TEXT): RdLexer;
    (* Calls setRd with a textReader. *)

    rewind();
    (* equivalent to Rd.Seek(rd, 0) followed by setRd *)

    getText(): TEXT;
    (* get TEXT of last token *)

    purge(): INTEGER;
    (* Allow any internally allocated ParseTypes to be garbage collected,
       even if the lexer itself remains in scope. Return number of ParseType
       objects allocated but not discarded (not the number of purged objects).
       Can be called at any time by the thread calling get. *)
  END;

[ parserlib page ] [ ktok ] [ klex ] [ kyacc ] [ kext ] [ m3build ]

$Id: ktok.html,v 1.5 2001/01/08 07:08:02 kp Exp $