Skip to content
Marios Papachristou edited this page Apr 18, 2020 · 1 revision

Lexer

The first part of a compiler is the lexer. The lexer is responsible for tokenizing the text and returning the list of identified tokens for further use, usually with a parser. Our lexer is built using SLY and contains rules that catch the tokens of the PCL language. More specifically, the parts of the PCL lexer contain the keywords, operators, identifier names, constants, ignore characters and comments.

The keywords are defined in a dictionary called keywords as follows

keywords = { r'if' : 'IF', r'while' : 'WHILE', ... }

so that they can be separated from the rest of the identifiers (or names). Each rule is defined as a Python regular expression and special rules for handling certain tokens are invoked. If the token starts with ignore_, then SLY is obliged to ignore the pattern. The ignore token is a string which contains separate ignore characters. A rule for a certain pattern is defined as

def TOKEN_NAME(self, t):
	# do something
	return t

If a token is invalid then PCLLexerError is raised. The sly.Lexer class modus operandi processes the tokens by order of appearance and does not match the longest tokens greedily. So for instance, it would be difficult to break 3.14 correctly as it identifies 3 as an integer constant and returns it, contrary to what traditional lexers like Flex and MLLex do, so special patterns have to be added to address the issue, or (easier) functions are specified to address it.

Example

Suppose that we have the following program

program collatz;

var x : integer;

begin
  x := 6;
  while x > 1 do
  begin
    writeInteger(x);
    if x mod 2 = 0 then x := x div 2
    else x := 3 * x + 1;
  end;

end.

Then invoking

pclc.py collatz.pcl --pipeline lex pprint

yields

Token(type='PROGRAM', value='program', lineno=1, index=0)
Token(type='NAME', value='collatz', lineno=1, index=8)
Token(type='SEMICOLON', value=';', lineno=1, index=15)
Token(type='VAR', value='var', lineno=3, index=18)
Token(type='NAME', value='x', lineno=3, index=22)
Token(type='DCOLON', value=':', lineno=3, index=24)
Token(type='INTEGER', value='integer', lineno=3, index=26)
Token(type='SEMICOLON', value=';', lineno=3, index=33)
Token(type='BEGIN', value='begin', lineno=5, index=36)
Token(type='NAME', value='x', lineno=6, index=44)
Token(type='SET', value=':=', lineno=6, index=46)
Token(type='INT_CONS', value='6', lineno=6, index=49)
Token(type='SEMICOLON', value=';', lineno=6, index=50)
Token(type='WHILE', value='while', lineno=7, index=54)
Token(type='NAME', value='x', lineno=7, index=60)
Token(type='GT', value='>', lineno=7, index=62)
Token(type='INT_CONS', value='1', lineno=7, index=64)
Token(type='DO', value='do', lineno=7, index=66)
Token(type='BEGIN', value='begin', lineno=8, index=71)
Token(type='NAME', value='writeInteger', lineno=9, index=81)
Token(type='LPAREN', value='(', lineno=9, index=93)
Token(type='NAME', value='x', lineno=9, index=94)
Token(type='RPAREN', value=')', lineno=9, index=95)
Token(type='SEMICOLON', value=';', lineno=9, index=96)
Token(type='IF', value='if', lineno=10, index=102)
Token(type='NAME', value='x', lineno=10, index=105)
Token(type='MOD', value='mod', lineno=10, index=107)
Token(type='INT_CONS', value='2', lineno=10, index=111)
Token(type='EQUAL', value='=', lineno=10, index=113)
Token(type='INT_CONS', value='0', lineno=10, index=115)
Token(type='THEN', value='then', lineno=10, index=117)
Token(type='NAME', value='x', lineno=10, index=122)
Token(type='SET', value=':=', lineno=10, index=124)
Token(type='NAME', value='x', lineno=10, index=127)
Token(type='DIV', value='div', lineno=10, index=129)
Token(type='INT_CONS', value='2', lineno=10, index=133)
Token(type='ELSE', value='else', lineno=11, index=139)
Token(type='NAME', value='x', lineno=11, index=144)
Token(type='SET', value=':=', lineno=11, index=146)
Token(type='INT_CONS', value='3', lineno=11, index=149)
Token(type='TIMES', value='*', lineno=11, index=151)
Token(type='NAME', value='x', lineno=11, index=153)
Token(type='PLUS', value='+', lineno=11, index=155)
Token(type='INT_CONS', value='1', lineno=11, index=157)
Token(type='SEMICOLON', value=';', lineno=11, index=158)
Token(type='END', value='end', lineno=12, index=162)
Token(type='SEMICOLON', value=';', lineno=12, index=165)
Token(type='END', value='end', lineno=14, index=168)
Token(type='COLON', value='.', lineno=14, index=171)
Clone this wiki locally