Usage
This guide shows how to use alex in real code: creating a lexer, defining rules, tokenizing input, and handling common scenarios.
Minimal working example
from alex import Alex
OPERATORS = ( ("ADD", "+"), ("SUB", "-") )
REGEXPS = ( ("NUM", '^["0123456789"]*'), )
lexer = Alex(operators=OPERATORS, regexps=REGEXPS)
lexer.scan('1 + 2 - 123')
print('----(Tokens found)----------')
for token in lexer.tokens:
print(token)
Output:
----(Tokens found)----------
1: 1 NUM 1
1: 3 ADD +
1: 5 NUM 2
1: 7 SUB -
1: 9 NUM 123
Operators
Operators are fixed strings and are defined as a tuple of tuple items.
Each tuple item has two parts, a name and a lexeme. The lexeme describes the text to match in the input text and the name is your choice of a unique string.
Example
OPERATORS = (
("PLUS", "+"),
("MINUS", "-"),
("MUL", "*"),
("DIV", "/"),
("DOT", "."),
("COMMA", ","),
("COLON", ":"),
("SEMI", ";"),
("AT", "@"),
("MOD", "%"),
("EQ", "="),
("GT", ">"),
("LT", "<"),
("LP", "("),
("RP", ")"),
("LBR", "["),
("RBR", "]"),
("LCBR", "{"),
("RCBR", "}"),
("OR", "|"),
("XOR", "^"),
("NOT", "~"),
("AND", "&"),
("ADDEQ", "+="),
("SUBEQ", "-="),
("MULEQ", "*="),
("DIVEQ", "/="),
("IDIVEQ", "//="),
("MODEQ", "%="),
("LSHIFT", "<<"),
("RSHIFT", ">>"),
("IDIV", "//"),
("EXP", "**"),
("GE", ">="),
("LE", "<="),
("TYPE", "->"),
("NEQ", "!="),
("EQEQ", "=="),
("OREQ", "|="),
("XOREQ", "^="),
("ANDEQ", "&="),
("WALRUS", ":="),
("EXPEQ", "**="),
("LSHIFTEQ", "<<="),
("RSHIFTEQ", ">>="),
)
Operators are sorted by lexeme length in descending order.
This means that the longest operator lexeme are matched first.
Operators are matched before regexps.
Regular expressions
Regular expressions are also defined as a tuple of tuple items.
Each tuple item has two parts, a name and a regular expression. The regular expression describes the text to match in the input text and the name is your choice of a unique string.
Example
REGEXPS = (
("TSTR", r'^f?"""(?:\\.|(?!""").)*?"""|^f?\'\'\'(?:\\.|(?!\'\'\').)*?\'\'\''),
("STR", r'^f?"(?:\\.|[^"\\])*"|^f?\'(?:\\.|[^\'\\])*\''),
("NUM", '^["0123456789"]*'),
("REM", "^#[^\n]*"),
("ID", f"^[a-zA-Z_0-9]*"),
)
Regular expressions are matched in the order that they have in the tuple.
The regular expression must start with the ^-character.
Keywords
If you want to distinguish an identifier from a language keyword, you can define a list of keywords.
A token for a keyword will be given the name 'KEYWORD'. That means that no operator definition or regexp definition must have the name 'KEYWORD'.
Example
from alex import Alex
OPERATORS = ( ("ADD", "+"), ("SUB", "-"), ("EQ", "=") )
REGEXPS = ( ("ID", f"^[a-zA-Z_][a-zA-Z_0-9]*"), ("NUM", '^["0123456789"]*') )
KEYWORDS = ['if', 'then']
lexer = Alex(operators=OPERATORS, regexps=REGEXPS, keywords=KEYWORDS)
lexer.scan('if x = 2 then y = 3')
print('----(Tokens found)----------')
for token in lexer.tokens:
print(token)
The output will be:
----(Tokens found)----------
1: 1 KEYWORD if <---- KEYWORD
1: 4 ID x <---- ID
1: 6 EQ =
1: 8 NUM 2
1: 10 KEYWORD then
1: 15 ID y
1: 17 EQ =
1: 19 NUM 3
Processing flags
skip_unrecognized_chars
This flag kan be used to skip all unrecognized characters in the input text. The default value is False.
I you for example only are interested in finding all identifier names in an input text and skip all the rest, you can set skip_unrecognized_chars=True
from alex import Alex
text = "text = self._read_file(path)\nself.scan(text)"
REGEXPS = ( ("ID", f"^[a-zA-Z_][a-zA-Z_0-9]*"), )
lexer = Alex(regexps=REGEXPS, skip_unrecognized_chars=True)
lexer.scan(text)
print('----(Tokens found)----------')
for token in lexer.tokens:
print(token)
Will output:
----(Tokens found)----------
1: 1 ID text
1: 7 ID self
1: 11 ID _read_file
1: 21 ID path
2: 1 ID self
2: 5 ID scan
2: 9 ID text
treat_unrecognized_chars_as_an_operator
If you want to see all skipped characters in the tokens list you can set this flag to True. Default value is False.
from alex import Alex
text = "text = self._read_file(path)\nself.scan(text)"
REGEXPS = ( ("ID", f"^[a-zA-Z_][a-zA-Z_0-9]*"), )
lexer = Alex(regexps=REGEXPS, treat_unrecognized_chars_as_an_operator=True)
lexer.scan(text)
print('----(Tokens found)----------')
for token in lexer.tokens:
print(token)
Will output:
----(Tokens found)----------
1: 1 ID text
1: 6 UNRECOGNIZED-CHAR =
1: 8 ID self
1: 12 UNRECOGNIZED-CHAR .
1: 13 ID _read_file
1: 23 UNRECOGNIZED-CHAR (
1: 24 ID path
1: 28 UNRECOGNIZED-CHAR )
2: 1 ID self
2: 5 UNRECOGNIZED-CHAR .
2: 6 ID scan
2: 10 UNRECOGNIZED-CHAR (
2: 11 ID text
2: 15 UNRECOGNIZED-CHAR )
scan_python_indents
With this flag set to True (default = False), you can get indentation information as an output token.
from alex import Alex
text = " text = self._read_file(path)\n self.scan(text)"
REGEXPS = (("ID", f"^[a-zA-Z_][a-zA-Z_0-9]*"),)
lexer = Alex(regexps=REGEXPS, skip_unrecognized_chars=True, scan_python_indents=True)
lexer.scan(text)
print('----(Tokens found)----------')
for token in lexer.tokens:
print(token)
Output:
----(Tokens found)----------
1: 1 INDENT 4
1: 5 ID text
1: 11 ID self
1: 15 ID _read_file
1: 25 ID path
2: 1 INDENT 4
2: 5 ID self
2: 9 ID scan
2: 13 ID text
This means that 'INDENT' also is a reserver word like 'KEYWORD' and cant be used for your own definitions.