Lexing

A lexer is the part of a programming language implementation that breaks the contents of a program into tokens, which are usually passed onto the parser. The lexer is sometimes refered to as a tokenizer. If thinking of it in terms of a function, the domain is all valid strings and the range is all possible combinations of lists of tokens.

            +-------+
<string> => | LEXER | => <token>*
            +-------+

A token is made up of two parts. The token name, which gives the token an explicit meaning, and a lexeme, which is the actual string content of the token. These can be thought of as pairs (ie. (<name> . <lexeme>)). Token names are language specific, but there is a lot of overlap between languages.

For example, possible tokens of a scheme program include parenthesis, ids, strings, and numbers. Given the code (define x 10), a scheme lexer would produce the tokens ((open_paren . "(") (id . "define") (id . "x") (number . "10") (close_paren . ")"). Notice the lexemes are all strings, even for the token identifying the number 10. As far as the lexer is concerned, everything is a string. It doesn’t do anything with the string contents other than identify it and pass it on. The token names are valuable for later stages of the compilation process as they give meaning to the purpose of the lexemes, while the lexemes (if needed) give the actual value of the token. The lexeme for the close parenthesis token is meaningless because the close parenthesis token signifies the end of a list in and of itself, but the lexeme of the id token is needed to look up values in the symbol table during compilation/execution. All the id token does is say, “Hey, this is an identifier!” while the lexeme gives that identifier value. That’s not to say that the ) character is useless, since it’s needed to make the close parenthesis token in the first place, but once the token is made the character itself is no longer needed.