Writing Your Own Toy Compiler Using Flex, Bison and LLVM
Step 1. Lexical Analysis with Flex
This is the simplest step. Given our grammar, we need to break down our input into a list of known tokens. As mentioned before, our grammar has very basic tokens: identifiers, numbers (integers and floats), the mathematical operators, parentheses and braces. Our lex file “tokens.l”, which has a somewhat specialized grammar, is simply defined as:
Listing of tokens.l:
%{
#include <string>
#include "node.h"
#include "parser.hpp"
#define SAVE_TOKEN yylval.string = new std::string(yytext, yyleng)
#define TOKEN(t) (yylval.token = t)
extern "C" int yywrap() { }
%}
%%
[ \t\n] ;
[a-zA-Z_][a-zA-Z0-9_]* SAVE_TOKEN; return TIDENTIFIER;
[0-9]+\.[0-9]* SAVE_TOKEN; return TDOUBLE;
[0-9]+ SAVE_TOKEN; return TINTEGER;
"=" return TOKEN(TEQUAL);
"==" return TOKEN(TCEQ);
"!=" return TOKEN(TCNE);
"<" return TOKEN(TCLT);
"<=" return TOKEN(TCLE);
">" return TOKEN(TCGT);
">=" return TOKEN(TCGE);
"(" return TOKEN(TLPAREN);
")" return TOKEN(TRPAREN);
"{" return TOKEN(TLBRACE);
"}" return TOKEN(TRBRACE);
"." return TOKEN(TDOT);
"," return TOKEN(TCOMMA);
"+" return TOKEN(TPLUS);
"-" return TOKEN(TMINUS);
"*" return TOKEN(TMUL);
"/" return TOKEN(TDIV);
. printf("Unknown token!\n"); yyterminate();
%%
The first section declares some specialized C code. We use a “SAVE_TOKEN” macro to keep the text of identifiers and numbers somewhere safe (instead of just the token itself), since Bison won’t have access to our ‘yytext’ variable. The first token tells us to skip all whitespace. You’ll also notice that we have some equality comparison tokens and such. Those are unimplemented for now, feel free to support them in your toy compiler!
So all we’re doing here is defining the tokens and their symbolic names. These symbols (TIDENTIFIER, etc.) will become “terminal symbols” in our grammar. We’re just returning them, but we’ve never defined them. Where are they defined? Why, in the bison grammar, of course. The parser.hpp file we’ve included will be generated by bison, and all the tokens inside it will be generated and available for use.
We run Flex on this tokens.l file to generate our “tokens.cpp” file, which will be compiled alongside our parser and provide the yylex() function that recognizes all of these tokens. We will run this command later though, because we need to generate that header file from bison first!