Writing Your Own Toy Compiler Using Flex, Bison and LLVM

By Loren Segal on September 18th, 2009 at 1:08 PM

The Basic Compiler Recipe

Although you should already pretty much know this, a compiler is really a grouping of three to four components (there are some more sub-components) where data is fed from one to the next in a pipeline fashion. We will be using a different tool to help us build out each of these components. Here is a diagram of each step and the tool we will use:

Compiler Pipeline

You’ll notice the linker step is greyed out. Our toy language won’t be supporting compile time linking (plus most languages don’t do compile time linking anymore anyway). To do our lexing we will be using the open source tool Lex, mostly available these days as Flex. Lexing usually goes hand in hand with semantic parsing, which we’ll be performing with the help of Yacc, better known as Bison. Finally, once semantic parsing is done we can walk over our AST and generate our “bytecode”, or machine code. For this, we’ll be using LLVM, which technically generates intermediate bytecode, but we will be using LLVM’s JIT (Just In Time) compilation to execute this bytecode on our machine.

To summarize, the steps are as follows:

  1. Lexical Analysis with Flex: Split input data into a set of tokens (identifiers, keywords, numbers, brackets, braces, etc.)
  2. Semantic Parsing with Bison: Generate an AST while parsing the tokens. Bison will do most of the legwork here, we just need to define our AST.
  3. Assembly with LLVM: This is where we walk over our AST and generate byte/machine code for each node. As crazy as it sounds, this is probably the easiest step.

Before moving too far along, you should probably consider installing Flex, Bison and LLVM, if you haven’t already. We’re going to need them pretty soon.

Defining Our Grammar

Our grammar is naturally the most central part of our language. From our grammar we inherently allow or disallow functionality. For our toy compiler, we will be using a standard C-like syntax because it’s familiar and simple to parse. The canonical example of our toy language will be the following code:

int do_math(int a) {
  int x = a * 5 + 3
}

do_math(10)

Looks simple enough. We can see it’s typed the same way C is, but there are no semicolons separating statements. You’ll also notice there is no “return” statement in our grammar; this is something you can implement on your own.

There’s also no mechanism to print any results or anything. We will verify the correctness of our programs by inspecting the very pretty instruction listing that LLVM can print of the bytecode we’ve compiled, but more on that later.

Questions? Comments? Follow me on Twitter (@lsegal) or email me.