Writing Your Own Toy Compiler Using Flex, Bison and LLVM
Step 3. Assembling the AST with LLVM
The next step in a compiler is to naturally take this AST and turn it into machine code. This means converting each semantic node into the equivalent machine instruction. LLVM makes this very easy for us, because it abstracts the actual instructions to something that is similar to an AST. This means all we’re really doing is translating from one AST to another.
You can imagine that this process will involve us walking over our AST from our root node, and for each node we emit bytecode. This is where the codeGen method that we defined for our nodes comes in handy. For example, when we visit an NBlock node (semantically representing a collection of statements), we call codeGen on each statement in the list. It looks like this:
Value* NBlock::codeGen(CodeGenContext& context)
{
StatementList::const_iterator it;
Value *last = NULL;
for (it = statements.begin(); it != statements.end(); it++) {
std::cout << "Generating code for " << typeid(**it).name() << std::endl;
last = (**it).codeGen(context);
}
std::cout << "Creating block" << std::endl;
return last;
}
We implement this method for all of our nodes, then call it as we go down the tree, implicitly walking us through our entire AST. We keep a new CodeGenContext class around to tell us where to emit our bytecode.
A Big Caveat About LLVM
One of the downsides of LLVM is that it’s really hard to find useful documentation. Their online tutorial and other such docs are wildly out of date, and there’s barely any information for the C++ API unless you really dig for it. If you installed LLVM on your own, check the ‘docs’ as it has "more" up to date documentation.
I’ve found the best way to learn LLVM is by example. There are some quick examples of programmatically generating bytecode in the ‘examples’ directory of the LLVM archive as well, and there’s also the LLVM live demo site which can emit C++ API code for a C program as input. This is a great way to find out what instructions something like "int x = 5;" will emit. I used the demo tool to implement most of the nodes.
Without further ado, I’ll be listing both the codegen.h and codegen.cpp files below.
Listing of codegen.h:
#include <stack>
#include <llvm/Module.h>
#include <llvm/Function.h>
#include <llvm/Type.h>
#include <llvm/DerivedTypes.h>
#include <llvm/LLVMContext.h>
#include <llvm/PassManager.h>
#include <llvm/Instructions.h>
#include <llvm/CallingConv.h>
#include <llvm/Bitcode/ReaderWriter.h>
#include <llvm/Analysis/Verifier.h>
#include <llvm/Assembly/PrintModulePass.h>
#include <llvm/Support/IRBuilder.h>
#include <llvm/ModuleProvider.h>
#include <llvm/Target/TargetSelect.h>
#include <llvm/ExecutionEngine/GenericValue.h>
#include <llvm/ExecutionEngine/JIT.h>
#include <llvm/Support/raw_ostream.h>
using namespace llvm;
class NBlock;
class CodeGenBlock {
public:
BasicBlock *block;
std::map<std::string, Value*> locals;
};
class CodeGenContext {
std::stack<CodeGenBlock *> blocks;
Function *mainFunction;
public:
Module *module;
CodeGenContext() { module = new Module("main", getGlobalContext()); }
void generateCode(NBlock& root);
GenericValue runCode();
std::map<std::string, Value*>& locals() { return blocks.top()->locals; }
BasicBlock *currentBlock() { return blocks.top()->block; }
void pushBlock(BasicBlock *block) { blocks.push(new CodeGenBlock()); blocks.top()->block = block; }
void popBlock() { CodeGenBlock *top = blocks.top(); blocks.pop(); delete top; }
};
Listing of codegen.cpp
#include "node.h"
#include "codegen.h"
#include "parser.hpp"
using namespace std;
/* Compile the AST into a module */
void CodeGenContext::generateCode(NBlock& root)
{
std::cout << "Generating code...\n";
/* Create the top level interpreter function to call as entry */
vector<const Type*> argTypes;
FunctionType *ftype = FunctionType::get(Type::getVoidTy(getGlobalContext()), argTypes, false);
mainFunction = Function::Create(ftype, GlobalValue::InternalLinkage, "main", module);
BasicBlock *bblock = BasicBlock::Create(getGlobalContext(), "entry", mainFunction, 0);
/* Push a new variable/block context */
pushBlock(bblock);
root.codeGen(*this); /* emit bytecode for the toplevel block */
ReturnInst::Create(getGlobalContext(), bblock);
popBlock();
/* Print the bytecode in a human-readable format
to see if our program compiled properly
*/
std::cout << "Code is generated.\n";
PassManager pm;
pm.add(createPrintModulePass(&outs()));
pm.run(*module);
}
/* Executes the AST by running the main function */
GenericValue CodeGenContext::runCode() {
std::cout << "Running code...\n";
ExistingModuleProvider *mp = new ExistingModuleProvider(module);
ExecutionEngine *ee = ExecutionEngine::create(mp, false);
vector<GenericValue> noargs;
GenericValue v = ee->runFunction(mainFunction, noargs);
std::cout << "Code was run.\n";
return v;
}
/* Returns an LLVM type based on the identifier */
static const Type *typeOf(const NIdentifier& type)
{
if (type.name.compare("int") == 0) {
return Type::getInt64Ty(getGlobalContext());
}
else if (type.name.compare("double") == 0) {
return Type::getDoubleTy(getGlobalContext());
}
return Type::getVoidTy(getGlobalContext());
}
/* -- Code Generation -- */
Value* NInteger::codeGen(CodeGenContext& context)
{
std::cout << "Creating integer: " << value << std::endl;
return ConstantInt::get(Type::getInt64Ty(getGlobalContext()), value, true);
}
Value* NDouble::codeGen(CodeGenContext& context)
{
std::cout << "Creating double: " << value << std::endl;
return ConstantFP::get(Type::getDoubleTy(getGlobalContext()), value);
}
Value* NIdentifier::codeGen(CodeGenContext& context)
{
std::cout << "Creating identifier reference: " << name << std::endl;
if (context.locals().find(name) == context.locals().end()) {
std::cerr << "undeclared variable " << name << std::endl;
return NULL;
}
return new LoadInst(context.locals()[name], "", false, context.currentBlock());
}
Value* NMethodCall::codeGen(CodeGenContext& context)
{
Function *function = context.module->getFunction(id.name.c_str());
if (function == NULL) {
std::cerr << "no such function " << id.name << std::endl;
}
std::vector<Value*> args;
ExpressionList::const_iterator it;
for (it = arguments.begin(); it != arguments.end(); it++) {
args.push_back((**it).codeGen(context));
}
CallInst *call = CallInst::Create(function, args.begin(), args.end(), "", context.currentBlock());
std::cout << "Creating method call: " << id.name << std::endl;
return call;
}
Value* NBinaryOperator::codeGen(CodeGenContext& context)
{
std::cout << "Creating binary operation " << op << std::endl;
Instruction::BinaryOps instr;
switch (op) {
case TPLUS: instr = Instruction::Add; goto math;
case TMINUS: instr = Instruction::Sub; goto math;
case TMUL: instr = Instruction::Mul; goto math;
case TDIV: instr = Instruction::SDiv; goto math;
/* TODO comparison */
}
return NULL;
math:
return BinaryOperator::Create(instr, lhs.codeGen(context),
rhs.codeGen(context), "", context.currentBlock());
}
Value* NAssignment::codeGen(CodeGenContext& context)
{
std::cout << "Creating assignment for " << lhs.name << std::endl;
if (context.locals().find(lhs.name) == context.locals().end()) {
std::cerr << "undeclared variable " << lhs.name << std::endl;
return NULL;
}
return new StoreInst(rhs.codeGen(context), context.locals()[lhs.name], false, context.currentBlock());
}
Value* NBlock::codeGen(CodeGenContext& context)
{
StatementList::const_iterator it;
Value *last = NULL;
for (it = statements.begin(); it != statements.end(); it++) {
std::cout << "Generating code for " << typeid(**it).name() << std::endl;
last = (**it).codeGen(context);
}
std::cout << "Creating block" << std::endl;
return last;
}
Value* NExpressionStatement::codeGen(CodeGenContext& context)
{
std::cout << "Generating code for " << typeid(expression).name() << std::endl;
return expression.codeGen(context);
}
Value* NVariableDeclaration::codeGen(CodeGenContext& context)
{
std::cout << "Creating variable declaration " << type.name << " " << id.name << std::endl;
AllocaInst *alloc = new AllocaInst(typeOf(type), id.name.c_str(), context.currentBlock());
context.locals()[id.name] = alloc;
if (assignmentExpr != NULL) {
NAssignment assn(id, *assignmentExpr);
assn.codeGen(context);
}
return alloc;
}
Value* NFunctionDeclaration::codeGen(CodeGenContext& context)
{
vector<const Type*> argTypes;
VariableList::const_iterator it;
for (it = arguments.begin(); it != arguments.end(); it++) {
argTypes.push_back(typeOf((**it).type));
}
FunctionType *ftype = FunctionType::get(typeOf(type), argTypes, false);
Function *function = Function::Create(ftype, GlobalValue::InternalLinkage, id.name.c_str(), context.module);
BasicBlock *bblock = BasicBlock::Create(getGlobalContext(), "entry", function, 0);
context.pushBlock(bblock);
for (it = arguments.begin(); it != arguments.end(); it++) {
(**it).codeGen(context);
}
block.codeGen(context);
ReturnInst::Create(getGlobalContext(), bblock);
context.popBlock();
std::cout << "Creating function: " << id.name << std::endl;
return function;
}
There is certainly a great deal to take in here, however this is the part where you should start exploring on your own. I only have a couple of notes:
- We use a “stack” of blocks in our CodeGenContext class to keep the last entered block (because instructions are added to blocks)
- We also use this stack to keep a symbol table of the local variables in each block.
- Our toy program only knows about variables in its own scope. To support the idea of “global” contexts you’d need to search upwards through each block in our stack until you found a match to the symbol (rather than simply searching the top symbol table).
- Before entering a block we should push the block and when leaving it we should pop it.
The rest of the details are all related to LLVM, and again, I’m hardly an expert on that subject. But at this point, we have all of the code we need to compile our toy language and watch it run.