Writing Your Own Toy Compiler Using Flex, Bison and LLVM
Update (March 19 2010): this article was updated for LLVM 2.6 thanks to a great patch by John Harrison. He rocks!
I’ve always been interested in compilers and languages, but interest only gets you so far. A lot of the concepts of compiler design can easily go way over most programmers’ heads, even the intelligent ones. Needless to say, I’ve tried, without much success, to write a small toy language/compiler before. I’d usually get caught up at the semantic parsing stage. And again, needless to say, this post is mostly inspired by my latest attempt, though this one has been much more successful (so far).
Fortunately over the last few years I’ve been involved in some projects that helped give me perspective and experience on what’s really involved in building a compiler. The other thing I’ve been lucky to have in my corner this time is the help of LLVM, a tool which I’m hardly qualified to talk too much about, but it’s been quite handy in implementing most of the business end (read: complex aspects) of my toy compiler.
So Why Are You Reading This?
Maybe you want to see what I’ve been doing with my time. It’s more likely, however, that you’re probably interested in compilers and languages as I am, and have probably been hitting similar roadblocks. You’ve probably wanted to try this but never found the resources, or did but couldn’t quite follow. The goal of this article is to provide such a resource and explain in a relatively step by step manner how to create the most basic-but-functional compiler from start to “finish”.
I won’t be covering much theory, so if you haven’t brushed up on your BNF grammars, AST data structures and the basic compiler pipeline, I suggest you do so. That said, I plan on keeping this as simple as possible. The goal, of course, is to make this an easy-to-understand introductory resource for people interested but not experienced with compilers.
What You Will End Up With
If you follow this article, you should end up with a language that can define functions, call functions, define variables, assign data to variables and perform basic math operations. It will support two basic types, doubles and integers. Some of the functionality is unimplemented, so you can have the satisfaction of actually implementing some of this stuff yourself and get the hang of writing a compiler with a little help.
Let’s Get Some Questions Out of the Way
1. What languages do I need to know?
The tools we’ll be using are C/C++ based. LLVM is specifically C++ and our toy language will follow suit since there are some niceties of OOP and the STL (C++’s stdlib) that make for fewer lines of code. In addition to C, both Lex and Bison have their own syntax which may seem daunting at first, but I’ll try to explain as much as possible. The grammar we’re dealing with is actually very tiny (~100 LOC), so it should be feasible.
2. Is this really complicated?
Yes and no. There is plenty of stuff going on here and it might seem scary at first, but honestly, this is about as simple as it gets. We will also use a lot of tools to abstract the layers of complexity and make it manageable.
3. How long will it take?
What we will be building took me about 3 days of toying with, but I have a few failed attempts under my belt and those do have impact on my comprehension. Of course, this will be a “follow me” kind of deal, so it should be much shorter for you. To understand completely everything might take a little longer, but you should be able to run through all of this stuff (and hopefully understand a good amount of it) in an afternoon.
So, if you’re ready, let’s get started.