Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part II —Tokenizer For a Simple Program
Creating Tokens For a Hyphothethical Yet Working Programming Language to Understand Compilers Better
In the previous part of this series, Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part I — Hades), we have covered what we want to accomplish and the general structure of a compiler:
Code => Tokens => Parsed Tokens as Abstract Syntax Tree => Assembly Code CPU understands
It is time to deep-dive into the nitty-gritty and get our hands dirty by implementing a basic tokenizer in C++.
The Hades Tokenizer
Every part of a compiler can be infinitely complex, hence we need to understand where to draw a proverbial line in the sand and limit the scope of our implementation.
To kick us off easily, our tokenizer will not handle any edge cases that we would expect any language to support, but instead, we will be satisfied with a system that can parse the following tokens:
These are enough to support a simple that prints a numerical value (
styx) and sets the exit status of a program (
bestow), such as:
hero a = 30;
hero b = 40;
Tokens will represent every individual unit of significance in our program. There are many things you might not naturally expect will be tokens because you are probably thinking about the Abstract Syntax Tree. Everything from equal signs to semicolons will be tokens before we build a tree that encapsulates more logic and expression trees. The tokens will be a flat array of values, which contain no distinguishing factors that will help to convert it to assembly as is. That will be the job of the AST which will give us the notions of scope and order of operations.
Token(TokenType type, const…