Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part II —Tokenizer For a Simple Program

Creating Tokens For a Hyphothethical Yet Working Programming Language to Understand Compilers Better

Adrian Nenu 😺


In the previous part of this series, Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part I — Hades), we have covered what we want to accomplish and the general structure of a compiler:

Code => Tokens => Parsed Tokens as Abstract Syntax Tree => Assembly Code CPU understands

It is time to deep-dive into the nitty-gritty and get our hands dirty by implementing a basic tokenizer in C++.

CPU city — generated by Midjourney

The Hades Tokenizer

Every part of a compiler can be infinitely complex, hence we need to understand where to draw a proverbial line in the sand and limit the scope of our implementation.

To kick us off easily, our tokenizer will not handle any edge cases that we would expect any language to support, but instead, we will be satisfied with a system that can parse the following tokens:

enum TokenType

These are enough to support a simple that prints a numerical value (styx) and sets the exit status of a program (bestow), such as:

hero a = 30;
hero b = 40;
styx b;
bestow b;

The Token

Tokens will represent every individual unit of significance in our program. There are many things you might not naturally expect will be tokens because you are probably thinking about the Abstract Syntax Tree. Everything from equal signs to semicolons will be tokens before we build a tree that encapsulates more logic and expression trees. The tokens will be a flat array of values, which contain no distinguishing factors that will help to convert it to assembly as is. That will be the job of the AST which will give us the notions of scope and order of operations.

class Token
Token(TokenType type, const…



Adrian Nenu 😺

Software Engineer @ Google. Photographer and writer on engineering, personal reflection, and creativity -