Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part II —Tokenizer For a Simple Program

Creating Tokens For a Hyphothethical Yet Working Programming Language to Understand Compilers Better

Adrian Nenu 😺
5 min readNov 16, 2023

In the previous part of this series, Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part I — Hades), we have covered what we want to accomplish and the general structure of a compiler:

Code => Tokens => Parsed Tokens as Abstract Syntax Tree => Assembly Code CPU understands

It is time to deep-dive into the nitty-gritty and get our hands dirty by implementing a basic tokenizer in C++.

CPU city — generated by Midjourney

The Hades Tokenizer

Every part of a compiler can be infinitely complex, hence we need to understand where to draw a proverbial line in the sand and limit the scope of our implementation.

To kick us off easily, our tokenizer will not handle any edge cases that we would expect any language to support, but instead, we will be satisfied with a system that can parse the following tokens:

enum TokenType
{
BESTOW,
STYX,
NUMBER,
SEMICOLON,
HERO,
IDENTIFIER,
EQUALS,
UNKNOWN
};

These are enough to support a simple that prints a numerical value (styx) and sets the exit status of a program (bestow), such as:

hero a = 30;
hero b = 40;
styx b;
bestow b;

The Token

Tokens will represent every individual unit of significance in our program. There are many things you might not naturally expect will be tokens because you are probably thinking about the Abstract Syntax Tree. Everything from equal signs to semicolons will be tokens before we build a tree that encapsulates more logic and expression trees. The tokens will be a flat array of values, which contain no distinguishing factors that will help to convert it to assembly as is. That will be the job of the AST which will give us the notions of scope and order of operations.

class Token
{
public:
Token(TokenType type, const std::string &value) : type(type), value(value) {}

TokenType getType() const { return type; }
const std::string &getValue() const { return value; }

private:
TokenType type;
std::string value;
};

That is how our Token class will look like, alongside the previously defined TokenType enum representing the types of tokens.

If you like this content, please consider clapping, following and subscribing to my newsletter.

Tokenization Process

With our tokens defined, we can go ahead and read our code as text, one character at a time, and convert it to instances of Token:

class Tokenizer
{
public:
std::vector<Token> tokens;

Tokenizer(std::stringstream &code)
{
std::string token;
std::string line;
while (std::getline(code, line))
{
if (line.empty() || line.front() == '#')
continue;

for (const char character : line)
{
switch (character)
{
case ';':
case '=':
addToken(token);
addToken(std::string(1, character));
token.clear();
break;
default:
if (std::isspace(character))
{
addToken(token);
token.clear();
}
else if (std::isalnum(character))
{
token += character;
}
break;
}
}
}
}

private:
static const std::unordered_map<std::string, TokenType> tokenMap;

void addToken(const std::string &s)
{
static const std::unordered_map<std::string, TokenType> tokenMap = createTokenMap();

if (s.empty())
return;
auto it = tokenMap.find(s);
if (it != tokenMap.end())
{
tokens.emplace_back(it->second, s);
}
else if (is_number(s))
{
tokens.emplace_back(TokenType::NUMBER, s);
}
else
{
tokens.emplace_back(TokenType::IDENTIFIER, s);
}
}

static std::unordered_map<std::string, TokenType> createTokenMap()
{
std::unordered_map<std::string, TokenType> map;
map["hero"] = TokenType::HERO;
map["bestow"] = TokenType::BESTOW;
map["styx"] = TokenType::STYX;
map[";"] = TokenType::SEMICOLON;
map["="] = TokenType::EQUALS;
return map;
}

bool is_number(const std::string &s)
{
return std::all_of(s.begin(), s.end(), ::isdigit);
}
};

You might find obvious ways to optimize and streamline this code, which we will go ahead and do as we progress through the series, but for now, I have kept it like this for clarity as it’s being read rather than an example of the best way of doing it.

The code of our program arrives as a std::stringstream &code and is then processed character by character. We ignore spaces and # to allow for comments prefixed by #. As we read characters, we populate the token string stream until we find a separator that tells us we should consider what we have so far in token as a standalone token.

An example is styx a; where we read styx a character at a time until space is reached, in which case we call addToken to register it as a STYX enum token and add it to our array of tokens. Subsequently, a is read and ; is reached which again triggers the addition of two tokens, one for a and one for ;.

The following map defines the relationship between the string representation of the tokens and their enum:

static std::unordered_map<std::string, TokenType> createTokenMap()
{
std::unordered_map<std::string, TokenType> map;
map["hero"] = TokenType::HERO;
map["bestow"] = TokenType::BESTOW;
map["styx"] = TokenType::STYX;
map[";"] = TokenType::SEMICOLON;
map["="] = TokenType::EQUALS;
return map;
}

Take your time reading through the code to get a feel for what it’s trying to accomplish, and feel free to comment with questions or suggestions!

The Full Code

The Tokenizer code as of now can be found in my GitHub repository dedicated to this project:

Next Steps

In the next article, we will convert the array of tokens we generate into an abstract syntax tree structure, where each node of the tree can have any number of children. The tree will allow us to more accurately represent operations such as assignments or expressions like addition. Moreover, it will be the step before generating assembly code and having an end-to-end running compiler that gives us binaries we can execute on our computer.

Stay tuned and let me know what you think!

--

--

Adrian Nenu 😺

Software Engineer @ Google. Photographer and writer on engineering, personal reflection, and creativity - nenuadrian.com.