Building a Mythological Programming Language Compiler For an x86 CPU (NASM) — Part III — Parser
Creating an Abstract Syntax Tree (AST) From Tokens of a Hyphothethical Yet Working Programming Language
--
In the previous two parts, we have made baby steps towards building up the capability of parsing code (text) into tokens (a flat array of known entities).
It is high time we converted this flat array into a representation that is closer to the intent of our program. Our program is realistically a tree (see the next diagram, and explanation in my first article), with branching operations, rather than just a block of a text or a flat array of tokens. But these intermediary steps were required for an optimal process of reaching this stage.
If you like this content, please consider clapping, following and subscribing to my newsletter.
Abstract Syntax Tree (AST)
The AST will be a more picture-perfect representation of the intent of our code. This is because it will encapsulate scope of operations and if drawn out visually it will be recognisable to a viewer that it is indeed the program we want to run, rather than just a string of characters or an array of words/tokens.
class ASTNode
{
Token token;
std::vector<ASTNode> children;
public:
explicit ASTNode(const Token &token) : token(token) {}
void addChild(const ASTNode &node)
{
children.push_back(node);
}
const Token &getToken() const { return token; }
const std::vector<ASTNode> &getChildren() const { return children; }
};
We can see the ASTNode
will represent each entity in our tree. Each can hold any number of other ASTNodes
, where for example hero a = 40;
will be an ASTNode
for the hero
token, which will then contain the a
, =
, 40
and ;
ASTNodes
and corresponding tokens as children.
In practice you could even just add children to the Token
class we have created previously, and nest a tree of tokens with the same effect. For the purpose of clarity and encapsulation I have decided on separate data-types for each stage of the compiler to allow for future complexity where each step might become much more differentiated from other steps.
Parsing Tokens Into an AST
Here is the code necessary to to convert a list of tokens into an Abstract Syntax Tree in order to create a more contextual representation of our code which we can subsequently leverage to understand what actually executing and converting it to assembly might entail.
class Parser
{
public:
explicit Parser(const std::vector<Token> &tokens) : tokens(tokens), currentIndex(0) {}
ASTNode parse()
{
if (tokens.empty())
{
throw std::runtime_error("No tokens to parse");
}
ASTNode rootNode(Token(TokenType::NUMBER, "root")); // Root node of AST
while (currentIndex < tokens.size())
{
Token currentToken = getNextToken();
if (currentToken.getType() == TokenType::STYX || currentToken.getType() == TokenType::BESTOW)
{
ASTNode statementNode = parseStatement(currentToken);
rootNode.addChild(statementNode);
}
else if (currentToken.getType() == TokenType::HERO)
{
ASTNode statementNode = parseStatement(currentToken);
rootNode.addChild(statementNode);
}
else
{
throw std::runtime_error("Syntax Error: Expected 'styx' or 'bestow'");
}
}
return rootNode;
}
const std::vector<const std::string> getVariables() const { return variables; }
private:
const std::vector<Token> &tokens;
std::vector<const std::string> variables;
size_t currentIndex;
Token getNextToken()
{
if (currentIndex >= tokens.size())
{
throw std::runtime_error("Syntax Error: Unexpected end of input");
}
return tokens[currentIndex++];
}
ASTNode parseStatement(const Token &startToken)
{
ASTNode statementNode(startToken); // This node will represent the statement
Token currentToken = getNextToken();
switch (startToken.getType())
{
case TokenType::HERO:
{
// Expecting a variable name after "hero"
if (currentToken.getType() != TokenType::IDENTIFIER)
{
throw std::runtime_error("Syntax Error: Expected variable name after 'hero'");
}
// Add the variable name as a child of the statement
statementNode.addChild(ASTNode(currentToken));
variables.push_back(currentToken.getValue());
// Now expecting an equals sign
currentToken = getNextToken();
if (currentToken.getType() != TokenType::EQUALS)
{
throw std::runtime_error("Syntax Error: Expected '=' after variable name");
}
// Expecting a number after "="
currentToken = getNextToken();
if (currentToken.getType() != TokenType::NUMBER)
{
throw std::runtime_error("Syntax Error: Expected a number after '='");
}
// Add the number as a child of the statement
statementNode.addChild(ASTNode(currentToken));
// Finally, expect a semicolon
currentToken = getNextToken();
if (currentToken.getType() != TokenType::SEMICOLON)
{
throw std::runtime_error("Syntax Error: Expected ';' after number");
}
break;
}
case TokenType::BESTOW:
{
// Rest of the BESTOW handling code...
// Expecting a variable or a number after "bestow"
if (currentToken.getType() != TokenType::IDENTIFIER && currentToken.getType() != TokenType::NUMBER)
{
throw std::runtime_error("Syntax Error: Expected variable or number after 'bestow'");
}
statementNode.addChild(ASTNode(currentToken));
// Expecting a semicolon after variable/number
currentToken = getNextToken();
if (currentToken.getType() != TokenType::SEMICOLON)
{
throw std::runtime_error("Syntax Error: Expected ';' after 'bestow' statement");
}
break;
}
case TokenType::STYX:
{
// Rest of the STYX handling code...
// Expecting a variable or a number after "styx"
if (currentToken.getType() != TokenType::IDENTIFIER && currentToken.getType() != TokenType::NUMBER)
{
throw std::runtime_error("Syntax Error: Expected variable or number after 'styx'");
}
statementNode.addChild(ASTNode(currentToken));
// Expecting a semicolon after variable/number
currentToken = getNextToken();
if (currentToken.getType() != TokenType::SEMICOLON)
{
throw std::runtime_error("Syntax Error: Expected ';' after 'styx' statement");
}
break;
}
// Handle other cases if necessary
default:
throw std::runtime_error("Syntax Error: Unknown statement type");
}
return statementNode;
}
};
We see our Hero token which expects an identifier afterwards, alongside an equal sign and a number to be assigned to this variable, finishing it all up with a semicolon.
Similiarly, the other instructions are represented so that when they are detected, we attempt to parse the subsequent tokens in a way that makes sense or error if we do not encounter a compatible syntax.
In the first article of this series you can find examples of what we are aiming for the baseline Hades language to support.
Next Steps
We will convert the AST into an Intermediate Representation that will be the last layer of abstraction before converting the IR to NASM Assembly code. The reason for having another layer is that our IR will be agnostic of the operating system the program will eventually be executed on, and in the future one can build multiple ways of converting the IR into assembly, for NASM or other systems. Even if NASM has the ability to generate binaries for many operating systems, it is not going to accept one assembly conversion and execute it on all operating systems. Instead, we will still have to generate the correct code.
I will endavour to provide examples for multiple operating systems but we will begin with MacOS.
Dear Reader,
Please clap, follow and join my newsletter if you like this content and you would like to support 🙇🏻♂️. You can connect with me on X (Twitter), Linkedin, Instagram & Github!
If you liked this article, you may also enjoy:
- Dante’s Code Hell Inferno: the Nine Layers
- The Code Purgatorio: Ascension to Clean Code
- How To Not Be a Run-of-the-Mill Software Engineer
- Plato’s Republic Of Software Engineering: A Philosophical Perspective
- Code 💻: The Modern Driving Skill — Why It’s as Crucial Today as Driving 🚙Was in the Past
- Diversify your Experience as you would your Investment Portfolio!