Theme

PlayCode: Implementing an Interpreter in Python

May 12, 2023

The repo for this project is here: github.com/mxsjoberg/playcode

This is a post on implementing my own non-trivial programming language, PlayCode. It's experimental and mostly for fun, so don't take anything in this post too seriously (everything might change, frequently, and randomly).

In this post:

What is PlayCode?
- Implementation details
Operator precedence
Lexer
Parser
- Program
- Expression
- Term
- Factor
Interpreter

What is PlayCode? #

The general idea behind PlayCode is to make a playful programming language and to test experimental programming language features. It's not really meant to be used for anything serious. So, the implementation need to be easy to manage, easy to change, and easy to expand.

PlayCode is procedural because knowing-how is more important than knowing-that, i.e. street smarts. It will most likely be dynamically typed because no one have time for types anyway. There will be numbers, strings, lists, if-statements, for-loops, variables, and comments. There will probably also be a lot of things not yet decided.

Other features:

Tags @<tag> and go-to tag goto @<tag> (or something similar) to enable any statement to become a function or macro, so that basically every line can be called from anywhere in the program and return itself once.
Swaps swap <a> <b> to swap the values of two variables.

Below are examples of valid programs.

print 42

x = 42
-- comment
if x < 0:
    print false
else
    print "positive"
end

-- swap
x = 2
y = 3
swap x y
print y == 2 -> true

-- tags
x = 0
@inc x++
goto @inc
print x -> 2

There are two ways to include comments, -- and ->, which also can be used as helpers in source for more readable code.

Implementation details #

Python is my go-to language for expressing ideas, and since Python 3.4+ there is support for enumerations and 3.10+ structural pattern matching (switch statements).

I'll start with a subset of the language as per Niklaus Wirth's An Incremental Approach to Compiler Construction. Not necessary the smallest subset, or adding a single operator at a time, but manageable pieces. This means that there will be a working interpreter for each subset. I'll might look at code generation later.

Operator precedence #

In this post, the goal is to implement this subset (basically a calculator with print):

# program       ::= PRINT expression
# expression    ::= term ((PLUS | MINUS) term)*
# term          ::= factor ((MULTIPLY | DIVIDE) factor)*
# factor        ::= INTEGER | LPAR expression RPAR

PRINT       = "PRINT"
INTEGER     = "INTEGER"
PLUS        = "+"
MINUS       = "-"
MULTIPLY    = "*"
DIVIDE      = "/"
LPAR        = "("
RPAR        = ")"

RESERVED = [
    "PRINT"
]

I have defined two classes to deal with tokens (using classes makes this more readable later).

class TokenType(Enum):
    KEYWORD     = 100
    INTEGER     = 201
    PLUS        = 301
    MINUS       = 302
    MULTIPLY    = 304
    DIVIDE      = 305
    LPAR        = 401
    RPAR        = 402

class Token(object):
    def __init__(self, m_type, m_value):
        self.m_type = m_type
        self.m_value = m_value

    def __repr__(self):
        if self.m_value:
            return f"Token({self.m_type}, '{self.m_value}')"
        else:
            return f"Token({self.m_type})"

Here are examples of valid programs (note that -> is start of comment):

print 42

print 4 + 2 -> 6

print 1 + (2 * 4) - (6 / 2) -> 6

Lexer #

Building a lexer is straightforward. It's just the usual read-each-char in source routine and match with defined tokens.

def tokenize(source):
    tokens = []
    current_line = 1
    current_char = ''
    current_char_index = 0

    while current_char_index < len(source):
        current_char = source[current_char_index]
        match current_char:
            # ...

    return tokens

I'm skipping whitespace, so first cases match whitespace characters and increment the lexer accordingly.

case ' ' | '\t' | '\r':
    current_char_index += 1
case '\n':
    current_line += 1
    current_char_index += 1

The next cases match operators, nothing fancy here. Comments are first matched with - and if next character is another - or > then the rest is skipped until newline.

case '+':
    tokens.append(Token(TokenType.PLUS, PLUS))
    current_char_index += 1
case '-':
    # comments
    next_char = source[current_char_index + 1]
    if next_char == '-' or next_char == '>':
        current_char_index += 1
        # skip until newline
        while source[current_char_index] != '\n':
            current_char_index += 1
    # minus
    else:
        tokens.append(Token(TokenType.MINUS, MINUS))
        current_char_index += 1
case '*':
    tokens.append(Token(TokenType.MULTIPLY, MULTIPLY))
    current_char_index += 1
case '/':
    tokens.append(Token(TokenType.DIVIDE, DIVIDE))
    current_char_index += 1

The next case matches parentheses, again nothing fancy.

case '(':
    tokens.append(Token(TokenType.LPAR, LPAR))
    current_char_index += 1
case ')':
    tokens.append(Token(TokenType.RPAR, RPAR))
    current_char_index += 1

Then there is the default case, which matches numbers (currently integers) with isdigit and identifiers with isalpha.

case _:
    if current_char.isdigit():
        # ...
    elif current_char.isalpha():
        # ...
    else:
        raise Exception("Unknown character:", current_char)

Numbers can be multi-digit, so each subsequent character that isdigit is concatenated to a number-string.

number = str(current_char)
current_char_index += 1
while source[current_char_index].isdigit() and current_char_index < len(source):
    number += str(source[current_char_index])
    current_char_index += 1
tokens.append(Token(TokenType.INTEGER, number))

Same goes for identifiers, which is also checked against a list of reserved keywords. There are no variables yet so identifiers not in RESERVED is discarded.

identifier = str(current_char)
current_char_index += 1
while source[current_char_index].isalpha() and current_char_index < len(source):
    identifier += str(source[current_char_index])
    current_char_index += 1
if identifier.upper() in RESERVED:
    tokens.append(Token(TokenType.KEYWORD, identifier.upper()))

The lexer is now complete.

tokens = tokenize("print 1 + (2 * 4) - (6 / 2) -> 6")
for token in tokens: print(token)
# Token(TokenType.KEYWORD, 'PRINT')
# Token(TokenType.INTEGER, '1')
# Token(TokenType.PLUS, '+')
# Token(TokenType.LPAR, '(')
# Token(TokenType.INTEGER, '2')
# Token(TokenType.MULTIPLY, '*')
# Token(TokenType.INTEGER, '4')
# Token(TokenType.RPAR, ')')
# Token(TokenType.MINUS, '-')
# Token(TokenType.LPAR, '(')
# Token(TokenType.INTEGER, '6')
# Token(TokenType.DIVIDE, '/')
# Token(TokenType.INTEGER, '2')
# Token(TokenType.RPAR, ')')

Parser #

I'll implement a recursive descent parser. It starts with parse, which takes the flat list of tokens and returns a nested list of tokens (representing a tree structure for a valid program). It works by recursively by calling non-terminals, i.e. program, expression, term, and factor until finding a terminal (currently only INTEGER).

def parse(tokens):
    tree = []
    current_token = None
    current_token_index = 0

    while current_token_index < len(tokens):
        program, current_token_index = parse_program(tokens, current_token_index)
        tree = program

    return tree

Program #

The program production rule is the starting point in the grammar, so parse_program is the first non-terminal called. It matches the current token with PRINT, which is currently the only acceptable statement in the language. It then calls parse_expression to parse the expression as per the grammar rule.

# program ::= PRINT expression
def parse_program(tokens, current_token_index):
    program = []
    current_token = tokens[current_token_index]
    current_token_index += 1

    # PRINT
    if current_token.m_value == PRINT:
        program.append(current_token)
        # expression
        expression, current_token_index = parse_expression(tokens, current_token_index)
        program.append(expression)
    else:
        raise Exception("parse_program", "Unexpected token:", tokens[current_token_index])

    return program, current_token_index

Expression #

The expression, term, and factor rules are primarily there to enforce operator precedence, where the expression rule is the highest precedence, followed by term, and then factor. It works by first calling parse_term, which is non-optional, then matches either of the tokens PLUS or MINUS followed by another parse_term, zero or more times.

# expression ::= term ((PLUS | MINUS) term)*
def parse_expression(tokens, current_token_index):
    expression = []

    # term
    term, current_token_index = parse_term(tokens, current_token_index)
    expression = term

    while current_token_index < len(tokens) and (tokens[current_token_index].m_type == TokenType.PLUS or tokens[current_token_index].m_type == TokenType.MINUS):
        current_token = tokens[current_token_index]
        current_token_index += 1
        match current_token.m_type:
            # PLUS
            case TokenType.PLUS:
                # term
                term, current_token_index = parse_term(tokens, current_token_index)
                expression = [current_token, [expression, term]]
            # MINUS
            case TokenType.MINUS:
                # term
                term, current_token_index = parse_term(tokens, current_token_index)
                expression = [current_token, [expression, term]]
            case _:
                raise Exception("parse_expression", "Unexpected token:", tokens[current_token_index])

    return expression, current_token_index

Term #

The term rule is like the expression rule but matches MULTIPLY or DIVIDE instead of PLUS or MINUS. It first calls parse_factor, which is non-optional, then the optional part zero or more times.

# term ::= factor ((MULTIPLY | DIVIDE) factor)*
def parse_term(tokens, current_token_index):
    term = []

    # factor
    factor, current_token_index = parse_factor(tokens, current_token_index)
    term = factor

    while current_token_index < len(tokens) and (tokens[current_token_index].m_type == TokenType.MULTIPLY or tokens[current_token_index].m_type == TokenType.DIVIDE):
        current_token = tokens[current_token_index]
        current_token_index += 1
        match current_token.m_type:
            # MULTIPLY
            case TokenType.MULTIPLY:
                # factor
                factor, current_token_index = parse_factor(tokens, current_token_index)
                term = [current_token, [term, factor]]
            # DIVIDE
            case TokenType.DIVIDE:
                # factor
                factor, current_token_index = parse_factor(tokens, current_token_index)
                term = [current_token, [term, factor]]
            case _:
                raise Exception("parse_term", "Unexpected token:", tokens[current_token_index])

    return term, current_token_index

Factor #

The factor rule has the lowest precedence and matches either an INTEGER or LPAR, which indicates the start of another expression. If LPAR is matched, then it calls parse_expression and the process starts all over again. It finally matches RPAR or raises an error (no closing parentheses).

# factor ::= INTEGER | LPAR expression RPAR
def parse_factor(tokens, current_token_index):
    factor = []
    current_token = tokens[current_token_index]
    current_token_index += 1

    match current_token.m_type:
        # INTEGER
        case TokenType.INTEGER:
            factor = current_token
        # LPAR
        case TokenType.LPAR:
            # expression
            expression, current_token_index = parse_expression(tokens, current_token_index)
            factor = expression
            # RPAR
            if current_token_index < len(tokens) and tokens[current_token_index].m_type == TokenType.RPAR:
                current_token_index += 1
            else:
                raise Exception("parse_factor", "Expecting ')':")
        case _:
            raise Exception("parse_factor", "Unexpected token:", tokens[current_token_index])

    return factor, current_token_index

The parser is now complete.

tokens = tokenize("print 1 + (2 * 4) - (6 / 2) -> 6")
tree = parse(tokens)
print_tree(tree)
# Token(TokenType.KEYWORD, 'PRINT')
#     Token(TokenType.MINUS, '-')
#             Token(TokenType.PLUS, '+')
#                 Token(TokenType.INTEGER, '1')
#                     Token(TokenType.MULTIPLY, '*')
#                         Token(TokenType.INTEGER, '2')
#                         Token(TokenType.INTEGER, '4')
#             Token(TokenType.DIVIDE, '/')
#                 Token(TokenType.INTEGER, '6')
#                 Token(TokenType.INTEGER, '2')

Interpreter #

The interpreter is the final step. It takes the tree, recursively evaluates it, and returns the result. Nothing more than that yet. It works by matching the left-most node, which should be a keyword, operator, or number, and evaluating the right (children).

In our example program, it matches the keyword (currently only PRINT) and calls interpret on the right node. If it is an operator, then it evaluates the expression on the right node and returns the result. A number simply returns the number.

def interpret(tree):
    result = ''
    node = tree
    if isinstance(node, list):
        left = node[0]
        right = node[1]
    else:
        left = node
        right = None

    match left.m_type:
        case TokenType.KEYWORD if left.m_value == PRINT:
            print(interpret(right))
        case TokenType.PLUS:
            result = int(interpret(right[0])) + int(interpret(right[1]))
        case TokenType.MINUS:
            result = int(interpret(right[0])) - int(interpret(right[1]))
        case TokenType.MULTIPLY:
            result = int(interpret(right[0])) * int(interpret(right[1]))
        case TokenType.DIVIDE:
            result = int(interpret(right[0])) / int(interpret(right[1]))
        case _:
            # NUMBER
            if left.m_value.isdigit():
                return left.m_value
            else:
                raise Exception("interpret", "Unexpected node:", node)

    return result

tokens = tokenize("print 1 + (2 * 4) - (6 / 2) -> 6")
tree = parse(tokens)

interpret(tree)
# 6

It prints 6, which is the correct behavior.