May 12, 2023
The repo for this project is here: github.com/mxsjoberg/playcode
This is a post on implementing my own non-trivial programming language, PlayCode. It's experimental and mostly for fun, so don't take anything in this post too seriously (everything might change, frequently, and randomly).
In this post:
The general idea behind PlayCode is to make a playful programming language and to test experimental programming language features. It's not really meant to be used for anything serious. So, the implementation need to be easy to manage, easy to change, and easy to expand.
PlayCode is procedural because knowing-how is more important than knowing-that, i.e. street smarts. It will most likely be dynamically typed because no one have time for types anyway. There will be numbers, strings, lists, if-statements, for-loops, variables, and comments. There will probably also be a lot of things not yet decided.
Other features:
Tags @<tag>
and go-to tag goto @<tag>
(or something similar) to enable any statement to become a function or macro, so that basically every line can be called from anywhere in the program and return itself once.
Swaps swap <a> <b>
to swap the values of two variables.
Below are examples of valid programs.
print 42
x = 42
-- comment
if x < 0:
print false
else
print "positive"
end
-- swap
x = 2
y = 3
swap x y
print y == 2 -> true
-- tags
x = 0
@inc x++
goto @inc
print x -> 2
There are two ways to include comments, --
and ->
, which also can be used as helpers in source for more readable code.
Python is my go-to language for expressing ideas, and since Python 3.4+ there is support for enumerations and 3.10+ structural pattern matching (switch statements).
I'll start with a subset of the language as per Niklaus Wirth's An Incremental Approach to Compiler Construction. Not necessary the smallest subset, or adding a single operator at a time, but manageable pieces. This means that there will be a working interpreter for each subset. I'll might look at code generation later.
In this post, the goal is to implement this subset (basically a calculator with print):
# program ::= PRINT expression
# expression ::= term ((PLUS | MINUS) term)*
# term ::= factor ((MULTIPLY | DIVIDE) factor)*
# factor ::= INTEGER | LPAR expression RPAR
PRINT = "PRINT"
INTEGER = "INTEGER"
PLUS = "+"
MINUS = "-"
MULTIPLY = "*"
DIVIDE = "/"
LPAR = "("
RPAR = ")"
RESERVED = [
"PRINT"
]
I have defined two classes to deal with tokens (using classes makes this more readable later).
class TokenType(Enum):
KEYWORD = 100
INTEGER = 201
PLUS = 301
MINUS = 302
MULTIPLY = 304
DIVIDE = 305
LPAR = 401
RPAR = 402
class Token(object):
def __init__(self, m_type, m_value):
self.m_type = m_type
self.m_value = m_value
def __repr__(self):
if self.m_value:
return f"Token({self.m_type}, '{self.m_value}')"
else:
return f"Token({self.m_type})"
Here are examples of valid programs (note that ->
is start of comment):
print 42
print 4 + 2 -> 6
print 1 + (2 * 4) - (6 / 2) -> 6
Building a lexer is straightforward. It's just the usual read-each-char in source routine and match with defined tokens.
def tokenize(source):
tokens = []
current_line = 1
current_char = ''
current_char_index = 0
while current_char_index < len(source):
current_char = source[current_char_index]
match current_char:
# ...
return tokens
I'm skipping whitespace, so first cases match whitespace characters and increment the lexer accordingly.
case ' ' | '\t' | '\r':
current_char_index += 1
case '\n':
current_line += 1
current_char_index += 1
The next cases match operators, nothing fancy here. Comments are first matched with -
and if next character is another -
or >
then the rest is skipped until newline.
case '+':
tokens.append(Token(TokenType.PLUS, PLUS))
current_char_index += 1
case '-':
# comments
next_char = source[current_char_index + 1]
if next_char == '-' or next_char == '>':
current_char_index += 1
# skip until newline
while source[current_char_index] != '\n':
current_char_index += 1
# minus
else:
tokens.append(Token(TokenType.MINUS, MINUS))
current_char_index += 1
case '*':
tokens.append(Token(TokenType.MULTIPLY, MULTIPLY))
current_char_index += 1
case '/':
tokens.append(Token(TokenType.DIVIDE, DIVIDE))
current_char_index += 1
The next case matches parentheses, again nothing fancy.
case '(':
tokens.append(Token(TokenType.LPAR, LPAR))
current_char_index += 1
case ')':
tokens.append(Token(TokenType.RPAR, RPAR))
current_char_index += 1
Then there is the default case, which matches numbers (currently integers) with isdigit
and identifiers with isalpha
.
case _:
if current_char.isdigit():
# ...
elif current_char.isalpha():
# ...
else:
raise Exception("Unknown character:", current_char)
Numbers can be multi-digit, so each subsequent character that isdigit
is concatenated to a number
-string.
number = str(current_char)
current_char_index += 1
while source[current_char_index].isdigit() and current_char_index < len(source):
number += str(source[current_char_index])
current_char_index += 1
tokens.append(Token(TokenType.INTEGER, number))
Same goes for identifiers, which is also checked against a list of reserved keywords. There are no variables yet so identifiers not in RESERVED
is discarded.
identifier = str(current_char)
current_char_index += 1
while source[current_char_index].isalpha() and current_char_index < len(source):
identifier += str(source[current_char_index])
current_char_index += 1
if identifier.upper() in RESERVED:
tokens.append(Token(TokenType.KEYWORD, identifier.upper()))
The lexer is now complete.
tokens = tokenize("print 1 + (2 * 4) - (6 / 2) -> 6")
for token in tokens: print(token)
# Token(TokenType.KEYWORD, 'PRINT')
# Token(TokenType.INTEGER, '1')
# Token(TokenType.PLUS, '+')
# Token(TokenType.LPAR, '(')
# Token(TokenType.INTEGER, '2')
# Token(TokenType.MULTIPLY, '*')
# Token(TokenType.INTEGER, '4')
# Token(TokenType.RPAR, ')')
# Token(TokenType.MINUS, '-')
# Token(TokenType.LPAR, '(')
# Token(TokenType.INTEGER, '6')
# Token(TokenType.DIVIDE, '/')
# Token(TokenType.INTEGER, '2')
# Token(TokenType.RPAR, ')')
I'll implement a recursive descent parser. It starts with parse
, which takes the flat list of tokens and returns a nested list of tokens (representing a tree structure for a valid program). It works by recursively by calling non-terminals, i.e. program, expression, term, and factor until finding a terminal (currently only INTEGER
).
def parse(tokens):
tree = []
current_token = None
current_token_index = 0
while current_token_index < len(tokens):
program, current_token_index = parse_program(tokens, current_token_index)
tree = program
return tree
The program production rule is the starting point in the grammar, so parse_program
is the first non-terminal called. It matches the current token with PRINT
, which is currently the only acceptable statement in the language. It then calls parse_expression
to parse the expression as per the grammar rule.
# program ::= PRINT expression
def parse_program(tokens, current_token_index):
program = []
current_token = tokens[current_token_index]
current_token_index += 1
# PRINT
if current_token.m_value == PRINT:
program.append(current_token)
# expression
expression, current_token_index = parse_expression(tokens, current_token_index)
program.append(expression)
else:
raise Exception("parse_program", "Unexpected token:", tokens[current_token_index])
return program, current_token_index
The expression, term, and factor rules are primarily there to enforce operator precedence, where the expression rule is the highest precedence, followed by term, and then factor. It works by first calling parse_term
, which is non-optional, then matches either of the tokens PLUS
or MINUS
followed by another parse_term
, zero or more times.
# expression ::= term ((PLUS | MINUS) term)*
def parse_expression(tokens, current_token_index):
expression = []
# term
term, current_token_index = parse_term(tokens, current_token_index)
expression = term
while current_token_index < len(tokens) and (tokens[current_token_index].m_type == TokenType.PLUS or tokens[current_token_index].m_type == TokenType.MINUS):
current_token = tokens[current_token_index]
current_token_index += 1
match current_token.m_type:
# PLUS
case TokenType.PLUS:
# term
term, current_token_index = parse_term(tokens, current_token_index)
expression = [current_token, [expression, term]]
# MINUS
case TokenType.MINUS:
# term
term, current_token_index = parse_term(tokens, current_token_index)
expression = [current_token, [expression, term]]
case _:
raise Exception("parse_expression", "Unexpected token:", tokens[current_token_index])
return expression, current_token_index
The term rule is like the expression rule but matches MULTIPLY
or DIVIDE
instead of PLUS
or MINUS
. It first calls parse_factor
, which is non-optional, then the optional part zero or more times.
# term ::= factor ((MULTIPLY | DIVIDE) factor)*
def parse_term(tokens, current_token_index):
term = []
# factor
factor, current_token_index = parse_factor(tokens, current_token_index)
term = factor
while current_token_index < len(tokens) and (tokens[current_token_index].m_type == TokenType.MULTIPLY or tokens[current_token_index].m_type == TokenType.DIVIDE):
current_token = tokens[current_token_index]
current_token_index += 1
match current_token.m_type:
# MULTIPLY
case TokenType.MULTIPLY:
# factor
factor, current_token_index = parse_factor(tokens, current_token_index)
term = [current_token, [term, factor]]
# DIVIDE
case TokenType.DIVIDE:
# factor
factor, current_token_index = parse_factor(tokens, current_token_index)
term = [current_token, [term, factor]]
case _:
raise Exception("parse_term", "Unexpected token:", tokens[current_token_index])
return term, current_token_index
The factor rule has the lowest precedence and matches either an INTEGER
or LPAR
, which indicates the start of another expression. If LPAR
is matched, then it calls parse_expression
and the process starts all over again. It finally matches RPAR
or raises an error (no closing parentheses).
# factor ::= INTEGER | LPAR expression RPAR
def parse_factor(tokens, current_token_index):
factor = []
current_token = tokens[current_token_index]
current_token_index += 1
match current_token.m_type:
# INTEGER
case TokenType.INTEGER:
factor = current_token
# LPAR
case TokenType.LPAR:
# expression
expression, current_token_index = parse_expression(tokens, current_token_index)
factor = expression
# RPAR
if current_token_index < len(tokens) and tokens[current_token_index].m_type == TokenType.RPAR:
current_token_index += 1
else:
raise Exception("parse_factor", "Expecting ')':")
case _:
raise Exception("parse_factor", "Unexpected token:", tokens[current_token_index])
return factor, current_token_index
The parser is now complete.
tokens = tokenize("print 1 + (2 * 4) - (6 / 2) -> 6")
tree = parse(tokens)
print_tree(tree)
# Token(TokenType.KEYWORD, 'PRINT')
# Token(TokenType.MINUS, '-')
# Token(TokenType.PLUS, '+')
# Token(TokenType.INTEGER, '1')
# Token(TokenType.MULTIPLY, '*')
# Token(TokenType.INTEGER, '2')
# Token(TokenType.INTEGER, '4')
# Token(TokenType.DIVIDE, '/')
# Token(TokenType.INTEGER, '6')
# Token(TokenType.INTEGER, '2')
The interpreter is the final step. It takes the tree, recursively evaluates it, and returns the result. Nothing more than that yet. It works by matching the left-most node, which should be a keyword, operator, or number, and evaluating the right (children).
In our example program, it matches the keyword (currently only PRINT
) and calls interpret
on the right node. If it is an operator, then it evaluates the expression on the right node and returns the result. A number simply returns the number.
def interpret(tree):
result = ''
node = tree
if isinstance(node, list):
left = node[0]
right = node[1]
else:
left = node
right = None
match left.m_type:
case TokenType.KEYWORD if left.m_value == PRINT:
print(interpret(right))
case TokenType.PLUS:
result = int(interpret(right[0])) + int(interpret(right[1]))
case TokenType.MINUS:
result = int(interpret(right[0])) - int(interpret(right[1]))
case TokenType.MULTIPLY:
result = int(interpret(right[0])) * int(interpret(right[1]))
case TokenType.DIVIDE:
result = int(interpret(right[0])) / int(interpret(right[1]))
case _:
# NUMBER
if left.m_value.isdigit():
return left.m_value
else:
raise Exception("interpret", "Unexpected node:", node)
return result
tokens = tokenize("print 1 + (2 * 4) - (6 / 2) -> 6")
tree = parse(tokens)
interpret(tree)
# 6
It prints 6
, which is the correct behavior.