Building a MiniJava Compiler
What is MiniJava?
I took an introductory compilers course where the ongoing project was to write a MiniJava compiler in Java. MiniJava is a programming language that is a subset of Java. There are many versions of MiniJava, but the MiniJava version I implemented has the Backus-Naur Form (BNF) defined here.
Phases 1 & 2: Lexical & Syntax Analysis
The first assignment of the course was to write a lexer and parser. Due to the time constraints of the course, the assignment only required the implementation of a parser for a part of MiniJava, as defined by the BNF below. Later assignments used a complete parser for MiniJava generated from the BNF via JavaCC, a parser generator.
In order to implement the parser for the above BNF, a set of terminal symbols is first constructed. The set is used to determine whether tokens from the input string belong to the language. If a token from the input string is not in the language, an error is raised. If all tokens from the input string belong in the language, an array of the input string tokens is passed to the parser.
The parser I implemented is a recursive descent parser, which uses mutual recursion of nonterminal functions to determine whether a set of terminal symbols belongs to the language. Each nonterminal symbol (namely S
, L
, and E
) has a corresponding function that calls functions for other nonterminal symbols and terminal symbols. The eat(String str)
function is called for terminal symbols, where the str
argument is the expected terminal symbol to be consumed. The definition of the eat(String string)
function is as follows.
The crux of the eat
function is that it compares whether the str
argument matches the current token
to be checked. If there’s a match, token
is updated to the next token in the list of input tokens for the next call of eat
; otherwise, an exception is thrown. Now we take a look at the definition of S()
.
The S()
function is a large switch statement for the token currently being processed. Depending on the token passed in, a different sequence of eat
and nonterminal functions is called. Other nonterminal functions, that being L()
and E()
, have the same switch statement structure.
Consider a simple input string of System.out.println(!true)
. Since S
is the start symbol, S()
is called first. S()
calls eat("System.out.println")
, eat("(")
and E()
. E()
then calls eat("!")
and E()
. The second call of E()
reaches the terminal symbol true
, for which eat("true")
is called. Since the second call of E()
terminates, the first call of E()
terminates. The starting S()
function finally calls eat(")")
. After eat(")")
terminates, the starting S()
call terminates, so the input is parsed successfully.
Phase 2: Semantic Analysis
The second assignment was to build the MiniJava type checker. Unlike Java, MiniJava only has integer, integer array, boolean, identifier, and void primitive types. The identifier type consists of names given to classes, functions, and variables. Semantic analysis required two passes of the code: the first pass constructs a symbol table, and the second pass checks whether variables in the symbol table are consistent with type rules.
First Pass: Symbol Table Construction
Consider the following test program.
On the first pass, my implementation constructs a nested hash table data structure as illustrated below. The top-level hash table maps class name to its field and function names. Field names are directly mapped to their type. Function names are mapped to local variables (which include parameters) and return variable, both of which are mapped to a type.
Supertype is a relation on the set of class names. Let the symbol denote the supertype relation. Let and be classes where extends , so . In this case, we say that “t2 is a supertype of t1.” is also a supertype of itself since .
So in the data structure below, we see that the cObj
has a supertype of CClass
. Variables with primitive types are not assigned a supertype.
Second Pass: Type Checking
To traverse the input program, the course provided an abstract syntax tree with a corresponding visitor pattern. Consider the PlusExpression
node, which has children f0
and f2
defined as the left and right operand respectively.
As illustrated on lines 2-3 of the visit
function, the PlusExpression
node’s children f0
and f1
call the accept
method. The this
keyword refers to the visitor object itself. Since f0
and f1
know what their own types are, each accept
method on lines 2-3 are able to use the passed in visitor object to call the visit
function that corresponds with their type. This is known as double dispatch, a mechanism where an object (such as the PlusExpression
node) uses the accept
method to call the corresponding visit
method defined in the program.
For a more visual understanding, consider the diagram above. An example expression that corresponds with the diagram is
(1 + 2) + 3
, where aPlusExpression
node has a child(1 + 2)
of typePlusExpression
and a child3
of typeInteger
. The childPlusExpression
has children1
and2
, both of typeInteger
. Now we investigate how double dispatch is used to identify the entire example expression to typeInteger
.First, the child
PlusExpression
calls itsInteger
children’saccept
functions, both of which returnTypeObject(TypeConstants.Integer)
. So inside the childPlusExpression
’s visit function,lftOp
andrhtOp
are then initialized to the returnedTypeObject
s. Now, we know that the childPlusExpression
also returnsTypeObject(TypeConstants.Integer)
since both its children returnedTypeObject
s withTypeConstants.Integer
(as shown in lines 4-6).The new
TypeObject
is then assigned to thelftOp
variable of the rootPlusExpression
node. Since the rootPlusExpression
node’s other child also returns aTypeObject
withTypeConstants.Integer
, thenTypeObject(TypeConstants.Integer)
is ultimately returned and so we know that the entire example expression is of typeInteger
.
Other visitor patterns behave in a similar fashion as the one for PlusExpression
. Now we investigate the visitor pattern for MessageSend
.
Phase 3: Intermediate Code Generation
In the intermediate code generation phase, MiniJava is translated into the intermediate representation (IR) Sparrow and and SparrowV, which were formally defined by my professor.
Sparrow Generation
Notice, however, that the cField
in CClass
is not set as a field of BClass
, even though BClass
extends from CClass
. Also notice that variables and return types have a property supertypeName
, which are currently all initialized to null
. This is where the second pass comes into play. In addition to updating classes with inherited fields and methods, the second pass also initializes the supertypeName
property. If the type of the variable is a primitive, the supertypeName
is left as null
; otherwise, supertypeName
is set to the name of the variable’s class. For instance after the second pass, the data structure is updated to the following.
{AClass=
{fields: {},
methods: {main={localVars:{args={type:ARRAY, supertypeName:null}},
returnVar:{type:VOID, supertypeName:null}}},
parentClassName: null}
, BClass=
{fields: {},
methods: {bFunc={localVars:{cObj={type:IDENTIFIER, supertypeName:CClass}},
returnVar:{type:INTEGER, supertypeName:null}}},
parentClassName: CClass}
, CClass=
{fields: {cField={type:BOOLEAN, supertypeName:null}},
methods: {cFunc={localVars:{cFuncRet={type:INTEGER, supertypeName:null},
cParam={type:BOOLEAN, supertypeName:null}},
returnVar:{type:INTEGER, supertypeName:null}}},
parentClassName: null}
}