Building a MiniJava Compiler

What is MiniJava?

I took an introductory compilers course where the ongoing project was to write a MiniJava compiler in Java. MiniJava is a programming language that is a subset of Java. There are many versions of MiniJava, but the MiniJava version I implemented has the Backus-Naur Form (BNF) defined here.

Phases 1 & 2: Lexical & Syntax Analysis

The first assignment of the course was to write a lexer and parser. Due to the time constraints of the course, the assignment only required the implementation of a parser for a part of MiniJava, as defined by the BNF below. Later assignments used a complete parser for MiniJava generated from the BNF via JavaCC, a parser generator.

S ::= { L }
      | System.out.println ( E ) ;
      | if ( E ) S else S
      | while ( E ) S
L ::= S L
      | ϵ
E ::= true
      | false
      | ! E

In order to implement the parser for the above BNF, a set of terminal symbols is first constructed. The set is used to determine whether tokens from the input string belong to the language. If a token from the input string is not in the language, an error is raised. If all tokens from the input string belong in the language, an array of the input string tokens is passed to the parser.

The parser I implemented is a recursive descent parser, which uses mutual recursion of nonterminal functions to determine whether a set of terminal symbols belongs to the language. Each nonterminal symbol (namely S, L, and E) has a corresponding function that calls functions for other nonterminal symbols and terminal symbols. The eat(String str) function is called for terminal symbols, where the str argument is the expected terminal symbol to be consumed. The definition of the eat(String string) function is as follows.

private static ArrayList<String> tokens;
private static String token;
private static int tokenIdx = 0;
 
static void eat(String str) throws IOException {
  if (token.equals(str)) {
    ++tokenIdx;
    if (tokenIdx < tokens.size())
      token = tokens.get(tokenIdx);
  } else {
    throw new IOException("Tokens do not match");
  }
}

The crux of the eat function is that it compares whether the str argument matches the current token to be checked. If there’s a match, token is updated to the next token in the list of input tokens for the next call of eat; otherwise, an exception is thrown. Now we take a look at the definition of S().

static void S() throws IOException {
  switch (token) {
    case "{": {
      eat("{");
      L();
      eat("}");
      break;
    }
    case "System.out.println": {
      eat("System.out.println");
      eat("(");
      E();
      eat(")");
      eat(";");
      break;
    }
    case "if": {
      eat("if");
      eat("(");
      E();
      eat(")");
      S();
      eat("else");
      S();
      break;
    }
    case "while": {
      eat("while");
      eat("(");
      E();
      eat(")");
      S();
      break;
    }
    default:
      throw new IOException("Token does not match a terminal symbol");
  }
}

The S() function is a large switch statement for the token currently being processed. Depending on the token passed in, a different sequence of eat and nonterminal functions is called. Other nonterminal functions, that being L() and E(), have the same switch statement structure.

Consider a simple input string of System.out.println(!true). Since S is the start symbol, S() is called first. S() calls eat("System.out.println"), eat("(") and E(). E() then calls eat("!") and E(). The second call of E() reaches the terminal symbol true, for which eat("true") is called. Since the second call of E() terminates, the first call of E() terminates. The starting S() function finally calls eat(")"). After eat(")") terminates, the starting S() call terminates, so the input is parsed successfully.

Phase 2: Semantic Analysis

The second assignment was to build the MiniJava type checker. Unlike Java, MiniJava only has integer, integer array, boolean, identifier, and void primitive types. The identifier type consists of names given to classes, functions, and variables. Semantic analysis required two passes of the code: the first pass constructs a symbol table, and the second pass checks whether variables in the symbol table are consistent with type rules.

First Pass: Symbol Table Construction

Consider the following test program.

class AClass {
  public static void main(String[] args) {
    System.out.println(new BClass().bFunc());
  }
}
 
class BClass extends CClass {
  public int bFunc() {
    CClass cObj;
    cObj = new CClass();
    return cObj.cFunc(false);
  }
}
 
class CClass {
  boolean cField;
  public int cFunc(boolean cParam) {
    int cFuncRet;
    cField = cParam;
    if (cField)
      cFuncRet = 1;
    else
      cFuncRet = 0;
    return cFuncRet;
  }
}

On the first pass, my implementation constructs a nested hash table data structure as illustrated below. The top-level hash table maps class name to its field and function names. Field names are directly mapped to their type. Function names are mapped to local variables (which include parameters) and return variable, both of which are mapped to a type.

Supertype is a relation on the set of class names. Let the \le symbol denote the supertype relation. Let t1t_1 and t2t_2 be classes where t1t_1 extends t2t_2, so t1t2t_1 \le t_2. In this case, we say that “t2 is a supertype of t1.” t2t_2 is also a supertype of itself since t2=t2t2t2t_2 = t_2 \rightarrow t_2 \le t_2.

So in the data structure below, we see that the cObj has a supertype of CClass. Variables with primitive types are not assigned a supertype.

{AClass=
  {fields: {},
   methods: {main={localVars:{args={type:ARRAY, supertypeName:null}},
                   returnVar:{type:VOID, supertypeName:null}}},
   parentClassName: null}
, BClass=
  {fields: {},
   methods: {bFunc={localVars:{cObj={type:IDENTIFIER, supertypeName:CClass}},
                    returnVar:{type:INTEGER, supertypeName:null}}},
   parentClassName: CClass}
, CClass=
  {fields: {cField={type:BOOLEAN, supertypeName:null}},
   methods: {cFunc={localVars:{cFuncRet={type:INTEGER, supertypeName:null},
                               cParam={type:BOOLEAN, supertypeName:null}},
                    returnVar:{type:INTEGER, supertypeName:null}}},
   parentClassName: null}
}

Second Pass: Type Checking

To traverse the input program, the course provided an abstract syntax tree with a corresponding visitor pattern. Consider the PlusExpression node, which has children f0 and f2 defined as the left and right operand respectively.

public TypeObject visit(PlusExpression n, SymbolTable symTable) {
  TypeObject lftOp = n.f0.accept(this, symTable);
  TypeObject rhtOp = n.f2.accept(this, symTable);
  if (lftOp.getType() == TypeConstants.INTEGER
      && rhtOp.getType() == TypeConstants.INTEGER)
    return new TypeObject(TypeConstants.INTEGER);
  else
    printError("Must have integer types for plus expression");
  return null;
}
 
public TypeObject visit(IntegerLiteral n, SymbolTable symTable) {
  return new TypeObject(TypeConstants.INTEGER);
}

As illustrated on lines 2-3 of the visit function, the PlusExpression node’s children f0 and f1 call the accept method. The this keyword refers to the visitor object itself. Since f0 and f1 know what their own types are, each accept method on lines 2-3 are able to use the passed in visitor object to call the visit function that corresponds with their type. This is known as double dispatch, a mechanism where an object (such as the PlusExpression node) uses the accept method to call the corresponding visit method defined in the program.

        PlusExpr
       /        \
   PlusExpr     Int
  /        \
Int        Int 

For a more visual understanding, consider the diagram above. An example expression that corresponds with the diagram is (1 + 2) + 3, where a PlusExpression node has a child (1 + 2) of type PlusExpression and a child 3 of type Integer. The child PlusExpression has children 1 and 2, both of type Integer. Now we investigate how double dispatch is used to identify the entire example expression to type Integer.

First, the child PlusExpression calls its Integer children’s accept functions, both of which return TypeObject(TypeConstants.Integer). So inside the child PlusExpression’s visit function, lftOp and rhtOp are then initialized to the returned TypeObjects. Now, we know that the child PlusExpression also returns TypeObject(TypeConstants.Integer) since both its children returned TypeObjects with TypeConstants.Integer (as shown in lines 4-6).

The new TypeObject is then assigned to the lftOp variable of the root PlusExpression node. Since the root PlusExpression node’s other child also returns a TypeObject with TypeConstants.Integer, then TypeObject(TypeConstants.Integer) is ultimately returned and so we know that the entire example expression is of type Integer.

Other visitor patterns behave in a similar fashion as the one for PlusExpression. Now we investigate the visitor pattern for MessageSend.

public TypeObject visit(MessageSend n, SymbolTable symTable) {
  TypeObject callerObjType = n.f0.accept(this, symTable);
 
  // if invoking object does not have a supertype 
  // OR the class supertype is not in the symbol table
  if (!callerObjType.hasSupertypeName()
      || !symTable.hasClass(callerObjType.getSupertypeName()))
    printError("Must have valid supertype for caller object");
 
  ClassObject callerClass = symTable.getClass(callerObjType.getSupertypeName());
  String calledMethodName = n.f2.f0.toString();
  // if class type of object does not have the method
  if (!callerClass.hasMethod(calledMethodName))
    printError("Must call valid method from caller object");
 
  MethodObject calledMethod = callerClass.getMethod(calledMethodName, symTable);
 
  // calledMethodStack needed since an argument can be a calledMethod
  calledMethodStack.push(calledMethod);
  n.f4.accept(this, symTable);
  calledMethodStack.pop();
 
  return calledMethod.getReturnType();
}

Phase 3: Intermediate Code Generation

In the intermediate code generation phase, MiniJava is translated into the intermediate representation (IR) Sparrow and and SparrowV, which were formally defined by my professor.

Sparrow Generation

Notice, however, that the cField in CClass is not set as a field of BClass, even though BClass extends from CClass. Also notice that variables and return types have a property supertypeName, which are currently all initialized to null. This is where the second pass comes into play. In addition to updating classes with inherited fields and methods, the second pass also initializes the supertypeName property. If the type of the variable is a primitive, the supertypeName is left as null; otherwise, supertypeName is set to the name of the variable’s class. For instance after the second pass, the data structure is updated to the following.

{AClass=
  {fields: {},
   methods: {main={localVars:{args={type:ARRAY, supertypeName:null}},
                   returnVar:{type:VOID, supertypeName:null}}},
   parentClassName: null}
, BClass=
  {fields: {},
   methods: {bFunc={localVars:{cObj={type:IDENTIFIER, supertypeName:CClass}},
                    returnVar:{type:INTEGER, supertypeName:null}}},
   parentClassName: CClass}
, CClass=
  {fields: {cField={type:BOOLEAN, supertypeName:null}},
   methods: {cFunc={localVars:{cFuncRet={type:INTEGER, supertypeName:null},
                               cParam={type:BOOLEAN, supertypeName:null}},
                    returnVar:{type:INTEGER, supertypeName:null}}},
   parentClassName: null}
}