Symbol table introduction
The difference between recognition and simply reading all the characters of a file is that, after recognition, you can identify the various phrases and subphrases of an input sentence. Phrases and subphrases are identified by their structure. The implicit structure of a sentence mirrors the rule invocation graph of its derivation. The result of parsing is usually a highly refined version of the parse tree called an AST where subtree roots are operators and subtree leaves are operations.
In English, this would be like having verbs as subtree roots and nouns as subtree leaves. Consider the following sentence:
Hit thumb with hammer.
A useful AST might look like:

Note that the punctuation and prepositions are mainly for making the input read better, but they do not add any information and are omitted from the intermediate form AST.
Along with the grammatical structure, most language applications need to verify the semantic validity of the input sentence. For example, the following Java code fragment is valid because variable i is defined prior to use:
The semantics might also allow forward references to methods. So, for example, you might see two methods such as the following where the first method calls the second:
void f() { g(); }
void g() { f(); }
How do you know that the reference to variable i and the reference to method g are valid? A data structure called a symbol table records definitions of all relevant symbols such as classes, variables, fields, parameters, methods, and so on. Validating a reference is a matter of asking the symbol table if the entity is visible in the current scope or any enclosing scope.
A scope is a lexically-bounded input construct that usually corresponds to curly braces {...} within programming languages derived from C. These boundaries are typically class boundaries, method boundaries, and nested code block boundaries. The syntax from language to language will vary, but you will encounter some kind of lexically bounded region that defines the scope. Each scope is nicely represented by a hash table that maps the symbol name to an entry in the symbol table. The entry in the symbol table has all sorts of interesting details such as the symbol's definition location in the AST, possibly its type, and what kind of thing it is.
Unfortunately, symbol table rules for looking up symbols (resolving symbols) is not as simple as asking the current scope's hash table for the symbol. Scopes can nest as you have experienced in just about every programming language. For example, a method's scope is nested within a class' scope. A class' scope may be nested within a package or module. Resolving a symbol means looking at a stack of nested, enclosing scopes. The top of the stack (most deeply nested scope) is the current scope in the bottom of the stack (the first scope on the stack) is likely to be the enclosing package. Sometimes the stack of scopes does not follow lexically nested regions. For example, if method foo is not found in a class, the symbol table should look at the superclass in order to resolve that symbol foo.
Goals
The purpose of symbol table management can include:
- checking for undefined, illegal, or mismatched entity references.
- guiding the parse based upon context; e.g., T(i) is a constructor style typecast because T is defined as a type name.
- mapping entity references to a type (user defined or predefined).
- mapping entity references to the code that defines the entity (can help with code motion).
- mapping entity definition to references (can help with code motion).
Common symbol resolution semantics and scoping rules
- symbols in a single scope; e.g., configuration files, BASIC programs
- multiple, nested scopes; e.g., methods within classes
- persistent scopes that have no enclosing scope; e.g., C structs
- persistent scopes with enclosing and/or nested scopes; e.g., class hierarchies, inner classes, packages
- forward references or references to external elements; e.g., classes in a different file
- spaces where the same name can refer equally to a type, a field, and say a method
- overloaded symbols; e.g., method overloading
The next two sections describe spaces and overloaded methods.
Same name, different space
For completeness, note that C struct names are in a different space than variable names which implies that you can have a struct and variable with the same name in the same scope as in:
struct t {
int i;
};
int t=4;
In Java this gets worse: types, methods, and fields can have the same name as they are in different spaces. Also, you can have packages and classes with the same name. For example, consider the following insanity:
package a;
public class a {
}
The fully-qualified name of class a is a.a. So bizarre. Not sure why language designers do this.
There are two solutions. Either you must track separate lists for each space within each scope or you must encode the space information in the name of symbol. This is called name mangling. Name mangling encodes the space into the name such as method_foo and field_x and type_User. The syntax of the symbol reference must then tell you what you are looking for. If you see foo() then do a lookup("method_foo") not lookup("foo").
Same name, same space
One clearly useful feature that allows the same name to appear in the same space is method overloading. Here the method name is the same but the argument number and/or types are different. The obvious solution here is to encode the argument information in the name of the method, again using name mangling.
Statically versus dynamically typed languages
First, let's take a look at dynamically typed languages.
Symbol management in Python
Consider the following assignment statement in Python:
What is the role of the Python compiler for this statement? It turns out that for a dynamically typed language, the role of the compiler is purely to generate code. From Python compiler package
:
The Python compiler package is a tool for analyzing Python source code and generating Python bytecode. The compiler contains libraries to generate an abstract syntax tree from Python source code and to generate Python bytecode from the tree.
The Python compiler is therefore a compiler in the Java compiler sense, going to byte codes rather than machine code. Because there are no type declarations and the compiler is not doing any type inference (guessing the type based upon the statements), there is literally nothing else for the compiler to do. The compiler maps that statement to something like:
You might wonder if the compiler has to allocate space for x before execution of the code. Nope. The interpreter does everything in Python. As a result, you can assign objects of different types to the same variable:
if foo:
x = 1
else:
x = "hi"
The STORE_FAST instruction does the allocation (if x is undefined) and assignment. Even fields of classes are defined on the fly:
is translated to the following bytecodes:
When you define a class, the interpreter creates a symbol table entry and can tell you about it, not the compiler. For example the following class definition:
adds a class symbol to the interpreters symbol table; using the dir function you can ask interpreter for information about the class:
In summary, a dynamically typed language does most of the work in the interpreter and only rudimentary compilation in the compiler. Dynamically typed languages can usually be translated to bytecodes with a single pass. Forward method references are not an issue because the message is simply sent to the appropriate object and the object determines whether there is such a method on-the-fly. One way to look at this is that the first pass is done by the interpreter as it loads the source file and the second pass to resolve references is done and the code executes.
As you might suspect, a statically typed language offloads a great deal of work to the compiler, resulting in a simpler interpreter or even no interpreter at all when it generates machine code directly.
Symbol management in Java
Revisit the concept of a simple assignment, but now consider it in Java. Java is a statically typed language and therefore requires a variable definition:
Like Python, the Java compiler reads in source code, constructs an AST, and generates bytecodes that execute within the Java virtual machine (VM). Java compiler has a lot more work to do than Python's compiler. Java's compiler must build a symbol table with all of the variable, method, and class definitions encountered in a source file. Python does all of this in the interpreter. You can think of the variable definition as "executing" at compile time rather than run-time for a statically typed language like Java.
A compiler for a statically typed language has more to do than just defining variables. It must also check that each reference to a symbol is semantically valid. That means that the symbol must be:
- properly defined within some visible scope
- of an appropriate type
Because the compiler knows about all data definitions, it can generate instructions and space in the constants table of the resulting .class file. (A C compiler would generate instructions to allocate space on the stack for local variables and also would place information in the generated image file about how much global space to allocate).
The following table summarizes what happens when for a dynamically typed language, a statically typed language, and a statically typed language that executes without an interpreter.
| |
Compile-time |
Run-time |
| Python |
Translates source code to bytecodes. |
Interpreter builds a symbol table and verifies the semantic validity of each assignments and other operations, which guarantees type safety. Interpreter executes code. |
| Java |
Builds a symbol table and verifies the semantic validity. Translates source code to bytecodes. Bytecode files contain symbol table information, which is available via reflection API at run-time. |
Interpreter executes code, verifying semantic validity of assignments and other operations. Cannot fool interpreter into doing unsafe operations with a hostile compiler. |
| C |
Builds a symbol table and verifies the semantic validity. Generates assembly code, which is translated to machine code by an assembler. A linker matches up symbol references to code and data definitions. |
Program runs natively on the processor. No symbol introspection available, no type safety guarantee at runtime. |
Sample Symbol Table Management Problems
In this section, we will explore a series of more and more complicated problems regarding the management of input symbols (usually programming language input variables and types).
Configuration file symbol references
Consider a configuration file syntax such as the following:
site main { ... }
site demo { ... }
site meditation { ... }
family jguru {
sites = { main, meditation };
...
}
The jguru family definition indicates that it is composed of two sites: main and meditation . The grammar might be something like:
family
: 'family' ID '{' element+ '}'
;
element
: sites
| ...
;
sites
: 'sites' '=' '{' ID (',' ID)* '}'
;
When you are parsing the jguru family sites specification, how do you know if those sites are valid? In other words, what data structure(s) would you use and what actions would you add to the grammar?
Configuration file symbol references solution
Simple scopes
The C programming language has a fairly straightforward set of scoping rules (for this question, please ignore {{struct}}s and other user-defined types). There is the file scope, the parameter scope, and the local scope. The "kinds" of items to consider for this question are just functions and variables.
For compilation or other translation tasks, you will need to know the type (such as int) of a variable, say x, when you encounter it in the program. This question focuses on how you know what x is. Consider the following input:
int x;
void foo(float x) {
long x;
if ( 1 ) {
double x;
}
}
void bar() {
}
It is clear from the program that the type of x depends on its context. If you reference x within the if-statement, you should discover that x's type is double. From within function bar, x's type is int.
Note that a good compiler will warn you about the parameter and local variable confict:
t.c: warning: declaration of 'x' shadows a parameter
Questions:
- What is the basic strategy for tracking and resolving symbols for this slightly simplified C case?
- What data structure(s) would you use to track symbols and scopes?
- What is the algorithm to resolve x's type? In other words, how do you use your data structures to find the right x?
- When would you update your symbol table and scope data structures? During parse time? After parsing? During a tree walk? When would you then try to resolve symbols to their types?
Simple scopes solution
Data aggregates
Now let's consider how C struct definitions complicate the symbol table picture. The following example illustrates a sample struct definition and a few field access assignments.
struct t {
int x;
float y;
struct {
char *name;
} u;
};
void foo() {
struct t test;
test.x = 1;
test.u.name = "John";
}
Now you have to resolve symbol test.x not just simple names like x as in the previous question.
- How are struct s defined in the symbol table? For example, what do you do with the fields? What do you do with the nested struct definitions?
- How do you resolve test.x to an int ? What about test.u.name ?
- How does the integration of struct s affect the normal scoping and resolution of symbols?
Data aggregates solution
Forward references
What would you do if the family spec appeared before the site specs it references?
family jguru {
sites = { main, meditation };
...
}
site main { ... }
site demo { ... }
site meditation { ... }
This is the same problem encountered when building an assembler. You might reference a label defined later:
...
jump _t1
...
_t1:
...
More commonly today, this is like referencing a method (in Java or any other optical language) that is defined below it lexically in the file.
- What is the basic strategy?
- How does this complicate your data structure(s)?
- What additional algorithms do you need?
- What happens when you have both forward and backward references?
Forward references solution
Object-oriented aggregrates and inheritance
Object-oriented languages such as Java have data aggregrates similar to C structs, but can inherit methods and data fields from superclasses. For example, here is a simple hierarchy:
class Vehicle {
String name = null;
public void start() {...}
public void stop() {...}
}
class Truck extends Vehicle {
int numberOfWheels = 18; public void start() {...} public void go4x4() {...} }
Other code can create new instances and reference these fields:
Truck t = new Truck();
System.out.println(t.numberOfWheels);
t.start();
t.stop();
- How would you modify your data structures for C aggregates to handle class hierarchies (ignoring interfaces and packages)?
- How do you resolve qualified references such as to t.numberOfWheels or t.start()?
- How do you resolve unqualified references to, say name , from within a method of Truck ? In other words, what scopes do you check and in what order?
- What does your parser do as it sees the class definitions and references assuming you are doing symbol table management during the parse?
- What does your parser do if you are doing symbol table management later during a tree walk? What advantages / disadvantages are there when you do symbol table management after the parse in a tree walk?
- Can you ever get rid of your symbol table (that is, before the end of the translation)?
- What is the relationship between the symbol table and the AST?
Object-oriented aggregrates and inheritance solution