It's time to turn our attention to the parsing (syntax analysis) of the input tokens from the scanner. SubC breaks this down into three sections:
- Statement parsing in src/stmt.c
- Declaration parsing in src/decl.c
- Expression parsing in src/expr.c
The parser interacts with the symbol table to do things like create and retrieve the types of variables and function parameters. Not only does the parser do syntax analysis, e.g. to identify this as incorrect:
{ if (2+3] { print(("Hell"o}; ++ ]))
but it also does semantic analysis to detect errors like type mismatches, e.g.
int x[];
int y;
y= x+23;
The parser has the job of calling the code generator to output the assembly code. Some of this can be done directly; some of this is done by constructing abstract syntax trees (ASTs) which are then optimised and flattened by the generic code generator.
Each language has a grammar: the set of acceptable statements in that language, and how that language is structured. In computing languages, this is often expressed in Backus-Naur Form. In the comments for the files src/stmt.c, src/decl.c and src/expr.c, SubC outlines the grammar of the subset of C that it recognises in BNF format.
Here's the first fragment of the grammar from the top of src/stmt.c
compound :=
{ stmt_list }
| { }
stmt_list:
stmt
| stmt stmt_list
A compound statement is a list of statements surrounded by curly brackets, or it is a pair of curly brackets with nothing on the inside. At this point I should touch on the concept of terminal and non-terminal symbols.
A terminal symbol is a symbol in a grammar that cannot be broken into other symbols: they are irreducible. In SubC, the tokens that were recognised by the scanner are the terminal symbols, because they cannot be subdivided.
On the other hand, a non-terminal symbol is a symbol which can be broken into other symbols. There is a production rule in the grammar which describes how a non-terminal can be produced (broken up) into other symbols.
So, for the compound non-terminal, the rule is:
compound :=
{ stmt_list }
| { }
where the curly brackets are the terminal symbols and stmt_list is another non-terminal. How do we know that stmt_list is a non-terminal symbol? Because it has its own production rule:
stmt_list:
stmt
| stmt stmt_list
A statement list is either a single statement, or a statement followed by another statement list. Notice that this rule is recursive: one of the alternatives in the rule has its own non-terminal.
Now, we can replace stmt_list by a single stmt (alternative #1). Therefore, the second alternative allows stmt_list to be two stmts in a row: the first stmt is given, and we know that the stmt_list can be replaced with a single stmt.
Given this, it should be obvious that a stmt_list could also be three stmts in a row:
stmt_list => stmt stmt_list
stmt_list => stmt stmt stmt_list
stmt_list => stmt stmt stmt
Inductively, the original BNF rule for stmt_list allows it to be one or more stmts. This is the power of the BNF notation: it's compact but it permits a very sophisticated grammar to be described.
SubC implements the recognition of the C grammar with a recursive descent parser. Following the recursive nature of the BNF rules, each function in the parser can call other functions, which can themselves call each other including the original function.
Each function has the job of recognising one of the non-terminals in the grammar. To do this, it reads one or more tokens from the input stream. Based on this token, it uses the BNF definition to determine which alternative it should follow and goes down that path. Sometimes that path fails, so the parser might have to back up (i.e. push back a token) to follow a different path. If the tokens do not match any of the paths in the BNF rules, the parser declares a syntax error.
That's enough of the theory of parsing, but I've only touched the surface. At this point, I highly recommend that you go get Nils' book about SubC, Practical Compiler Construction, as it covers the theory in more detail.
We will start our tour of the SubC parser with statement parsing. The
top-level statement parsing function is stmt()
which is in
src/stmt.c. The first thing we need to determine is
how we get to the point where the code in stmt()
is running.
The SubC compiler starts in main()
(in src/main.c)
which interprets the flags and file arguments on the command line.
Every non-flag argument is passed to the compile()
function:
for (i=1; i<argc; i++) {
if (*argv[i] != '-') break;
if (!strcmp(argv[i], "-")) {
compile(NULL, def);
exit(Errors? EXIT_FAILURE: EXIT_SUCCESS);
}
After opening the correct files, compile()
(also in src/main.c) then passes control to program()
:
static void compile(char *file, char *def) {
...
program(file, in, out, def);
program()
(in src/decl.c) generates the assembly prelude
code, scans the first token and calls the top()
function:
// Parse the specified file and generate assembly output
void program(char *name, FILE *in, FILE *out, char *def) {
init(); // Initialise all variables
defarg(def); // Define any -D from the cmd line
Infile = in; // Set the current in/out FILE pointers
Outfile = out;
File = Basefile = name; // Set the input file's name
genprelude(); // Generate the assembly prelude
Token = scan(); // Get the first token from the input
while (XEOF != Token) // Until we hit the end of the file
top();
genpostlude(); // Generate the assembly postlude
// Optionally, dump symbols and stats
if (O_debug & D_GSYM) dumpsyms("GLOBALS", "", 1, Globs);
if (O_debug & D_STAT) stats();
}
The top()
function (also in src/decl.c) parses the various
declarations that can occur at the top of a C program. Eventually it
finds a function declaration followed by a compound statement, i.e.
a pair of '{' ... '}' with zero or more statements inside.
compound()
(in src/stmt.c) gets the token after '{'
and repetetively calls stmt()
until we hit the closing '}':
void compound(int lbr) {
if (lbr) Token = scan(); // Get next token if still on the '{'
while (RBRACE != Token) { // Until we get the '}'
if (eofcheck()) return; // Return on EOF
stmt(); // Parse the statement within
}
Token = scan(); // And get the token after the '}'
}
And so, we finally hit the stmt()
function!
stmt()
implements this BNF rule:
stmt :=
break_stmt
| continue_stmt
| do_stmt
| for_stmt
| if_stmt
| return_stmt
| switch_stmt
| while_stmt
| compound
| ;
| expr ;
where ';' is a terminal and all the other symbols are non-terminals. In other words,
it recognises all the main statements in the C language. Of course, this is too much
for one function to perform, so stmt()
just looks at the current token and uses this
to determine which alternative to follow. It calls other functions to parse these alternatives:
static void stmt(void) {
int lv[LV];
switch (Token) {
case BREAK: break_stmt(); break;
case CONTINUE: continue_stmt(); break;
case DO: do_stmt(); break;
case FOR: for_stmt(); break;
case IF: if_stmt(); break;
case RETURN: return_stmt(); break;
case SWITCH: switch_stmt(); break;
case WHILE: while_stmt(); break;
case LBRACE: compound(1); break;
case SEMI: Token = scan(); break;
// Can't have 'default' or 'case' outside a 'switch' statement
case DEFAULT: wrong_ctx(DEFAULT); break;
case CASE: wrong_ctx(CASE); break;
// Not a statement, try parsing it as an expression followed
// by a semicolon. Flush the insruction queue.
default: expr(lv, 0); semi(); commit(); break;
}
clear(1); // Primary register is now empty
}
Note some syntax error reporting: we can't have a default
or case
keyword because
we are not in a switch
statement at this point in time.
There are functions which are involved in the generation of the output assembly code:
commit()
and clear()
. I'll delay discussion of those for now, but I'll come back
to them soon.
Let's choose to look at compound statements next.
We've already seen the BNF rules for compound statements:
compound :=
{ stmt_list }
| { }
stmt_list:
stmt
| stmt stmt_list
We must look for a '{', then zero or more statements, then a '}'. This is done with a loop that parses statements until the '}' token is received:
void compound(int lbr) {
if (lbr) Token = scan(); // Get next token if still on the '{'
while (RBRACE != Token) { // Until we get the '}'
if (eofcheck()) return; // Return on EOF
stmt(); // Parse the statement within
}
Token = scan(); // And get the token after the '}'
}
Here's the BNF rule for Do ... While loops:
do_stmt := DO stmt WHILE ( expr ) ;
stmt
and expr
are non-terminals, and we will call functions to deal with them.
The rest are terminals, i.e. tokens, and we can read them in from the scanner here.
Here is the code for do_stmt()
:
static void do_stmt(void) {
int ls, lb, lc;
Token = scan(); // Skip the 'do' token
ls = label(); // Get a label for the top of loop
pushbrk(lb = label()); // Get labels for breaks and continues
pushcont(lc = label()); // & push them on the respective stack
genlab(ls); // Generate the top label
stmt(); // Parse the statement
match(WHILE, "'while'"); // Get the 'while' token
lparen(); // Get the '(' token
genlab(lc); // Generate the continue label
rexpr(); // Parse the expression
genbrtrue(ls); // Generate "branch if true to top"
clear(1); // Primary register is now empty
genlab(lb); // Generate the break label
rparen(); // Get the ')' and ';' tokens
semi();
Bsp--; // And pop the break and continue
Csp--; // labels from their stacks
}
You should be able to spot the code that reads and checks the tokens: scan()
,
match()
, lparen()
, rparen()
, semi()
. You should also be able to spot the
calls to the functions that deal with the non-terminals: stmt()
and rexpr()
.
So now let's look at the rest of the code.
We have to translate C code that looks like this:
do {
// Zero or more statements inside the loop
} while (this expression is true);
into assembly code. We need the ability to return to the top of the set of statements if the loop expression is true, or keep going onwards if not.
Even more interesting, C has the break
and continue
keywords. If we see a break
,
we must immediately leave the loop, skipping all the statements and the loop test.
If we see a continue
, we must immediately run the loop's expression; if it's true,
we can execute another loop iteration. If not, we leave the loop.
This is translated into assembly with three labels, the testing of the loop expression and a jump based on the test's result:
start_label:
// All the assembly code for the
// statements inside the loop
continue_label:
test the loop expression
jump to the start_label if the test is true
break_label:
Any break
inside the loop will be translated into an immediate jump to the break_label
,
and any continue
inside the loop will be translated into an immediate jump to the
continue_label
. The loop's expression is evaluated at the bottom and jumps to the
start_label
if true.
With that explained, we can return to the analysis of the do_stmt()
code, but with one
more sidetrack. C allows loops inside loops, so there can be nested labels, e.g.
lvl1_continue_label:
...
lvl2_continue_label:
...
lvl2_break_label:
...
lvl1_break_label:
We thus need a stack to hold the break
and continue
labels for our loops, so that
any specific break
command can be translated to the correct label for its loop.
Returning to the do_stmt()
code, we can see that three assembly labels are generated:
ls
, lb
and lc
for the start_label
, break_label
and continue_label
. Two of
them are pushed onto stacks.
The start_label
is output, followed by the assembly for all the statements inside
the loop. The while
and '(' tokens are parsed. Now the continue_label
is output.
The assembly code for the expression is output by expr()
. The "branch if true"
assembly to the start_label
is output by genbrtrue()
. Now the break_label
is
output along with the parsing of the ')' and ';' tokens. Finally, the two labels
are popped from their stacks as they are no longer needed.
Finally, a quick word about the call to clear(1)
. SubC's generic gode generator
models a hypothetical CPU with a main register called an accumulator. There is only
one accumulator; if it is needed while it is occupied, the accumulator must be
spilled to
free it for another use.
The rexpr()
call calculated the value of the loop's expression and left this value in
the accumulator. The genbrtrue()
call generated the code that used this value, but
it left the accumulator "occupied". So, the call to clear()
marks that the accumulator
is free again for other assembly code to use it.
The BNF rule for While loops is:
while_stmt := WHILE ( expr ) stmt
It's very similar to the Do ... While loop except the loop condition is tested
at the top of the loop, not at the end. So the while_stmt()
function's code
is very similar to the do_stmt()
code:
static void while_stmt(void) {
int lb, lc;
Token = scan(); // Skip the 'while' token
pushbrk(lb = label()); // Get labels for breaks and continues
pushcont(lc = label()); // & push them on the respective stack
genlab(lc); // Generate the continue label
lparen(); // Get the '(' token
rexpr(); // Parse the expression
genbrfalse(lb); // Generate branch if false to end of loop
clear(1); // Primary register is now empty
rparen(); // Get the ')' token
stmt(); // Parse the inner statement
genjump(lc); // Generate jump back to top of loop
genlab(lb); // Generate the end of loop label
Bsp--; // And pop the break and continue
Csp--; // labels from their stacks
}
Given that we already have a stack of labels for the relevant jump points
for both break
and continue
, these are easy to implement.
/*
* break_stmt := BREAK ;
*/
static void break_stmt(void) {
// Read the break
// Error check if nothing on the break stack
// Generate a jump to the location on top of the break stack
// and skip past the semicolon
Token = scan();
if (!Bsp) error("'break' not in loop/switch context", NULL);
genjump(Breakstk[Bsp-1]);
semi();
}
/*
* continue_stmt := CONTINUE ;
*/
static void continue_stmt(void) {
// Read the break
// Error check if nothing on the continue stack
// Generate a jump to the location on top of the continue stack
// and skip past the semicolon
Token = scan();
if (!Csp) error("'continue' not in loop context", NULL);
genjump(Contstk[Csp-1]);
semi();
}
for_stmt :=
FOR ( opt_expr ; opt_expr ; opt_expr ) stmt
opt_expr :=
| expr
For loops in C are interesting for two reasons:
- The three clauses inside the loop parentheses are each optional.
- The generated code isn't actually sequential.
Let me explain the second point by rewriting a For loop as if it was a While loop:
for_stmt :=
FOR ( opt_expr1 ; opt_expr2 ; opt_expr3 ) stmt
opt_expr1;
while ( opt_expr2 is true ) {
stmt;
opt_expr3;
}
As you can see, the code for opt_expr3
comes after stmt
even though we will
parse opt_expr3
comes before stmt
. We're going to have to deal with this
somehow.
One strategy is to generate and "stash" the opt_expr3
code. Then generate all
the other code. Then "unstash" the opt_expr3
code and generate it. We would end
up with this assembly structure:
opt_expr1 code
continue_label:
test the loop expression
jump to the break_label if the test is true
stmt code
opt_expr3 code
jump to the continue_label
break_label:
SubC takes a different strategy which doesn't require the "stashing" of any
assembly ouput. But this forces the opt_expr3
code to be generated before
the stmt
code. So how do we get them to execute in the correct sequence
if they are generated out of order? The answer is to use two more labels and
some jumps:
opt_expr1 code
continue_label:
test the loop expression
jump to the break_label if the test is true
jump to the body_label
expr3_label:
opt_expr3 code
jump to the continue_label
body_label:
stmt code
jump to the expr3_label
break_label:
Yes it's ugly and inefficient, but it works!. So you should now be able to see how the
C code in for_stmt()
implements the above structure:
static void for_stmt(void) {
int ls, lbody, lb, lc;
Token = scan(); // Skip the 'for' token
ls = label(); // Get a label for the top of loop
lbody = label(); // and for the start of the inner statement
pushbrk(lb = label()); // Get labels for breaks and continues
pushcont(lc = label()); // & push them on the respective stack
lparen(); // Get the '(' token
if (Token != SEMI) { // Parse any first expression
rexpr();
clear(1); // Primary register is now empty
}
semi(); // Get the ';' token
genlab(ls); // Generate the top label
if (Token != SEMI) { // Parse any second expression
rexpr();
genbrfalse(lb); // Generate branch if false to break label
clear(1); // Primary register is now empty
}
genjump(lbody);
semi(); // Get the ';' token
genlab(lc); // Generate the continue label
if (Token != RPAREN)
rexpr();
clear(1); // Primary register is now empty
genjump(ls); // Generate jump to start of middle expression
rparen(); // Get the ')' token
genlab(lbody); // Generate label for start of inner statement
stmt(); // Parse the inner statement
genjump(lc); // Generate jump back to third expression code
genlab(lb); // Generate the break label
Bsp--; // And pop the break and continue
Csp--; // labels from their stacks
}
The BNF rule for If statements is:
if_stmt :=
IF ( expr ) stmt
| IF ( expr ) stmt ELSE stmt
which means that the else
clause is optional. This means that we have to translate
this into two different assembly outputs. The first alternative is:
test the expression
jump to the l1_label if the test is false
stmt code // Code when expr is true
l1_label:
The second alternative is:
test the expression
jump to the l1_label if the test is false
stmt code // Code when expr is true
always jump to the l2_label
l1_label:
stmt code // Code when expr is false
l2_label:
This means that we have to generate the l1_label
at the end for one alternative,
but generate the l2_label
at the end for the second alternative. See if you can
spot how this is done in the following C code:
static void if_stmt(void) {
int l1, l2;
Token = scan(); // Skip the 'if' token
lparen(); // Get the '(' token
rexpr(); // Parse the expression
l1 = label(); // Get the label for non-true code
genbrfalse(l1); // Generate branch if false to this label
clear(1); // Primary register is now empty
rparen(); // Get the ')' token
stmt(); // Parse the statement
if (ELSE == Token) { // If we have an 'else' token
l2 = label(); // Get a second label. Swap l1/l2 so the
genjump(l2); // new label is the non-false code. Generate
genlab(l1); // the jump to the end (l2) code. Generate
l1 = l2; // the false code (l1 label).
Token = scan(); // Skip the 'else' token and
stmt(); // Parse the inner statement
}
genlab(l1); // Generate the label for non-true code
}
SubC gives the BNF rules for Switch statements as follows:
switch_stmt :=
SWITCH ( expr ) { switch_block }
switch_block :=
switch_block_stmt
| switch_block_stmt switch_block
switch_block_stmt :=
CASE constexpr :
| DEFAULT :
| stmt
The switch_block
has to contain at least one switch_block_stmt
. However, notice that
a switch_block_stmt
could be just a single stmt
. This means that the BNF grammar
permits the following code:
switch (x) {
printf("Hello\n");
}
Obviously we don't want to permit this, so we'll have to implement semantic analysis to prevent a programmer from doing this.
SubC implements a Switch statement in assembly this way:
test the expression
jump to the ls_label
la: case a code // Including a jump to lb if a break
lb: case b code
ldef: default code
jump to lb
ls: load a pointer to the switch jump table
call code to jump to la, lb or ldef
switch jump table
lb:
As with the For loop, the switch block code is generated sequentially, with appropriate labels and jumps to delay the selection of the correct option until after all the switch blocks.
I'm not going to cover the operation of the code that interprets the switch jump table. This is hand-coded assembly code in src/lib/crt0.s if you want to see it.
We start with switch_stmt()
:
static void switch_stmt(void) {
Token = scan(); // Skip the 'switch' token
lparen(); // Get the '(' token
rexpr(); // Parse the expression
commit(); // Flush the insruction queue
clear(0); // Primary register is now empty, but
// don't touch the instruction queue.
// XXX Why not?
rparen(); // Get the ')' token
if (Token != LBRACE) // Syntax error if '{' not the next token
error("'{' expected after 'switch'", NULL);
switch_block(); // Now parse the switch block itself
}
which does the parsing of the tokens and calls rexpr()
to generate the
code to evaluate the Switch expression. I'm still not sure why clear()
is called with argument 1. With the top of the Switch statement parsed,
we call switch_block()
:
static void switch_block(void) {
int lb, ls, ldflt = 0;
int cval[MAXCASE]; // List of case values and case labels
int clab[MAXCASE];
int nc = 0; // Start with no cases found yet
Token = scan(); // Skip the '{' token
pushbrk(lb = label()); // Get a label for the end of the block
// and add it to the break queue
ls = label(); // Generate a label for the jump table
genjump(ls); // and jump to it
while (RBRACE != Token) { // Until we find the closing '}'
if (eofcheck()) return;
// Error if too many cases
if ((CASE == Token || DEFAULT == Token) && nc >= MAXCASE) {
error("too many 'case's in 'switch'", NULL);
nc = 0;
}
if (CASE == Token) { // Find and skip a 'case' token
Token = scan();
cval[nc] = constexpr(); // Get the case expression
// and generate a label for it
genlab(clab[nc++] = label());
colon(); // Get the ':' token
}
else if (DEFAULT == Token) { // Find and skip a 'default' token
Token = scan();
ldflt = label(); // Get a label for the default case
genlab(ldflt); // and output this label
colon(); // Get the ':' token
}
else
stmt(); // Otherwise parse a statement
}
if (!nc) { // If we have some case statements
if (ldflt) { // Add the default label with
cval[nc] = 0; // value to the value/label lists
clab[nc++] = ldflt;
}
else
error("empty switch", NULL); // Error if no cases
}
genjump(lb); // Output a jump to the end label
genlab(ls); // Generate the jump table label
genswitch(cval, clab, nc, ldflt? ldflt: lb); // Generate the jump table
gentext(); // Go back to the text segment
genlab(lb); // Generate the break label
Token = scan(); // Get the next token
Bsp--; // And pop the break label from the stack
}
The code builds a list of case
values in cval[]
and the
labels that precede the code for these cases in clab[]
.
The while (RBRACE != Token) { ... }
loops until the matching '}' token is
received. For each case
token, The case
value is calculated (by the
compiler, right now) with constexpr()
and stored in cval[]
along with a label in clab[]
. The label is output immediately.
The same is done for default
except that its label is stored in
the ldflt
variable.
For statements, we simply call stmt()
to generate the assembly code.
Thus, a switch block that looks like:
case 'a':
case 'b': printf("Hello\n"); break;
default: printf("world\n");
will be translated to:
L1:
L2: printf("Hello\n") code
jump to lb
Ldef: printf("world\n") code
Once we receive the ending '}' token and fall out of the while (RBRACE != Token)
loop, the code verifies that there was at least one case
or default
label; if
not, we have a semantic error.
The default
label is placed on the end of the list of case
labels.
Right now, the last assembly code generated was probably the default
code.
We know that, coming up next, is the code to deal with the switch jump table.
So output a jump to the break label with genjump(lb)
.
Now it's time to deal with the jump table. Output the label for this code
with genlab(ls)
. Generate the table itself with genswitch()
: I'll cover
this later on. Finally, generate the break
label with genlab(lb)
.
That's the parsing of the control statements done:
- compound statements
- Do .. While loops
- While loops
- For loops
- If/Else statements
- Switch statements
break
andcontinue
src/stmt.c also has a function that deals with the C return
keyword,
but this checks if the value being returned matches the declaration of the function
being returned on. As this requires knowledge of declarations, we will hold this over
until we cover declaration parsing.