CHAPTER 25:

LOW LEVEL PARSING

At this point in our tour of the Small C compiler we investigate the third- and fourth-level parsing functions--the ones that handle statements and local declarations. This is where the big picture begins taking shape. Figure 25-1 illustrates how the major low level parsing functions relate to each another.

Figure 25-1: Primary Low-Level Parsing Functions

As we saw in Chapter 24, the high point in dofunction() is where it calls statement(), since that is where the entire body of a function is parsed. This should seem appropriate since the body of a function is really just a single compound statement.

Statement() in turn calls the lower-level functions shown in Figure 25-1. Declloc() declares local variables. Compound() handles compound statements, while doif(), dowhile(), dodo(), dofor(), and doswitch() parse if, while, do, for, and switch statements respectively. These six functions are special because they each call statement() recursively. Notice that each of these constructs controls one or two dependent statements. And these, in turn, might contain further instances of these six control statements. This nesting could go on to any depth. It will be important to remember that this recursion exists, not only when studying these six functions but when studying all of the functions (except dofunction()) in Figure 25-1 since they all can be reached through recursion.

As with the preceding six functions, the others parse the constructs that their names imply. Doasm() was described in Chapter 24 and so is not repeated here. Neither is doexpr(), since it comprehends the expression analyzer which is has been reserved for Chapter 26.

Third-Level Parsing

Statement()

Each call to statement() parses a single statement. Sequences of statements only occur within a compound statement (or block), so the loop that recognizes statement sequences is to be found in compound().

First, statement() looks for a data declaration, introduced by char, int, unsigned, unsigned char, or unsigned int. Finding one of these, it calls declloc(), passing it the data type (CHR, INT, UCHR, or UINT). It then enforces the presence of a semicolon.

If that fails, it looks for one of the tokens that introduces an executable statement. If that succeeds, the corresponding parsing function is called, and the global variable lastst is set to a value indicating which statement was parsed. Table 25-1 lists these keywords, their parsers, and their lastst values.

Table 25-1: Statement Tokens, Parsers, and Lastst Values

Several cases deserve special comment. First, no value is assigned to lastst following a compound statement. This is because lastst must represent the last elementary statement that is guaranteed to have executed when control reaches the end of the function. Assigning a value after a compound statement would wipe out the desired information. Recall from Chapter 24 that lastst is tested by dofunction(), after parsing a function's body, to decide whether or not to generate a default return. At first glance, it would appear that assigning a value to lastst after each statement would not give reliable results. For instance, a function might end with

		...
		if(a==b) return;
		} 

If this should leave lastst with the value STRETURN then the default RET would not be generated and the consequences would be disastrous. However, that does not happen because the if statement recursively calls statement() to parse the return. Therefore, as the recursion unwinds, lastst is first assigned STRETURN and then STIF.

Labels are handled differently since the colon which identifies them is not a prefix token, but a suffix. In this case dolabel() is called to determine whether or not a label is present and, if so, to handle it. It returns true on success.

Null statements--semicolons--are also special because no parsing is needed. However, errflag is reset to reenable error reporting if it had been disabled previously. Small C has a policy of reporting only the first error message for a simple statement since the cascade of messages which usually follows are more likely to be spurious than correct. In practice, this works quite well.

If no other statement is recognized, it is assumed that the current token is the beginning of an expression, and so statement() calls doexpr(). Recall that in C an isolated expression is a legal statement. If it is not clear at this point how a function is called, then remember that functions are called from within expressions.

Fourth-Level Parsing

Declloc()

Declloc() parses local declarations (see Listing 19-6 for examples). This function is roughly parallel to declglb(), although major differences exist. First, local declarations may appear only at the beginning of a block, before the executable statements. This means that all of the local variables in a block can be allocated on the stack at one time with

		ADD SP,-n

in which n is the total number of bytes being allocated. If a data declaration is encountered after an executable statement, "must declare first in block" is issued.

Since local variables are referenced relative to the base pointer BP, no labels are generated. Rather, the local symbol table contains the offsets from BP to the variables.

Another difference is found in the fact that local variables are not initialized, and so initializers do not have to be parsed. Also, Small C restricts local function declarations to pointers; that is, only function declarations of the form

		(*name )()

are acceptable at the local level. These factors further simplify the task of declloc().

Two other restrictions on local declarations need to be considered. First, local declarations are not allowed in the body of a switch statement. This is because, control enters a switch statement at arbitrary points without passing through the head of a block where the local variables would be allocated. Thus, it would be possible to reference variables that were never allocated. Therefore, attempting to declare locals within a switch statement yields the message "not allowed in switch." An example of such an illegal statement is:

		switch(i) {
			int j;
			case 1: j = i; ...
			...
			}
Control cannot reach the point, before the first case prefix, where j is allocated space on the stack.

The other restriction has a similar basis. Goto statements and locals (declared below the highest-level block) are not allowed in combination--either is allowed, but not both. This is because jumping into a block in which locals are declared would bypass the code that allocates the locals. Local declarations at the highest level in the body of a function present no problem since they precede all executable statements, and so are guaranteed to be allocated. A violation of this restriction is greeted with either "not allowed with goto" or "not allowed with block-locals." An illegal example is:

		func() {
			if(i) {
				int j;
				target: j = 0;
				...
				}
			goto target;
			}

On entry from statement(), the data type (char, int, unsigned, unsigned char, or unsigned int) has already been recognized, bypassed, and passed to this function as an argument. Declloc() then finds itself facing a list of objects to define (reserve storage for). Each iteration of an infinite loop processes the next object in the list. Two statements test for return conditions. At the top of the loop

		if(endst()) return;

tests for the end of the statement (semicolon or end of input) and returns if so. And at the bottom

		if(match(",")==0) return;

returns when an object declaration is not followed by a comma. Between these, declloc():

  1. Calls decl() to determine the identity (POINTER, VARIABLE, ARRAY) and size of the object, saving them in the local variables id and sz. Decl() also fetches and verifies the object's name.
  2. Using id and sz, adds an entry to the symbol table so the object can be referenced.

First, within decl(), a check is made for an open parenthesis, introducing a function pointer declaration--(*...)(). Next * is sought; if present the identity is set to POINTER; otherwise, VARIABLE is assumed. Then, symname() is called to verify and fetch the name. If a legal name is not there, illname() complains. Next, if an open parenthesis was found, the closing parenthesis is bypassed. Then, if another open parenthesis is found, a function pointer is being declared. In that case, failure to supply the leading (* produces "try (*...)()" as an error message. In any case the final closing parenthesis is required.

Next, if the identity is VARIABLE (not POINTER), then a check is made for an open bracket, indicating ARRAY. This overrides the initial assumption of VARIABLE. Next, needsub() evaluates the dimension expression and enforces the closing bracket. Its returned value is the specified dimension--a constant. If the dimension is missing or evaluates to zero, then "need array size" is issued. The array dimension is multiplied by the size of an element to determine the total size of the array. Last of all, the size of the declared object is returned to declloc().

Finally, declloc() calls addsym() to enter the declared object into the local symbol table.

Notice that nothing is done to actually generate code to allocate the declared object. That step is postponed until the first executable statement is reached. The global variable declared accumulates a running total of the number of bytes that have been declared--the number that must be allocated. Notice that it gets reset to zero by compound() at the beginning of a block. Then as each object is declared, it gets incremented by the size of the object. Finally, when statement() sees something other than a declaration, the statement

		gen(ADDSP, csp-declared);

generates code to adjust the stack so as to reserve enough space for the variables. It also sets declared to -1 as a signal to declloc() that further declarations are to be rejected with the message "must declare first in block."

Compound()

Compound() parses compound statements. The essence of this function is captured in the lines

		++ncmp;
		while(match("}") == 0)
			if(eof) {
				error("no final }");
				break;
				}
			else statement();
		if(--ncmp && lastst != STRETURN && lastst != STGOTO)
			gen(ADDSP, savcsp);

The global variable ncmp indicates the level of nesting of compound statements. Notice that it gets increased before the statements between the braces are parsed, and decreased afterward. On entry, the opening brace has already been recognized and bypassed, so the next thing should be the first statement. The loop continues until the closing brace or the end of input text has been reached. Since the program should not end in an unclosed block, "no final }" is issued in that case. With each iteration of the loop, statement() is called recursively. Notice that, since statement() calls compound(), any of these inner statements may itself be compound, causing other instances of this same loop to be activated within a single iteration of this one. A closing brace will terminate the innermost instance of the loop, thereby properly matching the last open brace--a happy consequence of recursion.

Now, what happens before and after the main loop? First, csp and locptr are saved so they can be restored after the block is parsed. The statements

		savcsp = csp;
		savloc = locptr;
accomplish this. Also, declared is set to zero so that local declarations will be enabled.

After parsing the block, the stack may require adjusting to deallocate any locals that were declared, the local symbol table is purged of symbols that were declared in the block, and declared is set to -1 to ensure that locals cannot be declared until another block is opened.

There is no need to generate stack restoring code, however, if the last unconditional statement in the block is a return or goto. In the first case, the stack adjusting code is generated before the return and in the second it is not needed because it cannot be reached. So lastst is checked and

		gen(ADDSP, savcsp);

is performed only if necessary. Whether or not code is generated to deallocate locals, csp is restored to its original value for further parsing.

It would be tempting to purge locals from the symbol table simply by means of

		locptr = savloc;

That would purge everything declared in the current block and leave everything declared in superior blocks. However, it would wipe out labels along with objects. And that would not do since labels must be seen throughout a function's body. Therefore, before restoring locptr, a sweep from savloc to locptr finds each label entry, moves it to savloc and advances savloc over the relocated entry. Only then is

              locptr = savloc;

allowed to discard local objects that were declared in the current block.

		Doif()

This function and the next three are very similar. They do essentially the same things but in different sequences. Therefore, we shall examine this one very closely and then simply describe differences in the others.

The task of doif() is to scan an if statement from left to right, generating assembly code that faithfully represents the statement to the CPU. Figure 25-2 illustrates the general form of the C and assembly code for both forms of the if statement. We shall make one pass through doif() observing how it handles both types of statement--with and without an else clause.

Figure 25-2: Structure of IF and IF/ELSE Statements

First, a label number must be reserved for use in jumping around the first (or only) controlled statement. In Figure 25-2, this label is designated false: although its actual appearance in the output file is _n: where n is the number of the label. The statement

		test(flab1 = getlabel(), YES);

obtains a unique label number for the false label, assigns it to flab1, and calls test() to scan the expression (or list of expressions) in parentheses and generate code to evaluate it and test the result. This code, which appears as EXPR? in Figure 25-2, contains a jump on the false condition to the false label. If the expression evaluates true, however, the jump is not taken and control falls down to whatever follows. Since the next thing in the C syntax is the statement that should execute under the true condition, that statement is parsed by calling statement() to generate code for it (STATEMENT and STATEMENT1 in Figure 25-2). Since it closely relates to expression analysis, the examination of test() has been postponed to Chapter 26.

Now, if the else clause is not present, we only have to generate the false label with

		gen(LABm, flab1);

and return.

However, if an else clause is present, there must be another call to statement(). But we don't want the second statement to execute after the first one, so a jump from the end of the first statement around the second one is needed. Therefore,

		flab2 = getlabel();

reserves another label number for this purpose and

		gen(JMPm, flab2);

generates the jump. Notice, however, that the jump is not always generated. It is not needed if the true statement (or the last statement in it) is a return or a goto. In either case the jump could never be reached, so a bit of optimizing is done here.

Now the false label is generated by

		gen(LABm, flab1);

and the second statement is parsed by calling statement(). Finally, the exit label is generated by

		gen(LABm, flab2);

and the else clause has been parsed.

Notice that calls to statement() are recursive, so any legal statement--even another if--could occur at that point. Try mapping out, as in Figure 25-2, the form of the code generated by

		if(expr1)
			if(expr2) statement1
			else statement2
		else statement3
to show that the else clauses are correctly matched with their antecedents.

As we saw, this is a very simple, straight forward function. It is not difficult to understand if we know what the underlying functions do. This function is typical of the next three; essentially, they differ only in order and complexity, but not in concept.

Dowhile()

Figure 25-3 illustrates the general form of the C and assembly code for the while statement.

Figure 25-3: Structure of a WHILE Statement

Although this function is smaller and more straight forward than the previous one, it does introduce calls to addwhile() and delwhile(). These were discussed in Chapter 20 with the while queue.

Addwhile() establishes a new entry in the while queue to represent the present level of nesting of this control structure. It contains the stack level on entry to the current statement, the number of a target label for continuing the loop, and the number of a target label for breaking out of the loop. These three elements, are also stored in wq[], a local copy of the new while queue entry. Next, the loop label (loop: in Figure 25-3) is generated by

		gen(LABm, wq[WQLOOP]);

Then

		test(wq[WQEXIT], YES);

generates code to evaluate the expression and test the result (EXPR? in Figure 25-3). This contains a jump on the false condition to the exit label. After that, statement() is called recursively to parse the controlled statement. This is followed by an unconditional jump to the loop label. And, finally, the exit label is generated and the current entry in the while queue is deleted.

As we can see in Figure 25-3, the generated code first evaluates the expression in parentheses. If it tests true the controlled statement is executed and the process repeats. When the expression tests false, control goes to the exit point and continues on with whatever follows.

Dodo()

Figure 25-4 illustrates the general form of the C and assembly code for the do statement.

Figure 25-4: Structure of a DO Statement

There is very little difference between this function and dowhile(). Essentially, the order of the expression evaluation and the controlled statement are reversed. The statement is executed first and then the condition is tested. Besides that, there is just the additional requirement of the token while.

Dofor()

Figure 25-5 illustrates the general form of the C and assembly code for the for statement.

Figure 25-5: Structure of a FOR Statement

This function is larger and more complex than the previous ones, but it is really no more difficult to understand, since it contains nothing that we have not already seen. By way of reminder, the proper order for the performance of a for statement is:

  1. Evaluate the first expression.
  2. Evaluate and test the second expression for the exit condition.
  3. Execute the controlled statement.
  4. Evaluate the third expression.
  5. go back to step 2.

As we can see from Figure 25-5, the generated code does just that. Two of the unconditional jumps are necessary because the parts of the for statement are not executed in the order of their appearance in the syntax. Specifically, the third expression is evaluated after the controlled statement. But, since it precedes the statement in the syntax, Small C generates the code for it before the code for the statement. The first two jumps are needed to keep the chronology of events straight.

Any of the three expressions may be omitted, as long as the two semicolons are present. Notice what happens when they are absent. If the first one is missing, there is no code generated for it and the statement begins with the evaluation of the second expression. If the second one is missing, there is no code generated for it or for testing it; and, since the false jump is not there, the missing expression is always interpreted as true. Finally, if the last expression is missing, no code is generated for it so, after the controlled statement executes, the last jump transfers control to the second one which immediately transfers it to the first one. It should be clear why an infinite while loop is preferable.

Doswitch()

Figure 25-6 illustrates the general form of the C and assembly code for a switch statement.

Figure 25-6: Structure of a SWITCH Statement

This is easily the most devious statement that has to be parsed. Since __switch plays a crucial role in the execution of switch statements, reviewing its description in Chapter 18 may be in order at this point.

As with the other statements, code for the various parts of the switch statement is generated in the order of appearance with jumps inserted to keep the order of execution straight. Thus, the expression under test is evaluated first, leaving its value in the primary register (AX). Next a jump skips around the controlled statements to a call to __switch. This routine compares AX to the table of label/value pairs following the call. Each label corresponds to a case statement; its corresponding value is the value of the case expression. The table is terminated with a word containing zero, which cannot possibly be an address. When a match is found--AX equals a case value--a jump is executed from __switch directly to the corresponding label. This is shown in the figure as a jump from a table entry to the target label. If the end of the table is reached without finding a match, __switch simply returns control to the address following the end of the table. In our example, since there is a default statement, a jump to the default label was generated immediately after the table, so control goes there. Had there been no default statement, the jump would not have been generated, and control would have gone to the exit label from which it would have continued on to whatever follows.

Notice that since there is no break following statement1, if the expression evaluates to 15 then both of the first two statements execute. The first of the two jumps to the exit label was generated by the break statement, but the last one is always generated to make control exit properly at the end of the last statement.

With one exception, this function uses the while queue like the previous ones. Since switch statements do not have a continuation point, specifying zero for a loop label prevents a continue statement from seeing this entry in the queue.

Everything between braces (top of Figure 25-6) is generated by a single recursive call to statement() since the body of a switch is really just an ordinary compound statement in which case, default, and break statements are allowed.

Since nesting is possible, there must be multiple instances of three variables--swactive, swdefault, and swnext. When true, swactive allows case and default statements to be accepted. When not zero, swdefault contains the label number of the default prefix. And swnext points to the next available switch table entry. These three variables are declared globally, but are saved locally on each entry to doswitch() and restored on exit. Therefore, they always contain values that are appropriate to any given level of nesting.

Recall from Chapter 20 that the purpose of the switch table is to store label/value pairs that are created by case statements. They must be retained until the end of the body of the switch is reached, at which point they can be generated in the output. Because of nesting, pairs from several different nesting levels may be present in the table at one time. However, the lowest level entries are at the end of the table from which they are released as unnesting occurs. By restoring swnext to its value when doswitch() was entered, only those entries at the lowest level are released.

Docase()

Docase() is a small function. Basically it does three things. It generates a label for the case statement. It evaluates the constant expression associated with the case statement. And it places both in the next available entry of the switch table. As we saw, these label/value pairs accumulate in the table until doswitch() dumps them at the end of the switch statement.

Four errors are caught. If swactive is zero, "not in switch" is issued. If the switch table overflows, "too many cases" is issued. If the case expression does not yield a constant, "must be constant expression" is issued. And if the terminating colon is not found, "missing token" is issued.

Dodefault()

Basically, this function only generates a default label and saves its number in the global variable swdefault, telling doswitch() to generate (after the label/value table) a jump to the default label.

Three errors are detected. If swactive is zero, "not in switch" is issued. If swdefault is already non-zero, "multiple defaults" is issued. And if the terminating colon is not found, "missing token" is issued.

Dobreak()

Dobreak() generates a jump to the exit label of the innermost while, for, do, or switch statement as indicated by the last entry in the while queue. First, it calls readwhile() to obtain a pointer to the last entry in the while queue. If the queue is empty, "out of context" is issued and the jump is not generated since there is no active control statement. That failing, code to adjust the stack back to its original value is generated, after which a jump to the exit label is also generated.

Docont()

Docont() does the same thing as dobreak() except for two differences. First, it targets the loop label instead of the exit label. Then, since the switch statement does not have a loop label, this function must search backward through the while queue for the last entry with a loop label number. This is done by

		while(1) {
			if((ptr = readwhile(ptr)) == 0) return;
			if(ptr[WQLOOP]) break;
			}

Since readwhile() returns the address of the entry preceding the one passed to it, each iteration backs up one entry until an entry with a loop label number is found or readwhile() issues "out of context" and returns zero.

Dogoto()

Dogoto() is a simple function that generates a jump to a designated label in the current function. If the target label name is legal, addlabel() is called to add the name to the local symbol table and associate with it a label number. Since the target label may precede or follow the goto statement, the label may or may not be in the table already. If it is, addlabel() does nothing; otherwise, it adds the label to the table. In either case, it returns the label number which dogoto() passes to gen() as the target label number. With this arrangement, the programmer's label name never appears in the output. A corresponding numbered label acts as a synonym. This prevents "multiply defined" errors from being issued by the assembler when more than one function contains the same label name.

Notice that there is no attempt to adjust the stack before executing the jump. This is because Small C, by disallowing local declarations except prior to executable statements (and labels), guarantees that the stack level will be uniform throughout the function. This exclusion is initiated by the occurrence of the first goto in the function. On the other hand, if a declaration in a nested block appears first, then goto statements are disallowed. In other words, the two are mutually exclusive; whichever appears first is allowed and the other is excluded.

Four errors are possible. If local variables have been declared in a nested block, "not allowed with block-locals" is issued. If the label name is found in the local symbol table, but the entry is not for a label, "not a label" is issued. If the token following the keyword goto is not a legal symbol "bad label" is issued. And if a semicolon does not terminate the statement, "no semicolon" is issued.

Dolabel()

Dolabel() generates labels at the points where they appear in the function. This function must first decide whether or not the current token in the input line is a label. It begins by passing over white space to locate the token in question. It then saves lptr, the source line pointer, so that if the token is not a label, lptr can be restored to prevent further scanning from missing the current token. If the token is not a legal symbol, it is not passed over and false is returned to statement(). However, if it is a legal symbol, it is copied into ssname[] and bypassed in the source line. If the next token is a colon, then the extracted symbol is taken for a label. Addlabel() is called to add it to the symbol table (if not already there), the numeric form of the label is generated in the output, and true is returned. On the other hand, if the symbol is not followed by a colon, bump() is called to move the current position back to the start of the token, and false is returned.

Doreturn()

Doreturn() recognizes a return statement and generates

		MOV SP,BP
		POP BP
		RET

in the output file. The first instruction is generated only if the return occurs within the scope of local variables. In that case, SP must be adjusted back to its value before the first local object was allocated. Also, csp must be adjusted back to its global value of zero. All of this is done by

		gen(RETURN, 0);

Before that, however, the return expression, if present, is evaluated by calling doexpr() (Chapter 26). If no expression is there, no code is generated. In either case, whatever is in AX becomes the return value of the function.

Notice that csp is preserved across the generation of the return code. This is because the return may be executed conditionally, and so more code may follow. In that case, the stack level on entry to the return statement applies to the following code. If the return is not taken, SP is not adjusted and csp is preserved.

Go to Chapter 26 Return to Table of Contents