CHAPTER 26:
EXPRESSION ANALYSIS

We come now to the last remaining task, the parsing of expressions. This is clearly the most difficult part of the compiler. Therefore, it has been separated from the previous chapter for special treatment.

To help us appreciate the overall organization and operation of the expression analyzer, Figure 26-1 presents a graphic view of the major analyzer functions, the operators they recognize, and the grouping of the operators. On the left side are the functions; the most important ones (in terms of the flow of control through the analyzer) are in boxes for emphasis. The basic flow of control is from the top down. To save space, in many cases the connecting lines have been squeezed out. So, where we see function names stacked above one another, there is an implied call from the upper to the lower function. Recursion is evident where lines of control go from a function back to itself or to level1().

Functions level1() through level14() correspond to the precedence levels of the expression operators, such that level1() recognizes the operators at the lowest precedence level and level14() the highest.

To the right of each function, enclosed in boxes, are the operators which the function recognizes. Each set of operators is preceded by an arrow showing the way the operators associate. Notice that the functions for the only three levels that associate from right to left ( level1(), level2(), and level13() ) perform recursive calls to themselves. This causes the analyzer to scan its way to the right most instance of a run of operators at that level (lower level operators being parsed along the way) before generating code for these operators as the recursion unwinds from right to left.

At the bottom of Figure 26-1 is the function primary() which recognizes the operands upon which the operators work. These appear as identifiers (variables, pointers, and functions), constants, or subexpressions in parentheses. A function name stands for the address of the function. If it is followed by a left parenthesis, it is recognized by level14() as a function call; otherwise, its address figures into the expression. Small C is unique in that it interprets an undeclared symbol to be a function name and places it in the global symbol table as such.

As we saw in Chapter 25, there were several points in the syntax where expressions had to be parsed. For example, the if statement has an expression (or list of expressions) in parentheses which must be evaluated and tested at run time. Therefore doif() contained

		test(flab1 = getlabel(), YES);

to parse the expression and generate code for its evaluation and testing. As we can see from Figure 26-1 test() is one of three lead-in functions for the expression analyzer. Another is constexpr() which is called when a constant expression, such as a case value or an array dimension, is expected. Finally, there is doexpr() which attempts to interpret an entire statement as an expression.

Now, what about those unboxed functions--skim(), down(), down1(), and down2()? Basically, they contain logic which used to be imbedded in the level functions. While that arrangement used less stack space and was somewhat faster, it made the compiler considerably larger by replicating virtually identical logic many times. So this version of Small C has extracted this redundant logic into these pipeline functions through which most levels of the analyzer call other levels.

Figure 26-1: Expression Analysis Functions

An Introductory Example

Before launching right into a study of the analyzer, it should help to first get a feel for the overall flow of control by tracing out the parsing of a sample expression. Consider the program

		int i, j, k;
			main() {
			i = j+k/5;
			}

In this case, the expression is a complete statement, so we enter the analyzer from statement() through doexpr() and expression(), at the top of Figure 26-1. Before anything else, level1() calls level2() through down1() which remembers the staging buffer location in case a constant results. In that case, on return, it would discard everything that was generated by lower functions, on the assumption that a higher function will replace it with a single statement that puts the constant in the primary register.

Level2() immediately calls Level3() by way of down1(). Level3() then immediately calls skim() which in turn immediately calls level4() through down1(). Level4() then, likewise, calls level5() through skim() and down1().

Level5() is the first of eight functions which are virtually identical. It immediately calls level6() by way of down() and down1(). This process continues all the way down to level13().

Notice that until now there has been no attempt to scan the expression. In this function, however, since unary operators may precede an operand, attempts are made to match on any of the seven unary operators (Figure 26-1). That failing, level13() calls level14() directly. Level14() then immediately calls primary() which (1) sees the symbol i in the input line, (2) skips over it, (3) finds it in the global symbol table, (4) passes its table address and data type (int) up the line as a side effect, and (5) returns true (meaning that i has yet to be fetched).

At this point, the calls begin to unnest as the level functions, in reverse order, attempt to recognize the operators to which they are sensitive. As each fails, it returns to the next higher level the value that originated in primary(). As each preceding instance of down1() regains control, it merely returns the value from primary() up the line. Likewise, each instance of down() (which is empowered by the level function above to recognize and act on selected operators) does nothing but return the value from primary() up the line. A quick look at the listing for level5() through level12() makes it obvious that all they do is return the value (in this case true) that they receive. This unwinding continues up through levels four and three with the pipeline function skim() finding nothing to do.

Finally, control arrives back in level1() where the assignment operator (=) is recognized and bypassed. Level1() then takes note of the table address and the data type of i and proceeds to call itself; this is because other assignment operators could appear to the right of this one.

Now we descend all the way down the ladder again and in primary() recognize j. Again, control runs back up the ladder. Only, this time, since the next token is a plus sign, something different happens in the instance of down() called by level11(). It recognizes the plus sign. And, seeing that j has yet to be fetched, it generates code to load it into AX. It then bypasses the plus sign and calls level12() again, but this time through down2() and down1(). Down2() first generates code to push AX (containing j) onto the stack. This preserves it until whatever lies to the right of the plus sign has been evaluated. Then it can be popped into BX and added to the right side.

Again, control steps down to primary() where, this time, k is recognized. On the way back up, the following token (a slash), is recognized by the instance of down() beneath level12(). Seeing that k has yet to be fetched, it generates code to load it into AX. It then bypasses the slash and calls level13() again, but this time through down2() and down1(). Before calling down1(), however, down2() generates code to push AX (containing k) onto the stack. This preserves it until whatever lies to the right of the slash has been evaluated, at which time it can be popped and divided by the right side.

Once again, control goes down to primary() where the number 5 is recognized as a constant. It is converted to an internal integer and passed as a side effect back up the line. Finally, primary() returns false meaning that this operand has already been fetched.

At this point, the constant 5 is being passed to higher functions, the values of j and k are on the stack (more correctly, they would be at run time), and the next token is the semicolon that terminates the statement. As control runs back up the ladder, it reaches the instance of down2() beneath level12() where the value 5 (received as a side effect from primary() ) is seen as a constant. And, since BX is not needed to evaluate the constant, the code that pushes k onto the stack is purged from the staging buffer, and in its place

		MOV BX,AX
		MOV AX,5

(p-codes MOVE21 and GETw1n) is generated. Now, with 5 in AX and k in BX, down2() generates

XCHG AX,BX CWD IDIV BX

(p-code DIV12) which performs the division operation, leaving the quotient in AX and the remainder in DX. (The XCHG AX,BX, which is automatically generated by gen() when it sees DIV12, swaps the operands so they will be in the appropriate registers for the 8086 divide instruction. The CWD instruction converts the dividend in AX to a double word by extending its sign through DX. The IDIV BX instruction performs a signed divide of DX,AX by BX.)

After this, control works its way up to the instance of down2() beneath level11() where code is generated that pops j from the stack into BX. Now with k/5 in AX and j in BX, down2() generates

ADD AX,BX

(p-code ADD12) which adds BX to AX, placing the sum in AX.

Next control propagates all the way up to level1() again, where an attempt to recognize further assignment operators fails and control returns to the previous instance of level1() that had called itself. Here code is generated to store AX in memory where i resides, thereby effecting the assignment operation.

Finally, level1() returns to expression() which returns to doexpr() and the analysis is finished. The result is

	MOV AX,_J		fetch j into AX
	PUSH AX			push j on the stack
	MOV AX,_K		fetch k into AX
	MOV BX,AX		move k to BX
	MOV AX,5		load 5 into AX
	XCHG AX,BX		swap AX and BX
	CWD			sign extend AX into DX
	IDIV BX			divide k (DX,AX) by 5 (BX)
	POP BX			pop j into BX
	ADD AX,BX		add j (BX) to k/5 (AX)
	MOV _I,AX		store j+k/5 (AX) into i

which carries out the desired expression evaluation.

It is interesting to see what happens when the default order of precedence is changed by parentheses as in

		i = (j+k)/5

In this case, after scanning the equal sign, control runs down the line to primary() where the left parenthesis is seen, causing it to directly call level1() for the evaluation of j+k. Then, since the right parenthesis is not recognized, control reverts back up to level1() which, thinking it is finished, returns. However, this time primary() receives control, enforces the right parenthesis and returns control back up the ladder, causing analysis to continue. The result is

	MOV AX,_J		fetch j into AX
	PUSH AX			push j onto the stack
	MOV AX,_K		fetch k into AX
	POP BX			pop j into BX
	ADD AX,BX		add j (BX) to k (AX)
	MOV BX,AX		move j+k (AX) to BX
	MOV AX,5		load 5 into AX
	XCHG AX,BX		swap 5 (AX) and j+k (BX)
	CWD			sign extend AX into DX
	IDIV BX			divide j+k (AX) by 5 (BX)
	MOV _I,AX		store (j+k)/5 (AX) in i

Notice how the placement of higher-precedence operators toward the bottom of the evaluation hierarchy and the lower-precedence operators toward the top ensures that operations are performed in the proper order. Also notice how easy it is allow parentheses to alter the default order by making primary() sensitive to them. We simply treat anything between matching parentheses as an entire expression within the context of a larger expression.

Lets draw from this overview some additional concepts. First, notice that the unary operators (level 13) that precede operands are recognized on the way down the hierarchy, while those that follow are recognized on the way up. By "on the way down" I mean before calling the next lower level, and by "on the way up" I mean after returning from the lower level. This is obvious after thinking about it a bit, however it helps to keep this in mind as we study the analyzer functions. Analogously, the binary operators are recognized on the way up from scanning the left operand; however, since they require an operand on the right side, they must descend from the present level back down the hierarchy to evaluate the right hand operand. Then, on return from that, code is generated to effect the binary operation before returning to the next higher level.

This behavior is reflected in Figure 26-1 by the repeated pattern of the down functions connecting levels 5 through 13. Down() starts the descent to the next lower level. Down1() suffices as the pipeline for the left operand, and down2() followed by down1() serves to handle the right operand if an operator at the current level is recognized. So, the main path is the left path, with the right path being used only to descend one level after a binary operator has been detected. As a mnemonic device, these paths are arranged in Figure 26-1 on the left and right under down() according to which operand is being sought. As we shall see later, the mechanism by which information is passed--through side effects--from primary() to higher functions must be duplicated so that information about both operands entering into a binary operation is available to down2() for analysis.

The Lead-in Functions

Having established a feel for the flow of control, we now look into the actual functions which constitute the analyzer. First, we examine the lead-in functions doexpr(), constexpr(), and test().

Doexpr()

Doexpr() is called when statement() cannot recognize a keyword, the assumption being that it must have an expression statement. Doexpr() is a small function that does just two things. In case there is a string of expressions, it checks for a comma after every call to expression() and, finding one, loops back for another call. It also manipulates the staging buffer by calling setstage(), to set the staging buffer to its initial position, before expression() is called. Afterward it calls clearstage() to optimize and dump the buffer to the output file.

Constexpr()

Constexpr() is called when the syntax demands a constant expression. This is often simply a constant--numeric or character--but it may be any legal expression resulting in a constant value. Unlike doexpr(), this function does not expect or handle a list of expressions; so it only looks for one. But, like doexpr(), it sets and clears the staging buffer. There is a major difference, however. When it calls clearstage() it passes a zero for the second argument, meaning that the staging buffer is to be cleared but not dumped. In other words the code generated by analyzing the expression is discarded.

This seemingly strange behavior is understandable when we consider two things. First, constant expressions must sometimes be evaluated for compile-time use (e.g., array dimensions), in which case it would be inappropriate to generate code for run-time execution. And, even when the expression occurs in a run-time context, it is more efficient to generate a single instruction to load the constant value than to generate a set of instructions that perform a stepwise evaluation of the expression.

So, expression() has been written to return two items as side effects: the value of the expression and either true or false depending on whether or not that value is a constant. Constexpr() tests the first of these to see if the expression was indeed constant as expected. If it was not, then "must be constant expression" is issued. Finally, it returns the value of the expression. Of course, this is a bogus value if the message was issued.

Expression()

Expression() is one of two functions that initiate expression analysis by calling level1(). Besides calling level1() and returning the two items mentioned above, it does one very important thing. It declares locally an array of seven integers into which level1() and subordinate functions place vital information about the expression. This is the first instance of the, soon to be familiar, is[] and is2[] arrays. These arrays are the means by which, starting with primary(), information is returned by side effects up the hierarchy. The term is in these array names is not an acronym, but the word. These arrays tell the analyzer what the entire expression or any part of it is. In other words, they carry the properties of whatever has been parsed at every point in the process.

This first instance of the array is for the left leg of the descent through all of the levels. Another (is2[]) is declared--for use with the right operand--at each level where a binary operator is recognized. Before control rises to a higher level, the left array is adjusted to reflect the result of the operation, and the right, being local and having served its purpose, is discarded. At lower levels, the right array appears as a left array, and other right arrays are declared as other binary operators are encountered. Figure 26-2 illustrates the points where these arrays are declared during the analysis of the expression

 		i=j+k/5;

It shows a parse tree for the expression with the left and right arrays written in where they are created.

Figure 26-2: Is[] Arrays for i=j+k/5;

So, what kinds of information is carried in these arrays? We shall now see; however, the usefulness of this information will not become clear until the subordinate analyzer functions are studied. Nevertheless, for an introductory look and for future reference, a summary of the contents of the is[] arrays, is provided in Table 26-1.

Table 26-1: Is[] Array Contents

A brief description of each element follows:

Is[ST] contains the address of the symbol table entry that describes the operand. Its purpose is to allow information from the symbol table to be used in generating code that accesses the operand. Since constants are not represented in the symbol table, however, this element sometimes contains zero. Obviously, this information could have value only with respect to a primary operand. Therefore, when a primary operand combines into some larger entity, this element loses its significance and so is reset to zero.

Is[TI] contains the data type of indirectly referenced objects. These are objects which are referenced by way of an address in a register. This includes (1) function arguments, (2) local objects, and (3) globally declared arrays. In the first two cases, the address is calculated relative to BP, the stack frame pointer. In the third case an array element's address is calculated relative to the label which identifies the array. Static objects other than arrays, on the other hand, are directly referenced by their labels, so they produce zero in this element. In some cases--e.g., an array name without a subscript or a leading ampersand (&)--the address is all that is desired, and so no action is taken based on this information.

Is[TA] contains the data type of an address (pointer, array, &variable); otherwise, zero.

Is[TC] contains the data type (INT or UINT) if the (sub)expression yields a constant value; otherwise, zero. The unsigned designation applies only to values over 32767 that are written without a sign (Chapter 3).

Is[CV] contains the value produced by a constant (sub)expression. This element has alternate uses when it is known that the (sub)expression is not a constant.

Is[OP] contains the p-code that generates the highest binary operator in an expression. Its purpose is to determine, in the case of expressions of the form

left oper zero

(where left is the left subexpression, oper is a binary operator, and zero is a subexpression which evaluates to zero) which form of optimized code to generate. This element is used by test(). Is[SA] contains the staging buffer address where code that evaluates the oper zero part of an expression of the form

left oper zero

is stored. If not zero, it tells test() that better code can be generated and how much of the end of the staging buffer to replace.

Test()

Of the three lead-in functions, test() is the most complicated. Its purpose is to generate code to evaluate and test an expression (or expression list), for true or false. If an expression list is given, the expressions are evaluated from left to right, with the last expression determining the outcome.

This function receives two arguments--a label number for the false branch and a Boolean value indicating whether or not to require parentheses around the expression. As with expression() above, this function calls level1() and so must allocate an initial is[] array.

It then calls need() to enforce an open parenthesis (if told to). Next it falls into a loop setting the staging buffer, calling level1(), and checking for a comma. Each comma means that another expression follows, so it dumps the staging buffer to the output file and continues the loop. After the loop (if told to) it enforces a closing parenthesis.

Finally, before dumping the code for the last expression in the list, test() attempts to do some optimizing. If the expression yields a constant (is[TC] is true) then there is no need for code to evaluate the expression. If the constant value in is[CV] is true, the false jump can never be taken, so test() simply returns without generating anything; actually, clearstage() is called to delete the code that was generated by the last expression. The effect on the program is that control simply goes directly into the portion of the statement that executes under the true condition. However, if the expression is false, an unconditional jump to the indicated label is generated.

Even though the last expression in the list does not yield a constant it may still be optimizable. If its form is

left oper zero

where oper is a relational operator, then instead of generating the normal code for testing the expression,

OR AX,AX JNE $+5 JMP label

(p-code NE10f), it is possible to take advantage of the fact that the 8086 CPU has several instructions for comparing values to zero. By replacing the code for the

oper zero

portion of the expression with these instructions, less and faster code results. So, if is[SA] is nonzero and is[OP] contains the p-code for one of the relational operators, zerojump() is called to replace the unwanted part of the expression with better code. The preferred p-code is passed to zerojump() as an argument.

For an example of the improvement this provides, see Listing 19-23. Table 26-2 shows the original p-codes and the ones that replace them. Most of the substitution p-codes in this table have the form

xx10f

where xx designates the type of comparison (e.g., EQ for equal), 1 refers to the primary register (AX) which is to be tested, 0 designates the value against which it is to be tested, and f means that a jump is generated on the false condition.

Notice that LE12u maps to EQ10f since unsigned values, by definition, cannot be less than zero, making equality with zero the only possibility. Also, GT12u maps to NE10f for a similar reason. The case of GE12u receives the ultimate optimization; no code is generated, since unsigned values are always equal to or greater than zero. Finally, the p-code LT12u maps to an unconditional jump to the false label since unsigned values can never be negative.

Table 26-2: Original and Optimized Relational Test P-codes

The Pipeline Functions

Now we look at what, for lack of a better term, I call the pipeline functions. The ones through which the level functions call each another. Their purpose is to collect in one place logic that would otherwise have been replicated in many level functions. It is helpful to view these functions an extension of the level functions that call them.

Before going into these four functions, however, we shall consider a few techniques which they have in common. First, level3() through level12() do not directly scan for the operators at their precedence levels; they relegate that common task to the underlying instance of skim() or down(). They do this by passing a string containing a list of operators to the called function. These functions then pass it on to nextop() which compares the current token against each operator in the list until it either finds a match or exhausts the list. It returns true on success, and false on failure. It also sets two global variables for use by the caller. Opsize is set to the length of the token, if one is recognized; this is passed by the calling function to bump() when it is time to bypass the token. Opindex is set to designate which operator in the list matched--zero for the first, one for the second, and so on.

Another common technique is the means by which the pipeline functions know which hierarchy level to call next. This is accomplished by having the calling level function pass to the first pipeline function the address of the next level function. It simply passes the function name as an argument. (Recall that a naked function name evaluates to the address of the function.) This is passed from one pipeline function to the next until down1() finally calls the next level with the statement

		k = (*level)(is);

in which k receives the Boolean value indicating whether or not the subexpression parsed by the call is an lvalue, level is the address of the target function, and is is the address of the array for conveying to higher levels the attributes of the subexpression being parsed.

Skim()

This function is only called when the || or && operators are sought--levels 3 and 4 respectively. Essentially this is an embellished version of down() that generates additional code to jump over the remainder of a series of the specified operators when the outcome is known. For instance, the outcome of

		(a > b) || (c == d) || (e != f)

is known if

		(a > b)

is true. The remaining subexpressions need not, indeed must not, be evaluated. The C language guarantees that these trailing subexpressions will not be executed, and many programs depend on this behavior of the logical operators.

Similarly, the outcome of

		(a > b) && (c == d) && (e != f)

is known if

		(a > b)

is false. So the purpose of this function is to skim from left to right across a given level of the expression, ensuring that this rule is enforced. Skim() falls into a loop in which it first calls down1() to parse a subexpression--perhaps only a single primary operand. It then looks to see if the next token is the one being sought (|| or &&). If not, it simply returns the value it got from down1(). If it is, however, three things occur:

the operator is bypassed,
if this is the first instance of the operator in the present series, a number for the dropout label is allocated, and
the function dropout() is called to generate the code that tests for the dropout condition.

Dropout() first, if necessary, generates code to either fetch the subexpression value (an unfetched lvalue), or to load it (a constant) directly with p-code GETw1n. It then generates either EQ10f (for ||) or NE10f (for &&) to jump to the dropout label under the right condition. Remember that these codes produce a jump on the false condition. The specific p-code that applies is determined by the second argument that skim() receives.

Finally, when the series ends:

dropout() is called for the last subexpression,
code is generated to load AX with the outcome (zero or one depending on the fourth argument) when none of the subexpressions indicates otherwise,
an exit label number is allocated and a jump to it is generated,
the dropout label is generated,
code is generated to load AX with the outcome when one of the subexpressions determines it,
the exit label is generated,
appropriate elements of is[] are reset, and
zero (meaning that nothing needs to be fetched) is returned.

For an example of the code generated by this sequence, see Listing 19-18.

Down()

Down() is much simpler than skim(). It is called by each level of the hierarchy that looks for binary operators. It first passes control down the left leg (by way of down1() ) to parse an operand that might be on the left side of a binary operator. On receiving control again (the operand has been parsed), if it sees one of the anticipated operators, it passes control down the right leg (by way of down2() then down1() ) to parse the right hand operand and to generate code for the operation itself. The first call

		k = down1(level, is);

is simple enough and uses only techniques already described. Also, the examination of the current token to see if it is one of the anticipated operators

		if(nextop(opstr)==0) return k;

is simple. If one of the anticipated operators is not found, k is returned to the next higher level.

However, if nextop() returns true, the operator in the list (opstr) indicated by opindex has been identified. In this case, fetch() is called to generate code to load the left operand (if an unfetched lvalue) into the primary register, and control falls into a loop. The loop bypasses the matched operator then calls down2() to evaluate the right operand and generate code for the matched operator. This continues until the operators at the current precedence level (and at the current level of parenthesized nesting) have been exhausted. Finally, zero is returned to the next higher level, indicating that nothing needs to be fetched (down2() fetched the right operand and applied the operator).

Only one thing about this function is not obvious--the way down2() is told which operator is being parsed. To understand this, we must recall that there are two global integer arrays op[] and op2[] which contain p-codes for the signed and unsigned binary operators respectively. Except for seven elements, both arrays are the same. The p-codes are assigned to the array elements according to their level in the parsing hierarchy as indicated in Table 26-3. Within each level they are ordered the same as they appear in the operator lists passed to down(). This is important as we will see shortly. Some operators are not represented in these tables because some level functions do not call down().

Each time down() calls down2(), one of the operators in its list (its first argument) has been recognized, and the corresponding entries (p-codes) from these arrays are passed to it. Now, as we saw, nextop() sets the global integer opindex to designate which operator in the list of operators was found. By design, these lists follow the order in Table 26-3. Finally, notice that down() also receives an argument which is the offset into these arrays of the first operator in the list. By adding this offset opoff to opindex a subscript into the arrays for the matched operator is created. This is how down() determines the p-codes to pass to down2().

Table 26-3: Contents of the Op[] Arrays

One last thing to note about down() is that when an operator has been recognized, down() declares a new is array, called is2[]. It passes this array, along with is[] (which it received as its fourth argument) to down2() which must be able to compare the separate attributes of the left and right operands.

As long as down() keeps recognizing operators in its list (operators at the same precedence level) it keeps on calling down2() in this way. This is what establishes the left to right association of the operators which down() parses. When no more operators in its list are seen it returns zero, indicating that no operand fetch is pending.

Down1()

Not much needs to be said about this function. It receives the address of the target level function, remembers the current position in the staging buffer, calls the target function, and, if is[TC] indicates that a constant value resulted, reverts the staging buffer back to its original position, thereby throwing away the code that was generated. This discarding of the generated code keeps happening with the application of each operator as long as constants result. If an operation does finally produce a non-constant result, then either skim() (through dropout()) or down2() generates one instruction to load the constant before it enters into the operation. If the entire expression yields a constant then level1() generates an instruction to load the value into the primary register--the only code produced by the expression.

Down2()

Down2() is probably the most difficult analyzer function. It cannot possibly make sense without an understanding of the material presented thus far. At this point, though, we should be ready to for it.

Remember that this function is called by down() when a binary operator has been recognized. (An exception is the call from level14() which we shall not consider here.)

Besides the signed and unsigned p-codes mentioned above (oper and oper2) and the target level (level), down2() receives two is arrays--is[] containing the properties of the left operand which has already been parsed and is2[] which will receive the properties the right operand. Is2[] was declared locally in the preceding instance of down() and will be deallocated when it exits. This temporary array is passed to the target function, after which it appears to lower instances of down2() as is[]. Since is2[] is temporary, down2() must indicate in is[] the properties of the subexpression resulting from this binary operation.

Since this function is so large and obscure, a pseudocode version is presented in Listing 26-1. This listing follows the function exactly, making it easy to read the pseudocode as commentary on the function. Further explanation is given below; before that, however, some of the terms, abbreviations, and conventions used in Listing 26-1 need explaining.

The terms primary and secondary refer to the primary register (AX) and secondary register (BX) respectively.
The term constant (abbreviated c) stands for a constant operand which may be the left or right subexpression or the result of the operation.
The term variable (abbreviated v) stands for a variable operand which may be the left or right subexpression.
The term address (abbreviated a) stands for an address operand which may be the left or right subexpression. This could be a pointer, an array name, the result of the address operator (&), or the result of adding to or subtracting from an address.
The term int address (abbreviated ia) stands for an integer address which may be the left or right subexpression. This is the address of an integer, as opposed to a character.
The term left (abbreviated l) stands for the left subexpression. It asserts nothing about its attributes.
The term right (abbreviated r) stands for the right subexpression. It asserts nothing about its attributes.
The term both stands for both left and right subexpressions. It asserts nothing about their attributes.
The term zero (abbreviated z) stands for the constant value zero.
The term op stands for the operator that has been recognized and whose subexpressions are being parsed.
The term result stands for the outcome of the operation in question (l op r).
The term pass refers to the act of passing information up to higher parsing levels by way of side effects in the left is array. Where this occurs, the comment column indicates which element of the array is used.
The plus sign (+) stands for the addition operator.
The minus sign (-) stands for the subtraction operator.
The plus and minus signs together (+-) stand for either an addition or subtraction operator.
The use of parentheses indicates that the symbols within show the form of the subexpression surrounding the operator.
Finally, the use of capitalization refers to the generation of code that performs the indicated actions at run time. Lowercase actions, on the other hand, occur at compile time. The presence of uppercase actions means that the code to accomplish them is being generated (or modified) at the indicated point in the algorithm.

           PSEUDOCODE                                  COMMENTS 
    
     1 save stage current address                       before, start 
     2 assume result will not be (l op z) or (z op r)   is[SA]=0 
     3 if left == constant                              (c op r) 
     4     parse right into PRIMARY 
     5     if left == zero                              (z op r) 
     6         pass current stage address               is[SA]=snext 
     7     if right == int address                      double() 
     8     and op == add or subtract 
     9         double constant                          (2*c +- ia) 
    10     LOAD CONSTANT INTO SECONDARY                 GETw2n 
    11 else                                             (v op r) 
    12     PUSH LEFT ONTO STACK                         PUSH1 
    13     parse right into PRIMARY                     down1() 
    14     if right == constant                         (v op c) 
    15         if right == zero                         (v op z) 
    16             pass original stage address          is[SA]=start 
    17         purge PUSH instruction and adjust csp 
    18         if op == add                             commutative 
    19             if left == int address               double() 
    20                 double constant                  (ia +- 2*c) 
    21             LOAD CONSTANT INTO SECONDARY         GETw2n 
    22         else                                     not commutative 
    23             MOVE PRIMARY TO SECONDARY            MOVE21 
    24             if left == int address               double() 
    25             and op == add or subtract                 
    26                 double constant                  (ia +- 2*c) 
    27             LOAD CONSTANT INTO PRIMARY           GETw1n 
    28     else                                         (v op v) 
    29         POP LEFT INTO SECONDARY                  POP2 
    30         if left == int address                   double() 
    31         and op == add or subtract  
    32         and right == not address 
    33             DOUBLE PRIMARY (RIGHT)               (ia +- 2*v) 
    34         if right == int address                  double() 
    35         and op == add or subtract 
    36         and left == not address     
    37             DOUBLE SECONDARY (LEFT)              (2*v +- ia) 
    38 if op == binary 
    39     if left or right is unsigned 
    40         select unsigned operation                oper=oper2 
    41     if both constants                            (c op c) 
    42         pass constant designation                is[TC] 
    43         pass constant value                      is[CV]=calc() 
    44         purge RIGHT CODE 
    45         if unsigned result 
    46             pass unsigned constant designation   is[TC]=UINT 
    47     else 
    48         pass variable designation                is[TC] 
    49         OPERATION                                gen(oper,0) 
    50         if op == subtract 
    51         and both int addresses                   (ia - ia) 
    52             DIVIDE PRIMARY (RESULT) BY 2         integers between 
    53         pass operator p-code                     is[OP]=oper 
    54     if op == add or subtract 
    55         if both are addresses                    (a +- a) 
    56             pass <result not an address>         is[TA]=0 
    57         else 
    58             if right == address                  (? +- a) 
    59                 pass right table address         is[ST]=is2[ST] 
    60                 pass right indirect data type    is[TI]=is2[TI] 
    61                 pass right address data type     is[TA]=is2[TA] 
    62             else  
    63                 pass left by default 
    64     if left not in symbol table 
    65     or right is in symbol table and is unsigned 
    66         pass right symbol table entry            is[ST]=is2[ST] 
    67     else 
    68         pass left symbol table entry or zero     default

Listing 26-1: Pseudocode Representation of Down2()

Now, with these preliminaries out of the way, we examine Listing 26-1. First, the current stage address is saved so that code generated beyond this point can be purged from the staging buffer. The next line sets is[SA] to zero as a default assumption that the subexpression will not have either of the forms (l op z) or (z op r) which can be optimized within test(). If this assumption proves to be wrong it will be changed later.

Next there is a check (line 3) to see if the left operand is a constant. (Recall that the left operand has already been parsed and is[TC] indicates whether or not it is a constant.) If so, we have a subexpression of the form (c op r) and the code that would have evaluated the constant has already been purged from the staging buffer by down1(). So at this point we know that the left side is a constant and we have its value in is[CV], but no code for it exists yet. The important thing here is that, since there is no code yet for the left operand, there is nothing to save on the stack with a push (as in line 12).

In this case, the right operand is parsed (line 4) to generate code which evaluates it. After that the primary register (at run time) contains its value. Now, before generating code for the operator (line 49), if left is zero, the current stage address (after right, but before left and op are generated) is passed through is[SA] for use by test() in its optimizing efforts. Now, if right is an integer address and the operation is addition or subtraction, the constant in is[CV] is doubled by shifting it left one bit and (doubled or not) code to load it into secondary is generated. This (lines 7-11) all occurs in the source statement

		gen(GETw2n, is[CV] << double(oper, is2, is));

The reason for doubling the constant is that it must displace the address by the indicated number of objects (integers), and there are two bytes per integer. After this, control resumes at line 38.

Lines 12-37 similarly look for other cases requiring special handling; beginning with line 12, the subexpression must be some variant of the form (v op r). Since left is a variable, it must be preserved on the stack (line 12) at run time so that the primary register can be used in evaluating right. Since left must be in the secondary register when the operator is applied, it would be tempting to simply move it from primary to secondary before parsing right. However, right, may require the use of secondary, so we cannot safely do that. Therefore, we push it onto the stack, intending to pop it into secondary just before the operator is applied.

Now right is parsed (line 13), after which is2[] can be tested to see what it produced. If it turns out to be a constant (line 14) lines 15-27 are effective. Furthermore, if the constant is zero (line 15), the stage address on entry to this function is passed up through is[SA] where, as we saw earlier, test() may eventually use it to replace everything generated in this instance of down2() with optimized code. Since left is not a constant, code for it was already in the staging buffer before entering this function, so is[SA] designates the point in the buffer immediately after the calculation of left. Therefore, test() will not purge that code. Now, since right is known to be constant, the push instruction that was just generated is purged from the staging buffer (line 17) since right can be loaded directly into the appropriate register at the right time. Furthermore, since left is currently (at run time) in primary where it was originally calculated, it can simply be left there or moved to secondary, depending on where it is needed; no push/pop sequence is needed. The compiler-relative stack pointer, csp, is also increased by two to account for the fact that the purged push had decreased it by two.

Now, if the operation at hand is commutative, it doesn't matter which register the operands are in when the operator is applied. Thus, there is no need for code to move left to secondary where it would ordinarily need to be; it can stay in primary and right can be loaded directly into secondary. First, however, the constant will need to be doubled if left is an integer address and op is addition. This is all shown in lines 18-21. The only commutative operation recognized is addition.

On the other hand, if op is not commutative (line 22), right (a constant) must be loaded into primary (line 27); but left must first be moved to secondary (lines 23). Also before loading the constant, it may have to be doubled (lines 24-26) as before. Finally, code to load right into primary is generated. From here, control resumes at line 38.

Lines 29-37 show what occurs when both left and right are variable. First, left is popped from the stack into secondary where it needs to be when the operator is applied. (Recall that primary contains right which was just parsed.) Then, if necessary, code to double left or right is generated. This is because the opposite side is an integer address, the doubled side is not an address, and the operation is either addition or subtraction--conditions which are verified by double(). This doubling action differs from that performed on constants, in that it occurs at run time, not compile time. This is done to right in line 33 and to left in line 37.

The remainder of down2() generates the code that actually performs the designated operation; but only if the operator is a binary operator--argument oper is not zero. The purpose of this check is to exclude the simple assignment operator (=) from level1() and the left bracket and left parentheses from level14(). A quick look at these functions will show that they pass zeroes for oper and oper2 in the cases mentioned. The other unary operators (in level13()) do not go through this function and so present no problem. The code for these excluded operators is generated in the level functions themselves. The distinction between the simple assignment (=) and the other assignment operators (like +=) is because the latter operators are really a combination of two operations--first a binary operation, then an assignment.

First, a choice is made between the signed and unsigned versions of the operation. In most cases, these are the same, but in seven cases they are different (Table 26-3). Oper2 is the unsigned operator's p-code. If either operand is unsigned, an unsigned operation is to be performed. This is effected by copying oper2 to oper (initially the signed p-code) which is used later to generate code for the operator. The function nosign() is called once for each operand, to determine whether or not it is unsigned. It is considered to be unsigned if it is (1) an address, (2) an unsigned integer constant (greater than 32767), or (3) an unsigned variable (integer or character).

Now, if both operands are constants calc() is called to calculate (at compile time) the result and pass it up through is[TC] and is[CV] (lines 42-43). Then the code generated by right is purged (line 44). What happened to the code for left, since it too must be purged? The answer is that the lowest instance of down1(), through which left was parsed, purged the code when it saw that left was a constant. So the code for left was not even in the staging buffer when down2() was entered to evaluate right.

At this point, we must make sure that if either constant is unsigned, the result is likewise designated unsigned. Is[TC] indicates this already for left and, since that is also where the result must be indicated, it follows that we need only inspect is2[TC]. If it is found to be unsigned, forcing that designation in is[TC] indicates the true result. There are only two possible values for constant types--INT and UINT. So, if is2[TC] contains UINT the same value is forced into is[TC].

Now, if either side is variable, then a variable result must be calculated (at run time) by code that applies the pertinent operator to the two registers which are now set up properly. This code is generated (line 49) by the statement

		gen(oper, 0);

The passing of the constant or variable designation of the result (lines 42 and 48) is accomplished coincidentally with the decision that the result is or is not a constant (line 41). Therefore, lines 42 and 48 do not occur in the source listing.

Now (at run time) primary contains the result which must be checked for another possible adjustment (lines 50-52). If the current operation is a subtraction and each operand was an address, then the result is interpreted as the number of objects between them. Furthermore, if the addresses point to integers, then the difference must be divided by two (line 52).

Note that senseless combinations of addresses are possible, such as when one address points to integers and the other to characters, or when the addresses refer to different arrays (makes sense only by presuming a knowledge of how the compiler allocates storage). Small C does not try to deal with these, it only verifies that both addresses point to integers and generates the code to adjust the result. This covers the case of elements in a single array, and allows for pointers of the same type to be used in ways that make sense. If the data types are mismatched, no adjusting code is generated.

Finally, if addition or subtraction (the only two sensible address operations) is being performed, we must to tell the higher parsing levels whether or not the result is an address. If both sides are addresses then the result should not be considered an address, so is[TA] is set to zero (line 56). That failing, then one side or the other or perhaps both are not addresses. Here again, since is[], which classifies the result, already has left classified, we only need to inspect right to see if it should override what is already there. Thus, if right is an address, then left must not be, and so we have the form (? +- a) which means that the result must be classified as an address. In that case, as lines 59-61 show, three attributes from right are passed up the hierarchy--ST, TI, and TA. In all other cases, these attributes are passed by default from left. This logic makes sure that the forms (a +- r) and (l +- a) pass the attributes of the address to higher parsing levels.

Note that adding two addresses yields nonsense. Anything we say about it is wrong, so there is no point in going to the trouble of excluding that case; it would only make the compiler larger and it would still produce a lie. Small C declares that such a result is not an address.

The Precedence Levels

Level1()

This function is small but subtle. It recognizes and generates code for the operators at the lowest precedence level, the assignment operators: =, |=, ^=, &=, +=, -=, *=, /=, %=, >>=, and <<=. Since these operators group from right to left the assignments must be made in the reverse order from which the expression is scanned. Therefore, after parsing the left subexpression, if one of these operators is recognized and if the parsed subexpression produced an lvalue, then level1() calls itself again. On parsing the second subexpression, the cycle repeats and continues to repeat until the series of assignments ends. At that point, store() is called to generate code for the right most assignment, then control returns to the previous instance of level1(). Assignment code is again generated and control again returns to the prior instance. This continues until the initial instance of level1() returns to one of the lead-in functions or to primary() or level14() (by way of down1() and down2() ).

As we have already pointed out, level1() first calls level2() to parse whatever comes first (or next) in the expression. Since the lower lying functions do not recognize the assignment operators, control returns when an assignment operator (or the end of the expression or something unrecognizable) is reached. The attributes of the parsed subexpression are then found in is[]. If is[TC] indicates that the subexpression yielded a constant, then no code has been generated yet, so GETw1n is generated to load the constant into the primary register.

Now tests are made for one of the assignment operators. It may be that none are found; if so, control returns to the caller. However, if an assignment operator is recognized, the local variable oper is set to the p-code for the binary operation that is to be performed before assignment. Likewise, oper2 is set to the unsigned version of the same p-code. This identifies the operation to down2(), as we have seen. The simple assignment (=) is a special case, because no other operation is implied; in that case, zero is assigned to oper and oper2. This causes down2() to generate no code for a binary operation; it simply returns control back here, where code for the assignment is generated.

If an assignment operator is recognized, a check is made to ensure that the left subexpression is an lvalue, meaning that assignment is possible and legal. That failing, "must be lvalue" is issued and control returns.

Assuming the target is an lvalue, two of its attributes, ST (address of the symbol table entry) and TI (data type if the reference is indirect), are saved in a local array is3[] for use in making the assignment. This is because is[] will be altered when the right side is evaluated.

Next, two general cases are considered--the case of an indirect reference to the target (by means of an address in the primary register) and the case of a direct reference (by means of a label).

If the target reference is indirect, then the primary register contains its address which must be preserved while the right side is being calculated. This is handled automatically by down2() as we have already seen. However, if a binary operation is being performed, the original value of the target must be fetched into the primary register before the right side is evaluated, and that would destroy the address. So the address is first pushed onto the stack. Then after evaluating the right side and performing the binary operation, the address is popped to the secondary register where the assignment code generated by store() expects to find it. Of course, if only a simple assignment (=) is being performed, the push/fetch/pop sequence is not necessary; in that case, down2() sees to it that the target address migrates to the secondary register.

Direct references are simpler since, in those cases, the code generated by store() makes use of a label to locate the target. If a binary operation applies, as before, the original value of the target must be fetched; however, there is no need to protect a target address as before. So down2() is simply called to generate code that evaluates the right side and applies the operation in question. On the other hand, if the operation is a simple assignment (=) there is no need to fetch the initial value of the target or to go through down2(), so level1() calls itself directly to parse the right side.

Finally, before returning, level1() calls store() to generate code that stores the primary register in memory at the target address. It passes is3[] which contains information store() needs about the target.

Level2()

This level parses the conditional operator (?:). Recall that this operator has the form

Expression1 ? Expression2 : Expression3

The operation of level2() is quite straight forward. First, it calls level3() by way of down1() to parse the first expression. Then, if the next token is not a question mark, we do not have a conditional operator, so control returns to the caller. By calling level3(), assignment and conditional operators are disallowed in the first expression--unless it is enclosed in parentheses, that is.

On the other hand, if this is a conditional operation, then getlabel() is called to reserve a label number for use in jumping around Expression2, and dropout() is called to generate code to perform that jump if Expression1 is false.

Next, Expression2 is parsed by recursively calling level2() through down1(). Notice that since level2() is called, Expression2 may contain the conditional operator, but not assignment operators. Of course, with parentheses, it can include any operator.

Now, if necessary, fetch() obtains the expression's value from memory, or the expression's constant value is loaded directly.

Next, the second character of the operator (the colon) is enforced by

		need(":");

which complains on not finding it.

Having parsed Expression2, it is necessary now to generate an unconditional jump around Expression3 so that it too will not be executed when Expression2 is selected. As before, getlabel() reserves an exit label number. This is then passed to gen() for use in generating the jump.

Now, before parsing Expression3, we generate the target label for the false jump which we generated earlier by calling dropout(). Expression3 is then parsed just as Expression2 was. And, last of all, the exit label is generated.

This completes the processing of the conditional operator; but it is still necessary to decide what attributes to pass up to the calling level. Three arrays--is1[], is2[], and is3[]--are used in parsing the three expressions respectively. Is1[] is received from the caller, so it must convey the attributes of the result back up the line.

The task of determining the correct attributes is complicated by the fact that either of two separate expressions is selected at run time, but we must decide at compile time how to treat the result in the further process of expression evaluation. Because of this, the two expressions must either have similar attributes, or it must be clear which expression's attributes to choose if they differ.

First, if both expressions (2 and 3) yield constants, the compiler assumes that they have different values, since otherwise there would be no sense in writing the conditional operator. Therefore, since a choice is being made at run time between two different constants, the result is designated as variable. Furthermore, the following properties are asserted: it is not an address, nothing is to be fetched indirectly, and no symbol table entry applies. These are all accomplished by zeroing the TC, TA, TI, and SA entries of is1[].

The next two acceptable cases are where either expression yields a constant. In these cases, the result is given the attributes of the non-constant expression. This is based on the assumption that the constant is a special value of whatever the non-constant expression would otherwise have computed. Typically, these might be an address expression and the constant zero--the null address.

The last acceptable possibilities are that both expressions yield either addresses or non-addresses. In these cases, the choice of attributes is arbitrary since both expressions agree--the attributes of the third expression are chosen.

Should all four cases fail, the message "mismatched expressions" is issued.

Level3() through level12()

After the material already covered, this should be easy. These ten functions are essentially alike and very simple. Each one descends to the next lower level in the parsing hierarchy by way of either skim() or down(). In so doing, it passes a string of the operators to be recognized at the current level, the address of the target level function, the address of the is[] array that it received, and various other arguments.

Basically, these function only serve to direct the flow of control through the central part of the expression analyzer. In so doing, they enforce the operator precedence rules. We have already seen the function that calls them (down1()) and the ones which they call (skim() and down()).

Level13()

This function handles the unary operators: ++, --, ~, !, -, *, &, and sizeof(). Listing 26-2 is a pseudocode version of this function.


        PSEUDOCODE                                      COMMENTS 

     1 if op == pre-increment                              ++ 
     2     parse operand recursively                       level13() 
     3     if operand is not an lvalue 
     4          issue "must be an lvalue" 
     5          return <request no fetch> 
     6     generate rINC1 on operand in memory             step() 
     7     return <request no fetch> 
     8 if op == pre-decrement                              -- 
     9     parse operand recursively                       level13() 
    10     if operand is not an lvalue 
    11          issue "must be an lvalue" 
    12          return <request no fetch> 
    13     generate rDEC1 on operand in memory             step() 
    14     return <request no fetch> 
    15 if op == one's complement                           ~ 
    16     parse operand recursively                       level13() 
    17     if operand is an lvalue 
    18          FETCH OPERAND INTO PRIMARY 
    19     generate COM1 
    20     pass ~is[CV] 
    21     pass zero stage address                         is[SA] 
    22     return <request no fetch> 
    23 if op == logical not                                ! 
    24     parse operand recursively                       level13() 
    25     if operand is an lvalue 
    26          FETCH OPERAND INTO PRIMARY 
    27     generate LNEG1 
    28     pass !is[CV] 
    29     pass zero stage address                         is[SA] 
    30     return <request no fetch> 
    31 if op == unary minus                                - 
    32     parse operand recursively                       level13() 
    33     if operand is an lvalue 
    34          FETCH OPERAND INTO PRIMARY 
    35     generate ANEG1 
    36     pass -is[CV] 
    37     pass zero stage address                         is[SA] 
    38     return <request no fetch> 
    39 if op == indirection                                * 
    40     parse operand recursively                       level13() 
    41     if operand is an lvalue 
    42          FETCH OPERAND INTO PRIMARY 
    43     if operand is symbolic 
    44          pass indirect data type from symbol table  is[TI] 
    45     else pass indirect data type integer            is[TI] 
    46     pass zero stage address                         is[SA] 
    47     pass <not address>                              is[TA] 
    48     pass <not constant>                             is[TC] 
    49     pass <do not fetch if function call>            is[CV] 
    50     return <request fetch> 
    51 if op == sizeof                                     sizeof() 
    52     bypass and remember ( 
    53     default size to 0 
    54     if "unsigned" set size to 2 
    55     if "int" set size to 2 
    56     else if "char" set size to 1 
    57     if size != 0 and "char *" set size to 2 
    58     if size == 0 and symbol and in symbol table 
    59          fetch size from symbol table 
    60     else if size = 0 issue "must be object or type"           
    61     bypass ) if there was a ( 
    62     pass <integer constant>                         is[TC] 
    63     pass size as <constant value>                   is[CV] 
    64     pass <not address>                              is[TA] 
    65     pass <not indirect fetch>                       is[TI] 
    66     pass <not in symbol table>                      is[ST] 
    67     return <request no fetch> 
    68 if op == address                                    & 
    69     parse operand recursively                       level13() 
    70     if operand is not an lvalue 
    71          issue "illegal address" 
    72          return <request no fetch> 
    73     pass address data type from symbol table        is[TA] 
    74     if indirect object reference 
    75          return <request no fetch> 
    76     generate POINT1m 
    77     pass indirect data type from symbol table       is[TI] 
    78     return <request no fetch> 
    79 parse operand at higher precedence level            level14() 
    80 if op == post increment                             ++ 
    81     if operand is not an lvalue 
    82          issue "must be an lvalue" 
    83          return <request no fetch> 
    84     generate rINC1                                  step() 
    85     generate rDEC1     
    86     return <request no fetch> 
    87 if op == post decrement                             -- 
    88     if operand is not an lvalue 
    89          issue "must be an lvalue" 
    90          return <request no fetch> 
    91     generate rDEC1                                  step() 
    92     generate rINC1 
    93     return <request no fetch> 
    94 return <fetch request status from below>

Listing 26-2: Pseudocode Representation of Level13()

Although this is a large function, it is quite straight forward and involves a good bit of repetition. It first attempts to recognize the operators that precede an operand (lines 1, 8, 15, 23, 31, 39, 51, and 68). Those operators are:

++...

If an increment operator is found (line 1), the following operand is parsed (line 2) by calling level13() again. This recursive call allows for the fact that these operands group from right to left, and further occurrences of them (and higher precedence operators) may appear within the subexpression that becomes the operand for the current operator. The resulting operand must be an lvalue; otherwise, it cannot be incremented. If it is not (line 3), the message "must be an lvalue" is issued and control returns. On the other hand, if it is an lvalue, code is generated to increment it in memory (line 6). This action also leaves a copy of it in the primary register for use at the current point in the expression. On return, the caller is told (line 7) that there is no need to fetch anything.

The function step() is used to generate code for increment and decrement operators. It accepts, as its first argument, one of the p-codes rINC1 or rDEC1. The second argument is the address of is[] for the lvalue being stepped. And the third argument is either zero or a second p-code--either rDEC1 or rINC1. First, step() calls fetch(), passing it is[], to obtain the original value of the object to be stepped. Next the first p-code is generated to either increment or decrement the object in the primary register. Then store() is called to generate code for storing it back in memory. It is important to realize that this leaves the adjusted value in the primary register. If step() is called, as here, by a prefix operator, the third argument is zero. This means that there is no need to restore the original value of the object in the primary register. In that case, step() simply returns. However, if the last argument is not zero, it is taken as the p-code for a follow-up operation which is generated before returning.

--...

The decrement operation is accomplished by equivalent logic (lines 8-14).

...

If the operation is a one's complement, the operand is first parsed by calling level13() recursively (line 16). Then, if it yields an lvalue (line 17), code is generated to fetch it into the primary register (line 18). Next, code is generated to perform the one's complement (line 19). Then, if the operand is a constant, its one's complement is performed at compile time in is[CV] (line 20), through which the constant value is passed up the line. Since this is a unary operation, the su bexpression acted on by the operator cannot have the form (left op zero) or (zero op right), so is[SA] is set to zero (line 21) preventing test()--if it is the active lead-in function--from trying to optimize the code in the staging buffer. Finally, zero is returned to the caller (line 22), indicating that there is no need to fetch anything.

!... and -...

The logical NOT and unary minus operators receive analogous treatment in lines 23-30 and 31-38 respectively.

*...

The indirection operator can be deceiving because the only code it generates is a fetch (line 42) of what is supposed to be an address, if the operand is an lvalue (line 41). The real work here is in (1) declaring the operand to be the address for an indirect reference (lines 43-45) and (2) asserting that it points to an lvalue (line 50). The latter action guarantees that, at a higher point in the analysis, the operand will be fetched (or possibly accepted as the target for an assignment) and the former action guarantees that it will be referenced indirectly. The data type placed in is[TI] comes from the symbol table if the operand is based on a symbol; otherwise, the type is forced to integer. Finally, before returning, attributes are passed through is[], indicating that the result (after the anticipated fetch) is not an address, is not a constant, is not to be fetched if the reference turns out to be a function call (the function address is already in the primary register), and does not have the form (left op zero) or (zero op right).

sizeof(...)

The sizeof operator must inspect the data type or symbol on its right and return a constant which is the size of the type data specified or the named object. First (lines 54-57) it looks for one of the data type specifications unsigned, unsigned int, unsigned int *, unsigned char, unsigned char *, int, int *, char, or char *. That failing, it tries to interpret the current token as a name in the symbol table (line 58). In the first case, it sets the size directly. In the second case, it takes the size from the symbol table where it was placed when the object was declared. If both cases fail (line 60), then "must be object or type" is issued and zero is returned. Before returning (lines 62-66), it designates the result to be a constant integer and provides its value in is[CV]. Other properties are: not an address, no indirect fetch is pending, and no symbol table reference exists for the returned value. Finally, it returns to the caller, indicating that nothing is to be fetched.

&...

The address operator is handled by first parsing the operand (line 68). The operand must be an lvalue (lines 70-72), since things that do not occupy space in memory (like constants) have no addresses. Next, the data type is copied from the symbol table into is[TA] (line 73), indicating that the result is an address referring to an object of the specified type. Then, if the object (without the address operator) would have been referenced indirectly, its address is already in the primary register, so there is nothing more to do but return to the caller, asserting that there should not be an attempt to fetch the object. If the object would have been referenced directly, however, then its address is loaded into the primary register (line 76), the result is declared to be an indirect reference by copying its data type from the symbol table to is[TI], and control returns to the caller, asserting that nothing is to be fetched.

At this point (line 79), since the previous tests have failed, none of the operators at this level have been recognized, but it is possible that a post increment or decrement exists. So the operand is first parsed by calling level14(). This call parses only a primary object (possibly subscripted and/or a function call) or a subexpression in parentheses. After that, attempts are made to recognize post increment and decrement operations.

...++ and ...--

When these operators are recognized (lines 80 and 87), code to place the operand in the primary register has already been generated so it is only necessary to generate code for the operation in question (lines 84 and 91) and to return the primary register to its original value (lines 85 and 92). As before, zero is returned indicating to the caller that nothing needs to be fetched.

Finally, if none of the operators at this level are recognized, control returns to the caller with whatever fetch request was produced by the parsing of the current subexpression (line 94).

Level14()

Level14() is the last (highest) precedence level before primary(). Level14() recognizes only two operations, subscripting and calling functions. It also handles the case where a function's address is invoked by naming the function without a left parenthesis following. As before, level14() is presented in pseudocode in Listing 26-3.

       PSEUDOCODE                                          COMMENTS 
   
     1 parse primary operand or subexpression              primary() 
     2 advance to next token 
     3 if token == [ or ( 
     4     loop: 
     5         if token == [                               subscript 
     6             if operand not in symbol table 
     7                  issue "can't subscript" 
     8                  bypass subscript expression        skip() 
     9                  enforce ] token 
    10                  return <request no fetch> 
    11             if operand is an address 
    12                  FETCH INTO PRIMARY if necessary 
    13             else 
    14                  issue "can't subscript" 
    15                  set <request no fetch> for return  k=0 
    16             save staging buffer address             setstage() 
    17             set <not constant> in temporary array   is2[TC]=0 
    18             parse subscript expression              level1() 
    19             enforce ] token 
    20             if subscript is a constant              ...[constant] 
    21                  purge subscript code               clearstage() 
    22                  if subscript is not zero 
    23                       if operand is an integer 
    24                            double constant 
    25                            generate GETw2n 
    26                       else generate GETw2n 
    27                       generate ADD12 
    28             else if operand is an integer           ...[variable] 
    29                       generate DBL1 
    30                   generate ADD12 
    31             pass <not an address>                   is[TA]=0 
    32             pass indirect type from symbol table    is[TI] 
    33             set <request fetch> for return          k=1 
    34         else  
    35             if token == (                           function call 
    36                  if not symbol table reference 
    37                       INDIRECT CALL                 callfunc(0) 
    38                  else if not a function reference 
    39                            if <request fetch> 
    40                            and not already fetched  is[CV]==0 
    41                                FETCH OPERAND        fetch() 
    42                            INDIRECT CALL            callfunc(0) 
    43                       else DIRECT CALL              callfunc(ptr) 
    44                  pass <not symbol table reference>  is[ST]=0 
    45                  pass <not constant>                is[TC]=0 
    46                  pass <reference fetched>           is[CV]=0 
    47                  set <request no fetch> for return  k=0 
    48             else return <pre-set fetch request> 
    49     loop back 
    50 if operand is in symbol table and is a function     no ( 
    51     generate POINTm 
    52     pass <not symbol table reference>               is[ST]=0 
    53     return <request no fetch> 
    54 return <fetch request from below>

Listing 26-3: Pseudocode Representation of Level14()

First, since the pertinent operators (left bracket and left parenthesis) trail, the operand (address expression or function name) is parsed first. This is done by calling primary() (line 1).

If the following token is not a left bracket or left a parenthesis (line 3), control drops down to line 50 where the parsed operand is checked for being a function name. If so, then code to load its address is generated (line 51), the symbol table address in is[ST] is zeroed since the reference has already been made, and control returns to the caller with an indication that the operand has already been fetched. But if the operand is not a function name, then nothing needs to be done at this level, so control returns to the caller with whatever fetch request primary() returned.

That leaves the cases where either subscripting or function calling is to be done. In these cases, control falls into a loop in which the left bracket and left parenthesis are recognized separately. The loop repeats until neither is seen, at which point (line 48) control returns with the fetch request from primary() or possibly a pre-set fetch request (lines 15, 33, and 47). By looping, the analyzer allows subscripting to be performed as a part of the calculation of a function's address.

The remaining logic breaks down into two parts--subscript analysis (lines 6-33) and analysis of function calls (lines 35-47).

The subscript operator (line 5) must follow a pointer or array name or an expression based on either and yielding an address. If there is no symbol table reference for the operand (line 6) it cannot be based on a pointer or array name, so the message "can't subscript" (line 7) is issued, the subscript expression is passed over (line 8), the closing bracket is enforced (line 9), and control is returned to the caller (line 10). That failing, if the operand is a pointer (line 11), code to fetch its content is generated (line 12). But if it is not an address, it cannot be subscripted, so "can't subscript" is issued and the fetch request is pre-set to zero.

Now it is time to parse the expression comprising the subscript. In case it turns out to be a constant, the current staging buffer address is saved (line 16) so the generated code can be discarded (line 21). Then, since a new array is2[] will be receiving the attributes of the subscript expression, element TC is initialized to zero (line 17). This will change if the expression does in fact yield a constant. Finally, the expression is parsed by calling level1() by way of down2() (line 18) and the closing bracket is enforced (line 19).

Now, if the subscript is in fact a constant expression (line 20), the code generated for it in the staging buffer is purged by calling clearstage() (line 21). Furthermore, if the constant subscript has the value zero (line 22), no offset to the base address needs to be made, so no code is generated. A non-zero value must produce an offset, however. If the address refers to integers (line 23) then the constant is doubled (line 25) before being loaded into the secondary register. If characters are being referenced, no doubling is needed so the unaltered constant is loaded. Finally, the offset, in the secondary register, is added to the address, in the primary register, to produce the effective address.

On the other hand, if the subscript yields a variable quantity, the code that evaluates the subscript expression is retained. If integers are being referenced (line 28), the doubling takes the form of code that doubles the subscript at run time (line 29). And, as before, code that adds the base address to the offset is generated (line 30).

Now, all that remains is to set up the side effects in is[] and return. Is[TA] is zeroed (line 31) to indicate that the object being referenced is not an address, and the data type is copied from the symbol table to is[TI] (line 32) indicating that the reference will be indirect through the effective address in the primary register. And, finally, before control loops back to look for other operators at the same level, the return value is pre-set to request that the referenced object be fetched (line 33).

Most of the work in parsing function calls is performed in callfunc() which is described next. In this function, however, the processing is simple. If the operand which designates the function is not a name in the symbol table (line 36), then it must be an expression that produces the function's address. In that case, callfunc() is called (line 37) with a zero argument instead of the address of a symbol table entry. This results in an indirect function call. However, if the operand is in the symbol table but is not identified as a function (line 38) it is assumed to be an expression involving variables, arrays, and/or pointers. So, if there is a pending fetch request (lines 39-40), the operand's value is fetched (line 41) before the indirect call is generated (line 42).

Note that the indirection operator in (*func)(); is recognized in level13() where the fetch operation is generated. At that point is[CV] is set to one, indicating to level14() that the fetch is satisfied. This special use of is[CV] is permissible since the indirection operator precludes the possibility of having a constant subexpression, so is[CV] is free to be used in this way. A means of telling level14() whether or not to fetch the function address is needed because the indirection operator always returns true for a fetch request. If it seems strange that level13() is passing something up to level14(), then notice that the indirection operator is preceded with a left parenthesis, so this instance of level13() is reached by way of primary() which in turn was called by level14().

If the operand is a function name in the symbol table, however, callfunc() is called with the address of the symbol table entry as an argument, thus causing a direct call to be generated (line 43).

Now side effects are set up; specifically, is[ST] is zeroed (line 44) to indicate that the value of the function being called has nothing to do with the symbol table, is[TC] is zeroed (line 45) to indicate that the value of the function is variable, is[CV] is zeroed (line 46) to indicate that the function address has already been fetched, and the return value is pre-set to prevent an automatic fetch at a higher level.

Last of all, control loops back to look for other operators at the same precedence level.

Callfunc()

As we have just seen, callfunc() is called at three points in level14(). If its argument is zero, it generates an indirect call through the address in the primary register. But if it is not zero, it specifies the address of the symbol table entry for the function being called, and a direct call to the label bearing the function's name is generated.

This function must generate code for four tasks: (1) to evaluate each argument and push it onto the stack, (2) to provide an argument count to the called function, (3) to call the target function, and (4) to deallocate the arguments from the stack on return.

On entry, the opening parenthesis for the argument list has been recognized and passed over, so the first argument (if there is one) is next. The arguments are processed in a loop that terminates when the right parenthesis (or end of the statement or of the input) is reached. With each iteration, one argument expression is evaluated and pushed.

If an indirect call is being made, then the function address must be preserved as each argument is evaluated. This is done by placing it on the stack temporarily. Then, after evaluation, the argument swaps places with the function address. The steps are:

push the function address on the stack (PUSH1)
evaluate the next argument expression
swap the function address with the argument (SWAP1s)

Expression() is the lead-in function for argument evaluation. Notice that the entire expression analyzer is being called recursively through this function. Each actual argument is an expression in its own right, while the function call, of which it is a part, represents part or all of another expression.

When the arguments have been exhausted, a right parenthesis is enforced. Then, if the target function is not ccargc(), ARGCNTn is generated to load a count of the arguments into the CL register. (Since ccargc() is the function that fetches the count, it would not be a good idea to destroy CL with a count of the arguments passed to it.)

Next, if this is a direct call, CALLm is generated to call the target label. But if it is indirect, CALL1 is generated to call the function pointed to by the primary register.

Finally, ADDSP is generated to restore the stack to its original value before the arguments were pushed onto it.

Primary Operands

Primary()

At the bottom end of the expression analysis hierarchy is primary() which looks for the simplest possible operand reference. This may be a variable name, an array name, a pointer name, a character constant, a string constant, or a subexpression in parentheses.

First, primary() looks for a left parenthesis and, finding one, (1) calls level1() to evaluate the enclosed subexpression, (2) enforces the closing parenthesis, and (3) returns the fetch request from level1().

That failing, the entire is[] array is set to zeroes, thereby establishing the default values for its elements. Further action is only necessary when zero is not appropriate.

At this point, the current token must be a symbol or a constant. So symname() is called to determine if it is a symbol. If so, symname() places it in the local array sname[] and passes over it in the input line. Three possibilities exist for a legal symbol--it may be a local reference, a global reference, or a reference to an undeclared function. First, it is sought in the local symbol table; if that fails, the global table is searched; and, if that fails, it is assumed to be an undeclared function. Searching for locals before globals is critical, since local names must supersede global ones. Also, recall that the local table is searched sequentially backward, so that "newer" locals (the ones that are nested deeper) mask "older" ones. If both searches fail, the symbol is added to the global table as a function. Then, if it never is formally declared, it will be identified to the assembler as an external reference before the compiler quits.

On finding the symbol in the local table, a check is made to verify that it is not a label. If it is, the message "invalid expression" is issued, since labels cannot be referenced in expressions. Control then returns indicating no need for an operand fetch. That failing, we have a proper reference to a local object, so POINT1s is generated which yields

		LEA AX,n[BP]

to calculate in the primary register the address of the object in question. The letter n stands for the object's distance from the stack frame address in BP, and comes from the symbol table. It is negative for local objects and positive for arguments.

Next, is[ST] is set to the address of the symbol table entry and is[TI] is set to the data type, so that the reference will be seen as an indirect one.

Then, if the symbol names an array, is[TA] is set to the data type, indicating that the operand is an address, and control returns, requesting no fetch operation. These actions are appropriate since an unsubscripted array name is supposed produce the address of the array. Of course, subscripting changes this.

However, if the symbol names a pointer, is[TI] is changed to indicate an unsigned integer operand. This is done because the data type in the symbol table refers to the type of objects pointed to, whereas is[TI] refers to the object being referenced indirectly--the pointer. True, a pointer is not an integer, but they are the same size and that is all that matters here. As before, is[TA] is set to the data type, indicating that the operand (when it is fetched) yields an address.

Finally, whether the symbol is a pointer or a variable, control returns, requesting a fetch operation since the primary register contains the address of the object not its value.

Global references are treated somewhat differently since globals are referenced by their labels and because functions must handled.

First, is[ST] is set to the address of the symbol table entry for the recognized symbol.

Next, if the symbol is an array name, three actions occur. POINT1m is generated which yields

MOV AX,OFFSET _array

for loading the array's address into the primary register. Array stands for the name of the array. Then, since any further reference will be indirect, is[TI] is set to the data type of the symbol. And, finally, is[TA] is likewise set to the data type, indicating that an address is in the primary register. Control then returns with no request to fetch an operand.

However, if the symbol is a pointer name, only is[TA] is set to the data type of the symbol.

Then, whether the symbol is a pointer or a variable, control is returned with a request to fetch the operand. Since is[TI] remains zero, the requested fetch will be direct.

If the symbol is not found in the global symbol table, addsym() is called to create an entry for it as a function. At this point the symbol is known to be a function and nothing remains to be done but return requesting no fetch operation. Level14() will take care of calling the function or loading its address.

Finally, if the current token is not a legal symbol, then it must be a constant or else the expression is illegal. So constant() is called to parse it and tell whether or not it was a constant. Last of all, control returns, requesting no fetch operation. Recall that constants are passed back up the line in is[CV] with is[TC] indicating the type of the constant (INT or UINT).

Table 26-4 displays in a matrix the values that primary() returns to higher level functions. The first six columns correspond to the significant elements of the is[] array. The last column shows the value actually returned by primary(); no refers to zero and yes refers to one. The symbol ->st stands for a pointer to the recognized object's entry in the symbol table. The word type refers to the data type of the object, or to which it refers. UINT and INT are the unsigned and signed integer data types as defined in CC.H. And, the unusual looking (U)INT refers to either UINT or INT depending on the value of the constant and whether or not it was written with a minus sign (see Chapter 3). This table should be of considerable value when studying the logic of the higher level analyzer functions.

Table 26-4: Values Returned by Primary()

Constant()

This function is called by primary() when it thinks the current token must be a numeric, character, or string constant. If so, it parses the constant, generates code, and returns true, indicating that a constant was in fact parsed. On failure, it returns false.

Corresponding to the three valid possibilities, constant() calls number(), chrcon(), and then string() looking for a numeric constant, a character constant, or a string constant respectively. Each of these functions returns a non-zero value on success, and zero on failure. If all three fail, false is returned to primary(). These functions generate no code, they just parse the constant and place the result in either is[CV] (number or character constant) or the literal pool (string constant). Constant() then generates the code to reference the result.

The functions number() and chrcon() are essentially alike in their effects, whereas string() is distinct because it yields an offset to the string in the literal pool rather than a constant value.

Number() and chrcon() return the type of the constant found on success; that is, INT or UINT. As we can see from the listing, the return value in these cases, is assigned to is[TC] as the type of the constant, or as zero. This informs the higher parsing levels that a constant was or was not found and, if so, its type. They also receive as their only argument the address of is[CV] where they place the binary integer form of the constant they find.

Now, if number() or chrcon() indicates success GETw1n is generated to load the value into the primary register. This produces

XOR AX,AX

MOV AX,n

depending on whether or not the constant is zero.

String constants are different because they yield an address to the string. Where will the string be placed in the program? Recall from the discussion of the literal pool (Chapter 20) that strings are kept there until the end of the function, at which point the pool is dumped into the data segment. Each compiled function allocates a compiler created label for the dumped pool, and each string is referenced as an offset from that label.

String() receives the address of a local integer offset as an argument which it uses as the target for the string's offset in the literal pool. When string() indicates success, it has put the string in the literal pool (in the compiler, not in the program) and the offset in offset. Constant() then generates POINT1l which produces

MOV AX,OFFSET _m+n

where m is the number of the label for the current function's literal pool and n is the offset from the label to the first character of the string. The assembler produces from this the data segment relative address of the string, and the linker fixes it to an absolute address when all of the modules making up the program are finally joined.

Since a string address is not a constant (it's actual value is not known at compile time), is[TC] and is[CV] are left at zero, as they were initialized by primary().

To summarize, constant() sees to it that a numeric or character constant is placed in is[CV] and is loaded (at run time) into the primary register. It ensures that a string constant is copied into the literal pool, and its address is loaded into the primary register at run time.

Number()

This function looks for a (possibly signed) decimal, hexadecimal, or octal string and converts it to a signed binary integer. First it looks for a leading sign. If a hyphen is seen, minus is set true, otherwise it defaults to false. A plus sign is simply ignored, leaving minus at its default value of false.

Then, if the next character is not a digit, false is returned. That failing, the first digit is inspected to see if it is '0'--the lead-in for octal and hexadecimal numbers.

If not, then each digit found (going from left to right) is converted to its binary equivalent and is added to an accumulator k, which is first multiplied by 10. When the digits are exhausted, k contains the absolute value of the number. No account is taken of overflow, it simply yields a corrupt value.

Before discussing the application of the sign and what is returned to the caller, lets see what happens when the leading character is '0'. In that case, the second character is inspected to see if it is an x (uppercase or lowercase). That being true, a hexadecimal number follows, so the x is bypassed, and a loop is entered which lasts as long as legal hexadecimal digits are found. In each instance, the digit is converted to its binary value and added to k, after multiplying k by 16. Since the alphabetic digits do not immediately follow the numeric digits in the ASCII code set, separate statements are required for each case. The result of this loop is the absolute value in binary. As before, no account is taken of overflow.

If there is no x following the leading zero, then a different loop is entered which lasts as long as legal octal digits are found. It similarly computes the absolute value in k.

After the absolute value has been computed, if minus is true, the negated value of k is stored at the location indicated by the argument, and INT is returned. However, if minus is not true, then k is stored without negation. In that case, its value is tested to see if it should be designated as signed or unsigned. This version of Small C treats constants that do not have a minus sign, and are larger than 32767, as unsigned quantities (Chapter 3). If that condition holds, UINT is returned, otherwise INT.

Chrcon()

This function looks for a character constant of one or two characters enclosed by apostrophes. First, chrcon() looks for an apostrophe; if that fails, it returns false, indicating failure.

Following the leading apostrophe, control falls into a loop which exits when the trailing apostrophe is found. With each iteration an accumulator k is shifted left eight bits and the current character is added to the result. The characters are obtained through a filter function litchar() which recognizes and evaluates escape sequences; so it is possible that two or more characters may reduce to a single character as seen by chrcon(). Since the loop continues until the trailing apostrophe, if more than two characters are present, the leading characters will be lost and only the last two will contribute to the constant.

At the end, the closing apostrophe is bypassed, k is placed into is[CV], and INT is returned.

String()

This function parses quoted strings. First, it looks for the leading quote. If that fails, it returns false. After seeing the quote, it copies the current value of litptr (the offset to the current end of the literal pool where the new string is about to be copied) to the location which it received as an argument. It then falls into a loop which exits when the trailing quote is reached. With each iteration, it filters the next character(s) through litchar() and passes it on to stowlit() for storage in the literal pool. Finally, the closing quote is bypassed, a terminating zero is put at the end of the string in the literal pool, and true is returned.

Stowlit()

Stowlit() places character or integer values in the literal pool. It receives as arguments the value and its size in bytes. If an overflow would occur, it issues "literal queue overflow" and aborts the run. If there is room in the literal pool, however, it places the value (one or two bytes) in the pool at the offset indicated by litptr. It then increments litptr by the size of the value and returns to the caller.

Litchar()

Litchar() returns the current character in the input line after advancing to the next one. If the current character is a backslash, it is taken as the start of an escape sequence. The backslash is bypassed and the following character is tested to see if it is one of n, t, b, or f. If so, that too is passed over and a newline, tab, backspace, or formfeed, respectively, is returned.

That failing, litchar() looks for an octal number to convert to binary. Up to three digits will be recognized. Each succeeding character is tested for being an octal digit. If so, it is converted to binary and added to an accumulator oct which is first shifted left three bits. The loop ceases when no more octal digits remain or three digits have been processed. Finally, oct is returned.

However, if no octal digits are seen, then the character following the backslash is returned as is. This is the case of an escape followed by an unspecified character. Thus, for instance, a sequence of two backslashes would return the second backslash.

Go to Chapter 27 Return to Table of Contents

CHAPTER 26: EXPRESSION ANALYSIS

CHAPTER 26:
EXPRESSION ANALYSIS