CHAPTER 24:

HIGH LEVEL PARSING

A parser might employ any of a number parsing methods of which there are two broad categories--top-down methods and bottom-up methods. For a comparative study of parsing methods, you are referred to references in the Bibliography under the heading Compiler Writing. Specifically, I suggest Chapter 6 of Calingaert[19] for a quick survey, and Tremblay and Sorenson[24], for more detail. Small C uses a popular top down parsing method called recursive descent. If a program were to be successively divided into smaller and smaller components, it would form a tree based on the grammar of the language. At the root would be the entire program, at the interior nodes would be instances of constructs like statements, expressions, and functions; and at the leaves would be tokens.

The recursive descent method starts at the root of such an imaginary tree for the program and looks for a sequence of global constructs. When it recognizes the beginning of one, it then tries to recognize the constituent parts. These, in turn, are broken down into smaller parts, until the smallest lexical units (the tokens) are reached.

In the case of Small C, this activity is accomplished by having a high level function parse() look for the global constructs. Parse(), in turn, calls (descends to) lower-level functions which recognize the next most inclusive constructs, and so on. When the bottom (token level) is reached, control returns back up to the lowest level at which a construct was recognized; from there it descends again to find the next part of the construct. When the construct has been fully parsed, control migrates up to the next higher level at which a construct was recognized, and thence down again. As the parser "walks" this imaginary tree, it emits object code which represents the program in a lower-level "target" language. This process continues until control reverts back to the highest level and there remain no more global constructs.

To repeat, the parse tree for the program is only imaginary, it does not exist as such; the parser only performs a linear scan of the source program. On the other hand, instantaneous points on this imaginary tree do have a real existence in the compiler in the form of nested stack frames for the parsing functions that are active at any point in the process.

Recursion enters the picture because it is implicit in the grammar of the language. For example, a statement might be a compound statement, containing other statements. Therefore, the parsing function statement() might call the lower-level function compound() which in turn calls statement().

The various function calls and returns effectively travel down and up the branches of the parse tree for the program being compiled. Generally speaking, movement down the tree (toward the leaves) corresponds to successful efforts to recognize language constructs, and movement up the tree corresponds to the generation of code based on what has been recognized.

Top-Level Parsing

Parse()

Small C begins parsing in parse() which recognizes the global declarations (data and functions) that constitute a complete program. In addition, as mentioned in Chapter 22, it also recognizes the #include, #define, and #asm directives. Parse() hangs in a loop until eof is true--the end of the last source file has been reached. With each iteration it attempts to recognize the current token as the start of a declaration (int, char, unsigned, extern) or one of the directives #asm, #include, or #define. Should these fail, it assumes it has a function definition. Each of these constructs is further parsed by one of the second-level functions indicated in Figure 24-1. These five functions parse the entire program.

Figure 24-1: Primary High-Level Parsing Functions

Unlike the other four functions, dodeclare() is called in anticipation of a declaration. If it finds one, it parses it and returns true; not finding one, it returns false. It handles both external and static declarations. If parse() sees the keyword extern it calls dodeclare() passing it the constant EXTERNAL as a signal that the external storage class applies; otherwise, it passes STATIC for the static storage class.

Recall from Chapter 22 that amatch() and match(), which recognize literal strings in the source, take care of numerous low level, front end tasks. These include bypassing white space, bypassing recognized tokens, reading and preprocessing new source lines, and even closing one source file and opening the next one if necessary. They return true only when the specified string is recognized. Amatch() differs from match() by assuming that an alphanumeric comparison is being made on an entire token--the token must exactly match the string and the first character after the token must not be alphanumeric. Whereas, match() is satisfied if the string matches only the beginning of the token, regardless of what follows.

Parsing Global Data Declarations

Dodeclare()

Dodeclare() is a small function that tries to match on char, int, unsigned, unsigned char, or unsigned int. Finding none of these, it assumes int if the storage class is EXTERN; this agrees with normal C syntax in which the keyword int may be omitted following extern. Failing these conditions, control returns to parse() indicating false. If a declaration is recognized, however, dodeclare() calls declglb(), passing it the data type (CHR, INT, UCHR, or UINT) and the storage class. After that, it calls ns() to enforce the terminal semicolon, and returns true.

Declglb()

Declglb() parses global declarations (see Listing 19-2, for example). On entry from dodeclare() the optional storage class (extern) and the data type (char, int, unsigned, unsigned char, or unsigned int) have already been recognized, skipped over, and passed to this function which finds itself now facing a list of names to declare and possibly define (reserve storage and initialize).

Each iteration of an infinite loop processes the next name in the list. Two statements test for return conditions. At the top of the loop

		if(endst()) return;

calls endst() to test for the end of the statement (semicolon or end of input) and returns if true. And, at the bottom,

		if(match(",") == 0) return;

returns when a declaration is not followed by a comma. Between these, declglb():

  1. Determines the identity (POINTER, VARIABLE, ARRAY, FUNCTION) of the object, saving it in the local variable id. Note: Small C recognizes either the * prefix (for POINTER) or the [ suffix (for ARRAY), but not both. It cannot handle an array of pointers. If both are specified, an error message is issued.
  2. Determines the number of occurrences of the object, saving it in the local variable dim. This may be greater than one for an array.
  3. Fetches and verifies the object's name.
  4. Verifies that the name has not already been declared--does not yet exist in the symbol table.
  5. Using id and dim, generates appropriate code to define the object if it is not a function. Predeclared functions are only added to the symbol table.
  6. Using id and dim, adds an entry to the symbol table so the object can be referenced.

First, a check is made for * indicating POINTER. That failing, the assumption is VARIABLE. Then, symname() is called to verify the name and copy it from the source line into ssname[]. Next, findglb() is called to verify that the name is not yet in the symbol table. After that, if the identity is VARIABLE, checks are made for [ or ( indicating ARRAY or FUNCTION respectively. These override the initial determination of VARIABLE. Notice that these checks are not made if the name was previously identified as a pointer. In that case, if these tokens are present, the loop terminates after declaring the pointer, and an error results when dodeclare() tries to enforce the terminal semicolon.

The function needsub() is called (1) to evaluate the dimension expression following [ and (2) to enforce the presence of a closing ]. Its returned value is the specified dimension--a constant.

Next, if the storage class is EXTERNAL, external() (Chapter 21) is called to make the declaration to the assembler. Otherwise, if the identity is not FUNCTION (i.e., data is being declared) initials() (see below) is called to define and initialize the object(s). Finally, addsym() (also below) is called to enter the declared object into the global symbol table.

Initials()

Initials() is passed the size of the object (type>>2), its identity (POINTER, VARIABLE, or ARRAY), and the number of occurrences (dim) to define. Internally, these are called size, ident, and dim respectively. First, the declared symbol name, in ssname[], is declared to the assembler as an entry point by calling public(). This guarantees that it can be reached from every module of the program. Next, initials() scans for

		= Value
or

		= { Value, Value, ..., Value }
where each Value is evaluated by init(). The values are stored in the literal pool until the end of the initializer is reached and then dumped to the output as either DB or DW directives. After that, to account for the fact that the number of initial values may not be as great as the dimension of an array, dumpzero() is called to fill out the remaining elements with zeroes. This means that an array will be the size of the larger of its dimension or the number of initial values it has. Initials() can be called with dim equal to zero under two conditions--a pointer is being declared or an array of dimension zero is being declared. Therefore, after calling init(), which decrements dim for each initial value, a check is made to see if dim started at zero and is still zero. If so, an uninitialized pointer or an uninitialized and undimensioned array is being declared. In the first case, initials() assumes a value of zero; in the second, it complains with "need array size."

Init()

Init() recognizes two types of initializer--character strings and constant expressions. There are restrictions on which type is legal for a given object (see Table 7-1), so init() detects violations and reports them.

First, init() calls string() to test for a quoted string. Finding one, string() (1) saves it in the literal pool (giving init() the starting offset as a side effect in the local integer value), (2) skips the closing quote, (3) terminates the string in the pool with a null byte, and (4) returns true. Otherwise, it returns false.

In fetching the characters from the string, string() calls litchar() to translate escape sequences to individual characters. If a string is found and the object being initialized is not a character array or character pointer, the message "must assign to char pointer or char array" is issued. Next, the dimension variable in initials() is reduced by the length of the string (the new literal pool offset litptr minus the original offset value). Finally, if the object being defined is a pointer, then space for it is allocated and initialized to the address of the string. In this case, since the string will be dumped immediately following the pointer,

		DW $+2

is generated by calling point().

If a quoted string is not found in the input line, constexpr() is called to look for a constant expression. On finding one, a test is made to see if the object being defined is a pointer; if so, the message "cannot assign to pointer" is issued. In any case, the constant value is stored in the literal pool and the dimension (in initials() ) is reduced by one.

Parsing Preprocessor Directives

Doasm()

Doasm() is a small and very simple routine for handling assembly language code. It differs from the other second-level parsing functions in that it may be called from parse() or statement(). This is because Small C allows assembly code to appear not only at the global level, but anywhere a statement is allowed as well. When called, the lead-in #asm has already been recognized and bypassed, so doasm() merely drops into a loop copying lines of input to the output until a line beginning with #endasm is reached. Of course, the end-of-input condition also terminates the loop. Finally, the terminating line is discarded by calling kill(), and control returns to the caller.

Doinclude()

Doinclude() is also a very simple function. When called, the keyword #include has already been recognized and bypassed, and a filename, delimited by quotation marks or angle brackets, is next up (Note that Small C also accepts it without delimiters). Doinclude() skips the leading quote or angle bracket, if present, and copies the filename into the local character array str[] until the trailing delimiter or the next white space is reached. Then it calls the library function fopen(), passing it str for the filename and "r" (retrieval) for the mode. If the open is successful, input2 is assigned the new file descriptor, thereby causing inline() to fetch future input from that file instead of the primary file indicated by input. Should the open fail, however, input2 is again assigned its normal value of EOF, and "open failure on include file" is issued. Finally, kill() is called to discard the current line so that the next token will have to come from the new file.

Dodefine()

Dodefine() is another small routine. It processes #define directives by adding the macro name to the macro name table and then copying the remaining text into the macro text queue. When called, the keyword #define has been recognized and bypassed, and the macro name is next up. Symname() is called to verify that the name is legal and to fetch it into msname[]. If it is not a valid name, "illegal name" is issued, the line is discarded, and control returns. Otherwise, search() is called to look for the name in the macro name table. If not found, cptr is left pointing to an available entry into which the name is copied. In case the name is not found and there is no more space available, cptr is set to zero and "macro name table full" is issued before control returns. Regardless of whether a new entry was established or an existing one was found, the entry designated by cptr is pointed to the location in macq where the new replacement text will be stored--in the first case a new macro is being defined and in the second case an old one is being redefined. Finally, the remaining text in the line, starting with the next token, is copied into the text queue where it is terminated by a null byte. If there is insufficient room, "macro string queue full" is issued and the run aborts.

Parsing Function Definitions

As mentioned before, if parse() does not recognize any of the lead-in keywords for declarations or the directives which it handles, it assumes that a function is being defined, and so it calls dofunction(). This large function:

  1. Initializes the literal pool, obtains a label number for the literals that will be dumped at the end of the function, and clears the local symbol table.
  2. Bypasses the optional keyword void if there is one.
  3. Lists the current line on the screen if monitoring was requested.
  4. Verifies that the current token (the function name) is a legal symbol and fetches it into ssname[]. That failing, "illegal function or declaration" is issued, the line is discarded, and control returns to parse().
  5. Determines whether or not the symbol is already in the global symbol table. If so, the current definition might be an illegal redefinition. This would be the case if the table entry is not a function or is a function and was created by a previous definition. Conversely, if it was created by an earlier function reference (recognized by a storage class of AUTOEXT), the current definition is proper and the class is changed to STATIC. Note: Functions that are referenced, but not defined, are automatically declared external at the end of the program by trailer(); hence the name AUTOEXT.

    If no entry matching the current function name is found in the table, one is created with:

    	NAME = ssname
    	IDENT = FUNCTION
    	CLASS = STATIC
    	TYPE = INT
    	SIZE = 0
    	OFFSET = 0

  6. Generates the string in ssname[] as a label and entry point by calling public().
  7. Requires the opening parentheses for the argument list. Failure to find it yields "no open paren."
  8. Declares each formal argument in the local symbol table. This is done in a loop looking for the closing parenthesis (also terminated abnormally by the endst() condition). With each iteration:
    	NAME = ssname
    	IDENT = 0
    	CLASS = AUTOMATIC
    	TYPE = 0
    	SIZE = 0
    	OFFSET = 0, 2, 4, ... for the 1st, 2nd, 3rd, ... argument
    
  9. Determines the identity and type of each declared argument. This is done in another loop in which the variable argstk--used above for assigning OFFSET values--counts down to zero. With each iteration, a declaration list is processed by calling doargs() which decrements argstk by two for each argument it parses. Before each call, int, char, unsigned, unsigned int, or unsigned char must be recognized to introduce the declaration list (and a semicolon afterward). If one of these keywords is not found, "wrong number of arguments" is issued since, because argstk is not zero, not every argument has been typed and something other than a declaration has been found.

    Each call to doargs() accepts the data type from dofunction() and determines the identity and size of a list of arguments. The function decl() is called to determine these.

    Decl() is also used for local declarations. In this case, however, since the argument aid (array id) has the value POINTER, arrays are assigned identities of POINTER. This makes sense because their addresses are passed to the function on the stack, making them lvalues. Whether decl() is handling argument or local declarations, it requires that functions have the syntax of a pointer; otherwise, "try (*...)()" is issued. Decl() issues "i llegal symbol" if an improper name is found.

    Doargs() issues "not an argument" if a name in the list cannot be found in the local symbol table. Having found the argument's entry in the local symbol table, doargs() sets the IDENT, TYPE, SIZE, and OFFSET fields to complete the declaration. As we saw, it gets IDENT and SIZE from decl(), and it receives TYPE as an argument. It inverts the OFFSET field (to account for the fact that Small C pushes actual argum ents onto the stack from left to right) and adjusts it to account for the presence of a return address and the saved value of BP on top of the stack when the function receives control. For example, if three arguments are being passed, step (8b) would have established their OFFSETs as 0, 2, and 4. Then this step would invert and adjust them to 8, 6, and 4 respectively. From this point on, arguments are treated just like local variables. The only difference being that they are placed on the stack before the call of the function. Whereas, local variables are allocated on the stack after the call.

  10. Calls statement(). This is the central event in dofunction() because statement() parses the entire body of the function! More will be said about this important step in Chapter 25.
  11. Generates the default return, that passes control back to the caller, when it reaches the end of the function body. This is done only if the last statement in the function is not a return or goto; otherwise, the default return could never be reached.
  12. Finally, if any literal strings were defined within the function they are dumped from the literal pool to the output. This involves generating the label that was reserved in step (1) (and referenced in the function body) and generating the necessary DB directives.

Now that we have surveyed the steps that declare a function, in the next chapter we shall look deeper into the heart of the compiler by examining statement() and its subordinate functions.

Go to Chapter 25 Return to Table of Contents