CHAPTER 20:

DATA STRUCTURES

This chapter describes the seven main data structures used by the Small C compiler. They are: the input line, the literal pool, the macro pool, the staging buffer, the switch table, the symbol table, and the while queue. Gaining a conceptual understanding of these now will make the remainder of the compiler much easier to learn.

The term structure is being used here in a generic sense, not with reference to the C language construct of the same name. Although that feature of full C is a real convenience, it is hardly essential for working with data structures of any kind and, since it is not supported by Small C, the compiler is written without them.

Space for these structures is allocated dynamically in main() when the compiler begins execution. In each case, a pointer receives the address of the memory block that is allocated for the structure. Table 20-1 lists these pointer names together with their data types and the amount of memory allocated. The memory sizes are expressed as macro names which are defined in the file CC.H.

Table 20-1: Data Structure Pointers

The Input Line

There are two input line buffers in the compiler--pline from which parsing is done, and mline from which macro processing is done.

The preprocessor reads from mline while writing into pline, from which the parser operates. The function which reads source lines into the compiler inline() places its data wherever the pointer line points. Before a line is read into the compiler, line is set to mline, causing the raw source code to be placed there. The preprocessor then scans mline looking for matches with defined symbols. Tokens that do not match are copied directly into pline. Tokens that match have their replacement text copied into pline instead. When the end of the line is reached, line is reset to point to pline, from which parsing takes place.

Associated with line is another global pointer lptr which points to the current character in line. Lptr marches along the line as tokens are being recognized. Since the same scanning functions are shared by the preprocessor and the parser, they are directed by lptr so that they work from either line buffer. They reference mline during preprocessing and pline while parsing.

The Literal Pool

The literal pool is really just a character/integer buffer. It serves two purposes; it temporarily holds strings until they can be dumped at the end of the current function, and it temporarily holds initializers that are applied to global objects. The latter case does not require buffering, but it is more convenient to use the literal pool and its functions than to write special functions just for global initializers. Since function definitions are global, there is no conflict in these two uses of the literal pool; they occur at different times.

In the first case, dofunction(), at the beginning of a function, clears the literal pool and allocates a label number for it. At the end of the function, it dumps the literal pool; that is, it generates the label followed by enough DB instructions to define the function's strings. Each string within the function is referenced by an offset from the label. Since the literal pool is reset with each function, it only has to be large enough to hold the strings in a single function.

In the second case, the literal pool is used by initials() to initialize global objects. For each object, it is first cleared. Any initializers following the object's name cause constants to be stored in the pool. Finally, at the end of the definition, the contents of the pool are output as DB or DW instructions which cause the assembler to place the constants in memory.

The literal pool is controlled by two global variables--litq which points to the beginning of the pool and litptr which is not really a pointer, but a subscript to the next available position in the pool (the byte following the last one used).

The function stowlit() puts things into the pool and dumplits() dumps it by generating a sufficient number of DB or DW instructions to define the contents of the pool to the assembler. These are described in Chapter 21.

The Macro Pool

The macro pool is made up of two separate structures--the macro name table and the macro text queue which holds substitution text. The name table has the format shown in Figure 20-1.

Figure 20-1. Format of the Macro Name Table

It consists of a fixed number of fixed length entries. Each entry consists of a string containing a macro name and an offset into macq. The names are placed in the name field left justified with the customary trailing null byte; the offset is an ordinary integer which locates the beginning of the substitution text for the named macro.

Searches are performed on the name table by the same function that searches the global symbol table, search(). Search() employs a hashing algorithm to quickly locate a name in the table or the vacant entry where a name should be placed. On finding the name it returns true; otherwise, false. In either case, the global pointer cptr is set to the entry matching the name, or the vacant entry in which the name should be placed.

Macq is simply a character buffer into which the replacement text for macro definitions is stored. For each symbol, the entry in macn contains the offset to a null terminated string in macq.

The function addmac() processes #define statements by adding entries to macn and replacement text to macq. Preprocess() performs the substitutions.

The Staging Buffer

The staging buffer is a large integer buffer for holding generated code until outcode() can write it to the output file. It receives only code generated by expressions--the major part of C programs. Code that falls between expressions is not buffered in this way, but goes straight to the output file. The reason for buffering the code generated by expressions is so that it can be optimized. Some optimizing is done by the expression analyzer itself, but most is done by the peephole optimizer peep() as the buffered code is being written to the output file.

Each entry in the staging buffer consists of two integers, a small numeric code called a p-code or pseudo-code, and an integer value that further qualifies the p-code. Each p-code represents one or more assembler instructions. A few represent partial instructions.

Two global pointers control the staging buffer. Snext either contains zero or points to the next available entry in the buffer and slast points to the end of the buffer. When snext is zero, an expression is not being analyzed and so code generated by gen() goes directly to the output file. However, when snext is not zero, gen() places p-codes into the staging buffer at the position indicated.

Two functions setstage() and clearstage() manipulate snext and the contents of the buffer. Setstage() is called before expression analysis begins and before each subexpression is analyzed. In the first case, it changes snext from zero to the starting address of the staging buffer (stage), thereby directing future output to the beginning of the buffer. In the second case, snext is left alone but saved so that code generated from that point can be ignored later by resetting snext to its original value. The expression analyzer does this when it sees that an expression or subexpression resulted in a constant value. In that case, it throws away the code that it generated and replaces it with a single instruction that loads the constant into the primary register.

Clearstage() either writes the contents of the buffer to the output file and resets snext to zero or merely backs up snext to an earlier point in the buffer, thereby discarding the most recently generated code. Clearstage() calls outcode() to translate the p-codes to ASCII strings and write them to the output file. Outcode() in turn calls peep() to optimize the p-codes before it translates and writes them.

The Switch Table

For each case in a switch statement the switch table holds the value of the case and the number of the label to which control should go in that case. Figure 20-2 shows the format of the switch table.

Figure 20-2. Format of the Switch Table

Each entry consists of two words, one for the label number and one for the value of the corresponding case. After compiling the body of the switch statement, which generates ordinary code interspersed with a numeric labels for the cases, Small C generates a call to __switch() followed immediately by a table of label/value pairs to be scanned by __switch() in deciding which label to target. The purpose of the switch table is to accumulate these label/value pairs until they can be dumped after the call to __switch().

Two global pointers control access to the table; they are swnext and swend. Swnext points to the next available entry in the table and swend always points to the last entry. Whenever a case is found in the body of a switch, a label is generated, its number is stored in the table, the value of the case is stored with it, and swnext is advanced to the next entry. An attempt to advance swnext beyond swend results in an error.

Notice that switch statements can be nested. Therefore, it is necessary to track label/value pairs for higher level statements as well as for the current one. This is done by partitioning the table at the start of every nested switch. The function doswitch() saves swnext in a local variable and then calls statement() recursively to compile the compound statement which is the body of the current switch. If another switch is encountered before statement() returns, doswitch() is again called (this time recursively), again saving swnext locally, and again calling statement(). When statement() does finally return, doswitch() generates its call to __switch() and dumps the switch table from the entry indicated by swnext (when the current instance of doswitch() was called) to the last entry made. Finally, swnext is restored to its previous value and doswitch() exits.

The Symbol Table

The symbol table is the principle data structure of the compiler. It stores the names of every object declared in the program and everything the compiler must know about them. As Figure 20-3 shows, the symbol table is partitioned into two parts. There is a local table and a global table. Although they carry the same data, they are really distinct tables with different search algorithms.

Figure 20-3. Format of the Symbol Table

A limited space at the front of the symbol table is reserved for local declarations. Its address is defined by the symbol STARTLOC. Since local declarations pertain to only one function at a time, not much space is needed. The symbol NUMLOCS determines how many local declarations will fit in the reserved space. The global pointer locptr controls placement of entries in the local table by pointing to the next available entry--the one immediately following the last one made.

The remainder of the table, beginning at the symbol STARTGLB, stores global declarations which, unlike local declarations, accumulate throughout the compilation run. The symbol NUMGLBS determines how many global declarations will fit in the space allocated. The global pointer glbptr is initially set to STARTGLB and thereafter always points to the start of the global table.

Each table entry consists of six fields--identity, type, class, size, offset, and name. The symbols IDENT, TYPE, CLASS, SIZE, OFFSET, and NAME specify the offsets, within an entry, to each of these fields.

The identity field tells what the declared entity is. The possible values are defined by the symbols shown in Table 20-2. Symbol table entries can represent labels, variables, arrays of variables, pointers, and functions. These labels are the ones written in the source code, not the numeric labels created by the compiler. Labels exist only in the local symbol table.

Table 20-2: Defined Values for the Identity Field

The type field specifies the data type. The possible values are defined by the symbols shown in Table 20-3. Since data type is meaningless with respect to labels, this field does not apply to label entries, so zero is used. For variables, this field specifies the type of variable. For arrays, it tells the type of the array's elements. For pointers, it indicates the type of object referenced by the pointer. And for functions, it specifies the type of object returned--always integer. The values assigned to these symbols are especially chosen so that when they are shifted right by two bits, they yield the length of objects of the designated type.

Table 20-3: Defined Values for the Type Field

The class field gives the storage class of an object. The possible values are defined by the symbols shown in Table 20-4. Since storage class is meaningless with respect to labels, this field is set to zero for label entries. The idea of a storage class is the same for variables, arrays, and pointers. STATIC storage is always present. STATIC objects are declared to the assembler with labels and DB or DW directives. AUTOMATIC storage is dynamically allocated on the stack upon entry of a compound statement and is deallocated on exit (recall that the body of a function is a compound statement). Automatic (local) objects are referenced by their offset from the stack frame pointer BP rather than by means of labels. EXTERNAL objects exist at the global level in some other, separately compiled, module. They are merely declared to the assembler to be external. AUTOEXT applies to functions which are referenced but not declared in the program. These are assumed to be external and are declared so to the assembler at the end of the compile run.

Table 20-4: Defined Values for the Class Field

The size field gives the number of bytes occupied by the object. If it is a scalar item, then the size is one for a character and two for an integer. However, the size of an array is the total amount of space taken up by the array. For character arrays, this is the same as the number of elements, but for integer arrays it is twice that value.

The offset field specified a numeric value (if applicable). Primarily this is the stack frame offset for local objects. If the entry describes a label, this field contains the compiler assigned label number to be used in place of the actual label name. Label names are not used directly because their scope is restricted to the functions in which they appear. The same label may occur many times in a program, so using the name directly could result in an error at assembly time. Therefore, Small C associates with each declared label a unique compiler generated label number which is used instead.

The name field contains the name of an entry as a character string. This is a variable length field in the local table and a fixed length field in the global table. In the local table, the name is terminated with a byte containing the length of the name (1-8). In the global table a null byte terminates the name and any remaining bytes are ignored.

The local table is searched sequentially backward, making it easy to implement the scope rules for local declarations. The inefficiency of this method is compensated by the relatively small number of local declarations in any given function. Since local declarations can occur at the beginning of any compound statement and compound statements can be nested, searching backwards guarantees that when the same name occurs more than once the search will find the one declared at the lowest level. Locptr controls placement of entries in the local table. At the beginning of each function dofunction() resets it to STARTLOC, thereby emptying the table. Also, at the beginning of each compound statement compound() saves locptr so it can restore it at the end, thereby causing the local table to forget declarations made in the block while remembering those made in superior blocks. The function declloc() is then used to process local declarations except labels. Declloc() in turn calls addsym() to actually place new symbols in the table; the same function adds entries to the global table also. The address of locptr is passed to addsym() so it will know where to place the new entry and so it can set locptr to the address of the next entry to be placed in the local table; i.e., the byte following the name in the current entry. Since the name is terminated with a byte containing the length of the name, the local table can be searched backward even though it has variable length entries. The function addlabel() adds labels to the local table. The function findloc() searches the local table.

The global table is processed by means of a hashing algorithm. This compensates for the large number of entries which it usually contains. The function findglb() searches the global table by calling search(), the same function used for macro name searches. Basically, search() hashes the sought symbol to pick an entry of choice. It then proceeds sequentially (and cyclically) from that point, looking for a match. A match or a vacant entry terminates the search. The terminating entry's address is placed in the global pointer cptr, and true is returned if there was a match, or false otherwise. If the search failed then cptr points to the place where a new entry with the specified name should go.

The function declglb() processes global declarations. Declglb() in turn calls addsym(), the same function that adds entries to the local table, to place a new symbol in the global table. The address of glbptr is passed to addsym() so it will know that it is working in the global table. In that case it calls findglb() to locate a place for the new entry.

The While Queue

The while queue is used to store information needed for breaking out of and continuing while, for, and do loops and for breaking out of switch statements. A queue is needed because these constructs can be nested. Obviously, the term while queue is a misnomer. At one time Small C supported only the while statement for loop control, and the switch statement was not supported. Then, with the addition of the new statements, the while queue was further employed but not renamed.

Figure 20-4 shows the format of the while queue. It is really just a short table of integers, three per entry. WQTABSZ specifies the length of the queue in integers and so must be a multiple of three. WQSIZ defines the length of an entry (three) and WQMAX points to the last possible entry. The three items of information needed for continuing and breaking out of control constructs are: (1) the value of the compiler-relative stack pointer (csp) upon entry to the construct, (2) the number of the label that marks the continuation point of a loop, and (3) the number of the label that marks the exit point of the construct. The first item is needed so the stack can be adjusted back to its original value, thereby deallocating any block-local variables that are no longer needed. The second and third are needed so that jumps can be generated which target the continuation and exit points respectively.

Figure 20-4. Format of the While Queue

The global pointer wqptr points to the next available entry in the queue. It is initially set to the address of the first entry, wq. As each while, for, do, or switch is encountered, an entry is made at wqptr which is then advanced to the next entry. This action is performed by addwhile() which in turn calls getlabel() to create the two needed label numbers.

When a break or continue statement is found readwhile() is called to fetch the address of the last entry in the queue. This is appropriate for break statements, but for continue statements, a superior entry might the desired one, since continues do not pertain to switch statements. If the last entries in the queue correspond to switch statements, then a backward search must be made for the entry representing the most nested loop statement--the last entry with a non-zero continue label number. This is done by repeatedly calling readwhile(), each time passing it the pointer it returned previously. With each call, readwhile() returns the next previous entry pointer.

Finally, at the end of the construct, delwhile() is called to remove the current (last) entry from the queue by backing up wqptr one entry. Addwhile() and delwhile() are called by dowhile(), dofor(), dodo(), and doswitch(). Readwhile() is called by dobreak() and docont().

Go to Chapter 21 Return to Table of Contents