CHAPTER 22:
THE FRONT END

In this chapter we show how the source code is obtained, preprocessed, and scanned before going to the parser. We refer to these functions collectively as the front end of the compiler because they stand on the input side of the parser.

With most C compilers, the preprocessor is separate from the compiler itself. It makes its own pass on the input, changing its form for input to the compiler. With Small C, however, the preprocessor is integrated into the compiler so that it operates between the reading and scanning of the source code. This arrangement saves a pass on the program, but more importantly for Small C, it preserves the single pass design which enables the compiler to respond immediately to program fragments.

Because of this integration and certain efficiency considerations, the distinction between preprocessing and parsing is blurred in the Small C compiler. For instance, two standard preprocessor directives, #define and #include, are handled by the parser, as are the Small C directives #asm and #endasm. Conversely, two of the scanning functions, match() and streq(), are also used by the preprocessor.

As we saw in Chapter 2, the term token refers to instances of the smallest language elements. Individual characters are not necessarily tokens; however, things like names, expression operators, constants, keywords, and punctuation marks are. It is important to realize that tokens may comprise one or more characters, and that they may, or may not, be separated by white space.

The Input Buffers

Before proceeding with the front end functions, it might help to review the data structures which they use. Recall from Chapter 20 that there are two input buffers--pointed to by pline and mline. Input is read, one line at a time, into mline from which it is preprocessed into pline.

The global pointer line points to one or the other of these in order to direct the scanning routines, which are shared by the parser and the preprocessor, to the correct buffer. The global pointer lptr points the scanning functions to the current character of line and the global integer ch holds a copy of that character. This use of ch makes many references to the current character more efficient.

Three functions look for tokens in the input line; they are symname(), match(), and amatch(). These functions exhibit similar behavior in that they advance lptr past the current token if they find what they are looking for, but only if they find it. This allows the parser to make repeated attempts at recognizing the current token without passing it until it is finally accepted. These functions are described more fully below.

Front End Functions

Figure 22-1 diagrams the relationships among the major front end functions. Specifically, it shows the paths by which control flows from the three scanning functions, at the top, down to the library function fgets() which reads lines from the current input file. A vertical line connecting two functions means that the higher function calls the lower function. Notice that preprocess() fits midway between the matching functions at the top and the I/O function at the bottom. Other, miscellaneous front end functions exist, but these suffice to show how program code filters from the input file up to the parser.

We look first at the miscellaneous functions, and then follow Figure 22-1 from top to bottom.

Figure 22-1: Major Front End Functions

Keepch()

Keepch() serves preprocess() by placing one character at a time into pline. It first verifies that the character will fit. If not, it does nothing. In that case the character is lost and when preprocess() reaches the end of the line it issues an error message.

Inbyte()

Inbyte() returns the current character of the input line after advancing to the next one. It calls gch() to do this; however, it differs from gch() in that it calls preprocess() to fetch a new line if the end of the current one has been reached. Furthermore, if the new line is empty it goes for yet another, and so on until the next source character is found or the end of the last input file is reached. In the latter case, it returns zero.

Gch()

Gch() returns the current character of the input line, advances lptr to the next one, and places it in ch. If the current character is the null terminator, then gch() does not attempt to advance further, but does return the null character to the caller. Gch() calls bump() to advance to the next character.

Bump()

Bump() either advances the current position in the input line (indicated by lptr) a specified number of positions beyond the current character, or it sets it to the beginning of the line. It accepts an integer n which may be zero or a positive value. If n is zero, lptr is set to line; otherwise, it is increased by n. Then ch is assigned the value of the new character. Finally, if the new character is not the null terminator, nch (next character) is assigned the value of the following character; otherwise, is as assigned the null value. Bump() does not verify that n will not advance beyond the end of the line. It relies on the calling functions to ensure this.

Blanks()

Blanks() advances the input past white space to the beginning of the next token or until the end of the input is reached. If necessary, it calls preprocess() to obtain new, preprocessed source lines. It calls white() to determine whether or not a character is to be skipped, and gch() to advance to the next character.

White()

White() returns true if the current input character is a space or a control character and false otherwise.

Alpha()

Alpha() returns true if the character passed to it is alphabetic or an underscore, and false otherwise.

An()

An() returns true if the character passed to it is alphabetic, an underscore, or numeric, and false otherwise.

Streq()

Streq() indicates whether or not the current substring in the source line matches a literal string. It accepts the address of the current character in the source line and the address of the a literal string, and returns the substring length if a match occurs and zero otherwise. While repeatedly matching characters from left to right, if an inequality is found, failure is indicated. Otherwise, when the end of the literal is reached, success is indicated. Obviously, the end of the literal must determine the length of the comparison since the substring may be followed by anything and so cannot terminate the comparison.

Astreq()

Astreq() indicates whether or not two alphanumeric strings or substrings match. It serves two purposes--to compare symbol names extracted from the source line with names in the symbol table, and to compare alphanumeric tokens in the source line with literal strings. In the latter case, it would seem to overlap the function of streq(); but, there is a difference. Streq() will match a literal with the beginning of a token, whereas astreq() ensures that the entire token is examined. The term string is used now to refer to either an alphanumeric string with its own terminator, or an alphanumeric substring which is terminated by any non-alphanumeric character. Astreq() accepts three arguments--the addresses of two strings and the maximum number of characters to match on. It returns either the length of the matched strings or zero according to whether or not a match occurred. It loops, matching characters from the strings, from left to right, until (1) inequality is found, (2) the end of the first string is reached, (3) the end of the second string is reached, or (4) the maximum length is exceeded. Then if the end of both strings have been reached simultaneously, success is indicated; otherwise, failure.

Symname()

Symname() is the first of the major front end functions in Figure 22-1. It is called whenever the parser thinks a name fits the syntax. It's purpose is to indicate whether or not a legal symbol is next in the input line and, if so, to copy it to a designated buffer and skip over it in the source line. Symname() first calls blanks() to advance past any white space at the current position in the line and, if necessary, to input and preprocess another line. If the first non-white character is not alphabetic (or an underscore) it returns false since that is a requirement for C names. Otherwise, it copies the symbol to the buffer pointed to by its only argument sname. If the symbol exceeds eight characters, only the first eight are copied and the rest are bypassed. Finally, it terminates the destination string with a null byte and returns true. Symname() calls alpha() to determine if the first character is alphabetic (or an underscore), an() to deter mine if other characters are alphanumeric (including underscore), and gch() to accept the current character from the input line and advance to the next one.

Match()

Match() is one of two scanning functions that looks for a match between a literal string and the current token in the input line. It skips over the token and returns true if a match occurs; otherwise, it retains the current position in the input line and returns false. First, however, it calls blanks() to skip over white space to the next token, preprocessing a new line if necessary. It calls bump() to advance over the matched token. It is important to notice that since match() calls streq() it matches the literal with the same number of characters in the source line and there is no verification that all of the token was matched.

Amatch()

Amatch() is roughly equivalent to match() except that it assumes that an alphanumeric (including underscore) comparison is being made and guarantees that all of the token in the source line is scanned in the process. It uses astreq() to do the comparing.

Nextop()

Nextop() is called by the expression analyzer to determine if the next token in the source line is one of a list of expression operators. The address of the string containing the list is received as an argument. Each operator in the string is separated from the others by a single space, and, as usual, the string is terminated with a null byte. With each iteration of an infinite loop, nextop() extracts into the local character array op[] the next operator from the list. It then calls streq() to match it to the current token. If there is a match, further tests are made to ensure that, for instance, < did not match <=. If a match did indeed occur, then true is returned. However, if the list is exhausted without a match, then false is returned. On success, the global integer opindex is set to the offset of the matched operator in the list; that is, its subscript. Later, in the expression analyzer, this will be adjusted so that it will correctly subscript the matched operator in the two global arrays op[] and op2[].

Preprocess()

Preprocess() is a rather large function that fetches and preprocesses a line of source code. It executes the following steps:

Set line to mline.
Call ifline() to obtain the next line that is not excluded by #ifdef or #ifndef directives.
Return immediately if the end of the last source file has been reached.
Copy mline to pline with special treatment given to white space, character strings, character constants, comments, and macro names.
Determine if pline has overflowed and, if so, issue the message "line too long."
Set line to pline for subsequent parsing.
Establish the first character as the current character for parsing.

The cases receiving special treatment are:

White space, of any length, is reduced to a single blank character.
Character strings are checked to ensure that there is a closing quote in the same line as the opening quote.
Character constants are checked to ensure that there is a closing apostrophe.
Comments are eliminated entirely from the preprocessed code. If necessary, additional lines are obtained until the end of a comment is reached.

Macro names are recognized and replaced by their substitution text. A token is suspected of being a macro name if it begins with an alphanumeric character (or underscore). It is then copied into a short character array and passed on to search() for a check against the macro name table. If it is not found in the table, it is simply copied directly to pline. However, if it is found, then its offset is taken from the table and used to locate the start of its replacement text in the macro text queue. From there, everything is copied to pline until a null character is reached. Finally, the remainder (beyond eight characters) of the macro name in mline is passed over.

Gch() is called to advance to the next character in mline and keepch() is called to place characters into pline.

Ifline()

Ifline() handles all matters pertaining to conditional compilation. Since it is called only by preprocess(), it should be viewed as part of the preprocessor. Standing between the source file(s) and preprocess(), it obtains source lines from inline(), looks for #ifdef, #ifndef, #else, and #endif directives, and decides the fate of the lines which they control--whether or not they are passed up to preprocess().

Ifline() contains one large infinite loop in which inline() is called at the top, followed by tests for the conditional compilation directives. If reached, a break at the bottom discontinues the loop and returns control to the preprocessor with a new line. If a conditional compilation directive is seen, it is duly noted and the loop is continued. Conditional compilation directives, therefore, never make it to the preprocessor.

The global integer iflevel serves to match each #else and #endif with its antecedent #if... . Each #if... increases it by one, and each #endif decreases it by one. Therefore, it reflects the nesting level of #if... directives.

Another global integer skiplevel indicates whether or not source lines are being skipped, and at what level the skipping was initiated. A segment of code being skipped is delimited by the next #else or #endif at the same level as the #if... which started the skipping.

The crucial statement in ifline() is

		if(skiplevel) continue;

located before the break at the bottom of the loop. If skiplevel is not zero the loop is continued and the current line is skipped. However, if it is zero the break is reached and the current line is passed to the preprocessor.

At this point the following explanations should make sense:

An #ifdef advances iflevel by one then checks skiplevel to see if it is currently skipping text. If so, it continues the loop. However, if it is not already skipping and if the specified symbol has not been defined, it should initiate skipping. Therefore, it calls symname() to extract the macro name from #ifdef, then it calls search() to look for the name in the macro name table. If the search fails, skiplevel is set to iflevel to initiate skipping and to record the nesting level at which it began. Finally, the loop is continued.
An #ifndef is handled exactly like an #ifdef except that skipping is initiated if the specified symbol is found in the macro name table.
An #else directive first checks iflevel to see if an antecedent #if... exists; that failing, the message "no matching #if..." is issued and the loop is continued. Assuming no such error, the #else must either initiate skipping, terminate skipping, or do nothing at all. It initiates skipping, regardless of its nesting level, if skipping has not already been initiated (skiplevel is zero). As with the previous two directives, this involves assigning iflevel to skiplevel. It terminates skipping only if skipping is already in progress and the nesting level matches the level at which skipping was initiated (skiplevel equals iflevel). This is done by assigning zero to skiplevel. Finally, if neither condition exists, the loop is continued without change.
An #endif directive also checks iflevel for an antecedent #if..., issuing "no matching #if..." if there is not one. Assuming no such error, the #endif must either terminate skipping, or do nothing at all. It terminates skipping only if skipping is already in progress and the nesting level matches the level at which skipping was initiated (skiplevel equals iflevel). As before, this is done by assigning zero to skiplevel. Finally, iflevel is decremented by one and the loop is continued.

One last check is made before returning to the preprocessor. If the new line is null, the loop is continued in an attempt to obtain a significant line.

Inline()

Inline() fetches the next line of code from a source file and optionally lists it. Two global file descriptors direct it to input files; input designates the current primary file, and input2 designates the current #include file. (Since only one file descriptor is used for include files, nesting of include statements is not supported. See Chapter 28 for suggestions on making the compiler support nested include files.) Both descriptors are initially set to EOF, indicating that no input file has yet been opened. On entry, inline() checks input for EOF and, finding it, calls openfile() to prepare a file for input. Should that fail, eof would be set true and inline() would simply exit. Next, it decides which descriptor to use. A local integer unit is set to input or input2 depending on whether or not input2 equals EOF. Unit is then passed to fgets() to get the next line from the source file. In other words, if input2 is not set to EOF, it contains the file descriptor of an active include file, and so it preempts the primary input file designated by input.

If the end-of-file condition is met, then the file designated by unit is closed, either input or input2 (which ever pertains) is set to EOF, the input line is nulled by assigning zero to the first character, which is made current, and control is returned to the calling function. When this happens, a higher level function will eventually go for another line and inline() will again receive control. If the end-of-file occurred on input, openfile() will again be called to open the next source file, as described above. However, if it occurred on input2, then the include file will no longer preempt the primary input file and input will be used to fetch the next line from the current source file.

After successfully fetching a line, inline() tests listfp to see if a listing has been requested. If so, it writes the line to listfp. First, however, it checks to see if listfp is the same as the output file descriptor output. In that case, source lines are being interleaved with generated code, so each source line must be made to look like a comment by placing a semicolon before it.

Finally, before returning, bump() is called to establish the first character in the new line as the current character.

Go to Chapter 23 Return to Table of Contents

CHAPTER 22: THE FRONT END

CHAPTER 22:
THE FRONT END