CHAPTER 2:

TOKENS

The smallest elements in the syntax of a language are called tokens. A token may be a single character or a sequence of characters that form a single item. The first order of business for a compiler is to recognize the individual tokens from which the program is built. It then tries to recognize patterns of tokens as language constructs. And, finally, it generates code to perform the activities which the language constructs specify. Since tokens are the building blocks of programs, we begin our study of the Small C language by defining its tokens.

ASCII Character Set

First we look at the character set from which tokens are constructed. Small C uses the ASCII character set. As Appendix A indicates, there are 128 defined characters in the ASCII set. The first 32 (values 0-31) and the last one (127) are classified as control characters. Among them are the horizontal tab (9), the carriage return (13), and the line feed (10) which are commonly found in C programs. These characters, and the space (32) character, are often called white space characters since they form the gaps which usually separate tokens.

Codes 32-126 include the "normal" characters; these include the space (32), the numeric digits (48-57), the uppercase alphabetics (65-90), the lowercase alphabetics (97-122), and the special characters (all other values). From these are built the tokens of the Small C language.

Constants

Numeric constants consist of an uninterrupted sequence of digits which is delimited by white space or special characters (operators or punctuation). Since only integers are known to Small C, the period cannot appear in numeric constants. They may be written with a leading plus or minus sign, however. For more about numeric constants see Chapter 3.

Character constants are written by enclosing an ASCII character in apostrophes. We would write 'a' for a constant with the ASCII value of the lowercase a (97). There are variations on this idea, as we shall see in Chapter 3.

String constants are written as a sequence of ASCII characters bounded by quotation marks. Thus, "abc" describes a string of characters containing the first three letters of the alphabet in lowercase. See Chapter 3 for more about string constants too.

Keywords

Keywords (sometimes called reserved words), are the tokens which look like words or abbreviations and serve to distinguish between the different language constructs. Examples are int in

		int i, j, k;

and while in

		while (i < 5) x[i++] = 0;

Perhaps the first thing noticed in Listing 1-1 was that, for the most part, the program is written in lowercase letters. In fact, all keywords in the C language are written in lowercase.

Names

Names (also called identifiers or symbols) are used to identify specific variables, functions, and macros. Small C names may be any length; however, only the first eight characters have significance. Trailing characters are ignored. Thus, the names nameindex1 and nameindex2 are both seen by the compiler as nameinde. This limit of eight characters, while common, is not universal among C compilers.

Names must begin with a letter and the remaining characters must be either letters or digits. The underscore character counts as a letter, however. Thus, the name _abc is perfectly legal, as is a_b_c.

Names may be written with both upper and lowercase letters, which are equivalent. It is customary, however, to generally use lowercase except for macro symbols. The practice of naming macros in uppercase calls attention to the fact that they are not variable names but defined symbols. To improve readability, one common practice is to capitalize the first letter of each term that goes into a name. GetTwo, for instance, reads better than gettwo.

Every global name defined to the Small C compiler generates an assembly language label of the same name, but preceded by an underscore. The purpose of the underscore is to avoid clashes with the assembler's reserved words. As you study the Small C library (Appendix D ) you will notice that global variables and some functions are named with leading underscores. This common practice is to avoid clashing with names a programmer might choose. So, as a matter of practice, we should not ordinarily name globals with leading underscores.

Since the compiler adds its own underscore, names written with a leading underscore appear in the assembly file with two leading underscores.

Locals cannot clash with assembler-reserved words or library globals. This is because locals are allocated on the stack and are referenced relative to the stack frame instead of by name.

Punctuation

Punctuation in C is done with semicolons, colons, commas, apostrophes, quotation marks, braces, brackets, and parentheses. Semicolons are primarily used as statement terminators. One is placed at the end of every simple statement. As illustrated by

		{ x = j; j = k; k = x; }

even the last statement in a block requires one.

Preprocessor directives are an exception since they are not part of the C language proper, and each one exists in a line by itself. Semicolons also separate the three expressions in a for statement (Chapter 10), as illustrated by

		for (i = 0; i < 10; i = i + 1) x[i] = 0;

Since C has a goto statement, there must be a way of designating the destination address for a jump. This is done by writing an ordinary name followed by a colon. Such a name is called a label. An example is

		loop:
		...
		goto loop;

Colons also terminate case, and default prefixes which appear in switch statements (Chapter 10). Consider

		switch (var) {
			case  3: putchar('x');
			case  2: putchar('x');
			case  1: putchar('x');
			default: putchar(' ');
			}

for example. These prefixes may be thought of as special labels since they are in fact targets for a transfer of control.

Commas separate items that appear in lists. Thus, three integers may be declared by

		int i, j, k;

for instance. Or, a function requiring four arguments might be called with the statement

		func (arg1, arg2, arg3, arg4);

Commas are also used to separate lists of expressions. Sometimes it adds clarity to a program if related variables are modified at the same place. For example,

		while (++i, --k) abc ();

The value of a list of expressions is always the value of the last expression in the list.

Square brackets enclose array dimensions (in declarations) and subscripts (in expressions). Thus,

		char string[80];

declares a character array named string consisting of 80 characters numbered from 0 through 79, and

		ch = string[4];

assigns the fifth character of that array to the variable ch.

As we saw in Listing 1-1, parentheses enclose argument lists which are associated with function declarations and calls. They are required even if there are no arguments.

As with all programming languages, C uses parentheses to control the order in which expressions are evaluated. Thus, (6+2)/2 yields 4, whereas 6+2/2 yields 7.

The backslash character (\) may be used in character and string constants as an escape character. The presence of a backslash gives the following character(s) special meaning.

Operators

As Table 9-1 illustrates, numerous special characters are used as expression operators. They specify every sort of operation that can be performed on operands. There are operators for:

  1. assignments
  2. mathematical operations
  3. relational comparisons
  4. Boolean operations
  5. bitwise operations
  6. shifting values
  7. calling functions
  8. subscripting
  9. obtaining the size of an object
  10. obtaining the address of an object
  11. referencing an object through its address
  12. choosing between alternate subexpressions
Since there are so many operators, in many cases it was necessary to form operators from two or more characters. Thus <= and <<= are each operators. Small C requires that such operators be written without white space or comments between the characters.

Go to Chapter 3 Return to Table of Contents