Compilers

CSE Compiler Execution

Compiler Execution

Compilation

Compilation is a process that translates a program in one language (the source language) into an equivalent program in another language (the object or target language).
Important part is detection and reporting of errors
Source language usually high-level programming language (i.e., object-oriented language)
Target language is a machine language or assembly language (i.e., machine-oriented language)
Compilation is the link between abstract world of application development and low-level world of application execution on machines

Types of Translators

Assembler is also a type of translator
Interpreter is closely related to compiler, but takes both source program and input data
- Translation and execution phases of the source program are the same
- Easier implementation of programs (run-time errors immediately displayed)
- Slower execution
- Often requires more space

Program Compilation Process

The process includes preprocessing, compiling, assembling, linking/loading to generate absolute machine code.
Preprocessor takes a skeletal source program and generates a source program.
Compiler translates the source program into an assembly program.
Assembler converts the assembly program into relocatable machine code.
Link/load editor combines relocatable machine code to produce absolute machine code.

C Preprocessor

Preprocessing occurs before a program is compiled.
Includes:
- Inclusion of other files
- Definition of symbolic constants and macros
- Conditional compilation of program code
- Conditional execution of preprocessor directives
Format of preprocessor directives:
- Lines begin with #
- Only whitespace characters before directives on a line
Example: gcc –E helloworld.c
- Stops after preprocessing and send output to stdout

#include Preprocessor Directive

Copy of a specified file included in place of directive
#include <filename>
- Searches standard library for file
- Use for standard library files
#include "filename"
- Searches current directory, then standard library
- Use for user-defined files
Used for:
- Programs with multiple source files compiled together
- Header file has common declarations and definitions (classes, structures, function prototypes)
- #include statement in each file

#define Preprocessor Directive

Symbolic constants
- When program compiled, all occurrences of symbolic constant replaced with replacement text
- Format #define identifier replacement-text
- Example: #define PI 3.14159
  - Everything to right of identifier replaces text
  - #define PI = 3.14159
    - Replaces "PI" with "= 3.14159"
- Cannot redefine symbolic constants once created

#define Preprocessor Directive (Macros)

Macro
- Macro without arguments treated like a symbolic constant
- A macro with arguments has its arguments substituted for replacement text when the macro is expanded
- Performs a text substitution – no data type checking
- The macro #define CIRCLE_AREA( r ) ( PI * ( r ) * ( r ) ) would cause area = CIRCLE_AREA( 4 ); to become area = ( 3.14159 * ( 4 ) * ( 4 ) );

#define Preprocessor Directive (Parentheses)

Use parentheses
- Without them the macro #define CIRCLE_AREA( x ) PI * ( x ) * ( x ) would cause area = CIRCLE_AREA( c + 2 ); to become area = 3.14159 * c + 2 * c + 2;
Multiple arguments
- #define RECTANGLE_AREA( x, y ) ( ( x ) * ( y ) ) would cause rectArea = RECTANGLE_AREA( a + 4, b + 7 ); to become rectArea = ( ( a + 4 ) * ( b + 7 ) );

#undef Preprocessor Directive

Undefines a symbolic constant or macro
If a symbolic constant or macro has been undefined, it can later be redefined
Example
c #define WIDTH 80 #define ADD( X, Y ) ((X) + (Y)) ... #undef WIDTH #undef ADD

Conditional Compilation

Control preprocessor directives and compilation
Cast expressions, sizeof, enumeration constants cannot be evaluated in preprocessor directives
Structure similar to ifc #if !defined( NULL ) #define NULL 0 #endif
- Determines if symbolic constant NULL has been defined
- If NULL is defined, defined( NULL ) evaluates to 1
- If NULL is not defined, this function defines NULL to be 0
- Every #if must end with #endif
- #ifdef short for #if defined( name )
- #ifndef short for #if !defined( name )

Conditional Compilation (cont.)

Other statements
- #elif – equivalent of else if in an if structure
- #else – equivalent of else in an if structure
"Comment out" code
- Cannot use /* ... */
- Use
  c #if 0 code commented out #endif
- To enable code, change 0 to 1

Compiler

GNU Compiler Collection (gcc) is standard Linux C compiler (see man gcc for details)
Compiler translates C program to assembler code
Example gcc –S helloworld.c
- Stop after stage of compiling proper (i.e., not passed to assembler) and outputs assembler code to helloworld.s
Compiler actually consists of many parts (more later)
- Parser
- Analyzer
- Code optimizer
- Code generation

Assembler

Assembles, but does not link, source code to generate relocateable object code
- Object code may contain metadata such as label definitions that refer to locations within sections of code
- May include holes (i.e., relocation entries) that are to be filled with the values of labels defined elsewhere
Example gcc –c helloworld.c
- Assemble source code and save object file as helloworld.o
Use nm or objdump to look at object
- nm example.o
- objdump –t example.o

Linker

Combines previously compiled and assembled object code along with standard library functions to make executable file
Resolves references in each object file to external variables and procedures declared in other files
Example: gcc helloworld.o
- Build default executable a.out
- Use gcc –lm helloworld.o to link math library
Use nm to look at executable

Loader

Compilers, assemblers, and linkers usually produce code whose memory references are made relative to an undetermined starting location that can be anywhere in memory (relocatable machine code)
The loader ensures that the object programs are placed in memory in an executable form
- Loader calculates appropriate absolute addresses for these memory locations and amends the code to use these addresses
Gets an address from the OS to place the program in memory
Changes necessary addresses (if any)
Places code into memory to run

Compiling and Linking

Only compiling (creating hello.o)
- gcc –c hello.c
Only linking
- gcc hello.o –o hello
Compiling and linking
- gcc hello.c –o hello
To run your program
- ./hello
- Hello World!
Additional common flags to gcc
- -g allows debugging
- -l<library_name> linking with external libraries
If you leave off the -o, executable goes into the a.out file

Compiler Construction

Compiler Construction

What is a Compiler?

All computers only understand machine language
Therefore, high-level language instructions must be translated into machine language prior to execution

What is a Compiler? (cont.)

Consists of a piece of system software that translates high-level languages into machine language

Why Build Compilers?

Compilers provide an essential interface between applications and architectures
High level programming languages:
- Increase programmer productivity
- Better maintenance
- Portable
Low level machine details:
- Instruction selection
- Addressing modes
- Pipelines
- Registers and cache
Compilers efficiently bridge the gap and shield the application developers from low level machine details

Assembler

A kind of compiler
- One-to-one translation
- Assembly --> Machine Language

Compiler (High-Level Language Translator)

A high-level language translator
- One-to-many translation

Compiler Properties

Compiler must generate a correct executable
- The input program and the output program must be equivalent, the compiler should preserve the meaning of the input program
Output program should run fast
- For optimizing compilers, we expect the output program to be more efficient than the input program
Compiler itself should be fast
Compiler should provide good diagnostics for programming errors
Compiler should support separate compilation
Compiler should work well with debuggers
Optimizations should be consistent and predictable
Compile time should be proportional to code size

Compiler Goals

Code produced must be correct
Correctness is paramount; an incorrect translation defeats the purpose.
The compiler must ensure that the generated code accurately reflects the semantics of the source program. For example:
- Incorrect: A = (B + C) – (D + E);
  - Where STORE B and STORE D change the values of variables B and D which the high-level language does not intend

Compiler Goals (Efficiency)

Code produced should be reasonably efficient and concise
The generated code should execute quickly and utilize memory effectively.
Example Compute the sum: 2x1+ 2x2+ 2x3+ 2x4+….+ 2x_{50000}
- Vs. Optimizing compiler:
  - Which results in 49,999 less instructions

Why Study Compilers

Compilers embody a wide range of theoretical techniques and their application to practice
- DFAs, DPDAs, formal languages, formal grammars, lattice theory
Compiler construction teaches programming and software engineering skills
Compiler construction involves a variety of areas
- Theory, algorithms, systems, architecture
The techniques used in various parts of compiler construction are useful in a wide variety of applications
- Many practical applications have embedded languages, commands, macros, etc.
Is compiler construction a solved problem?
- No! New developments in programming languages and machine architectures (multicore machines) present new challenges

Syntax & Semantics of Language

A programming language must include the specification of syntax (structure) and semantics (meaning)
Syntax typically means the context-free syntax because of the almost universal use of context-free grammars (CFGs)
- Example
  - a = b + c is syntactically legal
  - b + c = a is illegal
The semantics of a programming language are commonly divided into two classes
- Static semantics
  - Semantics rules that can be checked at compile time
  - E.g., the type and number of a function’s argument(s)
- Runtime semantics
  - Semantics rules that can be checked only at run time

Compilation

Translate high-level program (source language) into machine code (machine language)
- Slow translation, fast execution
Compilation process has several phases:
- Lexical Analysis
  - Converts characters in source program into lexical units
- Syntax Analysis
  - Transforms lexical units into parse trees that represent the syntactic structure of program
- Semantics Analysis
  - Generate intermediate code
- Code Generation
  - Machine code is generated

Compilation Stages

For many compilers, the result is assembly, which then has to be run through an assembler
Lexical analysis (scanner)
Syntax analysis (parser)
Semantic analysis
Intermediate code generation
Machine-independent code improvement (optional)
Target code generation
Machine-specific code improvement (optional)

Lexical Analysis

Lexical analyzer or scanner
- Recognizes the “tokens” of a program
- The compiler scans the source code from left-to-right, character-by-character, and groups these characters into lexemes (sequence of characters that match a pattern) and outputs a sequence of tokens to the syntax analyzer
- Each token represents a logically cohesive sequence of characters
  - Keywords if, while, …
  - Operators +, *, <=, ||, …
  - Identifiers (names of variables, arrays, procedures, classes) i, i1, j1, count, sum, …
  - Numbers 12, 3.14, 7.2E-2, …

Lexical Analysis (cont.)

The main functions of this phase are
- Identify the lexical units in a source statement
- Classify units into different lexical classes (e.g., constants, reserved words, etc.) and enter them in different tables
- Build a descriptor (called a token) for each lexical unit (group of input characters)
- Ignore comments in the source program
- Detect tokens that are not a part of the language

Lexical Analysis Example

Program statement sum = sum + a[i];
Digital perspective: tab,s,u,m,blank,=,blank,s,u,m,blank,+,blank,a,[,i,],;
Tokenized: sum,=,sum,+,a[i],;

Lexical Analysis Process

Discard blanks, tabs, etc. – look for beginning of token
Put characters together
Repeat step 2 until end of token
Classify and save token
Repeat steps 1 – 4 until end of statement
Repeat steps 1 – 5 until end of source code

How Do We Specify Lexical Patterns?

Some patterns are easy
- Keywords and operators
  - Specified as literal patterns: if, then, else, while, =, +, …
Some patterns are more complex
- Identifiers
  - Letter followed by letters and digits
- Numbers
  - Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9
  - Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits

Language Definitions

An alphabet \Sigma is a finite non-empty set (of symbols)
A string or word over an alphabet \Sigma is a finite concatenation of symbols from \Sigma
The length of a string w (i.e., number of characters comprising the string) is denoted |w|
The empty or null string is denoted \epsilon (i.e., |\epsilon|=0)
The set of all strings over \Sigma is denoted \Sigma^*
For each n \geq 0, we define \Sigma^n = {w\in\Sigma^* \mid |w| = n}
For a symbol or word x, x^n denotes x concatenated with itself n times, with the convention that x^0 denotes \epsilon
A language over \Sigma is a set L \subseteq \Sigma^*
Two languages L1 and L2 over common alphabet \Sigma are equal if they are equal as sets

Lexical Patterns: Regular Expressions

Regular expressions (regex) describe regular languages
Regular Expression (over alphabet \Sigma)
- \epsilon (empty string) is a regex denoting the set {\epsilon}
- If a is in \Sigma, then a is a regex denoting {a}
- If x and y are regex denoting languages L(x) and L(y) then
  - x is a regex denoting L(x)
  - x | y is a regex denoting L(x) \cup L(y)
  - xy is a regex denoting L(x)L(y)
  - x^ is a regex denoting L(x)^

Regular Expression Examples

All strings of 1s and 0s
- ( 0 | 1 )^*
All strings of 1s and 0s beginning with a 1
- 1 ( 0 | 1 )^*
All strings of 0s and 1s containing at least two consecutive 1s
- ( 0 | 1 )^* 1 1( 0 | 1 )^*
All strings of alternating 0s and 1s
- ( \epsilon | 1 ) ( 0 1 )^* ( \epsilon | 0 )

Regular Expression Extensions (Lex)

x+= x x* denotes L(x)+
x? = x | ε denotes L(x) ∪ {ε}
[abc] = a | b | c matches one character in the square bracket
a-z = a | b | c | … | z range
[0-9a-z] = 0 | 1 | 2 | … | 9 | a | b | c | … | z
[^abc] ^ means negation matches any character except a, b or c
. (dot) matches any character except the newline
. = [^\n] \n means newline, dot is equivalent to [^\n]
“[” matches left square bracket, metacharacters in double quotes become plain characters
[ matches left square bracket, metacharacter after backslash becomes plain character

Regular Definitions

We can define macros using regular expressions and use them in other regular expressions
- Letter → (a|b|c| … |z|A|B|C| … |Z)
- Digit → (0|1|2| … |9)
- Identifier → Letter ( Letter | Digit )*
Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion)
Regular definitions can be converted to basic regular expressions with macro expansion
In Lex, enclose definitions using curly braces
- Identifier → {Letter} ( {Letter} | {Digit} )*

Regular Expression Examples

Examples:
- Integer → (+|-)? (0| (1|2|3| … |9)(Digit *))
- Decimal → Integer "." Digit *
- Real → ( Integer | Decimal ) E (+|-)?Digit *
- Complex → " (" Real , Real " ) "

Regular Expression to Scanners

Regular expressions are useful for specifying patterns that correspond to tokens
However, we also want to construct programs that recognize these patterns
How do we do it?
- Use finite automata!

RegEx to DFA Example

Consider the problem of recognizing register names in an assembler
- Register → R (0|1|2| … |9) (0|1|2| … |9)*
  - Allows registers of arbitrary number
  - Requires at least one digit
REGEX corresponds to a recognizer (or DFA)

Deterministic Finite Automata (DFA)

A set of states S
- S = { s0 , s1 , s2 , se}
A set of input symbols (an alphabet) \Sigma
- \Sigma = { R , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
A transition function 𝛿 : S   → S
- Maps (state, symbol) pairs to states
A start state
- s0
A set of final (or accepting) states
- Final = { s2 }

DFA Simulation

Start in state s0 and follow transitions on each input character
DFA accepts a word x iff x leaves it in a final state (s2 )
So,
- “R17” takes it through s0 , s1 , s2 and accepts
- “R” takes it through s0 , s1 and fails
- “A” takes it straight to se
- “R17R” takes it through s0 , s1 , s2 , se and rejects

DFA Simulation (cont.)

The recognizer translates directly into code
To change DFAs, just change the arrays
Takes O(|x|) time for input string x

Recognize Longest Accepted Prefix

Given an input string, this simulation algorithm returns the longest accepted prefix Given the input “R17R” , this simulation algorithm returns “R17”

Syntax Analysis

Syntax analyzer, or parser, groups sequences of tokens from the lexical analysis phase into phrases, each with an associated phrase type (i.e., a logical unit with respect to the rules of the source language)
The following operations are performed in this phase
- Obtain tokens from lexical analyzer
- Check whether the expression is syntactically correct
- Report syntax errors, if any
- Determine the statement class
  - An assignment statement, a condition (e.g., if) statement, etc.
- Group tokens into statements
- Construct hierarchical structures called parse trees
  - Parse trees represent the syntactic structure of the program

Grammars and Parse Trees

The grammar S → aSbS | bSaS | ε generates all strings of a’s and b’s with the same number of a’s as b’s
This grammar is ambiguous: abab has two parse trees

Grammars and Parse Trees (Unambiguous)

Natural languages are inherently ambiguous
- But programming languages should not be!
This grammar G generates the same language
- S → aAbS | bBaS | ε
- A → aAbA | ε
- B → bBaB | ε
G is unambiguous and has only one parse tree for every sentence in L(G)

Parse Trees

The statement b = a + 7 can be represented by the following parse tree
- Parse tree illustrates the grouping of tokens into phrases
Successful construction of a parse tree is proof that the statement is correctly formed

Parse Trees (cont.)

Every construct (statement, expression, etc.) in a programming language can be represented as a parse tree
Terminals
- The actual tokens of the language
Non-Terminals
- Intermediate grammatical categories used to help explain and organize the language

Recursive Descent Parser

Simplest idea for creating a parser for a grammar is to construct a recursive descent parser
- Built from a context-free grammar
Simple rules used to build the recursive program
- Implement a function for each non-terminal
- Generate code for each rule
  - “Call” non-terminals
  - “Match” terminals
  - Use if statement to choose between multiple rules with the same (non-terminal) symbol

Recursive Descent Parser (Pros & Cons)

Advantages
- They are exceptionally simple
- They can be constructed from recognizers simply by doing some extra work – specifically, building a parse tree
Disadvantages
- They are not as fast as some other methods
- It is difficult to provide really good error messages
- They cannot do parses that require arbitrarily long look-aheads

Recursive Descent Parser Stack

One easy way to do recursive descent parsing is to have each parse method take the tokens it needs, build a parse tree, and put the parse tree on a global stack
- Write a parse method for each non-terminal in the grammar
  - Each parse method should get the tokens it needs, and only those tokens
  - Those tokens (usually) go on the stack
  - Each parse method may call other parse methods, and expect those methods to leave their results on the stack
  - Each (successful) parse method should leave one result on the stack

Example: while Statement

::= "while"
The parse method for a while statement:
- Calls the Tokenizer, which returns a "while" token
- Makes "while" into a tree node, which it puts on the stack
- Calls the parser for , which parses a condition and puts it on the stack
  - Stack contains: "while" ("top" is on right)
- Calls the parser for , which parses a block and puts it on the stack
  - Stack now contains: "while"
- Pops the top three things from the stack, assembles them into a tree, and pushes this tree onto the stack

Recognizing a while Statement

This code assumes that keyword(String), condition(), and block() all leave their results on a stack
On the stack, while = 3, = 2, = 1

Parsing a while Statement

After recognizing a while loop, the stack looks like this:

Compiler Component Generators

Defines
- Flex specification
- Yacc specification

Desk Calculator Example

Flex specification for a desk calculator:
- desk.l
- $ flex desk.l
yacc specification for a desk calculator:
- desk.y
- $ yacc desk.y
  - This generates a file called y.tab.c, but before compiling this resulting file, you must add the following function prototypes near the top of the file:
    c int yylex(void); void yyerror(const char *);
Now compile this file with the flex and yacc libraries
- $ gcc y.tab.c -ly -lfl in public directory

Desk Calculator Example (cont’d)

Example output

Intermediate Code Generation

The intermediate code produces a program in a different language, at an intermediate level between the source code and the machine code
- This allows part of the compiler to be machine independent!
- Intermediate languages are sometimes assembly languages
Generation of intermediate code offers following advantages
- Flexibility
  - A single lexical analyzer/parser can be used to generate code for several different machines by providing separate back-ends that translate a common intermediate language to a machine-specific assembly language
- Intermediate code is used in interpretation
  - Executed directly rather than translating it into binary code and storing it

Semantic Analysis

A semantic analyzer takes its input from the syntax analysis phase in the form of a parse tree and a symbol table
Its purpose is to determine if the input has a well-defined meaning
- In practice, semantic analyzers are mainly concerned with type checking and type coercion based on type rules

Semantic Analysis (cont.)

The semantic phase has the following functions:
- Check phrases for semantic errors (e.g., type checking)
  - In a C program, int x = 10.5; should be detected as a semantic error
- Keeps track of types of identifiers and expressions to verify their consistent usage
- Maintains the symbol table
  - Symbol table contains information about each identifier in a program, such as identifier type, scope of identifier, etc.

Semantic Analysis (cont. 2)

The semantic phase has the following functions:
- Enforces a large number of rules using the symbol table
  - Every identifier is declared before it is used
  - No identifier is used in an inappropriate context (e.g., adding a string to an integer)
  - Function calls have a correct number and type of arguments
- Checks certain semantic rules, called dynamic semantics, at run time
  - Array subscript expression lies within the bounds of the array
  - Variables are never used in an expression unless they have been given a value

Symbol Table

Built and maintained by the semantic analyzer
Maps each identifier to information known about it
- Identifier’s type (e.g., int, char, float, etc.)
- Internal structure (if any)
- Scope (the portion of the program in which it is valid)
Using the symbol table, the semantic analyzer enforces a large variety of rules
Purpose of symbol table is to provide quick and uniform access to identifier attributes throughout compilation process

Semantics Analysis Example

Syntactically correct, but semantically incorrect
Semantic records

Code Generation

The code generated depends upon the architecture of the target machine
- Knowledge of the instructions and addressing modes in a target computer is necessary for the code generation process
One of the important aspects of code generation is the efficient initialization of machine resources using assumptions such as:
- Instruction types in target machines
- Commutative properties of operators in an expression
- Proper usage of syntax for syntax directed translation

Error Handling

In each phase, a compiler can encounter errors
On detecting an error, the compiler must
- Report the error in a helpful way
- Correct the error, if possible
- Continue processing (if possible) after the error to look for further errors

Error Types

Syntax errors are errors in the program text
- A lexical error is a mistake in lexeme (e.g., typing “esle” instead of “else” or missing off one of the quotes in a literal string)
- A grammatical error is one that violates the rules of the language
Semantic errors are mistakes concerning the meaning of a program construct
- Type errors occur when an operator is applied to an argument of the wrong type or to the wrong number of arguments
- Logical errors occur when a badly conceived program is executed
- Run-time errors are errors that can be detected only when the program is executed

Compiler Optimization

Compiler Optimization

Inside a Basic Compiler Front End

High-level language (C, C++, Java) to Low-level language (IA-64)
Inside an Optimizing Compiler
- IR Improved

Control Flow Graph

Basic Block: A group of consecutive instructions with a single entry point and a single exit point

Code Optimization Requirements

Preserve correctness
- The speed of an incorrect program is irrelevant
On average, improve performance
- Optimized code may be worse than original if unlucky
Be “worth the effort”
- Is this example worth it

Code Optimization

Optimization improves programs by making them smaller or faster or both
- They can be optimized for speed, memory usage, or program footprint
Goal of code optimization is to translate a program into a new version that computes the same result more efficiently
- by taking less time, memory, and other system resources
Code optimization is achieved in two ways
- Rearranging computations in a program to make them execute more efficiently
- Eliminating redundancies in a program

Code Optimization (cont.)

Should not change the meaning of the program
- Tries to improve the program, but the underlying algorithm is not affected
Cannot replace an inefficient algorithm with an algorithm that is more efficient
Also cannot fully utilize the instruction set of a particular architecture
- Therefore, is independent of the target machine and the programming language

Optimizing Transformations

Two types of optimizing transformations are
- Local transformations
  - Applied over small segments of a program
  - Provides limited benefits at a low cost
  - Scope is a basic block which is a sequential set of instructions
- Global transformations
  - Applied over larger segments consisting of loops or function bodies