Compilers
CSE Compiler Execution
- Compiler Execution
Compilation
- Compilation is a process that translates a program in one language (the source language) into an equivalent program in another language (the object or target language).
- Important part is detection and reporting of errors
- Source language usually high-level programming language (i.e., object-oriented language)
- Target language is a machine language or assembly language (i.e., machine-oriented language)
- Compilation is the link between abstract world of application development and low-level world of application execution on machines
Types of Translators
- Assembler is also a type of translator
- Interpreter is closely related to compiler, but takes both source program and input data
- Translation and execution phases of the source program are the same
- Easier implementation of programs (run-time errors immediately displayed)
- Slower execution
- Often requires more space
Program Compilation Process
- The process includes preprocessing, compiling, assembling, linking/loading to generate absolute machine code.
- Preprocessor takes a skeletal source program and generates a source program.
- Compiler translates the source program into an assembly program.
- Assembler converts the assembly program into relocatable machine code.
- Link/load editor combines relocatable machine code to produce absolute machine code.
C Preprocessor
- Preprocessing occurs before a program is compiled.
- Includes:
- Inclusion of other files
- Definition of symbolic constants and macros
- Conditional compilation of program code
- Conditional execution of preprocessor directives
- Format of preprocessor directives:
- Lines begin with #
- Only whitespace characters before directives on a line
- Example:
gcc –E helloworld.c- Stops after preprocessing and send output to stdout
#include Preprocessor Directive
- Copy of a specified file included in place of directive
#include <filename>- Searches standard library for file
- Use for standard library files
#include "filename"- Searches current directory, then standard library
- Use for user-defined files
- Used for:
- Programs with multiple source files compiled together
- Header file has common declarations and definitions (classes, structures, function prototypes)
#includestatement in each file
#define Preprocessor Directive
- Symbolic constants
- When program compiled, all occurrences of symbolic constant replaced with replacement text
- Format
#define identifier replacement-text - Example:
#define PI 3.14159- Everything to right of identifier replaces text
#define PI = 3.14159- Replaces "PI" with "= 3.14159"
- Cannot redefine symbolic constants once created
#define Preprocessor Directive (Macros)
- Macro
- Macro without arguments treated like a symbolic constant
- A macro with arguments has its arguments substituted for replacement text when the macro is expanded
- Performs a text substitution – no data type checking
- The macro
#define CIRCLE_AREA( r ) ( PI * ( r ) * ( r ) )would causearea = CIRCLE_AREA( 4 );to becomearea = ( 3.14159 * ( 4 ) * ( 4 ) );
#define Preprocessor Directive (Parentheses)
- Use parentheses
- Without them the macro
#define CIRCLE_AREA( x ) PI * ( x ) * ( x )would causearea = CIRCLE_AREA( c + 2 );to becomearea = 3.14159 * c + 2 * c + 2;
- Without them the macro
- Multiple arguments
#define RECTANGLE_AREA( x, y ) ( ( x ) * ( y ) )would causerectArea = RECTANGLE_AREA( a + 4, b + 7 );to becomerectArea = ( ( a + 4 ) * ( b + 7 ) );
#undef Preprocessor Directive
- Undefines a symbolic constant or macro
- If a symbolic constant or macro has been undefined, it can later be redefined
- Example
c #define WIDTH 80 #define ADD( X, Y ) ((X) + (Y)) ... #undef WIDTH #undef ADD
Conditional Compilation
- Control preprocessor directives and compilation
- Cast expressions,
sizeof, enumeration constants cannot be evaluated in preprocessor directives - Structure similar to
ifc #if !defined( NULL ) #define NULL 0 #endif- Determines if symbolic constant
NULLhas been defined - If
NULLis defined,defined( NULL )evaluates to 1 - If
NULLis not defined, this function definesNULLto be 0 - Every
#ifmust end with#endif #ifdefshort for#if defined( name )#ifndefshort for#if !defined( name )
- Determines if symbolic constant
Conditional Compilation (cont.)
- Other statements
#elif– equivalent ofelse ifin anifstructure#else– equivalent ofelsein anifstructure
- "Comment out" code
- Cannot use
/* ... */ - Use
c #if 0 code commented out #endif - To enable code, change 0 to 1
- Cannot use
Compiler
- GNU Compiler Collection (gcc) is standard Linux C compiler (see
man gccfor details) - Compiler translates C program to assembler code
- Example
gcc –S helloworld.c- Stop after stage of compiling proper (i.e., not passed to assembler) and outputs assembler code to
helloworld.s
- Stop after stage of compiling proper (i.e., not passed to assembler) and outputs assembler code to
- Compiler actually consists of many parts (more later)
- Parser
- Analyzer
- Code optimizer
- Code generation
Assembler
- Assembles, but does not link, source code to generate relocateable object code
- Object code may contain metadata such as label definitions that refer to locations within sections of code
- May include holes (i.e., relocation entries) that are to be filled with the values of labels defined elsewhere
- Example
gcc –c helloworld.c- Assemble source code and save object file as
helloworld.o
- Assemble source code and save object file as
- Use
nmorobjdumpto look at objectnm example.oobjdump –t example.o
Linker
- Combines previously compiled and assembled object code along with standard library functions to make executable file
- Resolves references in each object file to external variables and procedures declared in other files
- Example:
gcc helloworld.o- Build default executable
a.out - Use
gcc –lm helloworld.oto link math library
- Build default executable
- Use
nmto look at executable
Loader
- Compilers, assemblers, and linkers usually produce code whose memory references are made relative to an undetermined starting location that can be anywhere in memory (relocatable machine code)
- The loader ensures that the object programs are placed in memory in an executable form
- Loader calculates appropriate absolute addresses for these memory locations and amends the code to use these addresses
- Gets an address from the OS to place the program in memory
- Changes necessary addresses (if any)
- Places code into memory to run
Compiling and Linking
- Only compiling (creating hello.o)
gcc –c hello.c
- Only linking
gcc hello.o –o hello
- Compiling and linking
gcc hello.c –o hello
- To run your program
./helloHello World!
- Additional common flags to gcc
-gallows debugging-l<library_name>linking with external libraries
- If you leave off the
-o, executable goes into thea.outfile
Compiler Construction
- Compiler Construction
What is a Compiler?
- All computers only understand machine language
- Therefore, high-level language instructions must be translated into machine language prior to execution
What is a Compiler? (cont.)
- Consists of a piece of system software that translates high-level languages into machine language
Why Build Compilers?
- Compilers provide an essential interface between applications and architectures
- High level programming languages:
- Increase programmer productivity
- Better maintenance
- Portable
- Low level machine details:
- Instruction selection
- Addressing modes
- Pipelines
- Registers and cache
- Compilers efficiently bridge the gap and shield the application developers from low level machine details
Assembler
- A kind of compiler
- One-to-one translation
- Assembly --> Machine Language
Compiler (High-Level Language Translator)
- A high-level language translator
- One-to-many translation
Compiler Properties
- Compiler must generate a correct executable
- The input program and the output program must be equivalent, the compiler should preserve the meaning of the input program
- Output program should run fast
- For optimizing compilers, we expect the output program to be more efficient than the input program
- Compiler itself should be fast
- Compiler should provide good diagnostics for programming errors
- Compiler should support separate compilation
- Compiler should work well with debuggers
- Optimizations should be consistent and predictable
- Compile time should be proportional to code size
Compiler Goals
- Code produced must be correct
- Correctness is paramount; an incorrect translation defeats the purpose.
- The compiler must ensure that the generated code accurately reflects the semantics of the source program. For example:
- Incorrect: A = (B + C) – (D + E);
- Where STORE B and STORE D change the values of variables B and D which the high-level language does not intend
- Incorrect: A = (B + C) – (D + E);
Compiler Goals (Efficiency)
- Code produced should be reasonably efficient and concise
- The generated code should execute quickly and utilize memory effectively.
- Example Compute the sum: 2x1+ 2x2+ 2x3+ 2x4+….+ 2x_{50000}
- Vs. Optimizing compiler:
- Which results in 49,999 less instructions
- Vs. Optimizing compiler:
Why Study Compilers
- Compilers embody a wide range of theoretical techniques and their application to practice
- DFAs, DPDAs, formal languages, formal grammars, lattice theory
- Compiler construction teaches programming and software engineering skills
- Compiler construction involves a variety of areas
- Theory, algorithms, systems, architecture
- The techniques used in various parts of compiler construction are useful in a wide variety of applications
- Many practical applications have embedded languages, commands, macros, etc.
- Is compiler construction a solved problem?
- No! New developments in programming languages and machine architectures (multicore machines) present new challenges
Syntax & Semantics of Language
- A programming language must include the specification of syntax (structure) and semantics (meaning)
- Syntax typically means the context-free syntax because of the almost universal use of context-free grammars (CFGs)
- Example
a = b + cis syntactically legalb + c = ais illegal
- Example
- The semantics of a programming language are commonly divided into two classes
- Static semantics
- Semantics rules that can be checked at compile time
- E.g., the type and number of a function’s argument(s)
- Runtime semantics
- Semantics rules that can be checked only at run time
- Static semantics
Compilation
- Translate high-level program (source language) into machine code (machine language)
- Slow translation, fast execution
- Compilation process has several phases:
- Lexical Analysis
- Converts characters in source program into lexical units
- Syntax Analysis
- Transforms lexical units into parse trees that represent the syntactic structure of program
- Semantics Analysis
- Generate intermediate code
- Code Generation
- Machine code is generated
- Lexical Analysis
Compilation Stages
- For many compilers, the result is assembly, which then has to be run through an assembler
- Lexical analysis (scanner)
- Syntax analysis (parser)
- Semantic analysis
- Intermediate code generation
- Machine-independent code improvement (optional)
- Target code generation
- Machine-specific code improvement (optional)
Lexical Analysis
- Lexical analyzer or scanner
- Recognizes the “tokens” of a program
- The compiler scans the source code from left-to-right, character-by-character, and groups these characters into lexemes (sequence of characters that match a pattern) and outputs a sequence of tokens to the syntax analyzer
- Each token represents a logically cohesive sequence of characters
- Keywords
if,while, … - Operators
+,*,<=,||, … - Identifiers (names of variables, arrays, procedures, classes)
i,i1,j1,count,sum, … - Numbers
12,3.14,7.2E-2, …
- Keywords
Lexical Analysis (cont.)
- The main functions of this phase are
- Identify the lexical units in a source statement
- Classify units into different lexical classes (e.g., constants, reserved words, etc.) and enter them in different tables
- Build a descriptor (called a token) for each lexical unit (group of input characters)
- Ignore comments in the source program
- Detect tokens that are not a part of the language
Lexical Analysis Example
- Program statement
sum = sum + a[i]; - Digital perspective: tab,s,u,m,blank,=,blank,s,u,m,blank,+,blank,a,[,i,],;
- Tokenized: sum,=,sum,+,a[i],;
Lexical Analysis Process
- Discard blanks, tabs, etc. – look for beginning of token
- Put characters together
- Repeat step 2 until end of token
- Classify and save token
- Repeat steps 1 – 4 until end of statement
- Repeat steps 1 – 5 until end of source code
How Do We Specify Lexical Patterns?
- Some patterns are easy
- Keywords and operators
- Specified as literal patterns:
if,then,else,while,=,+, …
- Specified as literal patterns:
- Keywords and operators
- Some patterns are more complex
- Identifiers
- Letter followed by letters and digits
- Numbers
- Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9
- Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits
- Identifiers
Language Definitions
- An alphabet \Sigma is a finite non-empty set (of symbols)
- A string or word over an alphabet \Sigma is a finite concatenation of symbols from \Sigma
- The length of a string w (i.e., number of characters comprising the string) is denoted |w|
- The empty or null string is denoted \epsilon (i.e., |\epsilon|=0)
- The set of all strings over \Sigma is denoted \Sigma^*
- For each n \geq 0, we define \Sigma^n = {w\in\Sigma^* \mid |w| = n}
- For a symbol or word x, x^n denotes x concatenated with itself n times, with the convention that x^0 denotes \epsilon
- A language over \Sigma is a set L \subseteq \Sigma^*
- Two languages L1 and L2 over common alphabet \Sigma are equal if they are equal as sets
Lexical Patterns: Regular Expressions
- Regular expressions (regex) describe regular languages
- Regular Expression (over alphabet \Sigma)
- \epsilon (empty string) is a regex denoting the set {\epsilon}
- If a is in \Sigma, then a is a regex denoting {a}
- If x and y are regex denoting languages L(x) and L(y) then
- x is a regex denoting L(x)
- x | y is a regex denoting L(x) \cup L(y)
- xy is a regex denoting L(x)L(y)
- x^ is a regex denoting L(x)^
Regular Expression Examples
- All strings of 1s and 0s
- ( 0 | 1 )^*
- All strings of 1s and 0s beginning with a 1
- 1 ( 0 | 1 )^*
- All strings of 0s and 1s containing at least two consecutive 1s
- ( 0 | 1 )^* 1 1( 0 | 1 )^*
- All strings of alternating 0s and 1s
- ( \epsilon | 1 ) ( 0 1 )^* ( \epsilon | 0 )
Regular Expression Extensions (Lex)
- x+= x x* denotes L(x)+
- x? = x | ε denotes L(x) ∪ {ε}
- [abc] = a | b | c matches one character in the square bracket
- a-z = a | b | c | … | z range
- [0-9a-z] = 0 | 1 | 2 | … | 9 | a | b | c | … | z
- [^abc] ^ means negation matches any character except a, b or c
- . (dot) matches any character except the newline
- . = [^\n] \n means newline, dot is equivalent to [^\n]
- “[” matches left square bracket, metacharacters in double quotes become plain characters
- [ matches left square bracket, metacharacter after backslash becomes plain character
Regular Definitions
- We can define macros using regular expressions and use them in other regular expressions
- Letter → (a|b|c| … |z|A|B|C| … |Z)
- Digit → (0|1|2| … |9)
- Identifier → Letter ( Letter | Digit )*
- Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion)
- Regular definitions can be converted to basic regular expressions with macro expansion
- In Lex, enclose definitions using curly braces
- Identifier → {Letter} ( {Letter} | {Digit} )*
Regular Expression Examples
- Examples:
- Integer → (+|-)? (0| (1|2|3| … |9)(Digit *))
- Decimal → Integer "." Digit *
- Real → ( Integer | Decimal ) E (+|-)?Digit *
- Complex → " (" Real , Real " ) "
Regular Expression to Scanners
- Regular expressions are useful for specifying patterns that correspond to tokens
- However, we also want to construct programs that recognize these patterns
- How do we do it?
- Use finite automata!
RegEx to DFA Example
- Consider the problem of recognizing register names in an assembler
- Register → R (0|1|2| … |9) (0|1|2| … |9)*
- Allows registers of arbitrary number
- Requires at least one digit
- Register → R (0|1|2| … |9) (0|1|2| … |9)*
- REGEX corresponds to a recognizer (or DFA)
Deterministic Finite Automata (DFA)
A set of states S
- S = { s0 , s1 , s2 , se}
A set of input symbols (an alphabet) \Sigma
- \Sigma = { R , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
A transition function 𝛿 : S → S
- Maps (state, symbol) pairs to states
A start state
- s0
A set of final (or accepting) states
- Final = { s2 }
DFA Simulation
- Start in state s0 and follow transitions on each input character
- DFA accepts a word x iff x leaves it in a final state (s2 )
- So,
- “R17” takes it through s0 , s1 , s2 and accepts
- “R” takes it through s0 , s1 and fails
- “A” takes it straight to se
- “R17R” takes it through s0 , s1 , s2 , se and rejects
DFA Simulation (cont.)
- The recognizer translates directly into code
- To change DFAs, just change the arrays
- Takes O(|x|) time for input string x
Recognize Longest Accepted Prefix
Given an input string, this simulation algorithm returns the longest accepted prefix Given the input “R17R” , this simulation algorithm returns “R17”
Syntax Analysis
- Syntax analyzer, or parser, groups sequences of tokens from the lexical analysis phase into phrases, each with an associated phrase type (i.e., a logical unit with respect to the rules of the source language)
- The following operations are performed in this phase
- Obtain tokens from lexical analyzer
- Check whether the expression is syntactically correct
- Report syntax errors, if any
- Determine the statement class
- An assignment statement, a condition (e.g., if) statement, etc.
- Group tokens into statements
- Construct hierarchical structures called parse trees
- Parse trees represent the syntactic structure of the program
Grammars and Parse Trees
- The grammar S → aSbS | bSaS | ε generates all strings of a’s and b’s with the same number of a’s as b’s
- This grammar is ambiguous: abab has two parse trees
Grammars and Parse Trees (Unambiguous)
- Natural languages are inherently ambiguous
- But programming languages should not be!
- This grammar G generates the same language
- S → aAbS | bBaS | ε
- A → aAbA | ε
- B → bBaB | ε
- G is unambiguous and has only one parse tree for every sentence in L(G)
Parse Trees
- The statement b = a + 7 can be represented by the following parse tree
- Parse tree illustrates the grouping of tokens into phrases
- Successful construction of a parse tree is proof that the statement is correctly formed
Parse Trees (cont.)
- Every construct (statement, expression, etc.) in a programming language can be represented as a parse tree
- Terminals
- The actual tokens of the language
- Non-Terminals
- Intermediate grammatical categories used to help explain and organize the language
Recursive Descent Parser
- Simplest idea for creating a parser for a grammar is to construct a recursive descent parser
- Built from a context-free grammar
- Simple rules used to build the recursive program
- Implement a function for each non-terminal
- Generate code for each rule
- “Call” non-terminals
- “Match” terminals
- Use if statement to choose between multiple rules with the same (non-terminal) symbol
Recursive Descent Parser (Pros & Cons)
- Advantages
- They are exceptionally simple
- They can be constructed from recognizers simply by doing some extra work – specifically, building a parse tree
- Disadvantages
- They are not as fast as some other methods
- It is difficult to provide really good error messages
- They cannot do parses that require arbitrarily long look-aheads
Recursive Descent Parser Stack
- One easy way to do recursive descent parsing is to have each parse method take the tokens it needs, build a parse tree, and put the parse tree on a global stack
- Write a parse method for each non-terminal in the grammar
- Each parse method should get the tokens it needs, and only those tokens
- Those tokens (usually) go on the stack
- Each parse method may call other parse methods, and expect those methods to leave their results on the stack
- Each (successful) parse method should leave one result on the stack
- Write a parse method for each non-terminal in the grammar
Example: while Statement
- ::= "while"
- The parse method for a while statement:
- Calls the Tokenizer, which returns a "while" token
- Makes "while" into a tree node, which it puts on the stack
- Calls the parser for , which parses a condition and puts it on the stack
- Stack contains: "while" ("top" is on right)
- Calls the parser for , which parses a block and puts it on the stack
- Stack now contains: "while"
- Pops the top three things from the stack, assembles them into a tree, and pushes this tree onto the stack
Recognizing a while Statement
- This code assumes that
keyword(String),condition(), andblock()all leave their results on a stack - On the stack, while = 3, = 2, = 1
Parsing a while Statement
- After recognizing a while loop, the stack looks like this:
Compiler Component Generators
- Defines
- Flex specification
- Yacc specification
Desk Calculator Example
- Flex specification for a desk calculator:
desk.l$ flex desk.l
- yacc specification for a desk calculator:
desk.y$ yacc desk.y- This generates a file called
y.tab.c, but before compiling this resulting file, you must add the following function prototypes near the top of the file:c int yylex(void); void yyerror(const char *);
- This generates a file called
- Now compile this file with the flex and yacc libraries
$ gcc y.tab.c -ly -lfl in public directory
Desk Calculator Example (cont’d)
- Example output
Intermediate Code Generation
- The intermediate code produces a program in a different language, at an intermediate level between the source code and the machine code
- This allows part of the compiler to be machine independent!
- Intermediate languages are sometimes assembly languages
- Generation of intermediate code offers following advantages
- Flexibility
- A single lexical analyzer/parser can be used to generate code for several different machines by providing separate back-ends that translate a common intermediate language to a machine-specific assembly language
- Intermediate code is used in interpretation
- Executed directly rather than translating it into binary code and storing it
- Flexibility
Semantic Analysis
- A semantic analyzer takes its input from the syntax analysis phase in the form of a parse tree and a symbol table
- Its purpose is to determine if the input has a well-defined meaning
- In practice, semantic analyzers are mainly concerned with type checking and type coercion based on type rules
Semantic Analysis (cont.)
- The semantic phase has the following functions:
- Check phrases for semantic errors (e.g., type checking)
- In a C program,
int x = 10.5;should be detected as a semantic error
- In a C program,
- Keeps track of types of identifiers and expressions to verify their consistent usage
- Maintains the symbol table
- Symbol table contains information about each identifier in a program, such as identifier type, scope of identifier, etc.
- Check phrases for semantic errors (e.g., type checking)
Semantic Analysis (cont. 2)
- The semantic phase has the following functions:
- Enforces a large number of rules using the symbol table
- Every identifier is declared before it is used
- No identifier is used in an inappropriate context (e.g., adding a string to an integer)
- Function calls have a correct number and type of arguments
- Checks certain semantic rules, called dynamic semantics, at run time
- Array subscript expression lies within the bounds of the array
- Variables are never used in an expression unless they have been given a value
- Enforces a large number of rules using the symbol table
Symbol Table
- Built and maintained by the semantic analyzer
- Maps each identifier to information known about it
- Identifier’s type (e.g., int, char, float, etc.)
- Internal structure (if any)
- Scope (the portion of the program in which it is valid)
- Using the symbol table, the semantic analyzer enforces a large variety of rules
- Purpose of symbol table is to provide quick and uniform access to identifier attributes throughout compilation process
Semantics Analysis Example
- Syntactically correct, but semantically incorrect
- Semantic records
Code Generation
- The code generated depends upon the architecture of the target machine
- Knowledge of the instructions and addressing modes in a target computer is necessary for the code generation process
- One of the important aspects of code generation is the efficient initialization of machine resources using assumptions such as:
- Instruction types in target machines
- Commutative properties of operators in an expression
- Proper usage of syntax for syntax directed translation
Error Handling
- In each phase, a compiler can encounter errors
- On detecting an error, the compiler must
- Report the error in a helpful way
- Correct the error, if possible
- Continue processing (if possible) after the error to look for further errors
Error Types
- Syntax errors are errors in the program text
- A lexical error is a mistake in lexeme (e.g., typing “esle” instead of “else” or missing off one of the quotes in a literal string)
- A grammatical error is one that violates the rules of the language
- Semantic errors are mistakes concerning the meaning of a program construct
- Type errors occur when an operator is applied to an argument of the wrong type or to the wrong number of arguments
- Logical errors occur when a badly conceived program is executed
- Run-time errors are errors that can be detected only when the program is executed
Compiler Optimization
- Compiler Optimization
Inside a Basic Compiler Front End
- High-level language (C, C++, Java) to Low-level language (IA-64)
- Inside an Optimizing Compiler
- IR Improved
Control Flow Graph
- Basic Block: A group of consecutive instructions with a single entry point and a single exit point
Code Optimization Requirements
- Preserve correctness
- The speed of an incorrect program is irrelevant
- On average, improve performance
- Optimized code may be worse than original if unlucky
- Be “worth the effort”
- Is this example worth it
Code Optimization
- Optimization improves programs by making them smaller or faster or both
- They can be optimized for speed, memory usage, or program footprint
- Goal of code optimization is to translate a program into a new version that computes the same result more efficiently
- by taking less time, memory, and other system resources
- Code optimization is achieved in two ways
- Rearranging computations in a program to make them execute more efficiently
- Eliminating redundancies in a program
Code Optimization (cont.)
- Should not change the meaning of the program
- Tries to improve the program, but the underlying algorithm is not affected
- Cannot replace an inefficient algorithm with an algorithm that is more efficient
- Also cannot fully utilize the instruction set of a particular architecture
- Therefore, is independent of the target machine and the programming language
Optimizing Transformations
- Two types of optimizing transformations are
- Local transformations
- Applied over small segments of a program
- Provides limited benefits at a low cost
- Scope is a basic block which is a sequential set of instructions
- Global transformations
- Applied over larger segments consisting of loops or function bodies
- Local transformations