Looks like no one added any tags here yet for you.
The Compiler Toolchain
Compiler Sequence
Lexical Analysis
Takes in a stream of characters as input and splits it into tokens/lexemes
Tokens
The “logical units” of the source code
Typically keywords, operators, identifiers, numbers, etc
Parsing/Syntax Analysis
Checks if the code follows a given grammar
E.g “school goes a boy” is lexically correct, but not syntactically correct
Match input to predefined grammar
Outputs an abstract syntax tree
Semantic Analysis
Checks if the code follows the rules of the language and creates meaning
Scope resolution
Type Checking
Array bound checking
Completely traverses the Abstract syntax tree to do this
Outputs intermediate representation
Three steps of Semantic Analysis
Type Checking - inspects type match in assignment, operations, functions, and method calls
Flow Control Checking - checks flow control structures and are used correctly and if classes and objects are correctly assessed
Label Checking - validates the use of labels and identifiers
Optimization
Enhances code efficiency
3 rules of optimization
Can’t change the original programs meaning
Focus on consuming fewer resources and speeding up software operation
Shouldn’t significantly impact compilation time
4 optimization techniques
Function inlining = replace function call with its body
Dead code elimation = delete code if its never executed or if its result is never used
Loop fusion = If several loops have the same iteration conditions, combine them
Instruction combining = combining similar operation into one
Code generation
The compiler converts the optimized intermediate code into the machine code for the target machine
Byte code
Intermediate code between source and machine code
Result of compilation of high level source code
Low level
Not directly understandable by CPU without the use of a virtual machine first
Platform independent
Must be translated into machine code
Binary, hexadecimal, and macros
Intermediate Code
Machine independent code (so no need for unique compilers for all machines)
Easier to use with optimization
Formed by performing a post-order transversal of the AST to generate an IR instruction for each node (this is where most optimization occurs!)
2 forms of intermediate code
High level: Similar to the source language. Easy to boost source code performance
Low level: Close to machine code. Preferred for making machine related optimizations
Chompsky’s Heirarchy
Regular languages - Finite automata
Context free languages - Pushdown automata
Context sensitive languages - Linear bounded automata
Recursively enumerable languages - Turing machine
Context Free Grammar
A formal definition of a language defined by some rules. Each rule specifies how a single symbol can be replaced by one of a selection of sequences of atoms and symbols
LL Grammar
A CFG that can be evaluated by only considering the current rule and the next token in the input stream
3 conditions for an LL grammar
No ambiguity
No left recursion
No common left prefix
Direct Left Recursion
Regular Expression Precedence:
Alteration < Concatenation < Kleene’s Closure
Preprocessor
Prepares source code for the compiler
Substitute macros
Deal with directives starting in # if you’re in C
Compiler
Cleans up code, checks for errors, produces assembly
Assembler
Consumes assembly, produces object code
Object code = almost executable, does not know final memory addresses. The gaps are filled in by the linker
Linker
Consumes object code and libraries and compiles them into a complete, executable program
Selects memory locations where each piece of code and data will be loaded, then links them together by writing in missing address info
Retargetable computer
Contains multiple code generators, so the same IR can be emitted for a variety of microprocessors
Language lexicon
A “dictionary”, a complete set of available terms
Language Syntax
The possible ways we can put words from the syntax together
Syntax is well defined rules to create sentences in a given language
Language semantics
Indicates whether a valid sentence actually makes sense
Common semantic errors
Using an uninitialized variable
Added an operation immediately after a return operation in a function
Type mismatch
Accessing an index in an array outside the bounds
Referencing a variable out of scope
Why use regex?
Regex provides concise notation for string patterns
Use in lexical analysis provides small extensions to resolve ambiguities and handle errors
Good algorithms are known, requiring only a single pass over input and few operations per character
Sentence
A valid sequence of terminals in a language
Sentential form
A valid sequence of terminals and non terminals
Non-terminals
A structure that can occur in a language, but is not a literal symbol
Declarations, statements, and expressions
Ambiguity
A grammar is ambiguous if there exists 2 possible derivations for any sentence
virtual machine
interpreter
An interpreter is a program that reads and executes code written in a high-level programming language, line by line, without the need for compilation. It translates the code into machine language on the fly and executes it immediately
you can modify it as it runs
slower than compiled