Compilers
Table driver parser
PDA, if we can push the start symbol then pop it off at the end we know we're done. Generic table parsing routine can use the table to parse the input.
the grammar for (1+2+3) and (1+2+3)+4 is
non-deterministic without arbitrary look ahead. This is why change a grammar from left recursive(non-terminals on the left side) to right recursive.
Back end of compiler is responsible for
optimization, and code gen
Lexical Analysis (scanning)
Take stream of individual chars, convert to tokens. Detect lexical errors(badly formed literals, illegal ch ars) discard white space, discard comments.
CFG contains
Terminal symbols (Tokens generated by lexical analyzer), non-terminal symbols, production rules, start symbol.
symbol table built during _____ _______
semantic analysis
Comparing item separated in the program or counting items will end up being a ______ ____-
semantic check.
Lexical analyzer generator
mechanically converts a set of regular expressions (and their action routines) into a program.
interpreters advantage over compilers
more flexible, better diagnostics, easier portability, take less space.
Why not regular expressions for syntactic analysis?
Because they do not have enough power being finite, we need to use CFGs because they can recognize nesting that is too deep for DFAs (input is more complex than O(n).
what does LL(1) mean?
Can we automatically construct a parser from the CFG that will create an unambiguous parse tree from any given string, or tell us the string is unparseable. Do a left to right scan of input, do a Leftmost derivation(right recursion), using 1 token of look ahead(look at the next token without consuming it). Top-down O(n) size of input string
Semantic analysis covers stuff we can't cover in cfg and isn't convenient
Comparing argument lists in function calls to that given in the function's declaration. (This isn't context-free) Ensuring a function has at least one return statement. (inconvenient with CFG, return could be anywhere in the function)
Interpreter Vs Compiler
Compiler- entire code analyzed, more efficient done once. Catches lots of errors. interpreter- one line at a time, less efficient done every time. More flexible.
Syntactic analysis (Parsing)
Context free, Take stream of tokens, and convert into a parse tree. Captures the hierarchical structure of the program (declarations definitions, blocks, statements, expressions, BUT NOT undeclared identifiers, mismatched function calls) syntactic errors are about form, NOT structure.
Semantic analysis
Context sensitive stuff,
Why we use regular expressions to do lexical scanning?
Correspondence between REs and finite automaton, and finite automaton are convertible into scanning program in mechanical way.
Front end of the compiler or the first phases
Discovers its meaning, comprised of lexical, syntactic and semantic anlysis steps.
Role of the semantic analyzer
Enforce all static semantic rules Annotate teh parse tree with any info required by intermediate code gen(clarifications: resolution of overloaded operations),(Dynamic semantic checks)
semantic content of scalar
Its Type and Value
LR(k) Parsing
Left-to-right scan of input, producing Rightmost derivation using k tokens of lookahead. Bottom-up = use production rules right to left. Use stack more sophisticated. First, k tokens in the input stream form the lookahead Shift:push the first input token onto the stack reduce: select some grammar rule e.g. X>A B C; pop C, B, A off the stack; and push X onto the stack. If parser makes it to the point where it can shift EOF marker onto t he stack, it has accepted the input as valid 3 cases in reducing that can occur: 1). top few items do not match the RHS of any rule of grammar: so we should shift next token onto stack 2). Top few items match the RHS of one rule of the grammar: reduce, popping off those matching items and pushing the LHS of the rule 3). Top few items match the RHS of multiple rules of the grammar: the grammar is ambiguous we don't know which reduction to make (there are more than 1 rule in an entry for the parse table) this is a reduce/reduce ambiguity shift/reduce conflicts: you could shift or reduce dangling else reduce/reduce conflict: the grammar itself is not ambiguous, the problem is we only get 1 token of look ahead A-> B c d A-> E c f B-> x y E-> x y shifting and reducing is done via a DFA, the DFA is applied to the contents of the stack, not parsing the CFA.
Improved CFG to make none ambiguous
Lower precedence comes first to enforce order of operations(hierarchy, all add ops come before mul ops in production rules, unless they are in parenthesis), to enforce associativity by having left recursion for that rule (expr->expr add_op term) expr->term | expr add_op term term-> factor | term mul_op factor factor->id | number | - factor | (expr) add_op-> + | - mul_op-> * | /
(In predictive parsing to decide if we can do predict parsing) To predict which production rule to use, we need 3 things for every non-terminal X:
NULLABLE(X): does x ever derive e? FIRST(X): which terminals can appear first in a string derived from X? FOLLOW(X): which terminals can appear immediately after X? given these we can construct the parse table for the CFG The parse table tells us which rule to use when we are trying to parse a given non-terminal and are seeing a given terminal every box has at most 1 rule meaning the CFG for LL(1) is non-ambiguous, boxes with no value mean parsing that results in error
Parsing complexity of regular expressions
O(n) where n is the length of the string you're trying to parse. Because the acceptance or rejection of an input string in a DFA will take no more transitions than the length of the input. (you consume a symbol on every transition)
Parsing complexity of Context free
O(n^3) where n is length of tokens you're trying to parse
Last phases of compilation do what?
Producing the corresponding target code, This is the bac k end. Comprised of optimization of intermediate code, target code generation.
patterns of token classes as
Regular expressions, easy to write, lots of theory to help with processing
Undeclared identifiers mismatch function calls are a _____ error. Why?
Semantic error because they require context.
Compiler translates _______ into equivalent _______ then ____ _____
Source program into equivalent target program and then (compiler) goes away
Target code generation
Translates intermediate form into the target language, hard to generate good code.
RE
no recursion, although it may apear, S->aS
non-deterministic without arbitrary lookahead
While the grammar is not ambiguous its non-deterministic without arbitrary look ahead. (1+2+3) vs. (1+2+3)+4 expr-> expr add_op term expr-> term term-> term mul_op factor term-> factor solution: expr-> term in the first case but expr-> expr add_op term in the second case.
The output of syntactic analyzer phase is:
a Parse tree, which represents the structure of a particular input token stream as determined by the languages grammar rules, unique for a given steam of tokens (if not, the grammar is ambiguous and needs to be fixed)
Recursive descent parser
a parser written for each non-terminal uses the parse table slighty more efficient
As static semantic phase runs it transforms the parse tree to an
abstract syntax tree, with only essential info, uses a symbol. Many compilers use AST as intermediate form to hand off to back end for code gen (lower level language) other compilers tree walk AST and generate different intermediate form.
Action and Goto table
analysis of grammar results in action and goto table when you're in a given state and you see a specific token whether you should shift or reduce it by a certain rule, then use a goto
Turing complete
any real-world general purpose computer or language can simulate the computational aspects of any other computer language.
interpretation in c happens when
at run time looks through the special characters in printf(%d, %s) to see what comes after
Code improvement happens when?
can happen right after semantic analysis, the earlier the better improvement, you cannot change the result.
dynamic semantic rules are enforced by the
compiler inserting specific code to perform the check. Example no divide by zero, if denom 0 compiler gets angry
compiling is a ______ analysis of the ________ ______
complete analysis of source code.
Parse Tree (Concrete syntax tree)
completely shows how a sequence of tokens was derived using CFG.
examples of static semantic checks
declaration of identifier before use no use of identifier in an inappropriate context correct number/ type of paras in subroutines distinct constant labels on the branches of switch non-void return type function returns a value
improved cfg
eliminate explicit alternation (expr->expr add_op term expr -> term) eliminate left recursion and do right recursion(we did left recursion originally for associativity) move recursion from left to right(but generates identical strings): (left recursion) X->Xa X->B replaced by: (right recursion) X->BX' X'->aX' X'->e
interpretation is translate ______ ______
every time
What does FLEX do?
flex takes your formal description of token categories and pukes out a c program that you can compile
How was first compiler written
in assembly and machine code.
All FA are finite the language accepted by FA can be
infinite, additionally each string in the language itself is finite.
CFGs can refer to themselves, the definition of recursive, unlike REs which have
left or right recursion exclusively
in a tree leave of an expression are always
literals or ids
C doesn't have any dynamic semantic checks
the hardware checks (divide by 0)
Java has a lot of static and dynamic semantic checks
to catch bugs and malicious code (buffer overflows, SQL injection)
Why would you write assembly?
to touch certain parts of the hardware, or for performance
Dynamic semantic checks (enforced at runtime)
variables not used in expression unless they have been assigned a value pointers are deference arrays subscripts within bounds arithmetic expressions don't over flow these depend on info that isn't known until run time.
P-code (byte code)
very simple language that is easy to translate to machine code
formal notation(CFGS, and REs) is good because
we can process them mechanically, which means we can write programs to manipulate them.
Self hosting compiler
written in its own language(c compiler written in c), compiles first time by bootstrapping(knows enough to get to the next level).
