









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
These are notes on compiler design.
Typology: Lecture notes
Limited-time offer
Uploaded on 05/18/2020
4
(2)1 document
1 / 15
This page cannot be seen from the preview
Don't miss anything!
On special offer
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text. 3. Rational preprocessor: these preprocessors augment older languages with more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the
language by certain amounts to build-in macro
Compiler is a translator program that translates a program written in (HLL) the
source program and translate it into an equivalent program in (MLL) the target program.
As an important part of a compiler is error showing to the programmer.
Error msg
Executing a program written n HLL programming language is basically of two parts. the
source program must first be compiled translated into a object program. Then the results
object program is loaded into a memory executed.
Source pgm obj pgm
Compiler
Obj pgm input opj pgm output Obj pgm
language. They begin to use a mnemonic (symbols) for each machine instruction, which
they would subsequently translate into machine language. Such a mnemonic machine
language is now called an assembly language. Programs known as assembler were written
to automate the translation of assembly language in to machine language. The input to an
assembler program is called source program, the output is a machine language translation
(object program).
as if it were machine language.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also
uses interpreter. The process of interpretation can be carried out in following phases.
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
1.6 Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed into memory
and executed. The assembler could place the object program directly in memory and transfer
control to it, thereby causing the machine language program to be execute. This would waste
core by leaving the assembler in memory while the user’s program was being executed. Also
the programmer would have to retranslate his program with each execution, thus wasting
translation time. To over come this problems of wasted translation time and memory. System
programmers developed another component called loader
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form the
loader could”relocate” directly behind the user’s program. The task of adjusting programs o
they may be placed in arbitrary core locations is called relocation. Relocation loaders perform
four functions.
Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated
operation that takes source program in one representation and produces output in another
representation. The phases of a compiler are shown in below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of automic units called tokens.
Syntax Analysis:-
The output of LA is a stream of tokens, which is passed to the next phase, the
syntax analyzer or parser. The SA groups the tokens together into syntactic structure called as
expression. Expression may further be combined to form statements. The syntactic structure
can be regarded as a tree whose leaves are the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It also imposes
on tokens a tree-like structure that is used by the sub-sequent phases of the compiler.
Example , if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On seeing the /,
the syntax analyzer should detect an error situation, because the presence of these two adjacent
binary operators violates the formulations rule of an expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming
token stream by identifying which parts of the token stream should be grouped.
Example, (A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.
Semantic Analysis:-
A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
Determines meaning of source string.
Type-checking is an important part of semantic analyzer.
Normally semantic information cannot be represented by a context-free language used in syntax
analyzers.
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate code are
possible. One common style uses instruction with one operator and a small number of
operands.
The output of the syntax analyzer is some representation of a parse tree. the
intermediate code generation phase transforms this parse tree into an intermediate language
representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code program that
does the some job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to
make an improvement. For example,
If A > B goto L
Goto L
This can be replaced by a single statement If
A < B goto L
Another important local optimization is the elimination of common sub-
expressions
Might be evaluated as
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing
the speed of loops. A typical loop improvement is to move a
computation that produces the same result each time around the loop to
a point, in the program just before the loop is entered.
Code generator :-
temp1:= int to real (60)
temp2:= id3 * temp
temp3:= id2 + temp
id1:= temp3.
Code Optimizer
Temp1:= id3 * 60.
Id1:= id2 +temp
Code Generator
MOVF id3, r
MULF *60.0, r
MOVF id2, r
ADDF r2, r
MOVF r1, id
LA reads the source program one character at a time, carving the source program into
a sequence of automatic units called ‘Tokens’.
1, Type of the token.
2, Value of the token.
Type : variable, operator, keyword, constant
Value : N1ame of variable, current variable (or) pointer to symbol table.
If the symbols given in the standard format the LA accepts and produces
token as output. Each token is a sub-string of the program that is to be treated as a single
unit. Token are two types.
1, Specific strings such as IF (or) semicolon.
2, Classes of string such as identifiers, label, constants.
To identify the tokens we need some method of describing the possible tokens that
can appear in the input stream. For this purpose we introduce regular expression, a
notation that can be used to describe essentially all the tokens of programming
language.
Secondly , having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which are
designed using transition diagrams and finite automata.
the LA is the first phase of a compiler. It main task is to read the input character
and produce as output a sequence of tokens that the parser uses for syntax analysis.
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer
reads the input character until it can identify the next token. The LA return to the parser
representation for the token it has found. The representation will be an integer code, if the
token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank,
tab and new line characters. Another is correlating error message from the compiler with the
source program.
Lexical analysis Parsing
A Scanner simply turns an input String (say a A parser converts this list of tokens into a
file) into a list of tokens. These tokens Tree-like object to represent how the tokens
represent things like identifiers, parentheses, fit together to form a cohesive whole
operators etc. (sometimes referred to as a sentence).
The lexical analyzer (the "lexer") parses A parser does not give the nodes any
individual symbols from the source code file meaning beyond structural cohesion. The
into tokens. From there, the "parser" proper next thing to do is extract meaning from this
turns those whole tokens into sentences of structure (sometimes called contextual
your grammar analysis).
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
Pattern: A set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Example: Description of token
Token lexeme pattern
const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
nun 3.14 any character b/w “and “except"
literal "core" pattern
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take
the patterns for all the needed tokens and build a piece of code that examins the input string
and finds a prefix that is a lexeme matching one of the patterns.
Stmt
if expr then stmt
| If expr then else stmt
| є
Expr
term relop term |
term
Term
id |
number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes. The
terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using
regular definitions.
digit -->[0,9]
digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
ws
(blank/tab/newline)
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis from the
character that follows the white space. It is the following token that gets returned to the
parser.
Lexeme Token Name Attribute Value
Any ws _ _
if if _
then then _
else else _
Any id id pointer to table entry
Any number number pointer to table
entry
< relop LT
<= relop LE
= relop ET
< > relop NE
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking
for a lexeme that matches one of several patterns.
Edges are directed from one state of the transition diagram to another. each edge is labeled
by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s
labeled by a. if we find such an edge ,we advance the forward pointer and enter the
state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
been found, although the actual lexeme may not consist of all positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
additionally place a * near that accepting state.
“start” entering from nowhere .the transition diagram always begins in the state before
any input symbols have been used.
As an intermediate step in the construction of a LA, we first produce a stylized
flowchart, called a transition diagram. Position in a transition diagram, are drawn as
circles and are called as states.
The above TD for an identifier, defined to be a letter followed by any no of letters
or digits.A sequence of transition diagram can be converted into program to look for the
tokens specified by the diagrams. Each state gets a segment of code.
If = if
Then = then
Else = else
Relop = < | <= | = | > | >=
Id = letter (letter | digit) *|
Num = digit |
From state S0 for input ‘a’ there is only one path going to S2. similarly from S0 there
is only one path for input going to S1.
A NFA is a mathematical model that consists of
A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial) state.
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
A NFA can be diagrammatically represented by a labeled directed graph, called a
transition graph, In which the nodes are the states and the labeled edges represent
the transition function.
This graph looks like a transition diagram, but the same character can label two or
more transitions out of one state and edges can be labeled by the special symbol €
as well as by input symbols.
The transition graph for an NFA that recognizes the language ( a | b ) * abb is
shown
It involves four quantities.
CFG contain terminals, N-T, start symbol and production.
Terminal are basic symbols form which string are formed.
N-terminals are synthetic variables that denote sets of strings
In a Grammar, one N-T are distinguished as the start symbol, and the set of
string it denotes is the language defined by the grammar.
The production of the grammar specify the manor in which the terminal and
N-T can be combined to form strings.
Each production consists of a N-T, followed by an arrow, followed by a string
of one terminal and terminals.
An extensible array of records.
The identifier and the associated records contains collected information about
the identifier.
FUNCTION identify (Identifier name)
RETURNING a pointer to identifier information contains
The actual string
A macro definition A
keyword definition
A list of type, variable & function definition
A list of structure and union name definition
A list of structure and union field selected definitions.
A Lex program (the .l file ) consists of three parts:
declarations
translation rules
auxiliary procedures
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14 ),
and regular definitions.
p1 {action 1}
p2 {action 2}
p3 {action 3}
where each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
actions. Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
The LA scans the characters of the source pgm one at a time to discover tokens.
Because of large amount of time can be consumed scanning characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an
input character.
Buffering techniques:
The lexical analyzer scans the characters of the source program one a t a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for
thelexical analyzer to read its input from an input buffer. Figure shows a buffer divided into