Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Compiler Design: An Overview of Language Processing Systems, Lecture notes of Compilers

These are notes on compiler design.

Typology: Lecture notes

2019/2020
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 05/18/2020

taranpreet-kalra-1
taranpreet-kalra-1 🇮🇳

4

(2)

1 document

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
UNIT -1
1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM
1.2 Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the
language by certain amounts to build-in macro
1.3 COMPILER
Compiler is a translator program that translates a program written in (HLL) the
source program and translate it into an equivalent program in (MLL) the target program.
As an important part of a compiler is error showing to the programmer.
target pgm
Source pgm
Compiler
Error msg
Executing a program written n HLL programming language is basically of two parts. the
source program must first be compiled translated into a object program. Then the results
object program is loaded into a memory executed.
Source pgm
obj pgm
Compiler
Obj pgm input
opj pgm output
Obj pgm
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
Discount

On special offer

Partial preview of the text

Download Compiler Design: An Overview of Language Processing Systems and more Lecture notes Compilers in PDF only on Docsity!

UNIT -

1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM

1.2 Preprocessor

A preprocessor produce input to compilers. They may perform the following functions.

1. Macro processing: A preprocessor may allow a user to define macros that are short

hands for longer constructs.

2. File inclusion: A preprocessor may include header files into the program text. 3. Rational preprocessor: these preprocessors augment older languages with more

modern flow-of-control and data structuring facilities.

4. Language Extensions: These preprocessor attempts to add capabilities to the

language by certain amounts to build-in macro

1.3 COMPILER

Compiler is a translator program that translates a program written in (HLL) the

source program and translate it into an equivalent program in (MLL) the target program.

As an important part of a compiler is error showing to the programmer.

Source pgm Compiler target pgm

Error msg

Executing a program written n HLL programming language is basically of two parts. the

source program must first be compiled translated into a object program. Then the results

object program is loaded into a memory executed.

Source pgm obj pgm

Compiler

Obj pgm input opj pgm output Obj pgm

1.4 ASSEMBLER : programmers found it difficult to write or read programs in machine

language. They begin to use a mnemonic (symbols) for each machine instruction, which

they would subsequently translate into machine language. Such a mnemonic machine

language is now called an assembly language. Programs known as assembler were written

to automate the translation of assembly language in to machine language. The input to an

assembler program is called source program, the output is a machine language translation

(object program).

1.5 INTERPRETER : An interpreter is a program that appears to execute a source program

as if it were machine language.

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also

uses interpreter. The process of interpretation can be carried out in following phases.

  1. Lexical analysis
  2. Synatx analysis
  3. Semantic analysis
  4. Direct Execution

Advantages:

Modification of user program can be easily made and implemented as execution

proceeds.

Type of object that denotes a various may change dynamically.

Debugging a program and finding errors is simplified task for a program used for

interpretation.

The interpreter for the language makes it machine independent.

Disadvantages:

The execution of the program is slower.

Memory consumption is more.

1.6 Loader and Link-editor:

Once the assembler procedures an object program, that program must be placed into memory

and executed. The assembler could place the object program directly in memory and transfer

control to it, thereby causing the machine language program to be execute. This would waste

core by leaving the assembler in memory while the user’s program was being executed. Also

the programmer would have to retranslate his program with each execution, thus wasting

translation time. To over come this problems of wasted translation time and memory. System

programmers developed another component called loader

“A loader is a program that places programs into memory and prepares them for

execution.” It would be more efficient if subroutines could be translated into object form the

loader could”relocate” directly behind the user’s program. The task of adjusting programs o

they may be placed in arbitrary core locations is called relocation. Relocation loaders perform

four functions.

1.7 TRANSLATOR

Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated

operation that takes source program in one representation and produces output in another

representation. The phases of a compiler are shown in below

There are two phases of compilation.

a. Analysis (Machine Independent/Language Dependent)

b. Synthesis(Machine Dependent/Language independent)

Compilation process is partitioned into no-of-sub processes called ‘phases’.

Lexical Analysis:-

LA or Scanners reads the source program one character at a time, carving the

source program into a sequence of automic units called tokens.

Syntax Analysis:-

The output of LA is a stream of tokens, which is passed to the next phase, the

syntax analyzer or parser. The SA groups the tokens together into syntactic structure called as

expression. Expression may further be combined to form statements. The syntactic structure

can be regarded as a tree whose leaves are the token called as parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer,

occur in pattern that are permitted by the specification for the source language. It also imposes

on tokens a tree-like structure that is used by the sub-sequent phases of the compiler.

Example , if a program contains the expression A+/B after lexical analysis this

expression might appear to the syntax analyzer as the token sequence id+/id. On seeing the /,

the syntax analyzer should detect an error situation, because the presence of these two adjacent

binary operators violates the formulations rule of an expression.

Syntax analysis is to make explicit the hierarchical structure of the incoming

token stream by identifying which parts of the token stream should be grouped.

Example, (A/B*C has two possible interpretations.)

1, divide A by B and then multiply by C or

2, multiply B by C and then use the result to divide A.

each of these two interpretations can be represented in terms of a parse tree.

Semantic Analysis:-

A semantic analyzer checks the source program for semantic errors and collects the type

information for the code generation.

Determines meaning of source string.

  • Matching of parenthesis
  • Matching if else stmt.
  • Checking scope of operation

Type-checking is an important part of semantic analyzer.

Normally semantic information cannot be represented by a context-free language used in syntax

analyzers.

Intermediate Code Generation:-

The intermediate code generation uses the structure produced by the syntax

analyzer to create a stream of simple instructions. Many styles of intermediate code are

possible. One common style uses instruction with one operator and a small number of

operands.

The output of the syntax analyzer is some representation of a parse tree. the

intermediate code generation phase transforms this parse tree into an intermediate language

representation of the source program.

Code Optimization

This is optional phase described to improve the intermediate code so that the

output runs faster and takes less space. Its output is another intermediate code program that

does the some job as the original, but in a way that saves time and / or spaces.

1, Local Optimization:-

There are local transformations that can be applied to a program to

make an improvement. For example,

If A > B goto L

Goto L

L2 :

This can be replaced by a single statement If

A < B goto L

Another important local optimization is the elimination of common sub-

expressions

A := B + C + D

E := B + C + F

Might be evaluated as

T1 := B + C

A := T1 + D

E := T1 + F

Take this advantage of the common sub-expressions B + C.

2, Loop Optimization:-

Another important source of optimization concerns about increasing

the speed of loops. A typical loop improvement is to move a

computation that produces the same result each time around the loop to

a point, in the program just before the loop is entered.

Code generator :-

temp1:= int to real (60)

temp2:= id3 * temp

temp3:= id2 + temp

id1:= temp3.

Code Optimizer

Temp1:= id3 * 60.

Id1:= id2 +temp

Code Generator

MOVF id3, r

MULF *60.0, r

MOVF id2, r

ADDF r2, r

MOVF r1, id

1.10 TOKEN

LA reads the source program one character at a time, carving the source program into

a sequence of automatic units called ‘Tokens’.

1, Type of the token.

2, Value of the token.

Type : variable, operator, keyword, constant

Value : N1ame of variable, current variable (or) pointer to symbol table.

If the symbols given in the standard format the LA accepts and produces

token as output. Each token is a sub-string of the program that is to be treated as a single

unit. Token are two types.

1, Specific strings such as IF (or) semicolon.

2, Classes of string such as identifiers, label, constants.

2. LEXICAL ANALYSIS

2.1 OVER VIEW OF LEXICAL ANALYSIS

 To identify the tokens we need some method of describing the possible tokens that

can appear in the input stream. For this purpose we introduce regular expression, a

notation that can be used to describe essentially all the tokens of programming

language.

 Secondly , having decided what the tokens are, we need some mechanism to

recognize these in the input stream. This is done by the token recognizers, which are

designed using transition diagrams and finite automata.

2.2 ROLE OF LEXICAL ANALYZER

the LA is the first phase of a compiler. It main task is to read the input character

and produce as output a sequence of tokens that the parser uses for syntax analysis.

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer

reads the input character until it can identify the next token. The LA return to the parser

representation for the token it has found. The representation will be an integer code, if the

token is a simple construct such as parenthesis, comma or colon.

LA may also perform certain secondary tasks as the user interface. One such task is

striping out from the source program the commands and white spaces in the form of blank,

tab and new line characters. Another is correlating error message from the compiler with the

source program.

2.3 LEXICAL ANALYSIS VS PARSING:

Lexical analysis Parsing

A Scanner simply turns an input String (say a A parser converts this list of tokens into a

file) into a list of tokens. These tokens Tree-like object to represent how the tokens

represent things like identifiers, parentheses, fit together to form a cohesive whole

operators etc. (sometimes referred to as a sentence).

The lexical analyzer (the "lexer") parses A parser does not give the nodes any

individual symbols from the source code file meaning beyond structural cohesion. The

into tokens. From there, the "parser" proper next thing to do is extract meaning from this

turns those whole tokens into sentences of structure (sometimes called contextual

your grammar analysis).

2.4 TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical entity.

Typical tokens are,

  1. Identifiers 2) keywords 3) operators 4) special symbols 5)constants

Pattern: A set of strings in the input for which the same token is produced as output. This

set of strings is described by a rule called a pattern associated with the token.

Lexeme: A lexeme is a sequence of characters in the source program that is matched by the

pattern for a token.

Example: Description of token

Token lexeme pattern

const const const

if if If

relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter

followed by letters & digit

i pi any numeric constant

nun 3.14 any character b/w “and “except"

literal "core" pattern

2.5 LEXICAL ERRORS:

Recognition of tokens:

We learn how to express pattern using regular expressions. Now, we must study how to take

the patterns for all the needed tokens and build a piece of code that examins the input string

and finds a prefix that is a lexeme matching one of the patterns.

Stmt

if expr then stmt

| If expr then else stmt

| є

Expr

 term relop term |

term

Term

id |

number

For relop ,we use the comparison operations of languages like Pascal or SQL where = is

“equals” and < > is “not equals” because it presents an interesting structure of lexemes. The

terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens

as far as the lexical analyzer is concerned, the patterns for the tokens are described using

regular definitions.

digit -->[0,9]

digits -->digit+

number -->digit(.digit)?(e.[+-]?digits)?

letter -->[A-Z,a-z]

id -->letter(letter/digit)*

if --> if

then -->then

else -->else

relop --></>/<=/>=/==/< >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing

the “token” we defined by:

ws

(blank/tab/newline)

Here, blank, tab and newline are abstract symbols that we use to express the ASCII

characters of the same names. Token ws is different from the other tokens in that ,when we

recognize it, we do not return it to parser ,but rather restart the lexical analysis from the

character that follows the white space. It is the following token that gets returned to the

parser.

Lexeme Token Name Attribute Value

Any ws _ _

if if _

then then _

else else _

Any id id pointer to table entry

Any number number pointer to table

entry

< relop LT

<= relop LE

= relop ET

< > relop NE

2.8 TRANSITION DIAGRAM:

Transition Diagram has a collection of nodes or circles, called states. Each state

represents a condition that could occur during the process of scanning the input looking

for a lexeme that matches one of several patterns.

Edges are directed from one state of the transition diagram to another. each edge is labeled

by a symbol or set of symbols.

If we are in one state s, and the next input symbol is a, we look for an edge out of state s

labeled by a. if we find such an edge ,we advance the forward pointer and enter the

state of the transition diagram to which that edge leads.

Some important conventions about transition diagrams are

  1. Certain states are said to be accepting or final .These states indicates that a lexeme has

been found, although the actual lexeme may not consist of all positions b/w the lexeme

Begin and forward pointers we always indicate an accepting state by a double circle.

  1. In addition, if it is necessary to return the forward pointer one position, then we shall

additionally place a * near that accepting state.

  1. One state is designed the state ,or initial state ., it is indicated by an edge labeled

“start” entering from nowhere .the transition diagram always begins in the state before

any input symbols have been used.

As an intermediate step in the construction of a LA, we first produce a stylized

flowchart, called a transition diagram. Position in a transition diagram, are drawn as

circles and are called as states.

The above TD for an identifier, defined to be a letter followed by any no of letters

or digits.A sequence of transition diagram can be converted into program to look for the

tokens specified by the diagrams. Each state gets a segment of code.

If = if

Then = then

Else = else

Relop = < | <= | = | > | >=

Id = letter (letter | digit) *|

Num = digit |

From state S0 for input ‘a’ there is only one path going to S2. similarly from S0 there

is only one path for input going to S1.

2.12 NONDETERMINISTIC AUTOMATA

A NFA is a mathematical model that consists of

 A set of states S.

 A set of input symbols ∑.

 A transition for move from one state to an other.

 A state so that is distinguished as the start (or initial) state.

 A set of states F distinguished as accepting (or final) state.

 A number of transition to a single symbol.

A NFA can be diagrammatically represented by a labeled directed graph, called a

transition graph, In which the nodes are the states and the labeled edges represent

the transition function.

This graph looks like a transition diagram, but the same character can label two or

more transitions out of one state and edges can be labeled by the special symbol €

as well as by input symbols.

The transition graph for an NFA that recognizes the language ( a | b ) * abb is

shown

2.13 DEFINITION OF CFG

It involves four quantities.

CFG contain terminals, N-T, start symbol and production.

Terminal are basic symbols form which string are formed.

N-terminals are synthetic variables that denote sets of strings

In a Grammar, one N-T are distinguished as the start symbol, and the set of

string it denotes is the language defined by the grammar.

The production of the grammar specify the manor in which the terminal and

N-T can be combined to form strings.

Each production consists of a N-T, followed by an arrow, followed by a string

of one terminal and terminals.

2.14 DEFINITION OF SYMBOL TABLE

An extensible array of records.

The identifier and the associated records contains collected information about

the identifier.

FUNCTION identify (Identifier name)

RETURNING a pointer to identifier information contains

The actual string

A macro definition A

keyword definition

A list of type, variable & function definition

A list of structure and union name definition

A list of structure and union field selected definitions.

2.15 Creating a lexical analyzer with Lex

2.16 Lex specifications:

A Lex program (the .l file ) consists of three parts:

declarations

translation rules

auxiliary procedures

  1. The declarations section includes declarations of variables,manifest constants(A manifest

constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14 ),

and regular definitions.

  1. The translation rules of a Lex program are statements of the form :

p1 {action 1}

p2 {action 2}

p3 {action 3}

where each p is a regular expression and each action is a program fragment describing

what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex

the actions are written in C.

  1. The third section holds whatever auxiliary procedures are needed by the

actions. Alternatively these procedures can be compiled separately and loaded with the

lexical analyzer.

2.17 INPUT BUFFERING

The LA scans the characters of the source pgm one at a time to discover tokens.

Because of large amount of time can be consumed scanning characters, specialized buffering

techniques have been developed to reduce the amount of overhead required to process an

input character.

Buffering techniques:

  1. Buffer pairs
  2. Sentinels

The lexical analyzer scans the characters of the source program one a t a time to discover

tokens. Often, however, many characters beyond the next token many have to be examined

before the next token itself can be determined. For this and other reasons, it is desirable for

thelexical analyzer to read its input from an input buffer. Figure shows a buffer divided into