Compilation Principle lexical analysis

lexical analysis

Scan the source program to generate word symbols, and transform the source program into the intermediate program of word symbol string, that is, input source program and output word symbols. Lexical analyzer includes scanner and program to perform lexical analysis

Word symbols are the basic grammatical symbols of a program language. It is called token, which is the smallest grammatical unit with independent meaning. Combining characters into signs is very similar to making letters into words and determining the meaning of words in an English sentence, and the task is very similar to spelling.

The word symbols of program language are generally divided into:

  • Keywords: reserved words
  • Identifier: variable name, procedure name, etc
  • Constants: numbers, strings, Booleans, etc
  • Operator: ± / * etc
  • Delimiter: comma, semicolon, /, / etc

Word symbols are often expressed in binary form: (word type, attribute value of word symbols). Word type is the information needed for grammar analysis, which is usually encoded by integer. It is a technical problem how to classify, classify and code the word symbols of a language. It mainly depends on the convenience of processing.

  • Identifiers are generally classified as one.
  • Constants are classified by type.
  • Keywords can be regarded as a whole or a word. It is more convenient to use the word by word method in practice.
  • Operators can be divided one by one, but operators with certain commonalities can also be regarded as one.
  • In general, the division of boundary sign is one sign.

Category is usually defined as a logical item of enumeration type.

typedef enum {
} TokenType;

The attribute value of the word symbol refers to the character or feature of the word symbol, which can be the entry address of the identifier in the symbol table, the binary value of the value, etc.

If it is a symbol of a division (such as keywords, operators, etc.), the lexical analyzer only gives its type code, not its attribute value.
If a category contains more than one word symbol, for each word symbol, in addition to the category code, the attribute value should also be given to distinguish the same category of words. The identifier property value is its own string of symbols; it can also be at the entry address of the symbol table. The value of the constant itself is the binary value of the constant.

The scanner must calculate several attributes of each token, so it is useful to collect all attributes into a single constructed data type, which is called token record.

typedef struct {
    TokenType tokenval; 
    char* stringval;
    int numval;
} TokenRecord;

Or as a union

typedef struct {
    TokenType tokenval; 
    unon { char* stringval;
          int numval; 
         } attribute;
} TokenRecord;

In short, scan the source program to output the word symbol string in binary form (word type, attribute value of word symbol)

[example] try to give the output word symbol string of the program section if (a > 1) B = 100.
It is assumed that the basic word, operator and bound are all one character. The value of the identifier itself is a string and the constant is a binary value.
(2,) basic word if
 (29,) left bracket(
(10, 'a') identifier a
 (23,) greater than sign >
(11 binary of '1') constant 1
 (30,) right bracket)
(10, 'b') identifier b
 (17,) assignment number=
(11 binary of '100') constant 100
 (26,) semicolon;

Another expression can be:

[example] consider the following C + + code snippet: while (I > = J) I --;	
It is assumed that basic words, operators and bounds are all one kind of symbols. The value of the identifier itself is the entry address of the symbol table, and the constant is the binary value.
After being processed by lexical analyzer, it is converted into the following word symbol sequence:
	( while ,- )
    ( (	   ,- )
    (id, pointer to the symbol table item of i)
    ( >=    ,- )
    (id, pointer to symbol table item of j)
    ( )     ,- )
    (id, pointer to the symbol table item of i)
    ( --    ,- )
    ( ;    ,- )

As an independent stage (one pass), lexical analysis translates the character sequence of the source program into the word symbol sequence and stores it in the file. When the grammar analysis program is working, input these word symbols from the file for analysis. The structure is simpler, clearer and more organized. It is helpful to focus on lexical analysis.
As a subroutine, lexical analysis is called whenever the parser needs a word symbol. Every time it is called, the lexical analyzer recognizes a word symbol from the input string.

Generally, there are two ways to construct lexical analysis programs:

  • Manual method: use a high-level language according to the state transition diagram of identifying language words. For example: write lexical analysis program directly in C language
  • Automatic mode: use LEX, the automatic generation tool of lexical analysis program, to automatically generate lexical analysis program.

Manual design of lexical analysis

1. Input buffer, preprocessing

The first step in the lexical analyzer's work is to enter the source text into the input buffer.
Preprocessing work: it is to eliminate the redundant blank characters, skip grid characters, carriage return characters, line feed characters and other editorial characters and comments in the input source program, and store the results in the scanning buffer area, so as to facilitate the recognition of word symbols.

2. Scan buffer

  • In order to ensure that the word symbols are not interrupted by the boundary of the scan buffer, the scan buffer is generally designed as a two-part area as follows;
  • Each input updates half of its space so that the length of the lexical analyzer's recognition of word symbols in the worst case is half of the length of the scan buffer. This is also known as a pair buffer.
  • Two indicators
    • Start indicator: the first character of a new word;
    • Search indicator: used to search forward to find the end of a word;
  • If the search indicator searches from the beginning of the word to the edge of the half area, but does not reach the end of the word, it calls the preprocessor to load the subsequent string into the other half area

Recognition of word symbols: Advanced Search: there is no special end in the composition of word symbols in the source program, and there is no space between the word symbols and the word symbols when there is no readability error. Therefore, sometimes when all the characters of the word symbols have been processed, especially when there is a word symbol that is a prefix substring of another word symbol, the lexical analyzer It is uncertain whether the current word recognition has ended. It can only be determined after searching several characters in advance. If there are multiple characters read in at this time, it needs to be returned.
For example, the recognition of ">" and "> =" in C language needs to be searched in advance.

Some cases of not using advanced search

  • It stipulates that all basic words are reserved words; users cannot use them as their own identifiers; basic words are treated as special identifiers, using the reserved word table;
  • If there is no definite operator or delimiter between the basic word, identifier and constant (or label), a blank character must be used as the interval

State transition diagram: a state transition diagram can be used to identify (or accept) a certain string. In most programming languages, word symbols can be identified by conversion diagrams.

A state transition diagram is a finite direction diagram, in which nodes represent states, represented by circles, and states are connected by arrow arcs. The marks (characters) on arrow arcs represent the input characters or character classes that may appear under the shooting out node state. A transition diagram only contains a finite number of States, one of which is the initial state, and at least one of which is the final state

The state transition diagram can be used to identify (or accept) a certain string. If there is a path from the initial state to a certain final state, and all the markers on the arc on the path are connected to form a word equal to α, then α is recognized (accepted) by the state transition diagram


The design assumes:

  • Basic words: all basic words are reserved words, and users shall not use them as their own defined identifiers;
  • Basic words are treated as a kind of special identifier, and the corresponding conversion diagram is no longer designed. But you need to pre arrange keywords in a table, which is called keyword table. When an identifier is identified, the keyword table is looked up to determine whether it is a keyword.
  • If there is no definite operator or boundary function between the keyword, identifier and constant, at least one blank character must be used as the interval.

When the state transition diagram is implemented, each state node can correspond to a small program.

Branch node:

Cyclic state node

Terminal node

Use pseudo code to implement lexical analyzer

Global variables and processes
 ch character variable to store the latest read source program characters
 strToken character array, which holds the string forming the word symbol
 GetChar subroutine procedure to read the next character into ch
 GetBC subroutine procedure, skip the blank character until a non blank character is read in ch
 Concat subroutine, connecting characters in ch to strToken 
IsLetter and isdigital Boolean functions to determine whether the characters in ch are letters and numbers
 Reserve integer function. For the string in strToken, it looks up the reserved word table. If it is a reserved word, it gives its code. Otherwise, it returns 0
 Retract subroutine, callback the search pointer to a character position
 InsertId integer function, inserts identifier in strToken into symbol table, and returns symbol table pointer
 InsertConst integer function, inserts the constant in strToken into the constant table, and returns the constant table pointer
int code, value;
strToken := " ";	/*Set strToken as empty string*/
if (IsLetter())
	while (IsLetter() or IsDigit())
		Concat(); GetChar(); 
	code := Reserve();
	if (code = 0)
		value := InsertId(strToken);
		return ($ID, value);
		return (code, -);	
else if (IsDigit())
	while (IsDigit())
		Concat( ); GetChar( );
	value := InsertConst(strToken);
	return($INT, value);
else if (ch ='=') return ($ASSIGN, -);
else if (ch ='+') return ($PLUS, -);
else if (ch ='*')
	if (ch ='*') return ($POWER, -);
	Retract(); return ($STAR, -);
else if (ch =',') return ($COMMA, -);
else if (ch ='(') return ($LPAR, -);
else if (ch =')') return ($RPAR, -);
else ProcError( );		/* error handling*/
curState = Initial state
while( stateTrans[curState][ch]Definition){
   //There is a successor state, read in and splice
   //Transition to next state, read next character
   curState= stateTrans[curState][ch];
   if curState Final state then Return strToken Words in Chinese
   GetChar( ); 

Canonical form

Regular form: a way (or regular form) to define the symbols of words in a language

[example] define the regular expression of identifier
 Letters (letters and numbers)*

Recursive definition of regular formula and regular set:

  • ε and Φ are regular expressions on Σ. The regular sets they represent are {ε} and Φ, respectively;
  • Any a ∈∑, a is a regular expression on ∑, and its regular set is {a};
  • Suppose e1 and e2 are regular expressions on Σ, and the regular sets they represent are L(e1) and L(e2), respectively
    e1|e2 is a regular expression, which represents a regular set of L(e1) ∪ L(e2) (Union)
    E1 E2 is a regular expression, which represents a regular set of L(e1)L(e2) (connection set)
    (e1) * is a regular expression, and its normal set is (L(e1)) * (closure)
    The priority is closure, join product, or.

A regular set can be represented by a regular expression. A regular expression is a way to represent a regular set. A string set is a regular set if and only if it can be represented by a regular expression

The regular expression represents the format of a string. The regular expression r is completely defined by the string set it matches. This set is a language generated by the regular expression, written as L ®, and each regular expression can be regarded as a matching pattern.

[Set up example∑={a,b,c},be aa*bb*cc* yes∑A regular expression on
aa*bb*cc* yes∑A regular expression on, which represents a regular set of
L = {abc,aabc,abbc,abcc,aaabc,...}
  = {a^m b^n c^l | m,n,l ≥1}
If the program language alphabet is set of keyboard characters, the program language word symbols can be defined as follows
 Keyword if | else | while | do
 Identifier l (l | d)*
Integral constant dd*
Relational operators
	Where l stands for any English letter in a~z
    Where d represents any number from 0 to 9

Regular equivalence

If two regular expressions represent the same regular set, they are considered equivalent. Two equivalent regular expressions R1 and R2 are recorded as R1=R2.

	(a|b)* = (a*|b*)*
	b(ab)* = (ba)* b

Regular extension

One or more repetitions of the regular r, writing r+
. indicates a match with any character.
Use square brackets and a hyphen to indicate the range of characters, such as [0-9], [a-z], [a-zA-Z]. This representation can also be used to represent a single solution. For example, a|b|c can be written as [abc]

Use "~" for any character not in the given set, such as normal ~ a for non-a characters in the alphabet
Optional subexpression r? Indicates that the string matched by R is optional (0 or 1). For example, natural = [0-9] +, signedNatural = (+|-)? Natural

Regular grammar and regular formula

Regular grammars and regular expressions are tools to describe normal sets.

For any regular grammar, there is a regular grammar that defines the same language; on the contrary, for each regular grammar, there is a regular grammar that generates the same language.

The transformation from regular grammar to regular expression

Each nonterminal in a regular grammar is expressed as a regular equation about it, and a set of simultaneous equations is obtained. According to the solution rules:

If x = α x|β (or x = α x + β), the solution is x = α * β

If x = x α| β (or x = x α + β), the solution is x = β α*

And the distribution rate, exchange rate and combination rate of normal form are used to find the solutions of normal form equations about the start sign of grammar.
This solution is a regular expression of the grammar start sign S. These two rules are more important. The recursive x is removed to the closure of α

[example 1]
[example] a normal grammar G: Z → 0AA → 0A ∣ 0BB → 1A ∣ ϵ is set to give the normal formula of the grammar generating language There is a formal grammar G:\\ Z→0A\\\\ A→0A|0B\\\\ B → 1A|\epsilon \ \ \ \, try to give the normal form of the grammar generation language [example] a normal grammar G: Z → 0AA → 0A ∣ 0BB → 1A ∣ ϵ is set to give the normal formula of the grammar generating language

First, we give the corresponding normal equations (+ instead of |)
Z = 0A .........(1)
A = 0A+0B .........(2)
B = 1A+$\epsilon $ .........(3)

Replace (3) with B in (2) to get
A = 0A+01A+0 .........(4)
For (4) utilization distribution rate A = (0+01)A+0 (5)
Using the rule for (5) A = (0+01)*0 (6)
Substitute (6) into (1) z = 0 (0 + 01) × 0
That is to say, the normal expression of the language generated by normal grammar G[Z] is 0 (0| 01) * 0

[example 2]

With formal grammar G:
Try to give the normal form of the grammar generating language
Same as above steps
First, we give the corresponding normal equations (+ instead of |)
A = aB+bB .........(1)
B = aC+a+b .........(2)
C = aB .........(3)
Substituting (3) into (2)
B = aaB+a+b .........(4)
Using the rule for (4) B = (a a) * (a + b) (5)
Substitute (5) into (1) to get A = (a+b)(aa)*(a+b)
That is to say, the normal expression of the language generated by normal grammar G[Z] is (a|b)(aa)*(a|b)

[example 3]

With formal grammar G:
	V → Z0|0 try to give the normal formula of the grammar generating language
 First, we give the corresponding normal equations (+ instead of |)
			Z = U0+V1			.........(1)
			U = Z1+1 			.........(2)
			V = Z0+0			.........(3)
Substituting (2) (3) into (1)
			Z = Z10+10+Z01+01	.........(4)
			Z = Z(10+01)+10+01	.........(4)
Use Z = (10+01)(10+01) of the rule for (4)*
That is to say, the normal form of language generated by normal grammar G[Z] is (10|01) (10|01)*

[example 4]

A formal grammar known to describe the symbol of the word "identifier"
	< identifier > → l < identifier > L < identifier > d
 First, we give the corresponding normal equations (+ instead of |)
	S = l+Sl+Sd
	S = l+S(l+d)
Using rules
	S = l(l+d)*
The normal form of the grammar is 

Transformation from normal form to normal grammar

The conversion method from normal expression on the alphabet Σ to normal grammar G = (VNV_NVN, VTV_TVT, P,S) is as follows:

  1. Order VT = Sigma
  2. Select a nonterminal Z for any normal R, generate rule Z → R, and make S = Z;
  3. If both A and b are normal, the rules of A → AB are transformed into A → AB and b → b, where b is A new nonterminal;
  4. In the transformed grammar, the rules of form A → a*b are further transformed into A → A a | B;
  5. Rules (3) and (4) are constantly used for conversion until each rule contains at most one terminator.

[example 1]

Convert R = (a|b)(aa)*(a|b) to the corresponding normal grammar
 Let A be the starting symbol of grammar, which is transformed into
A → (a|b)(aa)*(a|b)
According to rule (3), change to
A → (a|b)B
B → (aa)*(a|b)
According to rule (4), change to (reverse, change * to |)
A → aB|bB
 B → aaB|a|b (there are two A's in aaB, which should be simplified to only one terminator)
According to rule (3), change to
A → aB|bB
B → aC|a|b
C → aB

[example 2]

Convert the normal expression R=l(l|d) describing identifier into the corresponding normal grammar
 Let S be the starting symbol of grammar, which is transformed into
According to rule (3), change to
According to rule (4), change to
A→(l|d)A |ε
Further transform to
 A → lA|dA|ε (elimination of ε)
Further transform to

Finite automata

Finite automata is an abstract mathematical model with discrete input and output systems. There are two kinds of finite automata: definite and uncertain. Both the definite finite automata and the indefinite finite automata can recognize the normal set accurately.

Deterministic finite automata (DFA)

A definite finite automaton DFA M is a five element formula: M = (Q, ∑, f, S, Z), where:
Q: A finite set of States, each of which is called a state.
Σ: there is a finite alphabet, each element of which is called an input character.
F: The state transition function is a single valued mapping from Q ×∑ to Q. f(qi, a) = qj (qi,qj ∈ Q, a ∈∑) means that when the current state is qi and the input character is a, the automaton will switch to the next state qj. qj is a successor of qi.
S ∈ Q: is the only initial state.
Z $\ subset $Q: final state set (nullable).

[Set up example DFA  M=({q0,q1,q2},{a,b},f,q0,{q2})
//Among them:
f(q0,a)= q1
f(q1,b)= q1
f(q0,b)= q2
f(q2,a)= q2
f(q1,a)= q1
f(q2,b)= q1

State transition matrix, state transition diagram

A DFA can be represented by a matrix. The row of the matrix represents the state, the list represents the input characters, and the matrix element represents the value of f(s, a). This matrix is called the state transition matrix, or transition table.
A DFA can also be represented by a (definite) state transition diagram. Assuming that DFA M contains M States and n input characters, this state transition diagram has m nodes. Each node has at most n arrow arcs which are connected with other states. Each arrow arc which is shot by the same node is represented by sigma The whole graph contains only one initial node and several (may be 0) final nodes.

Symbol string recognized by DFA M: for any word β in Σ *, if there is a path from the initial state to an end state node, and the words connected by the markers of all arcs in this path are equal to β, then β can be recognized by DFA M. If the initial state of M is also the final state, then ε can be identified by M.

All the symbol strings recognized by DFA M are the accepted languages, which are recorded as L(M).

Conclusion: V $\ subset $∑ * is normal if and only if there is automata M on Σ, making V=L (M)

Algorithm of simulating DFA

Input: input the string x Ending with eof; a DFA D with s0 as its start state and F as its acceptance state set.
Output: if D accepts x, answer "yes", otherwise answer "No".
Method: apply the following algorithm to the input string x. The function move(s,c) gives the next state to which the input character c should be converted when it is encountered in state s. The getch() function returns the next character of the input string x.

while ((c=getch())!=eof) {
	if (s is in F) return "yes";

Nondeterministic finite automata (NFA)

An uncertain finite automaton m is a five element formula: M=(Q, ∑, S,Z,F), where:

Q: Finite state set
Σ: finite alphabet
F: State transition function is a mapping (multivalued mapping) from Q ×∑ * to subset S' of S. Namely
f: Q ×∑ * → 2Q power set
S $\ subset $Q: non empty initial state set
Z $\ subset $Q: final state set (nullable)

An NFA can also be represented by a matrix. The row of the matrix represents the state, the list represents the input character, and the matrix element represents the value (state set) of f(s, a). An NFA can also be represented by a state transition diagram.

Difference between NFA and DFA

NFA can have multiple initial states;
The mark on the arc can be a word (or even a normal form) in sigma *, not necessarily a single character;
The same word may appear on multiple arcs in the same state;
DFA is a special case of NFA.

For the symbol string recognized by NFA m, for any word β in Σ *, if there is a path from the initial state to an end state node, and the words connected by the markers of all arcs in this path (ignoring ε arc) are equal to β, then β can be recognized by NFA M. If some states of M are both initial and final, the null epsilon is accepted by M.

All the symbol strings that NFA M can recognize are the accepted languages, which are recorded as L(M). For example, in the above example, the language that NFA M 'can recognize is L(M') = b*(b|ab)(bb)*

According to the definition of NFA, the same symbol string β can be identified by multiple paths. DFA is a special case of NFA. The method of constructing lexical analysis program with finite automata is as follows:

  1. NFA is constructed from the description of language words;
  2. Convert NFA to DFA;
  3. It is reduced to DFA with minimum state;
  4. For each state of DFA, a program segment is constructed to transform it into a lexical analysis program to recognize words.

The method of NFA to DFA

The definiteness of NFA means that any given NFA can construct a DFA accordingly to make them accept the same language.

For an NFA, since the state transition function f is a multivalued function, there are always some states q, for which

f(q,a)={q1, q2,...,qn}

It is A subset of NFA state set. In order to convert NFA to DFA, state set {q1, q2 qn} is regarded as A state A, that is to say, the basic idea of constructing DFA from NFA is that each state of DFA represents A subset of NFA's state set. This DFA uses its state to record the set of all states that NFA may reach after reading in the input symbols. This construction method is called subset method.

ε - closure of state set I

Let I be a state subset of NFA N, and ε - close (I) is defined as follows:

If s ∈ I, then s ∈ ε - close (I)
If s ∈ I, then any state s' that can be reached from s through any ε arc belongs to ε - close (I)

ε - close (I) is a process of searching reachable node set on transformation graph from given node set.

Press all the states in I into the stack;
Initialize ε - close (I) to I;
while stack is not empty do
	Pop the top element t out of the stack;
	for each State U: from t to u, there is an edge do marked with ε
		if u is not in ε - close (I) do
		Add u to ε - close (I);
		Push u into stack

The method of constructing equivalent DFA M=(Q ', ∑, f', S', Z ') from NFA N=(Q, ∑, F,S,Z)

First, the set of states that can be reached by any ε arc starting from the initial state S is regarded as the initial state S of M, and then the set of states that can be reached by the state transition of the input symbol a ∈∑ (including the states that can be reached by all possible ε transitions before or after reading the input symbol) starting from S' is regarded as M And so on until no more new states appear.

Set the state sets Q 'and Z' in DFA M as the set of 0.
The initial state S' = ε - close (S) of M is given, and S' is added to Q 'after being set to unlabeled state.
Initially, ε - close (s) is the only state in Q 'and is not marked;
There is an unmarked state T do in while Q '
 Labelled T;
  for each input symbol a do
  	U = ε-CLOSURE( f(T,a) );
    If u is not then in Q '
    Add U as an unmarked state to Q ';
    f'(T,a) = U;

Determination of NFA

The mutual transformation of finite automata and grammar

A transformation method from right linear normal grammar to finite automata

The transformation method from left linear normal grammar to finite automata

The transformation method from finite automata to normal grammar

Transformation between finite automata and regular expressions

Constructing NFA from regular expression

Input: normal R on alphabet

Output: NFA N identifying language L +

In the whole splitting process, all new nodes adopt different names. X and Y are the only initial and final nodes of the whole graph.

Transformation from finite automata to normal form

In the inverse process, a new initial state x is added, which is connected with all the original initial states by ε, a new final state Y is added, and all the original final states are connected by ε, thus forming a new NFA M ', which has only one initial state X and one final state Y. Merge arcs between X and Y.

Actual design

The task of lexical analysis is to scan the source file and output the word symbol string according to the binary form (token, attribute value of word symbol)

In lexical analysis, regular expressions are used to scan the whole file to identify the types of word symbols. Word symbols are divided according to the types, such as

<VOID : "void">
| <CHAR : "char">
| <SHORT : "short">
| <INT : "int">
| <LONG : "long">
| <STRUCT : "struct">
| <UNION : "union">
| <ENUM : "enum">
| <STATIC : "static">
| <EXTERN : "extern">
| <CONST : "const">
| <SIGNED : "signed">
| <UNSIGNED : "unsigned">
| <IF : "if">
| <ELSE : "else">
| <SWITCH : "switch">
| <CASE : "case">
| <DEFAULT_ : "default">
| <WHILE : "while">
| <DO : "do">
| <FOR : "for">
| <RETURN : "return">
| <BREAK : "break">
| <CONTINUE : "continue">
| <GOTO : "goto">
| <TYPEDEF : "typedef">
| <IMPORT : "import">
| <SIZEOF : "sizeof">

The above Token describes the keyword rules

<IDENTIFIER: ["a"-"z", "A"-"Z", "_"](["a"-"z", "A"-"Z", "_", "0"-"9"])*>

The Token above describes the identifier rule

Regular expressions will use the longest prefix matching rule. If a void Function is encountered, it will match a void Function instead of a void Function.

In the same way, the description of numerical rules can be used (matching 10, 16, 8 decimal values)

<INTEGER: ["1"-"9"] (["0"-"9"])* ("U")? ("L")?
| "0" ["x", "X"] (["0"-"9", "a"-"f", "A"-"F"])+ ("U")? ("L")?
| "0" (["0"-"7"])* ("U")? ("L")?

Skip for whitespace or comments, so don't use TOKEN to describe whitespace, use special "TOKEN" to describe whitespace

SPECIAL_TOKEN: { <SPACES: ([" ", "\t", "\n", "\r", "\f"])+> }

"," "," \ t "," n "," r "," f "] means any one of" "(space)," \ t "(TAB)," \ n "(line feed)," \ r "(enter)," \ f "(page feed), followed by" + "means one or more of the above five characters.

Description line comment

<LINE_COMMENT: "//" (~["\n", "\r"])* ("\n" | "\r\n" | "\r")?>

The pattern described in the above code is a string starting with "/ /", followed by characters other than line breaks, and ending with line breaks. In short, this is a string that starts with "/ /" and ends with a line break. There may not be a line break at the end of the file, so it can be omitted.

Description block notes
The first thing to note is that the following modes do not scan block annotations correctly.

SKIP { <"/*" (~[])* "*/"> }

According to the longest matching principle, the code may also be matched as comments, such as

/* This is the only line that should have been a comment */
main(int argc, char **argv)
printf("Hello, World!\n");
return 0; 
}/* End with status 0 */

If so, it matches the pattern "(~ [] *" until the comment terminator,
In order to solve this problem, we need to make the following modifications, that is to say, state transfer


In the example above, in block comment is the status of the scan. By using state, you can scan only a part of the code.
Let's explain how to use state. Let's first look at line 1 in the above example.


In this way, if {pattern: state name} is written in the rule definition, it means that after matching the pattern, it will migrate (transit) to the corresponding state. The above example migrates to a state called in block comment.
After the scanner migrates to a state, it only runs lexical analysis rules specific to that state. That is, in the above example, rules other than those dedicated to the in block comment state become invalid. To define special rules for a state, you can add < state name > before commands such as TOKEN as follows.

< status name > token: {~}
< status name > skip: {~}
< status name > special_token: {~}

The DEFAULT state indicates the state of the scanner at the beginning of lexical analysis. Lexical analysis rules that do not specifically specify a state are treated as DEFAULT states. That is to say, the scanning rules of reserved words, the rules of identifiers and the rules of line comments defined so far are actually in DEFAULT state. "* /" >: DEFAULT means to return to the original state if the matching pattern "* /".

The MORE command will be expressed as "the scan is not finished if only matching this rule", that is to say, the match entering this state must be expressed as / * */In this way, otherwise, it will report an error

Scan string literal

MORE: { <"\""> : IN_STRING } // Rule 1
<(~["\"", "\\", "\n", "\r"])+> // Rule 2
| <"\\" (["0"-"7"]){3}> // Rule 3
| <"\\" ~[]> // Rule 4
<IN_STRING> TOKEN: { <STRING: "\""> : DEFAULT } // Rule 5

First of all, with the aid of state migration, token can be described by multiple rules. Scan the start character "" of rule 1 and then migrate to in string state. Only rule 2, 3 and 4 are valid in this state. Second, in addition to the last Rule 5, rules 1 to 4 use the MORE command to scan a token with multiple rules. Any character wrapped in ""


A lexical analyzer based on automata

Regular expressions are used for lexical analysis. The syntax of the target language is as follows.

Input: the given source string. Output: a sequence of two tuples (syn,token or sum). SYN is the word category code, token is the string of the stored word itself, and sum is an integer constant.
The vocabulary of a language is:
1. Keywords
  if then else
  while do
  repeat until
  for from to step
  switch of case default
  integer real char bool
  and or not mod 
  read write
  All keywords are lowercase.
2. Special symbols
 Operators include: =, +, -, *, /, <, < =, >, > ==
Separators include:,;,:, {,}, [,], (,)
 3. Other tag ID s and NUM
 Other tags are defined by the following normal formula:
ID→letter(letter | digit)*
NUM→digit digit*
letter→a | ... | z | A | ... | Z
 4. Spaces consist of spaces, tabs, and line breaks
  Spaces are commonly used to separate ID S, NUM, private symbols, and keywords, and the lexical analysis phase is often ignored.
The category code of the word symbol is not set in the textbook, so it is specified here to increase from 1 according to the order of the above word symbols.

Lexical analysis Token description Token

 * Lexical analysis - symbolic representation of words
 * Abstract class, parent class of word category, maintaining the mapping between symbol and category code
public abstract class Token {
    //End of file
    public static final Token EOF = new Token(-1) {
    //Indicates the end of each line, which is the line break \ n
    public static final String EOL = "\\n";
     * Define regular expressions that match different words
     * Because there's a|at the end. All of them should be written in order
    public static final String KEYWORD_REGEX = "main|if|then|else|while|do|repeat|until|for|" +
            "from|to|step|switch|of|case|default|return|integer|real|char|bool|and|or|not|mod|read|write|";//Match keywords
    public static final String OPERATOR_REGEX = "=|\\+|\\-|\\*|\\/|<|<=|>|>=|!=|";//Match operator
    public static final String SEPARATOR_REGEX = "[,;:{}\\[\\]\\(\\)]|";//Match separator
    public static final String ID_REGEX = "[a-zA-Z][a-zA-Z0-9]*|";//Match identifier
    public static final String NUM_REGEX = "[0-9]+";//Constant value identifier

     * Define the mapping between word symbols and category codes
     * All information is stored in the file in order, read the file and load it into the static class
    public static Map<String, Integer> tokenTypeMap;
    //Location of the mapped profile
    public static String mapConfigPath = new File("").getAbsolutePath() + "/tokenTypeMap.config";

    static {
        tokenTypeMap = new LinkedHashMap<>();
        try {
            Scanner in = new Scanner(new BufferedInputStream(new FileInputStream(mapConfigPath)));
            int ite = 1;
            String res = in.hasNext() ? : null;
            while (res != null) {
                if (!(res.equals("") || res.charAt(0) == ' ' || res.charAt(0) == '#')) {
                    tokenTypeMap.put(res, ite++);
                res = in.hasNext() ? : null;

        } catch (Exception e) {

    private int lineNumber;//The line number of the word symbol

    public Token(int line) {
        this.lineNumber = line;


The configuration file read above is the configuration file of fixed symbols such as keywords and operators, which are read into memory and encoded in sequence


   if then else
   while do
   repeat until
   for from to step
   switch of case default
   integer real char bool
   and or not mod
   read write
   = + - * / < <= > >= !=
   , ; : { } [ ] ( )
   #Identifier and constant value

Lexical analysis output TokenRecord

 * Output of lexical analysis
 * (Word symbol code, word symbol attribute value)
public class TokenRecord extends Token {
    public TokenRecord(int line) {

    private int flagCode;//Identification code
    private String stringValue;//Character value
    private String numValue;//numerical value

    public int getFlagCode() {
        return flagCode;

    public String getNumValue() {
        return numValue;

    public String getStringValue() {
        return stringValue;

    public void setFlagCode(int flagCode) {
        this.flagCode = flagCode;

    public void setNumValue(String numValue) {
        this.numValue = numValue;

    public void setStringValue(String stringValue) {
        this.stringValue = stringValue;

Lexical analysis compilation exception CompileException

 * Compilation errors in lexical analysis
 * Throw exception
public class CompileException extends Exception {

    public int errorLine;//Wrong line number
    public String errorReason;//Reason for the error

    public CompileException(int errorLine, String errorReason) {
        this.errorLine = errorLine;
        this.errorReason = errorReason;

    public String toString() {
        return "The first" + errorLine + "That's ok:" + errorReason;


 * Preprocessor, delete blank lines, spaces and comments in the program
public class Preprocessor {
     * Read the specified program file for preprocessing
     * @param file
     * @return
    public static LinkedHashMap<Integer, String> preprocess(File file) throws CompileException {
        LinkedHashMap<Integer, String> res = new LinkedHashMap<>();
        boolean blockStatus = false;//In block comment or not
        int lineNumber = 1;
        try {
            Scanner scanner = new Scanner(new FileReader(file));
            String lineInfo = scanner.hasNextLine() ? scanner.nextLine() : null;
            //Process each line
            while (lineInfo != null) {
                StringBuilder lineProcessValue = new StringBuilder();
                for (int i = 0; i < lineInfo.length(); i++) {
                    //In block notes
                    if (blockStatus) {
                        if (i + 1 < lineInfo.length() && lineInfo.charAt(i) == '*' && lineInfo.charAt(i + 1) == '/') {
                            blockStatus = false;
                    if (lineInfo.charAt(i) == ' ' || lineInfo.charAt(i) == '\n') continue;//Space or newline omitted
                    if (i + 1 < lineInfo.length() && lineInfo.charAt(i) == '/' && lineInfo.charAt(i + 1) == '/')
                        break;//Line exit in case of line comment
                    if (i + 1 < lineInfo.length() && lineInfo.charAt(i) == '/' && lineInfo.charAt(i + 1) == '*') {//Block comment encountered, identify
                        blockStatus = true;
                lineInfo = scanner.hasNextLine() ? scanner.nextLine() : null;
                if (!lineProcessValue.toString().equals("")) res.put(lineNumber, lineProcessValue.toString());
        } catch (Exception e) {

        //Error in block comment, throw exception
        //TODO does not define string literal "..." and character literal '.', although it can verify whether there is a continuous / / or not, if / / is put in string literal, the verification rule will change, and there is no verification here
        if (blockStatus) throw new CompileException(lineNumber, "Check notes");

        return res;


 * Lexical analyzer, read the specified file, return word symbol tuple
public class Lexer {
    //Canonical form
    public static final String regex = Token.KEYWORD_REGEX + Token.OPERATOR_REGEX +
            Token.SEPARATOR_REGEX + Token.ID_REGEX + Token.NUM_REGEX;

     * Conduct lexical analysis
     * @param file source file
     * @return (Category code, word symbol or value) binary
    public List<TokenRecord> lex(File file) {
        List<TokenRecord> tuple = new ArrayList<>();
        try {
            //Advanced bank pretreatment
            LinkedHashMap<Integer, String> map = Preprocessor.preprocess(file);
            //Regular match data per row
            Pattern pattern = Pattern.compile(regex);
            for (int lineNumber : map.keySet()) {
                String string = map.get(lineNumber);
                Matcher matcher = pattern.matcher(string);
                while (matcher.find()) {
                    TokenRecord tokenRecord = new TokenRecord(lineNumber);
                    String match =;
                    if (Token.tokenTypeMap.containsKey(match)) {
                        System.out.println("(" + match + ",-)");
                    } else if ('0' <= match.charAt(0) && match.charAt(0) <= '9') {
                        System.out.println("(NUM," + match + ")");
                    } else {
                        System.out.println("(ID," + match + ")");
        } catch (CompileException e) {//Compile exception
            return null;
        return tuple;


 * Lexical analyzer test
public class LexerTest {
    public static void main(String[] args) {
        try {
            File file = new File(new File("").getAbsoluteFile() + "/test.txt");
            Lexer lexer = new Lexer();
            for (TokenRecord tokenRecord : lexer.lex(file)) {
                System.out.println("(" + tokenRecord.getFlagCode() + "," + tokenRecord.getStringValue() + ")");

        } catch (Exception e) {

The test file is test.txt

integer main(){
    integer i=0;//Line notes
    while(i<100)i++;/*Block annotation
    return i;


The object models described by DFA and NFA are designed to realize the basic operation (input and output) of DFA and NFA. Design a method to turn NFA into DFA.

The test file structure is:
1) The number of statuses is stateNum, and the contract status number is 0..(stateNum-1);
2) The number of characters is symbolNum, the contracted symbol number is 1..symbolNum, and the symbol with number 0 is;
3) The following lines are state transitions, one by one, ending with - 1;
Conversion format: state, symbol (can be 0), several States, ending with - 1;
4) Start state set, - 1 end;
5) End state set, - 1 end;

0 0 1 7 -1
1 0 2 4 -1
2 1 3 -1
3 0 6 -1
4 2 5 -1
5 0 6 -1
6 0 1 7 -1
7 1 8 -1
8 2 9 -1
9 2 10 -1

0 -1
10 -1

Output: DFA determined, described as:
Number of states: 5
 Number of character tables: 2
 State transition:
 Start status: 0
 End state set [4]

DFA description

* @author LSL
public class DFA {
    private List<Integer> statusList;//State set
    private List<Integer> symbolList;//character set
    private List<Function> functionList;//State transition set
    private int begin;//Initial state
    private List<Integer> endList;//Final state set

    public DFA(){
        statusList=new ArrayList<>();
        symbolList=new ArrayList<>();
        functionList=new ArrayList<>();
        endList=new ArrayList<>();

     * State saving, int [] [] is not used here
    static class Function{
        private int state;//state
        private int symbol;//Symbol
        private int convertState;//Status after conversion

        public Function(int state,int symbol,int convertState){

        public boolean equals(Object object){
            if(!(object instanceof Function))return false;
            return state==((Function) object).state && symbol==((Function) object).symbol;

        public int getConvertState() {
            return convertState;

    public String toString(){
        StringBuilder res=new StringBuilder("Number of states:"+statusList.size()+"\n"+
                "Number of character tables:"+(symbolList.size()-1)+"\n"+//Character table contains 0, one should be subtracted here
                "State transition:\n");
        functionList.sort((Function f1,Function f2)->{
            if(f1.state==f2.state)return f1.symbol-f2.symbol;
            else return f1.state-f2.state;
        //Output function
        for (Function function:functionList){
        res.append("Start status:").append(begin).append("\n");
        res.append("End state set").append(endList.toString()).append("\n");
        return res.toString();

    //getter and setter... Omitted here
    public void addConvertState(int state, int symbol, int convertState){
        Function function=new Function(state,symbol,convertState);

NFA description

* @author LSL
public class NFA {
    private List<Integer> statusList=new ArrayList<>();//State set
    private List<Integer> symbolList=new ArrayList<>();//character set
    private List<FunctionExtension> functionList=new ArrayList<>();//State transition set
    private List<Integer> beginList=new ArrayList<>();//Initial state set
    private List<Integer> endList=new ArrayList<>();//Final state set

     * Read file construct NFA
     * @param file
    public NFA(File file){
        try {
            Scanner scanner=new Scanner(new FileReader(file));
            int stateNum=scanner.nextInt();
            int symbolNum=scanner.nextInt();

            Set<Integer> statusSet=new HashSet<>();//State set
            Set<Integer> symbolSet=new HashSet<>();//Symbol set
            String line=scanner.nextLine();
            while (line.equals(""))line=scanner.nextLine();//Avoid redundant blank lines
            while (!line.equals("-1")){
                String[] num=line.split(" ");
                statusSet.add(Integer.parseInt(num[0]));//State set
                symbolSet.add(Integer.parseInt(num[1]));//Symbol set
                FunctionExtension extension=new FunctionExtension(Integer.parseInt(num[0]),Integer.parseInt(num[1]));//Transformation
                for(int j=2;j<num.length-1;j++){


            //Read start state set
            while (line.equals(""))line=scanner.nextLine();//Avoid redundant blank lines
            String[] num=line.split(" ");
            for(int i=0;i<num.length-1;i++)beginList.add(Integer.parseInt(num[0]));//Start state set
            //Read end state set
            while (line.equals(""))line=scanner.nextLine();//Avoid redundant blank lines
            num=line.split(" ");
            for(int i=0;i<num.length-1;i++)endList.add(Integer.parseInt(num[0]));//Start state set
            //TODO can be checked by the number of States and symbols to see if the reading is correct
        }catch (Exception e){

     * State saving, int [] [] is not used here
    static class FunctionExtension{
        private int state;//state
        private int symbol;//Symbol
        private List<Integer> convertStateList=new ArrayList<>();//Status after conversion

        public FunctionExtension(int state,int symbol){

     * Convert the NFA to DFA
     * @return DFA after conversion
    public DFA convertToDFA(){
        DFA res=new DFA();

        Map<List<Integer>,Integer> convertMap=new LinkedHashMap<>();//List of unlabeled transformation maps
        Map<List<Integer>,Boolean> convertMapFlag=new LinkedHashMap<>();//Is the map marked
        int stateId=0;//Status ID (index)
        List<Integer> init=Closure(beginList);
        convertMap.put(init,stateId++);//DFA index map NFA state set

        while (true){
            List<Integer> T=null;
            //Look for unmarked status
            for(List<Integer> flag:convertMapFlag.keySet()){
                    T=flag;//Unmarked status found
            convertMapFlag.put(T,true);//Reset to tag state

            //One operation for each symbol
            for(int symbol:symbolList){
                 * Here, because 0 represents epsilon, you should not close epsilon
                //Get f(T,symbol)=f(status1,symbol) U f(status2,symbol) U
                Set<Integer> tmp=new HashSet<>();//Set remove weights
                for(int status:T){
                List<Integer> U=Closure(new ArrayList<>(tmp));//Get close (f (T, symbol))
                //Judge whether the U set is already in the marked state set mapping, and compare the list < integer >
                boolean find=false;
                List<Integer> UCopy = null;
                for(List<Integer> flagList:new ArrayList<>(convertMap.keySet())){
                        UCopy=flagList;//Equivalent to U
                //Add if U is not in the mapping set
                //Set conversion f '(T, symbol) of DFA = u
                else res.addConvertState(convertMap.get(T),symbol,convertMap.get(UCopy));

        //Initialization of dfa's state set, symbol table, initial state, final state set
        for(List<Integer> list:convertMap.keySet()){//State set
        res.getSymbolList().addAll(symbolList);//Symbol table
        res.setBegin(0);//The initial state must be zero
        res.getEndList().add(stateId-1);//Final state set

        return res;

     * Starting from a given set of States, searching for reachable nodes
     * @param stateList Initial state set
     * @return All reachable node sets from the nodes in the set
    private List<Integer> Closure(List<Integer> stateList){
        //Initialize closure(state) as a state collection
        List<Integer> res=stateList;
        //Push all states onto the stack
        Stack<Integer> stack=new Stack<>();

        while (!stack.empty()){
            int node=stack.pop();//Stack out
            //Search function transformation list, find result set
            for (FunctionExtension extension:functionList){
                if(extension.state==node&&extension.symbol==0){//0 for epsilon
                    for(int endState:extension.convertStateList){//Reachable final set

        res.sort(Comparator.comparingInt(num->num));//Sort returned results
        return res;

     * Given the state and symbol, return the transformed state set
     * @param state
     * @param symbol
     * @return
    public List<Integer> getFunctionConvert(int state,int symbol){
        for(FunctionExtension extension:functionList){
            if(extension.state==state && extension.symbol==symbol)return extension.convertStateList;
        return new ArrayList<>();

     * Determine whether the elements of the two lists are identical, and the lists are sorted
     * @param a
     * @param b
     * @return
    public boolean sortListEquals(List<Integer> a,List<Integer> b){
        if(a.size()!=b.size())return false;
        for(int i=0;i<a.size();i++){
            if(!a.get(i).equals(b.get(i)))return false;
        return true;

    public String toString(){
        StringBuilder res=new StringBuilder(statusList.toString()+symbolList.toString()+beginList.toString()+endList.toString());
        for(FunctionExtension extension:functionList){
            res.append("\n").append(extension.state).append(" ").append(extension.symbol).append(" ").append(extension.convertStateList);
        return res.toString();


public class NfaToDfaTest {
    public static void main(String[] args){
        File file = new File(new File("").getAbsoluteFile() + "/test.txt");
        NFA nfa=new NFA(file);
49 original articles published, 32 praised, 100000 visitors+
Private letter follow

Tags: Attribute C Programming

Posted on Mon, 16 Mar 2020 09:37:27 -0400 by drimades