Compilation Principle lexical analysis

lexical analysis

Scan the source program to generate word symbols, and transform the source program into the intermediate program of word symbol string, that is, input source program and output word symbols. Lexical analyzer includes scanner and program to perform lexical analysis

Word symbols are the basic grammatical symbols of a program language. It is called token, which is the smallest grammatical unit with independent meaning. Combining characters into signs is very similar to making letters into words and determining the meaning of words in an English sentence, and the task is very similar to spelling.

The word symbols of program language are generally divided into:

• Keywords: reserved words
• Identifier: variable name, procedure name, etc
• Constants: numbers, strings, Booleans, etc
• Operator: ± / * etc
• Delimiter: comma, semicolon, /, / etc

Word symbols are often expressed in binary form: (word type, attribute value of word symbols). Word type is the information needed for grammar analysis, which is usually encoded by integer. It is a technical problem how to classify, classify and code the word symbols of a language. It mainly depends on the convenience of processing.

• Identifiers are generally classified as one.
• Constants are classified by type.
• Keywords can be regarded as a whole or a word. It is more convenient to use the word by word method in practice.
• Operators can be divided one by one, but operators with certain commonalities can also be regarded as one.
• In general, the division of boundary sign is one sign.

Category is usually defined as a logical item of enumeration type.

typedef enum {
IF,ELSE,PLUS,NUM,ID,......
} TokenType;


The attribute value of the word symbol refers to the character or feature of the word symbol, which can be the entry address of the identifier in the symbol table, the binary value of the value, etc.

If it is a symbol of a division (such as keywords, operators, etc.), the lexical analyzer only gives its type code, not its attribute value.
If a category contains more than one word symbol, for each word symbol, in addition to the category code, the attribute value should also be given to distinguish the same category of words. The identifier property value is its own string of symbols; it can also be at the entry address of the symbol table. The value of the constant itself is the binary value of the constant.

The scanner must calculate several attributes of each token, so it is useful to collect all attributes into a single constructed data type, which is called token record.

typedef struct {
TokenType tokenval;
char* stringval;
int numval;
} TokenRecord;


Or as a union

typedef struct {
TokenType tokenval;
unon { char* stringval;
int numval;
} attribute;
} TokenRecord;


In short, scan the source program to output the word symbol string in binary form (word type, attribute value of word symbol)

[example] try to give the output word symbol string of the program section if (a > 1) B = 100.
It is assumed that the basic word, operator and bound are all one character. The value of the identifier itself is a string and the constant is a binary value.
(2,) basic word if
(29,) left bracket(
(10, 'a') identifier a
(23,) greater than sign >
(11 binary of '1') constant 1
(30,) right bracket)
(10, 'b') identifier b
(17,) assignment number=
(11 binary of '100') constant 100
(26,) semicolon;


Another expression can be:

[example] consider the following C + + code snippet: while (I > = J) I --;
It is assumed that basic words, operators and bounds are all one kind of symbols. The value of the identifier itself is the entry address of the symbol table, and the constant is the binary value.
After being processed by lexical analyzer, it is converted into the following word symbol sequence:
( while ,- )
( (	   ,- )
(id, pointer to the symbol table item of i)
( >=    ,- )
(id, pointer to symbol table item of j)
( )     ,- )
(id, pointer to the symbol table item of i)
( --    ,- )
( ；    ,- )


As an independent stage (one pass), lexical analysis translates the character sequence of the source program into the word symbol sequence and stores it in the file. When the grammar analysis program is working, input these word symbols from the file for analysis. The structure is simpler, clearer and more organized. It is helpful to focus on lexical analysis.
As a subroutine, lexical analysis is called whenever the parser needs a word symbol. Every time it is called, the lexical analyzer recognizes a word symbol from the input string.

Generally, there are two ways to construct lexical analysis programs:

• Manual method: use a high-level language according to the state transition diagram of identifying language words. For example: write lexical analysis program directly in C language
• Automatic mode: use LEX, the automatic generation tool of lexical analysis program, to automatically generate lexical analysis program.

Manual design of lexical analysis

1. Input buffer, preprocessing

The first step in the lexical analyzer's work is to enter the source text into the input buffer.
Preprocessing work: it is to eliminate the redundant blank characters, skip grid characters, carriage return characters, line feed characters and other editorial characters and comments in the input source program, and store the results in the scanning buffer area, so as to facilitate the recognition of word symbols.

2. Scan buffer

• In order to ensure that the word symbols are not interrupted by the boundary of the scan buffer, the scan buffer is generally designed as a two-part area as follows;
• Each input updates half of its space so that the length of the lexical analyzer's recognition of word symbols in the worst case is half of the length of the scan buffer. This is also known as a pair buffer.
• Two indicators
• Start indicator: the first character of a new word;
• Search indicator: used to search forward to find the end of a word;
• If the search indicator searches from the beginning of the word to the edge of the half area, but does not reach the end of the word, it calls the preprocessor to load the subsequent string into the other half area

Recognition of word symbols: Advanced Search: there is no special end in the composition of word symbols in the source program, and there is no space between the word symbols and the word symbols when there is no readability error. Therefore, sometimes when all the characters of the word symbols have been processed, especially when there is a word symbol that is a prefix substring of another word symbol, the lexical analyzer It is uncertain whether the current word recognition has ended. It can only be determined after searching several characters in advance. If there are multiple characters read in at this time, it needs to be returned.
For example, the recognition of ">" and "> =" in C language needs to be searched in advance.

Some cases of not using advanced search

• It stipulates that all basic words are reserved words; users cannot use them as their own identifiers; basic words are treated as special identifiers, using the reserved word table;
• If there is no definite operator or delimiter between the basic word, identifier and constant (or label), a blank character must be used as the interval

State transition diagram: a state transition diagram can be used to identify (or accept) a certain string. In most programming languages, word symbols can be identified by conversion diagrams.

A state transition diagram is a finite direction diagram, in which nodes represent states, represented by circles, and states are connected by arrow arcs. The marks (characters) on arrow arcs represent the input characters or character classes that may appear under the shooting out node state. A transition diagram only contains a finite number of States, one of which is the initial state, and at least one of which is the final state

The state transition diagram can be used to identify (or accept) a certain string. If there is a path from the initial state to a certain final state, and all the markers on the arc on the path are connected to form a word equal to α, then α is recognized (accepted) by the state transition diagram

[example]

The design assumes:

• Basic words: all basic words are reserved words, and users shall not use them as their own defined identifiers;
• Basic words are treated as a kind of special identifier, and the corresponding conversion diagram is no longer designed. But you need to pre arrange keywords in a table, which is called keyword table. When an identifier is identified, the keyword table is looked up to determine whether it is a keyword.
• If there is no definite operator or boundary function between the keyword, identifier and constant, at least one blank character must be used as the interval.

When the state transition diagram is implemented, each state node can correspond to a small program.

Branch node:

Cyclic state node

Terminal node

Use pseudo code to implement lexical analyzer

Global variables and processes
ch character variable to store the latest read source program characters
strToken character array, which holds the string forming the word symbol
GetChar subroutine procedure to read the next character into ch
GetBC subroutine procedure, skip the blank character until a non blank character is read in ch
Concat subroutine, connecting characters in ch to strToken
IsLetter and isdigital Boolean functions to determine whether the characters in ch are letters and numbers
Reserve integer function. For the string in strToken, it looks up the reserved word table. If it is a reserved word, it gives its code. Otherwise, it returns 0
Retract subroutine, callback the search pointer to a character position
InsertId integer function, inserts identifier in strToken into symbol table, and returns symbol table pointer
InsertConst integer function, inserts the constant in strToken into the constant table, and returns the constant table pointer

int code, value;
strToken := " ";	/*Set strToken as empty string*/
GetChar();GetBC();
if (IsLetter())
begin
while (IsLetter() or IsDigit())
begin
Concat(); GetChar();
end
Retract();
code := Reserve();
if (code = 0)
begin
value := InsertId(strToken);
return ($ID, value); end else return (code, -); end else if (IsDigit()) begin while (IsDigit()) begin Concat( ); GetChar( ); end Retract(); value := InsertConst(strToken); return($INT, value);
end
else if (ch ='=') return ($ASSIGN, -); else if (ch ='+') return ($PLUS, -);
else if (ch ='*')
begin
GetChar();
if (ch ='*') return ($POWER, -); Retract(); return ($STAR, -);
end
else if (ch =',') return ($COMMA, -); else if (ch ='(') return ($LPAR, -);
else if (ch =')') return ($RPAR, -); else ProcError( ); /* error handling*/ curState = Initial state GetChar(); while( stateTrans[curState][ch]Definition){ //There is a successor state, read in and splice Concat(); //Transition to next state, read next character curState= stateTrans[curState][ch]; if curState Final state then Return strToken Words in Chinese GetChar( ); }  Canonical form Regular form: a way (or regular form) to define the symbols of words in a language [example] define the regular expression of identifier Letters (letters and numbers)*  Recursive definition of regular formula and regular set: • ε and Φ are regular expressions on Σ. The regular sets they represent are {ε} and Φ, respectively; • Any a ∈∑, a is a regular expression on ∑, and its regular set is {a}; • Suppose e1 and e2 are regular expressions on Σ, and the regular sets they represent are L(e1) and L(e2), respectively e1|e2 is a regular expression, which represents a regular set of L(e1) ∪ L(e2) (Union) E1 E2 is a regular expression, which represents a regular set of L(e1)L(e2) (connection set) (e1) * is a regular expression, and its normal set is (L(e1)) * (closure) The priority is closure, join product, or. A regular set can be represented by a regular expression. A regular expression is a way to represent a regular set. A string set is a regular set if and only if it can be represented by a regular expression The regular expression represents the format of a string. The regular expression r is completely defined by the string set it matches. This set is a language generated by the regular expression, written as L ®, and each regular expression can be regarded as a matching pattern. [Set up example∑={a,b,c},be aa*bb*cc* yes∑A regular expression on aa*bb*cc* yes∑A regular expression on, which represents a regular set of L = {abc,aabc,abbc,abcc,aaabc,...} = {a^m b^n c^l | m,n,l ≥1}  If the program language alphabet is set of keyboard characters, the program language word symbols can be defined as follows Keyword if | else | while | do Identifier l (l | d)* Integral constant dd* Relational operators Where l stands for any English letter in a~z Where d represents any number from 0 to 9  Regular equivalence If two regular expressions represent the same regular set, they are considered equivalent. Two equivalent regular expressions R1 and R2 are recorded as R1=R2. [Example] (a|b)* = (a*|b*)* b(ab)* = (ba)* b  Regular extension One or more repetitions of the regular r, writing r+ . indicates a match with any character. Use square brackets and a hyphen to indicate the range of characters, such as [0-9], [a-z], [a-zA-Z]. This representation can also be used to represent a single solution. For example, a|b|c can be written as [abc] Use "~" for any character not in the given set, such as normal ~ a for non-a characters in the alphabet Optional subexpression r? Indicates that the string matched by R is optional (0 or 1). For example, natural = [0-9] +, signedNatural = (+|-)? Natural Regular grammar and regular formula Regular grammars and regular expressions are tools to describe normal sets. For any regular grammar, there is a regular grammar that defines the same language; on the contrary, for each regular grammar, there is a regular grammar that generates the same language. The transformation from regular grammar to regular expression Each nonterminal in a regular grammar is expressed as a regular equation about it, and a set of simultaneous equations is obtained. According to the solution rules: If x = α x|β (or x = α x + β), the solution is x = α * β If x = x α| β (or x = x α + β), the solution is x = β α* And the distribution rate, exchange rate and combination rate of normal form are used to find the solutions of normal form equations about the start sign of grammar. This solution is a regular expression of the grammar start sign S. These two rules are more important. The recursive x is removed to the closure of α [example 1] [example] a normal grammar G: Z → 0AA → 0A ∣ 0BB → 1A ∣ ϵ is set to give the normal formula of the grammar generating language There is a formal grammar G:\\ Z→0A\\\\ A→0A|0B\\\\ B → 1A|\epsilon \ \ \ \, try to give the normal form of the grammar generation language [example] a normal grammar G: Z → 0AA → 0A ∣ 0BB → 1A ∣ ϵ is set to give the normal formula of the grammar generating language First, we give the corresponding normal equations (+ instead of |) Z = 0A .........(1) A = 0A+0B .........(2) B = 1A+$\epsilon $.........(3) Replace (3) with B in (2) to get A = 0A+01A+0 .........(4) For (4) utilization distribution rate A = (0+01)A+0 (5) Using the rule for (5) A = (0+01)*0 (6) Substitute (6) into (1) z = 0 (0 + 01) × 0 That is to say, the normal expression of the language generated by normal grammar G[Z] is 0 (0| 01) * 0 [example 2] With formal grammar G: A→aB|bB B→aC|a|b C→aB Try to give the normal form of the grammar generating language Same as above steps First, we give the corresponding normal equations (+ instead of |) A = aB+bB .........(1) B = aC+a+b .........(2) C = aB .........(3) Substituting (3) into (2) B = aaB+a+b .........(4) Using the rule for (4) B = (a a) * (a + b) (5) Substitute (5) into (1) to get A = (a+b)(aa)*(a+b) That is to say, the normal expression of the language generated by normal grammar G[Z] is (a|b)(aa)*(a|b) [example 3] With formal grammar G: Z→U0|V1 U→Z1|1 V → Z0|0 try to give the normal formula of the grammar generating language First, we give the corresponding normal equations (+ instead of |) Z = U0+V1 .........(1) U = Z1+1 .........(2) V = Z0+0 .........(3) Substituting (2) (3) into (1) Z = Z10+10+Z01+01 .........(4) Z = Z(10+01)+10+01 .........(4) Use Z = (10+01)(10+01) of the rule for (4)* That is to say, the normal form of language generated by normal grammar G[Z] is (10|01) (10|01)*  [example 4] A formal grammar known to describe the symbol of the word "identifier" < identifier > → l < identifier > L < identifier > d First, we give the corresponding normal equations (+ instead of |) S = l+Sl+Sd S = l+S(l+d) Using rules S = l(l+d)* The normal form of the grammar is l(l|d)*  Transformation from normal form to normal grammar The conversion method from normal expression on the alphabet Σ to normal grammar G = (VNV_NVN, VTV_TVT, P,S) is as follows: 1. Order VT = Sigma 2. Select a nonterminal Z for any normal R, generate rule Z → R, and make S = Z; 3. If both A and b are normal, the rules of A → AB are transformed into A → AB and b → b, where b is A new nonterminal; 4. In the transformed grammar, the rules of form A → a*b are further transformed into A → A a | B; 5. Rules (3) and (4) are constantly used for conversion until each rule contains at most one terminator. [example 1] Convert R = (a|b)(aa)*(a|b) to the corresponding normal grammar Let A be the starting symbol of grammar, which is transformed into A → (a|b)(aa)*(a|b) According to rule (3), change to A → (a|b)B B → (aa)*(a|b) According to rule (4), change to (reverse, change * to |) A → aB|bB B → aaB|a|b (there are two A's in aaB, which should be simplified to only one terminator) According to rule (3), change to A → aB|bB B → aC|a|b C → aB  [example 2] Convert the normal expression R=l(l|d) describing identifier into the corresponding normal grammar Let S be the starting symbol of grammar, which is transformed into S→l(l|d)* According to rule (3), change to S→lA A→(l|d)* According to rule (4), change to S→lA A→(l|d)A |ε Further transform to S→lA A → lA|dA|ε (elimination of ε) Further transform to S→l|lA A→l|d|lA|dA  Finite automata Finite automata is an abstract mathematical model with discrete input and output systems. There are two kinds of finite automata: definite and uncertain. Both the definite finite automata and the indefinite finite automata can recognize the normal set accurately. Deterministic finite automata (DFA) A definite finite automaton DFA M is a five element formula: M = (Q, ∑, f, S, Z), where: Q: A finite set of States, each of which is called a state. Σ: there is a finite alphabet, each element of which is called an input character. F: The state transition function is a single valued mapping from Q ×∑ to Q. f(qi, a) = qj (qi,qj ∈ Q, a ∈∑) means that when the current state is qi and the input character is a, the automaton will switch to the next state qj. qj is a successor of qi. S ∈ Q: is the only initial state. Z$\ subset $Q: final state set (nullable). [Set up example DFA M=({q0,q1,q2},{a,b},f,q0,{q2}) //Among them: f(q0,a)= q1 f(q1,b)= q1 f(q0,b)= q2 f(q2,a)= q2 f(q1,a)= q1 f(q2,b)= q1  State transition matrix, state transition diagram A DFA can be represented by a matrix. The row of the matrix represents the state, the list represents the input characters, and the matrix element represents the value of f(s, a). This matrix is called the state transition matrix, or transition table. A DFA can also be represented by a (definite) state transition diagram. Assuming that DFA M contains M States and n input characters, this state transition diagram has m nodes. Each node has at most n arrow arcs which are connected with other states. Each arrow arc which is shot by the same node is represented by sigma The whole graph contains only one initial node and several (may be 0) final nodes. Symbol string recognized by DFA M: for any word β in Σ *, if there is a path from the initial state to an end state node, and the words connected by the markers of all arcs in this path are equal to β, then β can be recognized by DFA M. If the initial state of M is also the final state, then ε can be identified by M. All the symbol strings recognized by DFA M are the accepted languages, which are recorded as L(M). Conclusion: V$\ subset $∑ * is normal if and only if there is automata M on Σ, making V=L (M) Algorithm of simulating DFA Input: input the string x Ending with eof; a DFA D with s0 as its start state and F as its acceptance state set. Output: if D accepts x, answer "yes", otherwise answer "No". Method: apply the following algorithm to the input string x. The function move(s,c) gives the next state to which the input character c should be converted when it is encountered in state s. The getch() function returns the next character of the input string x. s=s0; while ((c=getch())!=eof) { s=move(s,c); if (s is in F) return "yes"; }  Nondeterministic finite automata (NFA) An uncertain finite automaton m is a five element formula: M=(Q, ∑, S,Z,F), where: Q: Finite state set Σ: finite alphabet F: State transition function is a mapping (multivalued mapping) from Q ×∑ * to subset S' of S. Namely f: Q ×∑ * → 2Q power set S$\ subset $Q: non empty initial state set Z$\ subset \$Q: final state set (nullable)

An NFA can also be represented by a matrix. The row of the matrix represents the state, the list represents the input character, and the matrix element represents the value (state set) of f(s, a). An NFA can also be represented by a state transition diagram.

Difference between NFA and DFA

NFA can have multiple initial states;
The mark on the arc can be a word (or even a normal form) in sigma *, not necessarily a single character;
The same word may appear on multiple arcs in the same state;
DFA is a special case of NFA.

For the symbol string recognized by NFA m, for any word β in Σ *, if there is a path from the initial state to an end state node, and the words connected by the markers of all arcs in this path (ignoring ε arc) are equal to β, then β can be recognized by NFA M. If some states of M are both initial and final, the null epsilon is accepted by M.

All the symbol strings that NFA M can recognize are the accepted languages, which are recorded as L(M). For example, in the above example, the language that NFA M 'can recognize is L(M') = b*(b|ab)(bb)*

According to the definition of NFA, the same symbol string β can be identified by multiple paths. DFA is a special case of NFA. The method of constructing lexical analysis program with finite automata is as follows:

1. NFA is constructed from the description of language words;
2. Convert NFA to DFA;
3. It is reduced to DFA with minimum state;
4. For each state of DFA, a program segment is constructed to transform it into a lexical analysis program to recognize words.

The method of NFA to DFA

The definiteness of NFA means that any given NFA can construct a DFA accordingly to make them accept the same language.

For an NFA, since the state transition function f is a multivalued function, there are always some states q, for which

f(q,a)={q1, q2,...,qn}

It is A subset of NFA state set. In order to convert NFA to DFA, state set {q1, q2 qn} is regarded as A state A, that is to say, the basic idea of constructing DFA from NFA is that each state of DFA represents A subset of NFA's state set. This DFA uses its state to record the set of all states that NFA may reach after reading in the input symbols. This construction method is called subset method.

ε - closure of state set I

Let I be a state subset of NFA N, and ε - close (I) is defined as follows:

If s ∈ I, then s ∈ ε - close (I)
If s ∈ I, then any state s' that can be reached from s through any ε arc belongs to ε - close (I)


ε - close (I) is a process of searching reachable node set on transformation graph from given node set.

Press all the states in I into the stack;
Initialize ε - close (I) to I;
while stack is not empty do
begin
Pop the top element t out of the stack;
for each State U: from t to u, there is an edge do marked with ε
if u is not in ε - close (I) do
begin
Add u to ε - close (I);
Push u into stack
end
end


The method of constructing equivalent DFA M=(Q ', ∑, f', S', Z ') from NFA N=(Q, ∑, F,S,Z)

First, the set of states that can be reached by any ε arc starting from the initial state S is regarded as the initial state S of M, and then the set of states that can be reached by the state transition of the input symbol a ∈∑ (including the states that can be reached by all possible ε transitions before or after reading the input symbol) starting from S' is regarded as M And so on until no more new states appear.

Set the state sets Q 'and Z' in DFA M as the set of 0.
The initial state S' = ε - close (S) of M is given, and S' is added to Q 'after being set to unlabeled state.
Initially, ε - close (s) is the only state in Q 'and is not marked;
There is an unmarked state T do in while Q '
begin
Labelled T;
for each input symbol a do
begin
U = ε-CLOSURE( f(T,a) );
If u is not then in Q '
Add U as an unmarked state to Q ';
f'(T,a) = U;
end
end


Determination of NFA

The mutual transformation of finite automata and grammar

A transformation method from right linear normal grammar to finite automata

The transformation method from left linear normal grammar to finite automata

The transformation method from finite automata to normal grammar

Transformation between finite automata and regular expressions

Constructing NFA from regular expression

Input: normal R on alphabet

Output: NFA N identifying language L +

In the whole splitting process, all new nodes adopt different names. X and Y are the only initial and final nodes of the whole graph.

Transformation from finite automata to normal form

In the inverse process, a new initial state x is added, which is connected with all the original initial states by ε, a new final state Y is added, and all the original final states are connected by ε, thus forming a new NFA M ', which has only one initial state X and one final state Y. Merge arcs between X and Y.

Actual design

The task of lexical analysis is to scan the source file and output the word symbol string according to the binary form (token, attribute value of word symbol)

In lexical analysis, regular expressions are used to scan the whole file to identify the types of word symbols. Word symbols are divided according to the types, such as

TOKEN: {
<VOID : "void">
| <CHAR : "char">
| <SHORT : "short">
| <INT : "int">
| <LONG : "long">
| <STRUCT : "struct">
| <UNION : "union">
| <ENUM : "enum">
| <STATIC : "static">
| <EXTERN : "extern">
| <CONST : "const">
| <SIGNED : "signed">
| <UNSIGNED : "unsigned">
| <IF : "if">
| <ELSE : "else">
| <SWITCH : "switch">
| <CASE : "case">
| <DEFAULT_ : "default">
| <WHILE : "while">
| <DO : "do">
| <FOR : "for">
| <RETURN : "return">
| <BREAK : "break">
| <CONTINUE : "continue">
| <GOTO : "goto">
| <TYPEDEF : "typedef">
| <IMPORT : "import">
| <SIZEOF : "sizeof">
}


The above Token describes the keyword rules

TOKEN: {
<IDENTIFIER: ["a"-"z", "A"-"Z", "_"](["a"-"z", "A"-"Z", "_", "0"-"9"])*>
}


The Token above describes the identifier rule

Regular expressions will use the longest prefix matching rule. If a void Function is encountered, it will match a void Function instead of a void Function.

In the same way, the description of numerical rules can be used (matching 10, 16, 8 decimal values)

TOKEN: {
<INTEGER: ["1"-"9"] (["0"-"9"])* ("U")? ("L")?
| "0" ["x", "X"] (["0"-"9", "a"-"f", "A"-"F"])+ ("U")? ("L")?
| "0" (["0"-"7"])* ("U")? ("L")?
>
}


Skip for whitespace or comments, so don't use TOKEN to describe whitespace, use special "TOKEN" to describe whitespace

SPECIAL_TOKEN: { <SPACES: ([" ", "\t", "\n", "\r", "\f"])+> }


"," "," \ t "," n "," r "," f "] means any one of" "(space)," \ t "(TAB)," \ n "(line feed)," \ r "(enter)," \ f "(page feed), followed by" + "means one or more of the above five characters.

Description line comment

SPECIAL_TOKEN: {
<LINE_COMMENT: "//" (~["\n", "\r"])* ("\n" | "\r\n" | "\r")?>
}


The pattern described in the above code is a string starting with "/ /", followed by characters other than line breaks, and ending with line breaks. In short, this is a string that starts with "/ /" and ends with a line break. There may not be a line break at the end of the file, so it can be omitted.

Description block notes
The first thing to note is that the following modes do not scan block annotations correctly.

SKIP { <"/*" (~[])* "*/"> }


According to the longest matching principle, the code may also be matched as comments, such as

/* This is the only line that should have been a comment */
int
main(int argc, char **argv)
{
printf("Hello, World!\n");
return 0;
}/* End with status 0 */


If so, it matches the pattern "(~ [] *" until the comment terminator,
In order to solve this problem, we need to make the following modifications, that is to say, state transfer

MORE: { <"/*"> : IN_BLOCK_COMMENT }
<IN_BLOCK_COMMENT> MORE: { <~[]> }
<IN_BLOCK_COMMENT> SKIP: { <"*/"> : DEFAULT }


In the example above, in block comment is the status of the scan. By using state, you can scan only a part of the code.
Let's explain how to use state. Let's first look at line 1 in the above example.

SKIP: { <"/*"> : IN_BLOCK_COMMENT }


In this way, if {pattern: state name} is written in the rule definition, it means that after matching the pattern, it will migrate (transit) to the corresponding state. The above example migrates to a state called in block comment.
After the scanner migrates to a state, it only runs lexical analysis rules specific to that state. That is, in the above example, rules other than those dedicated to the in block comment state become invalid. To define special rules for a state, you can add < state name > before commands such as TOKEN as follows.

< status name > token: {~}
< status name > skip: {~}
< status name > special_token: {~}


The DEFAULT state indicates the state of the scanner at the beginning of lexical analysis. Lexical analysis rules that do not specifically specify a state are treated as DEFAULT states. That is to say, the scanning rules of reserved words, the rules of identifiers and the rules of line comments defined so far are actually in DEFAULT state. "* /" >: DEFAULT means to return to the original state if the matching pattern "* /".

The MORE command will be expressed as "the scan is not finished if only matching this rule", that is to say, the match entering this state must be expressed as / * */In this way, otherwise, it will report an error

Scan string literal

MORE: { <"\""> : IN_STRING } // Rule 1
<IN_STRING> MORE: {
<(~["\"", "\\", "\n", "\r"])+> // Rule 2
| <"\\" (["0"-"7"]){3}> // Rule 3
| <"\\" ~[]> // Rule 4
}
<IN_STRING> TOKEN: { <STRING: "\""> : DEFAULT } // Rule 5


First of all, with the aid of state migration, token can be described by multiple rules. Scan the start character "" of rule 1 and then migrate to in string state. Only rule 2, 3 and 4 are valid in this state. Second, in addition to the last Rule 5, rules 1 to 4 use the MORE command to scan a token with multiple rules. Any character wrapped in ""

test

A lexical analyzer based on automata

Regular expressions are used for lexical analysis. The syntax of the target language is as follows.

Input: the given source string. Output: a sequence of two tuples (syn,token or sum). SYN is the word category code, token is the string of the stored word itself, and sum is an integer constant.
The vocabulary of a language is:
1. Keywords
main
if then else
while do
repeat until
for from to step
switch of case default
return
integer real char bool
and or not mod
All keywords are lowercase.
2. Special symbols
Operators include: =, +, -, *, /, <, < =, >, > ==
Separators include:,;,:, {,}, [,], (,)
3. Other tag ID s and NUM
Other tags are defined by the following normal formula:
ID→letter(letter | digit)*
NUM→digit digit*
letter→a | ... | z | A | ... | Z
digit→0|...|9
4. Spaces consist of spaces, tabs, and line breaks
Spaces are commonly used to separate ID S, NUM, private symbols, and keywords, and the lexical analysis phase is often ignored.
The category code of the word symbol is not set in the textbook, so it is specified here to increase from 1 according to the order of the above word symbols.


Lexical analysis Token description Token

/**
* Lexical analysis - symbolic representation of words
* Abstract class, parent class of word category, maintaining the mapping between symbol and category code
*/
public abstract class Token {
//End of file
public static final Token EOF = new Token(-1) {
};
//Indicates the end of each line, which is the line break \ n
public static final String EOL = "\\n";
/**
* Define regular expressions that match different words
* Because there's a|at the end. All of them should be written in order
*/
public static final String KEYWORD_REGEX = "main|if|then|else|while|do|repeat|until|for|" +
public static final String OPERATOR_REGEX = "=|\\+|\\-|\\*|\\/|<|<=|>|>=|!=|";//Match operator
public static final String SEPARATOR_REGEX = "[,;:{}\$\$\$$\$$]|";//Match separator
public static final String ID_REGEX = "[a-zA-Z][a-zA-Z0-9]*|";//Match identifier
public static final String NUM_REGEX = "[0-9]+";//Constant value identifier

/**
* Define the mapping between word symbols and category codes
* All information is stored in the file in order, read the file and load it into the static class
*/
public static Map<String, Integer> tokenTypeMap;
//Location of the mapped profile
public static String mapConfigPath = new File("").getAbsolutePath() + "/tokenTypeMap.config";

static {
try {
Scanner in = new Scanner(new BufferedInputStream(new FileInputStream(mapConfigPath)));
int ite = 1;
String res = in.hasNext() ? in.next() : null;
while (res != null) {
if (!(res.equals("") || res.charAt(0) == ' ' || res.charAt(0) == '#')) {
tokenTypeMap.put(res, ite++);
}
res = in.hasNext() ? in.next() : null;
}

} catch (Exception e) {
e.printStackTrace();
}
}

private int lineNumber;//The line number of the word symbol

public Token(int line) {
this.lineNumber = line;
}

}


The configuration file read above is the configuration file of fixed symbols such as keywords and operators, which are read into memory and encoded in sequence

tokenTypeMap.config

   #Keyword
main
if then else
while do
repeat until
for from to step
switch of case default
return
integer real char bool
and or not mod
#operator
= + - * / < <= > >= !=
#Separator
, ; : { } [ ] ( )
#Identifier and constant value
ID NUM


Lexical analysis output TokenRecord

/**
* Output of lexical analysis
* (Word symbol code, word symbol attribute value)
*/
public class TokenRecord extends Token {
public TokenRecord(int line) {
super(line);
}

private int flagCode;//Identification code
private String stringValue;//Character value
private String numValue;//numerical value

public int getFlagCode() {
return flagCode;
}

public String getNumValue() {
return numValue;
}

public String getStringValue() {
return stringValue;
}

public void setFlagCode(int flagCode) {
this.flagCode = flagCode;
}

public void setNumValue(String numValue) {
this.numValue = numValue;
}

public void setStringValue(String stringValue) {
this.stringValue = stringValue;
}
}


Lexical analysis compilation exception CompileException

/**
* Compilation errors in lexical analysis
* Throw exception
*/
public class CompileException extends Exception {

public int errorLine;//Wrong line number
public String errorReason;//Reason for the error

public CompileException(int errorLine, String errorReason) {
this.errorLine = errorLine;
this.errorReason = errorReason;
}

@Override
public String toString() {
return "The first" + errorLine + "That's ok:" + errorReason;
}
}


Preprocessor

/**
* Preprocessor, delete blank lines, spaces and comments in the program
*/
public class Preprocessor {
/**
* Read the specified program file for preprocessing
*
* @param file
* @return
*/
public static LinkedHashMap<Integer, String> preprocess(File file) throws CompileException {
boolean blockStatus = false;//In block comment or not
int lineNumber = 1;
try {
Scanner scanner = new Scanner(new FileReader(file));
String lineInfo = scanner.hasNextLine() ? scanner.nextLine() : null;
//Process each line
while (lineInfo != null) {
StringBuilder lineProcessValue = new StringBuilder();
for (int i = 0; i < lineInfo.length(); i++) {
//In block notes
if (blockStatus) {
if (i + 1 < lineInfo.length() && lineInfo.charAt(i) == '*' && lineInfo.charAt(i + 1) == '/') {
i++;
blockStatus = false;
}
continue;
}
if (lineInfo.charAt(i) == ' ' || lineInfo.charAt(i) == '\n') continue;//Space or newline omitted
if (i + 1 < lineInfo.length() && lineInfo.charAt(i) == '/' && lineInfo.charAt(i + 1) == '/')
break;//Line exit in case of line comment
if (i + 1 < lineInfo.length() && lineInfo.charAt(i) == '/' && lineInfo.charAt(i + 1) == '*') {//Block comment encountered, identify
i++;
blockStatus = true;
continue;
}
lineProcessValue.append(lineInfo.charAt(i));
}
lineInfo = scanner.hasNextLine() ? scanner.nextLine() : null;
if (!lineProcessValue.toString().equals("")) res.put(lineNumber, lineProcessValue.toString());
lineNumber++;
}
} catch (Exception e) {
e.printStackTrace();
}

//Error in block comment, throw exception
//TODO does not define string literal "..." and character literal '.', although it can verify whether there is a continuous / / or not, if / / is put in string literal, the verification rule will change, and there is no verification here
if (blockStatus) throw new CompileException(lineNumber, "Check notes");

return res;
}
}


Lexer

/**
* Lexical analyzer, read the specified file, return word symbol tuple
*/
public class Lexer {
//Canonical form
public static final String regex = Token.KEYWORD_REGEX + Token.OPERATOR_REGEX +
Token.SEPARATOR_REGEX + Token.ID_REGEX + Token.NUM_REGEX;

/**
* Conduct lexical analysis
*
* @param file source file
* @return (Category code, word symbol or value) binary
*/
public List<TokenRecord> lex(File file) {
List<TokenRecord> tuple = new ArrayList<>();
try {
//Regular match data per row
Pattern pattern = Pattern.compile(regex);
for (int lineNumber : map.keySet()) {
String string = map.get(lineNumber);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
TokenRecord tokenRecord = new TokenRecord(lineNumber);
String match = matcher.group();
if (Token.tokenTypeMap.containsKey(match)) {
System.out.println("(" + match + ",-)");
tokenRecord.setFlagCode(Token.tokenTypeMap.get(match));
tokenRecord.setStringValue("-");
} else if ('0' <= match.charAt(0) && match.charAt(0) <= '9') {
System.out.println("(NUM," + match + ")");
tokenRecord.setFlagCode(Token.tokenTypeMap.get("NUM"));
tokenRecord.setStringValue(match);
} else {
System.out.println("(ID," + match + ")");
tokenRecord.setFlagCode(Token.tokenTypeMap.get("ID"));
tokenRecord.setStringValue(match);
}
}
}
} catch (CompileException e) {//Compile exception
e.printStackTrace();
return null;
}
return tuple;
}
}


test

/**
* Lexical analyzer test
*/
public class LexerTest {
public static void main(String[] args) {
try {
File file = new File(new File("").getAbsoluteFile() + "/test.txt");
Lexer lexer = new Lexer();
for (TokenRecord tokenRecord : lexer.lex(file)) {
System.out.println("(" + tokenRecord.getFlagCode() + "," + tokenRecord.getStringValue() + ")");
}

} catch (Exception e) {
e.printStackTrace();
}
}
}


The test file is test.txt

integer main(){
integer i=0;//Line notes
while(i<100)i++;/*Block annotation
*/
return i;
}


NFA to DFA

The object models described by DFA and NFA are designed to realize the basic operation (input and output) of DFA and NFA. Design a method to turn NFA into DFA.

The test file structure is:
1) The number of statuses is stateNum, and the contract status number is 0..(stateNum-1);
2) The number of characters is symbolNum, the contracted symbol number is 1..symbolNum, and the symbol with number 0 is;
3) The following lines are state transitions, one by one, ending with - 1;
Conversion format: state, symbol (can be 0), several States, ending with - 1;
4) Start state set, - 1 end;
5) End state set, - 1 end;
[example]
11
2

0 0 1 7 -1
1 0 2 4 -1
2 1 3 -1
3 0 6 -1
4 2 5 -1
5 0 6 -1
6 0 1 7 -1
7 1 8 -1
8 2 9 -1
9 2 10 -1
-1

0 -1
10 -1

Output: DFA determined, described as:
Number of states: 5
Number of character tables: 2
State transition:
(0,1)->1
(0,2)->2
(1,1)->1
(1,2)->3
(2,1)->1
(2,2)->2
(3,1)->1
(3,2)->4
(4,1)->1
(4,2)->2
Start status: 0
End state set [4]


DFA description

/**
* @author LSL
*/
public class DFA {
private List<Integer> statusList;//State set
private List<Integer> symbolList;//character set
private List<Function> functionList;//State transition set
private int begin;//Initial state
private List<Integer> endList;//Final state set

public DFA(){
statusList=new ArrayList<>();
symbolList=new ArrayList<>();
functionList=new ArrayList<>();
begin=0;
endList=new ArrayList<>();
}

/**
* State saving, int [] [] is not used here
*/
static class Function{
private int state;//state
private int symbol;//Symbol
private int convertState;//Status after conversion

public Function(int state,int symbol,int convertState){
this.state=state;
this.symbol=symbol;
this.convertState=convertState;
}

@Override
public boolean equals(Object object){
if(!(object instanceof Function))return false;
return state==((Function) object).state && symbol==((Function) object).symbol;
}

public int getConvertState() {
return convertState;
}
}

@Override
public String toString(){
StringBuilder res=new StringBuilder("Number of states:"+statusList.size()+"\n"+
"Number of character tables:"+(symbolList.size()-1)+"\n"+//Character table contains 0, one should be subtracted here
"State transition:\n");
//sort
functionList.sort((Function f1,Function f2)->{
if(f1.state==f2.state)return f1.symbol-f2.symbol;
else return f1.state-f2.state;
});
//Output function
for (Function function:functionList){
res.append("(").append(function.state).append(",").append(function.symbol).
append(")->").append(function.getConvertState()).append("\n");
}
res.append("Start status:").append(begin).append("\n");
res.append("End state set").append(endList.toString()).append("\n");
return res.toString();
}

//getter and setter... Omitted here

public void addConvertState(int state, int symbol, int convertState){
Function function=new Function(state,symbol,convertState);
}
}


NFA description

/**
* @author LSL
*/
public class NFA {
private List<Integer> statusList=new ArrayList<>();//State set
private List<Integer> symbolList=new ArrayList<>();//character set
private List<FunctionExtension> functionList=new ArrayList<>();//State transition set
private List<Integer> beginList=new ArrayList<>();//Initial state set
private List<Integer> endList=new ArrayList<>();//Final state set

/**
* @param file
*/
public NFA(File file){
try {
int stateNum=scanner.nextInt();
int symbolNum=scanner.nextInt();

Set<Integer> statusSet=new HashSet<>();//State set
Set<Integer> symbolSet=new HashSet<>();//Symbol set
String line=scanner.nextLine();
while (line.equals(""))line=scanner.nextLine();//Avoid redundant blank lines
while (!line.equals("-1")){
String[] num=line.split(" ");
FunctionExtension extension=new FunctionExtension(Integer.parseInt(num[0]),Integer.parseInt(num[1]));//Transformation
for(int j=2;j<num.length-1;j++){
}
line=scanner.nextLine();
}

line=scanner.nextLine();
while (line.equals(""))line=scanner.nextLine();//Avoid redundant blank lines
String[] num=line.split(" ");
line=scanner.nextLine();
while (line.equals(""))line=scanner.nextLine();//Avoid redundant blank lines
num=line.split(" ");
//TODO can be checked by the number of States and symbols to see if the reading is correct
}catch (Exception e){
e.printStackTrace();
}
}

/**
* State saving, int [] [] is not used here
*/
static class FunctionExtension{
private int state;//state
private int symbol;//Symbol
private List<Integer> convertStateList=new ArrayList<>();//Status after conversion

public FunctionExtension(int state,int symbol){
this.state=state;
this.symbol=symbol;
}
}

/**
* Convert the NFA to DFA
* @return DFA after conversion
*/
public DFA convertToDFA(){
DFA res=new DFA();

Map<List<Integer>,Integer> convertMap=new LinkedHashMap<>();//List of unlabeled transformation maps
Map<List<Integer>,Boolean> convertMapFlag=new LinkedHashMap<>();//Is the map marked
int stateId=0;//Status ID (index)
List<Integer> init=Closure(beginList);
convertMap.put(init,stateId++);//DFA index map NFA state set
convertMapFlag.put(init,false);

while (true){
List<Integer> T=null;
//Look for unmarked status
for(List<Integer> flag:convertMapFlag.keySet()){
if(!convertMapFlag.get(flag)){
T=flag;//Unmarked status found
break;
}
}
if(T==null)break;
convertMapFlag.put(T,true);//Reset to tag state

//One operation for each symbol
for(int symbol:symbolList){
/**
* Here, because 0 represents epsilon, you should not close epsilon
*/
if(symbol==0)continue;
//Get f(T,symbol)=f(status1,symbol) U f(status2,symbol) U
Set<Integer> tmp=new HashSet<>();//Set remove weights
for(int status:T){
}
List<Integer> U=Closure(new ArrayList<>(tmp));//Get close (f (T, symbol))
//Judge whether the U set is already in the marked state set mapping, and compare the list < integer >
boolean find=false;
List<Integer> UCopy = null;
for(List<Integer> flagList:new ArrayList<>(convertMap.keySet())){
if(sortListEquals(U,flagList)){
find=true;
UCopy=flagList;//Equivalent to U
break;
}
}
//Add if U is not in the mapping set
if(!find){
convertMap.put(U,stateId++);
convertMapFlag.put(U,false);//unmark
}
//Set conversion f '(T, symbol) of DFA = u
}
}

//Initialization of dfa's state set, symbol table, initial state, final state set
for(List<Integer> list:convertMap.keySet()){//State set
}
res.setBegin(0);//The initial state must be zero

return res;
}

/**
* Starting from a given set of States, searching for reachable nodes
* @param stateList Initial state set
* @return All reachable node sets from the nodes in the set
*/
private List<Integer> Closure(List<Integer> stateList){
//Initialize closure(state) as a state collection
List<Integer> res=stateList;
//Push all states onto the stack
Stack<Integer> stack=new Stack<>();

while (!stack.empty()){
int node=stack.pop();//Stack out
//Search function transformation list, find result set
for (FunctionExtension extension:functionList){
if(extension.state==node&&extension.symbol==0){//0 for epsilon
for(int endState:extension.convertStateList){//Reachable final set
if(!res.contains(endState)){
stack.push(endState);
}
}
}
}
}

res.sort(Comparator.comparingInt(num->num));//Sort returned results
return res;
}

/**
* Given the state and symbol, return the transformed state set
* @param state
* @param symbol
* @return
*/
public List<Integer> getFunctionConvert(int state,int symbol){
for(FunctionExtension extension:functionList){
if(extension.state==state && extension.symbol==symbol)return extension.convertStateList;
}
return new ArrayList<>();
}

/**
* Determine whether the elements of the two lists are identical, and the lists are sorted
* @param a
* @param b
* @return
*/
public boolean sortListEquals(List<Integer> a,List<Integer> b){
if(a.size()!=b.size())return false;
for(int i=0;i<a.size();i++){
if(!a.get(i).equals(b.get(i)))return false;
}
return true;
}

@Override
public String toString(){
StringBuilder res=new StringBuilder(statusList.toString()+symbolList.toString()+beginList.toString()+endList.toString());
for(FunctionExtension extension:functionList){
res.append("\n").append(extension.state).append(" ").append(extension.symbol).append(" ").append(extension.convertStateList);
}
return res.toString();
}
}


test

public class NfaToDfaTest {
public static void main(String[] args){
File file = new File(new File("").getAbsoluteFile() + "/test.txt");
NFA nfa=new NFA(file);
System.out.println(nfa.convertToDFA().toString());
}
}

49 original articles published, 32 praised, 100000 visitors+

Posted on Mon, 16 Mar 2020 09:37:27 -0400 by drimades