C + +: lexical analyzer for compiling principle experiment

1, Experimental purpose

  1. Learn to realize the corresponding high-level language source program for DFA conversion diagram.
  2. Deeply understand the meaning of state transition diagram and gradually understand finite automata.
  3. Master the method of manually generating lexical analyzer and understand the internal working principle of lexical analyzer.
  4. Strengthen the mastery of C language

2, Experimental content

The lexical analysis part of the compiler of C computer language is realized.

Scan the symbols of each line of the language source program from left to right, spell them into words, replace them with a unified internal token and send them to the grammar analyzer.

In order to simplify the preparation of procedures, specific requirements are as follows:

  1. Whitespace is just a space, carriage return, tab.
  2. The code is free-form.
  3. Comments should be enclosed in curly braces and nesting is not allowed

Words in C language

Reserved word

Special symbols

other

if

+

identifier

(initials are letters or underscores, others are composed of one or more letters or numbers or underscores)

else

-

while

*

do

/

main

=

number

Digital attributes include three types: type attributes (integer, floating point, etc.), hexadecimal attributes (decimal, octal, hexadecimal) and suffix attributes (such as auxiliary type information such as short type and wide type)

For example: 123L

int

<

float

{

double

}

return

;

const

(

void

)

character constants

Any character in the source program, except ', backslash \, newline

'a'

string constant

Any character in the source program, except ', backslash \, newline

"a"

continue

'

break

'

char

"

unsigned

"

enum

==

long

!=

switch

&&

case

||

unsigned

>

auto

>=

static

<=

3, Experimental requirements

The following functions of the compiler are required:

  1. Spell the words according to the rules and convert them into binary form
  2. Delete comment line
  3. Delete whitespace (space, carriage return, tab)
  4. Display the source program, add the line number before each line, and print the binary form of the mark contained in each line
  5. Find and locate errors.

Specific requirements for lexical analysis:

  1. The state transition diagrams of identifiers, numbers, characters and strings are given in detail
  2. Draw the state transition diagram of the blank space character, the state installation diagram of the line feed character, and the notes (/ //*  */) State transition diagram.
  3. The specific function implementation of lexical analysis is a function GetToken(). Each call analyzes the remaining string to obtain a word or token, identifies its type, and collects the symbol string attributes of the token. When a word is recognized, the type of the symbol is returned in the form of return value, and the attribute value of the currently recognized token is provided in the form of program variable.
  4. The lexical composition of the identifier and the reserved word is the same. For better implementation, the reserved word of the language is stored in a table. In this way, the identification of the reserved word can be placed after the identifier, and the table can be compared with the identified identifier. If it exists in the table, it is a reserved word, otherwise it is a general identifier.

4, Algorithm analysis

         In lexical analysis, the process of identifying the next word is simply the process of reading characters one by one and then spelling them together. The function of lexical analysis program is how to obtain a meaningful word symbol in the process of spelling words, that is, to identify the difference of a single word and the value of the word itself.

  1. This experiment uses C + + language, uses fopen() function to open the file test.txt in the main function main(), and uses fgetc() function to read in the pre written test sample and store it in prog string.
  2. Define the keyword array keyword to store common reserved words. (only some reserved words are stored and can be extended).
  3. Define the global variable line to record the current number of lines, so as to quickly locate the error location in case of lexical analysis errors.
  4. Then read the characters in order, and recognize the meaningful word symbols according to the lexical rule description of the programming language, that is, call GetToken() Function: if the current character is a space, carriage return or tab character, no operation will be performed and the next character will be read, and - 1 will be returned. If the current character is' / ', it will be further judged whether it is' / /' or '/ * * /'. If the current character is a letter, it will be an identifier by default and compared with the Keyword array. If it matches, it will be a Keyword; if the current character is a number, it will be defaulted It is considered as integer type. If '.' is encountered, it will be modified to floating-point type; for other characters, switch() is used to judge the matching. If the matching is unsuccessful, it will return - 2 to prompt lexical analysis error until the string is empty.

code:

#include<iostream>
#include<string.h>
using namespace std;
//prog stores the program read from the file, and token stores the string of the word itself
char token[8], prog[1000], ch;
//Columns is the number of columns, and sym represents the type code of the word 
int p = 0, sym = 0, n = 0, line = 1; 
//File name to read 
char filename[30];
/*FLE The system defined structure, * fpincontent is a pointer variable to the file structure,
Through fp, you can find the structure variable that stores the information of a file, find the file according to the information of the structure variable, and implement the operation on the file */
FILE *fpincontent; 
//keyword 
char *keyword[22] = {"if", "else", "while", "do", "main", "int", "float", "double", "return", "const", "void", "continue", "break", "char", "unsigned", "enum", "long", "switch", "case", "unsigned", "auto", "static"};

void GetToken()
{
  	// Empty token array
    for(n = 0; n < 8; n++) 
    	token[n] = '\0';
  	n = 0;
  	//Read character 
  	ch = prog[p++];
  	
  	// Delete the entered blank characters, judge whether to wrap lines and increase the number of columns
  	while(ch == ' ' || ch == '\n' || ch == '\t') 
  	{
	    //Encountered '\ n' (carriage return line feed), increase the number of lines
		if (ch == '\n') 
		{ 
	      	line++;
	    } 
	    //Encountered '\ t' (skip horizontally to the next tab position), ',' \ n ', ignore them and read the next character 
	    ch = prog[p++];
  	}
  	
  	//Delete comment line 
  	if (ch == '/')
  	{
  		ch=prog[p++];
		if(ch=='/')   // "//" 
    	{
    		do 
			{
      			ch = prog[p++];
    		} while(ch != '\n');
		}
		sym = -1;
		if(ch=='*')  // "/**/" 
		{
			do 
			{
      			ch = prog[p++];
    		} while(ch != '*');
    		ch = prog[p++];
    		if(ch!='/')
    		{
    			sym=-2;
			}
		}
  	} 
  	
  	//Identification identifier, keyword 
  	else if((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')) 
	{
	    sym = 22;  //identifier  
	    do 
		{
	      	token[n++] = ch;
	      	ch = prog[p++];
	    } while((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'));
	    for(n = 0; n < 22; n++) //Reserved word 
		{
	      	if(strcmp(token, keyword[n]) == 0) 
	        	sym = 23;
	    }
	    p--;
  	}
  	
  	//Identification number 
  	else if (ch >= '0' && ch <= '9')
	{
    	sym = 24;   //integer 
    	do 
		{
	      	token[n++] = ch;
	      	ch = prog[p++];
    	} while(ch >= '0' && ch <= '9');
    	if(ch=='.')  //decimal 
    	{
    		do 
			{
	      		token[n++] = ch;
	      		ch = prog[p++];
    		} while(ch >= '0' && ch <= '9');
			sym = 25;
		}
    	p--;
  	}
  	else 
	{
    	switch(ch) 
		{
	      	case '+': sym = 0; token[0] = ch; break; //plus 
	      	case '-': sym = 1; token[0] = ch; break; //reduce 
	     	case '*': sym = 2; token[0] = ch; break; //ride 
	     	case '/': sym = 3; token[0] = ch; break; //except 
	     	case ';': sym = 4; token[0] = ch; break; //End of statement 
	     	case '(': sym = 5; token[0] = ch; break;
	    	case ')': sym = 6; token[0] = ch; break;
	      	case '\'': sym = 7; token[0] = ch; break;
	      	case '\"': sym = 8; token[0] = ch; break;
	      	case '=': 
			{
		        sym = 9;  //be equal to 
		        token[0] = ch;
		        ch = prog[p++];
		        if(ch == '=') 
				{
		          	sym = 10; //Judge whether they are equal 
		          	token[1] = ch;
		        }else {
		          p--;
		        }
		        break;
		      }
       		case '<':  
	   		{
		    	sym = 11; //Less than sign 
		        token[0] = ch;
		        ch = prog[p++];
		        if(ch == '=') 
				{
		          	sym = 12;//Less than or equal to 
		          	token[1] = ch;
		        }else {
		         	p--;
		        }
		        break;
		    }
      		case '>':  
			{ 
		        sym = 13;  //Greater than sign 
		        token[0] = ch;
		        ch = prog[p++];
		        if(ch == '=') //Greater than or equal to 
				{
		          	sym = 14;
		          	token[1] = ch;
		        }else {
		         	p--;
		        }
		        break;
		      }
      		case '!': 
	  		{
		        token[0] = ch;
		        ch = prog[p++];
		        if(ch == '=') 
				{
		          	sym = 15; //Not equal to 
		          	token[1] = ch;
		        }else {
		          	p--;
		          	sym = -2;
		        }
		        break;
		    }
      		case '&': 
			{
		        token[0] = ch;
		        ch = prog[p++];
		        if(ch == '&')  
				{
		          	sym = 16;  //And 
		          	token[1] = ch;
		        }else {
		         	p--;
		          	sym = -2;
		        }
		        break;
		      }
      		case '|':    
			{
		        token[0] = ch;
		        ch = prog[p++];
		        if(ch == '|') 
				{
		          	sym = 17; //or 
		          	token[1] = ch;
		        }else {
		          	p--;
		          	sym = -2;
		        }
		        break;
		    }
	  		case '#': sym = 18; token[0] = ch; break;
      		case '[': sym = 19; token[0] = ch; break;
      		case ']': sym = 20; token[0] = ch; break;
      		case ',': sym = 21; token[0] = ch; break;
      		case '{': sym = 26; token[0] = ch; break;
      		case '}': sym = 27; token[0] = ch; break;
      		default: 
			{
		        sym = -2;
		        break;
		    }
        }
    }
}

int main()
{
  	p = 0;
  	cout<< "read something from : " << endl<<"                      ";
  	for(;;)
	{
		cin>>filename;
  		//Open the file with fopen function, and "r" is in read-only mode
 		fpincontent = fopen(filename,"r");  
    	if(fpincontent!=NULL)
			break;
		else
			cout<<"File path error, please enter the source file name: "< < endl < <";	
	}
  	// Read the file into the prog array
  	cout<<"Source program: "< < endl; 
  	do
  	{
  		//fgetc()Read one character
    	ch = fgetc(fpincontent); 
    	prog[p++] = ch;
    	cout<<ch;
   	}while(ch != EOF);
   	cout<<endl<<endl; 
   
  	p = 0;
  	// Spell the words according to the rules and convert them into binary form
  	do
	{
  		/*Each call analyzes the remaining string to obtain a word or token, identifies its type, and collects the symbol string attributes of the token,
		When a word is recognized, the type of symbol is returned in the form of return value, and the attribute value of the currently recognized token is provided in the form of program variable.*/
	    GetToken();
		switch(sym) 
		{
	      	case -1: break;
	      	case -2: //When the return value is - 1, - 2, the switch will jump out 
	      	{
	      		cout<<"Lexical analysis error!´╝îError at"<<line<<"that 's ok"<<endl;
	      		break;
			}
	      	default: cout<<"The first"<<line<<"that 's ok,<"<<sym<<" , "<<token<<">"<<endl; 
	    }
    }while(prog[p] != EOF);
    
    cout<<endl<<endl; 
    p=0;
    cout<<"After modification:"<<endl;
    do
	{ 
    	GetToken();
		cout<<token;
   	} while(prog[p]!=EOF);
    return 0;
} 

Tags: C++

Posted on Wed, 01 Dec 2021 16:08:32 -0500 by foamypup