postgreSQL source code analysis - storage management - external storage management



After analyzing the structure of postgreSQL table file last time, we will continue to analyze smgr for the first time. This time, we will make a specific analysis with the help of md.c, the disk storage manager implemented by postgreSQL.

Source code analysis

Segmentation mechanism

Causes of segmentation mechanism
In the comments of the md.c file, the author mentioned the segmentation mechanism. Because the amount of data to be stored in some tables is very large, it will lead to a large table size, which may not only exceed the file size limit of the operating system, but also lead to low efficiency of SQL statements. At the same time, in order to realize the function of virtual file descriptor, when the size of table file exceeds a threshold (generally 2GB), it will be divided into multiple segments. The segment size limit is stored in src/include/pg_config.h is in the header file.

#define BLCKSZ 8192 / / the maximum size of a disk block (page) or tuple, in bytes.
#define RELSEG_SIZE 131072 / / the maximum number of disk blocks (pages) stored in a file.

Multiplying these two constants limits the maximum size of a segment or table file. 8192, or 8KB, is the 13th power of 2, 131072 is the 17th power of 2, and the product is the 30th power of 2, plus the unit, that is, 1GB. Therefore, the default table file size of postgreSQL is 1GB. Because the file size of some systems is limited to 2GB or 4GB, in order to improve universality, the author limits the file size to 1GB.
Naming of segmented files
In the last blog, we know that table files are named through OID. The file naming after segmentation is actually similar. For example, if the OID of a table is 1000, its multiple segments will be named 1000.0, 1000.1... And so on.

Disk management

The first is a data structure related to file storage.

typedef struct _MdfdVec
	File		mdfd_vfd;		/*vfd corresponding to file*/
	BlockNumber mdfd_segno;		/*The segment number of the file, starting from 0*/
} MdfdVec;

static MemoryContext MdCxt;		/* Context of all mdfdvec objects */

The full name of this data structure is magnetic disk file descriptor vector, that is, disk file descriptor vector. The full name of vfd in the code is virtual file descriptor, that is, virtual file descriptor. Let's analyze this function first.

VFD (virtual file descriptor) mechanism

In OS, when a process creates or opens a file, the OS will assign a unique file descriptor to the file, and use the descriptor to uniquely identify and operate the file. However, the OS limits the number of files that a process can open, so the file descriptors that a process can obtain are limited. The database server process will open many files for various reasons, including base table, scratch files, sort and hash spool files, which will exceed the OS limit on the number of file descriptors. Therefore, postgreSQL uses VFD mechanism to solve this problem.

src/backend/storage/file/fd.c and the code related to the virtual file descriptor are stored in this file. The notes in the header introduce the contents related to vfd. Let me explain it first.

 * VFDs are managed as an LRU pool, with actual OS file descriptors
 * being opened and closed as needed.  Obviously, if a routine is
 * opened using these interfaces, all subsequent operations must also
 * be through these interfaces (the File type is not a real file
 * descriptor).
 * For this scheme to work, most (if not all) routines throughout the
 * server should use these interfaces instead of calling the C library
 * routines (e.g., open(2) and fopen(3)) themselves.  Otherwise, we
 * may find ourselves short of real file descriptors anyway.

VFD is managed as an LRU pool, and the actual operating system file descriptors are opened and closed when needed. Obviously, if a routine (subroutine) is opened with these interfaces, all subsequent operations must also pass through these interfaces (the file type is not a real file descriptor).
In order to realize the function of VFD, most (if not all) subroutines of the whole server should use these interfaces instead of calling C library functions. (for example, open(2) and fopen(3)) themselves. Otherwise, we may find that we still lack the real file descriptor.

The main idea is to use the fd.c interface to replace the library function of c language to open files. Let's first look at the open and fopen functions.

int open(const char *path, int access,int mode);
/*Parameter: file path name access mode specifies the mode when the file is created*/
/*Return value: file descriptor (handle)*/
FILE *fopen(char *filename, char *mode);
/*Parameter: file path name access mode*/
/*Return value: FILE pointer */
//FILE structure
struct _iobuf {
    char *_ptr; //Next location for file entry
    int _cnt; //The relative position of the current buffer
    char *_base; //Refers to the base location (i.e. the starting location of the file) 
    int _flag; //File flag
    int _file; //File descriptor id
    int _charbuf; //Check the buffer condition and do not read if there is no buffer
    int _bufsiz; //File buffer size
    char *_tmpfname; //Temporary file name
typedef struct _iobuf FILE;

You can see that both open and fopen will open a file descriptor, which is contrary to our requirements. Next, let's look at how VFD is implemented.
VFD structure

typedef struct vfd
	int			fd;				/*The current real file descriptor. If the real file descriptor is not opened, it is VFD_CLOSED*/
	unsigned short fdstate;		/*VFD See the following for the flag bit of*/
	ResourceOwner resowner;		/*Resource ownership is used for automatic cleaning and free space*/
	File		nextFree;		/*Point to the next idle VFD. The file type is an integer (see below), which is used to represent the subscript of VFD in VFD array*/
	File		lruMoreRecently;	/*The position of the VFD in the array that is more commonly used than this VFD*/
	File		lruLessRecently; 	/*The position in the array of VFDs that are less commonly used than this VFD*/
	off_t		fileSize;		/*If the file pointed to by the current VFD is not a temporary file, it indicates the size of the current file. */
	char	   *fileName;		/*VFD The file name of the corresponding file. If the VFD is not used, it is NULL*/
	int			fileFlags;		/* The flag bit for opening or reopening a file is similar to the above open flag bit, such as read-only, write only, read-write */
	mode_t		fileMode;		/* The mode specified when creating the file is similar to the open mode above */
} Vfd;

typedef int File;               /*File The definition of type is not a File in C's library, but an integer, which is located in the header File fd.h*/
//Three states of fdstate
#define FD_DELETE_AT_CLOSE 	 (1 << 0) 	/* If the first digit of fdstate is 1, the file will be deleted when it is closed*/
#define FD_CLOSE_AT_EOXACT 	 (1 << 1) 	/* If the second bit of fdstate is 1, it will be deleted during eoxact*/
#define FD_TEMP_FILE_LIMIT 	 (1 << 2) 	/* If the third bit of fdstate is 1, the temporary file limit is obeyed*/

According to this structure, it is not difficult to imagine the corresponding relationship between VFD, FD and files.

VFD storage

static Vfd *VfdCache; //That is, the array of VFD structure above is used for. VfdCache[0] does not store a valid VFD, but serves as the head of the linked list.
static Size SizeVfdCache = 0; //Record array size

//Initialization of array (called when the back end is generated)
	Assert(SizeVfdCache == 0);	/*Only called once*/

	VfdCache = (Vfd *) malloc(sizeof(Vfd));//Allocate space to header elements
	if (VfdCache == NULL)//Space allocation failure and error
				 errmsg("out of memory")));

	MemSet((char *) &(VfdCache[0]), 0, sizeof(Vfd));//Initialize header element
	VfdCache->fd = VFD_CLOSED;//The initial value indicates that the file is not opened

	SizeVfdCache = 1;

	/*hook Function to ensure that the temporary file will be deleted when exiting*/
	on_proc_exit(AtProcExit_Files, 0);

//Allocation of array space
static File
	Index		i;
	File		file;

	DO_DB(elog(LOG, "AllocateVfd. Size %zu", SizeVfdCache));

	Assert(SizeVfdCache > 0);	/*Judge whether the above initialization function is called successfully*/

	if (VfdCache[0].nextFree == 0)//Capacity expansion without free space
		Size		newCacheSize = SizeVfdCache * 2;//Classic expansion, twice the original size
		Vfd		   *newVfdCache;//First address pointer after storage expansion

		if (newCacheSize < 32)//Initialize the space length for the first time and set the length to 32
			newCacheSize = 32;

		newVfdCache = (Vfd *) realloc(VfdCache, sizeof(Vfd) * newCacheSize);//realloc function is used to reallocate memory space without affecting the original value.
		if (newVfdCache == NULL)//Failed to allocate memory with error
					 errmsg("out of memory")));
		VfdCache = newVfdCache;//Point the original pointer to the first address after capacity expansion

		for (i = SizeVfdCache; i < newCacheSize; i++)//Initialize the newly generated element (generated after capacity expansion)
			MemSet((char *) &(VfdCache[i]), 0, sizeof(Vfd)); //Memory initialization
			VfdCache[i].nextFree = i + 1;//Connection of free space linked list
			VfdCache[i].fd = VFD_CLOSED; //The initial value indicates that the file is not opened
		VfdCache[newCacheSize - 1].nextFree = 0;//The next of the free list tail element is set to null
		VfdCache[0].nextFree = SizeVfdCache;//The free space pointer of the header element is set to the first element after capacity expansion

		SizeVfdCache = newCacheSize;//Array size reset

	file = VfdCache[0].nextFree;

	VfdCache[0].nextFree = VfdCache[file].nextFree;

	return file;

static void
FreeVfd(File file)
	Vfd		   *vfdP = &VfdCache[file];

	DO_DB(elog(LOG, "FreeVfd: %d (%s)",
			   file, vfdP->fileName ? vfdP->fileName : ""));

	if (vfdP->fileName != NULL)
		vfdP->fileName = NULL;
	vfdP->fdstate = 0x0;

	vfdP->nextFree = VfdCache[0].nextFree;
	VfdCache[0].nextFree = file;

As mentioned earlier, VFD is maintained through LRU pool, but the storage form here looks like an array. In fact, this array contains two linked lists. In the above VFD structure, we can see three variables: nextFree, lruMoreRecently and lruLessRecently, and their functions are all in the comments. According to their functions, it can be seen that a linked list is connected by the attribute nextFree(next) to form an idle linked list, which is a one-way linked list and records all idle or allocable VFDs; Another linked list is the LRU (Last Recently Used) pool, which is connected through lruMoreRecently (pre) and lruLessRecently (next). It is a two-way linked list. Since the first element in the array is the chain header, its lrulessrecent (because this element represents next) points to the most frequently used VFD (the first element), and lrumoreecent (because this element represents pre) points to the least frequently used VFD (the last element). Similarly, since the last element is not more commonly used than it, lruLessRecently points to the chain header to maintain the integrity of the two-way linked list. Thus, the LRU linked list is formed.
The methods related to the allocation and initialization of VFD array space are also in the above. When the postgres process generates, it will initialize the VFD array, but only initialize and allocate space for the header element. When the space is insufficient, the capacity expansion function will be started. The first capacity expansion will expand the length of the array to 32, and then each capacity expansion will double the length of the array.
LRU linked list

Addition and deletion of LRU linked list

  1. Delete VFD in LRU
    Deletion occurs when a process finishes using a file and closes it.
    1. Normal deletion
      Through the LruDelete function to achieve the closure of the file, and then call the Delete function to delete the corresponding VFD in LRU.
      static void
      LruDelete(File file)
         Vfd		   *vfdP;//Pointer to vfd
         Assert(file != 0);//If it is a header element, an error is reported
         DO_DB(elog(LOG, "LruDelete %d (%s)",
         		   file, VfdCache[file].fileName));
         vfdP = &VfdCache[file];//Point to the element to delete
         if (close(vfdP->fd))//Failure to close the file will result in an error
         	elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
         		 "could not close file \"%s\": %m", vfdP->fileName);
         vfdP->fd = VFD_CLOSED;//File closed successfully
         --nfile;//Reduce the number of open files by one
         Delete(file);//Delete the vfd record in the LRU linked list
      static void
      Delete(File file)
         Vfd		   *vfdP;//Pointer to vfd
         Assert(file != 0);//If it is a header element, an error is reported
         DO_DB(elog(LOG, "Delete %d (%s)",
         		   file, VfdCache[file].fileName));
         vfdP = &VfdCache[file];//Point to the element to delete
      //Classic deletion method of bidirectional linked list
         VfdCache[vfdP->lruLessRecently].lruMoreRecently = vfdP->lruMoreRecently;
         //Set the next of the previous element of the vfd to the next of the vfd
         VfdCache[vfdP->lruMoreRecently].lruLessRecently = vfdP->lruLessRecently;
         //Set the pre of the next element of the vfd to the pre of the vfd
    2. Delete tail
      The ReleaseLruFile function is used to check the elements of the queue and then the LruDelete function is used to delete the queue element.
      static bool
      	DO_DB(elog(LOG, "ReleaseLruFile. Opened %d", nfile));
      	if (nfile > 0)//There are open files (there are real file descriptors)
      		//Because there are open files, it also means that there is at least one recently used VFD, which means there is a queue tail!
      		Assert(VfdCache[0].lruMoreRecently != 0);//If there is no end of the team, an error is reported for termination
      		LruDelete(VfdCache[0].lruMoreRecently);//Call LruDelete to delete the end of queue element
      		return true; //Delete succeeded
      	return false;//There is no tail element, deletion failed
  2. Insert VFD in LRU
    The insertion operation occurs when a new VFD is opened. Open the real file through the LruInsert function and get the FD, then call the Insert function to insert the VFD into the first element behind the chain header.
    static int
    LruInsert(File file)
    	Vfd		   *vfdP;//Pointer to vfd
    	Assert(file != 0);//The header element reports an error and terminates
    	DO_DB(elog(LOG, "LruInsert %d (%s)",
    			   file, VfdCache[file].fileName));
    	vfdP = &VfdCache[file];//Point to the corresponding vfd
    	if (FileIsNotOpen(file))//If the vfd does not open the file
    		//If it is not opened, it means that the FD, that is, the real file descriptor, has reached the upper limit. Close an FD by deleting the queue tail element (the least commonly used), so that the file can be opened
    		vfdP->fd = BasicOpenFilePerm(vfdP->fileName, vfdP->fileFlags,
    									 vfdP->fileMode);//Open the file to get FD
    		if (vfdP->fd < 0)//If opening fails
    			DO_DB(elog(LOG, "re-open failed: %m"));
    			return -1;//An error is reported and the insertion fails
    		else//Open successfully
    			++nfile;//Number of open files plus one
    	Insert(file);//Because it has just been used, it is the most frequently used recently. It is placed in the first element of the LRU linked list, that is, header insertion
    	return 0;
    static void
    Insert(File file)//Classic head insertion
    	Vfd		   *vfdP;//Pointer to vfd
    	Assert(file != 0);//The header element reports an error and terminates
    	DO_DB(elog(LOG, "Insert %d (%s)",
    			   file, VfdCache[file].fileName));
    	vfdP = &VfdCache[file];//Point to the corresponding vfd
    	//Set this element as the first element, and the previous element is placed after it.
    	vfdP->lruMoreRecently = 0;//The pre of this element is modified to a header element
    	vfdP->lruLessRecently = VfdCache[0].lruLessRecently;//The next of the element is modified to the element that was previously in the first position
    	VfdCache[0].lruLessRecently = file;//Set this element as the first element
    	VfdCache[vfdP->lruLessRecently].lruMoreRecently = file;//Set the pre of the previous element to this element


  1. VFD essentially exists in the array, but two kinds of static linked lists are formed by subscripts instead of pointers. Its advantage is that there is no need to move the array elements during insertion and deletion, which can reduce the time consumption of these two operations and reduce the time complexity from O (n) to o (1). At the same time, setting the 0 element as the header can not only realize the idle linked list, but also maintain the bidirectional of the LRU linked list. Moreover, the insertion and deletion of this static linked list does not really insert or delete array elements, but only modify the relevant information to achieve the purpose of insertion and deletion.
  2. Learned the usage of the ASSERT function. ASSERT () is a macro often used when debugging programs. When the program runs, it evaluates the expression in parentheses. If the expression is FALSE (0), the program will report an error and terminate execution. If the expression is not 0, continue with the following statement.
  3. There are a lot of memory operations involved here. We need to consider various possible errors, such as memory allocation failure, but these errors cannot be corrected, so we need to use assertions and other methods to terminate the program.
  4. This paper analyzes VFD, the core part of disk management, and understands how the database system "breaks through" the limitation of the system on the number of files opened.

Tags: Database PostgreSQL

Posted on Sun, 17 Oct 2021 19:26:36 -0400 by jmdavis