Recently, I've been dealing with the reading and processing of data sets. It's inevitable to use the knowledge points of file processing in python. Although some functions and knowledge points know what's going on, they don't feel that they have a solid grasp. Therefore, I plan to sort out the knowledge points so that I can check them later.
1 os module
There are many file processing specifications under the os module, such as file read and write commands. Here are a few common commands.
(1) Open file
# open() function signature open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
- File indicates the file address you want to open. The file address can be divided into relative path and absolute path. You can use it in the open function.
- mode indicates what you want to do with the file. There are several common ways:
'r' means only the contents of the file are read, and an error will be reported if you write data to the file ' io.UnsupportedOperation: not writable'
'w' means empty the contents of the file before writing the data
'a' means to append data at the end of the original text
The modes' w + 'and' w+b 'indicate that the file will open and the contents of the file will be emptied. The modes' r + 'and' r+b 'indicate that the contents of the file are opened but not emptied, which means that you can update the contents of the file, write or change the contents, and no error will be reported. The modes' a + 'and' a+b 'indicate that the file is opened and then appended.
Encoding indicates what encoding format you use. If Chinese characters are read or written, it can be set to the encoding format of 'utf-8'.
(2) Read file
After opening the file, we can use three methods to read the contents of the file
read() function: the return value of the function is a string, so you can read and process it with the string function.
with open('./test.txt') as f: content = f.read() # You can split the content with the split() function content1 = f.read().split('/n')
- readlines() function: the return value of the function is a list, and each element of the list is the data of each line of the original file.
with open('./test.txt') as f: lines = f.readlines() # You can cycle through the contents of the index list and print them lies = [int(i) for i in lines]
- readline() function: read one line of data content per value, which can be used in combination with loop function. This function takes up less memory, so it can be applied to read files with more content.
with open('./test.txt') as f: while True: line = f.readline() if len(line): print(line) # Read end else: break
(3) Write data
The write() function can write string data to a file (note that it must be string data)
with open('./test.txt') as f: # Write data to file f.write('hello world')
(4) Get all file addresses and folder addresses under a path
When processing the dataset, I need to get the address of all xml files under the training set and read them. Therefore, the following methods can be used:
# Function signature os.walk(top, topdown=True, onerror=None, followlinks=False)
os.walk() function can obtain all the file and folder addresses under a certain path. The following is illustrated by a case:
# Step1: the directory of the '.. / data' file is like this ---- data ---- obj ---- image1.jpg ---- image1.xml ---- obj.data ---- obj.names # Step 2: traverse the folder and print for root_path, sub_dir_path, files_path in os.walk('../data'): # (1) Print root_path: the results are '.. / data' and '.. / data/obj' print(root_path) # (2) Print sub_dir_path: the results are ['obj '] and  print(sub_dir_path) # (3) Traverse files_ All xml files in path list: print result ['image1.xml '] for file_path in files_path: if file_path.endswith('xml'): print(file_path)
root_path represents the address of the folder to traverse. In the case, '.. / data' and '. / data/obj'. Because there are still folders under this folder, the subfolders will still be traversed.
sub_dir_path indicates the name of the subfolder under the current folder. If there are still files in the subfolder, the name of the subfolder will still be printed circularly.
files_path indicates the name of the file in the current folder. If there is a file in the subfolder, the child files in the subfolder will still be placed in the list for printing.
- os.listdir() function
os.listdir() function can get the names (not addresses) of all files and folders under a folder
for path in os.listdir('../data'): print(path) # The output is # obj # obj.data # obj.names
(5) Splicing path
in use os.walk() function, we can get the path of the folder and the name of the file. In the dataset, we need to save the absolute path of the file or the relative path containing the current root directory, so we need to splice the file address.
When splicing paths, we can use the following os.path.join() function, which is used for reference This old man Blog of)
os.path.join() function: connect two or more pathname components 1. If the initial of each component name does not contain '/', the function will automatically add 2. If a component is an absolute path, all components before it will be discarded 3. If the last component is empty, the generated path ends with a '/' separator
# Case 1: if the initial of each component name does not contain '/', the function will automatically add path1 = 'home' path2 = 'develop' path3 = 'code' print(os.path.join(path1,path2,path3)) # The output is: # home\develop\code # Case 2: if a component is an absolute path, all components before it will be discarded # This must be noted!!!!! path1 = 'home' path2 = '/develop' path3 = 'code' print(os.path.join(path1,path2,path3)) # The output is: # /develop/code # Case 3: if the last component is empty, the generated path ends with a '/' separator path1 = '/home' path2 = 'develop' path3 = ' ' print(os.path.join(path1,path2,path3)) # The output is: # /home/develop/