python basics ⑦ - character encoding and file operation

Character encoding

Refer to Baidu Encyclopedia:
https://baike.baidu.com/item/%E5%AD%97%E7%AC%A6%E7%BC%96%E7%A0%81/8446880?fr=aladdin

File operation

'''
1 What is a file
    The file is the operating system for the user/An abstract unit provided by an application to operate a hard disk
2 Why use files
    user/The file read and write operation of the application will be converted from the operating system to the specific hard disk operation
    So users/Applications can be easily read\Write files to indirectly control the access operation of complex hard disk
    Realize the permanent saving of data in memory to hard disk
    user=input('>>>>: ') #user = "Xiao Wang"
3 How to use files
    Basic steps of file operation:
        f=open(...) #Open the file and get a file object f,f is equivalent to a remote control, which can send instructions to the operating system
        f.read() # Read and write files and send instructions to the operating system to read and write files
        f.close() # Close the file and recycle the resources of the operating system
    Context management:
        with open(...) as f:
            pass
'''
# Absolute path
f = open(r'/Users/jaidun/data/python_space/a.txt', encoding='utf-8')
print(f.read())
f.close()
# Relative path
# Read current file
f = open(r'a.txt', encoding='utf-8')
# .. / relative path
f = open(r'../a.txt', encoding='utf-8')
print(f.read())
f.close()

# # Stress: be sure to close the open file before the end of the program
# May forget
# # f.close()
# print(f.read())
# Context management: with
#         with open(...) as f:
#             pass
# Files can be closed automatically
with open(r'a.txt', encoding='utf-8')as f:
    print(f.read())
# print(f.read())


File other common operations

Open mode of a file
r: Read only mode (default)
w: Write only mode
a: Append write only mode
2. The way to control the reading and writing of file units (must be used in conjunction with r\w\a)
t: text mode (default). The encoding parameter must be specified
Advantages: the operating system will decode the binary numbers in the hard disk into unicode and then return
Emphasis: valid only for text files
b: In binary mode, the encoding parameter must not be specified
Advantages: it can be transmitted directly through the network

read-only

#I r: read only mode (default)
# 1. When the file is not saved, an error will be reported
# 2 when the file exists, the file pointer points to the beginning of the file

with open('a.txt',mode='rt',encoding='utf-8')as f:
    res1=f.read()
    print('111>>>',res1)
    # # I finished reading it for the first time
    res2 = f.read()
    print('222>>>', res2)
    # Determine whether the rt module is readable
    print(f.readable())
    # # Determine that the rt mode is not writable
    print(f.writable())
    # The read file is too large. It's not good
    print(f.readline(),end='')
    # # #There is a newline character in the file. print has its own newline character \ n
    print(f.readline())
    # for loops through file objects
    for line in f:
        print(line,end='')
    L = []
    for line in f:
        L.append(line)
    print(L)
    # One line of code
    print(f.readlines())


Write only

# II wt: write only mode
# 1. When the file is not saved, create a new empty document (create if there is no one)
# with open('b.txt',mode='wt',encoding='utf-8')as f:
#     pass
# 2. When the file exists, the file content will be cleared, and the file pointer will run to the beginning of the file (if any, it will be cleared)
with open('b.txt',mode='wt',encoding='utf-8')as f:
    # Empty all
    # Write what we want below
    # Can't read
    print(f.readable())
    # Can write
    print(f.writable())
    # f.read()
    # Remember to write line breaks
    # Overwrite the previous content
    # f. Write (string)
    f.write('Xiao Wang\n')
    f.write('king\n')
    f.write('Xiao Dai\n')
    # Write multiple lines at once
    f.write('111\n2222\n3333\n')
    # Write the contents of the list line by line
    info = ['sea\n','sea\n','sea\n']
    for line in info:
        f.write(line)
    # One line of code
    # Writelines (list)
    f.writelines(info)

append mode

# Three at: append write mode only
# When the file is not saved, create a new empty document, and the file pointer runs to the end of the file (the beginning is the end)
# with open('c.txt',mode='at',encoding='utf-8')as f:
#     pass
# 2 when the file exists, the file pointer runs to the end of the file
with open('c.txt',mode='at',encoding='utf-8')as f:
    # Can't read
    print(f.readable())
    # Can write
    print(f.writable())
    f.write('Miss Wang\n')
    f.write('Miss Dai\n')
    f.write('Miss Zhou\n')

with open('c.txt',mode='at',encoding='utf-8')as f:
    # Can't read
    print(f.readable())
    # Can write
    print(f.writable())
    f.write('Miss Dai\n')
    f.write('Miss Yang\n')
    f.write('Teacher Fu\n')


The difference between w mode and a mode
wt mode
Write continuously when the file is open but not closed,
The next write must continue based on the position of the last write pointer
The a mode is closed. The next opening is written at the end of the file, so the previous content will not be overwritten

Difference between t and b

2. The way to control the reading and writing of file units (must be used in conjunction with r\w\a)
t: text mode (default). The encoding parameter must be specified
Advantages: the operating system will decode the binary numbers in the hard disk into unicode and then return
Emphasis: valid only for text files
b: In binary mode, the encoding parameter must not be specified
Advantages: direct network transmission

# Operation t mode limitations for text files only

# Binary file b mode
# Pictures and videos
with open('1.jpeg', mode='rb', )as f:
    data = f.read()
    print(data)
    print(type(data))

with open('2.jpeg', mode='wb')as f1:
    f1.write(data)

# Using b mode, you can also operate on text files, but you need to decode them
# decode binary into characters
# encode characters into binary
# Convert to characters when decoding and reading
with open('b pattern.txt', mode='rb')as f:
    data = f.read()
    print(data)
    print(data.decode('utf-8'))
# When encoding, convert characters into binary to write
with open('wb pattern.txt', mode='wb')as f:
    f.write('Xiao Hong\n'.encode('utf-8'))
    f.write('Xiao Wang\n'.encode('utf-8'))
    f.write('Xiao Dai\n'.encode('utf-8'))


Readable and writable

r+t mode

1. When the file is not saved, an error will be reported
2 when the file exists, the file pointer points to the beginning of the file
3 more than one end write

with open('Readable and writable r+t pattern.txt', mode='r+t', encoding='utf-8')as f:
    print(f.readable())
    print(f.writable())
    msg = f.readline()
    print(msg)
    f.write('xxxxxxxx')

w+t mode

1. When the file is not saved, create a new empty document (create if there is no one)
2. When the file exists, the file content will be cleared, and the file pointer will run to the beginning of the file (if any, it will be cleared)

with open('Readable and writable w+t pattern.txt', mode='w+t', encoding='utf-8')as f:
    print(f.readable())
    print(f.writable())
    f.write('aaaaaaaa\n')
    f.write('bbbbbbbb\n')
    # Pointer moves seek (number of bytes moved, starting with 0)
    # Move 0 from the beginning
    f.seek(0, 0)
    print(f.readline())
    f.write('cccccccc\n')

a+t mode

It is also written at the end of the second opening

with open('Readable and writable a+t pattern.txt',mode='a+t',encoding='utf-8')as f:
    print(f.readable())
    print(f.writable())
    f.write('aaaaaaaa\n')
    f.write('bbbbbbbb\n')
    # Pointer moves seek (number of bytes moved, starting with 0)
    # Move 0 from the beginning
    f.seek(0,0)
    print(f.readline())
    f.write('cccccccc\n')
# Pictures and videos don't work
# r+b w+b a+b law and R + T W + T A + T

Pointer movement

seek() function

Pointer movement within file
Read (n) in t mode, n represents the number of characters
The movement of pointers in b-mode files is in bytes

Pointer operation
f. Seek (offset, where) has two parameters:
offset: represents the number of bytes that control pointer movement
Where: represents where to move by reference
Where = 0: the beginning of the reference file (default), special???, It can be used in t and b modes
Where = 1: refers to the current location, which must be used in mode b
Where = 2: refer to the end of the file, which must be used in mode b

# t mode
# with open('pointer movement. txt',mode='rt',encoding='utf-8')as f:
#     print(f.read(1))
#     print(f.read(1))
#     print(f.read(1))

# b mode
# with open('pointer movement. txt', mode='rb')as f:
#     # The two hexadecimals are 2 * * 4 and 2 * * 8
#     # One third Chinese character
#     print(f.read(1).decode('utf-8'))
#     print(f.read(1).decode('utf-8'))
#     print(f.read(3).decode('utf-8'))
#     print(f.read(3).decode('utf-8'))
#     print(f.read(3).decode('utf-8'))

# Pointer operation
# f. Seek (offset, where) has two parameters:
# offset: represents the number of bytes that control pointer movement
# Where: represents where to move by reference
#        Where = 0: the beginning of the reference file (default), special???, It can be used in t and b modes
#        Where = 1: refers to the current location, which must be used in mode b
#        Where = 2: refer to the end of the file, which must be used in mode b

# t mode is calculated according to characters
with open('seek.txt',mode='rt',encoding='utf-8')as f:
    f.seek(2,0)
    print(f.read(1))

# The number of bytes moved in mode b is also the number of bytes read
with open('seek.txt',mode='rb')as f:
    f.seek(5,0)
    print(f.read(3).decode('utf-8'))


with open('seek.txt',mode='rb')as f:
    msg = f.read(5)
    # The number of bytes in which the current cursor is located
    print(f.tell())
    f.seek(3,1)
    print(f.read(3).decode('utf-8'))

with open('seek.txt',mode='rb')as f:
    f.seek(0,2)
    print(f.tell())
    f.seek(-3,2)
    print(f.read(3).decode('utf-8'))

Detect what is added at the end of the file

with open('history.txt',mode='rb')as f:
    f.seek(0,2)
    while True:
        line=f.readline()
        # If it is 0 bytes, it means that the cursor is at the end
        # There is no operation to close this file
        if len(line) != 0:
            print(line.decode('utf-8'),end= '')

How to modify files

How to modify a file

1 read all the contents of the file from the hard disk into the memory
2 complete the modification in memory
3 overwrite the modified results in memory and write them back to the hard disk

with open('File modification.txt', mode='rt', encoding='utf-8')as f:
    all_data = f.read()
# # The read data has been saved to all_ In the data variable
with open('File modification.txt', mode='wt', encoding='utf-8')as f1:
    f1.write(all_data.replace('Xiao Wang', 'king'))

Method 2 of modifying documents

1 open the source file in read mode and a temporary file in write mode
2. After each content read from the source file is modified, it is written to the temporary file until the source file is read
3 delete the source file and rename the temporary file to the source file name

import os
with open('Document revision II.txt',mode='rt',encoding='utf-8')as read_f,open('Temporary documents.txt',mode='wt',encoding='utf-8') as write_f:
    for line in read_f:
        write_f.write(line.replace('Xiao Dai','Xiao Yang'))

# File modification II deletion
os.remove('Document revision II.txt')
# # Change the temporary file. txt to file modification 2
os.rename('Temporary documents.txt','Document revision II.txt')

Mode 1:
Advantage: there is always a copy of data on the hard disk during the process of file modification
Disadvantages: it takes up too much memory and is not suitable for large files

Mode 2:
Advantages: there is only one line of the source file in memory at the same time, which will not occupy too much memory
Disadvantages: in the process of file modification, the source file and temporary file will coexist, and there will be two copies of data on the hard disk at the same time, that is, too much hard disk will be occupied in the process of modification,

Avoid garbled code

# Heaven has endowed me with talents for eventual use
# Japanese
with open('text1.txt', mode='w', encoding='shift_jis')as f1:
    f1.write('livingまれながらにしてわたくしprivateはかならずhave toずやくserviceにたつstandつ')

# with open('text1.txt', mode='r', encoding='utf-8')as f1:
#     a = f1.read()
#     print(a)

# english
with open('text2.txt', mode='w', encoding='shift_jis')as f1:
    f1.write('I believe')

with open('text2.txt', mode='r', encoding='utf-8')as f2:
    a = f2.read()
    print(a)

!!! Summarize two very important points!!!
1. The core rule to ensure that there is no garbled code is what standard characters are encoded according to,
The standard here refers to character coding

2. All characters written in memory are unicode without discrimination. For example, when we open the editor,
If you enter a "you", we can't say that "you" is a Chinese character. At this time, it is just a symbol,
This symbol may be used in many countries, and the style of this word may be different according to the input method we use.
Only when we save to the hard disk or transmit based on the network,
To determine whether "you" is a Chinese character or a Japanese character, this is the process of converting unicode into other coding formats

Tags: Python Pycharm crawler

Posted on Mon, 04 Oct 2021 15:46:09 -0400 by youngloopy