Python 3 encoding & decoding

1, The difference between ASCII, Unicode and UTF-8 in character encoding

The following is from the blog: https://www.cnblogs.com/moumoon/p/10988234.html

1. Code introduction

      At first, only 127 letters were encoded into the computer, that is, upper and lower case English letters, numbers and some symbols. This coding table is called ASCII coding. For example, the coding of upper case letter A is 65 and the coding of lower case letter z is 122.
      However, it is obvious that one byte is not enough to deal with Chinese, at least two bytes are required, and it can not conflict with ASCII coding. Therefore, China has formulated GB2312 coding to encode Chinese.
      What you can imagine is that there are hundreds of languages all over the world. Japan compiles Japanese into English_ In JIS, South Korea compiles Korean into EUC Kr. If countries have national standards, there will inevitably be conflicts. As a result, there will be garbled codes in multilingual mixed texts.
      Therefore, Unicode came into being. Unicode unifies all languages into one set of codes, so that there will be no more random code problems. The Unicode standard is also evolving, but the most commonly used is to represent a character with two bytes (four bytes are required if very remote characters are used). Modern operating systems and most programming languages support Unicode directly.
      A new problem arises again: if it is unified into Unicode coding, the problem of garbled code will disappear. However, if the text you write is basically all in English, Unicode encoding requires twice as much storage space as ASCII encoding, which is not cost-effective in storage and transmission.
      Therefore, in the spirit of saving, there is a UTF-8 coding that converts Unicode coding into "variable length coding". UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only rare characters are encoded into 4-6 bytes. If the text you want to transmit contains a large number of English characters, UTF-8 coding can save space. An additional advantage of UTF-8 coding is that ASCII coding can actually be regarded as a part of UTF-8 coding. Therefore, a large number of legacy software that only supports ASCII coding can continue to work under UTF-8 coding.

  1. ASCII

ASCII has only 127 characters, representing the case of English letters, numbers and some symbols. However, because ASCII coding is not enough for other languages to represent bytes, for example, common Chinese needs two bytes and cannot conflict with ASCII, China has customized GB2312 coding format. Similarly, languages in other countries also have their own coding format.

  1. Unicode

Because each country's language has its own coding format, there will be garbled code in multi language editing text, so Unicode came into being. Unicode unifies these languages into a set of coding format. Usually two bytes represent one character, while ASCII represents one character. In this way, if your compiled text is all English, Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomic in storage and transmission.

  1. UTF-8

In order to solve the above problems, there is the transformation of Unicode coding into "variable length coding" UTF-8 coding. UTF-8 coding encodes Unicode characters into 1-6 bytes according to the size of numbers, English letters are encoded into one byte, and common Chinese characters are encoded into three bytes. If the text you translate is pure English, using UTF-8 will save space, And ASCII code is also part of UTF-8.

2. Connection among the three

After understanding the relationship between ASCII, Unicode and UTF-8, we can summarize the common character coding working mode of computer system:
(1) In the computer memory, Unicode coding is used uniformly. When it needs to be saved to the hard disk or transmitted, it is converted to UTF-8 coding.
(2) When editing with Notepad, UTF-8 characters read from the file are converted into Unicode characters and stored in memory. After editing, Unicode is converted into UTF-8 and saved to the file when saving. As shown below:

When browsing a web page, the server will convert the dynamically generated Unicode content into UTF-8 and then transmit it to the browser:

2, Use

1. General knowledge points

  • In Python, strings are encoded in Unicode. In the latest Python 3 version, strings are encoded in Unicode, that is, python strings support multiple languages
  • To transmit on the network or save to disk, you need to change the Unicode encoded str into bytes
  • Encoding conversion method in python: encoding decode () & decoding encode(), encoding and decoding process:
      Character | - encoding encode - > | - byte stream | - decoding decode - > | - character
  • UTF-8,GBK(GB2312)...
# -*- coding:utf-8 -*-

a = '\u563f\uff0c\u5582'
print(a)

ver  = "Why don't you take Wu hook and collect 50 states in Guanshan. Please go to Lingyan Pavilion for a while. If you are a scholar and a marquis?"
print(ver) # <class 'str'>


########## Encode
ver_bytes = ver.encode() #return <class 'bytes'> 
print(ver_bytes)
''' return
b'\xe5\xbe\x97\xe6\x97\xb6\xe6\x97\xa0\xe6\x80\xa0\xef\xbc\x8c
\xe6\x97\xb6\xe4\xb8\x8d\xe5\x86\x8d\xe6\x9d\xa5\xef\xbc\x8c\xe5
\xa4\xa9\xe4\xba\x88\xe4\xb8\x8d\xe5\x8f\x96\xef\xbc\x8c\xe5\x8f
\x8d\xe4\xb8\xba\xe4\xb9\x8b\xe7\x81\xbe\xe3\x80\x82'
'''
ver_bytes_utf8 = ver.encode('utf-8') #return <class 'bytes'> by utf-8
print(ver_bytes_utf8)
#Another way of writing:
ver_bytes_utf8 = ver.encode(encoding='utf-8', errors='strict') #return <class 'bytes'> by utf-8
print(ver_bytes_utf8)

ver_bytes_gbk = ver.encode('gbk') #return <class 'bytes'> by gbk
print(ver_bytes_gbk)

ver_bytes_unicode_escape = ver.encode('unicode_escape') #return <class 'bytes'> by unicode
print(ver_bytes_unicode_escape)


########## Decode
ver_str_unicode = ver.encode('unicode_escape').decode() #Return < class' str '> the current str is displayed in unicode
print(ver_str_unicode)
''' return
\u5f97\u65f6\u65e0\u6020\uff0c\u65f6\u4e0d\u518d\u6765
\uff0c\u5929\u4e88\u4e0d\u53d6\uff0c\u53cd\u4e3a\u4e4b\u707e\u3002
'''
str_unicode = '\u7537\u513f\u4f55\u4e0d\u5e26\u5434\u94a9\uff0c\u6536\u53d6\u5173\u5c71\u4e94\u5341\u5dde\u3002\u8bf7\u541b\u6682\u4e0a\u51cc\u70df\u9601\uff0c\u82e5\u4e2a\u4e66\u751f\u4e07\u6237\u4faf\uff1f'
print(str_unicode)

print(ver_bytes_utf8.decode('utf-8'))

print(ver_bytes_gbk.decode('gbk'))


2. Common coding errors

For error reporting: #unicode decodeerror: 'gbk' codec can't decode byte 0x80 in position 15: illegal multibyte sequence, there are two solutions:
See: https://blog.csdn.net/qq_40494873/article/details/120474070

# -*- coding:utf-8 -*-
import os, locale
from ProcessorFile import config
normal_file = os.path.join(config.input_path, 'read_test.txt')

#Error reporting for coding problems:#UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 15: illegal multibyte sequence
#Two solutions

#Add a parameter, encoding, and specify the reading code
with open(normal_file, 'r', encoding='utf-8') as f:
    line = f.readline()
    while line:
        print(line) #include \n
        print(type(line))
        line = f.readline()

#Change the reading mode, read it in binary form, and then decode it when outputting
with open(normal_file, 'rb') as f:
    line = f.readline()
    while line:
        print(line.decode())
        line = f.readline()

summary

Tip: here is a summary of the article:
For example, the above is what we want to talk about today. This paper only briefly introduces the use of pandas, which provides a large number of functions and methods that enable us to process data quickly and conveniently.

Tags: Python

Posted on Sun, 10 Oct 2021 02:11:15 -0400 by dips_007