Java Basic Tutorial - Conversion Stream

Conversion flow

Usually, the default encoding method for Window s is GBK, and UTF-8 encoding is generally recommended for Java projects. At this time, reading the file may appear scrambled. In fact, there are many scenarios where the coding format does not match in practice.

Conversion streams can specify encoding methods to solve scrambling problems.

OutputStreamWriter
InputStreamReader

Character Encoding: Character Encoding: Characters in natural languages, corresponding rules between binary numbers.
When the encoding method used in reading the file stream and the encoding method of the file itself are different, it will cause disordered code when reading out.

Charset: Charset: A collection of all characters supported by the system, including numbers, characters, punctuation symbols, graphical symbols, etc. A set of character sets has at least one set of character encoding. Common character sets include ASCII character set, GBK character set, Unicode character set, etc.

ASCII Character Set:
| ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a Latin-based computer coding system for displaying modern English, mainly including control characters (return key, backspace, line break key, etc.) and displayable characters (English upper and lower case characters, Arabic numerals). And Western symbols.
| The basic ASCII character set, which uses seven bits to represent a character, has 128 characters.
| ASCII's extended character set uses 8 bits to represent a character, a total of 256 characters, to facilitate support for commonly used European characters.

ISO-8859-1 Character Set:
| Latin code table, alias Latin-1, is used to show the languages used in Europe, including the Netherlands, Denmark, German, Italian, Spanish and so on.
| ISO-8859-1 uses single byte encoding and is compatible with ASCII encoding.

GBxxx Character Set: GB is the meaning of "national standard", a set of characters designed to display Chinese.
| GB2312: Simplified Chinese Code Table. Characters less than 127 have the same meaning as before. But when two characters greater than 127 are linked together, they represent a Chinese character, which contains about 7000 simplified Chinese characters, as well as mathematical symbols, Greek letters and Japanese pseudonyms. Even the numbers, punctuation points and letters already existing in ASCII are all rewritten in two byte-long encoding, which is the so-called "full-angle" character. Those below 127 are called "half corner" characters.
| GBK: The most commonly used Chinese code table. It is an extended specification based on GB2312 standard. It uses a double byte coding scheme. It contains 21003 Chinese characters. It is fully compatible with GB2312 standard. It also supports traditional Chinese characters, Japanese and Korean characters. The Windows operating system uses GBK encoding by default.
| GB18030: The latest Chinese code table. More than 70,000 Chinese characters are included, which are encoded by multi-byte codes. Each word can be composed of one, two or four bytes. It supports the writing of ethnic minorities in China, as well as traditional Chinese characters, Japanese and Korean characters.

Unicode Character Set:
Unicode coding system is designed to express arbitrary characters in any language. It is a standard in the industry. It is also called Unicode and Universal Standard.
Code.
There are three coding schemes, UTF-8, UTF-16 and UTF-32. The most commonly used UTF-8 coding.
| UTF-8, which can be used to represent any character in the Unicode standard, is the preferred encoding in e-mail, web pages and other applications for storing or transmitting text. The Internet Engineering Working Group (IETF) requires that all Internet protocols must support UTF-8 coding.
UTF-8 uses 1 to 4 bytes for encoding (up to 6 bytes). The encoding rules are as follows:

  1. 128 ASCII characters, one byte encoding.
  2. Latin and other characters, two byte encoding.
  3. Most commonly used words (including Chinese) use three byte encoding, which is basically equivalent to GBK.
  4. Some characters are encoded in four bytes.

Reference [How many bytes of Chinese characters in utf-8]: https://www.cnblogs.com/zxz1987/articles/6544593.html

Look at how many bytes UTF-8 encoding takes up:

public class TestCharEncoding {
    public static void main(String[] args) throws Exception {
        String[] strArr = { "A", "Ω", "One" };
        for (String s : strArr) {
            System.out.println("s: " + s.getBytes("utf-8").length);
        }
    }
}

Operation results

s: 1
s: 2
s: 3

Example code: Conversion streams can read and write various coded files, while pure character streams may read out random code.

  • Read the file stream using the FileInputStream class.
  • Using InputStreamReader, byte streams are converted into character streams, and the encoding method of file streams is specified.
  • The character stream is then put into BufferedReader for operation.

File FileOutputStream OutputStream Writer Buffered Writer

File FileInputStream InputStream Reader Buffered Reader

package ahjava.io;
import java.io.*;
public class Conversion flow {
    public static final String FILE_NAME = "testMessyCode.txt";
    public static final String CHARSET_NAME = "gbk";
    // File - > File Output Stream - > Output Stream Writer (specified code) - > Buffered Writer
    static void write(String msg) throws Exception {
        File f = new File(FILE_NAME);
        FileOutputStream fos = new FileOutputStream(f);
        OutputStreamWriter osw = new OutputStreamWriter(fos, CHARSET_NAME);
        BufferedWriter bw = new BufferedWriter(osw);
        bw.write(msg);
        // Close the stream (if the closing order of the write file is not correct, an exception will be thrown and the reading of the file will not be affected)
        bw.close();
        osw.close();
        fos.close();
    }
    // File - > File InputStream - > InputStream Reader (specified code) - > Buffered Reader
    static void read() throws Exception {
        File f = new File(FILE_NAME);
        FileInputStream fis = new FileInputStream(f);
        InputStreamReader isr = new InputStreamReader(fis, CHARSET_NAME);
        BufferedReader br = new BufferedReader(isr);
        String str;
        while ((str = br.readLine()) != null) // Read data line by line
        {
            System.out.println(CHARSET_NAME + "Read:" + str);
        }
        // Closing flow
        br.close();
        isr.close();
        fis.close();
    }
    static void Pure character stream R_Be afraid of comparing goods with goods() {
        File file = new File(FILE_NAME);
        FileReader fr;
        try {
            fr = new FileReader(file);
            BufferedReader br = new BufferedReader(fr);
            String str;
            while ((str = br.readLine()) != null) {
                System.out.println("Pure character stream read:" + str);
            }
            br.close();
            fr.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) throws Exception {
        write("Random code?");
        //Pure character stream R is afraid of comparing goods with goods ();
        read();
    }
}

* console input

Scenario: The console enters text and writes to a file.
It is generally recommended that Java engineering be set to UTF-8 encoding. At this time, if you write a file, the file is UTF-8 encoding.
However, the default encoding mode of Windows platform is GBK, which can be used to write files by GBK encoding through conversion stream.
The following code is to write the user's input in the console to a file in the encoding format of GBK.

import java.io.*;
public class Console input {
    public static void main(String[] args) throws IOException {
        //The console writes files ();
    }
    static void Console Writes Files() throws IOException {
        /* This example accepts input from the console and writes it to a file until the user enters "!!" */
        // Output: FileOutputStream - > OutputStream Writer (specified code) - > Buffered Writer
        FileOutputStream fos = new FileOutputStream("Console Writes Files.txt");
        OutputStreamWriter osw = new OutputStreamWriter(fos, "GBK");
        BufferedWriter bw = new BufferedWriter(osw);
        // Input: System. in (byte stream) - > InputStream Reader - > Buffered Reader
        // Converting console input objects into character streams and creating buffer streams
        InputStreamReader isr = new InputStreamReader(System.in);
        BufferedReader br = new BufferedReader(isr);
        String str = br.readLine(); // Accept a line of string input from the console
        while (!(str.equals("!!!"))) // If you enter "!!!" it means the end of the input.
        {
            bw.write(str); // Write a string input from the console to a file
            bw.newLine(); // New line
            str = br.readLine(); // Receive input from the console
        }
        // Close input-related streams
        br.close();
        isr.close();
        // Turn off output-related streams
        bw.close();
        osw.close();
        fos.close();
    }
}

* Think about the "console write file ()" method, can you call it twice in a row?

Tags: Windows encoding ascii Java

Posted on Fri, 12 Jul 2019 18:11:04 -0400 by adnan1983