EntityUtils.toString(entity) handles character set problem solving

Crawl information from 51Job and Hunting and Hiring Network to handle character set issues (51Job is gbk, Hunting and Hiring is utf-8).

Find two Web site character set information under the same label

You want to save the page as a String, parse it once to get the character set, then convert the page into the correct character set, and finally into a uniform character set utf-8

1.0 implementation, calling Entity.Utils.toString method twice

CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
            if(httpResponse.getStatusLine().getStatusCode() == 200) {
                //Site to String
                String get_Charset_Entity2String = EntityUtils.toString(httpResponse.getEntity());
                //analysis
                Document get_Charset_Document = Jsoup.parse(get_Charset_Entity2String);
                //Character Set Information Extraction, 51 job And hunting
                String charset = get_Charset_Document.select("meta[http-equiv=Content-Type]")
                        .attr("content").split("=")[1];
                System.out.println(charset);
                //Recoding to correct based on character set
                String Ori_Entity = EntityUtils.toString(httpResponse.getEntity(),charset);
                //Convert to Unified utf-8
                String entity = new String(Ori_Entity.getBytes(),"utf-8");
                System.out.println(entity);
        {

Report errors

 

Refer to https://blog.csdn.net/qq_23145857/article/details/70213277

We found that the EntityUtils stream only exists once, but we don't want a web page to be connected twice.

It's hard for me to switch directly to the original reserved String

2.0 implementation, second time not using EntityUtils

CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
            if(httpResponse.getStatusLine().getStatusCode() == 200) {
                //Site to String
                String get_Charset_Entity2String = EntityUtils.toString(httpResponse.getEntity());
                //analysis
                Document get_Charset_Document = Jsoup.parse(get_Charset_Entity2String);
                //Character Set Information Extraction, 51 job And hunting
                String charset = get_Charset_Document.select("meta[http-equiv=Content-Type]")
                        .attr("content").split("=")[1];
                System.out.println(charset);
                //Re-coded to correct based on character set, not used EntityUtils,Direct Turn get_Charset_Entity2String
                String Ori_Entity = new String(get_Charset_Entity2String.getBytes(), charset);
                //Convert to Unified utf-8
                String entity = new String(Ori_Entity.getBytes(),"utf-8");
                System.out.println(entity);
        {

Output:

 

The character set is still faulty, I find that EntityUtils.toString() uses the "ISO-8859-1" character set without specifying a character set, but I just don't know the character set

See the solution below the reference link, you'll be able to change flexibly by saving the stream directly in an array of bits

3.0 implementation, using EntityUtils.toByteArray instead of EntityUtils.toString

CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
            if(httpResponse.getStatusLine().getStatusCode() == 200) {
          //
Web site conversion to byte[] byte[] bytes = EntityUtils.toByteArray(httpResponse.getEntity()); //byte List to Default Character Set String get_Charset_Entity2String = new String(bytes); //analysis Document get_Charset_Document = Jsoup.parse(get_Charset_Entity2String); //Character Set Information Extraction, 51 job And hunting String charset = get_Charset_Document.select("meta[http-equiv=Content-Type]") .attr("content").split("=")[1]; System.out.println(charset); //Recoding to correct based on character set String Ori_Entity = new String(bytes, charset); //Convert to Unified utf-8 String entity = new String(Ori_Entity.getBytes(), "utf-8"); System.out.println(entity);
        }

For the default character set inside

Reference: https://blog.csdn.net/wangxin1949/article/details/78974037

  • 1. If eclipse is used, it is determined by the encoding of the java file
  • 2. If eclipse is not used, the local computer language environment decides that China is the default GBK encoding.
As long as you don't change the encoding in English, it doesn't matter, as long as you can extract the charset character set from the tag, you can convert it to the correct one
 

Output Normal

 

 

Change to hunting url and try again

 

Perfect, the crawler character set is amazing

Tags: Java encoding Eclipse network

Posted on Tue, 26 Nov 2019 16:22:22 -0500 by bull