Crawl information from 51Job and Hunting and Hiring Network to handle character set issues (51Job is gbk, Hunting and Hiring is utf-8).
Find two Web site character set information under the same label
You want to save the page as a String, parse it once to get the character set, then convert the page into the correct character set, and finally into a uniform character set utf-8
1.0 implementation, calling Entity.Utils.toString method twice
CloseableHttpResponse httpResponse = httpClient.execute(httpGet); if(httpResponse.getStatusLine().getStatusCode() == 200) { //Site to String String get_Charset_Entity2String = EntityUtils.toString(httpResponse.getEntity()); //analysis Document get_Charset_Document = Jsoup.parse(get_Charset_Entity2String); //Character Set Information Extraction, 51 job And hunting String charset = get_Charset_Document.select("meta[http-equiv=Content-Type]") .attr("content").split("=")[1]; System.out.println(charset); //Recoding to correct based on character set String Ori_Entity = EntityUtils.toString(httpResponse.getEntity(),charset); //Convert to Unified utf-8 String entity = new String(Ori_Entity.getBytes(),"utf-8"); System.out.println(entity);
{
Report errors
Refer to https://blog.csdn.net/qq_23145857/article/details/70213277
We found that the EntityUtils stream only exists once, but we don't want a web page to be connected twice.
It's hard for me to switch directly to the original reserved String
2.0 implementation, second time not using EntityUtils
CloseableHttpResponse httpResponse = httpClient.execute(httpGet); if(httpResponse.getStatusLine().getStatusCode() == 200) { //Site to String String get_Charset_Entity2String = EntityUtils.toString(httpResponse.getEntity()); //analysis Document get_Charset_Document = Jsoup.parse(get_Charset_Entity2String); //Character Set Information Extraction, 51 job And hunting String charset = get_Charset_Document.select("meta[http-equiv=Content-Type]") .attr("content").split("=")[1]; System.out.println(charset); //Re-coded to correct based on character set, not used EntityUtils,Direct Turn get_Charset_Entity2String String Ori_Entity = new String(get_Charset_Entity2String.getBytes(), charset); //Convert to Unified utf-8 String entity = new String(Ori_Entity.getBytes(),"utf-8"); System.out.println(entity);
{
Output:
The character set is still faulty, I find that EntityUtils.toString() uses the "ISO-8859-1" character set without specifying a character set, but I just don't know the character set
See the solution below the reference link, you'll be able to change flexibly by saving the stream directly in an array of bits
3.0 implementation, using EntityUtils.toByteArray instead of EntityUtils.toString
CloseableHttpResponse httpResponse = httpClient.execute(httpGet); if(httpResponse.getStatusLine().getStatusCode() == 200) {
//Web site conversion to byte[] byte[] bytes = EntityUtils.toByteArray(httpResponse.getEntity()); //byte List to Default Character Set String get_Charset_Entity2String = new String(bytes); //analysis Document get_Charset_Document = Jsoup.parse(get_Charset_Entity2String); //Character Set Information Extraction, 51 job And hunting String charset = get_Charset_Document.select("meta[http-equiv=Content-Type]") .attr("content").split("=")[1]; System.out.println(charset); //Recoding to correct based on character set String Ori_Entity = new String(bytes, charset); //Convert to Unified utf-8 String entity = new String(Ori_Entity.getBytes(), "utf-8"); System.out.println(entity);
}
For the default character set inside
Reference: https://blog.csdn.net/wangxin1949/article/details/78974037
-
1. If eclipse is used, it is determined by the encoding of the java file
-
2. If eclipse is not used, the local computer language environment decides that China is the default GBK encoding.
Output Normal
Change to hunting url and try again
Perfect, the crawler character set is amazing