[go] network programming HTTP programming

12.3 HTTP programming

12.3.1 general

12.3.1.1 Web working mode

When we browse the web, we will open the browser, enter the web address and press enter, and then the content you want to browse will be displayed. What is hidden behind this seemingly simple user behavior?

For the ordinary Internet access process, the system actually does this: the browser itself is a client. When you enter the URL, the browser will first request the DNS server, obtain the IP corresponding to the corresponding domain name through DNS, and then find the server corresponding to the IP through the IP address, and ask to establish a TCP connection. After the browser sends the HTTP Request packet, After receiving the request packet, the server starts to process the request packet. The server calls its own service and returns the HTTP Response packet; After receiving the Response from the server, the client starts rendering the body in the Response package, and then disconnects the TCP connection with the server after receiving all the contents.

A Web server, also known as HTTP server, communicates with the client through the HTTP protocol. This client usually refers to the Web browser (in fact, the mobile client is also implemented by the browser).

The working principle of Web server can be summarized as follows:
 the client establishes a TCP connection to the server through TCP/IP protocol
 the client sends an HTTP protocol request package to the server to request the resource documents in the server
 the server sends the HTTP protocol response package to the client. If the requested resource contains the content of dynamic language, the server will call the interpretation engine of dynamic language to process the "dynamic content" and return the processed data to the client
 the client is disconnected from the server. The client interprets the HTML document and renders the graphic results on the client screen

12.3.1.2 HTTP protocol

Hypertext Transfer Protocol (HTTP) is the most widely used network protocol on the Internet. It specifies the rules of communication between browser and World Wide Web server, and the data transmission protocol for transmitting World Wide Web documents through the Internet.

HTTP protocol is usually hosted on TCP protocol, and sometimes on TLS or SSL protocol layer. At this time, it has become what we often call HTTPS. As shown in the figure below:

12.3.1.3 address (URL)

The full name of URL is Unique Resource Location, which is used to represent network resources and can be understood as network file path.

The URL format is as follows:

http://host[":"port][abs_path] http://192.168.31.1/html/index

The length of the URL is limited. The limit values of different servers are different, but it cannot be infinite.

12.3.2 HTTP message analysis

12.3.2.1 format of request message

Test code
Server test code:

package main import ( "fmt" "log" "net" ) func main() { //Create and listen to socket s listenner, err := net.Listen("tcp", "127.0.0.1:8000") if err != nil { log.Fatal(err) //log.Fatal() generates panic } defer listenner.Close() conn, err := listenner.Accept() //Blocking waiting for client connections if err != nil { log.Println(err) return } defer conn.Close() //At the end of this function, close the connection socket //conn.RemoteAddr().String(): the network address of the customer service terminal ipAddr := conn.RemoteAddr().String() fmt.Println(ipAddr, "Connection succeeded") buf := make([]byte, 4096) //Buffer, which is used to receive data sent by the client //Blocking data waiting for users to send n, err := conn.Read(buf) //n length of data received by code if err != nil { fmt.Println(err) return } //Slice interception, intercepting only valid data result := buf[:n] fmt.Printf("Received data from[%s]==>:\n%s\n", ipAddr, string(result)) }

Browser input url address:

The printing results of the server run are as follows:

Format description of request message
HTTP request message consists of request line, request header, blank line and request package, as shown in the following figure:
Request line
The request line consists of three parts: method field, URL field and HTTP protocol version field, which are separated by spaces. The common HTTP request methods are GET and POST.

GET:
 when the client wants to read a resource from the server, use the GET method. The GET method requires the server to put the resource located by the URL in the data part of the response message and send it back to the client, that is, to request a resource from the server.
 when using the GET method, the request parameters and corresponding values are appended to the URL, and a question mark ("?") is used to represent the end of the URL and the beginning of the request parameters. The length of the transfer parameters is limited, so the GET method is not suitable for uploading data.
 when the web page is obtained through the GET method, the parameters will be displayed on the browser address bar, so the confidentiality is very poor.

POST:
 when the client provides more information to the server, the POST method can be used. The POST method submits data to the server, such as completing the submission of form data and submitting the data to the server for processing.
 GET is generally used to obtain / query resource information, and POST will be attached with user data, which is generally used to update resource information. The POST method encapsulates the request parameters in the HTTP request data, and there is no limit on the length, because the data carried by POST appears in the HTTP request body in the form of name / value, and a large amount of data can be transmitted.

Request header
The request header adds some additional information to the request message, which is composed of "name / value" pairs, one pair per line, and the names and values are separated by colons.
The request header notifies the server that there is information about the client request. Typical request headers are:

Request headermeaningUser-AgentRequested browser typeAcceptA list of response content types recognized by the client. The asterisk "*" is used to group types by range, with "/" indicating that all types are acceptable, and "type / *" indicating that all subtypes of type are acceptableAccept-LanguageClient acceptable natural languageAccept-EncodingEncoding compression format acceptable to the clientAccept-CharsetCharacter set of acceptable responsesHostThe requested host name, which allows multiple domain names to share one IP address, that is, the virtual hostconnectionConnection mode (close or keepalive)CookieStored in the client extension field, send a cookie belonging to the domain to the server of the same domain name

Blank line
The last request header is followed by an empty line. Carriage return and line feed are sent to inform the server that there are no more request headers below.
Request envelope
The request envelope is not used in the GET method, but in the POST method.
The POST method is suitable for situations where customers need to fill out forms. The most commonly used content type and content length related to requesting an envelope are the envelope type content type and envelope length.

12.3.2.2 format of response message

Test code
Server example code:

package main import ( "fmt" "net/http" ) //Business logic processing program written by the server func myHandler(w http.ResponseWriter, r *http.Request) { fmt.Fprintln(w, "hello world") } func main() { http.HandleFunc("/go", myHandler) //Listen at the specified address and start an HTTP http.ListenAndServe("127.0.0.1:8000", nil) }

Start the server program:

Client test example code:

package main import ( "fmt" "log" "net" ) func main() { //The client actively connects to the server conn, err := net.Dial("tcp", "127.0.0.1:8000") if err != nil { log.Fatal(err) //log.Fatal() generates panic return } defer conn.Close() //close requestHeader := "GET /go HTTP/1.1\r\nAccept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/xaml+xml, application/x-ms-xbap, */*\r\nAccept-Language: zh-Hans-CN,zh-Hans;q=0.8,en-US;q=0.5,en;q=0.3\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)\r\nAccept-Encoding: gzip, deflate\r\nHost: 127.0.0.1:8000\r\nConnection: Keep-Alive\r\n\r\n" //Send request packet first conn.Write([]byte(requestHeader)) buf := make([]byte, 4096) //buffer //Blocking data waiting for server reply n, err := conn.Read(buf) //n length of data received by code if err != nil { fmt.Println(err) return } //Slice interception, intercepting only valid data result := buf[:n] fmt.Printf("Data received[%d]:\n%s\n", n, string(result)) }

Start the program and test the successful response message of http:

Start the program and test the failure response message of http:

Format description of response message
The HTTP response message consists of status line, response header, blank line and response package, as shown in the following figure:
Status line
The status line consists of three parts: HTTP protocol version field, status code and description text of status code, which are separated by spaces.

Status code:
The status code consists of three digits. The first digit indicates the type of response. There are five categories of commonly used status codes, as follows:
Meaning of status code
1xx indicates that the server has received the client request, and the client can continue to send the request
2xx indicates that the server has successfully received and processed the request
3xx indicates that the server requires client redirection
4xx indicates that the client's request has illegal content
5xx indicates that the server failed to properly process the client's request and an unexpected error occurred

Examples of common status codes:
Meaning of status code
200 OK client request succeeded
400 Bad Request message has syntax error
401 Unauthorized
403 Forbidden server denial of service
404 Not Found the requested resource does not exist
500 Internal Server Error
503 Server Unavailable server is temporarily unable to process client requests (it may be possible later)

Response header
The response header may include:
Meaning of response header
The location response header field is used to redirect the recipient to a new location
The server response header field contains the software information and its version used by the server to process the request
Vary indicates a list of request headers that are not cacheable
Connection mode
Blank line
The last response header is followed by an empty line. Carriage return and line feed are sent to inform the server that there is no response header below.
Response inclusion
Text information returned by the server to the client.

12.3.3 HTTP programming

The Go language standard library provides a built-in net/http package, covering the specific implementation of HTTP client and server. use
net/http package, we can easily write HTTP client or server programs.

12.3.3.1 HTTP server

Example code:

package main import ( "fmt" "net/http" ) //Business logic processing program written by the server //Handler function: a function with func(w http.ResponseWriter, r *http.Requests) signature func myHandler(w http.ResponseWriter, r *http.Request) { fmt.Println(r.RemoteAddr, "Connection succeeded") //r.RemoteAddr remote network address fmt.Println("method = ", r.Method) //Request method fmt.Println("url = ", r.URL.Path) fmt.Println("header = ", r.Header) fmt.Println("body = ", r.Body) w.Write([]byte("hello go")) //Reply data to client } func main() { http.HandleFunc("/go", myHandler) //This method is used to monitor the specified TCP network address addr and then call the server side handler to process incoming connection requests. //The method has two parameters: the first parameter addr is the listening address; The second parameter represents the server-side handler, which is usually empty //The second parameter is null, which means that the server calls http.defaultservermux for processing http.ListenAndServe("127.0.0.1:8000", nil) }

Browser input url address:

Server running results:

12.3.3.2 HTTP client

package main import ( "fmt" "io" "log" "net/http" ) func main() { //Request a resource in get mode //resp, err := http.Get("http://www.baidu.com") //resp, err := http.Get("http://www.neihan8.com/article/index.html") resp, err := http.Get("http://127.0.0.1:8000/go") if err != nil { log.Println(err) return } defer resp.Body.Close() //close fmt.Println("header = ", resp.Header) fmt.Printf("resp status %s\nstatusCode %d\n", resp.Status, resp.StatusCode) fmt.Printf("body type = %T\n", resp.Body) buf := make([]byte, 2048) //Slice buffer var tmp string for { n, err := resp.Body.Read(buf) //Read the contents of the body package if err != nil && err != io.EOF { fmt.Println(err) return } if n == 0 { fmt.Println("End of reading") break } tmp += string(buf[:n]) //Accumulate read content } fmt.Println("buf = ", string(tmp)) }

Example code, Baidu Post Bar crawler

package main import ( "fmt" "net/http" "os" "strconv" ) //Crawling web content func HttpGet(url string) (result string, err error) { resp, err1 := http.Get(url) if err1 != nil { err = err1 return } defer resp.Body.Close() //Read web page body content buf := make([]byte, 1024*4) for { n, _ := resp.Body.Read(buf) if n == 0 { //Reading ended, or something went wrong //fmt.Println("resp.Body.Read err = ", err) break } result += string(buf[:n]) } return } func DoWork(start, end int) { fmt.Printf("Crawling %d reach %d Page of\n", start, end) //Clear goals (know where you're going to search) //http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&ie=utf -8 & PN = 0 / / next page + 50 for i := start; i <= end; i++ { url := "http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&ie=utf-8&pn=" + strconv.Itoa((i-1)*50) fmt.Println("url = ", url) //2) Crawling (crawling down the contents of all websites) result, err := HttpGet(url) if err != nil { fmt.Println("HttpGet err = ", err) continue } //Write content to file fileName := strconv.Itoa(i) + ".html" f, err1 := os.Create(fileName) if err1 != nil { fmt.Println("os.Create err1 = ", err1) continue } f.WriteString(result) //Write content f.Close() //Close file } } func main() { var start, end int fmt.Printf("Please enter the start page( >= 1) :") fmt.Scan(&start) fmt.Printf("Please enter a termination page( >= home page) :") fmt.Scan(&end) DoWork(start, end) }

Concurrent Version crawler code

package main import ( "fmt" "net/http" "os" "strconv" ) //Crawling web content func HttpGet(url string) (result string, err error) { resp, err1 := http.Get(url) if err1 != nil { err = err1 return } defer resp.Body.Close() //Read web page body content buf := make([]byte, 1024*4) for { n, _ := resp.Body.Read(buf) if n == 0 { //Reading ended, or something went wrong //fmt.Println("resp.Body.Read err = ", err) break } result += string(buf[:n]) } return } //Crawl a web page func SpiderPape(i int, page chan<- int) { url := "http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&ie=utf-8&pn=" + strconv.Itoa((i-1)*50) fmt.Printf("Climbing the third floor%d Page page: %s\n", i, url) //2) Crawling (crawling down the contents of all websites) result, err := HttpGet(url) if err != nil { fmt.Println("HttpGet err = ", err) return } //Write content to file fileName := strconv.Itoa(i) + ".html" f, err1 := os.Create(fileName) if err1 != nil { fmt.Println("os.Create err1 = ", err1) return } f.WriteString(result) //Write content f.Close() //Close file page <- i } func DoWork(start, end int) { fmt.Printf("Crawling %d reach %d Page of\n", start, end) page := make(chan int) //Clear goals (know where you're going to search) //http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&ie=utf -8 & PN = 0 / / next page + 50 for i := start; i <= end; i++ { go SpiderPape(i, page) } for i := start; i <= end; i++ { fmt.Printf("The first%d Page crawling completed\n", <-page) } } func main() { var start, end int fmt.Printf("Please enter the start page( >= 1) :") fmt.Scan(&start) fmt.Printf("Please enter a termination page( >= home page) :") fmt.Scan(&end) DoWork(start, end) }

Sample code, snippet, website crawler

package main import ( "fmt" "net/http" "os" "regexp" "strconv" "strings" ) func HttpGet(url string) (result string, err error) { resp, err1 := http.Get(url) //Send get request if err1 != nil { err = err1 return } defer resp.Body.Close() //Read web content buf := make([]byte, 4*1024) for { n, _ := resp.Body.Read(buf) if n == 0 { break } result += string(buf[:n]) //Accumulate read content } return } //Start crawling every joke, every paragraph title, content, err: = spideronejoy (URL) func SpiderOneJoy(url string) (title, content string, err error) { //Start crawling page content result, err1 := HttpGet(url) if err1 != nil { //fmt.Println("HttpGet err = ", err) err = err1 return } //Get key information //Take title < H1 > Title < / H1 > only take one re1 := regexp.MustCompile(`<h1>(?s:(.*?))</h1>`) if re1 == nil { //fmt.Println("regexp.MustCompile err") err = fmt.Errorf("%s", "regexp.MustCompile err") return } //Fetch content tmpTitle := re1.FindAllStringSubmatch(result, 1) //The last parameter is 1, and only the first one is filtered for _, data := range tmpTitle { title = data[1] // title = strings.Replace(title, "\r", "", -1) // title = strings.Replace(title, "\n", "", -1) // title = strings.Replace(title, " ", "", -1) title = strings.Replace(title, "\t", "", -1) break } //Get content < div class = "content TXT pt10" > sub content < a id = "prev" href=“ re2 := regexp.MustCompile(`<div>(?s:(.*?))<a id="prev" href="`) if re2 == nil { //fmt.Println("regexp.MustCompile err") err = fmt.Errorf("%s", "regexp.MustCompile err2") return } //Fetch content tmpContent := re2.FindAllStringSubmatch(result, -1) for _, data := range tmpContent { content = data[1] content = strings.Replace(content, "\t", "", -1) content = strings.Replace(content, "\n", "", -1) content = strings.Replace(content, "\r", "", -1) content = strings.Replace(content, "<br />", "", -1) break } return } //Write content to file func StoreJoyToFile(i int, fileTitle, fileContent []string) { //new file f, err := os.Create(strconv.Itoa(i) + ".txt") if err != nil { fmt.Println("os.Create err = ", err) return } defer f.Close() //Write content n := len(fileTitle) for i := 0; i < n; i++ { //Write title f.WriteString(fileTitle[i] + "\n") //Write content f.WriteString(fileContent[i] + "\n") f.WriteString("\n=================================================================\n") } } func SpiderPape(i int, page chan int) { //Explicit crawling url //https://www.pengfu.com/xiaohua_1.html url := "https://www.pengfu.com/xiaohua_" + strconv.Itoa(i) + ".html" fmt.Printf("Crawling to No%d Pages:%s\n", i, url) //Start crawling page content result, err := HttpGet(url) if err != nil { fmt.Println("HttpGet err = ", err) return } //fmt.Println("r = ", result) //Take, < H1 class = "dp-b" > < a href = "a segment url connection" //Interpretation expression re := regexp.MustCompile(`<h1><a href="(?s:(.*?))"`) if re == nil { fmt.Println("regexp.MustCompile err") return } //Get key information joyUrls := re.FindAllStringSubmatch(result, -1) //fmt.Println("joyUrls = ", joyUrls) fileTitle := make([]string, 0) fileContent := make([]string, 0) //Get web address //The first returns the subscript and the second returns the content for _, data := range joyUrls { //fmt.Println("url = ", data[1]) //Start crawling through every joke, every paragraph title, content, err := SpiderOneJoy(data[1]) if err != nil { fmt.Println("SpiderOneJoy err = ", err) continue } //fmt.Printf("title = #%v#", title) //fmt.Printf("content = #%v#", content) fileTitle = append(fileTitle, title) //Additional content fileContent = append(fileContent, content) //Additional content } //fmt.Println("fileTitle= ", fileTitle) //fmt.Println("fileContent= ", fileContent) //Write content to file StoreJoyToFile(i, fileTitle, fileContent) page <- i //Write content, write num } func DoWork(start, end int) { fmt.Printf("Ready to climb the second floor%d Page to%d Page URL\n", start, end) page := make(chan int) for i := start; i <= end; i++ { //Define a function to climb the main page go SpiderPape(i, page) } for i := start; i <= end; i++ { fmt.Printf("The first%d Page crawling completed\n", <-page) } } func main() { var start, end int fmt.Printf("Please enter the start page( >= 1) :") fmt.Scan(&start) fmt.Printf("Please enter a termination page( >= home page) :") fmt.Scan(&end) DoWork(start, end) //Working function }

Reptile summary

In fact, the reptile has four main steps:

Clear goals (know where you're going to search)
Crawling (crawling down the contents of all websites)
Take out (remove data that is useless to us)
Process data (store and use it the way we want)