The mental process of online troubleshooting - too long GC time

One day, I suddenly received feedback from other group partners that calling one of our group's interfaces timed out. The information is as follows: cause: java.net.socketimeoutexception: read timed out
So I quickly opened Postman to access the next interface, and the result was accessible. After communication, it was found that not every access failed. It was suspected that one of the machines was down. In order not to affect the normal operation of the online system, the service was restarted first, and then the service was available. Next, we began to investigate and guess the problem.

Guess 1: is there any problem with nginx's load balancing strategy?

From the phenomenon of the problem, I feel that there is indeed a machine that can not provide external services, but if the machine is down, nginx will take the initiative to remove the down machine, and the traffic will not come in. Is it the ip hash strategy configured by nginx, which is not the problem found through communication with the operation and maintenance department, and the operation and maintenance department has tracked that there is traffic coming in during this time period, but it returns The state code is 499 (in nginx source code, 499 is defined as "client has closed connection". This is probably because the server side processing time is too long and the client is "impatient").

Guess 2:Redis is penetrated by cache

Because the interface is a high-frequency query interface, it was originally designed to use redis as a cache, and through tracking the return log, it was found that there were indeed a large number of requests to access redis at this time without obtaining data and then directly accessing the database. Is this really the problem that led to this problem? But continue to track the logs forward, and find that there are other time periods that directly query the database through the cache. It seems that this is not the direct cause of this problem.

Guess 3: slow sql causes

Because the above database goes directly through the cache, consider whether there is slow sql when querying the database, which causes the system to respond slowly. Then, the executed sql explain is as follows:

id = 1    
select_type = SIMPLE    
table = tb_finance    
type = ref    
possible_keys = index_fina_student_id,idx_fina_entity_metavalue    
key = index_fina_student_id    
key_len = 99    
ref = const    
rows = 1    
Extra = Using where    

Because the query goes through the index, and the number of scanned rows is only one, the response is very fast, so it doesn't seem to be caused by slow sql, but it's not reassured. Consider whether there is a large amount of access, resulting in a delay in the database waiting? Contact DBA to see the following monitoring information of database QPS and connection, which are relatively stable, and there is no peak data, so the possibility of change is also ruled out.

Guess 4:CPU and memory surge

In the past, there was a situation that the memory consumption reached the pre-warning level and tomcat was restarted after going online. Is this the same situation this time? Unfortunately, after checking the monitoring information, it was found that CPU, memory and disk IO did not fluctuate much in this period.

Guess 5:JVM GC takes too long

Continue to check the 8 machines in the cluster to see if there is a problem with the JVM in this period of time? Sure enough, after checking GC monitoring, there are major findings as follows:

At 19:10, the GC time began to soar sharply. The longest time was even more than 3 minutes, and it was right with the feedback time point, so the scope of troubleshooting was narrowed down, and the corresponding time point log of this machine was continued to be checked. It was found that there were many similar errors at this time, and the error information was as follows:

[2020-01-12 19:35:35.469] [ERROR] [http-nio-8080-exec-13] [com.xes.payment.safe.handler.GlobalExceptionHandler] - Server exception, please contact administrator!
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe
	at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:356)
	at org.apache.catalina.connector.OutputBuffer.flushByteBuffer(OutputBuffer.java:825)
	at org.apache.catalina.connector.OutputBuffer.append(OutputBuffer.java:730)
	at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:391)
	at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:369)
	at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:96)
	at com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2085)
	at com.fasterxml.jackson.core.json.UTF8JsonGenerator.writeNumber(UTF8JsonGenerator.java:899)
	at com.fasterxml.jackson.databind.ser.std.NumberSerializers$IntegerSerializer.serialize(NumberSerializers.java:137)
	at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:727)
	at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719)
	at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)
	at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
	at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
	at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
	at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:727)
	at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719)
	at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)
	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
	at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
	at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1396)
	at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:913)
	at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.writeInternal(AbstractJackson2HttpMessageConverter.java:286)
	at org.springframework.http.converter.AbstractGenericHttpMessageConverter.write(AbstractGenericHttpMessageConverter.java:102)
	at org.springframework.web.servlet.mvc.method.annotation.AbstractMessageConverterMethodProcessor.writeWithMessageConverters(AbstractMessageConverterMethodProcessor.java:272)
	at org.springframework.web.servlet.mvc.method.annotation.RequestResponseBodyMethodProcessor.handleReturnValue(RequestResponseBodyMethodProcessor.java:180)
	at org.springframework.web.method.support.HandlerMethodReturnValueHandlerComposite.handleReturnValue(HandlerMethodReturnValueHandlerComposite.java:82)
	at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:119)
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:877)
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:783)
	at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:991)
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:925)
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:974)
	at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:877)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:661)
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:851)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:158)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:126)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:111)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.actuate.web.trace.servlet.HttpTraceFilter.doFilterInternal(HttpTraceFilter.java:90)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at com.xes.payment.safe.filter.XssFilter.doFilter(XssFilter.java:64)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.web.servlet.support.ErrorPageFilter.doFilter(ErrorPageFilter.java:130)
	at org.springframework.boot.web.servlet.support.ErrorPageFilter.access$000(ErrorPageFilter.java:66)
	at org.springframework.boot.web.servlet.support.ErrorPageFilter$1.doFilterInternal(ErrorPageFilter.java:105)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	at org.springframework.boot.web.servlet.support.ErrorPageFilter.doFilter(ErrorPageFilter.java:123)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
	at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
	at org.apache.tomcat.util.net.NioChannel.write(NioChannel.java:134)
	at org.apache.tomcat.util.net.NioBlockingSelector.write(NioBlockingSelector.java:101)
	at org.apache.tomcat.util.net.NioSelectorPool.write(NioSelectorPool.java:157)
	at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1306)
	at org.apache.tomcat.util.net.SocketWrapperBase.doWrite(SocketWrapperBase.java:670)
	at org.apache.tomcat.util.net.SocketWrapperBase.writeBlocking(SocketWrapperBase.java:450)
	at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:388)
	at org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.doWrite(Http11OutputBuffer.java:623)
	at org.apache.coyote.http11.filters.IdentityOutputFilter.doWrite(IdentityOutputFilter.java:127)
	at org.apache.coyote.http11.Http11OutputBuffer.doWrite(Http11OutputBuffer.java:225)
	at org.apache.coyote.Response.doWrite(Response.java:541)
	at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:351)
	... 83 common frames omitted

This error report concentrated on nearly 300 times before the service was restarted from 19:10 to 20:00, and only in this period of time in the latest week. It seems that the problem has surfaced, so what scenario will lead to java.io.IOException: Broken pipe?
1) After the client initiates the request again, it does not wait for the server to complete the response. Clicking the stop button causes the server to receive the cancellation request. In general, there will be no such boring users. This may be because the user submits the request and the server is slow. For example, there is a problem with the business logic and so on. As a result, the page does not refresh after a long time, the user may cancel or restart the request.
2) When Tomcat server accepts the user's request, it has its own processing capacity, thread, server and other resource limitations. Requests beyond the scope of Tomcat will be stopped by tomcat, and this error may also occur.
3) linux's threading mechanism will cause JVM errors, especially during the peak connection period. tomcat has the same situation under linux.

The above explanation also verifies why the system will return the 499 status code. According to the exception stack information, find the source code of this permission authentication. It is found that this code is used up and not closed when reading the byte stream. If the above exception occurs in the middle, it is easy to cause memory leakage and waste. In this time period, there may be some query ratios operated by the PC end user Slow operation, and then close the client for reoperation. Before the exception stack information, a slow query function is also found and some large objects will be generated, which are all the links to be optimized.

summary

According to the above conjecture and troubleshooting, we can finally locate the cause and general location of the problem. It will be found that real-time monitoring is an important part of the problem, otherwise online problems will not be detected, the system operation cannot be fully mastered, and there are many monitoring means, such as nailing alarm, ELK, CPU and memory utilization, GC log, etc.

Published 14 original articles, won praise 11, visited 7217
Private letter follow

Tags: Java Apache Tomcat Database

Posted on Tue, 14 Jan 2020 03:24:11 -0500 by gabo0303