Server Spooky Request Timeout Problem

A few days ago, there were occasional requests for a two-second timeout on the monitor display line.It took a lot of time and some detours to solve this problem.Take notes here.

Pre-phase Analyses

First of all, we need to know about our services:

  • Our service is a stateless set of front-end servers plus a stateful back-end storage layer.
  • These services are deployed on Tencent's biotite servers.

The first thing to do is to locate whether the problem is on the front or back end.From the logs, we find that during the error period, every front-end server has an error, and the error requests are sent to the same back-end server.As you can see, the problem is with back-end services.

Next, you need to analyze the characteristics of the error.Errors found through the log are all caused by a two-second timeout on the front end, while the back end actually completes the request after the front end timeout.Because it's a storage service, we naturally look at the disk in the first place.Sure, in case of an error, the io.util of the disk is very high (the output of sar-dp is TPS rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm%util from column 4):

Note that:

  • io.util is very high and lasts only 2-3 seconds
  • Almost all disks have high io.util
  • io's tps and throughput are not high

In addition, another rule has been found: a single machine will have a high io.util approximately every 12 hours, accompanied by a timeout error:

Problem investigation

Quick Screening

We did some quick surveys to try to find the source of the problem at a relatively small cost:

  1. View the system's timed tasks
  2. View System Log
  3. View timed tasks in stored service code

Unfortunately, no valuable clues were found.So it's assumed that there must be an existence we don't know that consumes IO in a 12-hour cycle.

Track IO usage

Use first iotop To locate who is taking up a lot of IO at the wrong time:

python /sbin/iotop -bPto -n 86400  > iotop.log

The above commands are batched and output every second for a total of 86400 seconds (24 hours).Similarly, no suspicious threads other than our programs were found.

Iotop is a good tool, but one drawback is that there is no detailed documentation on how it works.A documented rationale is necessary for a system tool.In this case, we suspect that iotop does not really reflect the IO usage of the process.To confirm again, we use SystemTap to count IO calls.Here is the improved iotop.stp:

global count_stacks
global size_stacks

global max_seconds = 172800
global elapsed_seconds = 0
global interval = 2

probe ioblock.request {
    if(devname == "sdd") {
        count_stacks[pid(), execname(), bio_rw_str(rw), backtrace()]++
        size_stacks[pid(), execname(), bio_rw_str(rw), backtrace()] += size
    }
}

probe timer.s(1)
{
    if (++elapsed_seconds % interval == 0) {
        if (elapsed_seconds >= max_seconds) {
            delete count_stacks
            exit()
        }

        time_string = ctime(gettimeofday_s())

        foreach ([process_id, process_name, read_write, stack] in count_stacks- limit 5) {
            printf("==== %25s %6d %20s %s %d\n", time_string, process_id, process_name, read_write, count_stacks[process_id, process_name, read_write, stack])
            print_syms(stack)
        }

        foreach ([process_id, process_name, read_write, stack] in size_stacks- limit 5) {
            printf("++++ %25s %6d %20s %s %d\n", time_string, process_id, process_name, read_write, size_stacks[process_id, process_name, read_write, stack])
            print_syms(stack)
        }

        delete count_stacks
        delete size_stacks
    }
}

This script takes all IO read and write requests that occur on/dev/sdd at 2-second intervals and prints the top 5 according to the number of times the request stack occurs and the total IO size, respectively.Execute the script:

nohup stap --suppress-handler-errors -d xfs ./iotop.stp > stap.log &

A typical partial result is:

====  Wed Oct 24 09:19:03 2018   1984                 java W 20
 0xffffffff812c7130 : generic_make_request+0x0/0x130 [kernel]
 0xffffffff812cd89b : blkdev_issue_flush+0xab/0x110 [kernel]
 0xffffffffa0681dc9 : xfs_blkdev_issue_flush+0x19/0x20 [xfs]
 0xffffffffa066b506 : xfs_file_fsync+0x1e6/0x200 [xfs]
 0xffffffff8120f975 : do_fsync+0x65/0xa0 [kernel]
 0xffff882025940b80 : 0xffff882025940b80
 0xffffffff8120fc40 : SyS_fsync+0x10/0x20 [kernel]
 0xffffffff81645909 : system_call_fastpath+0x16/0x1b [kernel]
====  Wed Oct 24 09:19:03 2018   1984                 java R 5
 0xffffffff812c7130 : generic_make_request+0x0/0x130 [kernel]
 0xffffffff8121f630 : do_mpage_readpage+0x2e0/0x6e0 [kernel]
 0xffffffff8121fb1b : mpage_readpages+0xeb/0x160 [kernel]
 0xffffffffa065fa5d : xfs_vm_readpages+0x1d/0x20 [xfs]
 0xffffffff81175cdc : __do_page_cache_readahead+0x1cc/0x250 [kernel]
 0xffff88201534cf00 : 0xffff88201534cf00
 0xffffffff81175ee6 : ondemand_readahead+0x126/0x240 [kernel]
 0xffffffff811762c1 : page_cache_sync_readahead+0x31/0x50 [kernel]
 0xffffffff8120e0d6 : __generic_file_splice_read+0x556/0x5e0 [kernel]
 0xffffffff8120c9b0 : spd_release_page+0x0/0x20 [kernel]
 0xffffffff81513d48 : sk_reset_timer+0x18/0x30 [kernel]
 0xffffffff81582ce5 : tcp_schedule_loss_probe+0x145/0x1e0 [kernel]
 0xffffffff81583b40 : tcp_write_xmit+0x2b0/0xce0 [kernel]
 0xffffffff8163cb5b : _raw_spin_unlock_bh+0x1b/0x40 [kernel]
 0xffffffff815154a8 : release_sock+0x118/0x170 [kernel]
 0xffffffff815769e6 : tcp_sendmsg+0xd6/0xc20 [kernel]
 0xffffffff8120e19e : generic_file_splice_read+0x3e/0x80 [kernel]
 0xffffffffa066bc5b : xfs_file_splice_read+0xab/0x140 [xfs]
 0xffffffff8120d222 : do_splice_to+0x72/0x90 [kernel]
 0xffffffff8120d2f7 : splice_direct_to_actor+0xb7/0x200 [kernel]

The first stack is that our program is doing fsync, and the second stack is that the program is doing read ahead.Both of these are excluded:

  • The system pressure is not high, and fsync is always doing it
  • Readahead was also modified by blockdev, which has no effect

So far, our survey has reached a deadlock: no one is doing IO, but io.util is high.

Find the culprit

Continuing to think, we speculate that there may be an operation that does not have a lot of IO, but it may tamp the disk.Note that this assumption is consistent with our initial analysis of IO.

So we thought about using top to see how things were going at that time (which is arguably the first way to try!):

nohup top -b -d 1 -n 86400 > top.log &

That is, in batch mode, process status is printed once per second for a total of 24 hours.Then, at the moment of the problem, we search for all processes in state D (such as IO).Sure enough, a suspicious process was found:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
32289 root      20   0    2148   1528    212 D   0.0  0.0   0:00.00 sas3flash

What is the process of sas3flash?Find his location first:

$ locate sas3flash
/usr/local/agenttools/agent/plugins/titan_tools/sas3flash

The directory you are in appears to belong to an agent, Hmm-pretty suspicious.See who called him:

   /usr/local/agenttools/agent/plugins$ grep sas3flash * -r
Binary file titan_for_serverV1.0 matches

   /usr/local/agenttools/agent/plugins$ strings titan_for_serverV1.0 | grep sas3flash
sas3flash
/usr/local/agenttools/agent/plugins/titan_tools/sas3flash -list
/usr/local/agenttools/agent/plugins/titan_tools/sas3flash
report_hba_by_sas3flash
report_hba_by_sas3flash
report_hba_by_sas3flash

After another search, we know:

  • sas3flash appears to be one HBA Management Tool (pdf)
  • agenttools This directory is Tencent biotite Monitoring components Catalog
  • sas3flash --list appears to be a view command, seemingly harmless to humans and animals, but probably the culprit of the problem

So execute sas3flash --list, and the problem recurs!As you can see from the strace, the command did not make any IO requests, but made some ioctl calls.Experiments have confirmed that these query calls will tamp IO.

In the /usr/local/agenttools/agent/Agent_log.log log file, we can find the time point and output of the command execution. Considering the execution-reporting time difference, we can find that the time point at which the command is executed is the time point at which the online timeout error problem occurs:

[2018-10-25 19:12:13] Send warning info, id:36149, value:[CollectID:
ServerSN:2102311YMD10J1000212
Avago Technologies SAS3 Flash Utility
Version 09.00.00.00 (2015.02.03)
Copyright 2008-2015 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 5e86819-f-8f0a-f000
        NVDATA Version (Default)       : 0b.02.00.01
        NVDATA Version (Persistent)    : 0b.02.00.01
        Firmware Product ID            : 0x2721 (IR)
        Firmware Version               : 12.00.02.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS3008
        BIOS Version                   : 08.29.02.00
        UEFI BSD Version               : 17.00.04.00
        FCODE Version                  : N/A
        Board Name                     : SAS3008
        Board Assembly                 : N/A
        Board Tracer Number            : 022CDECNHC010620

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.
]

By carefully examining the files under this directory, you will find that this monitoring component has its own set of timed task logic and log output.That's why we can't locate the problem with the system's crontab, cron.d, and system logs.

As to why this command would cause IO tamping, we are still waiting for a reply from Tencent Cloud.The temporary solution is to disable the execution of this command (which also requires Tencent to cloud out a hotfix package of monitoring components to complete, and simply deleting sas3flash is useless).

Summary

Throughout the entire survey process, we actually took some detours. By analyzing the IO characteristics, we can find problems faster if top is used first.

At the same time, I deeply feel the following pain points:

  1. iotop should have a documented rationale
  2. The monitoring components of Tencent Cloud should not be completely system independent.If we go to a system contab or log, we can quickly locate the problem
  3. Monitoring components of the system should be adequately tested

Tags: Database Java Python crontab

Posted on Thu, 19 Mar 2020 01:04:10 -0400 by kunalk