eBPF overview: Part 5: tracking user progress

Navigation in this series:

  1. eBPF overview: Part 1: Introduction
  2. eBPF overview: Part 2: machine and bytecode
  3. eBPF overview: Part 3: software development ecology
  4. eBPF overview: Part 4: running on Embedded SystemsS
  5. eBPF overview: Part 5: tracking user progress

Original address: https://www.collabora.com/news-and-blog/blog/2019/05/14/an-ebpf-overview-part-5-tracing-user-processes/
Starting address: https://ebpf.top/post/ebpf-overview-part-5/
Release time: May 14, 2019
By Adrian Ratiu
Translation: Di Weihua

1. Preface

In the previous section, we focused on Linux kernel tracing. In our opinion, the project based on eBPF is the safest, most widely available and most effective method (eBPF is fully supported upstream in Linux to ensure stable ABI, which is enabled by default in almost all distributions and can be integrated with all other tracing mechanisms). eBPF has become the best choice for kernel work. However, so far, we have deliberately avoided discussing user space tracking in depth because it deserves special treatment, so we will focus on it in part 5.

First, we will discuss why to use it, and then we will divide eBPF user tracking into static and dynamic categories.

2. Why use eBPF in user space?

The most important user question is, why use eBPF to track user space processes when there are so many other debuggers / performance analyzers / trackers and so many language or operating system specific tools developed for the same task? The answer is not simple. According to different usage situations, eBPF may not be the best solution; In the huge user space ecosystem, there is no debugging / tracking project suitable for all situations.

eBPF tracking has the following advantages:

  • It provides a unified tracking interface for kernel and user space, and is compatible with the mechanisms used by other tools ([k,u]probe, (dtrace)tracepoint, etc.). Articles in 2015 Select linux tracker Although somewhat outdated, it provides good insights into how difficult and effort it takes to use all the different tools. It is very valuable to have a unified, powerful, secure and widely used framework to meet the needs of most tracking. Some higher-level tools, such as Perf/SystemTap/DTrace, are being rewritten on the basis of eBPF (becoming the front end of eBPF), so understanding eBPF helps to use them.

  • eBPF is fully programmable. Perf/ftrace and other tools need to process data after the event, while eBPF can run customized advanced locally compiled C/Python/Go detection code directly in the kernel / application. It can store data between multiple eBPF event runs, for example, to calculate statistics for each function call based on function status / parameters.

  • eBPF can track everything in the system, and it is not limited to specific applications. For example, you can set up uprobes on a shared library and track links and all processes that call it.

  • Many debuggers need to pause programs to observe their state or reduce runtime performance, which makes it difficult to analyze them in real time, especially on production workloads. Because eBPF attaches JIT locally compiled detection code, its performance impact is minimal and does not need to pause for a long time.

Admittedly, eBPF also has some disadvantages:

  • eBPF is not portable like other trackers. The virtual machine is mainly developed in the Linux kernel (there is an ongoing BSD migration), and the related tools are developed based on Linux.
  • eBPF requires a fairly new kernel. For example, MIPs support is added in v4.13, but most MIPS devices run on older kernels than v4.13.
  • In general, eBPF does not easily provide as much insight as a language or application specific user space debugger. For example, the core of Emacs is an ELISP interpreter written in C language: eBPF can track / debug ELISP programs through C function calls installed in Emacs runtime, but it does not understand the higher-level ELISP language implementation, so it becomes more useful to use the special ELISP language, specific tracker and debugger provided by Emacs. Another example is debugging a JavaScript application running in a Web browser engine.
  • Because "normal eBPF" runs in the Linux kernel, kernel user context switching occurs every time eBPF detects a user process. This can be expensive to debug performance critical user space code (maybe it can be used) User space eBPF virtual machine Project to avoid this switching cost?). This can be expensive (perhaps) for debugging performance critical user space code User space eBPF VM Projects can be used to avoid this switching cost?). This context switching is much cheaper than a normal debugger (or a tool like strace), so it is usually negligible, but in this case, a tracker like LTTng that can run completely in user space may be more appropriate.

3. Static tracking point (USDT probe)

Static trace point, also known as USDT (user statically defined trace) probe in user space (specific location of interest in the application), where the tracker can mount and check code execution and data. They are clearly defined by developers in the source code and are usually enabled with flags such as "– enable trace" at compile time. The advantage of static trace points is that they don't change often: developers usually keep a stable static trace ABI, so the trace tool works between different application versions, which is very useful, such as when upgrading a PostgreSQL installation and experiencing performance degradation.

3.1 predefined tracking points

BCC-tools Contains many useful and tested tools that can interact with trace points defined by a specific application or language runtime. For our example, we will track Python applications. Make sure you enable the "– enable trace" flag when you build Python 3 and run it on Python binaries or libpython (depending on how you build it) tplist To confirm that trace points are enabled:

$ tplist -l /usr/lib/libpython3.7m.so
b'/usr/lib/libpython3.7m.so' b'python':b'import__find__load__start'
b'/usr/lib/libpython3.7m.so' b'python':b'import__find__load__done'
b'/usr/lib/libpython3.7m.so' b'python':b'gc__start'
b'/usr/lib/libpython3.7m.so' b'python':b'gc__done'
b'/usr/lib/libpython3.7m.so' b'python':b'line'
b'/usr/lib/libpython3.7m.so' b'python':b'function__return'
b'/usr/lib/libpython3.7m.so' b'python':b'function__entry'

First, we use a cool tracking tool provided by BCC uflow To track python's Simple http server The execution process of. The trace should be self explanatory, with arrows and indents indicating the entry / exit of the function. What we see in this trace is how a worker thread exits on CPU 3, and the main thread is ready to serve other incoming http requests on CPU 0.

$ python -m http.server >/dev/null & sudo ./uflow -l python $!
[4] 11727
Tracing method calls in python process 11727... Ctrl-C to quit.
3   11740  11757  7.034           /usr/lib/python3.7/_weakrefset.py._remove
3   11740  11757  7.034     /usr/lib/python3.7/threading.py._acquire_restore
0   11740  11740  7.034               /usr/lib/python3.7/threading.py.__exit__
0   11740  11740  7.034             /usr/lib/python3.7/socketserver.py.service_actions
0   11740  11740  7.034     /usr/lib/python3.7/selectors.py.select
0   11740  11740  7.532     /usr/lib/python3.7/socketserver.py.service_actions
0   11740  11740  7.532

Next, we want to run our custom code when the trace point is hit, so we don't rely entirely on any tools provided by BCC. The following example hooks itself to python's function__entry trace point (see python detection Document) and notify us when someone downloads the file:

#!/usr/bin/env python
from bcc import BPF, USDT
import sys

bpf = """
#include <uapi/linux/ptrace.h>

static int strncmp(char *s1, char *s2, int size) {for (int i = 0; i < size; ++i)
        if (s1[i] != s2[i])
            return 1;
    return 0;

int trace_file_transfers(struct pt_regs *ctx) {
    uint64_t fnameptr;
    char fname[128]={0}, searchname[9]="copyfile";

    bpf_usdt_readarg(2, ctx, &fnameptr);
    bpf_probe_read(&fname, sizeof(fname), (void *)fnameptr);

    if (!strncmp(fname, searchname, sizeof(searchname)))
        bpf_trace_printk("Someone is transferring a file!\\n");
    return 0;

u = USDT(pid=int(sys.argv[1]))
u.enable_probe(probe="function__entry", fn_name="trace_file_transfers")
b = BPF(text=bpf, usdt_contexts=[u])
while 1:
        (_, _, _, _, ts, msg) = b.trace_fields()
    except ValueError:
    print("%-18.9f %s" % (ts, msg))

We tested it by attaching it to a simple HTTP server again:

$ python -m http.server >/dev/null & sudo ./trace_simplehttp.py $!
[14] 28682
34677.450520000    b'Someone is transferring a file!'

The above example tells us when someone is downloading files, but it can't provide more detailed information, such as who is downloading, what files to download, etc. This is because python only enables a few very common trace points (module loading, function entry / exit, etc.) by default. In order to get more information, we must define our own tracking points where we are interested so that we can extract relevant data.

3.2 define our own tracking points

So far, we only use trace points defined by others, but if our application does not provide any trace points, or we need to add more information than existing trace points, we will have to add our own trace points.

There are many ways to add trace points, for example, through Python core pydtrace.h and pydtrace.d Use systemtap's development kit "systemtap SDT dev", but we will take another approach, using libstapsdt , because it has a simpler API, is lighter (only depends on libelf), and supports multi language binding. To maintain consistency, we focus again on python, but trace points can also be added in other languages, as shown here A C language example.

First, we patch the simple http server to expose the trace point. The code should be self explanatory: note the name of the trace point file_transfer and its parameters are enough to store two string pointers and a 32-bit unsigned integer, representing the client IP address, file path and file size.

diff --git a/usr/lib/python3.7/http/server.py b/usr/lib/python3.7/http/server.py
index ca2dd50..af08e10 100644
--- a/usr/lib/python3.7/http/server.py
+++ b/usr/lib/python3.7/http/server.py
@@ -107,6 +107,13 @@ from functools import partial
 from http import HTTPStatus
+import stapsdt
+provider = stapsdt.Provider("simplehttp")
+probe = provider.add_probe("file_transfer",
+                           stapsdt.ArgTypes.uint64,
+                           stapsdt.ArgTypes.uint64,
+                           stapsdt.ArgTypes.uint32)+provider.load()
 # Default error message template
@@ -650,6 +657,8 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
         f = self.send_head()
         if f:
+                path = self.translate_path(self.path)
+                probe.fire(self.address_string(), path, os.path.getsize(path))
                 self.copyfile(f, self.wfile)

Running the patched server, we can use tplist to verify our file_ Whether the transfer trace point exists at runtime:

$ python -m http.server >/dev/null 2>&1 & tplist -p $!
[1] 13297
b'/tmp/simplehttp-Q6SJDX.so' b'simplehttp':b'file_transfer'
b'/usr/lib/libpython3.7m.so.1.0' b'python':b'import__find__load__start'
b'/usr/lib/libpython3.7m.so.1.0' b'python':b'import__find__load__done'

We will make the following most important changes to the tracker sample code in the above example:

  • It attaches its logic to our custom file_transfer trace point.
  • It uses PERF EVENTS To store data that can transfer any structure to user space, instead of the ftrace ring buffer we used before, which can only transfer a single string.
  • It does not use bpf_usdt_readarg to get the pointers provided by USDT, but declare them directly in the handler function signature. This is a significant quality improvement that can be used for all handlers.
  • This tracker is used explicitly python2 , even if all our examples so far (including the python http.server patch above) use python3 . It is hoped that all BCC API s and documents can be ported to python 3 in the future.
 #!/usr/bin/env python2
 from bcc import BPF, USDT
 import sys

 bpf = """
 #include <uapi/linux/ptrace.h>


 struct file_transf {char client_ip_str[20];
     char file_path[300];
     u32 file_size;
     u64 timestamp;

 int trace_file_transfers(struct pt_regs *ctx, char *ipstrptr, char *pathptr, u32 file_size) {struct file_transf ft = {0};

     ft.file_size = file_size;
     ft.timestamp = bpf_ktime_get_ns();
     bpf_probe_read(&ft.client_ip_str, sizeof(ft.client_ip_str), (void *)ipstrptr);
     bpf_probe_read(&ft.file_path, sizeof(ft.file_path), (void *)pathptr);

     events.perf_submit(ctx, &ft, sizeof(ft));
     return 0;

 def print_event(cpu, data, size):
     event = b["events"].event(data)
     print("{0}: {1} is downloding file {2} ({3} bytes)".format(event.timestamp, event.client_ip_str, event.file_path, event.file_size))

 u = USDT(pid=int(sys.argv[1]))
 u.enable_probe(probe="file_transfer", fn_name="trace_file_transfers")
 b = BPF(text=bpf, usdt_contexts=[u])

 while 1:
     except KeyboardInterrupt:

Track patched servers:

        $ python -m http.server >/dev/null 2>&1 & sudo ./trace_stapsdt.py $!
        [1] 5613
        325540469950102: is downloading file /home/adi/ (4096 bytes)
        325543319100447: is downloading file /home/adi/.bashrc (827 bytes)
        325552448306918: is downloading file /home/adi/workspace/ (4096 bytes)
        325563646387008: is downloading file /home/adi/workspace/work.tar (112640 bytes)

The above customized file_ The transfer trace point looks very simple (direct python printing or logging call may have the same effect), but it provides a very powerful mechanism: well placed trace points ensure ABI stability and provide the ability of dynamic operation. Safe, local fast and programmable logic can be very helpful to quickly analyze and repair various problems, There is no need to restart the problematic application (it may take a long time to reproduce the problem).

4. Dynamic probes

The problem with the static trace points illustrated above is that they need to be clearly defined in the source code and the application needs to be rebuilt when the trace points are modified. Ensuring the ABI stability of existing trace points imposes restrictions on how maintenance personnel can rebuild / rewrite the code of trace point data. Therefore, in some cases, full runtime dynamic user space probes are preferred: they probe directly in the memory of the running application in a special way without any special source code definition. Dynamic probes can easily fail between application versions, but even so, they are useful for debugging running instances in real time.

Although static tracepoints are useful for tracking applications written in high-level languages such as Python or Java, uprobes are not very useful because they work at the bottom and do not know the language runtime implementation (static tracepoints work because the developer is responsible for exposing the relevant data of high-level applications). However, dynamic probes are useful for debugging the language implementation / engine itself or applications written in a language without a runtime, such as C.

uprobe can be added to the Striped binary file, but the user must manually calculate the in-process memory offset location. uprobe should be attached to the location through tools such as objdump and / proc//maps( See Example )However, this method is painful and non portable. Since most distributions provide debugging symbol packages (or fast methods built with debugging symbols) and BCC makes it easy to use uprobes with symbol name resolution, most dynamic detection uses this method.

gethostlatency The BCC tool prints DNS request latency by attaching uprobes to related functions in gethostbyname and libc. To verify that libc is not optimized so that the tool can run (otherwise a Sybol not found error will be raised):

$ file /usr/lib/libc-2.28.so
/usr/lib/libc-2.28.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, (...), not stripped
$ nm -na /usr/lib/libc-2.28.so | grep -i -e getaddrinfo
0000000000000000 a getaddrinfo.c

gethostlatency The code is very similar to the trace point example we checked above (and in some places, it also uses BPF_PERF_OUTPUT), so we won't publish it completely here. The most relevant difference is the use of BCC uprobe API:

b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry", pid=args.pid)
b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return", pid=args.pid)

The key idea to understand and remember here is: as long as we make some small changes to our BCC eBPF program, we can track very different applications, libraries and even the kernel through static and dynamic detection. Before, we used to track Python applications statically, but now we dynamically measure the host name resolution delay of libc (less than 150LOC, many of which are templates) examples can be similarly modified to track any content in the running system, which is very safe without the risk of crash or problems caused by other tools (such as debugger application pause / pause).

5. Summary

In part 5, we looked at how to use the eBPF program to track user space applications. The biggest advantage of using eBPF to complete this task is that it provides a unified interface to track the whole system safely and effectively: errors can be reproduced in the application, and then further tracked to the library or kernel, providing complete system visibility through a unified programming framework / interface. However, eBPF is not a silver bullet, especially when debugging applications written in high-level languages, language specific tools can better provide insight, or for applications running older versions of the Linux kernel or requiring non Linux systems.

Tags: Python Linux Operation & Maintenance ebpf

Posted on Wed, 29 Sep 2021 17:26:35 -0400 by linda_v