Aya: your tRusty eBPF companion - Aya: Rust風格的 eBPF 夥伴

原文鏈接: https://deepfence.io/aya-your-trusty-ebpf-companion/
第一次翻譯長篇文章，有不好的地方歡迎評論指出
不確定的翻譯已通過中文斜體標出
引用部分爲原文

Aya is a library that makes it possible to write eBPF programs fully in Rust and is focused on providing an experience that is as friendly as possible for developers. In this post we are going to explain what eBPF is, why Aya started, and what’s unique about it.

Aya這個庫可以讓你完全用Rust來編寫eBPF程序，並且爲開發者提供儘可能友好的開發體驗。這篇文章裏我們會講什麼是eBPF，爲什麼發起Aya，還有它的獨特之處。

What is eBPF? -- eBPF是什麼?

eBPF (extended Berkeley Packet Filter) is a technology that makes it possible to run sandboxed programs inside a virtual machine with its own minimal set of instructions.

eBPF (extended Berkeley Packet Filter) 是能在虛擬機裏運行沙箱程序的技術並且擁有自己的一套最小指令集。

It originated in the Linux kernel, where the eBPF VM (being the part of the kernel) triggers eBPF programs when a specific event happens in the system. There are more and more events added in new Linux kernel versions. For each type of event there is a separate kind of eBPF program. Currently in Linux, the most known program types are:

eBPF起源於Linux內核，在系統中發生某些特定事件時，內核裏的eBPF虛擬機會觸發eBPF程序。現在有越來越多的事件被加入新的Linux內核版本中。每種事件類型都有各自的eBPF程序。在Linux中現有已知的程序類型有：

Kprobes (kernel probes), fentry – can be attached to any kernel function during runtime.

Tracepoints – are hooks placed in various places in the Linux kernel code, which are more stable than Kprobes, that can change faster between kernel versions.

TC classifier – can be attached to egress and ingress qdisc (“queuing discipline” in Linux networking) for inspecting network interfaces and performing operations like accepting, dropping, redirecting, sending them to the queue again, etc.

XDP – similar to TC classifier, but attaches to the NIC driver and receives raw packets before they go through any layers of kernel networking stack. The limitation is that it can receive only ingress packets.

LSM – stands for Linux Security Modules, they are programs that are able to decide whether a particular security-related action is allowed to happen or not.

Kprobes (kernel probes,譯：內核探針), fentry – 能在運行時附加到任意內核函數
Tracepoints (譯：追蹤點) – 位於Linux內核代碼中的各種地方，相比 Kprobes 更加穩定，能夠在不同的Linux版本之間更快的更改
TC classifier (譯：TC分類器) – 能被附加到 qdisc (“queuing discipline” in Linux networking) 的出口和入口，用於檢查網絡接口和執行某些操作比如 accepting(接受), dropping(釋放), redirecting(重定向), sending them to the queue again(再次發送到隊列)等。
XDP – 類似於 TC classifier, 不過是附加到 NIC 驅動，而且能接收在通過內核網絡棧的任意一層之前的原始數據包。有個限制是它只能接收流入的數據包。
LSM – 代表 Linux Security Modules (譯：Linux安全模塊)，是能決定一個特殊的安全相關行爲是否被允許的程序

eBPF projects usually are built from two parts:

eBPF program itself, running in the kernel and reacting to events.

User space program, which loads eBPF programs into the kernel and manages their lifetime.

There are ways to share data between eBPF programs and user space:

Maps – data structures used by eBPF programs and, depending on the type, also by the user space. With standard map types like HashMap, both eBPF and user space can read and write to them.

Perf / ring buffers – (PerfEventArray) – buffers to which eBPF program can push events (in form of custom structures) to the user space. This is a way to notify the user space program immediately.

Although eBPF started in Linux, nowadays there is also implementation in Windows. And eBPF is not even limited to operating system kernels. There are several user space implementations of eBPF, such as rbpf – a user space VM used in production by projects like Solana.

eBPF 項目通常由兩部分構成：

eBPF program 本身, 運行在內核裏響應事件
User space program, 用於加載 eBPF 程序到內核中並負責生命週期管理

在eBPF程序和用戶程序之間有兩種共享數據的方式：

Maps – 用於 eBPF 程序的數據結構，取決於具體類型，也用於用戶層。像 HashMap 這樣的標準的map類型，eBPF 和用戶層代碼都能讀寫。
Perf / ring buffers – (PerfEventArray) – 緩衝區，能讓 eBPF 程序往裏推送事件(以自定義結構體的形式)到用戶程序。這是個可以及時通知用戶態程序的方法。

雖然 eBPF 始於Linux, 但是現在也有 Windows 裏的實現，並且 eBPF 不僅限於操作系統內核領域。有一些用戶態的 eBPF 實現，比如 rbpf – 一個用戶態的虛擬機用於像 Solana 這樣的產品。

What is Aya and how did it start? - Aya是什麼，如何開始？

Today, eBPF programs are usually written either in C or eBPF assembly. But in 2021, the Rust compiler got support for compiling eBPF programs. Since then, Rust Nightly can compile a limited subset of Rust into an eBPF object file.

If you are interested in reading about the implementation details, we recommend to check out this blog post by Alessandro Decina, who is the author of the pull request.

Aya is the first library that supports writing the whole eBPF project (both the user space and kernel space parts) in Rust, without any dependency on libbpf or clang. In most of environments, Rust Nightly is the only dependency needed for building. Some environments where rustc doesn’t expose its internal LLVM.so library (i.e. aarch64) require installing a shared LLVM library. But there is no need for libbpf, clang, or bcc!

As mentioned before, the main focus of Aya is developer experience – making it as easy as possible to write eBPF programs. Now we are going to go into details how Aya achieves that.

現在的 eBPF 程序一般用 C 或者 eBPF 彙編語言來寫。但是在 2021 年，Rust編譯器開始支持編譯 eBPF 程序. 從此以後，Rust Nightly 版本可以編譯Rust受限的子集到 eBPF 目標文件。

如果你對實現細節感興趣，我們建議看看這篇博客 post by Alessandro Decina, who is the author of the pull request.

Aya 是第一個支持用Russt寫整個 eBPF 項目的庫 (包括用戶層和內核層)，不需要依賴libbpf 或者 clang。大部分環境下，只需要依賴Rust Nightly即可構建。有些環境 rustc 沒有導出其內部的 LLVM.so library (比如 aarch64) 需要安裝 LLVM 共享庫。但是仍然不需要依賴 libbpf, clang, 或 bcc！

正如前面所提到的，Aya 的主要關注點是開發者體驗 – 使 eBPF 程序的編寫儘可能簡單。現在我們開始瞭解 Aya 如何實現這些細節的。

More (type) safety -- (類型)更安全

Although the eBPF verifier ensures memory safety, using Rust over C is still beneficial in terms of type safety. Both Rust and macros inside Aya are strict in terms of what types are used in which context.

Let’s look at this example in C.

即使有 eBPF 驗證器來保證內存安全，但在類型安全方面，使用 Rust 相比 C 而言是更有利的。在Rust裏 Aya 的代碼和宏在類型所屬的上下文方面都是嚴格的。

我們來看一下這方面 C 的例子

頭文件:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

程序:

SEC("xdp")
int incorrect_xdp(struct __sk_buff *skb) {
    return XDP_PASS;
}

It will compile without any problems:

上面的代碼編譯沒有任何問題

$ clang -O2 -emit-llvm -c incorrect\_xdp.c -o - | llc -march=bpf -filetype=obj -o bpf.o
$

… despite the fact that the function signature of that program is incorrect. struct __sk_buff *skb is an argument provided to TC classifier programs, not XDP, which has an argument of type struct xdp_md *ctx. Clang is not catching that mistake during compilation.

… 儘管事實上這個函數的簽名是不對的。struct __sk_buff *skb是一個提供共TC classifier的參數，而不是參數類型爲struct xdp_md *ctx的XDP。 Clang 在編譯期間沒有捕獲到這個錯誤。

Let’s try to make a similar mistake with Rust:

讓我們試試在Rust中製造類似的錯誤

#[xdp(name = "incorrect_xdp")]
pub fn incorrect_xdp(ctx: SkBuffContext) -> u32 {
    xdp_action::XDP_PASS
}

$ cargo xtask build-ebpf
[...]
   Compiling incorrect-xdp-ebpf v0.1.0 (/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf)
     Running `rustc --crate-name incorrect_xdp --edition=2021 src/main.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C opt-level=3 -C panic=abort -C lto -C codegen-units=1 -C metadata=c92607119e7c631d -C extra-filename=-c92607119e7c631d --out-dir /home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps --target bpfel-unknown-none -L dependency=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps -L dependency=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/debug/deps --extern aya_bpf=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libaya_bpf-85e7be8a52b56ed9.rlib --extern aya_log_ebpf=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libaya_log_ebpf-1b46466744bed2bc.rlib --extern 'noprelude:compiler_builtins=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libcompiler_builtins-bb297dda66d0a4e2.rlib' --extern 'noprelude:core=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libcore-65086a797df2a9a7.rlib' --extern incorrect_xdp_common=/home/vadorovsky/repos/aya-examples/incorrect-xdp/incorrect-xdp-ebpf/../target/bpfel-unknown-none/debug/deps/libincorrect_xdp_common-114ad60c902270da.rlib -Z unstable-options`
error[E0308]: mismatched types
 --> src/main.rs:7:1
  |
7 | #[xdp(name = "incorrect_xdp")]
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `SkBuffContext`, found struct `XdpContext`
8 | pub fn incorrect_xdp(ctx: SkBuffContext) -> u32 {
[...]

The Rust compiler was able to detect the mismatch between SkBuffContext (context of TC classifier program) and XdpContext (context of XDP program, which we should use when using the xdp macro).

Rust編譯器能檢測出SkBuffContext (TC classifier的上下文) 和 XdpContext (XDP程序的上下文, 應該用 xdp 這個宏) 之間的不匹配。

Error handling -- 錯誤處理

The usual way of error handling in C is by returning an integer indicating success or error in a function and comparing that integer when calling it. In that case, since the return value is an error code, the actual result of a successful work is usually stored in a pointer provided as an argument. To make it very simple, the basic example (which gets triggered when new process is cloned and saves the PID in HashMap) looks like:

在C語言裏常用的錯誤處理方式是在調用函數時返回一個整數表明成功還是錯誤。這種方式，因爲返回值是個錯誤碼，成功調用後的實際結果通常存到一個通過參數提供的指針裏。爲了簡單，基本的例子(當新進程clone並且保存PID到HashMap裏)看起來像這樣：

struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__uint(max_entries, 1024);
	__type(key, pid_t);
	__type(value, u32);
} pids SEC(".maps");

SEC("fentry/kernel_clone")
int BPF_PROG(kernel_clone, struct kernel_clone_args *args)
{
	/* Get the pid */
	pid_t pid = bpf_get_current_pid_tgid() >> 32;
	/* Save the pid in map */
	u32 val = 0;
	int err = bpf_map_update_elem(&pids, &pid, &val, 0);
	if (err < 0)
		return err;
	return 0;
}

Aya lets you use the Result enum and handle errors as it’s done in the most of Rust projects. The only trick is to create two or more functions – one that has a C function signature, which returns only the integer type (the actual eBPF program) and others that return Result (used by the first function). Example:

跟大多數Rust項目一樣，Aya(讓你)用 Result 枚舉處理錯誤。只有一個小把戲是創建兩個或者更多的函數 - 一個擁有C函數簽名，返回整數類型(實際的 eBPF 函數)，另一個返回Result(用於封裝C函數)。例子：

#[map(name = "pids")]
static mut PIDS: HashMap<u32, u32> = HashMap::<u32, u32>::with_max_entries(1024, 0);

#[fentry(name = "kernel_clone")]
pub fn kernel_clone(ctx: FEntryContext) -> u32 {
    match unsafe { try_kernel_clone(ctx) } {
        Ok(ret) => ret,
        Err(_) => 1,
    }
}

fn try_kernel_clone(ctx: FEntryContext) -> Result<u32, c_long> {
    // Get the pid
    let pid = ctx.pid();
    // Save the pid in map.
    unsafe { PIDS.insert(&pid, &0, 0)? };
    Ok(0)
}

The difference becomes significant when the eBPF code becomes larger and there are multiple errors to handle.

當 eBPF 代碼變得更大並且有很多錯誤要處理時，會有顯著的不同。

The Rust toolchain is all you need -- 你只需要Rust工具鏈

That’s right, to start playing with Aya and eBPF, usually you need to install only the Rust toolchain (nightly) and few crates. Detailed instructions are here. Cargo is enough to build the whole project and the produced binary will load the eBPF program into the kernel. Clang, bpftool, or iproute2 are not needed.

是的，一般情況下你只需要安裝 Rust 工具鏈(Nightly) 和少數的庫，就可以玩(譯者：搞起) Aya 和 eBPF了。詳細的教程在這。Cargo足以構建整個工程，生成二進制程序來加載 eBPF 程序到內核中。不需要Clang, bpftool, 還有 iproute2。

With Aya and Rust, you can use libraries in your eBPF program as long as they support no_std usage. More details about no_std are here. It’s also possible to release eBPF code as crates.

有了 Aya 和 Rust，你能在 eBPF 程序裏使用很多支持 no_std的庫。更多關於 no_std 的細節在這。你也能講你的 eBPF 代碼發佈爲 crate。

An example of a crate that is very often used in Aya-based eBPF programs is memoffset, which obtains offsets of struct members. We are going to see it in code examples later.

在基於Aya的 eBPF 程序裏經常用到的一個crate是 memoffset，用於獲取結構體成員的偏移。我們將在後面的代碼例子中看到。

aya-log

Aya-log is a library that lets people to easily log from their eBPF programs to the user space program. There is no need to use bpf_printk(), bpftool prog tracelog and the kernel trace buffer which is centralized. Aya-log sends the logs through a PerfEventArray to the user space part of the project, which is what eBPF developers often implement from scratch, but there is no need to do so with Aya!

Aya-log 是個能讓人們更容易地從 eBPF 程序記錄日誌到用戶層的庫。不需要使用 bpf_printk(), bpftool prog tracelog 而且內核的 trace buffer 是集中式的。Aya-log 通過 PerfEventArray 發送到項目的用戶層，這是 eBPF 開發者經常需要從頭實現的功能，但是用 Aya 就不需要這些啦！

Logging in Aya is as simple as:

在 Aya 中記錄日誌是如此簡單：

#[fentry(name = "kernel_clone")]
pub fn kernel_clone(ctx: FEntryContext) -> u32 {
    let pid = ctx.pid();
    info!(&ctx, "new process: pid: {}", pid);
    0
}

And then it’s visible in the user space process:

用戶層進程也是可見的：

aya-template

cargo-generate is a tool that helps with creating new Rust projects by using a git repository with a template. You can use it to create a new eBPF project based on Aya, using our aya-template repository.

cargo-generate 是個通過git倉庫模板來輔助創建Rust項目的工具。可以用它創建Aya項目，通過我們的aya-template倉庫。

Starting a new project is as simple as:

很容易開始一個新項目：

cargo install cargo-generate
cargo generate https://github.com/aya-rs/aya-template

Then cargo-generate asks question about the project, which mostly depend on the chosen type of eBPF program.

cargo-generate 會詢問關於項目的問題，大部分是關於 eBPF 程序的選擇類型

And the project layout with firewall user space crate, firewall-common crate for shared code, firewall-ebpf with eBPF code, and xtask for build commands:

在這個項目結構中， firewall 是用戶層 crate, firewall-common crate 用於共享代碼, firewall-ebpf 是 eBPF 代碼，xtask 是用於構建的一些命令：

Sharing the common types and logic between user space and eBPF -- 在用戶層和eBPF之間共享通用的類型和邏輯

Many eBPF projects keep the eBPF (kernel space) part in C, but the user space part in other languages like Go or Rust. In such cases, when some structures are used in both parts of the project, bindings to C structure definitions have to be generated.

許多 eBPF 項目保持 eBPF(內核空間的) 部分用C寫，而用戶層部分代碼用其他語言比如Go或者Rust。這種情況下，當一些結構體在雙方都用到時，必鬚生成綁定 C 的結構體。

In projects based entirely on Aya and Rust, it’s a common practice to keep common types in a crate called [project-name]-common. That crate is created by default when creating a new project using aya-template. That crate can contain, for example, struct definitions used in maps.

在完全基於Aya和Rust的項目中，將通用的類型放在名爲[project-name]-common的crate中是個常見的做法。用aya-template創建新項目時，這個crate默認會創建，例如可能會包含用在maps裏的結構體定義。

Async support -- 異步支持

User space part of projects based on Aya can be asynchronous, both Tokio and async-std are supported. Aya can load eBPF programs, perform operations on eBPF maps and perf buffers in asynchronous context.

基於Aya的項目用戶層部分可以是異步的，支持 Tokio 和 async-std 。Aya 能夠加載 eBPF 程序，在 eBPF maps 上執行操作，還有在異步上下文中 perf buffers。

How Deepfence leverages Aya -- Deepfence是如何利用Aya的

Packet inspection overview -- 數據包檢查概覽

We are analyzing network traffic on virtual machines, Docker, Kubernetes, and public cloud environments.

On each node, that analysis is done in two places for different purposes:

Inline in eBPF program, with the TC classifier context. On this level, we:

Perform a network layer inspection (L3) to check the source and destination addresses.

Perform a transport layer inspection (L4) to check the local and remote port.

Based on that information, we apply network policies. If the given address (and port) are in our blocklist, we drop the packet. We also have allowlist logic when there is a wildcard blocklist policy.

Perform basic application level (L7) inspection:

HTTP – some HTTP headers might contain information about the client that were masked by load balancers, which then we also use for enforcing network policies and dropping the packet.

User space after retrieving a packet from eBPF (via PerfEventArray) for further analysis.

We are matching all the packets with our sets of security rules, which are compatible with Suricata and ModSecurity.

When some packet matches any rule, we raise an alert.

Each rule has different alert severity – critical, high, medium, or low. Our thresholds for each severity are configurable.

After some threshold was reached, we automatically create a new network policy, which is going to block the particular traffic inline, in eBPF.

我們在虛擬機、Docker、Kubernetes和公有云環境中分析網絡傳輸。

每個節點都需要兩處地方分析，用於不同的目的：

eBPF 程序內部分析， with the TC classifier context. 在這一層，我們要做:
- 執行一次網絡層(network layer)檢查 (L3) 來檢測源地址和目標地址。
- 執行一次傳輸層(transport layer)檢查 (L4) 來檢測本地和遠程端口。
- 基於這些信息，來應用網絡檢測策略。如果給定的地址(和端口)在黑名單中，就丟棄這個包。 We also have allowlist logic when there is a wildcard blocklist policy.
- 執行基本的應用層檢查 (L7)：
  - HTTP – some HTTP headers might contain information about the client that were masked by load balancers, which then we also use for enforcing network policies and dropping the packet.
用戶層 從 eBPF 收到包後，(通過PerfEventArray)，做更多的分析。
- We are matching all the packets with our sets of security rules, which are compatible with Suricata and ModSecurity.
- 當有數據包匹配到某些規則，拋出一個警報
- 每條規則都有不同的告警嚴重程度(告警級別) – critical(危險), high(高), medium(中), low(低)。每種告警級別的開關都是可配置的。
- 在達到一定閾值後，會自動創建新的測率，將在 eBPF 程序內部阻攔某些特定的網絡傳輸。

Example of TC classifier eBPF program -- TC classifier 類型的 eBPF 程序示例

At the beginning, we mentioned the TC classifier type of eBPF program. All incoming traffic comes to TC and is then redirected to a bound socket where the data can be consumed in user space. It’s the same logic for outgoing traffic but in reverse, the data goes via the socket API and then goes through TC. This means by attaching to TC, you can intercept the kernel socket buffer (sk_buff for those who ever ventured the kernel code) and analyze all of it. On top of accessing the content, you can also make decisions such as dropping the packet, or letting it through.

This is the example of eBPF program applying a simple ingress network policy:

一開始我們就提到了TC classifier類型的 eBPF 程序。所有流入的(數據包)都會到 TC 然後轉發到用戶層可以消費的已經綁定的 socket。對於流出的(數據包)也有相同但不過是相反的邏輯，數據通過 socket API 然後經過 TC。這意味着通過附加到 TC 上，你能夠攔截內核 socket 的緩衝區(sk_buff for those who ever ventured the kernel code)並且進行分析。在能訪問數據包的基礎上，你能決定像丟棄數據包或者讓他通過等這些行爲。

這是個 eBPF 程序應用入口網絡流量策略的例子

const ETH_HDR_LEN: usize = mem::size_of::<ethhdr>();

const ETH_P_IP: u16 = 0x0800;
const IPPROTO_TCP: u8 = 6;
const IPPROTO_UDP: u8 = 17;

#[map]
static mut BLOCKLIST_V4_INGRESS: HashMap<u32, u8> = HashMap::with_max_entries(1024, 0);

#[classifier(name = "tc_cls_ingress")]
pub fn tc_cls_ingress(ctx: SkBuffContext) -> i32 {
    match { try_tc_filter_ingress(ctx) } {
        Ok(_) => TC_ACT_PIPE,
        Err(_) => TC_ACT_SHOT,
    }
}

fn try_tc_cls_ingress(ctx: SkBuffContext) -> Result<(), i64> {
    let eth_proto = u16::from_be(ctx.load(offset_of!(ethhdr, h_proto))?);
    let ip_proto = ctx.load::<u8>(ETH_HDR_LEN + offset_of!(iphdr, protocol))?;
    if !(eth_proto == ETH_P_IP && (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP)) {
        return Ok(());
    }

    let saddr = u32::from_be(ctx.load(ETH_HDR_LEN + offset_of!(iphdr, saddr))?);

    if unsafe { BLOCKLIST_V4_INGRESS.get(&saddr) }.is_some() {
        error!(&ctx, "blocked packet");
        return Err(-1);
    }

    info!(&ctx, "accepted packet");
    Ok(())
}

Rust will compile this code into an eBPF ELF format (tc-filter in our example) that the Linux eBPF VM will be able to execute.

eBPF loading and attaching in user space:

Rust會把這些代碼編譯成能被 Linux eBPF 虛擬機執行的 eBPF ELF 格式。

eBPF 在用戶層的加載和附加：

    let mut bpf = Bpf::load(include_bytes_aligned!(
    ".../target/bpfel-unknown-none/release/tc-filter"
    ))?;
    let prog: &mut SchedClassifier = bpf.program_mut("tc_cls_ingress").unwrap().try_into()?;
    prog.load()?;
    let _ = tc::qdisc_add_clsact("eth0");
    prog.attach("eth0", TcAttachType::Ingress)?;

The code above loads the eBPF binary, loads the specific programs and adds a new qdisc to TC. qdisc is short for “queue discipline” and they are mandatory. They allow for multiple eBPF programs to attach together on the same interface. Finally we attach the eBPF classifier tc_cls_ingress to TC on ingress for the interface eth0. So any incoming packets reaching TC will call the tc_cls_ingress function. The same can be done on egress.

Now that we have seen how eBPF programs can be built and triggered, let’s go deeper and see what they can do with the socket buffer. A socket buffer fully encapsulates one TCP or one UDP packet. This means that if you want to reconstruct HTTP messages, you will need to stack TCP packets and reorder them properly.

上面的代碼會加載 eBPF 二進制(代碼)，加載指定的執行並添加一個新的 qdisc 到 TC。qdisc 是“queue discipline”的法定縮寫。它允許doge eBPF 程序同時附加到一個相同的(譯者補：網絡)接口上。最終我們把(自己的) eBPF classifier tc_cls_ingress 附加到了 TC ，在eth0網絡接口的流入流量上。

現在我們學會了 eBPF 程序是如何構建和觸發的，讓我們更深入的瞭解他還能對 socket buffer 做什麼。一個 socket buffer 完全封裝了一個TCP 或 UDP 包。這意味着如果你想重新構建一個 HTTP 消息，你需要解開TCP棧並且正確地重組他們。

Sending data to user space with PerfEventArray -- 用 PerfEventArray 發送數據到用戶層

There are multiple ways to communicate back and forth with eBPF programs and user space programs. To quickly transmit data from eBPF to user space, PerfEventArray is the most efficient ways to do. Fortunately Aya also brings nice utilities around it.

Coming back to PerfEventArray, it was initially designed to just report metrics about traffic performance – hence the name – but in reality, you can use those arrays to pass any data you want. Aya makes it dead easy as the same data type can be used for PerfEventArray and your user space program.

In the future we want to support ring buffers in Aya, which bring better performance and are supported by newer kernels. The ongoing work is in progress.

Let’s see how to transmit socket buffers to user space.

First, we define a custom data type:

有多種方法能夠在 eBPF 程序和用戶層間來回通信。想要快速從 eBPF 傳送數據到用戶層，PerfEventArray是最搞笑的方法。幸運的是 Aya 也對此提供了良好的功能支持。

話題回到 PerfEventArray，它最初只是被設計來用於報告關於傳輸性能的指標的 – 因此叫這個名 – 但實際上，你能用這些數組傳遞任意你想要的數據。 Aya 讓它變得非常容易，因爲相同的數據類型可以用於 PerfEventArray 和你的用戶曾程序。

將來我們想在 Aya 裏支持環形緩衝區(ring buffers)，能提供更好的性能，支持更新的內核。持續工作中。

來看看如何把 socket buffers 傳到用戶層。

首先，自定義一個數據類型：

#[derive(Copy, Clone, Debug, Hash, Eq, PartialEq)]
#[repr(C)]
pub struct OobBuffer {
    pub direction: TrafficDirection,
    pub size: usize,
}

PerfEventArray in eBPF programs:

eBPF 程序裏的 PerfEventArray：

static mut OOB_DATA: PerfEventArray<OobBuffer> = PerfEventArray::new(0);

#[classifier(name = "my_ingress_endpoint")]
fn tc_cls_ingress(mut skb: SkBuffContext) -> i32 {
    unsafe {
        OOB_DATA.output(
            skb,
            &OobBuffer {
                direction: TrafficDirection::Ingress,
                size: skb.len() as usize,
            },
            skb.len(),
         )
    }

  return TC_ACT_PIPE
}

用戶層的PerfEventArray：

    let oob_events: AsyncPerfEventArray<_> =
    bpf.map_mut("OOB_DATA").unwrap().try_into().unwrap();
        
    for cpu_id in online_cpus()? {
        let mut oob_cpu_buf = oob_events.open(cpu_id, Some(256))?;
        spawn(&format!("cpu_{}_perf_read", cpu_id), async move {
            loop {
                let bufs = (0..sizes.buf_count)
                    .map(|_| BytesMut::with_capacity(128 * 4096))
                    .collect::<Vec<_>>();
                let events = oob_cpu_buf.read_events(&mut bufs).await.unwrap();
                // Play with the recieved events in bufs
            }
        });
    }

The PerfEventArray is a ring buffer and is bound with a map name OOB_DATA across eBPF programs and user space. The faster you can retrieve events from the buffer, the fewer you are going to miss. Here, we open the PerfEventArray and we spawn tokio tasks across all CPUs. There is one PerfEventArray allocated per CPU. Then we start reading from it asynchronously. When an event is sent from eBPF, user space task is awoken and starts reading the event. Note that our PerfEventArray data is composed of: custom type and the appended socket buffer. So to retrieve the underlying socket buffer, we can simply offset the custom type and access the remaining bytes.

Here we are, we have a way to attach eBPF program to TC and retrieve socket buffers to user space. The funny work can start!

PerfEventArray是個環形緩衝區(ring buffer)並且綁定到一個映射名稱 OOB_DATA，能夠跨 eBPF 程序和用戶層訪問。你從緩衝區中獲取事件越快，就有越少的事件會丟失。在這，我們打開了PerfEventArray然後跨越所有的CPU spawn tokio 任務。每個CPU(核心)都分配了一個PerfEventArray。然後我們開始異步去讀。當一個事件從 eBPF 發送出來，用戶層任務就會被喚醒並且開始讀這個事件。注意我們的 PerfEventArray 數據是由自定義類型和附加的socket buffer組成的。所以要獲取下層的 socket buffer，我們可以簡單地跨過自定義類型然後訪問剩下的字節。

我們有方法將 eBPF 程序附加到 TC 並獲取 socket buffers 到用戶空間。有趣的工作可以開始了！

Processing packets in user space -- 在用戶層處理數據包

The way Deepfence does deep packet inspection is by reordering the TCP frames and reconstructing HTTP messages from the socket buffer data gathered by eBPF. Once the HTTP payload is reconstructed, we apply rule matching to detect whether something malicious is present. If so, we generate an alert and notify customers. We deal the same way with the other application layer (L7) protocols.

Deepfence進行深度數據包檢查的方法是重組 TCP 幀，重構從 eBPF 收集到的 socket buffer 數據裏的 HTTP 消息。一旦 HTTP payload 被重構，我們就應用相匹配的規則來探測是否有惡意存在。如果有的話，講生成一個告警並通知用戶。我們以相同的方式處理其他應用層的協議(L7)。

A rule defines how to detect a specific malicious content on HTTP payloads. It also includes meta information on its purpose (the reason for its creation, or what it detects). It usually relies on the haystack finding approach but also regular expression matching approaches. Matching happens on different parts of the HTTP message, it can be headers, port, or even HTML body. Needless to say that such operations are CPU intensive. Here is a rule example:

規則定義瞭如何檢測 HTTP 負載上的特定惡意內容。它還包括有關其用途的元信息（創建原因或檢測到的內容）。它通常依賴於 haystack 尋找方法還有正則表達式匹配方法。匹配發生在 HTTP 消息的不同部分，可以 HTTP 的頭部、端口，甚至是正文。不用說，這種操作肯定是 CPU 密集型的。這是一個規則示例：

alert http $HOME_NET any -> any any (msg:"ET POLICY Outgoing Basic Auth Base64 HTTP Password detected unencrypted";
flow:established,to_server; threshold: type both, count 1, seconds 300, track by_src;
http.header; content:"Authorization|3a 20|Basic"; nocase; content:!"YW5vbnltb3VzOg==";
within:32; content:!"Proxy-Authorization|3a 20|Basic"; nocase;
content:!"KG51bGwpOihudWxsKQ=="; 
reference:url,doc.emergingthreats.net/bin/view/Main/2006380;
classtype:policy-violation; sid:2006380; rev:15; 
metadata:created_at 2010_07_30, former_category POLICY, updated_at 2022_06_14;)

This rule above is an emerging threat rule. It can be roughly translated into: any HTTP payload containing Authorization text and anything but YW5vbnltb3VzOg== next to it, should trigger an alert saying unencrypted passwords were found.

上面的規則是一個新興威脅規則。大體上可以翻譯爲：任何包含Authorization 文本並且YW5vbnltb3VzOg==在它後面的 HTTP payload，應該出發一個告警，告知未加密的密碼被發現了。

Finding the right set of rules is challenging, and applying them at the right time is crucial. At Deepfence, we aggregate different rules from different sources and different format, like emerging threat rules but also mod security core rule set for instance. But users can also provide their own rules. We apply them to the live traffic captured by eBPF program to achieve real time alerting.

尋找正確的規則集合是具有挑戰性的，而且在正確的時間這些規則也是至關重要的。在Deepfence，我們從不同的源中聚合了不同格式的規則，像新興威脅規則 mod security core rule set for instance。用戶仍然可以提供自己的規則。我們建議用戶通過 eBPF 程序去實操捕獲數據傳輸來實現實時告警。

Watching processes and containers -- 監控進程和容器

Deepfence, apart from network tracing, focuses also on monitoring processes container workloads. That can be achieved by using tracepoint eBPF program triggered by new processes in the system.

Deepfence除了網絡數據追蹤，還聚焦於監控進程容器的工作負載。可以通過 tracepoint eBPF 程序來完成這項工作，當系統中有新進程創建的時候。

#[map]
pub static mut RUNC_EVENT_SCRATCH: PerCpuArray<RuncEvent> = PerCpuArray::with_max_entries(1, 0);
#[map]
pub static mut RUNC_EVENTS: PerfEventArray<RuncEvent> = PerfEventArray::new(0);

#[tracepoint(name = "runc_tracepoint")]
pub fn runc_tracepoint(ctx: TracePointContext) -> i64 {
    match { try_runc_tracepoint(ctx) } {
        Ok(ret) => ret,
        Err(_) => ret,
    }
}

fn try_runc_tracepoint(ctx: TracePointContext) -> Result<i64, i64> {
    // To check offset values:
    // sudo cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/format
    const FILENAME_POINTER_OFFSET: usize = 8;
    let buf = unsafe {
        let ptr = FILENAME_BUF.get_ptr_mut(0).ok_or(0i64)?;
        &mut *ptr
    };
    let filename = unsafe {
        let len = bpf_probe_read_kernel_str(
            (ctx.as_ptr() as *const u8).add(ctx.read_at::<u16>(FILENAME_POINTER_OFFSET)? as usize),
            &mut buf.buf,
        )?;
        core::str::from_utf8_unchecked(&buf.buf[..len])
    };
    if filename.ends_with("runc\0") {
        let pid = bpf_get_current_pid_tgid() as u32;
        let event = &RuncEvent { pid };
        unsafe { RUNC_EVENTS.output(&ctx, &event, 0) };
    }
    Ok(0)
}

This program basically:

Gets triggered by sched_process_exec tracepoint – when new processes are spawned by executing a binary.

Checks if the filename ends with runc.

If yes, outputs an event to the user space via a PerfEventArray.

Of course with such simple filtering, if someone calls some binary foobar-runc (and it has nothing to do with the real runc), we have a problem. But let’s deal with that in the user space.

The definition of RuncEvent is here, it just contains a PID:

這個程序基本上做了這些：

通過sched_process_exec 追蹤點觸發執行 - 當從二進制文件創建新進程時
檢查文件名是否以runc結尾
如果是的話，通過PerfEventArray輸出事件到用戶層

當然這只是個簡單的過濾，如果某人調用了名爲foobar-runc的二進制程序(並且它並沒有實際執行runc)，就會有問題。但我們可以從用戶層去處理這種情況看。

RuncEvent的定義在這，只是包含了一個PID：

#[derive(Debug, Clone)]
#[repr(C)]
pub struct RuncEvent {
    pub pid: u32,
}

Then we can consume the event in the user space:

然後我們就可以在用戶層消費這些事件：

    let oob_events: AsyncPerfEventArray<_> =
    bpf.map_mut("RUNC_EVENTS").unwrap().try_into().unwrap();
        
    for cpu_id in online_cpus()? {
        let mut runc_buf = oob_events.open(cpu_id, Some(256))?;
        spawn(&format!("cpu_{}_runc_perf_read", cpu_id), async move {
            loop {
                let bufs = (0..sizes.buf_count)
                    .map(|_| BytesMut::with_capacity(128 * 4096))
                    .collect::<Vec<_>>();
                let events = runc_buf.read_events(&mut bufs).await.unwrap();
                for i in 0..events.read {
                    let buf = &mut buffers[i];
                    let ptr = buf.as_ptr() as *const SuidEvent;
                    let event = unsafe { ptr.read_unaligned() };
                    handle_runc_event(event.pid).unwrap();
                }
            }
        });
    }

It’s better to define our logic in some other function, the loop above is already complex enough. We can try to look for the actual process:

最好是在其他函數中定義我們的邏輯，上面的loop循環已經足夠複雜了。我們嘗試看一下實際的處理過程：

fn handle_runc_event(pid: u32) -> Result<(), anyhow::Error> {
    let p = match Process::new(pid as i32)?;
    // do something with `p`, parse its cmdline
    // and check if it's actually runc
}

That way of monitoring runc processes is agnostic to container engines (Docker, podman) or orchestration systems (Kubernetes and different CRI implementations), so it’s universal and fast. And based on container creation events, we are able to start parsing container configuration.

We use that solution for monitoring and scanning of container filesystems whenever a new container is created.

對於容器引擎(Docker, podman) 或者容器編排系統 (Kubernetes 和其他不同的 CRI 實現)，這種監控 runc 進程的方法是未知的，所以是普遍且快速的。基於容器創建事件，我們就能開始分析容器配置了。

Conclusion - 總結

In this post we introduced you to eBPF and Aya, and how Deepfence leverages those technologies to reliably detect real customer security issues.

If you have questions or want to hear more, we encourage you to read our Deep Packet Inspection documentation and join the Deepfence Slack workspace!

If you want to learn about Aya, check out the Aya book.

Aya has very active community on Discord, where conversations happen pretty much everyday. We invite you to join and feel free to ask any questions related to Aya and eBPF!

Finally, if you’re interested in hacking on eBPF, Rust, and Kubernetes, reach out – careers(at)deepfence(dot)io

在這篇文章裏，我們介紹了 eBPF 和 Aya，還有 Deepfence 是如何利用這些技術來可靠地探測實時的客戶安全問題。

如果你還有更多的問題，建議你讀我們的Deep Packet Inspection documentation 並且加入Deepfence 的 Slack 空間！

如果你想學 Aya (的更多內容), 可以看看Aya book。

Aya 在Discord上非常活躍，每天都有相當多的交接。我們邀請你加入並且輕鬆地提問任何關於 Aya 和 eBPF 的問題！

最後，如果你對 eBPF, Rust, 還有 Kubernetes 感興趣，請聯繫 – [email protected]

【翻譯】Aya: Rust風格的 eBPF 夥伴