Kubernetes 1.29版本中已經將nftables作爲一個featureGates,本文簡單整理了nftables的用法,便於後續理解kubernetes的nftables規則。文末給出了使用kubeadm部署啓用nftables featureGates的配置文件。
nftables和iptables的不同之處
-
nftables使用了新的語法:nftables使用了類似tcpdump的緊湊語法
-
可以完全配置tables和chains:iptables中有一些預定義的tables和chains(即使不需要),會對性能造成一定的影響。而nftables則沒有預定義的tables和chains,因此需要明確定義各個table,以及其包含的對象(chains、sets、maps、flowtables和state object)。你可以定義table和chain的名稱以及netfilter hook優先級。
-
單個nftables可以執行多個動作:iptables中通過匹配只能執行單個動作,但在nftables 規則中可以包含0或多個expressions(用於匹配報文),以及1或多個statements,每個expression會測試一個報文是否匹配特定的payload字段或報文/流的元數據。多個expressions會從左到右作線性評估,如果第一個expression匹配成功,則繼續評估下一個expression。如果匹配了所有的expressions,則會執行statement。
每個statement會執行一個動作,如設置netfilter mark、計算報文數、記錄報文日誌或做出諸如接收或丟棄報文或跳到另一個chain的決定。多個statements也是從左到右線性執行的,這樣一條規則可以通過多個statements執行多個動作(需要注意的是verdict statement 會終結規則)。
-
chain和rule中沒有內置的計數器:nftables中的計數器是可選的,可以在需要時啓用
-
可以更好地支持動態規則更新:iptables的規則是一體式的,一條規則的變更會影響到整體規則。而nftables的ruleset在內部以鏈表表示。現在,添加或刪除一條規則不會影響道其他規則,從而簡化了內部狀態信息的維護。
-
簡化了IPv4/IPv6雙棧管理: nftables的inet family 可以同時支持IPv4和IPv6的chain。不再需要腳本來複制規則
-
新的通用型基礎結構:該基礎結構與nftables的核心緊密集成,並支持高級配置,如maps, verdict maps 和 intervals ,以實現面向性能的報文分類。
-
支持串聯(Concatenations):從Linux kernel 4.1開始,可以串聯多個keys,並將它們與maps 和 verdict maps相結合。其思想是構建一個元組,並該元組的值進行哈希以獲得接近O(1)的查找效率。
-
無需升級內核即可支持新協議:內核升級一項耗時且艱鉅的任務,尤其是當你必須在網絡中維護多個防火牆時。分發的內核版本通常滯後於最新版本。當虛擬機使用nftables時,支持新協議通常不需要更新內核,只需要更新用戶空間的nft即可。
nftables的結構
跟iptables一樣,nftables也使用了table->chain->rule的概念。並使用family的概念區分了報文類型。
ADDRESS FAMILIES
Address families決定了處理的報文類型,默認使用ip
。
ip | IPv4 address family. |
---|---|
ip6 | IPv6 address family. |
inet | Internet (IPv4/IPv6) address family. |
arp | ARP address family, handling IPv4 ARP packets. |
bridge | Bridge address family, handling packets which traverse a bridge device. |
netdev | Netdev address family, handling packets on ingress and egress. |
內核在報文處理路徑的不同階段設置了hooks
,不同的address family對應各自的hooks。如 **IPv4/IPv6/ **的hooks如下:
Hook | Description |
---|---|
prerouting | 所有報文進入系統前都會經過prerouting hook的處理。在執行路由前調用,早期過濾或通過變更報文屬性來影響路由結果。 |
input | 在報文進入本地系統時會經過input hook 的處理 |
forward | 在將報文轉發到不同的主機時會經過forward hook的處理 |
output | 本地進程發送的報文會經過outpu hook的處理 |
postrouting | 所有離開系統的報文都會經過 postrouting hook的處理 |
ingress | 所有進入系統的報文都會經過該hook的處理。它會在3層協議處理前被調用,因此它更早於prerouting hook,,可以用於過濾和策略處理。Ingress只針對Inet family (Linux kernel 5.10以上)生效。 |
特殊語法
連接符
nft中使用.
表示連接符,如下面表示1.1.1.1 and 2.2.2.2 and TCP
或1.1.1.1 and 3.3.3.3 and UDP
nft add rule ip filter input ip saddr . ip daddr . ip protocol { 1.1.1.1 . 2.2.2.2 . tcp, 1.1.1.1 . 3.3.3.3 . udp} counter accept
Intervals
可以使用value-value的方式表示Intervals,一個Intervals可以視作一個參數
% nft add rule filter input ip daddr 192.168.0.1-192.168.0.250 drop
通用參數
位置參數handle
nftables中使用handle
來表示位置ID,用於添加或刪除表項,對應命令中的handle參數。可以通過-a
參數輸出handle信息。
$ nft -a list ruleset
註釋comment
註釋爲一個單詞或雙引號中的多個單詞。在hash中,需要使用斜槓來轉移引號,如\"enable ssh for servers\"
DATA TYPES
DATA TYPES爲Expression提供了數據類型定義。
RULESET
- {list | flush} ruleset [family]
注意黑體字是關鍵字段,斜體爲自定義字段,帶[]的表示有默認值,爲可選字段
ruleset表示所有的tables、chains等。
Example
nft list ruleset
: 顯示所有的nftables規則,對應iptables的iptables-save
命令.- 還可以指定family類型,如
nft list ruleset inet
- 還可以指定family類型,如
nft flush ruleset
: 清空所有的nftables規則
備份和恢復
備份
% echo "flush ruleset" > backup.nft
% nft list ruleset >> backup.nft
恢復
% nft -f backup.nft
TABLES
{add | create} table [family] table [ {comment comment ;} { flags 'flags ; }]
{delete | destroy | list | flush} table [family] table
list tables [family]
delete table [family] handle handle
destroy table [family] handle handle
tables包含chains、sets和stateful objects,並使用address family和名稱進行區分,address family必須是ip
, ip6
, inet
, arp
, bridge
, netdev
。inet
用於創建 IPv4/IPv6混合tables,如果不指定,默認爲ip
。add
和create
的區別是,前者不會在table存在的情況下返回錯誤,而後者會。delete
和destroy
的區別也是如此,delete會在table不存在的情況下返貨錯誤,而destroy則不會。
Example
$ nft add table inet my_table
CHAINS
- {add | create} chain [family] table chain [{ type type hook hook [device device] priority priority ; [policy policy ;] [comment comment ;] }]
- {delete | destroy | list | flush} chain ['family] table chain
- list chains [family]
- delete chain [family] table handle handle
- destroy chain [family] table handle handle
- rename chain [family] table chain newname
chains中包含了規則(rules)。chains有兩種:base chains和regular chains。base chains爲來自網絡棧的報文的入口,regular chain用於跳轉和組織規則。
對於base chain需要指定 type, hook 和 priority參數:
Type | Families | Hooks | Description |
---|---|---|---|
filter | all | all | Standard chain type to use in doubt. |
nat | ip, ip6, inet | prerouting, input, output, postrouting | 該類型的Chains可以根據conntrack表項執行NAT地址轉換。一條連接中,只有第一個報文會經過該chain,其包含的rules定義了創建conntrack表項的詳細信息。 |
route | ip, ip6 | output | 如果報文已遍歷了該類型的chain並將被接受,則如果IP首部的相關字段發送了更改,則將執行新的路由查找。這可以在Nftables中實現策略路由選擇器。 |
priority
priority參數接收一個有符號整數或一個標準優先級名稱,其指定了具有相同hook下的chain的執行順序。數值越小優先級越高。
對於nat
類型的chain,其優先級的下限爲-200。
標準優先級名稱、family和hook兼容性矩陣圖:
Name | Value | Families | Hooks |
---|---|---|---|
raw | -300 | ip, ip6, inet | all |
mangle | -150 | ip, ip6, inet | all |
dstnat | -100 | ip, ip6, inet | prerouting |
filter | 0 | ip, ip6, inet, arp, netdev | all |
security | 50 | ip, ip6, inet | all |
srcnat | 100 | ip, ip6, inet | postrouting |
可以使用基本的算術表達式(加法或減法)結合標準優先級名稱來實現相對優先級,如 mangle - 5 表示-155。在使用
list
等命令顯示時也使用這種表達方式。
bridge family的標準優先級名稱和hook兼容性:
Name | Value | Hooks |
---|---|---|
dstnat | -300 | prerouting |
filter | -200 | all |
out | 100 | output |
srcnat | 300 | postrouting |
policy
定義接收還是拒絕匹配到chain規則的報文,可選值爲accept
(默認值)或drop
Example
$ nft add chain inet mytable myin { type filter hook input priority 1; policy accept;}
下面展示了優先級和策略的用法:
table inet filter {
# This chain is evaluated first due to priority
chain services {
type filter hook input priority 0; policy accept;
# If matched, this rule will prevent any further evaluation
tcp dport http drop
# If matched, and despite the accept verdict, the packet proceeds to enter the chain below
tcp dport ssh accept
# Likewise for any packets that get this far and hit the default policy
}
# This chain is evaluated last due to priority
chain input {
type filter hook input priority 1; policy drop;
# All ingress packets end up being dropped here!
}
}
RULES
- {add | insert} rule [family] table chain [handle handle | index index] statement ... [comment comment]
- replace rule [family] table chain handle handle statement ... [comment comment]
- {delete | reset} rule [family] table chain handle handle
- destroy rule [family] table chain handle handle
- reset rules [family] [table [chain]]
規則中,如果沒有指定family,則默認使用ip
。規則包含兩部分:expression和statement
add
和inset
都用於添加規則,前者用於將規則追加到給定的chain中(或給定的位置handle
之後),後者用於將規則插入到給定的chain的開頭(給給定的位置handle
的前面)
Example
$ nft add rule ip filter output ip daddr 192.168.0.0/24 accept
# nft -a list ruleset
table inet filter {
chain input {
type filter hook input priority filter; policy accept;
ct state established,related accept # handle 4
ip saddr 10.1.1.1 tcp dport ssh accept # handle 5
...
# delete the rule with handle 5
nft delete rule inet filter input handle 5
SETS
- add set [family] table set { type type | typeof expression ; [flags flags ;] [timeout timeout ;] [gc-interval gc-interval ;] [elements = { element[, ...] } ;] [size size ;] [comment comment ;] [policy 'policy ;] [auto-merge ;] }
- {delete | destroy | list | flush} set [family] table set
- list sets [family]
- delete set [family] table handle handle
- {add | delete | destroy } element [family] table set { element[, ...] }
Keyword | Description | Type |
---|---|---|
type | 定義集合元素的數據類型 | 支持字符串: ipv4_addr , ipv6_addr , ether_addr , inet_proto , inet_service , mark |
typeof | 定義集合元素的數據類型 | 從Expression派生出的數據類型 |
flags | 集合標識 | 字符串: constant, dynamic, interval, timeout. 用於描述集合屬性 |
timeout | 元素停留在集合中的時間,如果集合來自報文路徑(ruleset),則爲強制字段 | 字符串, 十進制加單位表示. 單位爲: d, h, m, s |
gc-interval | GC間隔,在指定timeout或flag timeout時生效 | 字符串, 十進制加單位表示. 單位爲: d, h, m, s |
elements | 定義集合中的元素 | 集合的數據類型 |
size | 集合的最大元素數目,如果集合來自報文路徑(ruleset),則爲強制字段 | unsigned integer (64 bit) |
policy | 集合策略 | 字符串: performance [默認], memory |
auto-merge | 自動連接或合併集合元素(僅適用於interval集合) |
nftables提供了兩種集合的概念:匿名集合和命名集合。
Example
下面給出了set的基本用法:
$ nft add set ip filter flags_set {type ipv4_addr\; flags constant, interval\;}
$ nft add set ip filter daddrs {type ipv4_addr \; flags timeout \; elements={192.168.1.1 timeout 10s, 192.168.1.2 timeout 30s} \;}
匿名集合
匿名集合使用大括號表示,內部使用逗號劃分元素,一旦規則被移除,該集合也會被移除,且集合中的內容不可變(除非刪除再添加)。匿名集合不需要定義元素類型。下面幾個使用匿名集合的例子。
$ nft add rule filter input ip saddr { 10.0.0.0/8, 192.168.0.0/16 } tcp dport { 22, 443 } accept
$ nft add rule ip6 filter input tcp dport {telnet, http, https} accept
$ nft add rule ip6 filter input icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept
命名集合
命名集合可以在任意時間添加或移除元素,通過在名稱前面使用@前綴來引用命名集合。命名集合需要定義元素的類型,然後再給出元素的值。
$ nft add set ip filter blackhole { type ipv4_addr\; comment \"drop all packets from these hosts\" \; }
$ nft add element ip filter blackhole { 192.168.3.4 } #添加元素
$ nft add element ip filter blackhole { 192.168.1.4, 192.168.1.5 } #添加元素
$ nft add rule ip filter input ip saddr @blackhole drop #引用命名集合
$ nft add rule ip filter output ip daddr != @blackhole accept
MAPS
- add map [family] table map { type type | typeof expression [flags flags ;] [elements = { element[, ...] } ;] [size size ;] [comment comment ;] [policy 'policy ;] }
- {delete | destroy | list | flush} map [family] table map
- list maps [family]
Keyword | Description | Type |
---|---|---|
type | 定義Map元素的數據類型 | 字符串: ipv4_addr, ipv6_addr, ether_addr, inet_proto, inet_service, mark, counter, quota. 不能使用Counter 和 quota 作爲keys |
typeof | 定義Map元素的數據類型 | 從表示式派生出的數據類型 |
flags | map 標識 | string, same as set flags |
elements | 定義map中的元素 | map數據類型 |
size | Map的最大元素數目 | unsigned integer (64 bit) |
policy | map 策略 | 字符串: performance [默認], memory |
匿名map
如果端口是80,則DNAT到192.168.1.100,如果端口是8888,則DNAT到192.168.1.101
$ nft add rule ip nat prerouting dnat to tcp dport map { 80 : 192.168.1.100, 8888 : 192.168.1.101 }
命名map
$ nft add map nat porttoip { type inet_service: ipv4_addr\; }
$ nft add element nat porttoip { 80 : 192.168.1.100, 8888 : 192.168.1.101 }
TCP/80且源地址爲192.168.1.100的出站報文會被SNAT爲TCP/8888,地址爲192.168.1.101
$ nft add rule ip nat postrouting snat to tcp dport map @porttoip
ELEMENTS
-
{add | create | delete | destroy | get } element [family] table set { ELEMENT[, ...] }
ELEMENT := key_expression OPTIONS [: value_expression]
OPTIONS := [timeout TIMESPEC] [expires TIMESPEC] [comment string]
TIMESPEC := [numd][numh][numm][num[s]]
Option | Description |
---|---|
timeout | sets/maps flag 中的timeout 值 |
expires | timeout開始的倒計時計數器,當數值爲0時,會刪除掉element |
comment | 單個元素的註釋字段 |
element命令用於修改命名集合和命名maps的內容。key_expression用於集合類型的值,value_expression用於map類型,表示map的數據部分。
Example
$ nft add table inet myfilter
$ nft add set inet myfilter myset {type ipv4_addr\; flags timeout\; }
$ nft add element inet myfilter myset {10.0.0.1 timeout 10s }
FLOWTABLES
Flowtables用於加速報文的轉發,它可以繞過傳統的轉發路徑。
userspace process
^ |
| |
_____|____ ____\/___
/ \ / \
| input | | output |
\__________/ \_________/
^ |
| |
_________ __________ --------- _____\/_____
/ \ / \ |Routing | / \
--> ingress ---> prerouting ---> |decision| | postrouting|--> neigh_xmit
\_________/ \__________/ ---------- \____________/ ^
| ^ | ^ |
flowtable | ____\/___ | |
| | / \ | |
__\/___ | | forward |------------ |
|-----| | \_________/ |
|-----| | 'flow offload' rule |
|-----| | adds entry to |
|_____| | flowtable |
| | |
/ \ | |
/hit\_no_| |
\ ? / |
\ / |
|__yes_________________fastpath bypass ____________________________|
Fig.1 Netfilter hooks and flowtable interactions
STATEFUL OBJECTS
可以使用"type name"來引用Statefulset object,如"counter name"、"quota name"、"limit name"
COUNTER
- add counter [family] table counter_name [{ [ packets packets bytes bytes ; ] [ comment comment ; }]
- delete counter [family] table counter_name
- list counters
計數器,統計報文總數和自上次重置以來所接收到的總字節數。需要明確指定統計的每個規則的計數器。
匿名counter
匿名counter只屬於其所在的rule。下面匿名counter用於統計所有路由到本地的TCP流量:
table ip counter_demo {
chain IN {
type filter hook input priority filter; policy drop;
ip protocol tcp counter
}
}
命名counter
$ nft add counter filter http
$ nft add rule filter input tcp dport 80 counter name \"http\"
在map中使用counter:
$ nft add counter filter http
$ nft add counter filter https
$ nft add rule filter input counter name tcp dport map { 80 : \"http\", 443 : \"https\" }
重置counter:
$ nft reset counter filter http
Quotas
- add quota [family] table name { [over|until] bytes BYTE_UNIT [ used bytes BYTE_UNIT ] ; [ comment comment ; ] }
- BYTE_UNIT := bytes | kbytes | mbytes
- delete quota [family] table name
- list quotas
Keyword | Description | Type |
---|---|---|
quota | 配額限制,作爲配額名稱 | 兩個參數:unsigned integer (64 bit) 和 字符串: bytes, kbytes, mbytes。 在這些參數前面加上"over" 和 "until" |
used | 一開始使用的配額值 | 兩個參數:unsigned integer (64 bit) 和 字符串: bytes, kbytes, mbytes |
comment | 每個quota的註釋 | 字符串 |
配額,用於定義:
- 字節數上限
- 設置初始的字節數(默認0字節)
- 計算從初始字節數開始接收到的總字節數,直到字節數達到或超過上限。
匿名Quota
下面配置要求到端口udp/5060的數據量不超過100 mbytes。
table inet anon_quota_demo {
chain IN {
type filter hook input priority filter; policy drop;
udp dport 5060 quota until 100 mbytes accept
}
}
命名Quota
下面定義了一對quota,並在chain IN中引用了這兩個quotas。
- 到端口udp/5060的數據不超過100 mbytes,並丟棄其餘到該端口的報文;
- 到端口tcp/80的數據不超過500 mbytes,,並丟棄其餘到該端口的報文;
- tcp/443 (https)的報文不設限;
- 丟棄其餘報文(policy drop)
table inet quota_demo {
quota q_until_sip { until 100 mbytes used 0 bytes }
quota q_over_http { over 500 mbytes ; comment "cap http (but not https)" ; }
chain IN {
type filter hook input priority filter; policy drop;
udp dport 5060 quota name "q_until_sip" accept
tcp dport 80 quota name "q_over_http" drop
tcp dport { 80, 443 } accept
}
}
下面是在map中引用quota的例子
$ nft add quota filter user123 { over 20 mbytes }
$ nft add quota filter user124 { over 20 mbytes }
$ nft add rule filter input quota name ip saddr map { 192.168.10.123 : \"user123\", 192.168.10.124 : \"user124\" }
Limits
使用token bucket進行限流。
匿名limits
下面限制每秒最多10個ICMP echo-request報文
$ nft add rule filter input icmp type echo-request limit rate 10/second accept
命名limits
下面定義了兩個limits,分別爲:
- 接收所有類型的icmp報文,限制最大速率爲400 packets / minute.
- 接收到端口 tcp/25 (smtp)的流量,最大速率爲 1024 bytes / second,可接收的最大突發流量爲512 bytes
- 丟棄其他流量(policy drop)
table inet limit_demo {
limit lim_400ppm { rate 400/minute ; comment "use to limit incoming icmp" ; }
limit lim_1kbps { rate over 1024 bytes/second burst 512 bytes ; comment "use to limit incoming smtp" ; }
chain IN {
type filter hook input priority filter; policy drop;
meta l4proto icmp limit name "lim_400ppm" accept
tcp dport 25 limit name "lim_1kbps" accept
}
}
Others
- Conntrack helpers (ct helper, Layer 7 ALG)
- Conntrack timeout policies (ct timeout)
- Conntrack expectations (ct expectation)
數據類型
EXPRESSIONS
PRIMARY EXPRESSIONS
expression用於匹配報文。
primary expression是最低階的表達式,表示來自報文的payload、元數據或有狀態模塊的常量或單個數據。
下面僅列出了部分expressions。
META EXPRESSIONS
- meta {length | nfproto | l4proto | protocol | priority}
- [meta] {mark | iif | iifname | iiftype | oif | oifname | oiftype | skuid | skgid | nftrace | rtclassid | ibrname | obrname | pkttype | cpu | iifgroup | oifgroup | cgroup | random | ipsec | iifkind | oifkind | time | hour | day }
從上面看出,元數據有很多種,每種元數據都對應特定的類型。這裏根據元數據的類型進行了劃分。
meta expression指與報文有關的元數據。
meta expressions有兩種:unqualified 和 qualified meta expressions,區別就是有沒有meta
關鍵字:
# qualified meta expression
filter output meta oif eth0
filter forward meta iifkind { "tun", "veth" }
# unqualified meta expression
filter output oif eth0
ROUTING EXPRESSIONS
- rt [ip | ip6] {classid | nexthop | mtu | ipsec}
Keyword | Description | Type |
---|---|---|
classid | Routing realm | realm |
nexthop | Routing nexthop | ipv4_addr/ipv6_addr |
mtu | TCP maximum segment size of route | integer (16 bit) |
ipsec | route via ipsec tunnel or transport | boolean |
routeing expression指與報文有關的路由數據:
# IP family independent rt expression
filter output rt classid 10
# IP family dependent rt expressions
ip filter output rt nexthop 192.168.0.1
ip6 filter output rt nexthop fd00::1
inet filter output rt ip nexthop 192.168.0.1
inet filter output rt ip6 nexthop fd00::1
# outgoing packet will be encapsulated/encrypted by ipsec
filter output rt ipsec exists
NUMGEN EXPRESSION
- numgen {inc | random} mod NUM [ offset NUM ]
生成一個數字,inc 或 random 用於控制其操作模式,inc模式下,最後的返回值會簡單遞增,而random模式下,會返回一個隨機數。mod用於執行取模操作,可選的offset可以讓返回值加上一個固定的偏移量。
numgen通常用於負載均衡:
# round-robin between 192.168.10.100 and 192.168.20.200:
add rule nat prerouting dnat to numgen inc mod 2 map \
{ 0 : 192.168.10.100, 1 : 192.168.20.200 }
# probability-based with odd bias using intervals:
add rule nat prerouting dnat to numgen random mod 10 map \
{ 0-2 : 192.168.10.100, 3-9 : 192.168.20.200 }
HASH EXPRESSIONS
- jhash {ip saddr | ip6 daddr | tcp dport | udp sport | ether saddr} [. ...] mod NUM [ seed NUM ] [ offset NUM ]
- symhash mod NUM [ offset NUM ]
使用哈希函數來生成一個數字,可用的哈希函數爲jhash(Jenkins hash),symhash(Symmetric Hash)。jhash需要一個expression來確定哈希參數(報文首部),可以串聯多個參數。mod用於執行取模操作,可選的seed執行了哈希函數的種子,可選的offset可以讓返回值加上一個固定的偏移量。
jhash和symhash的通常用於負載均衡:
# load balance based on source ip between 2 ip addresses:
add rule nat prerouting dnat to jhash ip saddr mod 2 map \
{ 0 : 192.168.10.100, 1 : 192.168.20.200 }
# symmetric load balancing between 2 ip addresses:
add rule nat prerouting dnat to symhash mod 2 map \
{ 0 : 192.168.10.100, 1 : 192.168.20.200 }
PAYLOAD EXPRESSIONS
指來自報文payload的數據。下面給出常用報文的expressions。
ETHERNET HEADER EXPRESSION
- ether {daddr | saddr | type}
Keyword | Description | Type |
---|---|---|
daddr | Destination MAC address | ether_addr |
saddr | Source MAC address | ether_addr |
type | EtherType | ether_type |
VLAN HEADER EXPRESSION
除非接口配置了reorder_hdr off ,否則無法在 ip, ip6 和 inet family上生效。
- vlan {id | dei | pcp | type}
Keyword | Description | Type |
---|---|---|
id | VLAN ID (VID) | integer (12 bit) |
dei | Drop Eligible Indicator | integer (1 bit) |
pcp | Priority code point | integer (3 bit) |
type | EtherType | ether_type |
IPV4 HEADER EXPRESSION
- ip {version | hdrlength | dscp | ecn | length | id | frag-off | ttl | protocol | checksum | saddr | daddr }
Keyword | Description | Type |
---|---|---|
version | IP header version (4) | integer (4 bit) |
hdrlength | IP header length including options | integer (4 bit) FIXME scaling |
dscp | Differentiated Services Code Point | dscp |
ecn | Explicit Congestion Notification | ecn |
length | Total packet length | integer (16 bit) |
id | IP ID | integer (16 bit) |
frag-off | Fragment offset | integer (16 bit) |
ttl | Time to live | integer (8 bit) |
protocol | Upper layer protocol | inet_proto |
checksum | IP header checksum | integer (16 bit) |
saddr | Source address | ipv4_addr |
daddr | Destination address | ipv4_addr |
ICMP HEADER EXPRESSION
- icmp {type | code | checksum | id | sequence | gateway | mtu}
當在inet、bridge或netdev family中使用時,將導致對IPv4的隱式依賴。
Keyword | Description | Type |
---|---|---|
type | ICMP type field | icmp_type |
code | ICMP code field | integer (8 bit) |
checksum | ICMP checksum field | integer (16 bit) |
id | ID of echo request/response | integer (16 bit) |
sequence | sequence number of echo request/response | integer (16 bit) |
gateway | gateway of redirects | integer (32 bit) |
mtu | MTU of path MTU discovery | integer (16 bit) |
TCP HEADER EXPRESSION
- tcp {sport | dport | sequence | ackseq | doff | reserved | flags | window | checksum | urgptr}
Keyword | Description | Type |
---|---|---|
sport | Source port | inet_service |
dport | Destination port | inet_service |
sequence | Sequence number | integer (32 bit) |
ackseq | Acknowledgement number | integer (32 bit) |
doff | Data offset | integer (4 bit) FIXME scaling |
reserved | Reserved area | integer (4 bit) |
flags | TCP flags | tcp_flag |
window | Window | integer (16 bit) |
checksum | Checksum | integer (16 bit) |
urgptr | Urgent pointer | integer (16 bit) |
UDP HEADER EXPRESSION
- udp {sport | dport | length | checksum}
Keyword | Description | Type |
---|---|---|
sport | Source port | inet_service |
dport | Destination port | inet_service |
length | Total packet length | integer (16 bit) |
checksum | Checksum | integer (16 bit) |
EXTENSION HEADER EXPRESSIONS
指與協議首部有關的擴展字段,如IPv6擴展首部,TCP選項和IPv4選項。
CONNTRACK EXPRESSIONS
- ct {state | direction | status | mark | expiration | helper | label | count | id}
- ct [original | reply] {l3proto | protocol | bytes | packets | avgpkt | zone}
- ct {original | reply} {proto-src | proto-dst}
- ct {original | reply} {ip | ip6} {saddr | daddr}
用於匹配報文的連接元數據
STATEMENTS
Statements表示需要執行的動作,可以是變更控制流(return、jump到另一個chain、accept或drop報文)或執行動作,如logging、拒絕報文等等。
下面給出了部分主要的statements。
VERDICT STATEMENT
-
{accept | drop | queue | continue | return}
-
{jump | goto} chain
verdict statement 用於修改ruleset中的控制流,並確定報文策略。
accept 和drop都屬於 絕對verdicts。
accept | 結束ruleset評估並接收報文。 |
---|---|
drop | 結束ruleset評估,並丟棄報文。drop是立即執行的,不再評估後續的chains或hooks,因此無法在該後續的chain中accept報文 |
queue | 結束ruleset評估,並將報文傳給用戶空間隊列中。用戶空間必須提供一個drop或accept的verdict statement 。在accept的情況下,會在下一個base chain hook中繼續處理報文,而不是跟在queue verdict中的規則中處理。 |
continue | 使用下一個規則繼續評估ruleset,如果一條規則沒有指定verdict statement ,則默認使用該statement。 |
return | 從當前chain中返回,並繼續使用下一個chain中的規則進行評估,如果執行該statement的是一個base chain, 則它相當於一個base chain policy. |
jump chain | 繼續在 chain的第一條規則中進行評估,ruleset的當前位置會被push到一個調用棧中,並在新的chain評估結束或執行了return 之後返回到之前的位置。如果chain中的某個規則執行了絕對VERDICT,則將立即終止對ruleset的評估,並採取響應的操作。 |
goto chain | 類似 jump,但不會將當前位置push到調用棧,意味着在新chain評估結束之後會繼續執行下一個chain,而不會返回之前的位置。 |
Example
# process packets from eth0 and the internal network in from_lan
# chain, drop all packets from eth0 with different source addresses.
filter input iif eth0 ip saddr 192.168.0.0/24 jump from_lan
filter input iif eth0 drop
REJECT STATEMENT
-
reject [ with REJECT_WITH ]
REJECT_WITH := icmp icmp_code |
icmpv6 icmpv6_code |
icmpx icmpx_code |
tcp reset
拒絕匹配到的報文,並返回錯誤。默認的錯誤爲port-unreachable。只能用於 input, forward 或 output hooks
LOG STATEMENT
- log [prefix quoted_string] [level syslog-level] [flags log-flags]
- log group nflog_group [prefix quoted_string] [queue-threshold value] [snaplen size]
- log level audit
用於將匹配到的報文記錄到內核日誌中,可以使用dmesg(1) 讀取日誌或從 syslog中讀取日誌。當然也可以通過監聽nflog_group將日誌轉發到用戶空間。
COUNTER STATEMENT
設置匹配報文的字節數。
- counter packets number bytes number
- counter { packets number | bytes number }
MAP STATEMENT
-
expression map { MAP_ELEMENTS }
MAP_ELEMENTS := MAP_ELEMENT [, MAP_ELEMENTS]
MAP_ELEMENT := key : value
map statement用於根據特定的key查找數據,key通常是一個expression的返回值。
# select DNAT target based on TCP dport,key爲dport
# connections to port 80 are redirected to 192.168.1.100,
# connections to port 8888 are redirected to 192.168.1.101
nft add rule ip nat prerouting dnat tcp dport map { 80 : 192.168.1.100, 8888 : 192.168.1.101 }
# source address based SNAT,key爲saddr
# packets from net 192.168.1.0/24 will appear as originating from 10.0.0.1,
# packets from net 192.168.2.0/24 will appear as originating from 10.0.0.2
nft add rule ip nat postrouting snat to ip saddr map { 192.168.1.0/24 : 10.0.0.1, 192.168.2.0/24 : 10.0.0.2 }
VMAP STATEMENT
-
expression vmap { VMAP_ELEMENTS }
VMAP_ELEMENTS := VMAP_ELEMENT [, VMAP_ELEMENTS]
VMAP_ELEMENT := key : verdict
vmap statement和map statement類似,但包含一個verdicts字段
# jump to different chains depending on layer 4 protocol type:
$ nft add rule ip filter input ip protocol vmap { tcp : jump tcp-chain, udp : jump udp-chain , icmp : jump icmp-chain }
NAT STATEMENTS
-
snat [[ip | ip6] [ prefix ] to] ADDR_SPEC [:PORT_SPEC] [FLAGS]
-
dnat [[ip | ip6] [ prefix ] to] ADDR_SPEC [:PORT_SPEC] [FLAGS]
-
masquerade [to :PORT_SPEC] [FLAGS]
-
redirect [to :PORT_SPEC] [FLAGS]
ADDR_SPEC := address | address - address
PORT_SPEC := port | port - portFLAGS := FLAG [, FLAGS]
FLAG := persistent | random | fully-random
nat statement僅適用於nat 類型的chain。
snat 和 masquerade statements都會修改報文的源地址。snat只能用在postrouting和input chain,masquerade只能用在postrouting chain。dnat 和 redirect statements只能用在prerouting和output chain,用於修改報文的目的地址。
masquerade statement是一種特殊的snat,用於將報文的源地址轉換爲出接口的IP地址。特別適用於使用動態IP地址的網關。
redirect statement是一種特殊的dnat,將目的地址轉換爲本地主機地址。如果只想更改不同接口上傳入流量的目的端口,就可以使用該statement。
在內核4.18之前,nat statement要求同時有prerouting和postrouting base chain,否則netfilter將看不到返回路徑上的數據包,因此不會反向轉換地址。
Expression | Description | Type |
---|---|---|
address | 指定要修改的報文的源/目的地址,可以指定一個mapping,包含由任意expression key與地址值組合的列表。 | ipv4_addr, ipv6_addr, 如 abcd::1234, 或使用mapping,如 meta mark map |
port | 指定要修改的報文的源/目的端口 | 端口號 (16 bit) |
基本用法
# create a suitable table/chain setup for all further examples
add table nat
add chain nat prerouting { type nat hook prerouting priority dstnat; }
add chain nat postrouting { type nat hook postrouting priority srcnat; }
# translate source addresses of all packets leaving via eth0 to address 1.2.3.4
add rule nat postrouting oif eth0 snat to 1.2.3.4
# redirect all traffic entering via eth0 to destination address 192.168.1.120
add rule nat prerouting iif eth0 dnat to 192.168.1.120
# translate source addresses of all packets leaving via eth0 to whatever
# locally generated packets would use as source to reach the same destination
add rule nat postrouting oif eth0 masquerade
# redirect incoming TCP traffic for port 22 to port 2222
add rule nat prerouting tcp dport 22 redirect to :2222
# inet family:
# handle ip dnat:
add rule inet nat prerouting dnat ip to 10.0.2.99
# handle ip6 dnat:
add rule inet nat prerouting dnat ip6 to fe80::dead
# this masquerades both ipv4 and ipv6:
add rule inet nat postrouting meta oif ppp0 masquerade
高級用法
# map prefixes in one network to that of another, e.g. 10.141.11.4 is mangled to 192.168.2.4,
# 10.141.11.5 is mangled to 192.168.2.5 and so on.
add rule nat postrouting snat ip prefix to ip saddr map { 10.141.11.0/24 : 192.168.2.0/24 }
# map a source address, source port combination to a pool of destination addresses and ports:
add rule nat postrouting dnat to ip saddr . tcp dport map { 192.168.1.2 . 80 : 10.141.10.2-10.141.10.5 . 8888-8999 }
# The above example generates the following NAT expression:
#
# [ nat dnat ip addr_min reg 1 addr_max reg 10 proto_min reg 9 proto_max reg 11 ]
#
# which expects to obtain the following tuple:
# IP address (min), source port (min), IP address (max), source port (max)
# to be obtained from the map. The given addresses and ports are inclusive.
# This also works with named maps and in combination with both concatenations and ranges:
table ip nat {
map ipportmap {
typeof ip saddr : interval ip daddr . tcp dport
flags interval
elements = { 192.168.1.2 : 10.141.10.1-10.141.10.3 . 8888-8999, 192.168.2.0/24 : 10.141.11.5-10.141.11.20 . 8888-8999 }
}
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
ip protocol tcp dnat ip to ip saddr map @ipportmap
}
}
# @ipportmap maps network prefixes to a range of hosts and ports.
# The new destination is taken from the range provided by the map element.
# Same for the destination port.
# Note the use of the "interval" keyword in the typeof description.
# This is required so nftables knows that it has to ask for twice the
# amount of storage for each key-value pair in the map.
# ": ipv4_addr . inet_service" would allow associating one address and one port
# with each key. But for this case, for each key, two addresses and two ports
# (The minimum and maximum values for both) have to be stored.
MONITOR
-
monitor [new | destroy] MONITOR_OBJECT
monitor traceMONITOR_OBJECT := tables | chains | sets | rules | elements | ruleset
Monitor 命令可以監聽nf_tables子系統產生的Netlink事件。這些事件要麼與對象的創建和刪除有關,要麼與啓用meta nftrace的報文有關。這些事件將以JSON或原生的nft格式打印到stdout。
可以使用MONITOR_OBJECT過濾與具體對象有關的事件。
使用new或destroy來過濾與特定動作有關的事件。
第二種形式的調用沒有選項,只打印啓用nftrace的報文生成的事件。
啓用nftrace
爲了啓用nftrace,只需要在規則之後跟上下面statement即可:
meta nftrace set 1
當然也可以爲特定的報文啓用nftrace,下面爲tcp報文啓用nftrace:
ip protocol tcp meta nftrace set 1
使用單獨chain來啓用nftrace
推薦使用單獨的chain來啓用nftrace,下面用於跟蹤prerouting chain,如果有已經有一個prerouting chain,則需要確保trace_chain的優先級更高。
$ nft add chain filter trace_chain { type filter hook prerouting priority -301\; }
$ nft add rule filter trace_chain meta nftrace set 1
在調試結束之後,刪除創建的trace_chain
即可:
% nft delete chain filter trace_chain
monitor trace事件
在啓用nftrace之後,就可以監控產生的跟蹤事件:
#Listen to all events, report in native nft format.
% nft monitor
#Listen to deleted rules, report in JSON format.
% nft -j monitor destroy rules
#Listen to both new and destroyed chains, in native nft format.
% nft monitor chains
#Listen to ruleset events such as table, chain, rule, set, counters and quotas, in native nft format.
% nft monitor ruleset
#Trace incoming packets from host 10.0.0.1.
% nft add rule filter input ip saddr 10.0.0.1 meta nftrace set 1
% nft monitor trace
kubernetes中啓用nftables
kubernetes v1.29中需要在kube-proxy中啓用NFTablesProxyMode
feature gate並指定mode爲nftables
時才能使用nftables
。下面是一個使用kubeadm啓動單節點集羣的例子
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: node
taints: null
---
apiServer:
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
etcd:
local:
dataDir: /var/lib/etcd
imageRepository: registry.k8s.io
kind: ClusterConfiguration
kubernetesVersion: 1.29.0
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
scheduler: {}
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
featureGates:
NFTablesProxyMode: true
mode: nftables