Abusing Linux’s firewall: the hack that allowed us to build Spectrum(译文)

Abusing Linux’s firewall: the hack that allowed us to build Spectrum

深度使用Linux防火墙:允许我们构建Spectrum的黑客行为

Today we are introducing Spectrum: a new Cloudflare feature that brings DDoS protection, load balancing, and content acceleration to any TCP-based protocol.

今天我们来介绍spectrum:一项Cloudflare新业务,可为任何基于TCP协议的服务提供DDoS防护、负载均衡和内容加速功能。

Soon after we started building Spectrum, we hit a major technical obstacle: Spectrum requires us to accept connections on any valid TCP port, from 1 to 65535. On our Linux edge servers it’s impossible to “accept inbound connections on any port number”. This is not a Linux-specific limitation: it’s a characteristic of the BSD sockets API, the basis for network applications on most operating systems. Under the hood there are two overlapping problems that we needed to solve in order to deliver Spectrum:

在我们开始构建Spectrum之后不久,我们遇到了一个主要的技术障碍:Spectrum要求我们接受任意有效的TCP端口上的连接,从1到65535,但在我们的Linux边界服务器上,不可能“接受任意端口的入站连接“。这不是Linux的特定限制,它是BSD套接字API的一个特性,且是大多数操作系统上网络应用程序的基础,为了交付Spectrum,我们需要解决两个重叠的问题:

 

how to accept TCP connections on all port numbers from 1 to 65535

1、如何接受从1到65535的所有端口号上的TCP连接?

how to configure a single Linux server to accept connections on a very large number of IP addresses (we have many thousands of IP addresses in our anycast ranges)

2、如何配置单台Linux服务器去接受大量IP地址上的连接?(我们的anycast范围中有数千个IP地址)

 

Assigning millions of IPs to a server

将数百万个IP分配给服务器

Cloudflare’s edge servers have an almost identical configuration. In our early days, we used to assign specific /32 (and /128) IP addresses to the loopback network interface[1]. This worked well when we had dozens of IP addresses, but failed to scale as we grew.

Cloudflare的边界服务器配置几乎相同,在我们早期的时候,我们曾经为本地环回网络接口分配特定的/32(和/128)段IP地址。在我们只有数十个IP地址时一切运行良好,但随着我们规模不断增长开始出现问题。

 

Along came the “AnyIP” trick. AnyIP allows us to assign whole IP prefixes (subnets) to the loopback interface, expanding from specific IP addresses. There is already common use of AnyIP: your computer has 127.0.0.0/8 assigned to the loopback interface. From the point of view of your computer, all IP addresses from 127.0.0.1 to 127.255.255.254 belong to the local machine.

接下来是“AnyIP”技巧。AnyIP允许我们将整个IP前缀(子网)分配给本地环回接口,从特定IP地址扩展。AnyIP已被普遍使用:你的计算机将127.0.0.0/8分配给本地环回接口。从计算机的角度来看,自127.0.0.1到127.255.255.254范围内的所有IP地址都属于本地计算机。

 

This trick is applicable to more than the 127.0.0.1/8 block. To treat the whole range of 192.0.2.0/24 as assigned locally, run:

这个技巧适用于超过127.0.0.1/8的块,要在本地分配整个192.0.2.0/24网段,请运行:

 

ip route add local 192.0.2.0/24 dev lo

Following this, you can bind to port 8080 on one of these IP addresses just fine:

然后,你可将其中的一个IP地址绑定到8080端口:

nc -l 192.0.2.1 8080

Getting IPv6 to work is a bit harder:

如果是IPv6的情况下工作起来会有点困难:

ip route add local 2001:db8::/64 dev lo

 

Sadly, you can’t just bind to these attached v6 IP addresses like in the v4 example. To get this working you must use the IP_FREEBIND socket option, which requires elevated privileges. For completeness, there is also a sysctl net.ipv6.ip_nonlocal_bind, but we don’t recommend touching it.

可悲的是,你不能像IPv4示例那样将端口绑定到这些附加的IPv6地址,为了让它工作,你必须使用IP_FREEBIND套接字选项,这需要更高的权限,考虑到完整性,还有一个net.ipv6.ip_nonlocal_bind内核参数,但我们不建议你做调整。

 

This AnyIP trick allows us to have millions of IP addresses assigned locally to each server:

这个AnyIP技巧允许我们为每台服务器本地分配数百万个IP地址:

$ ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536

inet 1.1.1.0/24 scope global lo

valid_lft forever preferred_lft forever

inet 104.16.0.0/16 scope global lo

valid_lft forever preferred_lft forever

Binding to ALL ports

绑定到所有端口

The second major issue is the ability to open TCP sockets for any port number. In Linux, and generally in any system supporting the BSD sockets API, you can only bind to a specific TCP port number with a single bind system call. It’s not possible to bind to multiple ports in a single operation.

第二个主要问题是为任意端口号打开TCP套接字的能力,在Linux中,通常在任意支持BSD sockets API的系统中,你只能通过一个绑定系统调用来绑定到特定TCP端口,不能在单一操作中绑定到多个端口。

 

A naive solution would be to bind 65535 times, once for each of the 65535 possible ports. Indeed, this could have been an option, but with terrible consequences:

一个简单的解决方案是绑定65535次,每次为65535个可能的端口,确实,这可能是一种选择,但会带来不良后果:

 

The revenge of the listening sockets

监听套接字的后果

Internally, the Linux kernel stores listening sockets in a hash table indexed by port numbers, LHTABLE, using exactly 32 buckets:

在内部,Linux内核在一个以端口号为索引的哈希表中存储监听套接字,LHTABLE,使用32个buckets。

 

/* Yes, really, this is all you need. */

/*是,真的,这就是你所需要的。*/

 

#define INET_LHTABLE_SIZE       32

#定义INET_LHTABLE_SIZE      32

Had we opened 65k ports, lookups to this table would slow drastically: each hash table bucket would contain two thousand items.

假如我们打开了65k个端口,那么对这张表的查询速度降大幅减慢,而且每个哈希表bucket将包含2000个条目。

 

Another way to solve our problem would be to use iptables’ rich NAT features: we could rewrite the destination of inbound packets to some specific address/port, and our application would bind to that.

解决我们问题的另一种方法是使用iptables强大的NAT功能,我们可以将入站包的目标地址重定向到某个特定的地址/端口,同时将我们的应用程序绑定到这个地址/端口。

 

We didn’t want to do this though, since it requires enabling the iptables conntrack module. Historically we found some performance edge cases, and conntrack cannot cope with some of the large DDoS attacks that we encounter.

可我们不想这样做,因为它需要启用iptables conntrack模块。 从以往经验来看,我们发现了一些性能优势案例,而conntrack无法应对我们遇到的一些大型DDoS攻击场景。

Additionally, with the NAT approach we would lose destination IP address information. To remediate this, there exists a poorly known SO_ORIGINAL_DST socket option, but the code doesn’t look encouraging.

另外,使用NAT方案,我们将丢失目标IP地址信息。为了解决这个问题,SO_ORIGINAL_DST套接字选项存在一个不为人知的地方,但代码看起来并不令人鼓舞。

 

Fortunately, there is a way to achieve our goals that does not involve binding to all 65k ports, or use conntrack.

庆幸的是,有一种方法可以实现我们的目标,不涉及绑定到所有65k端口,或使用conntrack。

 

Firewall to the rescue

拯救防火墙

Before we go any further, let’s revisit the general flow of network packets in an operating system.

Commonly, there are two distinct layers in the inbound packet path:

在我们进一步讨论前,让我们重新审视下操作系统中网络数据包的一般流程。

通常,入站数据包路径中有两个不同的层:

 

IP firewall

IP防火墙

network stack

网络堆栈

These are conceptually distinct. The IP firewall is usually a stateless piece of software (let’s ignore conntrack and IP fragment reassembly for now). The firewall analyzes IP packets and decides whether to ACCEPT or DROP them. Please note: at this layer we are talking about packets and port numbers – not applications or sockets.

这些概念是截然不同的。IP防火墙通常是一个无状态的软件(现在让我们忽略conntrack和IP碎片重新封装)。 防火墙分析IP数据包并决定是否接受或丢弃它们。 请注意:在这一层,我们谈论的是数据包和端口号-而不是应用程序或套接字。

 

Then there is the network stack. This beast maintains plenty of state. Its main task is to dispatch inbound IP packets into sockets, which are then handled by userspace applications. The network stack manages abstractions which are shared with userspace. It reassembles TCP flows, deals with routing, and knows which IP addresses are local.

然后才是网络堆栈。这只野兽保持了很多状态。其主要任务是将入站IP数据包转发到套接字中,然后由用户空间应用程序处理。网络堆栈管理对象共享用户空间。它重新封装TCP流,处理路由,并知道哪些IP地址是本地的。

 

The magic dust

魔法光效尘埃

 

At some point we stumbled upon the TPROXY iptables module. The official documentation is easy to overlook:

在某个时候,我们偶然发现了TPROXY iptables模块。官方文件很容易被忽略:

 

TPROXY

This target is only valid in the mangle table, in the

PREROUTING chain and user-defined chains which are only

called from this chain.  It redirects the packet to a local

socket without changing the packet header in any way. It can

also change the mark value which can then be used in

advanced routing rules.

Another piece of documentation can be found in the kernel:

TPROXY

这个对象只在mangle表中有效

PREROUTING链和用户定义的链只是从这个链条调用。它将数据包重定向到本地套接字,而无需以任何方式更改数据包标签头。它还可以更改标记值,然后可以在高级路由规则中使用该标记值。
在内核中可以找到另一份文档:

docs/networking/tproxy.txt

 

The more we thought about it, the more curious we became…

我们越想,就越好奇…

 

So… What does TPROXY actually do?

所以…TPROXY究竟是做什么的呢?

 

Revealing the magic trick

魔法揭秘

 

The TPROXY code is surprisingly trivial:

TPROXY代码非常简单:

 

case NFT_LOOKUP_LISTENER:

sk = inet_lookup_listener(net, &tcp_hashinfo, skb,

ip_hdrlen(skb) +

__tcp_hdrlen(tcph),

saddr, sport,

daddr, dport,

in->ifindex, 0);

Let me read this out loud for you: in an iptables module, which is part of the firewall, we call inet_lookup_listener. This function takes a src/dst port/IP 4-tuple, and returns the listening socket that is able to accept that connection. This is a core functionality of the network stack’s socket dispatch.

让我为您解读:在iptables模块中,它是防火墙的一部分,我们调用inet_lookup_listener。该函数接受src/dst端口/IP 4-tuple,并返回能够接受该连接的监听套接字。这是网络堆栈的套接字调度的核心功能。

Once again: firewall code calls a socket dispatch routine.

重述:防火墙代码调用一个套接字调度事务。

Later on TPROXY actually does the socket dispatch:

稍后TPROXY会实际执行套接字调度。

skb->sk = sk;

This line assigns a socket struct sock to an inbound packet – completing the dispatch.

这一行将一个套接字结构的sock分配给入站包——完成分发。

 

Pulling the rabbit from the hat

将兔子从帽子中拽出来

Armed with TPROXY, we can perform the bind-to-all-ports trick very easily. Here’s the configuration:

通过使用TPROXY,我们可以很容易地执行绑定到所有端口的技巧。这是配置:

# Set 192.0.2.0/24 to be routed locally with AnyIP.

# Make it explicit that the source IP used for this network

# when connecting locally should be in 127.0.0.0/8 range.

# This is needed since otherwise the TPROXY rule would match

# both forward and backward traffic. We want it to catch

# forward traffic only.

sudo ip route add local 192.0.2.0/24 dev lo src 127.0.0.1

 

# Set the magical TPROXY routing

sudo iptables -t mangle -I PREROUTING \

-d 192.0.2.0/24 -p tcp \

-j TPROXY –on-port=1234 –on-ip=127.0.0.1

In addition to setting this in place, you need to start a TCP server with the magical IP_TRANSPARENT socket option. Our example below needs to listen on tcp://127.0.0.1:1234. The man page for IP_TRANSPARENT shows:

除了设置这个地方以外,您还需要启动一个带有魔力的IP_TRANSPARENT套接字选项的TCP服务器。下面的示例需要监听tcp://127.0.0.1:1234。IP_TRANSPARENT的帮助页面显示信息:

 

IP_TRANSPARENT (since Linux 2.6.24)

Setting this boolean option enables transparent proxying on

this socket.  This socket option allows the calling applica‐

tion to bind to a nonlocal IP address and operate both as a

client and a server with the foreign address as the local

end‐point.  NOTE: this requires that routing be set up in

a way that packets going to the foreign address are routed

through the TProxy box (i.e., the system hosting the

application that employs the IP_TRANSPARENT socket option).

Enabling this socket option requires superuser privileges

(the CAP_NET_ADMIN capability).

 

TProxy redirection with the iptables TPROXY target also

requires that this option be set on the redirected socket.

Here’s a simple Python server:

这是一个简单的Python服务器:

import socket

 

IP_TRANSPARENT = 19

 

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

s.setsockopt(socket.IPPROTO_IP, IP_TRANSPARENT, 1)

 

s.bind((‘127.0.0.1’, 1234))

s.listen(32)

print(“[+] Bound to tcp://127.0.0.1:1234”)

while True:

c, (r_ip, r_port) = s.accept()

l_ip, l_port = c.getsockname()

print(“[ ] Connection from tcp://%s:%d to tcp://%s:%d” % (r_ip, r_port, l_ip, l_port))

c.send(b”hello world\n”)

c.close()

After running the server you can connect to it from arbitrary IP addresses:

运行服务器后,您可以从任意的IP地址连接到它:

 

$ nc -v 192.0.2.1 9999

Connection to 192.0.2.1 9999 port [tcp/*] succeeded!

hello world

 

Most importantly, the server will report the connection indeed was directed to 192.0.2.1 port 9999, even though nobody actually listens on that IP address and port:

最重要的是,服务器将提示请求确实被重定向到192.0.2.1的9999端口,即使那个IP地址和端口没真正监听。:

 

$ sudo python3 transparent2.py

[+] Bound to tcp://127.0.0.1:1234

[ ] Connection from tcp://127.0.0.1:60036 to tcp://192.0.2.1:9999

Tada! This is how to bind to any port on Linux, without using conntrack.

这就是在不使用conntrack.前提下如何绑定到Linux的任意端口的方法。

 

That’s all folks

In this post we described how to use an obscure iptables module, originally designed to help transparent proxying, for something slightly different. With its help we can perform things we thought impossible using the standard BSD sockets API, avoiding the need for any custom kernel patches.

总结
在这篇文章中,我们介绍了如何使用一个不起眼的iptables模块,它最初设计用于实现透明代理,因为它稍有不同。借助它,我们可以使用标准的BSD套接字API执行我们认为不可能实现的任务,从而避免需要任何定制的内核补丁。

 

The TPROXY module is very unusual – in the context of the Linux firewall it performs things typically done by the Linux network stack. The official documentation is rather lacking, and I don’t believe many Linux users understand the full power of this module.

TPROXY模块极不寻常-在Linux防火墙的上下文中,它执行通常由Linux网络堆栈完成的事情。官方文档相当缺乏,我不相信许多Linux用户理解这个模块的全部功能。

 

It’s fair to say that TPROXY allows our Spectrum product to run smoothly on the vanilla kernel. It’s yet another reminder of how important it is to try to understand iptables and the network stack!

可以这么说,TPROXY允许我们的Spectrum产品在最初版本内核上平稳运行,同时让我们认识到,尝试理解iptables和网络堆栈式多么重要!

 

Assigning IP addresses to loopback interface, together with appropriate rp_filter and BGP configuration allows us to handle arbitrary IP ranges on our edge servers.

将IP地址分配给本地环回地址,附加适当的rp_filter和BGP配置,可以在我们的边界服务器处理任意IP范围。

 

原文URL: https://blog.cloudflare.com/how-we-built-spectrum/#fnref1

Leave a Comment

Your email address will not be published. Required fields are marked *

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据