Abusing Linux’s firewall: the hack that allowed us to build Spectrum(译文)

Abusing Linux’s firewall: the hack that allowed us to build Spectrum


Today we are introducing Spectrum: a new Cloudflare feature that brings DDoS protection, load balancing, and content acceleration to any TCP-based protocol.


Soon after we started building Spectrum, we hit a major technical obstacle: Spectrum requires us to accept connections on any valid TCP port, from 1 to 65535. On our Linux edge servers it’s impossible to “accept inbound connections on any port number”. This is not a Linux-specific limitation: it’s a characteristic of the BSD sockets API, the basis for network applications on most operating systems. Under the hood there are two overlapping problems that we needed to solve in order to deliver Spectrum:



how to accept TCP connections on all port numbers from 1 to 65535


how to configure a single Linux server to accept connections on a very large number of IP addresses (we have many thousands of IP addresses in our anycast ranges)



Assigning millions of IPs to a server


Cloudflare’s edge servers have an almost identical configuration. In our early days, we used to assign specific /32 (and /128) IP addresses to the loopback network interface[1]. This worked well when we had dozens of IP addresses, but failed to scale as we grew.



Along came the “AnyIP” trick. AnyIP allows us to assign whole IP prefixes (subnets) to the loopback interface, expanding from specific IP addresses. There is already common use of AnyIP: your computer has assigned to the loopback interface. From the point of view of your computer, all IP addresses from to belong to the local machine.



This trick is applicable to more than the block. To treat the whole range of as assigned locally, run:



ip route add local dev lo

Following this, you can bind to port 8080 on one of these IP addresses just fine:


nc -l 8080

Getting IPv6 to work is a bit harder:


ip route add local 2001:db8::/64 dev lo


Sadly, you can’t just bind to these attached v6 IP addresses like in the v4 example. To get this working you must use the IP_FREEBIND socket option, which requires elevated privileges. For completeness, there is also a sysctl net.ipv6.ip_nonlocal_bind, but we don’t recommend touching it.



This AnyIP trick allows us to have millions of IP addresses assigned locally to each server:


$ ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536

inet scope global lo

valid_lft forever preferred_lft forever

inet scope global lo

valid_lft forever preferred_lft forever

Binding to ALL ports


The second major issue is the ability to open TCP sockets for any port number. In Linux, and generally in any system supporting the BSD sockets API, you can only bind to a specific TCP port number with a single bind system call. It’s not possible to bind to multiple ports in a single operation.

第二个主要问题是为任意端口号打开TCP套接字的能力,在Linux中,通常在任意支持BSD sockets API的系统中,你只能通过一个绑定系统调用来绑定到特定TCP端口,不能在单一操作中绑定到多个端口。


A naive solution would be to bind 65535 times, once for each of the 65535 possible ports. Indeed, this could have been an option, but with terrible consequences:



The revenge of the listening sockets


Internally, the Linux kernel stores listening sockets in a hash table indexed by port numbers, LHTABLE, using exactly 32 buckets:



/* Yes, really, this is all you need. */



#define INET_LHTABLE_SIZE       32


Had we opened 65k ports, lookups to this table would slow drastically: each hash table bucket would contain two thousand items.



Another way to solve our problem would be to use iptables’ rich NAT features: we could rewrite the destination of inbound packets to some specific address/port, and our application would bind to that.



We didn’t want to do this though, since it requires enabling the iptables conntrack module. Historically we found some performance edge cases, and conntrack cannot cope with some of the large DDoS attacks that we encounter.

可我们不想这样做,因为它需要启用iptables conntrack模块。 从以往经验来看,我们发现了一些性能优势案例,而conntrack无法应对我们遇到的一些大型DDoS攻击场景。

Additionally, with the NAT approach we would lose destination IP address information. To remediate this, there exists a poorly known SO_ORIGINAL_DST socket option, but the code doesn’t look encouraging.



Fortunately, there is a way to achieve our goals that does not involve binding to all 65k ports, or use conntrack.



Firewall to the rescue


Before we go any further, let’s revisit the general flow of network packets in an operating system.

Commonly, there are two distinct layers in the inbound packet path:




IP firewall


network stack


These are conceptually distinct. The IP firewall is usually a stateless piece of software (let’s ignore conntrack and IP fragment reassembly for now). The firewall analyzes IP packets and decides whether to ACCEPT or DROP them. Please note: at this layer we are talking about packets and port numbers – not applications or sockets.

这些概念是截然不同的。IP防火墙通常是一个无状态的软件(现在让我们忽略conntrack和IP碎片重新封装)。 防火墙分析IP数据包并决定是否接受或丢弃它们。 请注意:在这一层,我们谈论的是数据包和端口号-而不是应用程序或套接字。


Then there is the network stack. This beast maintains plenty of state. Its main task is to dispatch inbound IP packets into sockets, which are then handled by userspace applications. The network stack manages abstractions which are shared with userspace. It reassembles TCP flows, deals with routing, and knows which IP addresses are local.



The magic dust



At some point we stumbled upon the TPROXY iptables module. The official documentation is easy to overlook:

在某个时候,我们偶然发现了TPROXY iptables模块。官方文件很容易被忽略:



This target is only valid in the mangle table, in the

PREROUTING chain and user-defined chains which are only

called from this chain.  It redirects the packet to a local

socket without changing the packet header in any way. It can

also change the mark value which can then be used in

advanced routing rules.

Another piece of documentation can be found in the kernel:






The more we thought about it, the more curious we became…



So… What does TPROXY actually do?



Revealing the magic trick



The TPROXY code is surprisingly trivial:




sk = inet_lookup_listener(net, &tcp_hashinfo, skb,

ip_hdrlen(skb) +


saddr, sport,

daddr, dport,

in->ifindex, 0);

Let me read this out loud for you: in an iptables module, which is part of the firewall, we call inet_lookup_listener. This function takes a src/dst port/IP 4-tuple, and returns the listening socket that is able to accept that connection. This is a core functionality of the network stack’s socket dispatch.

让我为您解读:在iptables模块中,它是防火墙的一部分,我们调用inet_lookup_listener。该函数接受src/dst端口/IP 4-tuple,并返回能够接受该连接的监听套接字。这是网络堆栈的套接字调度的核心功能。

Once again: firewall code calls a socket dispatch routine.


Later on TPROXY actually does the socket dispatch:


skb->sk = sk;

This line assigns a socket struct sock to an inbound packet – completing the dispatch.



Pulling the rabbit from the hat


Armed with TPROXY, we can perform the bind-to-all-ports trick very easily. Here’s the configuration:


# Set to be routed locally with AnyIP.

# Make it explicit that the source IP used for this network

# when connecting locally should be in range.

# This is needed since otherwise the TPROXY rule would match

# both forward and backward traffic. We want it to catch

# forward traffic only.

sudo ip route add local dev lo src


# Set the magical TPROXY routing

sudo iptables -t mangle -I PREROUTING \

-d -p tcp \

-j TPROXY –on-port=1234 –on-ip=

In addition to setting this in place, you need to start a TCP server with the magical IP_TRANSPARENT socket option. Our example below needs to listen on tcp:// The man page for IP_TRANSPARENT shows:



IP_TRANSPARENT (since Linux 2.6.24)

Setting this boolean option enables transparent proxying on

this socket.  This socket option allows the calling applica‐

tion to bind to a nonlocal IP address and operate both as a

client and a server with the foreign address as the local

end‐point.  NOTE: this requires that routing be set up in

a way that packets going to the foreign address are routed

through the TProxy box (i.e., the system hosting the

application that employs the IP_TRANSPARENT socket option).

Enabling this socket option requires superuser privileges

(the CAP_NET_ADMIN capability).


TProxy redirection with the iptables TPROXY target also

requires that this option be set on the redirected socket.

Here’s a simple Python server:


import socket




s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

s.setsockopt(socket.IPPROTO_IP, IP_TRANSPARENT, 1)


s.bind((‘’, 1234))


print(“[+] Bound to tcp://”)

while True:

c, (r_ip, r_port) = s.accept()

l_ip, l_port = c.getsockname()

print(“[ ] Connection from tcp://%s:%d to tcp://%s:%d” % (r_ip, r_port, l_ip, l_port))

c.send(b”hello world\n”)


After running the server you can connect to it from arbitrary IP addresses:



$ nc -v 9999

Connection to 9999 port [tcp/*] succeeded!

hello world


Most importantly, the server will report the connection indeed was directed to port 9999, even though nobody actually listens on that IP address and port:



$ sudo python3 transparent2.py

[+] Bound to tcp://

[ ] Connection from tcp:// to tcp://

Tada! This is how to bind to any port on Linux, without using conntrack.



That’s all folks

In this post we described how to use an obscure iptables module, originally designed to help transparent proxying, for something slightly different. With its help we can perform things we thought impossible using the standard BSD sockets API, avoiding the need for any custom kernel patches.



The TPROXY module is very unusual – in the context of the Linux firewall it performs things typically done by the Linux network stack. The official documentation is rather lacking, and I don’t believe many Linux users understand the full power of this module.



It’s fair to say that TPROXY allows our Spectrum product to run smoothly on the vanilla kernel. It’s yet another reminder of how important it is to try to understand iptables and the network stack!



Assigning IP addresses to loopback interface, together with appropriate rp_filter and BGP configuration allows us to handle arbitrary IP ranges on our edge servers.



原文URL: https://blog.cloudflare.com/how-we-built-spectrum/#fnref1

Leave a Comment

Your email address will not be published. Required fields are marked *