This article accompanies Go issue #10948 (net: use splice for TCPConn.ReadFrom on Linux) and CL 107715, and aims to provide context, insight into implementation decisions, performance measurements, and ideas for future work.
splice(2) moves data between file descriptors, without copying between kernel and user address space. A birds-eye view of the concept is that
splice is akin to “
read from a file descriptor to a kernel buffer” or “
write from a kernel buffer to a file descriptor”. The buffer is controlled by the user, and is, in fact, a plain UNIX pipe. At least one of the file descriptors passed to
splice must refer to a pipe.
One might ask “Why is the pipe important? Why can’t
splice transfer data between arbitrary file descriptors directly?”. The short answer to this question is that having an intermediary place to store the data is very useful, for various reasons. We will not pursue this line of questioning further in this article, because there is much to talk about. Instead, the interested reader is encouraged to read an e-mail thread where Linus Torvalds himself provides excellent insight on the matter, and explains the general concepts behind
splice and its friend
tee(2) very clearly.
splice is indeed a general data transfer function, then why does issue #10948 only mention
TCPConn? The next section answers this question.
Per rsc’s comment in the original issue, the objective is for the standard library support for
splice to be transparent, like the existing
sendfile optimization. The change must not introduce any new API, and callers – who typically use
io.Copy to move data around – should benefit from it transparently. Package
os and package
net should remain as portable as possible, without introducing new API for OS-specific optimizations.
splice requires help from the Go network poller. The poller is implemented in the runtime, and is exposed to standard library code through package
internal/poll. All standard library types which wrap a file descriptor hold a
*poll.FD. Code outside the standard library cannot make use of the poller, at least not directly. While
splice is a very general data transfer function, it operates on raw file descriptors, which are not exposed by the Go standard library directly.
The typical function used by code that wants to transfer data from an
io.Reader to an
io.Writer without looking at it is
io.Copy investigates its
io.Reader arguments for
io.WriterTo specializations respectively, and uses the specialized code, if possible. Otherwise, it falls back to a generic read-write loop.
For example, if
conn is a
*net.TCPConn (perhaps packaged in a
f is an
*os.File, a call to
io.Copy(conn, f) results in a call to
ReadFrom checks if the
io.Reader argument is an
*os.File and uses
poll.SendFile if it is. Usage of
splice in the standard library must follow a similar pattern.
Similarly to how
io.Copy takes the liberty to allocate a buffer for the data transfer if necessary, uses of
splice on code paths known to be optimizable could take the liberty to create a pipe scoped to the data transfer.
Let’s investigate candidates for file descriptors we could splice to, under the condition that no new API may be introduced:
*os.Filedoesn’t have a
ReadFrom, and both files and pipes are represented by
*os.File, so splicing to a file or a pipe is not possible.
ReadFrom, but that
ReadFrom. The signature doesn’t match, and cannot be changed.
*net.TCPConnhas the right
ReadFromand is thus the only real candidate for the write side.
The entry point for all
splice optimizations must be
(*net.TCPConn).ReadFrom. We have our destination file descriptor. Next, we investigate possible source file descriptors to represent the
Reader passed to
*os.File, we are on the
sendfilecode path… but not quite. The file could represent the read half of a pipe, which we could splice to the connection directly. The
Fd()on the file to get at the file descriptor, which is set to blocking mode.
sendfilecan get away with this, because disk files are always ready from the perspective of
splicecode is not so lucky, because pipes must be polled for readiness. After calling
Fd()on the file, attempting to register the returned file descriptor with the network poller is not possible. There is no way package
netcan get at the
*os.File. Only very intrusive solutions which involve duping the pipe file descriptor come to mind. Unfortunate.
*net.UnixConnseems like a good candidate, and the initial implementation enabled
splicefor this case, but benchmarks prove to be inconclusive. It turns out UNIX sockets are quite fast as-is.
We are left with only one clear candidate: TCP socket to TCP socket transfers, using a temporary pipe. To perform a copy from a TCP connection to another, package
net takes care to honor
io.LimitedReader, unwraps the source and destination connections to get at their
*poll.FDs, then transfers control to the new
poll.Splice function, which performs most of the work.
The reader is encouraged to refer to the source code for
poll.Splice for the remainder of this section.
poll.Splice creates a temporary pipe, locks the source and destination file descriptors, and prepares the file descriptors for a new round of polling. Then, it alternates between splicing from the source socket to the pipe, and from the pipe to the destination socket. This alternation deserves some extra attention.
To move data from the source socket to the pipe,
spliceDrain. This is the equivalent of the
Read part of an
io.Copy in userspace. Conversely,
splicePump is the equivalent of the
Write part. Note how
spliceDrain only attempts a single splice into the pipe, whereas
splicePump loops until it has spliced all the data out of the pipe.
This asymmetry is intentional. If
Splice simply alternated reads and writes, the pipe could fill up, if the source socket outpaced the destination socket. This would be problematic for multiple reasons.
EAGAIN, it would need to determine if the cause was the source socket not being ready for reading, or the pipe being full. This would complicate the implementation slightly.
Second, consider the following situation: At some point in the data transfer, a splice from the pipe to the destination socket is short, and leaves some data in the pipe. The next attempted splice is from the source socket to the pipe. The source socket times out, but it takes 30 seconds for that to happen. In the meantime, the old data is still stuck in the pipe, and the destination socket can only receive it when the source socket eventually times out. To mitigate this,
Splice would need to add timeouts to the poller events it waits for, and implement some form of flow control.
All of this would complicate
Splice, for no tangible gain. Therefore, the
Splice implementation remains simple, and mirrors the behavior of
io.Copy by ensuring that all the data read from the source socket is written to the destination socket, before new data is read again.
A final implementation detail is the intentional omission of the
SPLICE_F_MORE flag from calls to
SPLICE_F_MORE acts like
TCP_CORK and is not an opinion standard library code should impose upon callers, who should not wake up to 200ms of unexpected latency introduced by
io.Copy when they upgrade Go versions. If the
TCP_CORK behavior is desirable, callers can cork the TCP connection themselves before initiating the copy, using
syscall.RawConn, or a specialized package like mikioh/tcp.
We move on to a brief performance analysis of the new code.
CL 107715 includes a set of benchmarks, but they are superficial at best: the code is probably only hitting the loopback interface, and it doesn’t quite measure the most important impact of the change. In an attempt to improve on this state of affairs, I conducted a more concrete performance test.
m5.large nodes were used. One minute’s worth of network traffic was moved from one node to another, using the third node as a TCP proxy. The new
splice-optimized code was tested against the old code. CPU profiles and an execution traces were recorded for both runs.
Both proxy servers were able to saturate 10Gbps links, but the
splice-optimized server spent much less CPU time in doing so.
To begin with, basic
real 1m1.444s user 0m13.875s sys 0m43.481s
for the default server, versus
real 1m1.211s user 0m8.391s sys 0m28.498s
Both proxy servers moved ~70GiB of data in this time, but the execution traces tell an interesting tale. A 2ms window into the execution trace of the unoptimized server looks like this:
On the other hand, for the
splice-optimized server, a similar 2ms window looks like this:
Because the speed of the data transfer is bound by the speed of the network in both cases, roughly the same work is performed in both windows. However, the
splice-optimized server spends significantly less CPU time doing so. There are significant gaps between sections of activity throughout the execution trace of the
splice-optimized server. The non-optimized server trace also shows gaps, but they are significantly more narrow: the server is busier overall.
Compared to the un-optimized server, the
splice-enabled server seems to be able to move very large chunks of data at a time. The default buffer size for
io.Copy is 32 KiB. Copies through userspace are strictly bound by the size of the buffer. On the other hand, the default buffer size for a Linux pipe is 64 KiB. To make the test more fair, perhaps
io.CopyBuffer with a 64 KiB buffer could have been used, but even that fails to make the comparison more reasonable.
For reasons I do not understand, the kernel was willing to move chunks larger than the presumed pipe buffer size in a single call to
splice. Perhaps this stems from the fact that the data pages originated in the kernel, and it moved them to the pipe using a simple reference count increase, instead of copying data from userspace to pages allocated for the pipe. The exact process is still unclear to me, because I do not understand the inner workings of the Linux kernel very well. Clarifications would be appreciated.
CPU profiles (tip and 1.10) were also taken, but they don’t tell us anything we haven’t been able to discern from the execution traces already. That being said, one of them certainly looks cleaner than the other!
In conclusion, the optimization is not as much about raw speed, as it is about CPU time. Proxy servers and load balancers in high-concurrency scenarios are almost certain to benefit greatly from it.
Given the fact that the bar for new API added to the standard library is pretty high, it is unlikely to see
splice used in many other places than where it is used now.
That being said, a fun experiment would be to try to make use of
crypto/tls, which has been prototyped already and enable
splice for TLS connections too, probably by adding a
*tls.Conn. One can imagine using this to arrive at an L7 proxy or load balancer that moves request and response bodies between upstream and downstream without copying them to userspace.