On Thursday 10 June 2010, Andrei Pelinescu-Onciul wrote:
Is the performance gain really worth the effort? I wonder why constructing of IP/UDP headers in sr is faster than having it done in the kernel.
It's a workaround for performance/scalability problems on bigger multi-cpu/cores machines running linux. It has nothing to do with the construction of the headers, but with locking inside the kernel when sending on the same udp socket (or raw socket w/o IP_HDRINCL). If you are trying to send on the same socket in the same time from multiple cores you'll hit this problem. Some of us have seen symptoms which I believe are related to this problem. On an 8 cpu machine running an older kernel (IIRC 2.6.22), I got between 18%-28% improvement in _stateless_ forwarding just by distributing the traffic on 8 different sockets instead of one. I believe it would be even better with the raw socket support, but we'll see when the code will be ready for testing.
Hello Andrei,
the statistics looks promising, thanks. It would be indeed interesting to know how it performs with raw sockets then. I also looked a bit, aparently is also know for some other services like e.g. memcached:
"We discovered that under load on Linux, UDP performance was downright horrible. This is caused by considerable lock contention on the UDP socket lock when transmitting through a single socket from multiple threads. Fixing the kernel by breaking up the lock is not easy. Instead, we used separate UDP sockets for transmitting replies (with one of these reply sockets per thread). With this change, we were able to deploy UDP without compromising performance on the backend." (http://www.facebook.com/note.php?note_id=39391378919)
So it might be a good idea to evaluate how big is the actual improvement with raw sockets over the multiple sockets, if its make sense to from a maintenance POV to go with this solution (not sure how complicated the actual implementation will be..).
Henning