Module: sip-router Branch: andrei/raw_sock Commit: 2f0276f711ba7aad9ffa1de0445f75df015abbad URL: http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=2f0276f7...
Author: Andrei Pelinescu-Onciul andrei@iptel.org Committer: Andrei Pelinescu-Onciul andrei@iptel.org Date: Tue Jun 8 00:21:08 2010 +0200
core: basic raw socket support functions
Basic support for raw sockets. Functions for creating, sending and receiving udp packets over raw sockets. Initial version supports only linux.
---
raw_sock.c | 452 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ raw_sock.h | 52 +++++++ 2 files changed, 504 insertions(+), 0 deletions(-)
Diff: http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commitdiff;h=2f02...
Lovely.
Do you plan to export this functionality to the script? e.g. if force_send_socket() does not find the forced UDP socket, it uses a raw socket?
regards klaus
Am 08.06.2010 00:24, schrieb Andrei Pelinescu-Onciul:
On Jun 08, 2010 at 08:43, Klaus Darilion klaus.mailinglists@pernau.at wrote:
Not exactly, I plan to use it as a replacement for udp sockets (higher performance on send on linux) and in the future for handling icmp. The force_send_socket() stuff it's a good idea.
Andrei
Am 08.06.2010 09:09, schrieb Andrei Pelinescu-Onciul:
I just read about raw sockets (http://sock-raw.org/papers/sock_raw) and when sending on raw_sockets, there wont be any fragmentation. So that would need to be implemented in sr as well.
Is the performance gain really worth the effort? I wonder why constructing of IP/UDP headers in sr is faster than having it done in the kernel.
another interesting point is if the raw socket receives the packet before or after iptables filtering.
regards klaus
On Jun 10, 2010 at 10:48, Klaus Darilion klaus.mailinglists@pernau.at wrote:
Yes, if one sets IP_HDRINCL (which is what I intend to do).
It's a workaround for performance/scalability problems on bigger multi-cpu/cores machines running linux. It has nothing to do with the construction of the headers, but with locking inside the kernel when sending on the same udp socket (or raw socket w/o IP_HDRINCL). If you are trying to send on the same socket in the same time from multiple cores you'll hit this problem. Some of us have seen symptoms which I believe are related to this problem. On an 8 cpu machine running an older kernel (IIRC 2.6.22), I got between 18%-28% improvement in _stateless_ forwarding just by distributing the traffic on 8 different sockets instead of one. I believe it would be even better with the raw socket support, but we'll see when the code will be ready for testing.
Note that I'm speaking only of the sending part. The receive part I've added just for fun (while I am at it I would like to test also receiving on raw sockets, although I don't know if it would add any benefits).
another interesting point is if the raw socket receives the packet before or after iptables filtering.
After NF_INET_LOCAL_IN. For sending, after the raw socket send call it would go through NF_INET_LOCAL_OUT (it's the same as for udp or tcp from the iptables hooks point of view).
Andrei
On Thursday 10 June 2010, Andrei Pelinescu-Onciul wrote:
Hello Andrei,
the statistics looks promising, thanks. It would be indeed interesting to know how it performs with raw sockets then. I also looked a bit, aparently is also know for some other services like e.g. memcached:
"We discovered that under load on Linux, UDP performance was downright horrible. This is caused by considerable lock contention on the UDP socket lock when transmitting through a single socket from multiple threads. Fixing the kernel by breaking up the lock is not easy. Instead, we used separate UDP sockets for transmitting replies (with one of these reply sockets per thread). With this change, we were able to deploy UDP without compromising performance on the backend." (http://www.facebook.com/note.php?note_id=39391378919)
So it might be a good idea to evaluate how big is the actual improvement with raw sockets over the multiple sockets, if its make sense to from a maintenance POV to go with this solution (not sure how complicated the actual implementation will be..).
Henning
On Jun 11, 2010 at 13:33, Henning Westerholt henning.westerholt@1und1.de wrote:
We cannot go with multiple udp sockets because then we will have multiple source ports, which is something one does not want for a sip proxy (think natted UACs). A possible workaround is to use multiple ports and SRV records to balance the traffic on them, but IMHO the raw socket should solve the problem in a simpler way (from the sr.cfg point of view). The only problem is finding the right MTU, but I guess for most setups (that don't use multiple interfaces with different MTUs) a sr.cfg configurable mtu would do (at least for the initial version).
Anyway from the coding point of view I'm almost ready, the testing will be more difficult.
Andrei
Andrei Pelinescu-Onciul wrote:
Hello Andrei,
I might help with testing and providing some performance numbers. Unfortunately I only have access to some quad core Xeon servers, not 8 cores. If the weird UDP behavior appears also on 4 cores(less lock contention), I can run a test plan.
My idea is to have several UDP workers (I plan to use 8- this will allow the kernel to distribute them evenly on all 4 CPUs - as the machine is idle). I plan to test using other 2-3 machines for sending SIP messages (using sipp or sipsak). Ser will just send some replies with some error code. I will measure the throughput over a 1 minute interval on the ser machine, with both current UDP sender and the new UDP over raw sock sender.
If this is ok then in the afternoon I will bring some results. Any suggestion regarding the test plan is welcomed.
Cheers Marius
On Jun 14, 2010 at 12:37, marius zbihlei marius.zbihlei@1and1.ro wrote:
Thanks a lot!
Well, while there is little more to be done, the code is not yet testing ready (a few small things like on-send fragmentation and integration in ser are still missing). I'll have it ready sometime this week, but I'm not sure I'll have time do it today or tomorrow.
Andrei
Andrei Pelinescu-Onciul wrote:
Hello,
I want to test using some small sip replies(so I am not sure if fragmentation takes part). Of course the fragmentation code should also be tested for performance. Ser integration is another issue. Do you plan to use a global parameter to switch between normal sendto() function and raw sockets ?
Marius
Andrei
On Jun 14, 2010 at 15:08, marius zbihlei marius.zbihlei@1and1.ro wrote:
Yes, a global param for raw sockets, another for mtu (at least for now) and in the future an option to use raw sockets also for listening (they might be usefull for a transparent proxy / load balancer implementation and who knows, maybe we get a nice surprise testing performance with them).
Andrei
Andrei Pelinescu-Onciul wrote:
Hello Andrei,
Just performed a couple of tests (I was busy myself), but I think I have some interesting results. I have tested with 25 UAC/UAS's per test server, each pair generating 500 calls/s for a total of 12,500 calls/s . The test servers(running each 25 sipp as UAC and 25 sipp as UAS on different ports) where 2 quad core Xeon machines in the same LAN (Gigabit ethernet between them). Ser was doing a simple forward() based on the R-URI of the request, having 8 worker processes.
1. SER on a quad core Xeon, kernel 2.6.26.
a. I have enable just one test server for a total of 12,500 calls/s.
In this case the CPU usage was worse in case of UDP socks (udp_raw=0)(median value)
"usr", "sys", "idl", "wai", "hiq", "siq" 13.584, 15.030, 50.713, 0.0, 2.950, 17.723
For RAW socks (udp_raw=1) these values showed up:
"usr", "sys", "idl", "wai", "hiq", "siq" 10.396, 4.950, 76.238, 0.0, 2.970, 5.446
So the biggest difference is in software irq servicing time (last colum) and in sys. A little weird is the comparable usr CPU, I expected to be greater in raw sock mode.
b. I enabled both testing machines for a total of 25,000 calls/s.
In this case the CPU usage was almost identical, but mostly because the sipp instances couldn't send 500 reqs/s in the UDP mode .I limited sipp to send 20,000 calls per UAC/UAS pair. In the case of a raw sock it took an average of 55 s (closer to the 40s normal ideal value), but in udp mode it took almost 88s to send the 20,000 calls. The system load was the same (27% Idle).
2. SER on a Dual quad core Xeon, kernel 2.6.32
I have done only some basic runs but the results are not consistent with the ones on the other Ser machine. Siq time is the same, rate is steady at 500 calls/s but user CPU is greater in raw sock mode. I have dig around a bit and came over two interesting patches in 2.6.29
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=... http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=...
The release notes here: http://kernelnewbies.org/Linux_2_6_29#head-612c6b882f705935cc804d4af0b383167...
As time allows me I will rerun some tests and provide some graphs if necessary.
Marius
On Jun 18, 2010 at 12:42, marius zbihlei marius.zbihlei@1and1.ro wrote:
[...]
Great!
Yes, it's strange that udp sockets eat more cpu then the raw ones. For example on the raw sockets I do the udp checksum by hand in an unoptimized way.
That's way better then I expected...
The first one has to do with opening new sockets & binding fast and the other one is mostly the receive side. The second might help, but not for sending, while the first one should speedup rtpproxy (if it doesn't pre-bind the sockets on startup).
The locking on send problem is present also in the latest 2.6.35 (lock_sock(sk) in udp_sendmsg()).
Actually it looks like newer kernels are a bit slower on a receive side, a problem that is solved in 2.6.35: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=...
The slowdown (memory accounting) was added in 95766ff: $ git describe --all --contains 95766fff tags/v2.6.25-rc1~1162^2~899
So it's present form 2.6.25 onwards.
Thanks a lot! I guess I should start thinking about making it more portable (*BSD support) and then merging it in master. I might be able to do some testing next week, if I manage to setup some big testing environment and to finish the TCP & TLS stress tests by then (kind of a low probability).
Andrei
On Friday 11 June 2010, Andrei Pelinescu-Onciul wrote:
Hi Andrei,
thanks for the clarification. I completely forgot about the problem with the multiple source ports.. Just as a side note, maybe also interesting in this problem space are the new RFS/RPS implementations in upcoming 2.6.35:
http://lwn.net/Articles/382428/ http://lwn.net/Articles/362339/
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdif... http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdif...
even if in the patches they mentioned mainly TCP based workloads, the gains looks promising.
Regards,
Henning