async workers causing unexplained high load average

List overview All Threads
Download

newer

older

pipelimit rate question

append_hf on kamctl fifo...

Alex Balashov

23 Oct 2014 23 Oct '14

6:13 p.m.

Another thing I have found is that having a certain amount of async workers running, even if they are not doing anything, appears to cause unexplained CPU load, even if the Kamailio instance is completely idle and not processing any calls.

Here is the baseline load with no async workers:

[root@centosity6 ~]# uptime 14:06:28 up 2 min, 2 users, load average: 0.01, 0.00, 0.00

If I start Kamailio with 1 async worker...

modparam("async", "workers", 1) async_workers=1

14:07:59 up 3 min, 2 users, load average: 0.00, 0.00, 0.00

Still nothing.

Now I try 2:

14:08:54 up 4 min, 2 users, load average: 0.45, 0.11, 0.04 14:08:59 up 4 min, 2 users, load average: 0.49, 0.12, 0.04 14:09:04 up 4 min, 2 users, load average: 0.53, 0.14, 0.04 14:09:09 up 4 min, 2 users, load average: 0.57, 0.15, 0.05 14:09:14 up 5 min, 2 users, load average: 0.60, 0.17, 0.05 14:09:19 up 5 min, 2 users, load average: 0.63, 0.18, 0.06 14:09:31 up 5 min, 2 users, load average: 0.69, 0.21, 0.07 14:09:37 up 5 min, 2 users, load average: 0.71, 0.22, 0.07 14:10:35 up 6 min, 2 users, load average: 0.89, 0.36, 0.13 14:11:07 up 6 min, 2 users, load average: 0.93, 0.42, 0.16 ...

I don't have time right now to see if it will rise to infinity, but it just keeps going up.

There are no SIP messages being received at all, and there is absolutely nothing going on.

The ascent of the load average is much higher with >= 4 threads. I have 8 "CPUs" available from /proc/cpuinfo (quad-core processor with HyperThreading).

When I attach strace to the PID of the Async Worker, there's nothing going on except the normal recvfrom() blocking wait:

[root@centosity6 kamailio]# kamctl ps | grep Async which: no greadlink in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/) Process:: ID=11 PID=1559 Type=Async Task Worker Process:: ID=12 PID=1560 Type=Async Task Worker

[root@centosity6 kamailio]# strace -p 1559 Process 1559 attached - interrupt to quit recvfrom(9,

[root@centosity6 ~]# strace -p 1560 Process 1560 attached - interrupt to quit recvfrom(9,

Nor do the other Kamailio threads seem to be doing anything unusual. 'ps aux' and 'top' show 0.0% CPU load on the Kamailio processes themselves, and 'iowait' shows the system as 98-99% idle.

I suppose it's possible that this is a peculiarity of the testing environment: I am running this on CentOS 6.5 inside VirtualBox, which I think is KVM-based. I have not had a chance to duplicate this experiment in a different kind of execution environment yet. It's possible that the load average is simply being improperly indicated because of the VM environment.

-- Alex Balashov - Principal Evariste Systems LLC Tel: +1-678-954-0670 Web: http://www.evaristesys.com/, http://www.alexbalashov.com/ Please be kind to the English language: http://www.entrepreneur.com/article/232906

Show replies by date

Alex Balashov

24 Oct 24 Oct

3:30 a.m.

I just tested the same thing on a Rackspace VPS (Xen I think) and am seeing the same climbing load average with 2 async workers. It seems to top out at 1.05.

On 10/23/2014 02:13 PM, Alex Balashov wrote:

...

Another thing I have found is that having a certain amount of async workers running, even if they are not doing anything, appears to cause unexplained CPU load, even if the Kamailio instance is completely idle and not processing any calls.

Here is the baseline load with no async workers:

[root@centosity6 ~]# uptime 14:06:28 up 2 min, 2 users, load average: 0.01, 0.00, 0.00

If I start Kamailio with 1 async worker...
modparam("async", "workers", 1)
async_workers=1
14:07:59 up 3 min, 2 users, load average: 0.00, 0.00, 0.00

Still nothing.

Now I try 2:

14:08:54 up 4 min, 2 users, load average: 0.45, 0.11, 0.04 14:08:59 up 4 min, 2 users, load average: 0.49, 0.12, 0.04 14:09:04 up 4 min, 2 users, load average: 0.53, 0.14, 0.04 14:09:09 up 4 min, 2 users, load average: 0.57, 0.15, 0.05 14:09:14 up 5 min, 2 users, load average: 0.60, 0.17, 0.05 14:09:19 up 5 min, 2 users, load average: 0.63, 0.18, 0.06 14:09:31 up 5 min, 2 users, load average: 0.69, 0.21, 0.07 14:09:37 up 5 min, 2 users, load average: 0.71, 0.22, 0.07 14:10:35 up 6 min, 2 users, load average: 0.89, 0.36, 0.13 14:11:07 up 6 min, 2 users, load average: 0.93, 0.42, 0.16 ...

I don't have time right now to see if it will rise to infinity, but it just keeps going up.

There are no SIP messages being received at all, and there is absolutely nothing going on.

The ascent of the load average is much higher with >= 4 threads. I have 8 "CPUs" available from /proc/cpuinfo (quad-core processor with HyperThreading).

When I attach strace to the PID of the Async Worker, there's nothing going on except the normal recvfrom() blocking wait:

[root@centosity6 kamailio]# kamctl ps | grep Async which: no greadlink in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/) Process:: ID=11 PID=1559 Type=Async Task Worker Process:: ID=12 PID=1560 Type=Async Task Worker

[root@centosity6 kamailio]# strace -p 1559 Process 1559 attached - interrupt to quit recvfrom(9,

[root@centosity6 ~]# strace -p 1560 Process 1560 attached - interrupt to quit recvfrom(9,

Nor do the other Kamailio threads seem to be doing anything unusual. 'ps aux' and 'top' show 0.0% CPU load on the Kamailio processes themselves, and 'iowait' shows the system as 98-99% idle.

I suppose it's possible that this is a peculiarity of the testing environment: I am running this on CentOS 6.5 inside VirtualBox, which I think is KVM-based. I have not had a chance to duplicate this experiment in a different kind of execution environment yet. It's possible that the load average is simply being improperly indicated because of the VM environment.

Daniel-Constantin Mierla

1:03 p.m.

Can you attach with gdb to one of the async processes and see with the backtrace command what is the process doing at that time?

Cheers, Daniel

On 24/10/14 05:30, Alex Balashov wrote:

...

I just tested the same thing on a Rackspace VPS (Xen I think) and am seeing the same climbing load average with 2 async workers. It seems to top out at 1.05.

On 10/23/2014 02:13 PM, Alex Balashov wrote:

...
Another thing I have found is that having a certain amount of async workers running, even if they are not doing anything, appears to cause unexplained CPU load, even if the Kamailio instance is completely idle and not processing any calls.

Here is the baseline load with no async workers:

[root@centosity6 ~]# uptime 14:06:28 up 2 min, 2 users, load average: 0.01, 0.00, 0.00

If I start Kamailio with 1 async worker...
modparam("async", "workers", 1)
async_workers=1
14:07:59 up 3 min, 2 users, load average: 0.00, 0.00, 0.00

Still nothing.

Now I try 2:

14:08:54 up 4 min, 2 users, load average: 0.45, 0.11, 0.04 14:08:59 up 4 min, 2 users, load average: 0.49, 0.12, 0.04 14:09:04 up 4 min, 2 users, load average: 0.53, 0.14, 0.04 14:09:09 up 4 min, 2 users, load average: 0.57, 0.15, 0.05 14:09:14 up 5 min, 2 users, load average: 0.60, 0.17, 0.05 14:09:19 up 5 min, 2 users, load average: 0.63, 0.18, 0.06 14:09:31 up 5 min, 2 users, load average: 0.69, 0.21, 0.07 14:09:37 up 5 min, 2 users, load average: 0.71, 0.22, 0.07 14:10:35 up 6 min, 2 users, load average: 0.89, 0.36, 0.13 14:11:07 up 6 min, 2 users, load average: 0.93, 0.42, 0.16 ...

I don't have time right now to see if it will rise to infinity, but it just keeps going up.

There are no SIP messages being received at all, and there is absolutely nothing going on.

The ascent of the load average is much higher with >= 4 threads. I have 8 "CPUs" available from /proc/cpuinfo (quad-core processor with HyperThreading).

When I attach strace to the PID of the Async Worker, there's nothing going on except the normal recvfrom() blocking wait:

[root@centosity6 kamailio]# kamctl ps | grep Async which: no greadlink in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/)

Process:: ID=11 PID=1559 Type=Async Task Worker Process:: ID=12 PID=1560 Type=Async Task Worker

[root@centosity6 kamailio]# strace -p 1559 Process 1559 attached - interrupt to quit recvfrom(9,

[root@centosity6 ~]# strace -p 1560 Process 1560 attached - interrupt to quit recvfrom(9,

Nor do the other Kamailio threads seem to be doing anything unusual. 'ps aux' and 'top' show 0.0% CPU load on the Kamailio processes themselves, and 'iowait' shows the system as 98-99% idle.

I suppose it's possible that this is a peculiarity of the testing environment: I am running this on CentOS 6.5 inside VirtualBox, which I think is KVM-based. I have not had a chance to duplicate this experiment in a different kind of execution environment yet. It's possible that the load average is simply being improperly indicated because of the VM environment.

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Alex Balashov

4:16 p.m.

Hi Daniel,

On 10/24/2014 09:03 AM, Daniel-Constantin Mierla wrote:

...

Can you attach with gdb to one of the async processes and see with the backtrace command what is the process doing at that time?

Sometimes I can do this:

(gdb) where #0 0x00007f4c17007673 in __recvfrom_nocancel () from /lib64/libc.so.6 #1 0x000000000042c10c in async_task_run (idx=1) at async_task.c:201 #2 0x000000000042b449 in async_task_child_init (rank=0) at async_task.c:144 #3 0x0000000000590af0 in init_child (rank=0) at sr_module.c:944 #4 0x00000000004a603c in main_loop () at main.c:1705 #5 0x00000000004aad16 in main (argc=11, argv=0x7fff73224228) at main.c:2560

But at other times:

Process:: ID=11 PID=1932 Type=Async Task Worker Process:: ID=12 PID=1933 Type=Async Task Worker

root@centosity6 ~]# gdb GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. (gdb) attach 1932 Attaching to process 1932 [nothing happens, gdb hangs]

Moreover, no signals crash or kill the process:

[root@centosity6 kamailio]# kill -SEGV 1932 [root@centosity6 kamailio]# kill -TERM 1932 [root@centosity6 kamailio]# kill -INT 1932 [root@centosity6 kamailio]#

... not even kill, which shouldn't be trappable:

[root@centosity6 kamailio]# kill -9 1932 [root@centosity6 kamailio]# kill -9 1932 [root@centosity6 kamailio]# kill -9 1932 [root@centosity6 kamailio]# kill -9 1932 [root@centosity6 kamailio]# kill -9 1932

-- Alex

Alex Balashov

4:38 p.m.

But most of the time, I do get this from both processes:

(gdb) where #0 0x00007fb699e07673 in __recvfrom_nocancel () from /lib64/libc.so.6 #1 0x000000000042c10c in async_task_run (idx=2) at async_task.c:201 #2 0x000000000042b449 in async_task_child_init (rank=0) at async_task.c:144 #3 0x0000000000590af0 in init_child (rank=0) at sr_module.c:944 #4 0x00000000004a603c in main_loop () at main.c:1705 #5 0x00000000004aad16 in main (argc=11, argv=0x7fff6da24fd8) at main.c:2560

Interestingly enough, the load problem seems to come from both async_workers running at once. When I attach to one or the other:

12:26:58 up 40 min, 2 users, load average: 0.00, 0.08, 0.15

But as soon as I detach from them both, the load average comes back up.

I also get 0 load with only async_workers=1.

So, I think it's somehow related to the multiplicity of async_workers.

Alex Balashov

5:04 p.m.

I also find this in 'dmesg' about the async workers periodically:

INFO: task kamailio:4480 blocked for more than 120 seconds. Not tainted 2.6.32-431.29.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kamailio D 0000000000000001 0 4480 4458 0x00000080 ffff880037b1db28 0000000000000082 ffff880037b1dbb8 ffffffff8112f183 ffff88000001dd80 0000000000000002 ffff880037b1dac8 ffffffff8109b39c ffff8800a6069058 ffff880037b1dfd8 000000000000fbc8 ffff8800a6069058 Call Trace: [<ffffffff8112f183>] ? __alloc_pages_nodemask+0x113/0x8d0 [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50 [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 [<ffffffff814f668a>] unix_dgram_recvmsg+0x7a/0x4d0 [<ffffffff8144a923>] sock_recvmsg+0x133/0x160 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8114b00a>] ? handle_mm_fault+0x22a/0x300 [<ffffffff8104a98c>] ? __do_page_fault+0x1ec/0x480 [<ffffffff8144aa9e>] sys_recvfrom+0xee/0x180 [<ffffffff810e1e07>] ? audit_syscall_entry+0x1d7/0x200 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Daniel-Constantin Mierla

31 Oct 31 Oct

2:48 p.m.

Hello,

back in the discussion after being traveling.

Interesting, because it shows like being the normal recvfrom() done on the internal socket, expecting to be no load there. I have seen the backtrace you sent in previous email and it is in the same recv function from the system. The process is waiting for packets on the memory socket.

I will try to read more about and see if there is anything missing there.

Cheers, Daniel

On 24/10/14 19:04, Alex Balashov wrote:

...

I also find this in 'dmesg' about the async workers periodically:

INFO: task kamailio:4480 blocked for more than 120 seconds. Not tainted 2.6.32-431.29.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kamailio D 0000000000000001 0 4480 4458 0x00000080 ffff880037b1db28 0000000000000082 ffff880037b1dbb8 ffffffff8112f183 ffff88000001dd80 0000000000000002 ffff880037b1dac8 ffffffff8109b39c ffff8800a6069058 ffff880037b1dfd8 000000000000fbc8 ffff8800a6069058 Call Trace: [<ffffffff8112f183>] ? __alloc_pages_nodemask+0x113/0x8d0 [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50 [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 [<ffffffff814f668a>] unix_dgram_recvmsg+0x7a/0x4d0 [<ffffffff8144a923>] sock_recvmsg+0x133/0x160 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8114b00a>] ? handle_mm_fault+0x22a/0x300 [<ffffffff8104a98c>] ? __do_page_fault+0x1ec/0x480 [<ffffffff8144aa9e>] sys_recvfrom+0xee/0x180 [<ffffffff810e1e07>] ? audit_syscall_entry+0x1d7/0x200 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

3910

Age (days ago)

3918

Last active (days ago)

sr-users@lists.kamailio.org

6 comments

2 participants

tags (0)

participants (2)

Alex Balashov
Daniel-Constantin Mierla