Hi guys,
Since 1.3.0 (now running 1.4.4) I'm seeing a very slow uptake of SHM memory on our low traffic setup (less than 5 cps per machine). I'm looking for some basis to go further on in my research to the cause. :)
I compiled Kamailio 1.4.4-notls with #define SHM_MEM_SIZE 4*32 in config.h in production. For my testing setup I'm running on the standard 32 there.
After about 3 weeks uptime I start top, sort on memory size and find the kamailio processes (I'm running with 16 children) to all have about 40mb in the SHR column. My understanding is that this should also go down, but it only goes up, slowly. More CPS (for example a benchmark using sipp) makes this go up faster, but it never seems to go down this figure. I think this is wrong, but I could be wrong myself. :)
On a seperate machine with no traffic I compiled the memory debugging according to the "memory troubleshooting" page on the wiki. LOTS of info in the logs. Also ran with valgrind, didn't find anything interesting (but I'm no dev myself really).
My plan now is to take away our acc module (compiled with radius support) and see if it's maybe that module that's causing this. My test on this traffic-less machine is as follows: start, run 20cps for a while (we do no registers, just routing and auth) and note the SHR data from top. Then according to my understanding this figure should drop down after a period of 20 minutes with no traffic. Is this a right assumption?
On the test setup the top data looks like this after about 10 calls:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15975 root 18 0 69240 1536 568 S 0.0 0.6 0:00.00 kamailio 15972 root 18 0 69240 1568 588 S 0.0 0.6 0:00.00 kamailio 15970 root 18 0 69240 1568 588 S 0.0 0.6 0:00.00 kamailio 15969 root 19 0 69244 1488 568 S 0.0 0.6 0:00.00 kamailio 15968 root 15 0 69240 1860 936 S 0.0 0.7 0:00.00 kamailio 15967 root 15 0 71436 3376 2268 S 0.0 1.3 0:00.00 kamailio 15966 root 15 0 71436 3396 2288 S 0.0 1.3 0:00.00 kamailio 15965 root 23 0 69240 5552 4640 S 0.0 2.1 0:00.02 kamailio
As far as I know, it never goes down, the SHR entries. When running with very little SHM i config.h, the process goes out of shm memory and complains, as expected.
Are my assumptions about all of this correct?
On Freitag, 6. November 2009, Robin Vleij wrote:
Since 1.3.0 (now running 1.4.4) I'm seeing a very slow uptake of SHM memory on our low traffic setup (less than 5 cps per machine). I'm looking for some basis to go further on in my research to the cause. :)
I compiled Kamailio 1.4.4-notls with #define SHM_MEM_SIZE 4*32 in config.h in production. For my testing setup I'm running on the standard 32 there.
After about 3 weeks uptime I start top, sort on memory size and find the kamailio processes (I'm running with 16 children) to all have about 40mb in the SHR column. My understanding is that this should also go down, but it only goes up, slowly. More CPS (for example a benchmark using sipp) makes this go up faster, but it never seems to go down this figure. I think this is wrong, but I could be wrong myself. :)
On a seperate machine with no traffic I compiled the memory debugging according to the "memory troubleshooting" page on the wiki. LOTS of info in the logs. Also ran with valgrind, didn't find anything interesting (but I'm no dev myself really).
My plan now is to take away our acc module (compiled with radius support) and see if it's maybe that module that's causing this. My test on this traffic-less machine is as follows: start, run 20cps for a while (we do no registers, just routing and auth) and note the SHR data from top. Then according to my understanding this figure should drop down after a period of 20 minutes with no traffic. Is this a right assumption? [..] As far as I know, it never goes down, the SHR entries. When running with very little SHM i config.h, the process goes out of shm memory and complains, as expected.
Are my assumptions about all of this correct?
Hello Robin,
do you experience any problems in your setup when you use a reasonable SHM mem size? In my experience the size of the SHM memory (as displayed from top) depends on the load of the machine. But there is a certain level of shared memory that is used regardless of the load. Even if the machine has been completely passive over a longer time, it will not reclaim this memory. On a certain test system for example there is one process that has 11MB SHM at the moment, even if its completely idle.
For the VIRT column (again top) its another story, here it will just show something like SHM + PKG memory size, regardless of the actual load.
If you've a real memory leak in shared memory then after a certain time interval the server will report memory allocation errors. Otherwise i don't think its something to worry about.
Regards,
Henning
Henning Westerholt wrote:
Hi Henning!
do you experience any problems in your setup when you use a reasonable SHM mem size? In my experience the size of the SHM memory (as displayed
I've had problems finding a "reasonable" shm mem size. :) Standard is like 32MB, which runs out quickly when customers do "funny stuff" (read: loops). Now I'm compiling with #define SHM_MEM_SIZE 4*32. 128MB should be enough to hold pretty long. So there's no immediate memory problem or crashes (when it's full, it gets errors and stops processing traffic the right way). But right now for example, after a "funny" customer, I'm seeing over 40mb per child in top (16 children). That won't go down anymore, so we'll have to see how long it holds. What do you suggest for SHM sizes?
machine has been completely passive over a longer time, it will not reclaim this memory. On a certain test system for example there is one process that has 11MB SHM at the moment, even if its completely idle.
OK. We often run very long on 10-20MB per process (all processes have about the same, at least the children that process UDP), but like today when someone has a problem and it becomes sip-spaghetti it jumps up to 40MB and then continues to slowly rise from there. Doesn't feel good to be able to hit some kind of roof with the same traffic load.
For the VIRT column (again top) its another story, here it will just show something like SHM + PKG memory size, regardless of the actual load.
Virt shows 421MB right now for me. I figured out that's what you write, the PKG memory of each process + the SHM.
If you've a real memory leak in shared memory then after a certain time interval the server will report memory allocation errors. Otherwise i don't think its something to worry about.
It does, if I don't make the limit higher. So say that I'm running on 32, then if I would hit that after some weeks uptime it would start reporting memory allocation errors in different parts of my config and stop doing important stuff. I also reproduced this assigning a small amount to a dev machine and then sending 20cps to the machine.
On a test machine I have like 4 processes all using 600kb or so, then after 20 calls it'll go up to something like
31409 root 15 0 94672 1936 1052 R 0.0 0.7 0:00.00 kamailio 31408 root 15 0 94784 3072 2068 S 0.0 1.2 0:00.00 kamailio 31407 root 15 0 94784 3072 2068 S 0.0 1.2 0:00.00 kamailio 31406 root 25 0 94672 5428 4556 S 0.0 2.1 0:00.02 kamailio
And go back in size only a little after 15-20 minutes or so (often a bit faster is load is low).
If this is a leak, it'll be almost impossible to find. I can't run production with memlog or debug on, and in dev it's quite hard to reproduce it seems. Not sure what to expect. :)
On Montag, 9. November 2009, Robin Vleij wrote:
do you experience any problems in your setup when you use a reasonable SHM mem size? In my experience the size of the SHM memory (as displayed
I've had problems finding a "reasonable" shm mem size. :) Standard is like 32MB, which runs out quickly when customers do "funny stuff" (read: loops). Now I'm compiling with #define SHM_MEM_SIZE 4*32. 128MB should be enough to hold pretty long.
Hi Robin,
btw, there is no need to re-compile the server just to change this setting, its a normal daemon binary parameter. 128 MB should be really fine, given the load you quoted.
So there's no immediate memory problem or crashes (when it's full, it gets errors and stops processing traffic the right way). But right now for example, after a "funny" customer, I'm seeing over 40mb per child in top (16 children). That won't go down anymore, so we'll have to see how long it holds. What do you suggest for SHM sizes?
With today memory sizes/ prizes you could use for example 512 MB, which should give you plenty of room even in really abnormal load conditions. And as its shared, you'll have still plenty of room for e.g. the database.
machine has been completely passive over a longer time, it will not reclaim this memory. On a certain test system for example there is one process that has 11MB SHM at the moment, even if its completely idle.
OK. We often run very long on 10-20MB per process (all processes have about the same, at least the children that process UDP), but like today when someone has a problem and it becomes sip-spaghetti it jumps up to 40MB and then continues to slowly rise from there. Doesn't feel good to be able to hit some kind of roof with the same traffic load.
You mentioned the the loops a few times, normally they should be pretty fast detected by max forward counter checks and additionally by diversion header checks?
If you've a real memory leak in shared memory then after a certain time interval the server will report memory allocation errors. Otherwise i don't think its something to worry about.
It does, if I don't make the limit higher. So say that I'm running on 32, then if I would hit that after some weeks uptime it would start reporting memory allocation errors in different parts of my config and stop doing important stuff. I also reproduced this assigning a small amount to a dev machine and then sending 20cps to the machine.
On a test machine I have like 4 processes all using 600kb or so, then after 20 calls it'll go up to something like
31409 root 15 0 94672 1936 1052 R 0.0 0.7 0:00.00 kamailio 31408 root 15 0 94784 3072 2068 S 0.0 1.2 0:00.00 kamailio 31407 root 15 0 94784 3072 2068 S 0.0 1.2 0:00.00 kamailio 31406 root 25 0 94672 5428 4556 S 0.0 2.1 0:00.02 kamailio
And go back in size only a little after 15-20 minutes or so (often a bit faster is load is low).
With the memory debugging you could dump all the allocations during runtime, but they are a bit hard to read for a non-developer. But this way you could reproduce "call by call" how your server behave and how the situation develops.
If this is a leak, it'll be almost impossible to find. I can't run production with memlog or debug on, and in dev it's quite hard to reproduce it seems. Not sure what to expect. :)
If you'd have a leak in a common used code path, then you'll run out of memory pretty fast, like in a few days. If your servers are stable (like some weeks or month) with the setting you use at the moment, i don't think there is much to worry.
Regards,
Henning
Henning Westerholt wrote:
Hi Henning!
btw, there is no need to re-compile the server just to change this setting, its a normal daemon binary parameter. 128 MB should be really fine, given the load you quoted.
Ah, OK. Missed that. I had already defined the -m as 256, wasn't aware that that was shared. So I'll let it run and see how it goes and if it will keep growing till it reaches 256.
With today memory sizes/ prizes you could use for example 512 MB, which should give you plenty of room even in really abnormal load conditions. And as its shared, you'll have still plenty of room for e.g. the database.
Yep. It's basically to just use a huge amount and then normally you wouldn't hit it. I understand from all of this that the memory "growing" is really something that is not specific to my setup here.
You mentioned the the loops a few times, normally they should be pretty fast detected by max forward counter checks and additionally by diversion header checks?
Well, it's more like this. A customer sends an invite, which is really to himself (failboat). So I send it to him (directly or via a PSTN gateway, depending on the routing setup). Which causes (for example when their PBX has a forwarding) a new invite to me, new call leg. Untill one of both sides dies or is congested. That's normally not Kamailio, so that's the good news. Only thing then is the memory usage after this spike. I'm also running spike, so in the end I just send them 480's back.
With the memory debugging you could dump all the allocations during runtime, but they are a bit hard to read for a non-developer. But this way you could reproduce "call by call" how your server behave and how the situation develops.
I had this on and it was a LOT of info. :) I made some calls and it showed that there's lots of stats and init stuff, for the rest nothing shocking. But as you said, I'm not a dev so I might have missed shocking errors. :)
If you'd have a leak in a common used code path, then you'll run out of memory pretty fast, like in a few days. If your servers are stable (like some weeks or month) with the setting you use at the moment, i don't think there is much to worry.
OK, I'll keep an eye on it. Will run with 128MB for now and see how it grows with the load. I was looking at a 1.4.4 -> 1.5.0 upgrade, but that was a bit more complicated than 1.3 -> 1.4 because of the database layout and some changed modules. Have to write a failback plan before I upgrade. Might also wait for 3.0.0, which sounds interesting.
/Robin
On Dienstag, 10. November 2009, Robin Vleij wrote:
With today memory sizes/ prizes you could use for example 512 MB, which should give you plenty of room even in really abnormal load conditions. And as its shared, you'll have still plenty of room for e.g. the database.
Yep. It's basically to just use a huge amount and then normally you wouldn't hit it. I understand from all of this that the memory "growing" is really something that is not specific to my setup here.
Hey Robin,
yes, and normally only a small fraction of the memory is used.
You mentioned the the loops a few times, normally they should be pretty fast detected by max forward counter checks and additionally by diversion header checks?
Well, it's more like this. A customer sends an invite, which is really to himself (failboat). So I send it to him (directly or via a PSTN gateway, depending on the routing setup). Which causes (for example when their PBX has a forwarding) a new invite to me, new call leg. Untill one of both sides dies or is congested. That's normally not Kamailio, so that's the good news. Only thing then is the memory usage after this spike. I'm also running spike, so in the end I just send them 480's back.
Ok, i understand. So its more a temporary over load condition that you face.
With the memory debugging you could dump all the allocations during runtime, but they are a bit hard to read for a non-developer. But this way you could reproduce "call by call" how your server behave and how the situation develops.
I had this on and it was a LOT of info. :) I made some calls and it showed that there's lots of stats and init stuff, for the rest nothing shocking. But as you said, I'm not a dev so I might have missed shocking errors. :)
Yes, its a lot of information to parse. But if you do a mem dump (as described e.g. here: http://www.kamailio.org/dokuwiki/doku.php/troubleshooting:memory) with enabled memory debugging you could see where the allocations come from.
If you'd have a leak in a common used code path, then you'll run out of memory pretty fast, like in a few days. If your servers are stable (like some weeks or month) with the setting you use at the moment, i don't think there is much to worry.
OK, I'll keep an eye on it. Will run with 128MB for now and see how it grows with the load. I was looking at a 1.4.4 -> 1.5.0 upgrade, but that was a bit more complicated than 1.3 -> 1.4 because of the database layout and some changed modules. Have to write a failback plan before I upgrade. Might also wait for 3.0.0, which sounds interesting.
We did an upgrade to 1.5 in the last months on some of our production systems, without any notable problems. Some other systems we've needs some more time before they can run on 1.5, especially because of the database changes you also mentioned. If you update make sure that you use the latest 1.5 version/ stable branch. 3.0 will be indeed interesting, looking forward to this.
Regards,
Henning
On 11/10/09 12:26 PM, Henning Westerholt wrote:
Hi Henning!
Well, it's more like this. A customer sends an invite, which is really to himself (failboat). So I send it to him (directly or via a PSTN gateway, depending on the routing setup). Which causes (for example when their PBX has a forwarding) a new invite to me, new call leg. Untill one of both sides dies or is congested. That's normally not Kamailio, so that's the good news. Only thing then is the memory usage after this spike. I'm also running spike, so in the end I just send them 480's back.
Ok, i understand. So its more a temporary over load condition that you face.
Yes, exactly. And it's not really overload, the machine can easily handle it, it's not loaded at all, even if customers do "interesting" stuff. But it's just that the memory grows and never seems to shrink again.
We did an upgrade to 1.5 in the last months on some of our production systems, without any notable problems. Some other systems we've needs some more time before they can run on 1.5, especially because of the database changes you also mentioned. If you update make sure that you use the latest 1.5 version/ stable branch. 3.0 will be indeed interesting, looking forward to this.
Me too. I think I'll actually wait for 3.0.x or 3.1 and then make one big upgrade. Night maintenance is not my favourite hobby so I'll make a big step. I think the current 1.4.4 I'm running doesn't have any acute problems or crashes so it's fine. :)
Thanks for your help and explanations (and ofcourse your work on Kamailio)!
/Robin