Hello,
On 10/06/08 13:22, mayamatakeshi wrote:
[...]
Hello,
we have openser 1.3.3 running in production
(current rev.:
4943).
For 3 times in 50 days we had to restart openser to
correct pkg memory problem.
openser 1.3.3 was released 3 weeks ago, so I guess you were
running previous version before, but it happened again
since
you upgraded to 1.3.3, right?
After some time logging messages like this:
/openser.log:Aug 19 10:39:18 ipx022
/usr/local/sbin/openser[16991]:
ERROR:core:new_credentials: no pkg memory left,
openser will eventually run out of pkg memory and
refuse
all subsequent requests.
We are trying to recreate this in our lab so that
we can
follow memory troubleshooting instructions at
http://kamailio.net/dokuwiki/doku.php/troubleshooting:memory,
but so far we were unable to do it even when generating
millions of calls and registration transactions (we are
using SIPp to generate normal call flows and even
abnormal
call flows detected when reading openser.log, like
'invalid cseq for aor', malformed SIP messages etc).
We can spot memory leaks even the "out of memory"
message is
not printed. Just archive the logs (the most important
is the
shut down time) and made them available for download so
they
can be investigated.
There could be two reasons:
- there is memory leak but happens in some cases that you
don't reproduce in lab, but they are in the production
environment
- you get memory fragmentation
Let's see first the debug messages...
Hello,
here are the link for openser.log and cfg files:
http://www.yousendit.com/download/bVlEV0o4R3NoeWJIRGc9PQ
After compilation with debug flags for memory manager, I left
openser running in production for 24 hours. Then, I moved all
traffic to another host and waited for more than 30 minutes
before
stopping openser.
In the openser.cfg, I set debug=2. If you need, I can run
it again
with a higher value (but I hope it doesn't have to be too high,
due to overhead concerns).
Sorry, I forgot to tell one thing: the last revision that
showed this problem was 4809, so we reverted back to that
revision before performing the above.
to understand that you couldn't reproduce with latest svn version?
So you had to get a previous version?
Hi,
no, the reason for reversion is that the latest version running in
production will not show the problem because we adopted preventive
reset to minimize impact to customer calls. So I don't know yet if it
shows this problem or not.
So I collected the logs using a revision that I was sure could
recreate the problem.
OK, I understand now. I was looking at the logs and there
seems to be a
leak with db operations - something does not free a db result. I will go
over the modules that you are using and try to spot any issue -- i will
check the change log to see if something happened in the last time
regarding such issue..
But here's some developments on my investigation:
Up to now, I was trying to recreate the problem using VirtualMachines
running the same OS (Fedora 5) as in production. It never happened
there, even after 30 million of calls.
But we eventually were able to test openser 1.3 using a production
machine with the same spec as the ones showing the problem and we were
able to generate pkg memory problem using a simple outgoing SIPp
scenario. The problem always happens after we reach around 28.000
calls and we confirmed the amount of calls needed to cause the problem
grows linearly with the amount of pkg memory (after increase of pkg
memory pool by 4, problem started to happen only after around 128.000
calls).
However, we also tried the same tests with kamailio 1.4 (rev. 5017) on
that machine and we could not recreate the problem after 1.5 million
calls, so we are thinking in just upgrade to 1.4 after other scenarios
show everything else is working.
OK, 1.4 is recommended, it has lot of new features
and many fixes.
But I don't know why the problem cannot be recreated using the VMs:
the only significant difference is that the productions machines have
4 NICs that are bound in 2 pairs (1 for private ip and another for
public ip) while the VMs have just one NIC.
I see no relation with the NICs.
I hope upgrading to 1.4 will solve everything, however, since nobody
is complaining about having openser stopping after 28.000 calls, I
still believe we have some problem in the openser.cfg itself. I'll
check it after we put kamailio 1.4 in production.
OK, I will dig in further, I
might be a bit slow, however, these days.
Cheers,
Daniel
--
Daniel-Constantin Mierla
http://www.asipto.com