DMQ mem leak issues

List overview All Threads
Download

newer

older

Problem with reconnection to...

Ipv6_to_ipv4

Rogelio Perez

31 Jul 2018 31 Jul '18

4:58 a.m.

Hello,

We're running three instances of Kamailo v5.14 as registrars handling registrations from ~2000 SIP clients, with one instance being primary and the other two as backups.

The three of them are using the dmq and dmq_usrloc modules to synchronize user locations, however after a couple of days of operation the two failover instances show memory leak behaviors, with mem usage assigned to the core taking all available resources.

When this happens we've noticed that: - The shared memory used by the function "sip_msg_shm_clone" spikes (from 1kb to 1.5GB). - The shared memory used by the function "dmq:worker.c:job_queue_push" also increases, but not as much (from 1kb to 1MB) - DMQ request are not being answered (with a 200 OK) by the affected instance during this memory leak, which make us think that DMQ module becomes unresponsive.

A few more notes: - The failover instances are doing nothing except receiving replicated contacts. - The shared memory grows at the same rate on both instances, but the critical behavior never happens at the same time. - We are allocating 1GB memory on startup to each instance. - We store the location DB in a psql DB and we load it at startup. - We didn't find any errors in syslog, even at debug level.

Has anyone experienced a similar issue who can suggest a possible solution?

Thanks, Rogelio Perez Telnyx

Attachments:

attachment.html (text/html — 1.8 KB)

Show replies by date

Daniel-Constantin Mierla

31 Jul 31 Jul

12:05 p.m.

Hello,

not using dmq much, but at a quick look in the code, I noticed that there are some cases when the job fields were not released if the processing was not completely done for various reasons.

I pushed the commit a1f5fbe2c18246d4afefa44fd8a52612a5182a46, can you try with it and see the results?

Maybe Charles Chance can also do a bit of review here, being the one doing most of the work lately for dmq.

Cheers, Daniel

On Tue, Jul 31, 2018 at 6:58 AM, Rogelio Perez rogelio@telnyx.com wrote:

...

Hello,

We're running three instances of Kamailo v5.14 as registrars handling registrations from ~2000 SIP clients, with one instance being primary and the other two as backups.

The three of them are using the dmq and dmq_usrloc modules to synchronize user locations, however after a couple of days of operation the two failover instances show memory leak behaviors, with mem usage assigned to the core taking all available resources.

When this happens we've noticed that:

The shared memory used by the function "sip_msg_shm_clone" spikes

(from 1kb to 1.5GB).

The shared memory used by the function "dmq:worker.c:job_queue_push"

also increases, but not as much (from 1kb to 1MB)

DMQ request are not being answered (with a 200 OK) by the affected

instance during this memory leak, which make us think that DMQ module becomes unresponsive.

A few more notes:

The failover instances are doing nothing except receiving replicated

contacts.

The shared memory grows at the same rate on both instances, but the

critical behavior never happens at the same time.

We are allocating 1GB memory on startup to each instance.

We store the location DB in a psql DB and we load it at startup.

We didn't find any errors in syslog, even at debug level.

Has anyone experienced a similar issue who can suggest a possible solution?

Thanks, Rogelio Perez Telnyx

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Charles Chance

12:59 p.m.

Hi Daniel,

Nice spot! I had tried to reproduce locally, but had not considered the possibility that jobs may be failing somewhere in Rogelio’s setup.

Most likely your patch will resolve it but I’m happy to take a look further if not.

Cheers,

Charles

On Tue, 31 Jul 2018 at 13:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Hello,

not using dmq much, but at a quick look in the code, I noticed that there are some cases when the job fields were not released if the processing was not completely done for various reasons.

I pushed the commit a1f5fbe2c18246d4afefa44fd8a52612a5182a46, can you try with it and see the results?

Maybe Charles Chance can also do a bit of review here, being the one doing most of the work lately for dmq.

Cheers, Daniel

On Tue, Jul 31, 2018 at 6:58 AM, Rogelio Perez rogelio@telnyx.com wrote:

...
Hello,

We're running three instances of Kamailo v5.14 as registrars handling registrations from ~2000 SIP clients, with one instance being primary and the other two as backups.

The three of them are using the dmq and dmq_usrloc modules to synchronize user locations, however after a couple of days of operation the two failover instances show memory leak behaviors, with mem usage assigned to the core taking all available resources.

When this happens we've noticed that:

The shared memory used by the function "sip_msg_shm_clone" spikes

(from 1kb to 1.5GB).

The shared memory used by the function "dmq:worker.c:job_queue_push"

also increases, but not as much (from 1kb to 1MB)

DMQ request are not being answered (with a 200 OK) by the affected

instance during this memory leak, which make us think that DMQ module becomes unresponsive.

A few more notes:

The failover instances are doing nothing except receiving replicated

contacts.

The shared memory grows at the same rate on both instances, but the

critical behavior never happens at the same time.

We are allocating 1GB memory on startup to each instance.

We store the location DB in a psql DB and we load it at startup.

We didn't find any errors in syslog, even at debug level.

Has anyone experienced a similar issue who can suggest a possible solution?

Thanks, Rogelio Perez Telnyx

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- *Charles Chance* Managing Director t. 0330 120 1200 m. 07932 063 891 -- Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Julien Chavanton

3:50 p.m.

Nice finding ! however Rogelio is saying there was no errors in the logs, looking at the patch he would have seen some.

LM_ERR("running job failed\n");

Hope I am wrong.

Rogerio is this a slow leak of you mean suddenly it is using 1.5G ?

There can be a lot of shm used by sip_msg_shm_clone if Kamailio starts to re transmit. 2K seems very small. Normally this kind of high memory usage can take place at startup when DMQ userloc is getting a full sync from all other nodes.

On Tue, Jul 31, 2018 at 5:59 AM, Charles Chance < charles.chance@sipcentric.com> wrote:

...

Hi Daniel,

Nice spot! I had tried to reproduce locally, but had not considered the possibility that jobs may be failing somewhere in Rogelio’s setup.

Most likely your patch will resolve it but I’m happy to take a look further if not.

Cheers,

Charles

On Tue, 31 Jul 2018 at 13:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...
Hello,

not using dmq much, but at a quick look in the code, I noticed that there are some cases when the job fields were not released if the processing was not completely done for various reasons.

I pushed the commit a1f5fbe2c18246d4afefa44fd8a52612a5182a46, can you try with it and see the results?

Maybe Charles Chance can also do a bit of review here, being the one doing most of the work lately for dmq.

Cheers, Daniel

On Tue, Jul 31, 2018 at 6:58 AM, Rogelio Perez rogelio@telnyx.com wrote:

...
Hello,

We're running three instances of Kamailo v5.14 as registrars handling registrations from ~2000 SIP clients, with one instance being primary and the other two as backups.

The three of them are using the dmq and dmq_usrloc modules to synchronize user locations, however after a couple of days of operation the two failover instances show memory leak behaviors, with mem usage assigned to the core taking all available resources.

When this happens we've noticed that:

The shared memory used by the function "sip_msg_shm_clone" spikes

(from 1kb to 1.5GB).

The shared memory used by the function "dmq:worker.c:job_queue_push"

also increases, but not as much (from 1kb to 1MB)

DMQ request are not being answered (with a 200 OK) by the affected

instance during this memory leak, which make us think that DMQ module becomes unresponsive.

A few more notes:

The failover instances are doing nothing except receiving replicated

contacts.

The shared memory grows at the same rate on both instances, but the

critical behavior never happens at the same time.

We are allocating 1GB memory on startup to each instance.

We store the location DB in a psql DB and we load it at startup.

We didn't find any errors in syslog, even at debug level.

Has anyone experienced a similar issue who can suggest a possible solution?

Thanks, Rogelio Perez Telnyx

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- *Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Rogelio Perez

4:21 p.m.

Thanks Daniel, Charles and Julien.

I confirm we're not getting the error log "running job failed". The behavior is always the same, any of the two failover instances would run without issues for a day or two and then suddenly start consuming all available memory in the span of an hour or less. Please check these graphs with some examples for more details: https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0

I'll try Daniel's patch and confirm results soon.

Rogelio

Julien Chavanton

4:58 p.m.

Since it seem you are recovering the memory this does not seems like a real "leak"

One hypothesis :

When you restart a node on the DMQ bus, it can trigger memory usage on the other nodes since they will start to do a SYNC and send one DMQ message / contact It could be that one node in the DMQ bus is restarted and not answering DMQ messages ?

Few ideas :

You could search you trace, maybe you will find the DMQ sync requests ...

You can also confirm significant increase in active transactions.

Verify the state of the bus : kamcmd dmq.list_nodes

Verify the amount of contact on each node (confirm that the cluster is healthy) kamctl stats | grep usrloc | grep contact

On Tue, Jul 31, 2018 at 9:21 AM, Rogelio Perez rogelio@telnyx.com wrote:

...

Thanks Daniel, Charles and Julien.

I confirm we're not getting the error log "running job failed". The behavior is always the same, any of the two failover instances would run without issues for a day or two and then suddenly start consuming all available memory in the span of an hour or less. Please check these graphs with some examples for more details: https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0

I'll try Daniel's patch and confirm results soon.

Rogelio

Rogelio Perez

8:15 p.m.

Julien,

...

Since it seem you are recovering the memory this does not seems like a

real "leak" I forgot to mention that the recoveries are actual Kamailio manual restarts.

...

One hypothesis : When you restart a node on the DMQ bus, it can trigger memory usage on

the other nodes since they will start to do a SYNC and send one DMQ message / contact

...

It could be that one node in the DMQ bus is restarted and not answering

DMQ messages ? The mem leak periods do not match the moment we restart any of the nodes.

...

Few ideas : You could search you trace, maybe you will find the DMQ sync requests ...

We verified the traces and we found that at the moment of the mem leak there was nothing unusual.

...

You can also confirm significant increase in active transactions.

Same.

...

Verify the state of the bus : kamcmd dmq.list_nodes

The primary node state shows the affected secondary node as inactive.

...

Verify the amount of contact on each node (confirm that the cluster is

healthy)

...

kamctl stats | grep usrloc | grep contact

I'll run this check the next time we see the mem leak in action.

Daniel's patch is now in production, I'll confirm results soon.

Thanks, Rogelio

Rogelio Perez

1 Aug 1 Aug

5:06 p.m.

Hello,

We had to rollback the changes as dmq_usrloc notifications are not working at all with the latest master that includes the DMQ patch. On the receiving node we see the following errors:

(394) INFO: <script>: [1ab4f27229f4d994-409@10.15.7.9] Contact not found for user notification_peer -> 404 Not Found 3(395) INFO: <script>: [16b52b012246628b-422@10.15.7.9] Contact not found for user usrloc -> 404 Not Found

We've reproduced this issue on our dev environment, any ideas what should we try next?

Thanks, Rogelio

Charles Chance

5:16 p.m.

Out of interest, can you show me the beginning of your route block - just the part where you call dmq_handle_message()?

Cheers,

Charles

On Wed, 1 Aug 2018 at 18:05, Rogelio Perez rogelio@telnyx.com wrote:

...

Hello,

We had to rollback the changes as dmq_usrloc notifications are not working at all with the latest master that includes the DMQ patch. On the receiving node we see the following errors:

(394) INFO: <script>: [1ab4f27229f4d994-409@10.15.7.9] Contact not found for user notification_peer -> 404 Not Found 3(395) INFO: <script>: [16b52b012246628b-422@10.15.7.9] Contact not found for user usrloc -> 404 Not Found

We've reproduced this issue on our dev environment, any ideas what should we try next?

Thanks, Rogelio

--

*Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

-- Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Rogelio Perez

2 Aug 2 Aug

3:45 a.m.

Charles, here you go:

####### Routing Logic ########

# Main SIP request routing logic # - processing of any incoming SIP request starts with this route # - note: this is the same as route { ... } request_route {

# per request initial checks route(REQINIT);

#!ifdef ENABLE_KDMQ # Handle Kamailio DQM messages if (is_method("KDMQ")) { dmq_handle_message(); } #!endif

Charles Chance

7:43 a.m.

Again out of interest, what happens if you change it to:

if (method == “KDMQ”) { ...

Cheers,

Charles

On Thu, 2 Aug 2018 at 04:45, Rogelio Perez rogelio@telnyx.com wrote:

...

Charles, here you go:

####### Routing Logic ########

# Main SIP request routing logic # - processing of any incoming SIP request starts with this route # - note: this is the same as route { ... } request_route {

# per request initial checks route(REQINIT);

#!ifdef ENABLE_KDMQ # Handle Kamailio DQM messages if (is_method("KDMQ")) { dmq_handle_message(); } #!endif

Paolo Visintin - evosip.cloud

12:05 p.m.

Hi Charles, method == "KDMQ" says "syntax error"

we solved using if($rm == "KDMQ")

Cheers

*Paolo Visintin* *CTO* evosip.cloud [image: Risultati immagini per evosip]

2018-08-02 9:43 GMT+02:00 Charles Chance charles.chance@sipcentric.com:

...

Again out of interest, what happens if you change it to:

if (method == “KDMQ”) { ...

Cheers,

Charles

On Thu, 2 Aug 2018 at 04:45, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, here you go:

####### Routing Logic ########

# Main SIP request routing logic # - processing of any incoming SIP request starts with this route # - note: this is the same as route { ... } request_route {

# per request initial checks route(REQINIT);

#!ifdef ENABLE_KDMQ # Handle Kamailio DQM messages if (is_method("KDMQ")) { dmq_handle_message(); } #!endif

-- *Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Charles Chance

12:27 p.m.

Hi Paolo,

How strange...it really shouldn’t (and has never done so for us): https://www.kamailio.org/wiki/cookbooks/devel/core#method

Either way, $rm is also fine.

Rogelio - to clarify, my aim here is simply to establish if something other than is_method() works in your case.

Best,

Charles

On Thu, 2 Aug 2018 at 13:07, Paolo Visintin - evosip.cloud paolo.visintin@evosip.cloud wrote:

...

Hi Charles, method == "KDMQ" says "syntax error"

we solved using if($rm == "KDMQ")

Cheers

*Paolo Visintin* *CTO* evosip.cloud [image: Risultati immagini per evosip]

2018-08-02 9:43 GMT+02:00 Charles Chance charles.chance@sipcentric.com:

...
Again out of interest, what happens if you change it to:

if (method == “KDMQ”) { ...

Cheers,

Charles

On Thu, 2 Aug 2018 at 04:45, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, here you go:

####### Routing Logic ########

# Main SIP request routing logic # - processing of any incoming SIP request starts with this route # - note: this is the same as route { ... } request_route {

# per request initial checks route(REQINIT);

#!ifdef ENABLE_KDMQ # Handle Kamailio DQM messages if (is_method("KDMQ")) { dmq_handle_message(); } #!endif

-- *Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

...

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Charles Chance

3:53 p.m.

Hi Rogelio,

No need to test with method/$rm - the is_method() regression is now fixed.

If you pull the latest master you should be able to test Daniel's patch again.

Cheers,

Charles

On 2 August 2018 at 13:27, Charles Chance charles.chance@sipcentric.com wrote:

...

Hi Paolo,

How strange...it really shouldn’t (and has never done so for us): https://www.kamailio.org/wiki/cookbooks/devel/core#method

Either way, $rm is also fine.

Rogelio - to clarify, my aim here is simply to establish if something other than is_method() works in your case.

Best,

Charles

On Thu, 2 Aug 2018 at 13:07, Paolo Visintin - evosip.cloud paolo.visintin@evosip.cloud wrote:

...
Hi Charles, method == "KDMQ" says "syntax error"

we solved using if($rm == "KDMQ")

Cheers

*Paolo Visintin* *CTO* evosip.cloud [image: Risultati immagini per evosip]

2018-08-02 9:43 GMT+02:00 Charles Chance charles.chance@sipcentric.com:

...
Again out of interest, what happens if you change it to:

if (method == “KDMQ”) { ...

Cheers,

Charles

On Thu, 2 Aug 2018 at 04:45, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, here you go:

####### Routing Logic ########

# Main SIP request routing logic # - processing of any incoming SIP request starts with this route # - note: this is the same as route { ... } request_route {

# per request initial checks route(REQINIT);

#!ifdef ENABLE_KDMQ # Handle Kamailio DQM messages if (is_method("KDMQ")) { dmq_handle_message(); } #!endif

-- *Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

...

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- *Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

-- Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Rogelio Perez

5:57 p.m.

Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

Rogelio Perez

6 Aug 6 Aug

8:43 p.m.

Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

...

Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Julien Chavanton

7 Aug 7 Aug

3:42 p.m.

I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

...

Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

...
Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Charles Chance

8 Aug 8 Aug

10:36 a.m.

Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

...

I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

...
Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Julien Chavanton

22 Aug 22 Aug

2:39 p.m.

Hi Rogerio, did you have any luck digging this leak further ?

On Wed, Aug 8, 2018 at 3:37 AM Charles Chance charles.chance@sipcentric.com wrote:

...

Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

...
I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

...
Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB. _______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Rogelio Perez

23 Aug 23 Aug

4:43 a.m.

Hi Julien,

Thanks for checking on this. I've been working in the background with Charles on this issue and we think we've found a solution, although the cause isn't clear to me yet. Following Charles advice we changed the usrloc module parameter db_mode from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory leaks incidents since then. I'll report back if we have any further updates.

Best, Rogelio

On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton jchavanton@gmail.com wrote:

...

Hi Rogerio, did you have any luck digging this leak further ?

On Wed, Aug 8, 2018 at 3:37 AM Charles Chance < charles.chance@sipcentric.com> wrote:

...
Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

...
I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

...
Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB. _______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Julien Chavanton

8:52 p.m.

Hi, I have glad you guys found a solution.

Thanks for sharing it.

Regards, Julien

On Wed, Aug 22, 2018, 21:43 Rogelio Perez rogelio@telnyx.com wrote:

...

Hi Julien,

Thanks for checking on this. I've been working in the background with Charles on this issue and we think we've found a solution, although the cause isn't clear to me yet. Following Charles advice we changed the usrloc module parameter db_mode from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory leaks incidents since then. I'll report back if we have any further updates.

Best, Rogelio

On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton jchavanton@gmail.com wrote:

...
Hi Rogerio, did you have any luck digging this leak further ?

On Wed, Aug 8, 2018 at 3:37 AM Charles Chance < charles.chance@sipcentric.com> wrote:

...
Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

...
I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

...
Thanks Charles, it's working now. I'm deploying to production and confirming results soon.

Rogelio

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB. _______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Joel Serrano

24 Aug 24 Aug

2:58 a.m.

I’m planning to use DMQ + usrloc too! Thanks for sharing the solution!

Do you know if It’s fixed also with db_mode=1?

On Thu, Aug 23, 2018 at 13:53 Julien Chavanton jchavanton@gmail.com wrote:

...

Hi, I have glad you guys found a solution.

Thanks for sharing it.

Regards, Julien

On Wed, Aug 22, 2018, 21:43 Rogelio Perez rogelio@telnyx.com wrote:

...
Hi Julien,

Thanks for checking on this. I've been working in the background with Charles on this issue and we think we've found a solution, although the cause isn't clear to me yet. Following Charles advice we changed the usrloc module parameter db_mode from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory leaks incidents since then. I'll report back if we have any further updates.

Best, Rogelio

On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton jchavanton@gmail.com wrote:

...
Hi Rogerio, did you have any luck digging this leak further ?

On Wed, Aug 8, 2018 at 3:37 AM Charles Chance < charles.chance@sipcentric.com> wrote:

...
Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

...
I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

...
Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try? Will a core dump file tell us what's causing it?

Thanks, Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com wrote:

> Thanks Charles, it's working now. > I'm deploying to production and confirming results soon. > > Rogelio >

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB. _______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Charles Chance

4 Sep 4 Sep

6:35 p.m.

Hi Joel,

I have not had much time recently to find the cause of the issue with db_mode 1, although I should have some more time over the next couple of weeks.

Out of interest, what will you be using the database for?

Cheers,

Charles

On Fri, 24 Aug 2018 at 04:00, Joel Serrano joel@textplus.com wrote:

...

I’m planning to use DMQ + usrloc too! Thanks for sharing the solution!

Do you know if It’s fixed also with db_mode=1?

On Thu, Aug 23, 2018 at 13:53 Julien Chavanton jchavanton@gmail.com wrote:

...
Hi, I have glad you guys found a solution.

Thanks for sharing it.

Regards, Julien

On Wed, Aug 22, 2018, 21:43 Rogelio Perez rogelio@telnyx.com wrote:

...
Hi Julien,

Thanks for checking on this. I've been working in the background with Charles on this issue and we think we've found a solution, although the cause isn't clear to me yet. Following Charles advice we changed the usrloc module parameter db_mode from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory leaks incidents since then. I'll report back if we have any further updates.

Best, Rogelio

On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton jchavanton@gmail.com wrote:

...
Hi Rogerio, did you have any luck digging this leak further ?

On Wed, Aug 8, 2018 at 3:37 AM Charles Chance < charles.chance@sipcentric.com> wrote:

...
Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

...
I wonder if this could be introduced by a regression or if you are facing a specific edge case

I briefly looked at the commits of DMQ and DMQ_USRLOC It seems there was significant work done. I would give a try with 5.0.0 and then we will at least learn that this is not a recent regression.

On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com wrote:

> Charles, Julien, Daniel, > > The results are pretty much the same, the mem leak is still there > and we need to restart Kamailio when it reaches certain threshold. > https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0 > > Is there anything else we can try? > Will a core dump file tell us what's causing it? > > Thanks, > Rogelio > > On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com > wrote: > >> Thanks Charles, it's working now. >> I'm deploying to production and confirming results soon. >> >> Rogelio >> > > > -- > https://telnyx.com > Rogelio Perez | engineering | telnyx https://telnyx.com > chicago: +1 312 270 8119 | dublin: +353 1 912 6119 > > > _______________________________________________ > Kamailio (SER) - Users Mailing List > sr-users@lists.kamailio.org > https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users > >

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB. _______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Joel Serrano

6:48 p.m.

Hi Charles,

I'm trying to move _away_ from database for data replication and pass over to DMQ as much as possible, currently I have db_mode=1 for usrloc so that is why I asked.

If I understand correctly, on a 2 node cluster, without database for usrloc and only using DMQ, as long as both nodes don't go down at the same time you are good... it would be possible to restart the nodes sequentially and not lose registration info as they would replicate to each other on startup, right?

Do you see a need for the database (specifically talking about usrloc replication) other than persistency if *all* nodes in the dmq cluster are down at the same time?

Thanks, Joel.

On Tue, Sep 4, 2018 at 11:35 AM, Charles Chance < charles.chance@sipcentric.com> wrote:

...

Hi Joel,

I have not had much time recently to find the cause of the issue with db_mode 1, although I should have some more time over the next couple of weeks.

Out of interest, what will you be using the database for?

Cheers,

Charles

On Fri, 24 Aug 2018 at 04:00, Joel Serrano joel@textplus.com wrote:

...
I’m planning to use DMQ + usrloc too! Thanks for sharing the solution!

Do you know if It’s fixed also with db_mode=1?

On Thu, Aug 23, 2018 at 13:53 Julien Chavanton jchavanton@gmail.com wrote:

...
Hi, I have glad you guys found a solution.

Thanks for sharing it.

Regards, Julien

On Wed, Aug 22, 2018, 21:43 Rogelio Perez rogelio@telnyx.com wrote:

...
Hi Julien,

Thanks for checking on this. I've been working in the background with Charles on this issue and we think we've found a solution, although the cause isn't clear to me yet. Following Charles advice we changed the usrloc module parameter db_mode from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory leaks incidents since then. I'll report back if we have any further updates.

Best, Rogelio

On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton jchavanton@gmail.com wrote:

...
Hi Rogerio, did you have any luck digging this leak further ?

On Wed, Aug 8, 2018 at 3:37 AM Charles Chance < charles.chance@sipcentric.com> wrote:

...
Hi Rogelio,

I have been running master on a three-node lab (one primary, two secondary) for the past 24 hours or so, maintaining 2000 registrations on the primary, replicating to both secondaries, and memory usage has remained constant throughout.

I will leave it running for another 24 hours to be sure but in the meantime, you mentioned you are loading records from DB - which mode are you using for writing (write-through or write-back)? Do you experience the same symptoms if you disable the database completely on the secondary nodes (or just one for testing) and instead, enable sync in dmq_usrloc?

Cheers,

Charles

On 7 August 2018 at 16:42, Julien Chavanton jchavanton@gmail.com wrote:

> I wonder if this could be introduced by a regression or if you are > facing a specific edge case > > I briefly looked at the commits of DMQ and DMQ_USRLOC > It seems there was significant work done. > I would give a try with 5.0.0 and then we will at least learn that > this is not a recent regression. > > On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez rogelio@telnyx.com > wrote: > >> Charles, Julien, Daniel, >> >> The results are pretty much the same, the mem leak is still there >> and we need to restart Kamailio when it reaches certain threshold. >> https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0 >> >> Is there anything else we can try? >> Will a core dump file tell us what's causing it? >> >> Thanks, >> Rogelio >> >> On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez rogelio@telnyx.com >> wrote: >> >>> Thanks Charles, it's working now. >>> I'm deploying to production and confirming results soon. >>> >>> Rogelio >>> >> >> >> -- >> https://telnyx.com >> Rogelio Perez | engineering | telnyx https://telnyx.com >> chicago: +1 312 270 8119 | dublin: +353 1 912 6119 >> >> >> _______________________________________________ >> Kamailio (SER) - Users Mailing List >> sr-users@lists.kamailio.org >> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users >> >> >

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB. _______________________________________________ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- https://telnyx.com Rogelio Perez | engineering | telnyx https://telnyx.com chicago: +1 312 270 8119 | dublin: +353 1 912 6119

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- *Charles Chance* Managing Director

t. 0330 120 1200 m. 07932 063 891

Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, Birmingham Science Park, Birmingham B7 4BB.

Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

2499

Age (days ago)

2534

Last active (days ago)

sr-users@lists.kamailio.org

23 comments

6 participants

tags (0)

participants (6)

Charles Chance
Daniel-Constantin Mierla
Joel Serrano
Julien Chavanton
Paolo Visintin - evosip.cloud
Rogelio Perez