Module: sip-router Branch: master Commit: 3eef4c8be5cc2b6ccba993a62f8c8039d1adf7da URL: http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=3eef4c8b...
Author: Alex Hermann alex@speakup.nl Committer: Alex Hermann alex@speakup.nl Date: Thu Oct 23 15:09:05 2014 +0200
dialog: Don't destroy dialogs in timer while they're still being referenced
---
modules/dialog/dlg_hash.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/modules/dialog/dlg_hash.c b/modules/dialog/dlg_hash.c index 09166b8..31ad428 100644 --- a/modules/dialog/dlg_hash.c +++ b/modules/dialog/dlg_hash.c @@ -242,7 +242,7 @@ int dlg_clean_run(ticks_t ti) while (dlg) { tdlg = dlg; dlg = dlg->next; - if(tdlg->state==DLG_STATE_UNCONFIRMED && tdlg->init_ts<tm-300) { + if(tdlg->state==DLG_STATE_UNCONFIRMED && tdlg->init_ts<tm-300 && tdlg->ref<=1) { /* dialog in early state older than 5min */ LM_NOTICE("dialog in early state is too old (%p ref %d)\n", tdlg, tdlg->ref);
Hello,
what would be the situation to happen like that? Have you spotted a case when an non established dialog lasting for more than 5 minutes is still referenced externally by pointer? For cases when it can take many minutes to come back to a dialog, the dialog ids should be cloned and used for searching dialog when needed.
Cheers, Daniel
On 03/11/14 10:34, Alex Hermann wrote:
Module: sip-router Branch: master Commit: 3eef4c8be5cc2b6ccba993a62f8c8039d1adf7da URL: http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=3eef4c8b...
Author: Alex Hermann alex@speakup.nl Committer: Alex Hermann alex@speakup.nl Date: Thu Oct 23 15:09:05 2014 +0200
dialog: Don't destroy dialogs in timer while they're still being referenced
modules/dialog/dlg_hash.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/modules/dialog/dlg_hash.c b/modules/dialog/dlg_hash.c index 09166b8..31ad428 100644 --- a/modules/dialog/dlg_hash.c +++ b/modules/dialog/dlg_hash.c @@ -242,7 +242,7 @@ int dlg_clean_run(ticks_t ti) while (dlg) { tdlg = dlg; dlg = dlg->next;
if(tdlg->state==DLG_STATE_UNCONFIRMED && tdlg->init_ts<tm-300) {
if(tdlg->state==DLG_STATE_UNCONFIRMED && tdlg->init_ts<tm-300 && tdlg->ref<=1) { /* dialog in early state older than 5min */ LM_NOTICE("dialog in early state is too old (%p ref %d)\n", tdlg, tdlg->ref);
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
On Monday 03 November 2014, Daniel-Constantin Mierla wrote:
what would be the situation to happen like that? Have you spotted a case when an non established dialog lasting for more than 5 minutes is still referenced externally by pointer?
I had a segfault because another process was handling a very late reply. I don't know the exact circumstances that led to it.
For cases when it can take many minutes to come back to a dialog, the dialog ids should be cloned and used for searching dialog when needed.
The whole idea of refcounting is not to hold a lock during processing, but just during obtaining a pointer to the object. Refcounted objects should never be deleted when they are still being referenced by others.
In this case, the process handling the reply behaved well and had upped the refcounter. The timer process ignored the refcount and freed the object still in use.
Now that i think of it, the proper thing to do here would probably be just to unref the dlg instead of open-coding around it.
On 03/11/14 16:32, Alex Hermann wrote:
On Monday 03 November 2014, Daniel-Constantin Mierla wrote:
what would be the situation to happen like that? Have you spotted a case when an non established dialog lasting for more than 5 minutes is still referenced externally by pointer?
I had a segfault because another process was handling a very late reply. I don't know the exact circumstances that led to it.
For cases when it can take many minutes to come back to a dialog, the dialog ids should be cloned and used for searching dialog when needed.
The whole idea of refcounting is not to hold a lock during processing, but just during obtaining a pointer to the object. Refcounted objects should never be deleted when they are still being referenced by others.
In this case, the process handling the reply behaved well and had upped the refcounter. The timer process ignored the refcount and freed the object still in use.
Now that i think of it, the proper thing to do here would probably be just to unref the dlg instead of open-coding around it.
Reference counters are good as long as they are used in predictable circumstances. The problem encountered so far were related to the fact that not all calls have proper signaling (e.g., network issues, buggy clients), the reason for this cleanup routine as well (i.e., also sip protocol relies on a time interval for not receiving ACK).
My goal was to figure out what was the situation, see if it is something predictable that can be fixed in source code -- dialogs staying too long in a state that shouldn't take too long are susceptible to issues and it is better if we know what was the reason.
Daniel
On Monday 03 November 2014, Daniel-Constantin Mierla wrote:
Reference counters are good as long as they are used in predictable circumstances. The problem encountered so far were related to the fact that not all calls have proper signaling (e.g., network issues, buggy clients), the reason for this cleanup routine as well
True. But even then refcounting shouldn't be bypassed. For each (cleanup) situation the expected refcount should be compared to the actual refcount before deleting.
For the situation at hand, if the refcount is >1, something is handling the dlg (or has buggy code not unreffing after finishing).
I think my fix catches those situations tough. When the ACK is missing, the refcount should be 1. When another process is handling the dlg, the refcount will be >1.
(i.e., also sip protocol relies on a time interval for not receiving ACK).
This specific situation is not about missing ACK. That would be DLG_STATE_CONFIRMED_NA.
My goal was to figure out what was the situation, see if it is something predictable that can be fixed in source code -- dialogs staying too long in a state that shouldn't take too long are susceptible to issues and it is better if we know what was the reason.
It is weird that the TMCB_DESTROY callback didn't cleanup the dialog. I haven't investigated the cause as i could fix the segfault pretty easily.
On 03/11/14 17:05, Alex Hermann wrote:
On Monday 03 November 2014, Daniel-Constantin Mierla wrote:
Reference counters are good as long as they are used in predictable circumstances. The problem encountered so far were related to the fact that not all calls have proper signaling (e.g., network issues, buggy clients), the reason for this cleanup routine as well
True. But even then refcounting shouldn't be bypassed. For each (cleanup) situation the expected refcount should be compared to the actual refcount before deleting.
Not sure what would be expected reference counter value in each of the states. The aim of the cleanup was to go beyond reasonable time a dialog can stay in a state and clean up to avoid memory leaks.
For the situation at hand, if the refcount is >1, something is handling the dlg (or has buggy code not unreffing after finishing).
In theory, the dialog cannot last for ever in initial state, so cleaning it up make sense to stay safe of memory leaks at runtime -- we can print log messages to alert, but keeping it for ever brings more troubles (e.g., besides memory leaks, the call limit can be affected).
I think my fix catches those situations tough. When the ACK is missing, the refcount should be 1. When another process is handling the dlg, the refcount will be >1.
As I said, for operations that are long term, I consider better to keep dlg index and label (two integers) than increasing the reference count and storing the pointer (on 64b it is the same size). Then lookup again the dialog when needed. Finding the reason of what determined you to do this is important, not to push kind of workaround that we don't fully understand its purpose and benefits.
IMO, it is not right to keep a dialog in proxy as long as caller/callee don't have it anymore.
You can eventually make the behavior of the patch configurable, so you can enable it via modparam, in the current form I don't find it ok.
(i.e., also sip protocol relies on a time interval for not receiving ACK).
This specific situation is not about missing ACK. That would be DLG_STATE_CONFIRMED_NA.
It was a quick example that sip relies on time intervals for changing states.
My goal was to figure out what was the situation, see if it is something predictable that can be fixed in source code -- dialogs staying too long in a state that shouldn't take too long are susceptible to issues and it is better if we know what was the reason.
It is weird that the TMCB_DESTROY callback didn't cleanup the dialog. I haven't investigated the cause as i could fix the segfault pretty easily.
So far sounds like fixing something that we don't know what was and the side effects are not good. I would like to get to a proper solution, knowing what was the issue.
Cheers, Daniel
On Tuesday 04 November 2014, Daniel-Constantin Mierla wrote:
On 03/11/14 17:05, Alex Hermann wrote:
On Monday 03 November 2014, Daniel-Constantin Mierla wrote:
True. But even then refcounting shouldn't be bypassed. For each (cleanup) situation the expected refcount should be compared to the actual refcount before deleting.
Not sure what would be expected reference counter value in each of the states.
I think it is necessary to know it to be able to securely cleanup dialogs.
In theory, the dialog cannot last for ever in initial state, so cleaning it up make sense to stay safe of memory leaks at runtime -- we can print log messages to alert, but keeping it for ever brings more troubles (e.g., besides memory leaks, the call limit can be affected).
I agree those must be cleaned, even though it is not technically a memory leak, as the dialog pointer is still used in the dialog hash.
As I said, for operations that are long term, I consider better to keep dlg index and label (two integers) than increasing the reference count and storing the pointer (on 64b it is the same size). Then lookup again the dialog when needed.
Who is talking about "long term" operations? The duration is not of any relevance here.
Almost none of the dialog related operations apart from getting the dlg pointer are protected by locks but by the refcounter. Suppose process 1 obtains a dlg pointer (from the h_entry, and h_id), and upped the refcounter as a result. It can get scheduled away at any random time, even directly after obtaining the dlg pointer. If the timer process get scheduled at that moment and frees the memory of the dlg (disregarding the fact that is is still in use), how is process 1 to know that?
Finding the reason of what determined you to do this is important, not to push kind of workaround that we don't fully understand its purpose and benefits.
IMO, it is not right to keep a dialog in proxy as long as caller/callee don't have it anymore.
I agree, just not with the method you used to get rid of the dialog.
You can eventually make the behavior of the patch configurable, so you can enable it via modparam, in the current form I don't find it ok.
I'm not going to make a known segfault path configurable, so i'll revert it if you insist. But may i kindly suggest you revert your code too which introduced this code in the first place as, IMHO, it is fundamentally wrong by deleting an object without checking if it is in use.
(i.e., also sip protocol relies on a time interval for not receiving ACK).
This specific situation is not about missing ACK. That would be DLG_STATE_CONFIRMED_NA.
It was a quick example that sip relies on time intervals for changing states.
My goal was to figure out what was the situation, see if it is something predictable that can be fixed in source code -- dialogs staying too long in a state that shouldn't take too long are susceptible to issues and it is better if we know what was the reason.
It is weird that the TMCB_DESTROY callback didn't cleanup the dialog. I haven't investigated the cause as i could fix the segfault pretty easily.
So far sounds like fixing something that we don't know what was and the side effects are not good. I would like to get to a proper solution, knowing what was the issue.
I'm sorry, but IMHO this (and earlier) argument applies more to your initial cleanup routine for the stray dialogs than to my fix for the segfault.
On 04/11/14 14:19, Alex Hermann wrote:
On Tuesday 04 November 2014, Daniel-Constantin Mierla wrote:
On 03/11/14 17:05, Alex Hermann wrote:
On Monday 03 November 2014, Daniel-Constantin Mierla wrote: True. But even then refcounting shouldn't be bypassed. For each (cleanup) situation the expected refcount should be compared to the actual refcount before deleting.
Not sure what would be expected reference counter value in each of the states.
I think it is necessary to know it to be able to securely cleanup dialogs.
In theory, the dialog cannot last for ever in initial state, so cleaning it up make sense to stay safe of memory leaks at runtime -- we can print log messages to alert, but keeping it for ever brings more troubles (e.g., besides memory leaks, the call limit can be affected).
I agree those must be cleaned, even though it is not technically a memory leak, as the dialog pointer is still used in the dialog hash.
As I said, for operations that are long term, I consider better to keep dlg index and label (two integers) than increasing the reference count and storing the pointer (on 64b it is the same size). Then lookup again the dialog when needed.
Who is talking about "long term" operations? The duration is not of any relevance here.
Almost none of the dialog related operations apart from getting the dlg pointer are protected by locks but by the refcounter. Suppose process 1 obtains a dlg pointer (from the h_entry, and h_id), and upped the refcounter as a result. It can get scheduled away at any random time, even directly after obtaining the dlg pointer. If the timer process get scheduled at that moment and frees the memory of the dlg (disregarding the fact that is is still in use), how is process 1 to know that?
Finding the reason of what determined you to do this is important, not to push kind of workaround that we don't fully understand its purpose and benefits.
IMO, it is not right to keep a dialog in proxy as long as caller/callee don't have it anymore.
I agree, just not with the method you used to get rid of the dialog.
You can eventually make the behavior of the patch configurable, so you can enable it via modparam, in the current form I don't find it ok.
I'm not going to make a known segfault path configurable, so i'll revert it if you insist. But may i kindly suggest you revert your code too which introduced this code in the first place as, IMHO, it is fundamentally wrong by deleting an object without checking if it is in use.
If you discovered a flaw in existing code, then should be fixed properly. It is what I am trying to do as well by discussing your patch. This is the benefit of open source, reviewing patches for mutual benefits, and I do hope other people check my commits as well.
I didn't say you didn't get an issue with it and the existing old code was perfect, I am just trying to understand the issue and be sure it is the right fix, with no side effects. I will be more than happy to solve what you found as problem in the old patch, together we what I found not clear with the last patch.
You said you don't have a backtrace for it and also you didn't provide any further details about the situation, but now you refer to it as a 'known segfault'. If you can provide more info about the crash, it can help to really solve the problem.
Just reverting and going back in another buggy state it is not good. I expressed my opinions that with the patch, it seems we can get back to the state with dialogs stuck in memory for ever. The config was more like a temporary solution if you had any pressure on having the patch, to give everyone an easy option to decide how to use it.
Otherwise, yes, I do want to get it properly fixed, with no options.
Cheers, Daniel