On 03/11/14 17:05, Alex Hermann wrote:
On Monday 03 November 2014, Daniel-Constantin Mierla wrote:
Reference counters are good as long as they are used in predictable circumstances. The problem encountered so far were related to the fact that not all calls have proper signaling (e.g., network issues, buggy clients), the reason for this cleanup routine as well
True. But even then refcounting shouldn't be bypassed. For each (cleanup) situation the expected refcount should be compared to the actual refcount before deleting.
Not sure what would be expected reference counter value in each of the states. The aim of the cleanup was to go beyond reasonable time a dialog can stay in a state and clean up to avoid memory leaks.
For the situation at hand, if the refcount is >1, something is handling the dlg (or has buggy code not unreffing after finishing).
In theory, the dialog cannot last for ever in initial state, so cleaning it up make sense to stay safe of memory leaks at runtime -- we can print log messages to alert, but keeping it for ever brings more troubles (e.g., besides memory leaks, the call limit can be affected).
I think my fix catches those situations tough. When the ACK is missing, the refcount should be 1. When another process is handling the dlg, the refcount will be >1.
As I said, for operations that are long term, I consider better to keep dlg index and label (two integers) than increasing the reference count and storing the pointer (on 64b it is the same size). Then lookup again the dialog when needed. Finding the reason of what determined you to do this is important, not to push kind of workaround that we don't fully understand its purpose and benefits.
IMO, it is not right to keep a dialog in proxy as long as caller/callee don't have it anymore.
You can eventually make the behavior of the patch configurable, so you can enable it via modparam, in the current form I don't find it ok.
(i.e., also sip protocol relies on a time interval for not receiving ACK).
This specific situation is not about missing ACK. That would be DLG_STATE_CONFIRMED_NA.
It was a quick example that sip relies on time intervals for changing states.
My goal was to figure out what was the situation, see if it is something predictable that can be fixed in source code -- dialogs staying too long in a state that shouldn't take too long are susceptible to issues and it is better if we know what was the reason.
It is weird that the TMCB_DESTROY callback didn't cleanup the dialog. I haven't investigated the cause as i could fix the segfault pretty easily.
So far sounds like fixing something that we don't know what was and the side effects are not good. I would like to get to a proper solution, knowing what was the issue.
Cheers, Daniel