Hello,
We are experiencing an issue when having some blocking code in the
dialog:end event route. If the code takes too long to execute then CDR
information is not logged.
In our case we have a blocking socket trying to establish a connection
and waiting 500ms before failing. When it timeouts, the accounting
module will never log any CDR logs for that call. It appears that the
callbacks for DLGCB_TERMINATED_CONFIRMED event are never called. That is
because the function dlg_terminated_confirmed is never called (this
function is registered as a callback to the TMCB_RESPONSE_OUT event from
tm module).
After further investigation, the problem seems to be related to a race
condition between two processes when dealing with ending a dialog:
From what we saw the dialog:end event route is treated when a BYE
request is received. But if that takes too long, and that BYE request
will be retransmitted (we are using SIPp to test this, and SIPp will
automatically retransmit the message), it will then be treated by a
different process.
When this happens, the first process has already modified the dialog
state, so the second process will not execute code specific to dialog
ending from the dialog module dlg_onroute function. It will not execute
the event route, nor the DLGCB_TERMINATED event callbacks, and it will
neither register the dlg_terminated_confirmed function to the
TMCB_RESPONSE_OUT event. This means that the second process will
actually finish processing the BYE faster then the first process. By the
time the first process finishes with processing the dialog:end event
route, the transaction corresponding to the BYE request is already
created, and even though dlg_terminated_confirmed function is eventually
registered for the TMCB_RESPONSE_OUT event, it is too late because this
event was already treated, and all callbacks registered for it were
called. This means that DLGCB_TERMINATED_CONFIRMED event will never be
treated, and the registered callbacks for this event (including the
accounting module functions) will not get called.
We managed to work around this behavior by decreasing the timeout of the
socket so that the BYE can be processed before a retransmit occurs, but
maybe there is a different solution to fix this issue. Do you have any
suggestions on how to solve this problem?
Thank you,
Claudiu Boriga.