Hi,
I am running 4.3.5:1b0c0a, which I am aware is an EOL'd release train, and have a problem with the dialog module I am baffled by.
On many calls - I can't find any correlation to particular kinds of endpoints - I see spoofed BYEs come out of Kamailio after a minute and a half, as if the call had hit a dialog timeout or the dialog module's dead peer detection (OPTIONS pinging) kicked it out.
The thing is, neither of those explanations seem to be borne out by the parameters of the calls under investigation. The dialog timeout on these calls is set to 7200, and the dialog keepalives aren't enabled in either direction. If they were, it'd be logged.
Calls that are terminated in this manner also see the following log message:
WARNING: dialog [dlg_req_within.c:214]: bye_reply_cb(): inconsitent dlg timer data on dlg 0x7f9997317e48 [3390:9644] with clid '1996679936_133218050@x.x.x.x' and tags '2bb17663-co4006-INS001' 'as7925780d'
But I don't think this is unusual for a dialog-spoofed BYE; presumably this is due to a 200 OK for the BYE from both ends.
So anyway, I am trying to track down anything else that could inadvertently cause dialog to hang up calls relatively quickly in that release, or inadvertently set the dialog timeout parameter to something lower than is apparent from the logging and the config.
If anyone has any pointers, that would not go unappreciated!
Many thanks,
-- Alex
Hi Alex,
sounds strange indeed. 4.3.5 is from March 2016, a lot happened since then..
One idea to investigate further - have you tried already to get more debugging information from dialog? With debugger module you could restrict it to dialog, if this is a loaded production instance.
Another idea - are you using functionality like timeout_avp or sst module to update/modify the dialog timeout?
Cheers,
Henning
Am 27.08.19 um 22:31 schrieb Alex Balashov:
Hi,
I am running 4.3.5:1b0c0a, which I am aware is an EOL'd release train, and have a problem with the dialog module I am baffled by.
On many calls - I can't find any correlation to particular kinds of endpoints - I see spoofed BYEs come out of Kamailio after a minute and a half, as if the call had hit a dialog timeout or the dialog module's dead peer detection (OPTIONS pinging) kicked it out.
The thing is, neither of those explanations seem to be borne out by the parameters of the calls under investigation. The dialog timeout on these calls is set to 7200, and the dialog keepalives aren't enabled in either direction. If they were, it'd be logged.
Calls that are terminated in this manner also see the following log message:
WARNING: dialog [dlg_req_within.c:214]: bye_reply_cb(): inconsitent dlg timer data on dlg 0x7f9997317e48 [3390:9644] with clid '1996679936_133218050@x.x.x.x' and tags '2bb17663-co4006-INS001' 'as7925780d'
But I don't think this is unusual for a dialog-spoofed BYE; presumably this is due to a 200 OK for the BYE from both ends.
So anyway, I am trying to track down anything else that could inadvertently cause dialog to hang up calls relatively quickly in that release, or inadvertently set the dialog timeout parameter to something lower than is apparent from the logging and the config.
If anyone has any pointers, that would not go unappreciated!
Many thanks,
-- Alex
Hi Henning,
I am in fact using timeout_avp, but it's set to 2 hours (7200 sec).
However, to follow up on this, I have made the discovery that the caller adulterates my RR cookie parameter value to something other than the value my Kamailio instance injected. It's truncated.
This leads parse_dlg_rr_param() to complain about a malformed RR parameter.
At some point, a reinvite from the caller occurs, and the Route header in both the reinvite and the subsequent e2e ACK to the 200 OK contain an improper RR cookie. 154 seconds after the ACK, Kamailio spoofs the BYE in both directions.
Nevertheless, I am moved to ask: why is Kamailio so actively interventionist in this situation? I have dlg_match_mode modparam set to 1 (cookie with fallback to SIP fundamentals), so presumably Kamailio is able to follow the mutation to the dialog state based on these in-dialog requests, otherwise it would not see a need to end the call after a lot shorter a period than the 7200 sec to which the timeout_avp is set. But if it follows the change based on a fallback to SIP identifiers, why would it care that the cookie is broken, and why does this put the dialog in a state where the dialog module seems to apply some sort of timeout to it that leads to a spoofed BYE?
-- Alex
On Thu, Aug 29, 2019 at 08:08:06AM +0000, Henning Westerholt wrote:
Hi Alex,
sounds strange indeed. 4.3.5 is from March 2016, a lot happened since then..
One idea to investigate further - have you tried already to get more debugging information from dialog? With debugger module you could restrict it to dialog, if this is a loaded production instance.
Another idea - are you using functionality like timeout_avp or sst module to update/modify the dialog timeout?
Cheers,
Henning
Am 27.08.19 um 22:31 schrieb Alex Balashov:
Hi,
I am running 4.3.5:1b0c0a, which I am aware is an EOL'd release train, and have a problem with the dialog module I am baffled by.
On many calls - I can't find any correlation to particular kinds of endpoints - I see spoofed BYEs come out of Kamailio after a minute and a half, as if the call had hit a dialog timeout or the dialog module's dead peer detection (OPTIONS pinging) kicked it out.
The thing is, neither of those explanations seem to be borne out by the parameters of the calls under investigation. The dialog timeout on these calls is set to 7200, and the dialog keepalives aren't enabled in either direction. If they were, it'd be logged.
Calls that are terminated in this manner also see the following log message:
WARNING: dialog [dlg_req_within.c:214]: bye_reply_cb(): inconsitent dlg timer data on dlg 0x7f9997317e48 [3390:9644] with clid '1996679936_133218050@x.x.x.x' and tags '2bb17663-co4006-INS001' 'as7925780d'
But I don't think this is unusual for a dialog-spoofed BYE; presumably this is due to a 200 OK for the BYE from both ends.
So anyway, I am trying to track down anything else that could inadvertently cause dialog to hang up calls relatively quickly in that release, or inadvertently set the dialog timeout parameter to something lower than is apparent from the logging and the config.
If anyone has any pointers, that would not go unappreciated!
Many thanks,
-- Alex
-- Henning Westerholt - https://skalatan.de/blog/ Kamailio services - https://skalatan.de/services
Hello Alex,
just looked briefly into the code. There is only one place in the dialog module that ends a dialog with "dlg_bye_all" - and this happens in the dlg_ontimeout routine, if there is the proper flag set for the particular dialog.
Do you have the send_bye parameter set to 1 or use the $dlg_ctx(timeout_bye) PV? Is there any configured timeout that relates to the ~150s that you see in the module? It is always ~150s or does it change from call to call?
About the other topic - the matching mode: if there is an existing, but broken RR DID parameter, the code will stop processing and not try again the other match mode.
This is correct according the docs "1 - DID_FALLBACK - the match is first tried based on DID and if not present, it will fall back to SIP matching;", but could be changed to be more resilient, I agree.
Cheers,
Henning
Am 01.09.19 um 04:53 schrieb Alex Balashov:
Hi Henning,
I am in fact using timeout_avp, but it's set to 2 hours (7200 sec).
However, to follow up on this, I have made the discovery that the caller adulterates my RR cookie parameter value to something other than the value my Kamailio instance injected. It's truncated.
This leads parse_dlg_rr_param() to complain about a malformed RR parameter.
At some point, a reinvite from the caller occurs, and the Route header in both the reinvite and the subsequent e2e ACK to the 200 OK contain an improper RR cookie. 154 seconds after the ACK, Kamailio spoofs the BYE in both directions.
Nevertheless, I am moved to ask: why is Kamailio so actively interventionist in this situation? I have dlg_match_mode modparam set to 1 (cookie with fallback to SIP fundamentals), so presumably Kamailio is able to follow the mutation to the dialog state based on these in-dialog requests, otherwise it would not see a need to end the call after a lot shorter a period than the 7200 sec to which the timeout_avp is set. But if it follows the change based on a fallback to SIP identifiers, why would it care that the cookie is broken, and why does this put the dialog in a state where the dialog module seems to apply some sort of timeout to it that leads to a spoofed BYE?
-- Alex
On Thu, Aug 29, 2019 at 08:08:06AM +0000, Henning Westerholt wrote:
Hi Alex,
sounds strange indeed. 4.3.5 is from March 2016, a lot happened since then..
One idea to investigate further - have you tried already to get more debugging information from dialog? With debugger module you could restrict it to dialog, if this is a loaded production instance.
Another idea - are you using functionality like timeout_avp or sst module to update/modify the dialog timeout?
Cheers,
Henning
Am 27.08.19 um 22:31 schrieb Alex Balashov:
Hi,
I am running 4.3.5:1b0c0a, which I am aware is an EOL'd release train, and have a problem with the dialog module I am baffled by.
On many calls - I can't find any correlation to particular kinds of endpoints - I see spoofed BYEs come out of Kamailio after a minute and a half, as if the call had hit a dialog timeout or the dialog module's dead peer detection (OPTIONS pinging) kicked it out.
The thing is, neither of those explanations seem to be borne out by the parameters of the calls under investigation. The dialog timeout on these calls is set to 7200, and the dialog keepalives aren't enabled in either direction. If they were, it'd be logged.
Calls that are terminated in this manner also see the following log message:
WARNING: dialog [dlg_req_within.c:214]: bye_reply_cb(): inconsitent dlg timer data on dlg 0x7f9997317e48 [3390:9644] with clid '1996679936_133218050@x.x.x.xmailto:1996679936_133218050@x.x.x.x' and tags '2bb17663-co4006-INS001' 'as7925780d'
But I don't think this is unusual for a dialog-spoofed BYE; presumably this is due to a 200 OK for the BYE from both ends.
So anyway, I am trying to track down anything else that could inadvertently cause dialog to hang up calls relatively quickly in that release, or inadvertently set the dialog timeout parameter to something lower than is apparent from the logging and the config.
If anyone has any pointers, that would not go unappreciated!
Many thanks,
-- Alex
-- Henning Westerholt - https://skalatan.de/blog/ Kamailio services - https://skalatan.de/services
-- Henning Westerholt - https://skalatan.de/blog/ Kamailio services - https://skalatan.de/services
Hi Henning,
On Sun, Sep 01, 2019 at 09:35:43AM +0000, Henning Westerholt wrote:
just looked briefly into the code. There is only one place in the dialog module that ends a dialog with "dlg_bye_all" - and this happens in the dlg_ontimeout routine, if there is the proper flag set for the particular dialog.
Do you have the send_bye parameter set to 1 or use the $dlg_ctx(timeout_bye) PV? Is there any configured timeout that relates to the ~150s that you see in the module? It is always ~150s or does it change from call to call?
No, mysteriously there is not.
While the value of the timeout_avp can vary from call to call, in the case of these particular calls--and indeed, 95%+ of calls--it is set to 7200 sec. The default_timeout modparam is set to 28800 sec (8 hours).
The timeout period always seems to be 150-160 sec, but as you note, the arithmetical relationship between this period and any formally declared timeouts is a mystery.
About the other topic - the matching mode: if there is an existing, but broken RR DID parameter, the code will stop processing and not try again the other match mode.
This is correct according the docs "1 - DID_FALLBACK - the match is first tried based on DID and if not present, it will fall back to SIP matching;", but could be changed to be more resilient, I agree.
Ah, I see. Well, "if not present" is a bit ambiguous from a layperson's perspective; in my case the cookie is present but incorrect and so does not match a known dialog, so I didn't know offhand if the code treat the parameter as though it were altogether absent, or whether it would treat it as present but wrong and abort processing as you suggest. :-)
-- Alex
On Mon, Sep 02, 2019 at 04:04:59AM -0400, Alex Balashov wrote:
On Sun, Sep 01, 2019 at 09:35:43AM +0000, Henning Westerholt wrote:
just looked briefly into the code. There is only one place in the dialog module that ends a dialog with "dlg_bye_all" - and this happens in the dlg_ontimeout routine, if there is the proper flag set for the particular dialog.
Do you have the send_bye parameter set to 1 or use the $dlg_ctx(timeout_bye) PV? Is there any configured timeout that relates to the ~150s that you see in the module? It is always ~150s or does it change from call to call?
Also, I neglected to answer your question: yes, I set 'send_bye' modparam rather than using $dlg_ctx(timeout_bye) = 1.
Hi Alex,
answer inline below
Am 02.09.19 um 10:04 schrieb Alex Balashov:
just looked briefly into the code. There is only one place in the dialog module that ends a dialog with "dlg_bye_all" - and this happens in the dlg_ontimeout routine, if there is the proper flag set for the particular dialog.
Do you have the send_bye parameter set to 1 or use the $dlg_ctx(timeout_bye) PV? Is there any configured timeout that relates to the ~150s that you see in the module? It is always ~150s or does it change from call to call?
So you've set dhe dlg_bye_all parameter. Could you try to deactivate it just for a brief time and see if the problem does occur as well? Then we know at for sure that it is related to this functionality.
No, mysteriously there is not.
While the value of the timeout_avp can vary from call to call, in the case of these particular calls--and indeed, 95%+ of calls--it is set to 7200 sec. The default_timeout modparam is set to 28800 sec (8 hours).
The timeout period always seems to be 150-160 sec, but as you note, the arithmetical relationship between this period and any formally declared timeouts is a mystery.
Another idea , there is the dlg_clean_timer which is 90s. But this is still a bit off to your observed time period.
Maybe an additional timer is running before.
About the other topic - the matching mode: if there is an existing, but broken RR DID parameter, the code will stop processing and not try again the other match mode.
This is correct according the docs "1 - DID_FALLBACK - the match is first tried based on DID and if not present, it will fall back to SIP matching;", but could be changed to be more resilient, I agree.
Ah, I see. Well, "if not present" is a bit ambiguous from a layperson's perspective; in my case the cookie is present but incorrect and so does not match a known dialog, so I didn't know offhand if the code treat the parameter as though it were altogether absent, or whether it would treat it as present but wrong and abort processing as you suggest. :-)
Yes, this could be improved, I agree.
Cheers,
Henning