I have a 1.5.3 installation functioning as a registrar that is exhibiting a very curious, if only infrequent set of behaviours. The host is CentOS 5.x, with PostgreSQL 9.0 backing usrloc, and db_mode = 1 (immediate write-through).
I have a registration handler in my initial request route that begins with a log message:
...
else if(is_method("REGISTER")) { xlog("L_INFO", "... Processing REGISTER from $si:$sp for AOR $tu\n"); route(2); exit; }
I've got a registrant who consistently re-registers with an expiration time of 120, so in theory they ought to be re-registering every two minutes or so. Most of the time they do. But, occasionally, I'll see a gap of 5-8 minutes go by, in which the above log message does not appear, and their registration expires so that they can't receive calls.
That's not what's interesting. What's interesting is that I did a packet capture on the proxy and caught one of these scenarios. Within that ~8 minute window when the registration expired and calls were failing, the UAC actually re-registered several times, on normal ~2 minute boundaries, AND the proxy successfully challenged the request and sent a 200 OK indicating that the binding was stored!
I double-checked all the parameters to ensure that the 401s and 200 OKs corresponded to the right REGISTER flow, etc. - yes, it's all correct.
Is there any imaginable set of circumstances or a known bug that would cause Kamailio to successfully authenticate and affirm a registration request, while neither logging its receipt, nor, evidently, storing it to the database? I can envision a number of database failure modes that would account for the binding not actually being in the 'location' table, but that doesn't explain why the request wouldn't even be logged. That's what has me baffled.
Thanks!
-- Alex
Alex Balashov writes:
else if(is_method("REGISTER")) { xlog("L_INFO", "... Processing REGISTER from $si:$sp for
AOR $tu\n"); route(2); exit; }
no matter what db you use, the above would print the syslog message IF the piece of code is executed for register requests. check again the logic in your config script. perhaps there is a way around the above piece of code.
-- juha
On 11/17/2011 11:02 PM, Juha Heinanen wrote:
else if(is_method("REGISTER")) { xlog("L_INFO", "... Processing REGISTER from $si:$sp for
AOR $tu\n"); route(2); exit; }
no matter what db you use, the above would print the syslog message IF the piece of code is executed for register requests. check again the logic in your config script. perhaps there is a way around the above piece of code.
That was my thought, as well. However, I don't see it. There's only one path to route #2, and only route #2 contains any registration processing. There is only the bare minimum boilerplate above that log statement, e.g. message length, loop detection, in-dialog request handling, etc. All authentication and higher-level processing takes place afterward.
It's not logically possible. I checked, believe me.
On 11/17/2011 11:15 PM, Juha Heinanen wrote:
Alex Balashov writes:
It's not logically possible. I checked, believe me.
ok, then start adding more xlog statements in order to figure out what happens.
That would imply a deterministic state of affairs. But this happens very infrequently, no more than once a day for a few minutes, and only to a very small percentage of the users.
There is nothing structurally different about the REGISTER requests that take place within the problem interval vs. outside of it; I scrutinised that in great detail.
On 18.11.2011 05:07, Alex Balashov wrote:
On 11/17/2011 11:02 PM, Juha Heinanen wrote:
else if(is_method("REGISTER")) { xlog("L_INFO", "... Processing REGISTER from $si:$sp for AOR $tu\n"); route(2); exit; }
no matter what db you use, the above would print the syslog message IF the piece of code is executed for register requests. check again the logic in your config script. perhaps there is a way around the above piece of code.
That was my thought, as well. However, I don't see it. There's only one path to route #2, and only route #2 contains any registration processing. There is only the bare minimum boilerplate above that log statement, e.g. message length, loop detection, in-dialog request handling, etc. All authentication and higher-level processing takes place afterward.
It's not logically possible. I checked, believe me.
Then, the problem may be the logging: - sychronous vs asynchronous syslogging? - local or via UDP to a remote syslog server?
klaus
On 11/18/2011 05:47 AM, Klaus Darilion wrote:
Then, the problem may be the logging:
- sychronous vs asynchronous syslogging?
It's asynchronous.
- local or via UDP to a remote syslog server?
It's local.
On Nov 18, 2011, at 5:47 AM, Klaus Darilion wrote:
On 18.11.2011 05:07, Alex Balashov wrote:
On 11/17/2011 11:02 PM, Juha Heinanen wrote:
else if(is_method("REGISTER")) { xlog("L_INFO", "... Processing REGISTER from $si:$sp for AOR $tu\n"); route(2); exit; }
What happens if you move the Register to the beginning of the if statement?
With best regards,
Fred http://qxork.com
On 11/18/2011 06:24 AM, Fred Posner wrote:
What happens if you move the Register to the beginning of the if statement?
While I have not empirically investigated this, the structure literally looks like this:
if(is_method("OPTIONS")) { # respond to OPTIONS pings exit; }
else if(is_method("INVITE")) { # route the INVITE exit; }
else if(is_method("REGISTER")) { xlog(...); route(2); }
else { # 403 Forbidden }
Also, if this were a logic problem, wouldn't it happen all the time? This registrar has many thousands of users, out of which about half a dozen have reported this problem, and for them, it only occurs for at most a few minutes out of each day.
It can look stupidity, but, I dislike this 'else if' structure.... so, suppressing the 'else' from the 'else if' wouldn't change in nothing the logic, so have you tested this suppression?
Edson.
Em 18/11/2011 09:32, Alex Balashov escreveu:
On 11/18/2011 06:24 AM, Fred Posner wrote:
What happens if you move the Register to the beginning of the if statement?
While I have not empirically investigated this, the structure literally looks like this:
if(is_method("OPTIONS")) { # respond to OPTIONS pings exit; }
else if(is_method("INVITE")) { # route the INVITE exit; }
else if(is_method("REGISTER")) { xlog(...); route(2); }
else { # 403 Forbidden }
Also, if this were a logic problem, wouldn't it happen all the time? This registrar has many thousands of users, out of which about half a dozen have reported this problem, and for them, it only occurs for at most a few minutes out of each day.
On 11/18/2011 02:04 PM, Edson - Lists wrote:
It can look stupidity, but, I dislike this 'else if' structure.... so, suppressing the 'else' from the 'else if' wouldn't change in nothing the logic, so have you tested this suppression?
It logically protects from situations in which, hypothetically, is_method() could malfunction.
No, I haven't tested it. But again, this is obviously not a logic problem. 99.99% of the registrations work fine, 99% of the time.
On 11/17/2011 10:28 PM, Alex Balashov wrote:
I have a 1.5.3 installation functioning as a registrar that is exhibiting a very curious, if only infrequent set of behaviours. The host is CentOS 5.x, with PostgreSQL 9.0 backing usrloc, and db_mode = 1 (immediate write-through).
I have a registration handler in my initial request route that begins with a log message:
...
else if(is_method("REGISTER")) { xlog("L_INFO", "... Processing REGISTER from $si:$sp for AOR $tu\n"); route(2); exit; }
I've got a registrant who consistently re-registers with an expiration time of 120, so in theory they ought to be re-registering every two minutes or so. Most of the time they do. But, occasionally, I'll see a gap of 5-8 minutes go by, in which the above log message does not appear, and their registration expires so that they can't receive calls.
That's not what's interesting. What's interesting is that I did a packet capture on the proxy and caught one of these scenarios. Within that ~8 minute window when the registration expired and calls were failing, the UAC actually re-registered several times, on normal ~2 minute boundaries, AND the proxy successfully challenged the request and sent a 200 OK indicating that the binding was stored!
I double-checked all the parameters to ensure that the 401s and 200 OKs corresponded to the right REGISTER flow, etc. - yes, it's all correct.
Is there any imaginable set of circumstances or a known bug that would cause Kamailio to successfully authenticate and affirm a registration request, while neither logging its receipt, nor, evidently, storing it to the database? I can envision a number of database failure modes that would account for the binding not actually being in the 'location' table, but that doesn't explain why the request wouldn't even be logged. That's what has me baffled.
My logic tells me: 1) If its not hitting your xlog in your register section then its not registering to this server 2) If your packet capture is showing a successful registration and you cannot see it registered in this server then its registering somewhere else. 3) Is it possible that your server is simply re-routing the packet to another server before it hits your xlog statement and thus the client is registering there? A full packet capture in and out of the server might give you a hint, but if you have thousands of customers that might be tough. On the other hand, if you have thousands of customers, I would venture to guess that you have more than one server where users can register and that might explain why the trace shows a successful registration.
Cheers.
On 11/18/2011 03:16 PM, Andres wrote:
My logic tells me:
- If its not hitting your xlog in your register section then its not
registering to this server
The registrar is running inside an OpenVZ container, so the first possibility I investigated was that someone had inadvertently cloned the container on the same hardware node and that there was an IP address conflict. However, this turned out not to be the case. I also double-checked the MAC addresses on the registrations within the successful interval vs. outside of it.
- If your packet capture is showing a successful registration and
you cannot see it registered in this server then its registering somewhere else.
That stands to reason, but I cannot find any evidence apart to support that viewpoint, nor anything about the packet capture (e.g. different hardware or instance signatures) that would seem to corroborate that.
- Is it possible that your server is simply re-routing the packet
to another server before it hits your xlog statement and thus the client is registering there?
I entertained that possibility, but can't find any indication of how this would be possible, particularly in an arbitrary, small number of cases for very brief periods.
A full packet capture in and out of the server might give you a hint, but if you have thousands of customers that might be tough.
I took a packet capture at the edge router of the network as well as on the hardware node itself, in addition to the container. All of them tell the same story.
On the other hand, if you have thousands of customers, I would venture to guess that you have more than one server where users can register and that might explain why the trace shows a successful registration.
There is just one registrar -- the one under discussion. It's the main reason Kamailio is used here; it is the only OSS registrar I know of that has enough throughput capacity to sustain thousands of devices banging on it with relatively short re-registration intervals.
There is just one registrar -- the one under discussion. It's the main reason Kamailio is used here; it is the only OSS registrar I know of that has enough throughput capacity to sustain thousands of devices banging on it with relatively short re-registration intervals.
Hi Alex,
At this point I would consider customizing your script to append an extra header to your onreply_route logic using something like append_hf("P-hint: bla...bla...bla\r\n"). My guess is that a packet capture will show the header inserted under normal circumstances, but will be absent during the mystery registration in the 8 minute window. If that happens, it means your mystery registration is not hitting this server or its not following your expected logic in the config script.
Hope it helps.
On 11/18/2011 04:23 PM, Andres wrote:
There is just one registrar -- the one under discussion. It's the main reason Kamailio is used here; it is the only OSS registrar I know of that has enough throughput capacity to sustain thousands of devices banging on it with relatively short re-registration intervals.
Hi Alex,
At this point I would consider customizing your script to append an extra header to your onreply_route logic using something like append_hf("P-hint: bla...bla...bla\r\n"). My guess is that a packet capture will show the header inserted under normal circumstances, but will be absent during the mystery registration in the 8 minute window. If that happens, it means your mystery registration is not hitting this server or its not following your expected logic in the config script.
Yeah, I've already "tagged" the registration request in similar manner. But this is a very good idea, thank you!
I think I've figured out what's going on. Unsurprisingly, it isn't a Kamailio problem--my proposed thesis seemed as improbable to me as to all of you.
I had been going off the assumption that the absolute timestamps in the packet captures off the edge router, from which I was basing my analysis, were correct, or reasonably close to correct. It's an edge router, after all. However, the person who took them (I do not have direct access to it) just informed me that the router actually had substantial system clock drift - about 20 minutes - at the time they were taken. This invalidates the correlations I had made about the proxy log <-> capture.
As you suggested, there is most likely, in fact, a reachability or UA-related issue.