Hi,
We're hitting an issue in a deployment where all udp receivers are sitting in FUTEX_WAIT caused by save() -> lock_udomain() and seem to have deadlocked themselves every couple of days.
Looking at the code, enable_gruu in registrar is active by default, and in lookup there is a code path
/* temp-gruu lookup */ res = ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr);
but no lock_udomain is obtained. However, when the execution falls through to the "done:" marker, it does
ul.unlock_udomain(_d, &aor);
without having called ul.lock_udomain first.
1.) Could someone please review this part? Looks a bit suspicious, although I don't know what implicitly happens in this case. If it were a semaphore and you decrease it to -1 by decrementing it without prior increment, it's essentially causing a dead-lock, but the current locking implementation might work completely different.
2.) Since I have no clue how gruu is supposed to work in detail, and since in our config we don't explicitly handle gruu (no lookup in loose-route, but gruu is enabled by default in registrar and we don't explicitly turned it off), I'm not even sure if we ever hit this code path. I only see that the ruid column in the location table is filled, but in order to get to this part, the ";gr" flag needs to be set in the R-URI for a lookup(), which I don't know whether that happened somehow in some call flows (we only log $ru, which I don't think logs these parameters, right?).
Some input is highly appreciated!
Andreas
Hello,
what version are you playing with? To look in the right branch when troubleshooting first time, then look at the others that might be affected...
Cheers, Daniel
On 4/30/13 5:07 PM, Andreas Granig wrote:
Hi,
We're hitting an issue in a deployment where all udp receivers are sitting in FUTEX_WAIT caused by save() -> lock_udomain() and seem to have deadlocked themselves every couple of days.
Looking at the code, enable_gruu in registrar is active by default, and in lookup there is a code path
/* temp-gruu lookup */ res = ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr);
but no lock_udomain is obtained. However, when the execution falls through to the "done:" marker, it does
ul.unlock_udomain(_d, &aor);
without having called ul.lock_udomain first.
1.) Could someone please review this part? Looks a bit suspicious, although I don't know what implicitly happens in this case. If it were a semaphore and you decrease it to -1 by decrementing it without prior increment, it's essentially causing a dead-lock, but the current locking implementation might work completely different.
2.) Since I have no clue how gruu is supposed to work in detail, and since in our config we don't explicitly handle gruu (no lookup in loose-route, but gruu is enabled by default in registrar and we don't explicitly turned it off), I'm not even sure if we ever hit this code path. I only see that the ruid column in the location table is filled, but in order to get to this part, the ";gr" flag needs to be set in the R-URI for a lookup(), which I don't know whether that happened somehow in some call flows (we only log $ru, which I don't think logs these parameters, right?).
Some input is highly appreciated!
Andreas
SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users
Hi Daniel,
On 04/30/2013 05:34 PM, Daniel-Constantin Mierla wrote:
what version are you playing with? To look in the right branch when troubleshooting first time, then look at the others that might be affected...
The affected version is latest 3.3 branch, but the same code is there in 4.0 as well.
Andreas
Hello,
looked over the code and seems ok. The domain lock is set inside ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr).
With temp gruu, the real aor is not present in URI, it will be discovered based on ahash (aor hash id) and ruid, which compose the temp-gruu value. If the record is discovered by ahash+ruid then the domain is kept locked, the aor is set to the value from the record, in this way the domain is unlocked at the end of the respective function.
Are you doing other operations in config with usrloc/registrar rather than save()/lookup()? Any mi/rpc commands? Any other modules bound to usrloc (e.g., pua_usrloc)?
Cheers, Daniel
On 4/30/13 5:41 PM, Andreas Granig wrote:
Hi Daniel,
On 04/30/2013 05:34 PM, Daniel-Constantin Mierla wrote:
what version are you playing with? To look in the right branch when troubleshooting first time, then look at the others that might be affected...
The affected version is latest 3.3 branch, but the same code is there in 4.0 as well.
Andreas
SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users
Hi,
On 05/01/2013 10:46 AM, Daniel-Constantin Mierla wrote:
looked over the code and seems ok. The domain lock is set inside ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr).
Ok, good to know.
Are you doing other operations in config with usrloc/registrar rather than save()/lookup()? Any mi/rpc commands? Any other modules bound to usrloc (e.g., pua_usrloc)?
I'm regularly calling MI functions to get the number of records in usrloc. During the last deadlock scenario, I saw a kamctl fifo process querying usrloc hanging there since 2 days already, but I actually expected more of them. Didn't try to issue the MI command manually to see if that was responding, will do next time this happens.
Andreas