gruu and dead-lock in registrar module - sr-users

30 Apr 2013


      Hi,
We're hitting an issue in a deployment where all udp receivers are 
sitting in FUTEX_WAIT caused by save() -> lock_udomain() and seem to 
have deadlocked themselves every couple of days.
Looking at the code, enable_gruu in registrar is active by default, and 
in lookup there is a code path
/* temp-gruu lookup */
    res = ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr);
but no lock_udomain is obtained. However, when the execution falls 
through to the "done:" marker, it does
ul.unlock_udomain(_d, &aor);
without having called ul.lock_udomain first.
1.) Could someone please review this part? Looks a bit suspicious, 
although I don't know what implicitly happens in this case. If it were a 
semaphore and you decrease it to -1 by decrementing it without prior 
increment, it's essentially causing a dead-lock, but the current locking 
implementation might work completely different.
2.) Since I have no clue how gruu is supposed to work in detail, and 
since in our config we don't explicitly handle gruu (no lookup in 
loose-route, but gruu is enabled by default in registrar and we don't 
explicitly turned it off), I'm not even sure if we ever hit this code 
path. I only see that the ruid column in the location table is filled, 
but in order to get to this part, the ";gr" flag needs to be set in the 
R-URI for a lookup(), which I don't know whether that happened somehow 
in some call flows (we only log $ru, which I don't think logs these 
parameters, right?).
Some input is highly appreciated!
Andreas