[SR-Users] gruu and dead-lock in registrar module

Tue Apr 30 17:07:42 CEST 2013

Hi,

We're hitting an issue in a deployment where all udp receivers are 
sitting in FUTEX_WAIT caused by save() -> lock_udomain() and seem to 
have deadlocked themselves every couple of days.

Looking at the code, enable_gruu in registrar is active by default, and 
in lookup there is a code path

	/* temp-gruu lookup */
	res = ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr);

but no lock_udomain is obtained. However, when the execution falls 
through to the "done:" marker, it does

	ul.unlock_udomain(_d, &aor);

without having called ul.lock_udomain first.

1.) Could someone please review this part? Looks a bit suspicious, 
although I don't know what implicitly happens in this case. If it were a 
semaphore and you decrease it to -1 by decrementing it without prior 
increment, it's essentially causing a dead-lock, but the current locking 
implementation might work completely different.

2.) Since I have no clue how gruu is supposed to work in detail, and 
since in our config we don't explicitly handle gruu (no lookup in 
loose-route, but gruu is enabled by default in registrar and we don't 
explicitly turned it off), I'm not even sure if we ever hit this code 
path. I only see that the ruid column in the location table is filled, 
but in order to get to this part, the ";gr" flag needs to be set in the 
R-URI for a lookup(), which I don't know whether that happened somehow 
in some call flows (we only log $ru, which I don't think logs these 
parameters, right?).

Some input is highly appreciated!

Andreas