Hi,
We're hitting an issue in a deployment where all udp receivers are
sitting in FUTEX_WAIT caused by save() -> lock_udomain() and seem to
have deadlocked themselves every couple of days.
Looking at the code, enable_gruu in registrar is active by default, and
in lookup there is a code path
/* temp-gruu lookup */
res = ul.get_urecord_by_ruid(_d, ahash, &inst, &r, &ptr);
but no lock_udomain is obtained. However, when the execution falls
through to the "done:" marker, it does
ul.unlock_udomain(_d, &aor);
without having called ul.lock_udomain first.
1.) Could someone please review this part? Looks a bit suspicious,
although I don't know what implicitly happens in this case. If it were a
semaphore and you decrease it to -1 by decrementing it without prior
increment, it's essentially causing a dead-lock, but the current locking
implementation might work completely different.
2.) Since I have no clue how gruu is supposed to work in detail, and
since in our config we don't explicitly handle gruu (no lookup in
loose-route, but gruu is enabled by default in registrar and we don't
explicitly turned it off), I'm not even sure if we ever hit this code
path. I only see that the ruid column in the location table is filled,
but in order to get to this part, the ";gr" flag needs to be set in the
R-URI for a lookup(), which I don't know whether that happened somehow
in some call flows (we only log $ru, which I don't think logs these
parameters, right?).
Some input is highly appreciated!
Andreas