[Users] Re: [Devel] "detached" timer

Bogdan-Andrei Iancu bogdan at voice-system.ro
Fri Mar 30 16:23:41 CEST 2007


well the openser related information you based you statements/opinions 
on is quite deprecated, as a lot of work was done in that area.

please try to update with the progress of the openser project.

bogdan

Jiri Kuthan wrote:
> At 14:48 30/03/2007, Bogdan-Andrei Iancu wrote:
>   
>> wrong again :)
>>     
>
> I wish it would be.
>
> The operational experience shows us that in the former versions
> there have been race conditions which do cause troubles under
> hard-to-reproduce conditions. Based on surface knowledge, it appears
> that openser has inhereted those from ser before's ser's overhaul
> of those.
>
>   
>> as I mentioned in my previous email, the "detached timer" was more an maker that something else was going wrong - there was no amplification.
>>     
>
> lucky those who haven't been affected by the race conditions. My point
> is though, this particular warning corelates with undeterminism.
>
>   
>> and as TR clearly said, the problem was with DB connectivity and had nothing to do with TM timers.
>>     
>
> Well, as a matter of fact, I have witnessed several failures which coincidently
> appeared with this warning. Studing the code will reveal to you and anyone else
> that actually this warning is just a hack which helps to ignore erroneous conditions 
> and survive those, but doesn't heal the cause of the problem, which may still generate
> disfucntional service.
>
> Again -- I don't mean to daemonize it, with this -ignore-the-problem-hack things
> have been running mostly fine.
>
> -jiri
>
>
>   
>> regards,
>> bogdan
>>
>> Jiri Kuthan wrote:
>>     
>>> Actually more likely it has been both. The root problem lies in the timer subsystem
>>> and may be amplified by other troubles (or amplify those).
>>>
>>> -jiri
>>>
>>> At 01:35 30/03/2007, T.R. Missner wrote:
>>>  
>>>       
>>>> FYI All
>>>>
>>>> This turned out to be a database write ( acc ) that was blocking due to a raid card problem.
>>>>
>>>>
>>>>
>>>> T.R. Missner wrote:
>>>>    
>>>>         
>>>>> Is it possible the locked state I am seeing with openser leads to the "detached" timer?
>>>>> Since the "detached" timer is a race, it would make sense to see the race condition after openser locks up and messages buffer up in the stack.
>>>>> When a bunch of messages are processed all at once by multiple threads the race condition would occur.
>>>>> Does this make sense?
>>>>>
>>>>> Maybe I have been focusing on the wrong place.
>>>>>
>>>>> Ignoring the "detached" timer what could cause openser to hang for a couple seconds then clear every 5 - 10 minutes?
>>>>>
>>>>> Ideas?
>>>>>
>>>>> We are seeing this on 3 different productions servers.
>>>>>
>>>>> Thanks
>>>>>
>>>>> TR
>>>>>
>>>>> using openser1.1.1
>>>>>
>>>>>
>>>>>
>>>>> T.R. Missner wrote:
>>>>>      
>>>>>           
>>>>>> Bogdan,
>>>>>>
>>>>>> I have been chasing this for days and done lots of debugging.
>>>>>> using 1.1.1
>>>>>> While looking at the network trace at the time of these messages ( I usually see at least 5 in a row with differing hex values ) I see many incoming packets coming into the box and no response from the proxy for somewhere between 5 - 10 seconds, then a flood a responses from the proxy.
>>>>>> I can email you a sample pcap file if you like.
>>>>>> As part of my debugging I forced a 100 reply at the very top of my cfg file.
>>>>>> The forced 100 was not sent during the locked up time leading me to believe openser was not processing incoming packets.
>>>>>> I have now seen this on multiple servers in different locations. Likely a particular customer call flow is causing this but I have not been able to pin it down to the exact customer. These proxies run pretty fast during the day so finding a pattern leading up the this issue is difficult. What could I add to the Log output to identify the offending sip-callid? Is sip-callid or branch tag or anything similar easily accessible in any of the data structs in timer.c?
>>>>>>
>>>>>> TR
>>>>>>
>>>>>> Bogdan-Andrei Iancu wrote:
>>>>>>        
>>>>>>             
>>>>>>> Hi TR,
>>>>>>>
>>>>>>> it is race between expire even (from timer) and inserting again on a timer list.
>>>>>>>  1 is the final response timer list (fr_timer)
>>>>>>>  3 id the wait timer list (wt_timer)
>>>>>>>
>>>>>>> I would say there is no way this could leas to a any kind of lock.
>>>>>>>
>>>>>>> what version are you using? what makes you say it locks?
>>>>>>>
>>>>>>> regards,
>>>>>>> bogdan
>>>>>>>
>>>>>>> T.R. Missner wrote:
>>>>>>>          
>>>>>>>               
>>>>>>>> Does anyone know what causes this?
>>>>>>>>
>>>>>>>> */set_timer for 1 list called on a "detached" timer -- ignoring /*
>>>>>>>>
>>>>>>>> I also see
>>>>>>>>
>>>>>>>> */set_timer for 3 list called on a "detached" timer -- ignoring /*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> When this happens Openser seems to lock up for 10 seconds or so.
>>>>>>>>
>>>>>>>> >From searching it appears this is caused by a race but I am not sure what the race is or why this results in an unresponsive openser instance for multiple seconds.
>>>>>>>>
>>>>>>>> Transaction expiration racing reply?
>>>>>>>>
>>>>>>>>
>>>>>>>> Desperately need to understand how this could be triggered so I can get customer to adjust system.
>>>>>>>>
>>>>>>>> Any way to adjust?
>>>>>>>>
>>>>>>>> tried tweaking fr_inv_timer but no joy.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> TR
>>>>>>>>            
>>>>>>>>                 
>>>>>>>
>>>>>>> --
>>>>>>> Jiri Kuthan            http://iptel.org/~jiri/
>>>>>>>               
>
>
>   





More information about the Users mailing list