doc/
subfolder, the README file is autogenerated)We've been running the ims_charging module in production for quite some time, without any issues.
A couple of days ago the whole instance freezed, but I was able to get a dump before the stuck processes before restarting. The traffic volume has gradually increased over time, so it's likely related to that.
I then did a summary of the different types of stuck processes. Which locks they're holding, and what they're waiting for:
holding lock:
AAASessionsLock
waiting for lock:
lock_get(peer_list_lock);
#1 0x00007fe70eba8cde in futex_get (lock=0x7fe6e4a5c680) at ../../core/futexlock.h:108
#2 0x00007fe70ebaaca2 in get_peer_by_fqdn (fqdn=0x7fe6e4a5ba30) at peermanager.c:259
#3 0x00007fe70ebb64b7 in get_first_connected_route (cdp_session=0x7fe6e5cc5350, r=0x7fe6e4a5ba30, app_id=4, vendor_id=10415) at routing.c:115
#4 0x00007fe70ebb9a37 in get_routing_peer (cdp_session=0x7fe6e5cc5350, m=0x7fe6e5447ab0) at routing.c:293
#5 0x00007fe70ebcaf8c in AAASendMessage (message=0x7fe6e5447ab0, callback_f=0x7fe707a9e323 <resume_on_initial_ccr>, callback_param=0x7fe6e5297110) at diameter_comm.c:139
#6 0x00007fe707a9d3e0 in Ro_Send_CCR (msg=0x7fe70f011100, dlg=0x7fe6e5c571f0, dir=0, reservation_units=30, incoming_trunk_id=0x7ffe50efb060, outgoing_trunk_id=0x7ffe50efb050, pani=0x7ffe50efaee0, action=0x7fe70efa4730, tindex=10484, tlabel=677172593) at ims_ro.c:1511
#7 0x00007fe707a8285d in ki_ro_ccr (msg=0x7fe70f011100, s_route_name=0x7ffe50efb080, s_direction=0x7ffe50efb070, reservation_units=30, s_incoming_trunk_id=0x7ffe50efb060, s_outgoing_trunk_id=0x7ffe50efb050) at ims_charging_mod.c:742
#8 0x00007fe707a7bf01 in w_ro_ccr (msg=0x7fe70f011100, c_route_name=0x7fe70ef8b8d0 "\220\311\371\016\347\177", c_direction=0x7fe70ef8b980 "p\240\371\016\347\177", reservation_units=30, c_incoming_trunk_id=0x7fe70ef8ba30 "p\241\371\016\347\177", c_outgoing_trunk_id=0x7fe70ef8bae0 "\360\241\371\016\347\177") at ims_charging_mod.c:507
#9 0x00000000004858d8 in do_action (h=0x7ffe50efb970, a=0x7fe70ef99e20, msg=0x7fe70f011100) at core/action.c:1144
#10 0x00000000004928d6 in run_actions (h=0x7ffe50efb970, a=0x7fe70ef99e20, msg=0x7fe70f011100) at core/action.c:1618
#11 0x0000000000492f52 in run_actions_safe (h=0x7ffe50eff1e0, a=0x7fe70ef99e20, msg=0x7fe70f011100) at core/action.c:1681
#12 0x0000000000450156 in rval_get_long (h=0x7ffe50eff1e0, msg=0x7fe70f011100, i=0x7ffe50efbec8, rv=0x7fe70ef9c1d8, cache=0x0) at core/rvalue.c:973
#13 0x0000000000454d24 in rval_expr_eval_long (h=0x7ffe50eff1e0, msg=0x7fe70f011100, res=0x7ffe50efbec8, rve=0x7fe70ef9c1d0) at core/rvalue.c:1854
#14 0x0000000000454d52 in rval_expr_eval_long (h=0x7ffe50eff1e0, msg=0x7fe70f011100, res=0x7ffe50efc448, rve=0x7fe70ef9b920) at core/rvalue.c:1864
#15 0x00000000004850ce in do_action (h=0x7ffe50eff1e0, a=0x7fe70ef9b070, msg=0x7fe70f011100) at core/action.c:1097
--
holding lock:
AAASessionsLock
waiting for lock:
lock_get(peer_list_lock);
#1 0x00007fe70eba8d4f in futex_get (lock=0x7fe6e4a5c680) at ../../core/futexlock.h:121
#2 0x00007fe70ebaaca2 in get_peer_by_fqdn (fqdn=0x7ffe50efab90) at peermanager.c:259
#3 0x00007fe70ebb8e89 in get_routing_peer (cdp_session=0x7fe6e5ab6910, m=0x7fe6e5435be0) at routing.c:252
#4 0x00007fe70ebcaf8c in AAASendMessage (message=0x7fe6e5435be0, callback_f=0x7fe707a95edc <resume_on_termination_ccr>, callback_param=0x0) at diameter_comm.c:139
#5 0x00007fe707a95b02 in send_ccr_stop_with_param (ro_session=0x7fe6e5ab65e0, code=0, reason=0x7ffe50efb060) at ims_ro.c:1181
#6 0x00007fe707a72ff7 in dlg_terminated (dlg=0x7fe6e623d7a0, type=64, termcode=0, reason=0x7fe707ab72b3 "normal call clearing", _params=0x7fe707f67280 <params>) at dialog.c:249
#7 0x00007fe707a6a729 in dlg_callback_received (dlg=0x7fe6e623d7a0, type=64, _params=0x7fe707f67280 <params>) at dialog.c:25
#8 0x00007fe707d341b9 in run_dlg_callbacks (type=64, dlg=0x7fe6e623d7a0, req=0x7fe70f011100, rpl=0x0, dir=1, dlg_data=0x0) at dlg_cb.c:271
#9 0x00007fe707cf4db4 in dlg_terminated (req=0x7fe70f011100, dlg=0x7fe6e623d7a0, dir=1) at dlg_handlers.c:413
#10 0x00007fe707cfddeb in dlg_onroute (req=0x7fe70f011100, route_params=0x7ffe50efb6d0, param=0x0) at dlg_handlers.c:1097
#11 0x00007fe70ad285f6 in run_rr_callbacks (req=0x7fe70f011100, rr_param=0x7ffe50efb7c0) at rr_cb.c:96
#12 0x00007fe70ad3ae92 in after_loose (_m=0x7fe70f011100, preloaded=0) at loose.c:1021
#13 0x00007fe70ad3b5ce in loose_route_mode (_m=0x7fe70f011100, _mode=0) at loose.c:1056
#14 0x00007fe70ad3f74f in w_loose_route (msg=0x7fe70f011100, p1=0x0, p2=0x0) at rr_mod.c:273
#15 0x00000000004855ff in do_action (h=0x7ffe50efc390, a=0x7fe70efe0d40, msg=0x7fe70f011100) at core/action.c:1121
--
holding lock:
lock_get(peer_list_lock);
waiting for lock:
lock_get(p->lock);
#1 0x00007fe70eba8d4f in futex_get (lock=0x7fe6e4a5cbd0) at ../../core/futexlock.h:121
#2 0x00007fe70ebab0ae in peer_timer (now=1742807320, ptr=0x0) at peermanager.c:286
#3 0x00007fe70ebd0f39 in timer_loop () at timer.c:116
#4 0x00007fe70ebd21b2 in timer_process (returns=0) at timer.c:216
#5 0x00007fe70eb8ccf8 in diameter_peer_start (blocking=0) at diameter_peer.c:350
#6 0x00007fe70eb7cbb2 in cdp_child_init (rank=0) at cdp_mod.c:272
--
holding lock:
lock_get(p->lock)
waiting for lock:
AAASessionsLock
#1 0x00007fe70ebf2597 in futex_get (lock=0x7fe6e4a5d490) at ../../core/futexlock.h:108
#2 0x00007fe70ebf26f1 in AAASessionsLock (hash=0) at session.c:79
#3 0x00007fe70ebf5e6e in cdp_get_session (id=...) at session.c:316
#4 0x00007fe70eba6892 in Snd_Message (p=0x7fe6e4a5c880, msg=0x7fe6e63998d0) at peerstatemachine.c:1237
#5 0x00007fe70eba003e in sm_process (p=0x7fe6e4a5c880, event=Send_Message, msg=0x7fe6e63998d0, peer_locked=0, sock=0) at peerstatemachine.c:429
#6 0x00007fe70ebcbdc6 in AAASendMessage (message=0x7fe6e63998d0, callback_f=0x7fe707a8f208 <resume_on_interim_ccr>, callback_param=0x7fe6e6190a90) at diameter_comm.c:166
#7 0x00007fe707a8edb3 in send_ccr_interim (ro_session=0x7fe6e5399160, used=60, reserve=30) at ims_ro.c:847
#8 0x00007fe707a68bd6 in ro_session_ontimeout (tl=0x7fe6e5399200) at ro_timer.c:513
#9 0x00007fe707a63078 in ro_timer_routine (ticks=114862426, attr=0x0) at ro_timer.c:279
#10 0x00000000004fd33e in compat_old_handler (ti=1837798827, tl=0x7fe6e4cf6260, data=0x7fe6e4cf6260) at core/timer.c:980
#11 0x00000000004fde7a in slow_timer_main () at core/timer.c:1103
#12 0x000000000042e4e7 in main_loop () at main.c:1911
#13 0x000000000043876c in main (argc=10, argv=0x7ffe50f001c8) at main.c:3236
In get_first_connected_route()
in routing.c
of the cdp module there are two places get_peer_by_fqdn()
are called. One of them has an unlock/relock of the session list before and after (and a comment about holding two locks at a time), while the other doesn't.
I'm currently testing two version of this. The first by just doing the some relock for the other get_peer_by_fqdn()
, but this PR got an approach for maybe fixing more latent issues. sm_process()
also got some strange handling by Rcv_Process()
after the peer lock is released, which seems to be about the same thing. My deadlock is for the Snd_Message()
equivalent.
I've now removed this queueing behaviour, and instead doing a re-lock of the peer to (hopefully) have the same locking order as the other operations.
The problem with this thing is how rare it occurs. Just wanted to share my findings for others with more knowledge to the cdp module for comments, suggestions and hopefully some extra testing.
https://github.com/kamailio/kamailio/pull/4191
(2 files)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.