Hi, I've submitted some patches to the SR tracker in order to improve something that I consider buggy (and for sure it is):
http://sip-router.org/tracker/index.php?do=details&task_id=66
As explained in the tracker, in daemonize mode kamailio doesn't return the proper status code. If for example the listen address is not local or a DB connection uses an invalid password, then the command "kamailio" would return 0 (OK) anyway (even if it fails to start). This occurs because all those checking (socket, DB connections...) are performed *after* invoking daemonize() so the parent process (which could be invoked by "/etc/init.d/kamailio start") returns 0 knowing nothing about those errors. In fact the only checking done before daemonizing is the config file parsing.
The patches I've upload to the tracker improve/fix it by doing the following:
- 'daemonize()' function doesn't exit the parent process, instead the parent process gets blocked in 'main()' function reading a pipe. - When the main process initializes the modules, sockets and so, it writes into the pipe so the parent process exists with 0. - If the main process fails to start then the pipe write fd is closed with no more processes writting into it so the parent process detects it and exists with -1 error.
I've checked it and works ok. You can try to set a wrong MySQL password or a wrong listen address and run kamailio in daemonize mode. Check the exit status so you will get error code rather than 0 (as occurs without these changes).
This is strongly required when working with HeartBeat (and any other software managing daemons) because HeartBeat relies on the init script returned status codes. Imagine for example that a MySQL password is changed. For any reason HeartBeat decides to change the service to nodo-2. This is what would happen without the proposed changes: - HeartBeat invokes "/etc/init.d/kamailio start". - It returns 0 even if it fails to start due to MySQL password. - HeartBeat invokes periodically "/etc/init.d/kamailio monitor" (similar to "status") which would return 1 (it's not running). - Then HeartBeat tries to start it again. And of course, it receives 0 again. - Loop...
This is, a init script return codes are *important* and current SR/Kamailio doesn't do it well as it fails in daemonize mode.
I will use these patches in production very soon as I'm building some platforms of Kamailio+HeartBeat and this is really required. If somebody could review than patches it should be great.
Regards.
2010/4/19 Iñaki Baz Castillo ibc@aliax.net:
If somebody could review than patches it should be great.
=> "it would be great"
Iñaki Baz Castillo wrote:
Hi, I've submitted some patches to the SR tracker in order to improve something that I consider buggy (and for sure it is):
http://sip-router.org/tracker/index.php?do=details&task_id=66
As explained in the tracker, in daemonize mode kamailio doesn't return the proper status code. If for example the listen address is not local or a DB connection uses an invalid password, then the command "kamailio" would return 0 (OK) anyway (even if it fails to start). This occurs because all those checking (socket, DB connections...) are performed *after* invoking daemonize() so the parent process (which could be invoked by "/etc/init.d/kamailio start") returns 0 knowing nothing about those errors. In fact the only checking done before daemonizing is the config file parsing.
The patches I've upload to the tracker improve/fix it by doing the following:
- 'daemonize()' function doesn't exit the parent process, instead the
parent process gets blocked in 'main()' function reading a pipe.
- When the main process initializes the modules, sockets and so, it
writes into the pipe so the parent process exists with 0.
- If the main process fails to start then the pipe write fd is closed
with no more processes writting into it so the parent process detects it and exists with -1 error.
I've checked it and works ok. You can try to set a wrong MySQL password or a wrong listen address and run kamailio in daemonize mode. Check the exit status so you will get error code rather than 0 (as occurs without these changes).
This is strongly required when working with HeartBeat (and any other software managing daemons) because HeartBeat relies on the init script returned status codes. Imagine for example that a MySQL password is changed. For any reason HeartBeat decides to change the service to nodo-2. This is what would happen without the proposed changes:
- HeartBeat invokes "/etc/init.d/kamailio start".
- It returns 0 even if it fails to start due to MySQL password.
- HeartBeat invokes periodically "/etc/init.d/kamailio monitor"
(similar to "status") which would return 1 (it's not running).
- Then HeartBeat tries to start it again. And of course, it receives 0 again.
- Loop...
This is, a init script return codes are *important* and current SR/Kamailio doesn't do it well as it fails in daemonize mode.
I will use these patches in production very soon as I'm building some platforms of Kamailio+HeartBeat and this is really required. If somebody could review than patches it should be great.
Regards.
Hello Iñaki ,
I had a look over the patches and they look fine. Of course I think one of the core developers should have a look also.
I suggest one thing: Instead of a read() from the read end of the pipe, can we use a select()/poll() so we can have timeouts and prevent blocking. For example does it make sense to say that if the child process doesn't write something to the pipe in let's say 1 minute, this means that it is blocked somewhere and the main process should exit with error (thus the init.d script should return != 0) ?!
Marius
2010/4/19 marius zbihlei marius.zbihlei@1and1.ro:
Hello Iñaki ,
I had a look over the patches and they look fine. Of course I think one of the core developers should have a look also.
I suggest one thing: Instead of a read() from the read end of the pipe, can we use a select()/poll() so we can have timeouts and prevent blocking. For example does it make sense to say that if the child process doesn't write something to the pipe in let's say 1 minute, this means that it is blocked somewhere and the main process should exit with error (thus the init.d script should return != 0) ?!
Hi, in the proposed code if the child process (main process) exits due to an error then it writes nothing to the pipe and the parent process reads 0 bytes from it. It means that an error has occurred and it exits with -1. In case main process starts properly it writes something to the pipe ("go") so the master process reads 2 bytes (>0) and exits with 0.
In the case you suggest, if the main process gets blocked for some reason (it doesn't exit but neither writes into the pipe) then as you say the parent process would get blocked. Not good. Is it possible to do a blocking read of the pipe with a timeout of 4-5 seconds? or is the select()/poll() stuff required for it?
Anyhow, I wonder if it would be enough. Note that in case the main process gets blocked and the parent process exits with -1 (due to the suggested timeout) the main process still remains running (even if blocked). Perhaps the parent process should kill it and ensure it's dead in case such timeout occurs?
Thanks a lot.
Iñaki Baz Castillo wrote:
2010/4/19 marius zbihlei marius.zbihlei@1and1.ro:
Hello Iñaki ,
I had a look over the patches and they look fine. Of course I think one of the core developers should have a look also.
I suggest one thing: Instead of a read() from the read end of the pipe, can we use a select()/poll() so we can have timeouts and prevent blocking. For example does it make sense to say that if the child process doesn't write something to the pipe in let's say 1 minute, this means that it is blocked somewhere and the main process should exit with error (thus the init.d script should return != 0) ?!
Hello Iñaki ,
Hi, in the proposed code if the child process (main process) exits due to an error then it writes nothing to the pipe and the parent process reads 0 bytes from it. It means that an error has occurred and it exits with -1. In case main process starts properly it writes something to the pipe ("go") so the master process reads 2 bytes (>0) and exits with 0.
Indeed, main process is the process after the fork, and this is the process that writes to signal the parent. I see two possible pitfalls: 1. If the main process blocks, this will block the parent process also 2. If the main process returns without writing the bytes, and there are still child processes left(tcp or udp worker children etc), then they will still have the writing part of the socket open (forked from the main process)and again the parent (master) process will keep blocking (didn't discovered a case where it might happen).
In the case you suggest, if the main process gets blocked for some reason (it doesn't exit but neither writes into the pipe) then as you say the parent process would get blocked. Not good. Is it possible to do a blocking read of the pipe with a timeout of 4-5 seconds? or is the select()/poll() stuff required for it?
With a select it is possible to do a blocking read from some time. I strongly suggest more than 4-5 seconds, I think 30s should be a minimum.
Anyhow, I wonder if it would be enough. Note that in case the main process gets blocked and the parent process exits with -1 (due to the suggested timeout) the main process still remains running (even if blocked). Perhaps the parent process should kill it and ensure it's dead in case such timeout occurs?
Thanks a lot.
Good question.. We can kill all children from the main proces, but I am not sure that from the masetr process we can do this..
Marius
2010/4/19 marius zbihlei marius.zbihlei@1and1.ro:
Indeed, main process is the process after the fork, and this is the process that writes to signal the parent. I see two possible pitfalls:
- If the main process blocks, this will block the parent process also
This could be improved using select() with a timeout, right?
- If the main process returns without writing the bytes, and there are
still child processes left(tcp or udp worker children etc), then they will still have the writing part of the socket open (forked from the main process)and again the parent (master) process will keep blocking (didn't discovered a case where it might happen).
Then the workers should immediately close the pipe output fd after being created, right?
With a select it is possible to do a blocking read from some time. I strongly suggest more than 4-5 seconds, I think 30s should be a minimum.
ok, but IMHO 30 seconds is too much for a init script. However this is a corner case.
Anyhow, I wonder if it would be enough. Note that in case the main process gets blocked and the parent process exits with -1 (due to the suggested timeout) the main process still remains running (even if blocked). Perhaps the parent process should kill it and ensure it's dead in case such timeout occurs?
Good question.. We can kill all children from the main proces, but I am not sure that from the masetr process we can do this..
But if we just kill the main process from the parent process, then all the workers and rest of children would be killed, right? The problem would be if the main process gets "blocked" for some reason, but I think that is a terrible case I've never seen.
Regards.
Iñaki Baz Castillo wrote:
2010/4/19 marius zbihlei marius.zbihlei@1and1.ro:
Indeed, main process is the process after the fork, and this is the process that writes to signal the parent. I see two possible pitfalls:
- If the main process blocks, this will block the parent process also
This could be improved using select() with a timeout, right?
Hello I have attached to this email a new main.c.patch containing the select changes (timeout is 30 seconds, hard coded :) ) I have taken the liberty to also format your initial patch.
- If the main process returns without writing the bytes, and there are
still child processes left(tcp or udp worker children etc), then they will still have the writing part of the socket open (forked from the main process)and again the parent (master) process will keep blocking (didn't discovered a case where it might happen).
Then the workers should immediately close the pipe output fd after being created, right?
Good question.. We can kill all children from the main proces, but I am not sure that from the masetr process we can do this..
But if we just kill the main process from the parent process, then all the workers and rest of children would be killed, right? The problem would be if the main process gets "blocked" for some reason, but I think that is a terrible case I've never seen.
I have not yet provide with a solution for this case. If the select timeout is finished, then the main process will try to write to a pipe that the other end stop reading, receive EPIPE and kill all its children. This is the case in witch the main process reaches the write part, if it's blocked somewhere else... well this must be taken care of.
Regards.
Cheers Marius
2010/4/19 marius zbihlei marius.zbihlei@1and1.ro:
Hello I have attached to this email a new main.c.patch containing the select changes (timeout is 30 seconds, hard coded :) )
It's ok for me, I don't consider that this value should be a configurable parameter. I've also seen you have fixed a bug of mine since I closed twice the fd[0] in the main (child) process :)
I have taken the liberty to also format your initial patch.
Great. However I see a diff for modules/carrierroute/cr_func.c ¿? :)
But if we just kill the main process from the parent process, then all the workers and rest of children would be killed, right? The problem would be if the main process gets "blocked" for some reason, but I think that is a terrible case I've never seen.
I have not yet provide with a solution for this case. If the select timeout is finished, then the main process will try to write to a pipe that the other end stop reading, receive EPIPE and kill all its children. This is the case in witch the main process reaches the write part, if it's blocked somewhere else... well this must be taken care of.
Perhaps the parent process could return a different error code (i.e. -2) if pipe read timeout occurs so the init script could then do a kill -9 (problem, parent process doesn't know the main process PID). Well, I think this is an already existing issue even without these changes. For sure there are more corner cases to handle but I do think that this patch is a real improvement over the current code, am I right? :)
Thanks a lot.
On Monday 19 April 2010, Iñaki Baz Castillo wrote:
With a select it is possible to do a blocking read from some time. I strongly suggest more than 4-5 seconds, I think 30s should be a minimum.
ok, but IMHO 30 seconds is too much for a init script. However this is a corner case.
Hi Iñaki,
i think its a bit of a problem that the init script can block somehow, as this causes delays for the system start. But on the other hand, kamailio is not really desktop software..
But if we could allow the init script to block for some time, what about this (IMHO much simpler) approach:
* kamailio gets started as usual * init script sleeps for some time (e.g. 5-10 seconds, configurable) * init script checks if the process is really running (process list, FIFO, SIP OPTIONS..) * init script returns the appropriate return code
Not sure if you already considered this.
Cheers,
Henning
2010/4/19 Henning Westerholt henning.westerholt@1und1.de:
i think its a bit of a problem that the init script can block somehow, as this causes delays for the system start. But on the other hand, kamailio is not really desktop software..
Sure, it's a server software and that's the reason that return codes must be precise.
But if we could allow the init script to block for some time, what about this (IMHO much simpler) approach:
- kamailio gets started as usual
- init script sleeps for some time (e.g. 5-10 seconds, configurable)
- init script checks if the process is really running (process list, FIFO,
SIP OPTIONS..)
- init script returns the appropriate return code
Not sure if you already considered this.
I've done the same for some other services wrongly managing exit status codes. Sincerely I consider it a bad workaround, not reliable and ugly (why to force a timeout if the init script could exit quickly?). Also note that usually HeartBeat waits just a limited ammount of seconds when it starts a service.
Perhaps we could forget the corner case of the timeout as it would only occur if the main process blocks without exiting and without previously writing into the pipe (if it occurs then we have bigger problems, something that must be fixed in the code and not something related to a DB connection error or wrong socket). I really think that the pipe mechanism is the best approach here as it allows getting the real status of the main process, and quickly.
Regards.
Juha Heinanen wrote:
Iñaki Baz Castillo writes:
Sure, it's a server software and that's the reason that return codes must be precise.
i agree with inaki here. in order to to be able to write accurate init scripts, the return codes must be accurate.
-- juha
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Hello,
Also the patch from Inaki is fairly small and easy to understand. With the timeout I only tried to catch some really rather obscure corner cases(don't even know if they can happen), so IMHO this behavior has to be also enabled in the daemon. Another benefit is that testing script will catch the exact status of the daemon, so no only the init.d scripts will benefit from it, other testing and management tools might also use this.
(I am thinking at our own test suites that would benefit from this)
Greetings all,
Marius
On Monday 19 April 2010, Iñaki Baz Castillo wrote:
But if we could allow the init script to block for some time, what about this (IMHO much simpler) approach:
- kamailio gets started as usual
- init script sleeps for some time (e.g. 5-10 seconds, configurable)
- init script checks if the process is really running (process list,
FIFO, SIP OPTIONS..)
- init script returns the appropriate return code
Not sure if you already considered this.
I've done the same for some other services wrongly managing exit status codes. Sincerely I consider it a bad workaround, not reliable and ugly (why to force a timeout if the init script could exit quickly?). Also note that usually HeartBeat waits just a limited ammount of seconds when it starts a service.
Hi Iñaki,
i've not yet looked into your patch, if its not that invasive and fairly easy to implement (as mentioned in later messages) its fine with me. Implementing it in the init script would be indeed more like a workaround. It would not help in the case when the daemon gets started directly.
Cheers,
Henning
20 apr 2010 kl. 13.20 skrev Henning Westerholt:
On Monday 19 April 2010, Iñaki Baz Castillo wrote:
But if we could allow the init script to block for some time, what about this (IMHO much simpler) approach:
- kamailio gets started as usual
- init script sleeps for some time (e.g. 5-10 seconds, configurable)
- init script checks if the process is really running (process list,
FIFO, SIP OPTIONS..)
- init script returns the appropriate return code
Not sure if you already considered this.
I've done the same for some other services wrongly managing exit status codes. Sincerely I consider it a bad workaround, not reliable and ugly (why to force a timeout if the init script could exit quickly?). Also note that usually HeartBeat waits just a limited ammount of seconds when it starts a service.
Hi Iñaki,
i've not yet looked into your patch, if its not that invasive and fairly easy to implement (as mentioned in later messages) its fine with me. Implementing it in the init script would be indeed more like a workaround. It would not help in the case when the daemon gets started directly.
The preference should always be to have proper exit codes. I don't know how much sleeps the various distros allow you to implement in the rc scripts. Curious if there are guidelines for it.
/O
2010/4/20 Olle E. Johansson oej@edvina.net:
The preference should always be to have proper exit codes. I don't know how much sleeps the various distros allow you to implement in the rc scripts. Curious if there are guidelines for it.
Any init script should accomodate to LSB specificacions: http://refspecs.freestandards.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generi...
This is true for Debian, RedHat or whatever *nix distro.
2010/4/20 Henning Westerholt henning.westerholt@1und1.de:
Hi Iñaki,
i've not yet looked into your patch, if its not that invasive and fairly easy to implement (as mentioned in later messages) its fine with me. Implementing it in the init script would be indeed more like a workaround. It would not help in the case when the daemon gets started directly.
Hi Henning, IMHO it's not invasive (in fact I've seen similar pipe mechanism in other servers). Please take a look to the patches (is very few code). However note that Marius has fixed a small bug (a fd closed twice by the main process) and has also added a select() to allow timeout when the parent process reads from the pipe.
Thanks a lot.