Hello people
I have a heartbeat cluster that manages a ser 0.9.3 running on one
machine and asterisk1.2.3 running on another in an Active/Active two
IP address Configuration with failover support. I have SUSE 10.1
installed on both machines. My ha.cf files look like this
###############################################
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 10
warntime 10
initdead 20
udpport 694
baud 19200
bcast eth1
ping xxx.xxx.x.x
auto_failback on
node linux-xczz
node prueba2
respawn hacluster /usr/local/lib/heartbeat/ipfail
#################################################
My haresources files look like this
########################################
prueba2 xxx.xxx.x.125/24 safe_asterisk
linux-xczz xxx.xxx.x.124/24 serctl
########################################
so "linux-xczz" is the master when running ser and "prueba2" is the
master when running asterisk
the authkeys files are the same on both machines too with the right
permissions (mod 600). The /etc/hosts files look like this
##########################
10.10.10.1 linux-xczz
10.10.10.2 prueba2
##########################
The First time I try to run heartbeat on one machine (prueba2) with
/etc/init.d/heartbeat start, both my services run good. But when I try
to run heartbeat on the other machine (linux-xczz) so that it takes
over the ser service, the system goes crazy and once linux-xczz takes
over the ser service, prueba2 gives up the other resource (asterisk)
which it should not do, and it appears on linux-xczz, only to
disappear seconds later along with ser, leaving my cluster-ha a
complete wreck with no service running on either machine. The error
log I get from prueba2 is this:
heartbeat[12609]: 2007/03/15_11:41:18 info: Link linux-xczz:eth1 up.
heartbeat[12609]: 2007/03/15_11:41:18 info: Status update for node
linux-xczz: status init
heartbeat[12609]: 2007/03/15_11:41:18 info: Status update for node
linux-xczz: status up
harc[13730]: 2007/03/15_11:41:18 info: Running /etc/ha.d/rc.d/status status
harc[13741]: 2007/03/15_11:41:18 info: Running /etc/ha.d/rc.d/status status
heartbeat[12609]: 2007/03/15_11:41:19 info: Status update for node
linux-xczz: status active
harc[13754]: 2007/03/15_11:41:19 info: Running /etc/ha.d/rc.d/status status
heartbeat[12609]: 2007/03/15_11:41:19 info: remote resource transition
completed.
heartbeat[12609]: 2007/03/15_11:41:19 info: prueba2 wants to go
standby [foreign]
heartbeat[12609]: 2007/03/15_11:41:20 info: standby: linux-xczz can
take our foreign resources
heartbeat[13767]: 2007/03/15_11:41:20 info: give up foreign HA
resources (standby).
ResourceManager[13777]: 2007/03/15_11:41:20 info: Releasing resource
group: linux-xczz xxx.xxx.x.124/24 serctl
ResourceManager[13777]: 2007/03/15_11:41:20 info: Running
/etc/init.d/serctl stop
ResourceManager[13777]: 2007/03/15_11:41:20 info: Running
/etc/ha.d/resource.d/IPaddr xxx.xxx.x.124/24 stop
IPaddr[13915]: 2007/03/15_11:41:20 INFO: /sbin/route -n del -host
xxx.xxx.x.124
IPaddr[13915]: 2007/03/15_11:41:20 INFO: /sbin/ifconfig eth0:0
xxx.xxx.x.124 down
IPaddr[13915]: 2007/03/15_11:41:20 INFO: IP Address xxx.xxx.x.124 released
IPaddr[13836]: 2007/03/15_11:41:20 INFO: IPaddr Success
heartbeat[13767]: 2007/03/15_11:41:20 info: foreign HA resource
release completed (standby).
heartbeat[12609]: 2007/03/15_11:41:20 info: Local standby process
completed [foreign].
heartbeat[12609]: 2007/03/15_11:41:23 WARN: 1 lost packet(s) for
[linux-xczz] [13:15]
heartbeat[12609]: 2007/03/15_11:41:23 info: remote resource transition
completed.
heartbeat[12609]: 2007/03/15_11:41:23 info: No pkts missing from linux-xczz!
heartbeat[12609]: 2007/03/15_11:41:23 info: Other node completed
standby takeover of foreign resources.
heartbeat[12609]: 2007/03/15_11:41:35 info: linux-xczz wants to go
standby [foreign]
heartbeat[12609]: 2007/03/15_11:41:36 info: standby: acquire [foreign]
resources from linux-xczz
heartbeat[14011]: 2007/03/15_11:41:36 info: acquire local HA resources
(standby).
ResourceManager[14021]: 2007/03/15_11:41:36 info: Acquiring resource
group: prueba2 xxx.xxx.x.125/24 asterisk-rosa
IPaddr[14048]: 2007/03/15_11:41:36 INFO: IPaddr Running OK
ResourceManager[14021]: 2007/03/15_11:41:36 info: Running
/etc/init.d/safe_asterisk start
ResourceManager[14021]: 2007/03/15_11:41:36 ERROR: Return code 1 from
/etc/init.d/safe_asterisk
ResourceManager[14021]: 2007/03/15_11:41:36 CRIT: Giving up resources
due to failure of safe_asterisk
ResourceManager[14021]: 2007/03/15_11:41:36 info: Releasing resource
group: prueba2 xxx.xxx.x.125/24 asterisk-rosa
ResourceManager[14021]: 2007/03/15_11:41:xxz.xxz.x.xxz36 info: Running
/etc/init.d/safe_asterisk stop
ResourceManager[14021]: 2007/03/15_11:41:37 info: Running
/etc/ha.d/resource.d/IPaddr xxx.xxx.x.125/24 stop
IPaddr[14310]: 2007/03/15_11:41:37 INFO: /sbin/route -n del -host
xxx.xxx.x.125
IPaddr[14310]: 2007/03/15_11:41:37 INFO: /sbin/ifconfig eth0:2
xxx.xxx.x.125 down
IPaddr[14310]: 2007/03/15_11:41:37 INFO: IP Address xxx.xxx.x.125 released
IPaddr[14231]: 2007/03/15_11:41:37 INFO: IPaddr Success
heartbeat[14011]: 2007/03/15_11:41:37 info: local HA resource
acquisition completed (standby).
heartbeat[12609]: 2007/03/15_11:41:37 info: Standby resource
acquisition done [foreign].
heartbeat[12609]: 2007/03/15_11:41:37 info: remote resource transition
completed.
heartbeat[12609]: 2007/03/15_11:41:38 WARN: G_CH_dispatch_int:
Dispatch function for read child took too long to execute: 520 ms (>
50 ms)
(GSource: 0x80fbe00)
hb_standby[14375]: 2007/03/15_11:42:07 Going standby [foreign].
heartbeat[12609]: 2007/03/15_11:42:07 info: prueba2 wants to go
standby [foreign]
heartbeat[12609]: 2007/03/15_11:42:08 info: standby: linux-xczz can
take our foreign resources
heartbeat[14385]: 2007/03/15_11:42:08 info: give up foreign HA
resources (standby).
ResourceManager[14395]: 2007/03/15_11:42:08 info: Releasing resource
group: linux-xczz xxx.xxx.x.124/24 serctl
ResourceManager[14395]: 2007/03/15_11:42:08 info: Running
/etc/init.d/serctl stop
ResourceManager[14395]: 2007/03/15_11:42:08 ERROR: Return code 1 from
/etc/init.d/serctl
The error log I get from linux-xczz when I run heartbeat is this:
heartbeat[1063]: 2007/03/15_14:26:12 WARN: Core dumps could be lost if
multiple dumps occur
heartbeat[1063]: 2007/03/15_14:26:12 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
supportability
heartbeat[1063]: 2007/03/15_14:26:12 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[1063]: 2007/03/15_14:26:12 info: **************************
heartbeat[1063]: 2007/03/15_14:26:12 info: Configuration validated.
Starting heartbeat 2.0.7
heartbeat[1064]: 2007/03/15_14:26:12 info: heartbeat: version 2.0.7
heartbeat[1064]: 2007/03/15_14:26:12 info: Heartbeat generation: 130
heartbeat[1064]: 2007/03/15_14:26:12 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[1064]: 2007/03/15_14:26:12 info: G_main_add_TriggerHandler:
Added signal manual handler
heartbeat[1064]: 2007/03/15_14:26:12 info: Removing
/usr/local/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[1064]: 2007/03/15_14:26:12 info: glib: UDP Broadcast
heartbeat started on port 694 (694) interface eth1
heartbeat[1064]: 2007/03/15_14:26:12 info: glib: UDP Broadcast
heartbeat closed on port 694 interface eth1 - Status: 1
heartbeat[1064]: 2007/03/15_14:26:12 info: glib: ping heartbeat started.
heartbeat[1064]: 2007/03/15_14:26:12 info: G_main_add_SignalHandler:
Added signal handler for signal 17
heartbeat[1064]: 2007/03/15_14:26:12 info: Local status now set to: 'up'
heartbeat[1064]: 2007/03/15_14:26:13 info: Link linux-xczz:eth1 up.
heartbeat[1064]: 2007/03/15_14:26:13 info: Link prueba2:eth1 up.
heartbeat[1064]: 2007/03/15_14:26:13 info: Status update for node
prueba2: status active
heartbeat[1064]: 2007/03/15_14:26:13 info: Link xxx.xxx.x.x:xxx.xxx.x.x up.
heartbeat[1064]: 2007/03/15_14:26:13 info: Status update for node
xxx.xxx.x.x: status ping
harc[1073]: 2007/03/15_14:26:13 info: Running
/usr/local/etc/ha.d/rc.d/status status
heartbeat[1064]: 2007/03/15_14:26:14 info: Comm_now_up(): updating
status to active
heartbeat[1064]: 2007/03/15_14:26:14 info: Local status now set to: 'active'
heartbeat[1064]: 2007/03/15_14:26:14 info: Starting child client
"/usr/local/lib/heartbeat/ipfail" (1001,100)
heartbeat[1084]: 2007/03/15_14:26:14 info: Starting
"/usr/local/lib/heartbeat/ipfail" as uid 1001 gid 100 (pid 1084)
heartbeat[1064]: 2007/03/15_14:26:14 info: remote resource transition
completed.
heartbeat[1064]: 2007/03/15_14:26:14 info: remote resource transition
completed.
heartbeat[1064]: 2007/03/15_14:26:14 info: Local Resource acquisition
completed. (none)
heartbeat[1064]: 2007/03/15_14:26:15 info: prueba2 wants to go standby
[foreign]
heartbeat[1064]: 2007/03/15_14:26:15 info: standby: acquire [foreign]
resources from prueba2
heartbeat[1088]: 2007/03/15_14:26:15 info: acquire local HA resources
(standby).
ResourceManager[1098]: 2007/03/15_14:26:15 info: Acquiring resource
group: linux-xczz xxx.xxx.x.124/24 serctl
IPaddr[1122]: 2007/03/15_14:26:16 INFO: IPaddr Resource is stopped
ResourceManager[1098]: 2007/03/15_14:26:16 info: Running
/usr/local/etc/ha.d/resource.d/IPaddr 192.168.1.124/24 start
IPaddr[1321]: 2007/03/15_14:26:16 INFO: eval /sbin/ifconfig eth0:0
xxx.xxx.x.124 netmask 255.255.255.0 broadcast xxx.xxx.x.255
IPaddr[1321]: 2007/03/15_14:26:16 INFO: Sending Gratuitous Arp for
xxx.xxx.x.124 on eth0:0 [eth0]
IPaddr[1321]: 2007/03/15_14:26:16 INFO:
/usr/local/lib/heartbeat/send_arp -i 500 -r 10 -p
/usr/local/var/run/heartbeat/rsctmp/send_arp/send_arp-xxx.xxx.x.124
eth0 xxx.xxx.x.124 auto xxx.xxx.x.124 ffffffffffff
IPaddr[1241]: 2007/03/15_14:26:16 INFO: IPaddr Success
ResourceManager[1098]: 2007/03/15_14:26:16 info: Running
/etc/init.d/serctl start
heartbeat[1088]: 2007/03/15_14:26:17 info: local HA resource
acquisition completed (standby).
heartbeat[1064]: 2007/03/15_14:26:17 info: Standby resource
acquisition done [foreign].
heartbeat[1064]: 2007/03/15_14:26:17 info: Initial resource
acquisition complete (auto_failback)
heartbeat[1064]: 2007/03/15_14:26:23 info: remote resource transition
completed.
heartbeat[1064]: 2007/03/15_14:26:28 info: linux-xczz wants to go
standby [foreign]
heartbeat[1064]: 2007/03/15_14:26:28 info: standby: prueba2 can take
our foreign resources
heartbeat[1492]: 2007/03/15_14:26:28 info: give up foreign HA
resources (standby).
ResourceManager[1502]: 2007/03/15_14:26:28 info: Releasing resource
group: prueba2 xxx.xxx.x.125/24 safe_asterisk
ResourceManager[1502]: 2007/03/15_14:26:28 info: Running
/etc/init.d/safe_asterisk stop
ResourceManager[1502]: 2007/03/15_14:26:28 info: Running
/usr/local/etc/ha.d/resource.d/IPaddr xxx.xxx.x.125/24 stop
IPaddr[1561]: 2007/03/15_14:26:29 INFO: IPaddr Success
heartbeat[1492]: 2007/03/15_14:26:29 info: foreign HA resource release
completed (standby).
heartbeat[1064]: 2007/03/15_14:26:29 info: Local standby process
completed [foreign].
heartbeat[1064]: 2007/03/15_14:26:30 WARN: 1 lost packet(s) for
[prueba2] [68:70]
heartbeat[1064]: 2007/03/15_14:26:30 info: remote resource transition
completed.
heartbeat[1064]: 2007/03/15_14:26:30 info: No pkts missing from prueba2!
heartbeat[1064]: 2007/03/15_14:26:30 info: Other node completed
standby takeover of foreign resources.
heartbeat[1064]: 2007/03/15_14:27:00 info: prueba2 wants to go standby
[foreign]
heartbeat[1064]: 2007/03/15_14:27:11 info: standby: acquire [foreign]
resources from prueba2
heartbeat[1784]: 2007/03/15_14:27:11 info: acquire local HA resources
(standby).
ResourceManager[1794]: 2007/03/15_14:27:11 info: Acquiring resource
group: linux-xczz xxx.xxx.x.124/24 serctl
IPaddr[1818]: 2007/03/15_14:27:11 INFO: IPaddr Running OK
ResourceManager[1794]: 2007/03/15_14:27:11 info: Running
/etc/init.d/serctl start
ResourceManager[1794]: 2007/03/15_14:27:11 ERROR: Return code 1 from
/etc/init.d/serctl
ResourceManager[1794]: 2007/03/15_14:27:11 CRIT: Giving up resources
due to failure of serctl
ResourceManager[1794]: 2007/03/15_14:27:11 info: Releasing resource
group: linux-xczz xxx.xxx.x.124/24 serctl
ResourceManager[1794]: 2007/03/15_14:27:11 info: Running
/etc/init.d/serctl stop
ResourceManager[1794]: 2007/03/15_14:27:11 info: Running
/usr/local/etc/ha.d/resource.d/IPaddr xxx.xxx.x.124/24 stop
IPaddr[2090]: 2007/03/15_14:27:12 INFO: /sbin/route -n del -host xxx.xxx.x.124
IPaddr[2090]: 2007/03/15_14:27:12 INFO: /sbin/ifconfig eth0:0
xxx.xxx.x.124 down
IPaddr[2090]: 2007/03/15_14:27:12 INFO: IP Address xxx.xxx.x.124 released
IPaddr[2006]: 2007/03/15_14:27:12 INFO: IPaddr Success
heartbeat[1784]: 2007/03/15_14:27:12 info: local HA resource
acquisition completed (standby).
heartbeat[1064]: 2007/03/15_14:27:12 info: Standby resource
acquisition done [foreign].
heartbeat[1064]: 2007/03/15_14:27:12 info: remote resource transition
completed.
hb_standby[2228]: 2007/03/15_14:27:42 Going standby [foreign].
heartbeat[1064]: 2007/03/15_14:27:42 info: linux-xczz wants to go
standby [foreign]
heartbeat[1064]: 2007/03/15_14:27:42 info: standby: prueba2 can take
our foreign resources
heartbeat[2238]: 2007/03/15_14:27:42 info: give up foreign HA
resources (standby).
ResourceManager[2248]: 2007/03/15_14:27:43 info: Releasing resource
group: prueba2 xxx.xxx.x.125/24 safe_asterisk
ResourceManager[2248]: 2007/03/15_14:27:43 info: Running
/etc/init.d/safe_asterisk stop
ResourceManager[2248]: 2007/03/15_14:27:43 info: Running
/usr/local/etc/ha.d/resource.d/IPaddr xxx.xxx.x.125/24 stop
IPaddr[2310]: 2007/03/15_14:27:43 INFO: IPaddr Success
heartbeat[2238]: 2007/03/15_14:27:43 info: foreign HA resource release
completed (standby).
heartbeat[1064]: 2007/03/15_14:27:43 info: Local standby process
completed [foreign].
heartbeat[1064]: 2007/03/15_14:27:44 WARN: 1 lost packet(s) for
[prueba2] [114:116]
heartbeat[1064]: 2007/03/15_14:27:44 info: remote resource transition
completed.
heartbeat[1064]: 2007/03/15_14:27:44 info: No pkts missing from prueba2!
heartbeat[1064]: 2007/03/15_14:27:44 info: Other node completed
standby takeover of foreign resources.
What I am trying to have is ser running on linux-xczz and asterisk
running on prueba2 with failover configured on both machines but
apparently the failover crashes and I lose both my services if both
heartbeats are running. Any idea why this happens or what I'm doing
wrong?.
Can ser and asterisk be run by heartbeat with failover support?
thanxs in advance