Since it seem you are recovering the memory this does not seems like a real
"leak"
One hypothesis :
When you restart a node on the DMQ bus, it can trigger memory usage on the
other nodes since they will start to do a SYNC and send one DMQ message /
contact
It could be that one node in the DMQ bus is restarted and not answering DMQ
messages ?
Few ideas :
You could search you trace, maybe you will find the DMQ sync requests ...
You can also confirm significant increase in active transactions.
Verify the state of the bus :
kamcmd dmq.list_nodes
Verify the amount of contact on each node (confirm that the cluster is
healthy)
kamctl stats | grep usrloc | grep contact
On Tue, Jul 31, 2018 at 9:21 AM, Rogelio Perez <rogelio(a)telnyx.com> wrote:
Thanks Daniel, Charles and Julien.
I confirm we're not getting the error log "running job failed".
The behavior is always the same, any of the two failover instances would
run without issues for a day or two and then suddenly start consuming all
available memory in the span of an hour or less.
Please check these graphs with some examples for more details:
https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0
I'll try Daniel's patch and confirm results soon.
Rogelio