Since it seem you are recovering the memory this does not seems like a real "leak"
One hypothesis :
When you restart a node on the DMQ bus, it can trigger memory usage on the other nodes since they will start to do a SYNC and send one DMQ message / contact It could be that one node in the DMQ bus is restarted and not answering DMQ messages ?
Few ideas :
You could search you trace, maybe you will find the DMQ sync requests ...
You can also confirm significant increase in active transactions.
Verify the state of the bus : kamcmd dmq.list_nodes
Verify the amount of contact on each node (confirm that the cluster is healthy) kamctl stats | grep usrloc | grep contact
On Tue, Jul 31, 2018 at 9:21 AM, Rogelio Perez rogelio@telnyx.com wrote:
Thanks Daniel, Charles and Julien.
I confirm we're not getting the error log "running job failed". The behavior is always the same, any of the two failover instances would run without issues for a day or two and then suddenly start consuming all available memory in the span of an hour or less. Please check these graphs with some examples for more details: https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0
I'll try Daniel's patch and confirm results soon.
Rogelio