Since it seem you are recovering the memory this does not seems like a real "leak"

One hypothesis :

When you restart a node on the DMQ bus, it can trigger memory usage on the other nodes since they will start to do a SYNC and send one DMQ message / contact

It could be that one node in the DMQ bus is restarted and not answering DMQ messages ?

Few ideas :

You could search you trace, maybe you will find the DMQ sync requests ...

You can also confirm significant increase in active transactions.

Verify the state of the bus :

kamcmd dmq.list_nodes

Verify the amount of contact on each node (confirm that the cluster is healthy)

kamctl stats | grep usrloc | grep contact

On Tue, Jul 31, 2018 at 9:21 AM, Rogelio Perez <rogelio@telnyx.com> wrote:

Thanks Daniel, Charles and Julien.

I confirm we're not getting the error log "running job failed".
The behavior is always the same, any of the two failover instances would run without issues for a day or two and then suddenly start consuming all available memory in the span of an hour or less.
Please check these graphs with some examples for more details: https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0

I'll try Daniel's patch and confirm results soon.

Rogelio