[Django]-Celery missed heartbeat (on_node_lost)

16đź‘Ť

Celery 3.1 added in the new mingle and gossip procedures. I too was getting a ton of missed heartbeats and passing –without-gossip to my workers cleared it up.

https://docs.celeryproject.org/en/3.1/whatsnew-3.1.html#mingle-worker-synchronization

Mingle: Worker synchronization

The worker will now attempt to synchronize with other workers in the
same cluster.

Synchronized data currently includes revoked tasks and logical clock.

This only happens at startup and causes a one second startup delay to
collect broadcast responses from other workers.

You can disable this bootstep using the –without-mingle argument.

https://docs.celeryproject.org/en/3.1/whatsnew-3.1.html#gossip-worker-worker-communication

Gossip: Worker <-> Worker communication

Workers are now passively subscribing to worker related events like
heartbeats.

This means that a worker knows what other workers are doing and can
detect if they go offline. Currently this is only used for clock
synchronization, but there are many possibilities for future additions
and you can write extensions that take advantage of this already.

Some ideas include consensus protocols, reroute task to best worker
(based on resource usage or data locality) or restarting workers when
they crash.

We believe that although this is a small addition, it opens amazing
possibilities.

You can disable this bootstep using the –without-gossip argument.

👤user3204501

12đź‘Ť

Saw the same thing, and noticed a couple of things in the log files.

1) There were messages about time drift at the start of the log and occasional missed heartbeats.

2) At the end of the log file, the drift messages went away and only the missed heartbeat messages were present.

3) There were no changes to the system when the drift messages went away… They just stopped showing up.

I figured that the drift itself was likely the problem itself.

After syncing the time on all the servers involved these messages went away. For ubuntu, run ntpdate as a cron or ntpd.

👤user3691996

1đź‘Ť

I’m having a similar issue. I have found the reason in my case.

I have two server to run worker.

when I use “ping” to another server,
I found when the ping time larger than 2 second, the log will show ” missed heartbeat from celery@ “. The default heartbeat interval is 2 second.

The reason is my poor network.
http://docs.celeryproject.org/en/latest/internals/reference/celery.worker.heartbeat.html

👤mutex86

-1đź‘Ť

add –without-mingle when you start celery

👤Flora

Leave a comment