[Answered ]-H13 (Connection closed without response) errors on Heroku scale down

1👍

After three weeks of work I had been finally able to fix this issue.

Short answer:

Avoid using Heroku to run Docker images if you can.

Heroku sends SIGTERM to ALL processes in the dyno, which is something that is very hard to deal with. You will need patch almost every process inside the Docker container to count with SIGTERM and terminate nicely.

Standard way of terminating Docker container is with docker stop command which sends SIGTERM ONLY to root process (entrypoint), where it can be dealt with.

Heroku has very arbitrary process of terminating the instance incompatible with existing applications as well as existing Docker image deployments. And according to my communication with Heroku they are unable to change this in the future.

Long answer:

There was not one single issue but 5 separate issues.
In order to terminate the instance successfully following conditions needs to be fulfilled:

  • Nginx has to be terminate first and start last (so the Heroku router stops sending requests, this is similar to Puma) and it has to be graceful, which is usually done with SIGQUIT signal.
  • Other applications needs to terminate gracefully in correct order – in my case first Nginx, than Gunicorn and PGBouncer as the last. The order of terminating the applications is important – e.g. PGBouncer must terminate after Gunicorn to not interrupt running SQL queries.
  • The docker-entrypoint.sh needs to catch the SIGTERM signal. This didn’t show up when I was testing locally.

In order to achieve this I had to deal with every application separately:

Nginx:

I had to patch Nginx to swich SIGTERM and SIGQUIT signals, so I run following command in my Dockerfile:

# Compile nginx and patch it to switch SIGTERM and SIGQUIT signals
RUN curl -L http://nginx.org/download/nginx-1.22.0.tar.gz -o nginx.tar.gz \
  && tar -xvzf nginx.tar.gz \
  && cd nginx-1.22.0 \
  && sed -i "s/ QUIT$/TIUQ/g" src/core/ngx_config.h \
  && sed -i "s/ TERM$/QUIT/g" src/core/ngx_config.h \
  && sed -i "s/ TIUQ$/TERM/g" src/core/ngx_config.h \
  && ./configure --without-http_rewrite_module \
  && make \
  && make install \
  && cd .. \
  && rm nginx-1.22.0 -rf \
  && rm nginx.tar.gz

Issue I created

uWSGI/Gunicorn:

I gave up on uWSGI and swiched to Gunicorn (which terminates gracefully on SIGTERM), but I had to patch it anyways in the end, because it needs to terminate later than Nginx. I disabled SIGTERM signal and mapped it’s function on SIGUSR1
My patched version is here: https://github.com/PetrDlouhy/gunicorn/commit/1414112358f445ce714c5d4f572d78172b993b79

I install it with:

RUN poetry run pip install -e git+https://github.com/PetrDlouhy/gunicorn@no_sigterm#egg=gunicorn[gthread] \
   && cd `poetry env info -p`/src/gunicorn/ \
   && git config core.repositoryformatversion 0  # Needed for Dockerfile.test only untill next version of Dulwich is released \
   && cd /project

Issue I created

PGBouncer:

I also deployed PGBouncer which I had to modify to not react on SIGTERM with:

# Compile pgbouncer and patch it to switch SIGTERM and SIGQUIT signals
RUN curl -L https://github.com/pgbouncer/pgbouncer/releases/download/pgbouncer_1_17_0/pgbouncer-1.17.0.tar.gz -o pgbouncer.tar.gz \
  && tar -xvzf pgbouncer.tar.gz \
  && cd pgbouncer-1.17.0 \
  && sed -i "s/got SIGTERM, fast exit/PGBouncer got SIGTERM, do nothing/" src/main.c \
  && sed -i "s/ exit(1);$//g" src/main.c \
  && ./configure \
  && make \
  && make install \
  && cd .. \
  && rm pgbouncer-1.17.0 -rf \
  && rm pgbouncer.tar.gz

It still can be brought down gracefully with SIGINT.

Issue I created

docker-entrypoint.sh

I had to trap SIGTERM in my docker-entrypoint.sh with:

_term() {
  echo "Caught SIGTERM signal. Do nothing here, because Heroku already sent signal everywhere."
}

trap _term SIGTERM

supervisor

In order to not receive R12 errors all processes needs to terminate before 30 second Heroku graceful period. I achieved it by setting priorities in supervisord.conf:

[supervisord]
nodaemon=true

[program:gunicorn]
command=poetry run newrelic-admin run-program gunicorn wsgi:application -c /etc/gunicorn/gunicorn.conf.py
priority=2
stopsignal=USR1
...

[program:nginx]
command=/usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
priority=3
...

[program:pgbouncer]
command=/usr/local/bin/pgbouncer /project/pgbouncer/pgbouncer.ini
priority=1
stopsignal=INT
...

Testing the solutions:

In order to test what was going on, I had to develop some testing techniques which might come handy in different but similar cases.

I created a view which waits 10 seconds before answer and bind it on /slow_view url.

Then I started the server in Docker instance, made query to the slow view with curl -I "http://localhost:8080/slow_view" and made second connection to the Docker instance and executed kill command with pkill -SIGTERM . or e.g. pkill -SIGTERM gunicorn.

I could also run the kill command on testing Heroku dyno where I connected with heroku ps:exec --dyno web.1 --app my_app.

Leave a comment