RabbitMQ Under Load

simon_harrison · February 1, 2018, 10:43am

Hi there

We're continuing to have trouble with our RabbitMQ setup. Under load things
break down quickly. And i use the term "load" very generously here!

Sentry is flooded with alerts such as: OS Error Socket Closed, Connection
Forced, Killing n managed threads, RecoverableConnectionError, killing
managed thread "run", killing 1 active worker, etc. etc.

We are using Amazons load balancer with 3 rabbit instances behind it.

All we are doing is pub-sub between 2 services, and this is very light
message publishing.

Our heartbeat is set to 5 seconds.

Can anyone advise or suggest somewhere we can get hands-on help with our
rabbit setup?

Thanks

mattbennett · February 1, 2018, 11:08am

With a 5 second heartbeat this is almost certainly the same problem as
reported
here: Redirecting to Google Groups

From that thread:

If the workload inside the event handler isn't doing any I/O it is likely

to be stealing the CPU long enough to starve the thread that's sending your
heartbeats. Once two heartbeats are missed the broker will close the
connection from the other end. You'll be able to see this happening if you
check your rabbit logs.

Unfortunately Nameko handles the error very ungracefully. It would be much
better to catch it and try to reconnect rather than letting it bubble and
kill the container, but fixing that won't solve what I suspect is your
underlying issue: tight CPU loops will starve other greenthreads. This is a
inherent limitation of implicitly yielding co-routines. There is a simple
fix though -- offload non-yielding code to a (native) threadpool, or insert
some explicit yields. Another quick fix would be to decrease the heartbeat
frequency.

Please check your rabbit broker logs for "heartbeat missed" messages to
confirm this is the source of your problem. You could also take a look at
the amount of time you take to service each request -- you'll see some that
are longer than 2x the heartbeat if this is your issue.

···

On Thursday, 1 February 2018 10:43:37 UTC, simon harrison wrote:

Hi there

We're continuing to have trouble with our RabbitMQ setup. Under load
things break down quickly. And i use the term "load" very generously here!

Sentry is flooded with alerts such as: OS Error Socket Closed, Connection
Forced, Killing n managed threads, RecoverableConnectionError, killing
managed thread "run", killing 1 active worker, etc. etc.

We are using Amazons load balancer with 3 rabbit instances behind it.

All we are doing is pub-sub between 2 services, and this is very light
message publishing.

Our heartbeat is set to 5 seconds.

Can anyone advise or suggest somewhere we can get hands-on help with our
rabbit setup?

Thanks

simon_harrison · February 1, 2018, 11:29am

Thanks Matt, I appreciate your very quick response. and I've passed that
thread on.

All our handlers typically either write to postgres or redis.

What heartbeat do you use/recommend?

···

On Thursday, 1 February 2018 11:08:47 UTC, Matt Yule-Bennett wrote:

With a 5 second heartbeat this is almost certainly the same problem as
reported here:
Redirecting to Google Groups

From that thread:

> If the workload inside the event handler isn't doing any I/O it is
likely to be stealing the CPU long enough to starve the thread that's
sending your heartbeats. Once two heartbeats are missed the broker will
close the connection from the other end. You'll be able to see this
happening if you check your rabbit logs.

Unfortunately Nameko handles the error very ungracefully. It would be much
better to catch it and try to reconnect rather than letting it bubble and
kill the container, but fixing that won't solve what I suspect is your
underlying issue: tight CPU loops will starve other greenthreads. This is a
inherent limitation of implicitly yielding co-routines. There is a simple
fix though -- offload non-yielding code to a (native) threadpool, or insert
some explicit yields. Another quick fix would be to decrease the heartbeat
frequency.

Please check your rabbit broker logs for "heartbeat missed" messages to
confirm this is the source of your problem. You could also take a look at
the amount of time you take to service each request -- you'll see some that
are longer than 2x the heartbeat if this is your issue.

On Thursday, 1 February 2018 10:43:37 UTC, simon harrison wrote:

Hi there

We're continuing to have trouble with our RabbitMQ setup. Under load
things break down quickly. And i use the term "load" very generously here!

Sentry is flooded with alerts such as: OS Error Socket Closed, Connection
Forced, Killing n managed threads, RecoverableConnectionError, killing
managed thread "run", killing 1 active worker, etc. etc.

We are using Amazons load balancer with 3 rabbit instances behind it.

All we are doing is pub-sub between 2 services, and this is very light
message publishing.

Our heartbeat is set to 5 seconds.

Can anyone advise or suggest somewhere we can get hands-on help with our
rabbit setup?

Thanks

mattbennett · February 2, 2018, 10:22am

Thanks Matt, I appreciate your very quick response. and I've passed that thread on.

All our handlers typically either write to postgres or redis.

It might be just one handler causing the problem. You only need one thread to hold the cpu for longer than 2x the heartbeat

What heartbeat do you use/recommend?

We use the nameko default of 60 seconds

···

On Thursday, February 1, 2018 at 11:29:21 AM UTC, simon harrison wrote:

On Thursday, 1 February 2018 11:08:47 UTC, Matt Yule-Bennett wrote:

With a 5 second heartbeat this is almost certainly the same problem as reported here: Redirecting to Google Groups

From that thread:

> If the workload inside the event handler isn't doing any I/O it is likely to be stealing the CPU long enough to starve the thread that's sending your heartbeats. Once two heartbeats are missed the broker will close the connection from the other end. You'll be able to see this happening if you check your rabbit logs.

Unfortunately Nameko handles the error very ungracefully. It would be much better to catch it and try to reconnect rather than letting it bubble and kill the container, but fixing that won't solve what I suspect is your underlying issue: tight CPU loops will starve other greenthreads. This is a inherent limitation of implicitly yielding co-routines. There is a simple fix though -- offload non-yielding code to a (native) threadpool, or insert some explicit yields. Another quick fix would be to decrease the heartbeat frequency.

Please check your rabbit broker logs for "heartbeat missed" messages to confirm this is the source of your problem. You could also take a look at the amount of time you take to service each request -- you'll see some that are longer than 2x the heartbeat if this is your issue.

On Thursday, 1 February 2018 10:43:37 UTC, simon harrison wrote:

Hi there

We're continuing to have trouble with our RabbitMQ setup. Under load things break down quickly. And i use the term "load" very generously here!

Sentry is flooded with alerts such as: OS Error Socket Closed, Connection Forced, Killing n managed threads, RecoverableConnectionError, killing managed thread "run", killing 1 active worker, etc. etc.

We are using Amazons load balancer with 3 rabbit instances behind it.

All we are doing is pub-sub between 2 services, and this is very light message publishing.

Our heartbeat is set to 5 seconds.

Can anyone advise or suggest somewhere we can get hands-on help with our rabbit setup?

Thanks

simon_harrison · February 7, 2018, 8:27pm

Thanks Matt.

I've set ours to the default. I'll update here what happens next.

···

On Friday, 2 February 2018 10:22:14 UTC, Matt Yule-Bennett wrote:

On Thursday, February 1, 2018 at 11:29:21 AM UTC, simon harrison wrote:
> Thanks Matt, I appreciate your very quick response. and I've passed that
thread on.
>
>
> All our handlers typically either write to postgres or redis.

It might be just one handler causing the problem. You only need one thread
to hold the cpu for longer than 2x the heartbeat
>
>
> What heartbeat do you use/recommend?

We use the nameko default of 60 seconds

>
>
>
>
>
> On Thursday, 1 February 2018 11:08:47 UTC, Matt Yule-Bennett wrote:
>
> With a 5 second heartbeat this is almost certainly the same problem as
reported here:
Redirecting to Google Groups
>
>
> From that thread:
>
>
> > If the workload inside the event handler isn't doing any I/O it is
likely to be stealing the CPU long enough to starve the thread that's
sending your heartbeats. Once two heartbeats are missed the broker will
close the connection from the other end. You'll be able to see this
happening if you check your rabbit logs.
>
>
> Unfortunately Nameko handles the error very ungracefully. It would be
much better to catch it and try to reconnect rather than letting it bubble
and kill the container, but fixing that won't solve what I suspect is your
underlying issue: tight CPU loops will starve other greenthreads. This is a
inherent limitation of implicitly yielding co-routines. There is a simple
fix though -- offload non-yielding code to a (native) threadpool, or insert
some explicit yields. Another quick fix would be to decrease the heartbeat
frequency.
>
>
> Please check your rabbit broker logs for "heartbeat missed" messages to
confirm this is the source of your problem. You could also take a look at
the amount of time you take to service each request -- you'll see some that
are longer than 2x the heartbeat if this is your issue.
>
>
>
> On Thursday, 1 February 2018 10:43:37 UTC, simon harrison wrote:
>
> Hi there
>
>
> We're continuing to have trouble with our RabbitMQ setup. Under load
things break down quickly. And i use the term "load" very generously here!
>
>
> Sentry is flooded with alerts such as: OS Error Socket Closed,
Connection Forced, Killing n managed threads, RecoverableConnectionError,
killing managed thread "run", killing 1 active worker, etc. etc.
>
>
> We are using Amazons load balancer with 3 rabbit instances behind it.
>
>
> All we are doing is pub-sub between 2 services, and this is very light
message publishing.
>
>
> Our heartbeat is set to 5 seconds.
>
>
> Can anyone advise or suggest somewhere we can get hands-on help with our
rabbit setup?
>
>
> Thanks

mattbennett · February 13, 2018, 11:43am

There is a bug here. The same thing exhibits itself in
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/nameko-dev/Hmz659s4dBY/bqfmZtS9AwAJ

I'm building a testcase to reproduce, but it happens when we try to
acknowledge the message through a connection that has already closed. This
situation arises when the connection is lost while a worker is still
running, so it could be triggered by missed heartbeats or flaky
connections. We actually retry [1] these errors, but even if a connection
is re-established, the channel for the particular message will never
recover and so the retry simply postpones the error bubbling out.

We actually have tests for this scenario but they're not precise enough and
don't notice that the container dies.

It should be an easy fix once I'm confident I know exactly what's going on.

[1]

github.com

nameko/nameko/blob/b33a36197cc3d819360317f213f10afd0901b589/nameko/messaging.py#L313-L319


      
          @retry(for_exceptions=RecoverableConnectionError, max_attempts=3)
          def ack_message(self, message):
              message.ack()
          
          @retry(for_exceptions=RecoverableConnectionError, max_attempts=3)
          def requeue_message(self, message):
              message.requeue()

···

On Wednesday, February 7, 2018 at 8:27:07 PM UTC, simon harrison wrote:

Thanks Matt.

I've set ours to the default. I'll update here what happens next.

On Friday, 2 February 2018 10:22:14 UTC, Matt Yule-Bennett wrote:

On Thursday, February 1, 2018 at 11:29:21 AM UTC, simon harrison wrote:
> Thanks Matt, I appreciate your very quick response. and I've passed
that thread on.
>
>
> All our handlers typically either write to postgres or redis.

It might be just one handler causing the problem. You only need one
thread to hold the cpu for longer than 2x the heartbeat
>
>
> What heartbeat do you use/recommend?

We use the nameko default of 60 seconds

>
>
>
>
>
> On Thursday, 1 February 2018 11:08:47 UTC, Matt Yule-Bennett wrote:
>
> With a 5 second heartbeat this is almost certainly the same problem as
reported here:
Redirecting to Google Groups
>
>
> From that thread:
>
>
> > If the workload inside the event handler isn't doing any I/O it is
likely to be stealing the CPU long enough to starve the thread that's
sending your heartbeats. Once two heartbeats are missed the broker will
close the connection from the other end. You'll be able to see this
happening if you check your rabbit logs.
>
>
> Unfortunately Nameko handles the error very ungracefully. It would be
much better to catch it and try to reconnect rather than letting it bubble
and kill the container, but fixing that won't solve what I suspect is your
underlying issue: tight CPU loops will starve other greenthreads. This is a
inherent limitation of implicitly yielding co-routines. There is a simple
fix though -- offload non-yielding code to a (native) threadpool, or insert
some explicit yields. Another quick fix would be to decrease the heartbeat
frequency.
>
>
> Please check your rabbit broker logs for "heartbeat missed" messages to
confirm this is the source of your problem. You could also take a look at
the amount of time you take to service each request -- you'll see some that
are longer than 2x the heartbeat if this is your issue.
>
>
>
> On Thursday, 1 February 2018 10:43:37 UTC, simon harrison wrote:
>
> Hi there
>
>
> We're continuing to have trouble with our RabbitMQ setup. Under load
things break down quickly. And i use the term "load" very generously here!
>
>
> Sentry is flooded with alerts such as: OS Error Socket Closed,
Connection Forced, Killing n managed threads, RecoverableConnectionError,
killing managed thread "run", killing 1 active worker, etc. etc.
>
>
> We are using Amazons load balancer with 3 rabbit instances behind it.
>
>
> All we are doing is pub-sub between 2 services, and this is very light
message publishing.
>
>
> Our heartbeat is set to 5 seconds.
>
>
> Can anyone advise or suggest somewhere we can get hands-on help with
our rabbit setup?
>
>
> Thanks

simon_harrison · February 15, 2018, 3:10pm

Thanks Matt.

Interesting to hear about this and I've seen the PR - great stuff.

Simon

···

On Tuesday, 13 February 2018 11:43:27 UTC, Matt Yule-Bennett wrote:

There is a bug here. The same thing exhibits itself in
Redirecting to Google Groups

I'm building a testcase to reproduce, but it happens when we try to
acknowledge the message through a connection that has already closed. This
situation arises when the connection is lost while a worker is still
running, so it could be triggered by missed heartbeats or flaky
connections. We actually retry [1] these errors, but even if a connection
is re-established, the channel for the particular message will never
recover and so the retry simply postpones the error bubbling out.

We actually have tests for this scenario but they're not precise enough
and don't notice that the container dies.

It should be an easy fix once I'm confident I know exactly what's going on.

[1]
https://github.com/nameko/nameko/blob/b33a36197cc3d819360317f213f10afd0901b589/nameko/messaging.py#L313-L319

On Wednesday, February 7, 2018 at 8:27:07 PM UTC, simon harrison wrote:

Thanks Matt.

I've set ours to the default. I'll update here what happens next.

On Friday, 2 February 2018 10:22:14 UTC, Matt Yule-Bennett wrote:

On Thursday, February 1, 2018 at 11:29:21 AM UTC, simon harrison wrote:
> Thanks Matt, I appreciate your very quick response. and I've passed
that thread on.
>
>
> All our handlers typically either write to postgres or redis.

It might be just one handler causing the problem. You only need one
thread to hold the cpu for longer than 2x the heartbeat
>
>
> What heartbeat do you use/recommend?

We use the nameko default of 60 seconds

>
>
>
>
>
> On Thursday, 1 February 2018 11:08:47 UTC, Matt Yule-Bennett wrote:
>
> With a 5 second heartbeat this is almost certainly the same problem as
reported here:
Redirecting to Google Groups
>
>
> From that thread:
>
>
> > If the workload inside the event handler isn't doing any I/O it is
likely to be stealing the CPU long enough to starve the thread that's
sending your heartbeats. Once two heartbeats are missed the broker will
close the connection from the other end. You'll be able to see this
happening if you check your rabbit logs.
>
>
> Unfortunately Nameko handles the error very ungracefully. It would be
much better to catch it and try to reconnect rather than letting it bubble
and kill the container, but fixing that won't solve what I suspect is your
underlying issue: tight CPU loops will starve other greenthreads. This is a
inherent limitation of implicitly yielding co-routines. There is a simple
fix though -- offload non-yielding code to a (native) threadpool, or insert
some explicit yields. Another quick fix would be to decrease the heartbeat
frequency.
>
>
> Please check your rabbit broker logs for "heartbeat missed" messages
to confirm this is the source of your problem. You could also take a look
at the amount of time you take to service each request -- you'll see some
that are longer than 2x the heartbeat if this is your issue.
>
>
>
> On Thursday, 1 February 2018 10:43:37 UTC, simon harrison wrote:
>
> Hi there
>
>
> We're continuing to have trouble with our RabbitMQ setup. Under load
things break down quickly. And i use the term "load" very generously here!
>
>
> Sentry is flooded with alerts such as: OS Error Socket Closed,
Connection Forced, Killing n managed threads, RecoverableConnectionError,
killing managed thread "run", killing 1 active worker, etc. etc.
>
>
> We are using Amazons load balancer with 3 rabbit instances behind it.
>
>
> All we are doing is pub-sub between 2 services, and this is very light
message publishing.
>
>
> Our heartbeat is set to 5 seconds.
>
>
> Can anyone advise or suggest somewhere we can get hands-on help with
our rabbit setup?
>
>
> Thanks

Topic		Replies	Views
RecoverableConnectionError('connection already closed') and all workers killed! googlegroup	12	1208	November 2, 2017
high availability rabbitmq hangs on reply? googlegroup	4	398	June 1, 2016
Why heartbeat param not in amqp.get_producer Github Issues	1	685	August 13, 2018
Nameko behavior on rabbitmq crash googlegroup	1	636	September 3, 2018
HA RabbitMQ Best Practice googlegroup	7	798	July 4, 2017

RabbitMQ Under Load

Related topics