Meaning of "Disconnected while waiting for reply"

I've just deployed a nameko service (http) which calls another nameko
service over rpc and am getting this error randomly but quite often.

RpcConnectionError("Disconnected while waiting for reply")

While this error is happening I can see from the rabbitmq management
interface that messages are being processed.

Here's the stack-trace

RpcConnectionError: Disconnected while waiting for reply
  File "nameko/containers.py", line 405, in _run_worker
    result = method(*worker_ctx.args, **worker_ctx.kwargs)
  File "newrelic/api/background_task.py", line 102, in wrapper
    return wrapped(*args, **kwargs)
  File "./app/helpers/monitoring.py", line 18, in wrapper
    return func(*args, **kwargs)
  File "./app/helpers/decorators.py", line 16, in wrapper
    response = func(*args, **kwargs)
  File "./app/service.py", line 65, in process_event
    self.handle_process_elements(event_data=event_data)
  File "./app/service.py", line 77, in handle_process_elements
    user_id=event_data['user_id']
  File "./app/components/handlers/thing_handler.py", line 22, in
handle_selected_elements
    user_id=user_id)
  File "nameko/rpc.py", line 395, in __call__
    return reply.result()
  File "nameko/rpc.py", line 376, in result
    self.resp_body = self.reply_event.wait()
  File "eventlet/event.py", line 121, in wait
    return hubs.get_hub().switch()
  File "eventlet/hubs/hub.py", line 294, in switch
    return self.greenlet.switch()

What could cause this to occur? And why does the that code in nameko raise
an error on re-connection?

I'm fairly new to working with these services so could do with an
explanation of that part of the code.

The call it's failing on is fairly simple, however if I ssh into the
machine and run it manually I'm unable to reproduce the connection error.

stuff = self.stuff_rpc.get_stuff(user=123)

Thanks in advance

Looks like the issue was haproxy being in front of rabbitmq. Found this
fix https://deviantony.wordpress.com/2014/10/30/rabbitmq-and-haproxy-a-timeout-issue/

Hi Richard,

Glad you found the cause of the disconnection. Regarding the exception
raised in the code you linked to, the explanation is in the comment just
above: Due to the disconnection, the reply queue may have been deleted, so
it's _possible_ that unread replies have been lost. If that's the case, the
client would wait forever for a reply to arrive. To avoid this, we
"invalidate" all pending replies.

Best,
David

···

On Monday, 27 March 2017 14:29:38 UTC+1, richard...@babylonhealth.com wrote:

Looks like the issue was haproxy being in front of rabbitmq. Found this
fix
https://deviantony.wordpress.com/2014/10/30/rabbitmq-and-haproxy-a-timeout-issue/

This happens when your service was disconnected from the rabbit broker
while there is a pending RPC reply. It's raised in on_consume_ready (when
the connection is reestablished) because that's the first available chance
we have to detect the disconnection.

RPC reply queues are exclusive and auto-delete — meaning that the queue and
only be consumed from one connection (the one which declared it), and when
that consumer disconnects the queue is removed.

Disconnection and reconnection means that by definition, the original reply
queue (the one to which the result will be sent) won’t exist anymore. The
client has no choice but to retry the request.

So yes, having HAProxy yank connections will definitely cause this. If
you're using a load-balancer I highly recommend you use nameko 2.4.4 or
greater, since this has support for consumer heartbeats. The heartbeat will
keep the connections active so your load-balancer doesn't kill them. It'll
also mean that you detect genuine disconnections immediately, rather than
waiting 2 hours (!) for a TCP timeout.

I saw you also replied to
https://github.com/nameko/nameko/issues/359#issuecomment-289437640, which
is related. On reflection, I think what’s happening there is just the
RpcProxy always tries to reconnect with the same queue. If RabbitMQ hasn’t
auto-deleted the old one yet, you get a clash. In the standalone case we
don’t catch it and so it bubbles out.

I’m not 100% sure there’s not a better possible implementation for the
RpcProxy. It would be nice if RabbitMQ let you reconnect to an exclusive
queue if you were the original consumer. If that were the case then we
could survive small outages (just as long as the queue wasn’t auto-deleted
in the mean time).

···

On Monday, March 27, 2017 at 2:29:38 PM UTC+1, richard...@babylonhealth.com wrote:

Looks like the issue was haproxy being in front of rabbitmq. Found this
fix
https://deviantony.wordpress.com/2014/10/30/rabbitmq-and-haproxy-a-timeout-issue/

1 Like

my understanding is that exclusive queues are owned by the _connection_, so
by definition if you get disconnected and come back, you are a new
connection and can't get back at it.

the alternative would we to use a different mechanism, though i'd worry
that would require some manual mechanism for garbage collection of reply
queues

d

···

On Monday, 27 March 2017 14:39:17 UTC+1, Matt Yule-Bennett wrote:

This happens when your service was disconnected from the rabbit broker
while there is a pending RPC reply. It's raised in on_consume_ready (when
the connection is reestablished) because that's the first available chance
we have to detect the disconnection.

RPC reply queues are exclusive and auto-delete — meaning that the queue
and only be consumed from one connection (the one which declared it), and
when that consumer disconnects the queue is removed.

Disconnection and reconnection means that by definition, the original
reply queue (the one to which the result will be sent) won’t exist anymore.
The client has no choice but to retry the request.

So yes, having HAProxy yank connections will definitely cause this. If
you're using a load-balancer I highly recommend you use nameko 2.4.4 or
greater, since this has support for consumer heartbeats. The heartbeat will
keep the connections active so your load-balancer doesn't kill them. It'll
also mean that you detect genuine disconnections immediately, rather than
waiting 2 hours (!) for a TCP timeout.

I saw you also replied to
https://github.com/nameko/nameko/issues/359#issuecomment-289437640, which
is related. On reflection, I think what’s happening there is just the
RpcProxy always tries to reconnect with the same queue. If RabbitMQ hasn’t
auto-deleted the old one yet, you get a clash. In the standalone case we
don’t catch it and so it bubbles out.

I’m not 100% sure there’s not a better possible implementation for the
RpcProxy. It would be nice if RabbitMQ let you reconnect to an exclusive
queue if you were the original consumer. If that were the case then we
could survive small outages (just as long as the queue wasn’t auto-deleted
in the mean time).

On Monday, March 27, 2017 at 2:29:38 PM UTC+1, > richard...@babylonhealth.com wrote:

Looks like the issue was haproxy being in front of rabbitmq. Found this
fix
https://deviantony.wordpress.com/2014/10/30/rabbitmq-and-haproxy-a-timeout-issue/