nameko and HA rabbit

Hi,
Our rabbitmq cluster consists of 3 servers clustered in HA-all policy.
Due to disk errors, out master node went down, and all our clients started throwing errors.
We expected them to connect to the elected slave, but that didn't happen.

The connection string in the conf.yaml file is as follows:
AMQP_URI='amqp://path_to_node1;amqp://path_to_node2;amqp://path_to_node3'

So, given that the first (master) node is down, we expected nameko (and its clients) to switch to the second node (the elected master).
But that didn't happen.
Furthermore, restarting the services also didn't work, since the master node is first on the list of URIs, and is down.

We expect that if the first connection is refused/broken, the next URI should be used.

So, how would you handle automatic reconnect to elected nodes once the master is down?

I just found an issue on that: https://github.com/celery/kombu/issues/185 (from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

I wasn't aware that kombu supported this kind of connection params. Nameko
has never explicitly supported it. From quickly scanning the kombu docs it
looks like we should be round-robining between the provided URIs, but it's
never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ) there's
a load-balancer in front of the cluster, and that takes care of routing
traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko isn't
supporting round-robin connections out of the box.

···

On Sunday, November 20, 2016 at 1:48:18 PM UTC, tsachi...@gmail.com wrote:

I just found an issue on that: Failover not working for multiple URLs in hostname · Issue #185 · celery/kombu · GitHub
(from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

Well, a load balancer is a valid option, but the way that rabbitmq implements HA it creates unnecessary hops between nodes, thus adding latency.

The best solution, performance-wise (though a bad programming design) is that the clients would cycle through a list of nodes, given the main node is down.

It IS implelemented in kombu, but as mentioned, it has a bug, so it always tries to connect to the master, even if it is down.

By the way, we are now testing a configuration of rabbit nodes behind ELB,
and this is not good either.
Once you stop the master node (and wait for the ELB to remove it from
service), nameko starts throwing "IOError: Socket closed" exceptions,
exclusive queues created by ClusterRpcProxy become locked, and things look
really bad.

Did you try to see what happens when you stop the master?

···

On Sunday, November 20, 2016 at 4:06:00 PM UTC+2, Matt Yule-Bennett wrote:

I wasn't aware that kombu supported this kind of connection params. Nameko
has never explicitly supported it. From quickly scanning the kombu docs it
looks like we should be round-robining between the provided URIs, but it's
never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ)
there's a load-balancer in front of the cluster, and that takes care of
routing traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko isn't
supporting round-robin connections out of the box.

On Sunday, November 20, 2016 at 1:48:18 PM UTC, tsachi...@gmail.com wrote:

I just found an issue on that: Failover not working for multiple URLs in hostname · Issue #185 · celery/kombu · GitHub
(from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

Is the bug in kombu or nameko? And can you reproduce it in a test case?

···

On Sun, 20 Nov 2016 at 14:43, <tsachi.shuval@gmail.com> wrote:

Well, a load balancer is a valid option, but the way that rabbitmq
implements HA it creates unnecessary hops between nodes, thus adding
latency.

The best solution, performance-wise (though a bad programming design) is
that the clients would cycle through a list of nodes, given the main node
is down.

It IS implelemented in kombu, but as mentioned, it has a bug, so it always
tries to connect to the master, even if it is down.

--
You received this message because you are subscribed to the Google Groups
"nameko-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to nameko-dev+unsubscribe@googlegroups.com.
To post to this group, send an email to nameko-dev@googlegroups.com.
To view this discussion on the web, visit
https://groups.google.com/d/msgid/nameko-dev/c4697c43-127f-4493-867d-0d1c2d3c39ad%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout\.

I've done some testing with ELB.

The "IOError: socket closed" exceptions are expected. Kombu prints the
stacktrace when it detects the disconnection, and then immediately tries to
reconnect again. It will keep trying until the connection can be
re-established, which should be as soon as the ELB redirects traffic to the
other node.

With nameko 2.4.4 you will see "disconnected while waiting for reply" from
the client, which is also expected. This will be raised for any requests
were in flight when the connection was lost because there's no way to know
whether the reply was swallowed by a reply-queue being auto-deleted.

Things behave better after the changes
in https://github.com/nameko/nameko/pull/383\. Critically, increasing the
safety_interval in consume() stops the ResourceLocked exception being
thrown by the RPC proxy (although it's worth nothing that the client should
recover even in this case).

The changes in Enable confirms for all amqp publishers by mattbennett · Pull Request #337 · nameko/nameko · GitHub are also required
for nameko to be truly tolerant of disconnections. Without it, publishers
will lose messages immediately after a disconnection, which often leads to
hanging workers (e.g. when an RPC reply message is lost, the caller waits
forever)

I expect #337 to land soon, but in the mean time are you able to do some
testing with 2.4.4?

···

On Tuesday, November 22, 2016 at 9:07:35 AM UTC, tsachi...@gmail.com wrote:

By the way, we are now testing a configuration of rabbit nodes behind ELB,
and this is not good either.
Once you stop the master node (and wait for the ELB to remove it from
service), nameko starts throwing "IOError: Socket closed" exceptions,
exclusive queues created by ClusterRpcProxy become locked, and things look
really bad.

Did you try to see what happens when you stop the master?

On Sunday, November 20, 2016 at 4:06:00 PM UTC+2, Matt Yule-Bennett wrote:

I wasn't aware that kombu supported this kind of connection params.
Nameko has never explicitly supported it. From quickly scanning the kombu
docs it looks like we should be round-robining between the provided URIs,
but it's never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ)
there's a load-balancer in front of the cluster, and that takes care of
routing traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko isn't
supporting round-robin connections out of the box.

On Sunday, November 20, 2016 at 1:48:18 PM UTC, tsachi...@gmail.com >> wrote:

I just found an issue on that:
Failover not working for multiple URLs in hostname · Issue #185 · celery/kombu · GitHub (from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

The bug is in kombu and can be easily reproduced.
1. AMQP_URI variable in conf.yaml: assign a semicolon-separated string of URIs. The master is the first one, followed by its slaves.
2. Start a nameko service.
3. Stop the master node.
4. Try to communicate with nameko.

Instead of kombu switching to the newly elected master, it will continue trying to connect to the original master, which is down.

This doesn't illustrate whether the bug is with kombu or nameko, because
you're using both. Can you produce it using kombu alone?

···

On Sunday, November 20, 2016 at 2:59:36 PM UTC, tsachi...@gmail.com wrote:

The bug is in kombu and can be easily reproduced.
1. AMQP_URI variable in conf.yaml: assign a semicolon-separated string of
URIs. The master is the first one, followed by its slaves.
2. Start a nameko service.
3. Stop the master node.
4. Try to communicate with nameko.

Instead of kombu switching to the newly elected master, it will continue
trying to connect to the original master, which is down.

Well, digging into kombu and nameko code, it is in kombu connection.
See my second post in this thread for an issue opened for this bug.

You can produce it easily in kombu.
Create a connection, channel, consumer and queue. Start consuming in a loop. Then stop the master rabbit node and see the errors that the consumer throws.

Hi Matt,

No, we're still running nameko 2.2.0.

We solved the disconnections issue by monkey-patching the kombu package
(essentially fixing a bug in it).

We're not using an ELB, since in our tests it doesn't work well. We prefer
(not ideally, but works better) to work in a HA mode, and let the clients
connect to all active rabbit nodes.
This architecture didn't work due to the bug in kombu, which, as mentioned
above, we monkey-patched.

I will take a look at nameko 2.4 soon. At the moment, things seem to work
just fine.

Tsachi

···

On Monday, December 5, 2016 at 8:06:02 PM UTC+2, Matt Yule-Bennett wrote:

I've done some testing with ELB.

The "IOError: socket closed" exceptions are expected. Kombu prints the
stacktrace when it detects the disconnection, and then immediately tries to
reconnect again. It will keep trying until the connection can be
re-established, which should be as soon as the ELB redirects traffic to the
other node.

With nameko 2.4.4 you will see "disconnected while waiting for reply" from
the client, which is also expected. This will be raised for any requests
were in flight when the connection was lost because there's no way to know
whether the reply was swallowed by a reply-queue being auto-deleted.

Things behave better after the changes in
https://github.com/nameko/nameko/pull/383\. Critically, increasing the
safety_interval in consume() stops the ResourceLocked exception being
thrown by the RPC proxy (although it's worth nothing that the client should
recover even in this case).

The changes in Enable confirms for all amqp publishers by mattbennett · Pull Request #337 · nameko/nameko · GitHub are also
required for nameko to be truly tolerant of disconnections. Without it,
publishers will lose messages immediately after a disconnection, which
often leads to hanging workers (e.g. when an RPC reply message is lost, the
caller waits forever)

I expect #337 to land soon, but in the mean time are you able to do some
testing with 2.4.4?

On Tuesday, November 22, 2016 at 9:07:35 AM UTC, tsachi...@gmail.com > wrote:

By the way, we are now testing a configuration of rabbit nodes behind
ELB, and this is not good either.
Once you stop the master node (and wait for the ELB to remove it from
service), nameko starts throwing "IOError: Socket closed" exceptions,
exclusive queues created by ClusterRpcProxy become locked, and things look
really bad.

Did you try to see what happens when you stop the master?

On Sunday, November 20, 2016 at 4:06:00 PM UTC+2, Matt Yule-Bennett wrote:

I wasn't aware that kombu supported this kind of connection params.
Nameko has never explicitly supported it. From quickly scanning the kombu
docs it looks like we should be round-robining between the provided URIs,
but it's never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ)
there's a load-balancer in front of the cluster, and that takes care of
routing traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko
isn't supporting round-robin connections out of the box.

On Sunday, November 20, 2016 at 1:48:18 PM UTC, tsachi...@gmail.com >>> wrote:

I just found an issue on that:
Failover not working for multiple URLs in hostname · Issue #185 · celery/kombu · GitHub (from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

Glad to hear you got it working. So you went back to passing multiple URIs
and letting kombu round-robin by itself?

What was the monkey-patch / bug-fix you had to apply to kombu?

Matt.

···

On Tuesday, December 6, 2016 at 8:42:08 AM UTC, tsachi...@gmail.com wrote:

Hi Matt,

No, we're still running nameko 2.2.0.

We solved the disconnections issue by monkey-patching the kombu package
(essentially fixing a bug in it).

We're not using an ELB, since in our tests it doesn't work well. We prefer
(not ideally, but works better) to work in a HA mode, and let the clients
connect to all active rabbit nodes.
This architecture didn't work due to the bug in kombu, which, as mentioned
above, we monkey-patched.

I will take a look at nameko 2.4 soon. At the moment, things seem to work
just fine.

Tsachi

On Monday, December 5, 2016 at 8:06:02 PM UTC+2, Matt Yule-Bennett wrote:

I've done some testing with ELB.

The "IOError: socket closed" exceptions are expected. Kombu prints the
stacktrace when it detects the disconnection, and then immediately tries to
reconnect again. It will keep trying until the connection can be
re-established, which should be as soon as the ELB redirects traffic to the
other node.

With nameko 2.4.4 you will see "disconnected while waiting for reply"
from the client, which is also expected. This will be raised for any
requests were in flight when the connection was lost because there's no way
to know whether the reply was swallowed by a reply-queue being
auto-deleted.

Things behave better after the changes in
https://github.com/nameko/nameko/pull/383\. Critically, increasing the
safety_interval in consume() stops the ResourceLocked exception being
thrown by the RPC proxy (although it's worth nothing that the client should
recover even in this case).

The changes in Enable confirms for all amqp publishers by mattbennett · Pull Request #337 · nameko/nameko · GitHub are also
required for nameko to be truly tolerant of disconnections. Without it,
publishers will lose messages immediately after a disconnection, which
often leads to hanging workers (e.g. when an RPC reply message is lost, the
caller waits forever)

I expect #337 to land soon, but in the mean time are you able to do some
testing with 2.4.4?

On Tuesday, November 22, 2016 at 9:07:35 AM UTC, tsachi...@gmail.com >> wrote:

By the way, we are now testing a configuration of rabbit nodes behind
ELB, and this is not good either.
Once you stop the master node (and wait for the ELB to remove it from
service), nameko starts throwing "IOError: Socket closed" exceptions,
exclusive queues created by ClusterRpcProxy become locked, and things look
really bad.

Did you try to see what happens when you stop the master?

On Sunday, November 20, 2016 at 4:06:00 PM UTC+2, Matt Yule-Bennett >>> wrote:

I wasn't aware that kombu supported this kind of connection params.
Nameko has never explicitly supported it. From quickly scanning the kombu
docs it looks like we should be round-robining between the provided URIs,
but it's never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ)
there's a load-balancer in front of the cluster, and that takes care of
routing traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko
isn't supporting round-robin connections out of the box.

On Sunday, November 20, 2016 at 1:48:18 PM UTC, tsachi...@gmail.com >>>> wrote:

I just found an issue on that:
Failover not working for multiple URLs in hostname · Issue #185 · celery/kombu · GitHub (from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority

That is correct. All clients are instantiated with multiple URIs, and kombu
round-robins the hosts in case of the master failure.

Our patch looks like this:

from kombu.connection import Connection

original_info = Connection._info

def _info(self, resolve=True):
    # Fixes a bug in kombu.Connection._info method
    info = original_info(self, resolve=resolve)

    info = list(info)

    # Last item is the 'alternates' param. Remove it and replace the
'hostname'
    _, alt = info.pop()
    if alt:
        info[0] = ('hostname', ';'.join(alt))

    return tuple(info)

Connection._info = _info

Import this piece of code before kombu is imported, and it will fix the
round-robin bug.

Tsachi

···

On Tuesday, December 6, 2016 at 12:14:18 PM UTC+2, Matt Yule-Bennett wrote:

Glad to hear you got it working. So you went back to passing multiple URIs
and letting kombu round-robin by itself?

What was the monkey-patch / bug-fix you had to apply to kombu?

Matt.

On Tuesday, December 6, 2016 at 8:42:08 AM UTC, tsachi...@gmail.com wrote:

Hi Matt,

No, we're still running nameko 2.2.0.

We solved the disconnections issue by monkey-patching the kombu package
(essentially fixing a bug in it).

We're not using an ELB, since in our tests it doesn't work well. We
prefer (not ideally, but works better) to work in a HA mode, and let the
clients connect to all active rabbit nodes.
This architecture didn't work due to the bug in kombu, which, as
mentioned above, we monkey-patched.

I will take a look at nameko 2.4 soon. At the moment, things seem to work
just fine.

Tsachi

On Monday, December 5, 2016 at 8:06:02 PM UTC+2, Matt Yule-Bennett wrote:

I've done some testing with ELB.

The "IOError: socket closed" exceptions are expected. Kombu prints the
stacktrace when it detects the disconnection, and then immediately tries to
reconnect again. It will keep trying until the connection can be
re-established, which should be as soon as the ELB redirects traffic to the
other node.

With nameko 2.4.4 you will see "disconnected while waiting for reply"
from the client, which is also expected. This will be raised for any
requests were in flight when the connection was lost because there's no way
to know whether the reply was swallowed by a reply-queue being
auto-deleted.

Things behave better after the changes in
https://github.com/nameko/nameko/pull/383\. Critically, increasing the
safety_interval in consume() stops the ResourceLocked exception being
thrown by the RPC proxy (although it's worth nothing that the client should
recover even in this case).

The changes in https://github.com/nameko/nameko/pull/337 are also
required for nameko to be truly tolerant of disconnections. Without it,
publishers will lose messages immediately after a disconnection, which
often leads to hanging workers (e.g. when an RPC reply message is lost, the
caller waits forever)

I expect #337 to land soon, but in the mean time are you able to do some
testing with 2.4.4?

On Tuesday, November 22, 2016 at 9:07:35 AM UTC, tsachi...@gmail.com >>> wrote:

By the way, we are now testing a configuration of rabbit nodes behind
ELB, and this is not good either.
Once you stop the master node (and wait for the ELB to remove it from
service), nameko starts throwing "IOError: Socket closed" exceptions,
exclusive queues created by ClusterRpcProxy become locked, and things look
really bad.

Did you try to see what happens when you stop the master?

On Sunday, November 20, 2016 at 4:06:00 PM UTC+2, Matt Yule-Bennett >>>> wrote:

I wasn't aware that kombu supported this kind of connection params.
Nameko has never explicitly supported it. From quickly scanning the kombu
docs it looks like we should be round-robining between the provided URIs,
but it's never been tested.

At Student.com (and everywhere else I know that uses HA'd RabbitMQ)
there's a load-balancer in front of the cluster, and that takes care of
routing traffic to healthy nodes.

If you provide a testcase we might be able to figure out why nameko
isn't supporting round-robin connections out of the box.

On Sunday, November 20, 2016 at 1:48:18 PM UTC, tsachi...@gmail.com >>>>> wrote:

I just found an issue on that:
Failover not working for multiple URLs in hostname · Issue #185 · celery/kombu · GitHub (from 2012!!)
The suggested fix also has a bug that leaks memory.
Anyway, I don't understand why this is not top priority