Call_async takes longer than expected

I’m using call_async to issue multiple parallel RPC calls into a nameko RPC service from a client. I was gathering some timing data around how long it takes to complete the task (I wanted to compare a serial request vs a parallel and see how it affected the system).

I noticed that serially, each call takes about 30ms. When run with call_async, each call to call_async takes about 20ms. Except sometimes (usually the first 2 calls in parallel) it will take under 2ms to complete the call_async. This is reliably reproduced; it seems if there is a barrage of requests, the first few are fast returning, the rest are delayed. Waiting a bit seems to reset something and again the next barrage of requests the first few are quick returning and the rest are delayed.

The overall time to complete the task serially is 14s and in parallel about 10s. So there is an improvement, but I was expecting more (I’m making about 400 calls total). The service is configured to run 10 workers.

Why does call_async take 20ms to complete? Does it wait for a rabbit ack before returning? Am I just setting my expectations too high? :slight_smile:
What’s going on with the first few calls being 2ms? That’s what I was expecting all the calls to be, around that kind of performance. Feels like a configuration issue, but I’m not really sure what the issue would be.

Any insights would be really helpful and appreciated. Thanks!

-Eric

This is a bit surprising. call_async doesn’t wait for anything apart from the publish confirm from the RabbitMQ broker. The “ack” is sent after the worker completes. The difference between call and call_async is precisely that the latter does not wait for the ack before returning.

Which version of Nameko are you using?

If you can post a quick reproduction that would be helpful.

Does it improve if you set a higher number of workers? (Like 100, say)

I’ll work on it and get back to you. What I’ve found so far is a simple ping() method in the same service returns in under 2ms almost every time when I throw it into a loop 1000x. When I call a service that actually does work (a lookup entry), it also returns roughly under 2ms in the same 1000x loop. Although there are occasions for it to take over 10ms, but they are rare. So far so good. I mean it shouldn’t matter how complex the method is, so it was good to see this result.

When I go back to my other test (which is an integration test on real software) and time a similar call, it returns either 25-35ms or under 3ms, but much more often 25-35ms

So there’s a difference somewhere… I’ll see if I can narrow it down into a reproducible test case, and try your suggestions of increasing the number of workers to see if that makes a difference. I wonder if there’s some kind of contention between queues in rabbit… or maybe I’m just making some dumb timing mistake… I’ll look and post what I find here.

I’m on nameko 2.13.0 – I see there is a 3.0.0 but wasn’t sure how to upgrade or what the consequences of upgrading would be. Can I just stick the latest 3.0.0 on the host and run it just the same?

-Eric

Based on the sample code in the documentation, I’m able to reproduce the call_async delay. When calling service_y’s ping method, it takes about 3ms or less to complete most of the 1000 calls the client makes, but there were several over 3ms (roughly 20%) and under around 15ms. Takes about 3.5s to complete 1000 calls. That’s more or less the expected result.
service_x’s ping method, same pattern. about 3.5s to complete all 1000 calls.

service_x also has a method that calls service_y’s ping. In this case, roughly half of the calls are under 2ms. The rest seem to cluster around 16-25ms; it is highly variable. There is a clear gap between 2ms and 13ms where almost none of the calls take that long – it’s either going to take under 2ms or over 13ms in like 99% of the cases. It takes this test at least 2x the time to run as either ping method test; sometimes up to 11s to complete all 1000 calls.

Here is my test service.py:
from nameko.rpc import rpc, RpcProxy

class ServiceY:
    name = "service_y"

    @rpc
    def ping(self):
        return 'Y'


class ServiceX:
    name = "service_x"

    y = RpcProxy("service_y")

    @rpc
    def ping(self):
        return 'X'

    @rpc
    def ping_y(self):
        return self.y.ping()

And here is the client. It is setup to show the delay scenario. When you run it, the output should indicate the time for each async_call, then the total time for all async’s then the total time to get all results. If you pipe the output through awk '{print $2}' h | sed 's:\(.*\)\.\(\).*:\1.\2:' | sort -n | uniq -c you’ll get a nice frequency count by number of milliseconds; just makes it easier to compare.

from nameko.standalone.rpc import ClusterRpcProxy
from time import perf_counter, sleep

config = {
        'AMQP_URI': 'amqp://guest:guest@localhost:5672'  # e.g. "pyamqp://guest:guest@localhost"
}

with ClusterRpcProxy(config) as cluster_rpc:
    pc = perf_counter()
    pending = []
    for i in range(1000):
        s = perf_counter()
        #r = pending.append(cluster_rpc.service_y.ping.call_async())
        #r = pending.append(cluster_rpc.service_x.ping.call_async())
        r = cluster_rpc.service_x.ping_y.call_async()
        print(f'-- {(perf_counter() - s)*1000}ms async')
        pending.append(r)
        #sleep(0.001)
        #r.result()


    print(f'{(perf_counter() - pc)*1000}ms request')
    for p in pending:
        p.result()
    print(f'{(perf_counter() - pc)*1000}ms result')

Interestingly enough, when using call_async then immediately waiting for the result, almost every call_async takes under 3ms (over 99%). If I throw a 10ms sleep between each call_async instead of the result() call, about 98% of the calls to ping_y are under 3ms. With a 1ms sleep, its closer to 85% under 3ms. And, again, with no sleep, it’s about 50%, sometimes 40% under 3ms.

Hope this is helpful, thanks!

-Eric

I can’t reproduce this. Running these tests on my local machine, with both Nameko 2 and 3, the histogram is quite consistent.

My test client, services and results all here: https://gist.github.com/mattbennett/27b27b481d560cf965e5578e565bda2e

These were run with Python 3.7.7 on a 2018 MacBook Pro, with RabbitMQ 3.6.6 running in a local docker container.

It’s revealed that Nameko 3 is consistently slower which I was not expecting, but not shed any light on the behaviour you’re seeing.

Thanks for taking a look, Matt. I’m running with python 3.9.0 with a RabbitMQ at 3.8.9 running inside a local docker container. The services are running in the same docker container together and the client is running in a different container (but the same image). I’ll see if maybe going to python 3.7.7 helps get numbers closer to what you are seeing.

My box is a 2x10C 3.1GHz server.

Also I don’t remember if I mentioned it earlier, but I did increase the number of workers from 10 to 1000 and it made absolutely no difference. Good idea posting the results as a gist; I’ll do the same with mine so you can see what I’m seeing; not that it will help much.

I’ll continue to press into this and see where the problem is. If anything comes to mind, I’d love to hear it. Thanks again for jumping in and taking a quick look this for me. I really appreciate it.

-Eric

Thanks again for looking. I was able to re-run using a docker image of python 3.7.7 and rabbitmq 3.6.6 out of the box (no customizations). There was no change in my runs. calls to service_x.ping and service_y.ping seem fine; calls from service_x.ping_y are volatile.

I recorded my out in a gist, but just the service_x.ping_y.

I may try to reproduce your results on a slower machine and if maybe the problem here isn’t that my server is pushing harder than your mac. I dunno really what the cause is here. It could be that rabbit isn’t happy at above 400m/s which is about where it tops out. I re-ran with 10000 calls from two independent clients and 1000 workers defined in the config.yaml; about 33% were under 4ms, most were above 15ms in both clients. That was on python 3.7.7 and rabbitmq 3.6.6.

One thing I did notice was rabbitmq topped out at around 455messages/s overall; the services were getting about 100messages/s and the rabbitmq queued messages climbed linearly until the clients ended (at the same time) then it went to zero. So we were at nearly 20k queued messages by the time both clients ended, and almost as soon as they ended, queued messages dropped to zero. Not sure what that indicates – I’m sorta new to rabbitmq and what all it can do.

Here is a chart I created of the times from the second client with 10,000 runs. The banding is interesting, isn’t it?