Validation of nameko Usage Pattern

Hi

Newbie python development. We are putting together a microservice in
python and have been using nameko as the basis. So far so good! We are
seeing some performance issues and before we start to look into these in
detail I wanted to get some validation around how we have approached
developing this microservice. The service is calculating deferred revenue
over a period. ie company issues policy of 100,000 for a year, we need to
calculate which periods that revenue should be distributed across. The
calculations are a mix of linear and non-linear calculation patterns.

We need to support 3 different methods of interaction:
1. HTTP and rpc service that will be called periodically for individual
requests - either to retrieve previously calculated results (stored in
MongoDB) or calculate new results and store them if required.
2. Batch calculation and response of 100,000's of requests in a short
period of time from typically Mainframe COBOL applications.
3. HTTP requests to perform aggregation functions on MongoDB and return the
results.

We have configured the nameko HTTP and rpc methods and for the 1st use
case, performance is acceptable (around 12 records a second generally per
single instance running). (we tested this using locust for HTTP - this
does appear to provide better performance)

The 2nd use case however is causing some anxiety due to performance. We
have used rabbitMQ to queue the batch requests with a inbound queue and an
outbound for the responses.

We have python code that loads the requests into MQ using struct and
stringIO - achieves around 1200 records a second from a flat file.

We have a python 2.7 queue worker which reads the queue and then calls the
nameko rpc service and puts the result back on another queue. Again each
instance of nameko seems to achieve 2.5 records (based on the
acknowledgement queue in rabbitMQ) a second when using the RPC route (which
I believe is the preferred route here.) It is implemented as a straight
service - we have not looked at any pub-sub type patterns in nameko. The
volume of open requests doesn't seem to affect the performance in any way -
tested up to 500K.

As an alternative on the same infrastructure we tested bypassing nameko
completely and just calling the python code directly from the worker and we
achieve throughput of around 11 records per second - we think this more
acceptable - with multiple instances of the Queue Worker we can reach at
least 50 records per second in a fairly linear manner (ie 5 instances
achieves

What I want to understand is whether rpc as we've used it is the right
approach for the batch wor`kloads ?

Our current thoughts on next steps is to:
1. Look at implementing multiprocessing in the python service code. (CPU
usage on the nameko service is around 10%)
2. Consider alternative methods in nameko to achieve this.
3. Abandon nameko for the batch processing and call direct to the service
code (certainly not our preference - sort of breaks the microservice
concept in our eyes).
4. Understand whether the pre-fetch setting of 10 on the rpc
implementation is causing the issue or perhaps some sort of configuration
issue in rabbitMQ.

Are we missing anything else here (I feel we might be) ?

Thanks in advance

Mick

Hi Mick,

I've replied inline. There are some things you can quickly tweak.

Hi

Newbie python development. We are putting together a microservice in
python and have been using nameko as the basis. So far so good! We are
seeing some performance issues and before we start to look into these in
detail I wanted to get some validation around how we have approached
developing this microservice. The service is calculating deferred revenue
over a period. ie company issues policy of 100,000 for a year, we need to
calculate which periods that revenue should be distributed across. The
calculations are a mix of linear and non-linear calculation patterns.

We need to support 3 different methods of interaction:
1. HTTP and rpc service that will be called periodically for individual
requests - either to retrieve previously calculated results (stored in
MongoDB) or calculate new results and store them if required.
2. Batch calculation and response of 100,000's of requests in a short
period of time from typically Mainframe COBOL applications.
3. HTTP requests to perform aggregation functions on MongoDB and return
the results.

We have configured the nameko HTTP and rpc methods and for the 1st use

case, performance is acceptable (around 12 records a second generally per
single instance running). (we tested this using locust for HTTP - this
does appear to provide better performance)

The 2nd use case however is causing some anxiety due to performance. We
have used rabbitMQ to queue the batch requests with a inbound queue and an
outbound for the responses.

We have python code that loads the requests into MQ using struct and

stringIO - achieves around 1200 records a second from a flat file.

We have a python 2.7 queue worker which reads the queue and then calls the
nameko rpc service and puts the result back on another queue. Again each
instance of nameko seems to achieve 2.5 records (based on the
acknowledgement queue in rabbitMQ) a second when using the RPC route (which
I believe is the preferred route here.) It is implemented as a straight
service - we have not looked at any pub-sub type patterns in nameko. The
volume of open requests doesn't seem to affect the performance in any way -
tested up to 500K.

Using RabbitMQ as a buffer like this is a good design. Is this "queue
worker" is implemented in nameko, or using the standalone RPC proxy? The
standalone proxy has more overhead than the service-based one because it
creates a reply queue for every usage, rather than once at service startup.
If you're finding that RPC requests aren't being *made* fast enough, this
may be your bottleneck.

As an alternative on the same infrastructure we tested bypassing nameko
completely and just calling the python code directly from the worker and we
achieve throughput of around 11 records per second - we think this more
acceptable - with multiple instances of the Queue Worker we can reach at
least 50 records per second in a fairly linear manner (ie 5 instances
achieves

What I want to understand is whether rpc as we've used it is the right
approach for the batch wor`kloads ?

Our current thoughts on next steps is to:
1. Look at implementing multiprocessing in the python service code. (CPU
usage on the nameko service is around 10%)

2. Consider alternative methods in nameko to achieve this.

3. Abandon nameko for the batch processing and call direct to the service
code (certainly not our preference - sort of breaks the microservice
concept in our eyes).
4. Understand whether the pre-fetch setting of 10 on the rpc
implementation is causing the issue or perhaps some sort of configuration
issue in rabbitMQ.

Multiprocessing isn't required -- nameko already handles concurrency with
eventlet. The maximum number of workers is set by the `max_workers` config
key (which, inconsistently, must be lowercase). The default value is 10.

With CPU utilisation at 10% it seems there's plenty of scope to try to do
more work concurrently. Also note that a nameko service runs inside a
single Python thread, so multiple concurrent workers will share the same
CPU core. On multicore machines you may want to run one instance per core
for maximum utilisation.

There is a subtle distinction between max_workers and the consumer
pre-fetch count. Max workers determines the number of eventlet threads that
nameko makes available to process entrypoints. The consumer pre-fetch count
specifies the maximum number of AMQP messages that can be un-ack'd at a
time. Nameko entrypoints ack messages after the worker completes, so the
pre-fetch count also roughly determines how much concurrent work is
possible, and by default the value is inherited from the max_workers value.
There is slightly more complexity to it than this, but hopefully this gives
you the gist. I can add more detail in a later reply if it would be helpful.

I would try the following next-steps:

1. Try increasing the max worker count
2. Check whether you're maxing out a single core and under-utilising the
others; if so, run multiple processes per machine
3. If increasing the processing capacity doesn't help at all, check whether
the bottleneck is in making the requests

Further, I'd suggest using nameko to implement the "queue worker" -- either
as a separate service or an additional entrypoint in your existing service.
The Publisher
<https://github.com/nameko/nameko/blob/master/nameko/messaging.py#L74> DependencyProvider
can be used to publish messages, and the @consume
<https://github.com/nameko/nameko/blob/master/nameko/messaging.py#L424>
entrypoint to consume them again.

Hope that helps. Look forward to hearing how you get on.

Matt.

···

On Tuesday, December 6, 2016 at 1:18:02 PM UTC, micksc...@googlemail.com wrote:

Are we missing anything else here (I feel we might be) ?

Thanks in advance

Mick

Matt

Thanks for the quick response. Its been a real joy to get this far this
quickly - we probably need to dive in deeper into some of these things now.

I'm going to work out how to implement the worker as a nameko service
itself using the @consume entrypoint.

Currently we call it this way which is standalone I believe:

    with ClusterRpcProxy(CONFIG) as rpc:
        
response=rpc.earningsStorageService.earningsStore(body.decode('utf-8'))

We have the http service running in the same nameko instance at the moment
- if you want to run multiple instances on the same machine (avoiding port
in use errors) I guess we'll need to split the http from the rpc
implementation into separate processes ? ie 4 cores = 1 http instance, 4
rpc instances.

Thanks

Mick

···

On Tuesday, 6 December 2016 09:13:50 UTC-5, Matt Yule-Bennett wrote:

Hi Mick,

I've replied inline. There are some things you can quickly tweak.

On Tuesday, December 6, 2016 at 1:18:02 PM UTC, micksc...@googlemail.com > wrote:

Hi

Newbie python development. We are putting together a microservice in
python and have been using nameko as the basis. So far so good! We are
seeing some performance issues and before we start to look into these in
detail I wanted to get some validation around how we have approached
developing this microservice. The service is calculating deferred revenue
over a period. ie company issues policy of 100,000 for a year, we need to
calculate which periods that revenue should be distributed across. The
calculations are a mix of linear and non-linear calculation patterns.

We need to support 3 different methods of interaction:
1. HTTP and rpc service that will be called periodically for individual
requests - either to retrieve previously calculated results (stored in
MongoDB) or calculate new results and store them if required.
2. Batch calculation and response of 100,000's of requests in a short
period of time from typically Mainframe COBOL applications.
3. HTTP requests to perform aggregation functions on MongoDB and return
the results.

We have configured the nameko HTTP and rpc methods and for the 1st use

case, performance is acceptable (around 12 records a second generally per
single instance running). (we tested this using locust for HTTP - this
does appear to provide better performance)

The 2nd use case however is causing some anxiety due to performance. We
have used rabbitMQ to queue the batch requests with a inbound queue and an
outbound for the responses.

We have python code that loads the requests into MQ using struct and

stringIO - achieves around 1200 records a second from a flat file.

We have a python 2.7 queue worker which reads the queue and then calls
the nameko rpc service and puts the result back on another queue. Again
each instance of nameko seems to achieve 2.5 records (based on the
acknowledgement queue in rabbitMQ) a second when using the RPC route (which
I believe is the preferred route here.) It is implemented as a straight
service - we have not looked at any pub-sub type patterns in nameko. The
volume of open requests doesn't seem to affect the performance in any way -
tested up to 500K.

Using RabbitMQ as a buffer like this is a good design. Is this "queue
worker" is implemented in nameko, or using the standalone RPC proxy? The
standalone proxy has more overhead than the service-based one because it
creates a reply queue for every usage, rather than once at service startup.
If you're finding that RPC requests aren't being *made* fast enough, this
may be your bottleneck.

As an alternative on the same infrastructure we tested bypassing nameko
completely and just calling the python code directly from the worker and we
achieve throughput of around 11 records per second - we think this more
acceptable - with multiple instances of the Queue Worker we can reach at
least 50 records per second in a fairly linear manner (ie 5 instances
achieves

What I want to understand is whether rpc as we've used it is the right
approach for the batch wor`kloads ?

Our current thoughts on next steps is to:
1. Look at implementing multiprocessing in the python service code. (CPU
usage on the nameko service is around 10%)

2. Consider alternative methods in nameko to achieve this.

3. Abandon nameko for the batch processing and call direct to the
service code (certainly not our preference - sort of breaks the
microservice concept in our eyes).
4. Understand whether the pre-fetch setting of 10 on the rpc
implementation is causing the issue or perhaps some sort of configuration
issue in rabbitMQ.

Multiprocessing isn't required -- nameko already handles concurrency with
eventlet. The maximum number of workers is set by the `max_workers` config
key (which, inconsistently, must be lowercase). The default value is 10.

With CPU utilisation at 10% it seems there's plenty of scope to try to do
more work concurrently. Also note that a nameko service runs inside a
single Python thread, so multiple concurrent workers will share the same
CPU core. On multicore machines you may want to run one instance per core
for maximum utilisation.

There is a subtle distinction between max_workers and the consumer
pre-fetch count. Max workers determines the number of eventlet threads that
nameko makes available to process entrypoints. The consumer pre-fetch count
specifies the maximum number of AMQP messages that can be un-ack'd at a
time. Nameko entrypoints ack messages after the worker completes, so the
pre-fetch count also roughly determines how much concurrent work is
possible, and by default the value is inherited from the max_workers value.
There is slightly more complexity to it than this, but hopefully this gives
you the gist. I can add more detail in a later reply if it would be helpful.

I would try the following next-steps:

1. Try increasing the max worker count
2. Check whether you're maxing out a single core and under-utilising the
others; if so, run multiple processes per machine
3. If increasing the processing capacity doesn't help at all, check
whether the bottleneck is in making the requests

Further, I'd suggest using nameko to implement the "queue worker" --
either as a separate service or an additional entrypoint in your existing
service. The Publisher
<https://github.com/nameko/nameko/blob/master/nameko/messaging.py#L74> DependencyProvider
can be used to publish messages, and the @consume
<https://github.com/nameko/nameko/blob/master/nameko/messaging.py#L424>
entrypoint to consume them again.

Hope that helps. Look forward to hearing how you get on.

Matt.

Are we missing anything else here (I feel we might be) ?

Thanks in advance

Mick

Happy to hear you're enjoying it :slight_smile:

The ClusterRpcProxy is the "standalone" version (so called because it
doesn't need to be hosted by a nameko service). Using the context manager
around every request like that will be very slow -- it creates a reply
queue on __enter__ and destroys it again on __exit__. You can test whether
this is your bottleneck by only entering the context manager once.

As written, your code also blocks until the response is available. The RPC
proxy does have an async call mode that you can use to make multiple
requests simultaneously --
see http://nameko.readthedocs.io/en/stable/built_in_extensions.html#rpc.

As for splitting the instances so HTTP ports don't clash, you can use the
WEB_SERVER_ADDRESS config key to set a different bind address for each
instance, and then maybe route requests to them using something like nginx.

Matt.

···

On Tuesday, December 6, 2016 at 2:40:17 PM UTC, Mick wrote:

Matt

Thanks for the quick response. Its been a real joy to get this far this
quickly - we probably need to dive in deeper into some of these things now.

I'm going to work out how to implement the worker as a nameko service
itself using the @consume entrypoint.

Currently we call it this way which is standalone I believe:

    with ClusterRpcProxy(CONFIG) as rpc:
        
response=rpc.earningsStorageService.earningsStore(body.decode('utf-8'))

We have the http service running in the same nameko instance at the moment
- if you want to run multiple instances on the same machine (avoiding port
in use errors) I guess we'll need to split the http from the rpc
implementation into separate processes ? ie 4 cores = 1 http instance, 4
rpc instances.

Thanks

Mick

On Tuesday, 6 December 2016 09:13:50 UTC-5, Matt Yule-Bennett wrote:

Hi Mick,

I've replied inline. There are some things you can quickly tweak.

On Tuesday, December 6, 2016 at 1:18:02 PM UTC, micksc...@googlemail.com >> wrote:

Hi

Newbie python development. We are putting together a microservice in
python and have been using nameko as the basis. So far so good! We are
seeing some performance issues and before we start to look into these in
detail I wanted to get some validation around how we have approached
developing this microservice. The service is calculating deferred revenue
over a period. ie company issues policy of 100,000 for a year, we need to
calculate which periods that revenue should be distributed across. The
calculations are a mix of linear and non-linear calculation patterns.

We need to support 3 different methods of interaction:
1. HTTP and rpc service that will be called periodically for individual
requests - either to retrieve previously calculated results (stored in
MongoDB) or calculate new results and store them if required.
2. Batch calculation and response of 100,000's of requests in a short
period of time from typically Mainframe COBOL applications.
3. HTTP requests to perform aggregation functions on MongoDB and return
the results.

We have configured the nameko HTTP and rpc methods and for the 1st use

case, performance is acceptable (around 12 records a second generally per
single instance running). (we tested this using locust for HTTP - this
does appear to provide better performance)

The 2nd use case however is causing some anxiety due to performance. We
have used rabbitMQ to queue the batch requests with a inbound queue and an
outbound for the responses.

We have python code that loads the requests into MQ using struct and

stringIO - achieves around 1200 records a second from a flat file.

We have a python 2.7 queue worker which reads the queue and then calls
the nameko rpc service and puts the result back on another queue. Again
each instance of nameko seems to achieve 2.5 records (based on the
acknowledgement queue in rabbitMQ) a second when using the RPC route (which
I believe is the preferred route here.) It is implemented as a straight
service - we have not looked at any pub-sub type patterns in nameko. The
volume of open requests doesn't seem to affect the performance in any way -
tested up to 500K.

Using RabbitMQ as a buffer like this is a good design. Is this "queue
worker" is implemented in nameko, or using the standalone RPC proxy? The
standalone proxy has more overhead than the service-based one because it
creates a reply queue for every usage, rather than once at service startup.
If you're finding that RPC requests aren't being *made* fast enough,
this may be your bottleneck.

As an alternative on the same infrastructure we tested bypassing nameko
completely and just calling the python code directly from the worker and we
achieve throughput of around 11 records per second - we think this more
acceptable - with multiple instances of the Queue Worker we can reach at
least 50 records per second in a fairly linear manner (ie 5 instances
achieves

What I want to understand is whether rpc as we've used it is the right
approach for the batch wor`kloads ?

Our current thoughts on next steps is to:
1. Look at implementing multiprocessing in the python service code.
(CPU usage on the nameko service is around 10%)

2. Consider alternative methods in nameko to achieve this.

3. Abandon nameko for the batch processing and call direct to the
service code (certainly not our preference - sort of breaks the
microservice concept in our eyes).
4. Understand whether the pre-fetch setting of 10 on the rpc
implementation is causing the issue or perhaps some sort of configuration
issue in rabbitMQ.

Multiprocessing isn't required -- nameko already handles concurrency with
eventlet. The maximum number of workers is set by the `max_workers` config
key (which, inconsistently, must be lowercase). The default value is 10.

With CPU utilisation at 10% it seems there's plenty of scope to try to do
more work concurrently. Also note that a nameko service runs inside a
single Python thread, so multiple concurrent workers will share the same
CPU core. On multicore machines you may want to run one instance per core
for maximum utilisation.

There is a subtle distinction between max_workers and the consumer
pre-fetch count. Max workers determines the number of eventlet threads that
nameko makes available to process entrypoints. The consumer pre-fetch count
specifies the maximum number of AMQP messages that can be un-ack'd at a
time. Nameko entrypoints ack messages after the worker completes, so the
pre-fetch count also roughly determines how much concurrent work is
possible, and by default the value is inherited from the max_workers value.
There is slightly more complexity to it than this, but hopefully this gives
you the gist. I can add more detail in a later reply if it would be helpful.

I would try the following next-steps:

1. Try increasing the max worker count
2. Check whether you're maxing out a single core and under-utilising the
others; if so, run multiple processes per machine
3. If increasing the processing capacity doesn't help at all, check
whether the bottleneck is in making the requests

Further, I'd suggest using nameko to implement the "queue worker" --
either as a separate service or an additional entrypoint in your existing
service. The Publisher
<https://github.com/nameko/nameko/blob/master/nameko/messaging.py#L74> DependencyProvider
can be used to publish messages, and the @consume
<https://github.com/nameko/nameko/blob/master/nameko/messaging.py#L424>
entrypoint to consume them again.

Hope that helps. Look forward to hearing how you get on.

Matt.

Are we missing anything else here (I feel we might be) ?

Thanks in advance

Mick

ok I flipped it to the consume model and the performance is greatly
improved. Thanks again for that. Getting around 20 a second now. With 8
instances on the 8cpu box its screaming along. Looking at AWS ELB and the
multiple ports.

The piece I'm now struggling with is publishing the response onto a queue
from within the consumer.

I've been trying this but with little success. Can I call a publisher
within the consume or should it be separate ? When I run this nothing
happens - nothing in the logs at all. I have it working thru pika but I'd
rather not have another component in there. I probably need to dive in
deep in kombu to see whats happening.

class asyncEarningsQueueWorker(object):
    name = 'asyncEarningsQueueWorker'
    
    @consume(sendQueue)
    def processEarningsMessage(self, msg):
        
        success = Publisher(queue=receiveQueue)
        fail = Publisher(queue=failQueue)
        
        response = earningsEngine_Store(msg.decode('utf-8'),'normal
execution')
        
        resp = ast.literal_eval(response)
        if resp['Status'] == 'SUCCESS':
            self.success(response)
        else:
            self.fail(response)

The Publisher is a DependencyProvider
<http://nameko.readthedocs.io/en/stable/key_concepts.html#dependency-injection>.
It can't be instantiated inside the service method like that.

They need to be added to your service at class definition time:

class asyncEarningsQueueService(object):
    name = 'asyncEarningsQueueService'
    
    success = Publisher(queue=receiveQueue)
    fail = Publisher(queue=failQueue)

    @consume(sendQueue)
    def processEarningsMessage(self, msg):
        
        response = earningsEngine_Store(msg.decode('utf-8'),'normal
execution')
        
        resp = ast.literal_eval(response)
        if resp['Status'] == 'SUCCESS':
            self.success(response, routing_key="...")
        else:
            self.fail(response, routing_key="...")

You've declared two publishers here, but one is probably sufficient if you
bind your two queues with different routing keys.

Also note that the term "worker" has special meaning in nameko -- it's how
we refer to the execution of a single entrypoint. This and lots of other
useful stuff is covered in more detail in the key concepts
<http://nameko.readthedocs.io/en/stable/key_concepts.html> section of the
docs.

···

On Tuesday, December 6, 2016 at 10:55:49 PM UTC, Mick wrote:

ok I flipped it to the consume model and the performance is greatly

improved. Thanks again for that. Getting around 20 a second now. With 8
instances on the 8cpu box its screaming along. Looking at AWS ELB and the
multiple ports.

The piece I'm now struggling with is publishing the response onto a queue
from within the consumer.

I've been trying this but with little success. Can I call a publisher
within the consume or should it be separate ? When I run this nothing
happens - nothing in the logs at all. I have it working thru pika but I'd
rather not have another component in there. I probably need to dive in
deep in kombu to see whats happening.

class asyncEarningsQueueWorker(object):
    name = 'asyncEarningsQueueWorker'
    
    @consume(sendQueue)
    def processEarningsMessage(self, msg):
        
        success = Publisher(queue=receiveQueue)
        fail = Publisher(queue=failQueue)
        
        response = earningsEngine_Store(msg.decode('utf-8'),'normal
execution')
        
        resp = ast.literal_eval(response)
        if resp['Status'] == 'SUCCESS':
            self.success(response)
        else:
            self.fail(response)

RTFM!

Thanks again