Long-Running processes in ThreadPool executor


#1

I hope someone can help me....

So I have a process that marshals several co-routines via a
concurrent.futures.ThreadPoolExecutor.

This works beautifully for quick processes.... but not all of my processes
are quick.... and in fact one particular one (transcoding audio) is
downright slooow.

And so I have fallen foul of https://github.com/nameko/nameko/issues/477
where the ThreadPool is blocking the heartbeats, which causes the worker to
restart (and the whole thing starts again)

Now, I was able to switch the 'worker' side of this to the 'greened'
Subprocess - so the actual Transcode shouldn't be blocking anymore.

`from eventlet.green import subprocess`

But I am pretty sure the issue is now with the ThreadPool.

I'm am a total newbie to Eventlet - and honestly didn't anticipate this
kind of nuanced issue. So I'm not entirely sure how I can fix this. Does
anyone have any thoughts?

I could try running it inside a tpool? or am I over-complicating it, and
simply need to use the Futures in a diferent way to how I am right now?

                with concurrent.futures.ThreadPoolExecutor(max_workers=25) 
as executor:
                    x = '/net/' + platform.node() + temp_location
                    futures = [executor.submit(self.audio.transcode, 
source_location + src, x + tgt, **transcode)
                               for src, tgt in filemap]
                    for future in concurrent.futures.as_completed(futures):
                        future.result()  # Raises exceptions

*any* help, guidance, or pointers *greatly* appreciated :slight_smile:

Geoff


#2

OK, I spent a long time really thinking on this, and checking patterns in
the failures, and any non-nameko issues - and whilst I am sure I am falling
foul of issue linked - it is not necessarily because of the
threading/subprocess.

My problem appears to be I/O related, causing severe slowdowns at the OS
level.

I thought I'd cracked it with the 'green subprocess, but when the issue
continued (and after reassuring myself I'd done that bit right!) I started
looking more fundamentally.

I happened to be running a (very crude) SaltStack command to monitor the
disks on the workers, when I noticed it slowed down at times (just
calculating disk space). So I logged into one of the 'slow' responders, and
the shell was very very sluggish. I attempted to clear the `/tmp` folder,
and it just hung out, for ages and ages.

I don't (yet) know if it's an issue with one or more hosts running the
Nameko services; or if it's a more general issue with the hardware
configuration (I have a cluster of identical blades running the services);
or a network I/O issue - possible, but not probably.

My money is on crappy Disk IO on a couple of machines.

If anyone has tips on how I might track that down, I'm all ears (and eyes).

Many thanks,

Geoff

···

On Friday, March 9, 2018 at 3:22:17 PM UTC-8, juko...@gmail.com wrote:

I hope someone can help me....

So I have a process that marshals several co-routines via a
concurrent.futures.ThreadPoolExecutor.

This works beautifully for quick processes.... but not all of my processes
are quick.... and in fact one particular one (transcoding audio) is
downright slooow.

And so I have fallen foul of https://github.com/nameko/nameko/issues/477
where the ThreadPool is blocking the heartbeats, which causes the worker to
restart (and the whole thing starts again)

Now, I was able to switch the 'worker' side of this to the 'greened'
Subprocess - so the actual Transcode shouldn't be blocking anymore.

`from eventlet.green import subprocess`

But I am pretty sure the issue is now with the ThreadPool.

I'm am a total newbie to Eventlet - and honestly didn't anticipate this
kind of nuanced issue. So I'm not entirely sure how I can fix this. Does
anyone have any thoughts?

I could try running it inside a tpool? or am I over-complicating it, and
simply need to use the Futures in a diferent way to how I am right now?

                with concurrent.futures.ThreadPoolExecutor(max_workers=25) 
as executor:
                    x = '/net/' + platform.node() + temp_location
                    futures = [executor.submit(self.audio.transcode, 
source_location + src, x + tgt, **transcode)
                               for src, tgt in filemap]
                    for future in concurrent.futures.as_completed(futures):
                        future.result()  # Raises exceptions

*any* help, guidance, or pointers *greatly* appreciated :slight_smile:

Geoff


#3

I know I have a bad habit of replying to myself, but I figure it's polite
to close the loop :slight_smile:

I have confirmed that there is a disk IOPS issue. I used SaltStack to send
the following command to all the workers at once (race!):

`fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test --filename=test --bs=4k --iodepth=64 --size=4G
--readwrite=randrw --rwmixread=75 | grep iops`

The results (below) clearly show wildly variable IOPS performance.

So what looked like a Nameko/Kombu issue (those were the only errors being
thrown) was actually more fundamental :slight_smile:

Thanks for listening.

worker-02-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=53605KB/s, iops=13401, runt= 58677msec
      write: io=1024.4MB, bw=17876KB/s, iops=4469, runt= 58677msec
worker-05-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=13852KB/s, iops=3463, runt=227064msec
      write: io=1024.4MB, bw=4619.5KB/s, iops=1154, runt=227064msec
worker-07-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=7588.3KB/s, iops=1897, runt=414510msec
      write: io=1024.4MB, bw=2530.6KB/s, iops=632, runt=414510msec
worker-06-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6205.5KB/s, iops=1551, runt=506908msec
      write: io=1024.4MB, bw=2069.3KB/s, iops=517, runt=506908msec
worker-03-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6203.6KB/s, iops=1550, runt=507030msec
      write: io=1024.4MB, bw=2068.8KB/s, iops=517, runt=507030msec
worker-08-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6049.2KB/s, iops=1512, runt=519969msec
      write: io=1024.4MB, bw=2017.3KB/s, iops=504, runt=519969msec
worker-04-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=6021.4KB/s, iops=1505, runt=522370msec
      write: io=1024.4MB, bw=2007.2KB/s, iops=501, runt=522370msec
worker-01-p:
    TERM environment variable not set.
    /bin/sh: 1: cd: can't cd to ~
      read : io=3071.7MB, bw=3877.2KB/s, iops=969, runt=811256msec
      write: io=1024.4MB, bw=1292.1KB/s, iops=323, runt=811256msec
···

On Friday, March 9, 2018 at 3:22:17 PM UTC-8, juko...@gmail.com wrote:

I hope someone can help me....

So I have a process that marshals several co-routines via a
concurrent.futures.ThreadPoolExecutor.

This works beautifully for quick processes.... but not all of my processes
are quick.... and in fact one particular one (transcoding audio) is
downright slooow.

And so I have fallen foul of https://github.com/nameko/nameko/issues/477
where the ThreadPool is blocking the heartbeats, which causes the worker to
restart (and the whole thing starts again)

Now, I was able to switch the 'worker' side of this to the 'greened'
Subprocess - so the actual Transcode shouldn't be blocking anymore.

`from eventlet.green import subprocess`

But I am pretty sure the issue is now with the ThreadPool.

I'm am a total newbie to Eventlet - and honestly didn't anticipate this
kind of nuanced issue. So I'm not entirely sure how I can fix this. Does
anyone have any thoughts?

I could try running it inside a tpool? or am I over-complicating it, and
simply need to use the Futures in a diferent way to how I am right now?

                with concurrent.futures.ThreadPoolExecutor(max_workers=25) 
as executor:
                    x = '/net/' + platform.node() + temp_location
                    futures = [executor.submit(self.audio.transcode, 
source_location + src, x + tgt, **transcode)
                               for src, tgt in filemap]
                    for future in concurrent.futures.as_completed(futures):
                        future.result()  # Raises exceptions

*any* help, guidance, or pointers *greatly* appreciated :slight_smile:

Geoff