What does a 128 thread python app do better than 128 single threaded ones?

heinrichhartman · on Oct 17, 2021

No shared memory. To communicate between processes you usually use sockets, to communicate between threads you mutate variables. This is a huge performance difference.

fulafel · on Oct 18, 2021

You can have (non transparent, but supporting the buffer interface, enough for eg numpy arrays) shared memory with the current multiprocessing stuff: https://docs.python.org/3/library/multiprocessing.shared_mem...

A tangent but I find it amusing to contrast the perpetual Python GIL debate with all the new computation platforms that claim to be focused on scalability. Those are mostly single threaded or max out at a few virtual CPUs (eg "serverless" platforms) and there people applaud it. There people view the isolation as supporting scalability.

jhoechtl · on Oct 17, 2021

OS overhead of 128 processes is higher than scheduling 128 tasks. Varies from os to os, but it's especially bad on Windows.

turminal · on Oct 17, 2021

Yeah, I know about that argument but it just doesn't make sense to me. Removing the GIL means that 1) you make your language runtime more complex and 2) you make your app more complex.

Is it truly worth it just to avoid some memory overhead? Or is there some other windows specific thing that I'm missing here?

dragonwriter · on Oct 17, 2021

> Yeah, I know about that argument but it just doesn't make sense to me. Removing the GIL means that 1) you make your language runtime more complex and 2) you make your app more complex.

#2 need not be true; e.g., the approach proposed here is transparent to most Python code and even minimized impact on C extensions, still exposing the same GIL hook functions which C code would use in the same circumstances, though it has slightly different effect.

lmm · on Oct 18, 2021

It doesn't have to be. On Linux 2.4 (pre-NPTL) processes and threads were represented exactly the same way.

gypsyharlot · on Oct 17, 2021

Shared L3 cache.

semi-extrinsic · on Oct 17, 2021

Well actually, on the types of CPUs that OP refers to (128 threads i.e. AMD Threadripper), L3 cache is only shared within each pair of CCXs that form a CCD. If you launch a program with 32 threads, they may have 1, 2, 3 or 4 distinct L3 caches to work with.

Moreover, unless thread pinning is enforced, a given thread will bounce around between different cores during execution, so the number of distinct L3 caches in action will not be constant.

Of course you have the same story with memory, accessing another thread's memory is slower if that thread is on another CCD.

TL;DR NUMA makes life hard if you want to get consistent performance from parallelism.

Redoubts · on Oct 17, 2021

marshal data

cma · on Oct 17, 2021

That's slower than just doing it single threaded for many use cases.

stjohnswarts · on Oct 18, 2021

I mean is there anything here preventing one from only writing their code to be single threaded tho? This is an addition to the capability and not a detraction.

cma · on Oct 18, 2021

I think you are replying to the wrong post, I'm not making that argument.

turminal · on Oct 17, 2021

Care to elaborate? What does that change for an average webapp?

yuliyp · on Oct 17, 2021

Say your webapp talks to a database or a cache. It'd be really nice if you could use a single connection to that database instead of 64 connections. Or if you wanted to cache some things on the web server, it would be nice if you could have 1 copy easily accessible vs needing 64 copies and needing to fill those caches 64x as much.

semiquaver · on Oct 17, 2021

Unfortunately using a single db/RPC connection for many active threads is not done in any multithreaded system I’m aware of for good reasons. Sharing this type of resource across threads is not safe without expensive and performance-destroying mutexes. In practice each thread needs exclusive access to its own database connection while it is active. This is normally achieved using connection pooling which can save a few connections when some threads are idle, but 1 connection for 64 active web worker threads is not a recipe for a performant web app. If you can point to a multithreaded web app server that works this way I’d be very interested to hear about it.

The idea of a process-local cache (or other data) shared among all worker threads is a different story. Along with reduced memory consumption, I see this as one of the bigger advantages of threaded app servers. However, preforking multiprocess servers can always use shmget(2) to share memory directly with a bit more work.

paulmd · on Oct 18, 2021

> Unfortunately using a single db/RPC connection for many active threads is not done in any multithreaded system I’m aware of for good reasons. Sharing this type of resource across threads is not safe without expensive and performance-destroying mutexes

lol, you're so deep into python stockholm-syndrome "don't share anything between threads because we don't support that at all even a little bit" that you don't even realize that connection pools exist. Instead of holding a connection open per process, you can have one connection pool with 30 connections that services 200 threads (exact ratio depends on how many are actually using connections, of course). literally everybody "shares a single DB/RPC connection across multiple threads" (or at least shares a number of connections across a number of threads), except python.

and yeah you can turn that into yet another standalone service that you gotta deliver in your docker-compose setup, but everybody else just builds that into the application itself.

yellowapple · on Oct 18, 2021

> that you don't even realize that connection pools exist

The GP mentions connection pooling literally three sentences later.

> literally everybody "shares a single DB/RPC connection across multiple threads" (or at least shares a number of connections across a number of threads), except python.

Right, but multiple ≠ many. You're discussing the former. GP is discussing the latter.

yuliyp · on Oct 18, 2021

Depending on the structure, it can indeed be many. Both in the case of protocols which support multiplexing of requests, and in situations where you have multiple databases (thus a given thread might not need to be talking to a particular database all the time).