Multiprocessing vs Multithreading in Python – Explained With Cooking

Table of Contents

Foreword
Part 1 - Explaining The Concepts
Part 2 - Multithreading and Multiprocessing in Action
A Business Perspective
Summary - Multiprocessing vs Multithreading in PythonSummary

A (not) short story about how I came to understand the discrete difference between multiprocessing and multithreading.

Foreword

What do you remember from 1991? It was full of interesting events: USSR collapsed and the Chicago Bulls with Michael Jordan won their first NBA championship, for example. But developers remember this year as a different milestone - Python had its premiere.

Although Python has been around for so long, there's a lot to explore. One of the things I learned recently was multithreading, where in general, Python offers two approaches to the topic.

In this article, I will try to describe them and point out the differences, which I hope will allow you to choose the right solutions for the processes you program in your remarkable, wonderful, and amazing Python applications.

Part 1 - Explaining the concept

Traditional Approach (Or How Not To Program)

Imagine you have a yeast dough to bake. Generally, it is a simple cake, the preparation of which we can divide into 3 main phases:

Gathering of ingredients
Dough preparation
Baking (Here you do not actually do anything, the oven works for you so we will skip this step)

Phase 1 - Gathering The Ingredients

As I mentioned, yeast dough is a simple dough. To prepare it you need: flour, sugar, yeast, milk, 5 eggs, margarine, and salt. (By the way, I will tell you my grandma's secret. It's better to use butter instead of margarine, it comes out much better).

So our list of ingredients looks like this:

Ingredients = [“egg”, “egg”, “egg”, “egg”, “egg”, “sugar 250g”, “flour 1kg”, “yeast 75g”, “milk 1.5 glass”, “butter 200g”, “pinch of salt”]

In the traditional approach of gathering ingredients for a cake as a cook, you are in the kitchen alone. Moreover, you are not very smart and you’re using only one hand to do the work.

The cook hand is our Thread.

So the code for gathering ingredients written in Python would look like this:

for ingredient in ingredients:

cook.go_to_fridge_or_cabinet()

cook.take_ingredient(ingredient)

cook.bring_ingredient()

cook.put_it_on_the_table()

Quite a few steps that we have to do. We do it separately with each ingredient so the whole thing is done 11 times.

With this (traditional) programming approach we run around the kitchen a bit.

Phase 2 - Dough Preparation

Preparing a yeast dough is no longer as simple as gathering the ingredients. It is also a more laborious process and more prone to errors. The order of operations is crucial otherwise we will end up with a scone. Our "program" of preparing the dough must be done exactly according to the recipe. No step can be skipped.

So our cake baking “code” looks like this:

cook.warm_up_milk(37 st.C)

cook.put_igridients_to_bowl([“(warm) milk”,”‘yast 75g”, “pinch of sugar”, “spoon of flour”])

cook.mix_igredients_in_bowl()

cook.wait(10 min.)

cook.melt_butter()

cook.put_igredients_to_bowl([“egg”, “egg”, “egg”, “egg”, “egg”, “rest of sugar”, “melted butter”, “pinch of salt”, “rest of flour”])

cook.mix_igredients_in_bowl(20 min.)

cook.wait(30 min)

cook.put_cake_to_backing_tray()

cook.wait(30 min)

cook.bake_cake(50min, 170 st.C)

Again, you do everything with one hand. In the end, you have baked a tasty cake, but its preparation takes a lot of time. The order of the steps is also important here. For example, you can't combine all the `cook.wait()`commands into one à la `cook.wait(70 min.)`. You also can't change the order in which each line of the program is executed. If you do this the cake won’t be good.

While baking a cake is hard to optimize, it takes about the same amount of time no matter how many cooks make it, Phase 1 (gathering ingredients) seems pretty easy to optimize. It doesn't matter the ingredients in which order you bring them from the fridge and put them on the table for further processing. What's more, you can safely bring in all the ingredients at once. I assure you that no egg will protest that you bring it to the table with another egg, flour, or sugar.

Just how do you do it with one hand?

Multithreading - Let's Add To Our Cook Hands To The Work

To understand how multithreading works in Python, the key point is the "To Our Cook" part in the chapter title. By using multithreading, you add "cook" hands to your work and there is still only one cook in the kitchen.

The official Python documentation refers to threading as "Thread-based parallelism". Tasks are executed in parallel... or rather quasi-parallel. It is this fine distinction between multithreading and multiprocessing that has eluded me all along.

Multithreading gives us the ability to simultaneously run the tasks that need to be executed and execute them, regardless of their duration. Tasks are executed by a single processor core, with access to operational memory in which the program is executed.

Referring to our pie example, our cook in the kitchen is a mutant octopus on steroids that has grown an extra 4 arms. On a signal we specify, the octopus performs a job we specify, which we will call a worker for readability:

def worker(ingredient):

cook.go_to_fridge_or_cabinet()

cook.take_ingredient(ingredient)

cook.bring_ingredient()

cook.put_it_on_the_table()

for ingredient in ingredients:

octopus.submit(worker, ingredient)

What is going on here? In the beginning, we define the work to be done which is our "worker" and this is the function that will be performed. The worker needs to know what component to bring, without it he will get lost. This is the same as we have described above, we do not change anything here.

The second part of the program is more interesting. Having our list (array) of ingredients we tell our octopus to do a worker with each ingredient from the list.

`octopus.submit(worker, ingredient)`

Since we do this infinitely fast, shouting out successive commands (Fetch egg, Fetch sugar, Fetch yeast...) the octopus, before it even moves, already has all the workers specified and starts executing them. Each worker is carried out by a separate arm with a tentacle.

Phase 1 - Gathering The Ingredients

And so in the first phase the octopus:

He walks over to the refrigerator,
Opens it (all arms at once),
Inserts his/her arms (all at once) into the refrigerator by grabbing the specified component in the worker
Takes it out
Closes the fridge and goes back to the table
Puts the ingredient back on the table.

This is where we may encounter minor inconveniences. Yeast, milk, or butter are more likely to be on different shelves in the refrigerator, in different places. Eggs are most likely to be on one shelf in one package. What does an octopus do? The 5 arms reach onto the same shelf at the same time, wedging together and blocking each other.

No worries. Python multithreading automatically solves this problem for us by waiting a while until one of the tentacles (thread) frees up space (computer resources) and another thread can be executed.

Once all the processes are done, we have the ingredients on the table, ready to continue and we can move on to Phase 2. And here we hit an obstacle that multithreading is not able to handle especially if we had several cakes to bake.

Phase 2 - Dough Preparation

The preparation of the cake, as I have already mentioned, is the phase that requires more attention from the cook.The"program" must also be done in the right order where at the end we have baking in the oven.

Imagine a situation where we have to bake 10 cakes. In the first step, we bring all the ingredients needed to make them. The table gets cramped but we still fit somehow. We start with 10 cake workers and here we hit an obstacle we cannot overcome.

We have one bowl where we mix the ingredients
We have one oven where we bake cakes.

Running 10 parallel threads won't do us any good. Preparing and baking one cake blocks our resources (bowl and oven) for another 70-80 minutes. The threads running in parallel "wait" for resources to be released before they start executing. And so baking 10 cakes, using multithreading is a job for about 800 minutes (13+ hours).

How do we increase our resources and add more bowls and ovens to the kitchen?

Multiprocessing - Let’s Clone Kitchens

The idea of multiprocessing, allowing all computer resources to be used in parallel, is nothing new. However, unlike other programming languages, Python itself is not ready for it. It is hindered by the Global Interpreter Lock (GIL), which prevents more than one thread from executing at once in a given time unit. For those interested in the issue, I recommend the interesting post "What is the Python Global Interpreter Lock (GIL)?" where you will find a detailed description of this "infamous" Python feature.

In order to bypass GIL-related limitations, an additional library written in C/C++ has been introduced to Python, which has been available since version 3.5. Multiprocessing frees us from Python's limitations, giving us the possibility of full and unlimited use of all the computer resources... however, it has its own limitations, which you have to remember, and which we will discuss in more detail later in this article

Let's refer one last time to our cake-baking example. Our kitchen in which we can bake one cake at a time, even if our cook has 10 hands and moves the ingredients to the workbench quite quickly, is not able to handle the case where we have to bake 10 cakes, because this kitchen has one oven into which we can put one cake at a time. With help comes multiprocessing, which replicates our kitchen.

You can imagine it as a block of apartments in which there are many apartments and each of them has a kitchen.

Thanks to multiprocessing. we can use each of them keeping in mind some important things.

Each block, even the biggest one, has a limited number of kitchens that we can use (RAM, processor cores, and all the technical stuff that goes into processing our program).
The kitchens are in separate apartments. As such, they do not know about each other's existence. More specifically, Kitchen A does not know if Kitchen B is making a pie and (if it is) at what stage it is.
Using all the kitchens at once to bake a cake can completely block our ability to make other meals around the block. And what if there's a sudden need to heat up some porridge for the baby? A hungry child very quickly turns into an emergency situation that we certainly don't want to have.

As you can guess, a block is our computer/server. We can expand it by adding more processors, RAM, hard disks, etc. However, every computer, even the biggest one in the world, will reach its limit which means that if we run 100 processes on it and each of them uses 100 threads we'll reach the maximum performance of most of the publicly available servers.

A separate problem associated with multiprocessing is the issue of information exchange. As I mentioned our "kitchens" do not know anything about each other. However, in most cases, when we run parallel tasks, we would like to be able to do something with the results of their actions at the end, when they are all done. In our case, in the end, we'd like to pack up our cakes and take them to the cafe for guests. This is quite an obvious problem, so the library creators have added appropriate solutions that we can use.

The obvious problem we will encounter when using the power of multiprocessing is the question of the maximum resources we will use. Let's imagine a situation where our server gives us 30 processes to use. If we occupy all of them, and in the meantime some other user types in the address of our web page, the server won't even be able to display it, because all 30 processes will be currently occupied with the work we gave the server. Like a hungry child, a client who doesn't see the website will very quickly lose patience.

Multiprocessing, like multithreading, also runs the program at the same time but uses different computer resources so that the work of one process can be performed independently of the work of the other process

The execution of our 10-cake multiprocessing baking program would look like this.

START
Process 1	Process 2	…	Process 10
Phase 1 ingredient collection	Phase 1 ingredient collection	Phase 1 ingredient collection	Phase 1 ingredient collection
Phase 2 Baking	Phase 2 Baking (nothing is blocking me, I have separate resources)	Phase 2 Baking (nothing is blocking me, I have separate resources)	Phase 2 Baking (nothing is blocking me, I have separate resources)
END. 10 cakes baked
Collecting baked cakes We pack them We take it to the pastry shop

The baking time for all 10 cakes is the maximum baking time for the longest cake. Since no one is blocking the oven, none of the processes running are waiting for the oven to slow down. The whole program will be done not in 800 minutes (multithreading) but in 80 minutes. 10 cakes in 80 minutes is already a micro-bakery that we can conquer the market with, especially if we bake such a good cake as the yeast cake described above.

I hope that this explanation of the differences between traditional (in the loop) programming, multithreading, and multiprocessing will help you understand the difference between these issues in more detail and... more importantly, will allow you to better choose a programming solution strategy for your applications.

Now it's time for a short break with coffee and yeast cake and after that, in the next part of the article, I will show you how you can use the knowledge gained in practice.

Baking cake illustrative photo

Part 2 - Multithreading and Multiprocessing in Action

Side note

For the rest of this article, I assume you have some basic knowledge of programming in Python :), or configuring the docker we will use to create our development environment.

You can find the examples below in the repository at https://github.com/michal-stachura/blog-mvm.

Given the speed at which computer programs are executed, the differences between multithreading and multiprocessing that we will discuss next are quite difficult to see. However, we will add a few "test points'' in our code, and load the CPU heavily, which will give us a better understanding of the differences between these issues.

—

Ok, in the first part of this article we had two main "Phases" of baking a cake. Phase 1 was easier to execute. Phase 2 required more resources and could "block" us from executing the program due to insufficient computer/server resources.

In the publicly available examples on the Internet, this issue is solved with solutions à la `time.sleep(1)` for easy processes and `time.sleep(10)` for hard processes taking 10 seconds. In reality, both of these processes are just as trivial to the CPU and do not consume any CPU resources, but make it wait 1 or 10 seconds.

Personally, I prefer a more empirical approach. We write a program that does:

From https://www.thispersondoesnotexist.com/ will download 10 images of generated faces of people who do not exist in real life
Change images to appropriate size
It will put them on the page of the pdf document, and add fictitious details of the person such as name, home address etc.
Prepared PDF files will be packed to ZIP and saved to disk

As you can guess, image download #1 is the easier Phase 1. The service responds at different speeds so we will see small differences in the time it takes to download the images and process them.

Points 2, 3 and 4 of our program are already Phase 2, involving much more of our computer and requiring more resources. Ok, enough talking. Let’s do some code.

The main.py - is our main application file where in the first part I add the logger configuration and define the parameters that we can use in the tests:

--cvs - number of generated CV files
--details - flag whether the report should contain details or not.
--p1_type, --p2_type - the type of processing of work through phase 1 and 2. The choices are "common", "multithreading" and "multiprocessing"
--p1_max_workers, --p2_max_workers - the number of parallel threads/processes we want to use in tests for phase 1 and 2

Next, we have a simple call to the classes PhaseOne and PhaseTwo which are defined in `app/phase1.py` and `app/phase2.py` respectively: `app/phase1.py` and `app/phase2.py`

Both files with the `PhaseOne` and `PhaseTwo` classes have a similar structure where for both, I first define the "job" that will be done `def job()` and then the workers that in addition to logging times will do the defined `def job()`.

OK, I think this is understandable. Time to get your hands a little dirty :)

Test 1 - Traditional approach

The traditional approach of doing the work in a simple loop without running the code in parallel and doing several things at once.

In the ssh console, type:

docker --tag monte_py .

Which will build us a picture of the environment that we will use further in testing.

Successfully built 267b3b24efc6

Successfully tagged monte_py:latest

Sidenotes

If you run into problems, see if you need to run the above command with sudo
If you are not a fan of docker, you can run the whole thing using virtualenv keeping in mind:
1. You must have python version 3.10+
2. Create the directories that docker creates, i.e. 'downloads' and 'results'
3. After starting the virtual machine, be sure to install the necessary libraries `pip install -r requirements.txt`

With the image ready, we run the first test:

docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="common" --p2_type="common"

The resulting output should look roughly like this:

######################

Number of CV's: 10

Test type:

- Phase 1: common

- Phase 2: common

Detailed report: Y

Max workers:

- Phase 1: Not considered

- Phase 2: Not considered

######################

--- Phase 1 - gathering data ---

Average request time: 0:00:00.398223

Phase 1 took: 0:00:03.985781

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:02.375734

Phase 2 took: 0:00:23.978809

--- Summary ---

Whole process took: 0:00:27.964590

--- Details Phase 1 ---

[

"Task: 0 (start) - PID: 1 CPU: 9.5%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 0 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",

"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",

"Task: 1 (end) - PID: 1 CPU: 10.2%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",

"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",

"Task: 2 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",

"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",

"Task: 3 (end) - PID: 1 CPU: 5.2%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 4 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 4 (end) - PID: 1 CPU: 1.7%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 5 (end) - PID: 1 CPU: 2.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 6 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 6 (end) - PID: 1 CPU: 4.0%, RAM (GB): avl: 23.93, used: 6.47, 23.7%)",

"Task: 7 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.93, used: 6.47, 23.7%)",

"Task: 7 (end) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",

"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",

"Task: 8 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.59, used: 6.81, 24.8%)",

"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.59, used: 6.81, 24.8%)",

"Task: 9 (end) - PID: 1 CPU: 12.3%, RAM (GB): avl: 23.56, used: 6.84, 24.9%)"

]

--- Details Phase 2 ---

[

"Task: 0 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.56, used: 6.84, 24.9%)",

"Task: 0 (end) - PID: 1 CPU: 15.2%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",

"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",

"Task: 1 (end) - PID: 1 CPU: 16.2%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",

"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",

"Task: 2 (end) - PID: 1 CPU: 15.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",

"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",

"Task: 3 (end) - PID: 1 CPU: 14.7%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",

"Task: 4 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",

"Task: 4 (end) - PID: 1 CPU: 16.3%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",

"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",

"Task: 5 (end) - PID: 1 CPU: 14.5%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",

"Task: 6 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",

"Task: 6 (end) - PID: 1 CPU: 14.4%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",

"Task: 7 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",

"Task: 7 (end) - PID: 1 CPU: 15.5%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",

"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",

"Task: 8 (end) - PID: 1 CPU: 15.3%, RAM (GB): avl: 23.77, used: 6.63, 24.2%)",

"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.77, used: 6.63, 24.2%)",

"Task: 9 (end) - PID: 1 CPU: 13.6%, RAM (GB): avl: 23.76, used: 6.65, 24.3%)"

]

In the beginning, we have listed the configuration of our test that was done. As you can see `max_workers` for phase 1 and phase 2 is undefined even though the default value is 10. Well. In a traditional loop, we do not run the code in parallel. The whole is processed in a single thread/process and we have no influence on it.

Then we have the time summaries for phase 1 and phase 2. In my case, it came out at:

--- Phase 1 - gathering data ---

Average request time: 0:00:00.398223

Phase 1 took: 0:00:03.985781

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:02.375734

Phase 2 took: 0:00:23.978809

It took an average of ~0.39 seconds to download one image, all 10 images we downloaded took ~3.98 seconds.

It takes my computer about ~2.37 seconds to generate one PDF file, and 10 pdf files in ~23.97 seconds... quite long

We closed the entire process in ~27.96 seconds which is a very poor time. I think not many customers would wait almost half a minute after clicking the "Generate me 10 resume files" button :)

In the test details, you can see how each task in the loop is executed. In both cases, we have the same scheme.

"Task: 0 (start) - PID: 1 CPU: 9.5%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",

"Task: 0 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",

"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",

"Task: 1 (end) - PID: 1 CPU: 10.2%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",

"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",

"Task: 2 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",

We start a task -> we do it -> we finish the task. Boredom. We use one `PID` process all the time, our `CPU` processor is bored at 10% most of the time, and the `RAM` operating memory remains mostly unused.

It's time to speed things up a bit.

Test 2 - Multithreading

We are working again to generate 10 resume files

docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="multithreading" --p2_type="multithreading" --p1_max_workers=8 --p2_max_workers=8

######################

Number of CV's: 10

Test type:

- Phase 1: multithreading

- Phase 2: multithreading

Detailed report: Y

Max workers:

- Phase 1: 8

- Phase 2: 8

######################

--- Phase 1 - gathering data ---

Average request time: 0:00:00.447264

Phase 1 took: 0:00:00.706548

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:20.365066

Phase 2 took: 0:00:30.022417

--- Summary ---

Whole process took: 0:00:30.728965

--- Details Phase 1 ---

[

"Task: 0 (start) - PID: 1 CPU: 17.6%, RAM (GB): avl: 23.64, used: 6.82, 24.6%)",

"Task: 4 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 6 (start) - PID: 1 CPU: 25.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 7 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",

"Task: 6 (end) - PID: 1 CPU: 10.9%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 0 (end) - PID: 1 CPU: 20.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 4 (end) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 1 (end) - PID: 1 CPU: 11.1%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 2 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 5 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 7 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 3 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",

"Task: 8 (end) - PID: 1 CPU: 9.1%, RAM (GB): avl: 23.53, used: 6.93, 25.0%)",

"Task: 9 (end) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)"

]

--- Details Phase 2 ---

[

"Task: 1 (start) - PID: 1 CPU: 66.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",

"Task: 0 (start) - PID: 1 CPU: 66.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",

"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",

"Task: 2 (start) - PID: 1 CPU: 33.3%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",

"Task: 4 (start) - PID: 1 CPU: 22.2%, RAM (GB): avl: 23.47, used: 6.99, 25.2%)",

"Task: 6 (start) - PID: 1 CPU: 18.7%, RAM (GB): avl: 23.33, used: 7.14, 25.6%)",

"Task: 5 (start) - PID: 1 CPU: 19.2%, RAM (GB): avl: 23.33, used: 7.14, 25.6%)",

"Task: 7 (start) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.32, used: 7.14, 25.7%)",

"Task: 6 (end) - PID: 1 CPU: 14.5%, RAM (GB): avl: 23.16, used: 7.32, 26.2%)",

"Task: 0 (end) - PID: 1 CPU: 9.3%, RAM (GB): avl: 23.15, used: 7.32, 26.2%)",

"Task: 8 (start) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.15, used: 7.33, 26.2%)",

"Task: 7 (end) - PID: 1 CPU: 10.6%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",

"Task: 9 (start) - PID: 1 CPU: 8.3%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",

"Task: 5 (end) - PID: 1 CPU: 8.1%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",

"Task: 4 (end) - PID: 1 CPU: 8.1%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",

"Task: 3 (end) - PID: 1 CPU: 13.5%, RAM (GB): avl: 23.25, used: 7.23, 25.9%)",

"Task: 1 (end) - PID: 1 CPU: 18.2%, RAM (GB): avl: 23.25, used: 7.23, 25.9%)",

"Task: 2 (end) - PID: 1 CPU: 18.4%, RAM (GB): avl: 23.26, used: 7.21, 25.8%)",

"Task: 8 (end) - PID: 1 CPU: 13.1%, RAM (GB): avl: 23.21, used: 7.27, 26.0%)",

"Task: 9 (end) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.21, used: 7.27, 26.0%)"

]

Let's see what happened here:

--- Phase 1 - gathering data ---

Average request time: 0:00:00.447264

Phase 1 took: 0:00:00.706548

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:20.365066

Phase 2 took: 0:00:30.022417

--- Summary ---

Whole process took: 0:00:30.728965

This time, the average image download time took ~0.45 seconds. In a traditional approach, 10 images would download in about ~4.5 seconds. Meanwhile, we have now downloaded them in just ~0.71 seconds. Multithreading sped up the process for us by about 75%.

Phase 2, however, did not go so well. The process to generate one PDF file is about ~20.36 seconds. Remember in the traditional process it was ten times less at ~2.37 seconds. Why such a slowdown?

Let's go back to our cake-baking analogy for a moment. Phase 1 is gathering ingredients, where our eight-handed cook (--p1_max_workers=8) brings the ingredients for the cake and puts them on the table. Since we have 10 ingredients and 8 hands the first "run" will bring 8 ingredients from the fridge and the next run will bring the remaining 2. The whole process is asynchronous so the time will never be the same as two runs fridge <-> table but it will be similar. If we gave 10 workers this time would be not much bigger than the time to fetch one image.

By executing 10 image requests in parallel, in fact, we wait for the last image response. The request that responds the last image closes the thread's queue and our code continues to execute

The matter gets a bit more complicated in Phase 2 where we deal with slightly more serious activities requiring more computer resources. Here the time to generate one PDF file took on average ~20.36 seconds. 20 seconds vs 2.4 in the classic approach is almost 8 times slower. We rather don't want that :)

Why has the file generation time increased so much?

The answer is simple. In this test, we ran 8 pdf generation processes in parallel, which are themselves quite processor-intensive. It takes a lot more resources to generate our file. Even if we run 10 or more of them it won't change much. Our resources that are allocated to the asynchronously fired threads remain unchanged. Individual threads simply limp along, waiting for the previously run asynchronous CV file generation activities to finish and free up some resources to execute the code run earlier.

Back to our cake analogy. It doesn't matter how many cooks you put at the pie table. The table is a certain size. Even if you tell everyone at the same time "start, make pies" they will get stuck and have to wait for the table to clear before they start working.

For tasks requiring large amounts of computer resources, Python multithreading does not work

Let's see how it looks using multiprocessing? But before that, take another look at the details of the individual processes in phase 1 and phase 2.

For Phase 1 we have 8 tasks to execute in parallel (--p1_max_workers=8) then task 6 ends and task8 (waiting in the queue to be executed) starts immediately. The same goes for task 9, which starts after task 0 ends when there's a worker left that could run the task.

The details look similar for Phase 2. Notice the CPU usage. At the very beginning, it reaches a fairly high result of 66.7%, which reflects the situation where at one point 8 threads are opened in parallel to generate a pdf file. Then, the processor, with a relatively constant load oscillating around 20%, closes the individual threads generating the PDF file.

Test 3 - Multiprocessing

docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="multiprocessing" --p2_type="multiprocessing" --p1_max_workers=8 --p2_max_workers=8

After running such a test, you will get a result that looks roughly like this:

######################

Number of CV's: 10

Test type:

- Phase 1: multiprocessing

- Phase 2: multiprocessing

Detailed report: Y

Max workers:

- Phase 1: 8

- Phase 2: 8

######################

--- Phase 1 - gathering data ---

Average request time: 0:00:00.451822

Phase 1 took: 0:00:00.815908

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:02.507320

Phase 2 took: 0:00:05.045943

--- Summary ---

Whole process took: 0:00:05.861851

--- Details Phase 1 ---

[

"Task: 1 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.88, used: 7.52, 27.1%)",

"Task: 0 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 6 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 3 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 2 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 5 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 4 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 7 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",

"Task: 6 (end) - PID: 10 CPU: 11.7%, RAM (GB): avl: 22.75, used: 7.65, 27.5%)",

"Task: 8 (start) - PID: 10 CPU: 33.3%, RAM (GB): avl: 22.75, used: 7.65, 27.5%)",

"Task: 0 (end) - PID: 10 CPU: 10.7%, RAM (GB): avl: 22.74, used: 7.66, 27.5%)",

"Task: 9 (start) - PID: 10 CPU: 25.0%, RAM (GB): avl: 22.74, used: 7.66, 27.5%)",

"Task: 1 (end) - PID: 10 CPU: 10.9%, RAM (GB): avl: 22.72, used: 7.68, 27.6%)",

"Task: 2 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",

"Task: 7 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",

"Task: 3 (end) - PID: 10 CPU: 11.2%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",

"Task: 5 (end) - PID: 10 CPU: 11.4%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",

"Task: 4 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",

"Task: 9 (end) - PID: 10 CPU: 11.5%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",

"Task: 8 (end) - PID: 10 CPU: 7.7%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)"

]

--- Details Phase 2 ---

[

"Task: 0 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.78, used: 7.63, 27.4%)",

"Task: 1 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.78, used: 7.63, 27.4%)",

"Task: 2 (start) - PID: 10 CPU: 12.9%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",

"Task: 4 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",

"Task: 6 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",

"Task: 3 (start) - PID: 10 CPU: 12.9%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",

"Task: 5 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",

"Task: 7 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",

"Task: 2 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",

"Task: 8 (start) - PID: 10 CPU: 75.0%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",

"Task: 5 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.29, used: 8.11, 28.9%)",

"Task: 9 (start) - PID: 10 CPU: 100.0%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",

"Task: 3 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.29, used: 8.11, 28.9%)",

"Task: 4 (end) - PID: 10 CPU: 72.5%, RAM (GB): avl: 22.25, used: 8.16, 29.1%)",

"Task: 1 (end) - PID: 10 CPU: 71.5%, RAM (GB): avl: 22.25, used: 8.16, 29.1%)",

"Task: 0 (end) - PID: 10 CPU: 71.4%, RAM (GB): avl: 22.26, used: 8.14, 29.0%)",

"Task: 7 (end) - PID: 10 CPU: 71.4%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",

"Task: 6 (end) - PID: 10 CPU: 71.3%, RAM (GB): avl: 22.3, used: 8.11, 28.9%)",

"Task: 9 (end) - PID: 10 CPU: 25.6%, RAM (GB): avl: 22.33, used: 8.07, 28.8%)",

"Task: 8 (end) - PID: 10 CPU: 25.7%, RAM (GB): avl: 22.32, used: 8.08, 28.8%)"

]

We received interesting results here:

--- Phase 1 - gathering data ---

Average request time: 0:00:00.451822

Phase 1 took: 0:00:00.815908

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:02.507320

Phase 2 took: 0:00:05.045943

--- Summary ---

Whole process took: 0:00:05.861851

Phase 1 more or less with similar results as in multithreading. We got a very big speedup of the overall process for Phase 2, which took us more than 30 seconds in multithreading. This time we generated 10 PDF files in about ~5.04 seconds. That's over 70% faster work done!

The details of how each phase runs are also interesting. As in multithreading, we have asynchronous processes for downloading images and generating pdf files. However, unlike multithreading, multiprocessing uses not 1 but 8 parallel processes. The processor utilization at the beginning is 12-13%, similar to the multitasking process, but then it rapidly rises to 70-100% and stays at that level until the very end, when the last two processes use 25% of the CPU.

You can see the process better for more cv files we are generating but here I do not want to generate too long logs. In the end, let's do one more test.

Test 4 - Mix

Finally, let us combine the multithreading and multiprocessing approaches where for phase 1 we use multithreading and for phase 2 we use multiprocessing. Here we will use more CVs and compare the separation of the two phases of the code. We will also try to run as many processes as possible.

######################

Number of CV's: 100

Test type:

- Phase 1: multiprocessing

- Phase 2: multiprocessing

Detailed report: N

Max workers:

- Phase 1: 100

- Phase 2: 100

######################

######################

Number of CV's: 100

Test type:

- Phase 1: multithreading

- Phase 2: multiprocessing

Detailed report: N

Max workers:

- Phase 1: 100

- Phase 2: 100

######################

--- Phase 1 - gathering data ---

Average request time: 0:00:01.477735

Phase 1 took: 0:00:03.917579

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:18.109740

Phase 2 took: 0:00:24.027231

--- Summary ---

Whole process took: 0:00:27.944810

--- Phase 1 - gathering data ---

Average request time: 0:00:01.526664

Phase 1 took: 0:00:03.422567

--- Phase 2 - generate PDF ---

Average pdf generation time: 0:00:21.898660

Phase 2 took: 0:00:27.718271

--- Summary ---

Whole process took: 0:00:31.140838

In both cases, we generate 100 files. The results are more or less on the same level. The important difference, in this case, is that for Phase 1 when we used multithreading we still have the computer resources for other tasks (for example requests from other users). Multiprocessing is in any case physically limited by the quality of the computer. In my case it looks like this:

import psutil()

psutil.cpu_count()

12

psutil.cpu_count(logical=False)

It means that I have a max of 12 parallel processes to use, of which 6 are physical CPU cores. Even if I start 100 at once it won't change anything. When I reach 12 parallel processes running at the same time, I'm clogging up the CPU and waiting for resources to be freed up, which can be seen in the average PDF generation time, which has increased to 21.89%.

However, the case is different with the servers on which you host your applications. There it is worth checking what resources you have at your disposal and what you can afford using multiprocessing for "difficult" tasks.

Is it worth it?

A Business Perspective

Think about how the customer sees it. Let's say that your application generates pdf files with reports for your customers, which contain sales statements, charts, tables, etc. then they send such a file by email. This is quite a labor-intensive process. There are 100 000 clients to serve, and the process of generating files and sending them by e-mail takes 2 seconds

It is known that we will use celery, which will run the process of generating files and sending them out, at the end informing you by email (or any other way) that the files have been generated, 100 000 emails sent 234 did not reach the recipient, etc.

100,000 files. Each one takes 2 seconds, starting the process on Friday at 8 a.m. it will finish after 55 hours on Sunday around 3 p.m. Quite a lot of time. But what if we sped it up 100 times? The inexpensive VPS that I use gives me 150 parallel processes to use. If I use 100 to generate report files it will take me not 55 hours but... 33 minutes.

In the first case, I will know on Sunday how many users did not receive the report email, in the second case I will know it after half an hour and I still have 7.5 hours of work ahead of me to react, check emails and possibly send the report to users again.

...And this is just one of a million cases in which we can use multithreading and multiprocessing in Python :)

Summary - Multiprocessing vs Multithreading in Python

I hope this article has pretty much shown you the differences between the traditional approach and the asynchronous approach using the two capabilities Python gives us.

Multithreading
Multiprocessing

Both approaches have their advantages and disadvantages. It's worth knowing the differences in their operation so that you can use them wisely and take into account the computer resources at your disposal. The most important conclusions I drew are:

Both multithreading and multiprocessing are not the golden solutions to all the performance pains of our applications :)
Asynchronous code execution can be difficult to understand at first (it was for me), but once you master the use of `.join()` `Queue` etc. the whole thing starts to be simple and the possibilities it opens up to us are huge.
For simple tasks, you should use multithreading, which is simpler and does not require so many computer resources. It also uses one process and one operating memory, and each thread has access to it. This makes communication much easier, especially if the result of each thread is used later in the rest of the program.
For complex, difficult tasks it is worth writing code with the use of multiprocessing, which will ensure fast operation of the whole code and ultimately greater customer satisfaction.
Multiprocessing is more difficult than multithreading, especially because of the memory sharing mentioned above. As you can see in the `app/phase1.py` and `app/phase2.py` files, in both cases I've separated the worker for multiprocessing into a separate function where I pass in the entirety needed to complete the job. In addition to the file identifier there is also `phase_1_sub_duration`, `phase_1_details`,`job`, `generate_details_log` which are our arrays to store the data created using the manager, the instructions to execute (job()) or the log generation function (`generate_details_log`). This is our `clone kitchen` process that you remember from the first part of this article. Using Multiprocessing, we need to provide the worker with everything it will need to perform the task, including the very description of the task it needs to perform.

Feel free to use code from the repository for your own tests. If you have processes in your application that run in loops check if they can't be accelerated this way :)

Michał Stachura