July 28, 2022
A (not) short story about how I came to understand the discrete difference between multiprocessing and multithreading.
What do you remember from 1991? It was full of interesting events: USSR collapsed and the Chicago Bulls with Michael Jordan won their first NBA championship, for example. But developers remember this year as a different milestone - Python had its premiere.
Although Python has been around for so long, there's a lot to explore. One of the things I learned recently was multithreading, where in general, Python offers two approaches to the topic.
In this article, I will try to describe them and point out the differences, which I hope will allow you to choose the right solutions for the processes you program in your remarkable, wonderful, and amazing Python applications.
Contents
Imagine you have a yeast dough to bake. Generally, it is a simple cake, the preparation of which we can divide into 3 main phases:
As I mentioned, yeast dough is a simple dough. To prepare it you need: flour, sugar, yeast, milk, 5 eggs, margarine, and salt. (By the way, I will tell you my grandma's secret. It's better to use butter instead of margarine, it comes out much better).
So our list of ingredients looks like this:
Ingredients = [“egg”, “egg”, “egg”, “egg”, “egg”, “sugar 250g”, “flour 1kg”, “yeast 75g”, “milk 1.5 glass”, “butter 200g”, “pinch of salt”]
In the traditional approach of gathering ingredients for a cake as a cook, you are in the kitchen alone. Moreover, you are not very smart and you’re using only one hand to do the work.
The cook hand is our Thread.
So the code for gathering ingredients written in Python would look like this:
for ingredient in ingredients:
cook.go_to_fridge_or_cabinet()
cook.take_ingredient(ingredient)
cook.bring_ingredient()
cook.put_it_on_the_table()
Quite a few steps that we have to do. We do it separately with each ingredient so the whole thing is done 11 times.
With this (traditional) programming approach we run around the kitchen a bit.
Preparing a yeast dough is no longer as simple as gathering the ingredients. It is also a more laborious process and more prone to errors. The order of operations is crucial otherwise we will end up with a scone. Our "program" of preparing the dough must be done exactly according to the recipe. No step can be skipped.
So our cake baking “code” looks like this:
cook.warm_up_milk(37 st.C)
cook.put_igridients_to_bowl([“(warm) milk”,”‘yast 75g”, “pinch of sugar”, “spoon of flour”])
cook.mix_igredients_in_bowl()
cook.wait(10 min.)
cook.melt_butter()
cook.put_igredients_to_bowl([“egg”, “egg”, “egg”, “egg”, “egg”, “rest of sugar”, “melted butter”, “pinch of salt”, “rest of flour”])
cook.mix_igredients_in_bowl(20 min.)
cook.wait(30 min)
cook.put_cake_to_backing_tray()
cook.wait(30 min)
cook.bake_cake(50min, 170 st.C)
Again, you do everything with one hand. In the end, you have baked a tasty cake, but its preparation takes a lot of time. The order of the steps is also important here. For example, you can't combine all the `cook.wait()`
commands into one à la `cook.wait(70 min.)`
. You also can't change the order in which each line of the program is executed. If you do this the cake won’t be good.
While baking a cake is hard to optimize, it takes about the same amount of time no matter how many cooks make it, Phase 1 (gathering ingredients) seems pretty easy to optimize. It doesn't matter the ingredients in which order you bring them from the fridge and put them on the table for further processing. What's more, you can safely bring in all the ingredients at once. I assure you that no egg will protest that you bring it to the table with another egg, flour, or sugar.
Just how do you do it with one hand?
To understand how multithreading works in Python, the key point is the "To Our Cook" part in the chapter title. By using multithreading, you add "cook" hands to your work and there is still only one cook in the kitchen.
The official Python documentation refers to threading as "Thread-based parallelism". Tasks are executed in parallel... or rather quasi-parallel. It is this fine distinction between multithreading and multiprocessing that has eluded me all along.
Multithreading gives us the ability to simultaneously run the tasks that need to be executed and execute them, regardless of their duration. Tasks are executed by a single processor core, with access to operational memory in which the program is executed.
Referring to our pie example, our cook in the kitchen is a mutant octopus on steroids that has grown an extra 4 arms. On a signal we specify, the octopus performs a job we specify, which we will call a worker for readability:
def worker(ingredient):
cook.go_to_fridge_or_cabinet()
cook.take_ingredient(ingredient)
cook.bring_ingredient()
cook.put_it_on_the_table()
for ingredient in ingredients:
octopus.submit(worker, ingredient)
What is going on here? In the beginning, we define the work to be done which is our "worker" and this is the function that will be performed. The worker needs to know what component to bring, without it he will get lost. This is the same as we have described above, we do not change anything here.
The second part of the program is more interesting. Having our list (array) of ingredients we tell our octopus to do a worker with each ingredient from the list.
`octopus.submit(worker, ingredient)`
Since we do this infinitely fast, shouting out successive commands (Fetch egg, Fetch sugar, Fetch yeast...) the octopus, before it even moves, already has all the workers specified and starts executing them. Each worker is carried out by a separate arm with a tentacle.
And so in the first phase the octopus:
This is where we may encounter minor inconveniences. Yeast, milk, or butter are more likely to be on different shelves in the refrigerator, in different places. Eggs are most likely to be on one shelf in one package. What does an octopus do? The 5 arms reach onto the same shelf at the same time, wedging together and blocking each other.
No worries. Python multithreading automatically solves this problem for us by waiting a while until one of the tentacles (thread) frees up space (computer resources) and another thread can be executed.
Once all the processes are done, we have the ingredients on the table, ready to continue and we can move on to Phase 2. And here we hit an obstacle that multithreading is not able to handle especially if we had several cakes to bake.
The preparation of the cake, as I have already mentioned, is the phase that requires more attention from the cook. The"program" must also be done in the right order where at the end we have baking in the oven.
Imagine a situation where we have to bake 10 cakes. In the first step, we bring all the ingredients needed to make them. The table gets cramped but we still fit somehow. We start with 10 cake workers and here we hit an obstacle we cannot overcome.
Running 10 parallel threads won't do us any good. Preparing and baking one cake blocks our resources (bowl and oven) for another 70-80 minutes. The threads running in parallel "wait" for resources to be released before they start executing. And so baking 10 cakes, using multithreading is a job for about 800 minutes (13+ hours).
How do we increase our resources and add more bowls and ovens to the kitchen?
The idea of multiprocessing, allowing all computer resources to be used in parallel, is nothing new. However, unlike other programming languages, Python itself is not ready for it. It is hindered by the Global Interpreter Lock (GIL), which prevents more than one thread from executing at once in a given time unit. For those interested in the issue, I recommend the interesting post "What is the Python Global Interpreter Lock (GIL)?" where you will find a detailed description of this "infamous" Python feature.
In order to bypass GIL-related limitations, an additional library written in C/C++ has been introduced to Python, which has been available since version 3.5. Multiprocessing frees us from Python's limitations, giving us the possibility of full and unlimited use of all the computer resources... however, it has its own limitations, which you have to remember, and which we will discuss in more detail later in this article
Let's refer one last time to our cake-baking example. Our kitchen in which we can bake one cake at a time, even if our cook has 10 hands and moves the ingredients to the workbench quite quickly, is not able to handle the case where we have to bake 10 cakes, because this kitchen has one oven into which we can put one cake at a time. With help comes multiprocessing, which replicates our kitchen.
You can imagine it as a block of apartments in which there are many apartments and each of them has a kitchen.
Thanks to multiprocessing. we can use each of them keeping in mind some important things.
As you can guess, a block is our computer/server. We can expand it by adding more processors, RAM, hard disks, etc. However, every computer, even the biggest one in the world, will reach its limit which means that if we run 100 processes on it and each of them uses 100 threads we'll reach the maximum performance of most of the publicly available servers.
A separate problem associated with multiprocessing is the issue of information exchange. As I mentioned our "kitchens" do not know anything about each other. However, in most cases, when we run parallel tasks, we would like to be able to do something with the results of their actions at the end, when they are all done. In our case, in the end, we'd like to pack up our cakes and take them to the cafe for guests. This is quite an obvious problem, so the library creators have added appropriate solutions that we can use.
The obvious problem we will encounter when using the power of multiprocessing is the question of the maximum resources we will use. Let's imagine a situation where our server gives us 30 processes to use. If we occupy all of them, and in the meantime some other user types in the address of our web page, the server won't even be able to display it, because all 30 processes will be currently occupied with the work we gave the server. Like a hungry child, a client who doesn't see the website will very quickly lose patience.
Multiprocessing, like multithreading, also runs the program at the same time but uses different computer resources so that the work of one process can be performed independently of the work of the other process
The execution of our 10-cake multiprocessing baking program would look like this.
START |
|||
Process 1 |
Process 2 |
… |
Process 10 |
Phase 1 |
Phase 1 |
Phase 1 |
Phase 1 |
Phase 2 |
Phase 2 |
Phase 2 |
Phase 2 |
END. 10 cakes baked |
|||
|
The baking time for all 10 cakes is the maximum baking time for the longest cake. Since no one is blocking the oven, none of the processes running are waiting for the oven to slow down. The whole program will be done not in 800 minutes (multithreading) but in 80 minutes. 10 cakes in 80 minutes is already a micro-bakery that we can conquer the market with, especially if we bake such a good cake as the yeast cake described above.
I hope that this explanation of the differences between traditional (in the loop) programming, multithreading, and multiprocessing will help you understand the difference between these issues in more detail and... more importantly, will allow you to better choose a programming solution strategy for your applications.
Now it's time for a short break with coffee and yeast cake and after that, in the next part of the article, I will show you how you can use the knowledge gained in practice.
For the rest of this article, I assume you have some basic knowledge of programming in Python :), or configuring the docker we will use to create our development environment.
You can find the examples below in the repository at https://github.com/michal-stachura/blog-mvm.
Given the speed at which computer programs are executed, the differences between multithreading and multiprocessing that we will discuss next are quite difficult to see. However, we will add a few "test points'' in our code, and load the CPU heavily, which will give us a better understanding of the differences between these issues.
—
Ok, in the first part of this article we had two main "Phases" of baking a cake. Phase 1 was easier to execute. Phase 2 required more resources and could "block" us from executing the program due to insufficient computer/server resources.
In the publicly available examples on the Internet, this issue is solved with solutions à la `time.sleep(1)`
for easy processes and `time.sleep(10)`
for hard processes taking 10 seconds. In reality, both of these processes are just as trivial to the CPU and do not consume any CPU resources, but make it wait 1 or 10 seconds.
Personally, I prefer a more empirical approach. We write a program that does:
As you can guess, image download #1 is the easier Phase 1. The service responds at different speeds so we will see small differences in the time it takes to download the images and process them.
Points 2, 3 and 4 of our program are already Phase 2, involving much more of our computer and requiring more resources. Ok, enough talking. Let’s do some code.
The main.py - is our main application file where in the first part I add the logger configuration and define the parameters that we can use in the tests:
Next, we have a simple call to the classes PhaseOne and PhaseTwo which are defined in `app/phase1.py` and `app/phase2.py` respectively: `app/phase1.py` and `app/phase2.py`
Both files with the `PhaseOne` and `PhaseTwo` classes have a similar structure where for both, I first define the "job" that will be done `def job()`
and then the workers that in addition to logging times will do the defined `def job()`
.
OK, I think this is understandable. Time to get your hands a little dirty :)
The traditional approach of doing the work in a simple loop without running the code in parallel and doing several things at once.
In the ssh console, type:
docker --tag monte_py .
Which will build us a picture of the environment that we will use further in testing.
Successfully built 267b3b24efc6
Successfully tagged monte_py:latest
With the image ready, we run the first test:
docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="common" --p2_type="common"
The resulting output should look roughly like this:
######################
Number of CV's: 10
Test type:
- Phase 1: common
- Phase 2: common
Detailed report: Y
Max workers:
- Phase 1: Not considered
- Phase 2: Not considered
######################
--- Phase 1 - gathering data ---
Average request time: 0:00:00.398223
Phase 1 took: 0:00:03.985781
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.375734
Phase 2 took: 0:00:23.978809
--- Summary ---
Whole process took: 0:00:27.964590
--- Details Phase 1 ---
[
"Task: 0 (start) - PID: 1 CPU: 9.5%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 0 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (end) - PID: 1 CPU: 10.2%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",
"Task: 3 (end) - PID: 1 CPU: 5.2%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 4 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 4 (end) - PID: 1 CPU: 1.7%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 5 (end) - PID: 1 CPU: 2.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 6 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 6 (end) - PID: 1 CPU: 4.0%, RAM (GB): avl: 23.93, used: 6.47, 23.7%)",
"Task: 7 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.93, used: 6.47, 23.7%)",
"Task: 7 (end) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 8 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.59, used: 6.81, 24.8%)",
"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.59, used: 6.81, 24.8%)",
"Task: 9 (end) - PID: 1 CPU: 12.3%, RAM (GB): avl: 23.56, used: 6.84, 24.9%)"
]
--- Details Phase 2 ---
[
"Task: 0 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.56, used: 6.84, 24.9%)",
"Task: 0 (end) - PID: 1 CPU: 15.2%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 1 (end) - PID: 1 CPU: 16.2%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 2 (end) - PID: 1 CPU: 15.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 3 (end) - PID: 1 CPU: 14.7%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 4 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 4 (end) - PID: 1 CPU: 16.3%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 5 (end) - PID: 1 CPU: 14.5%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 6 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 6 (end) - PID: 1 CPU: 14.4%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 7 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 7 (end) - PID: 1 CPU: 15.5%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 8 (end) - PID: 1 CPU: 15.3%, RAM (GB): avl: 23.77, used: 6.63, 24.2%)",
"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.77, used: 6.63, 24.2%)",
"Task: 9 (end) - PID: 1 CPU: 13.6%, RAM (GB): avl: 23.76, used: 6.65, 24.3%)"
]
In the beginning, we have listed the configuration of our test that was done. As you can see `max_workers` for phase 1 and phase 2 is undefined even though the default value is 10. Well. In a traditional loop, we do not run the code in parallel. The whole is processed in a single thread/process and we have no influence on it.
Then we have the time summaries for phase 1 and phase 2. In my case, it came out at:
--- Phase 1 - gathering data ---
Average request time: 0:00:00.398223
Phase 1 took: 0:00:03.985781
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.375734
Phase 2 took: 0:00:23.978809
It took an average of ~0.39 seconds to download one image, all 10 images we downloaded took ~3.98 seconds.
It takes my computer about ~2.37 seconds to generate one PDF file, and 10 pdf files in ~23.97 seconds... quite long
We closed the entire process in ~27.96 seconds which is a very poor time. I think not many customers would wait almost half a minute after clicking the "Generate me 10 resume files" button :)
In the test details, you can see how each task in the loop is executed. In both cases, we have the same scheme.
"Task: 0 (start) - PID: 1 CPU: 9.5%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 0 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (end) - PID: 1 CPU: 10.2%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",
We start a task -> we do it -> we finish the task. Boredom. We use one `PID` process all the time, our `CPU` processor is bored at 10% most of the time, and the `RAM` operating memory remains mostly unused.
It's time to speed things up a bit.
We are working again to generate 10 resume files
docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="multithreading" --p2_type="multithreading" --p1_max_workers=8 --p2_max_workers=8
######################
Number of CV's: 10
Test type:
- Phase 1: multithreading
- Phase 2: multithreading
Detailed report: Y
Max workers:
- Phase 1: 8
- Phase 2: 8
######################
--- Phase 1 - gathering data ---
Average request time: 0:00:00.447264
Phase 1 took: 0:00:00.706548
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:20.365066
Phase 2 took: 0:00:30.022417
--- Summary ---
Whole process took: 0:00:30.728965
--- Details Phase 1 ---
[
"Task: 0 (start) - PID: 1 CPU: 17.6%, RAM (GB): avl: 23.64, used: 6.82, 24.6%)",
"Task: 4 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 6 (start) - PID: 1 CPU: 25.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 7 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 6 (end) - PID: 1 CPU: 10.9%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 0 (end) - PID: 1 CPU: 20.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 4 (end) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 1 (end) - PID: 1 CPU: 11.1%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 2 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 5 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 7 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 3 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 8 (end) - PID: 1 CPU: 9.1%, RAM (GB): avl: 23.53, used: 6.93, 25.0%)",
"Task: 9 (end) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)"
]
--- Details Phase 2 ---
[
"Task: 1 (start) - PID: 1 CPU: 66.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 0 (start) - PID: 1 CPU: 66.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 2 (start) - PID: 1 CPU: 33.3%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 4 (start) - PID: 1 CPU: 22.2%, RAM (GB): avl: 23.47, used: 6.99, 25.2%)",
"Task: 6 (start) - PID: 1 CPU: 18.7%, RAM (GB): avl: 23.33, used: 7.14, 25.6%)",
"Task: 5 (start) - PID: 1 CPU: 19.2%, RAM (GB): avl: 23.33, used: 7.14, 25.6%)",
"Task: 7 (start) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.32, used: 7.14, 25.7%)",
"Task: 6 (end) - PID: 1 CPU: 14.5%, RAM (GB): avl: 23.16, used: 7.32, 26.2%)",
"Task: 0 (end) - PID: 1 CPU: 9.3%, RAM (GB): avl: 23.15, used: 7.32, 26.2%)",
"Task: 8 (start) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.15, used: 7.33, 26.2%)",
"Task: 7 (end) - PID: 1 CPU: 10.6%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 9 (start) - PID: 1 CPU: 8.3%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 5 (end) - PID: 1 CPU: 8.1%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 4 (end) - PID: 1 CPU: 8.1%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 3 (end) - PID: 1 CPU: 13.5%, RAM (GB): avl: 23.25, used: 7.23, 25.9%)",
"Task: 1 (end) - PID: 1 CPU: 18.2%, RAM (GB): avl: 23.25, used: 7.23, 25.9%)",
"Task: 2 (end) - PID: 1 CPU: 18.4%, RAM (GB): avl: 23.26, used: 7.21, 25.8%)",
"Task: 8 (end) - PID: 1 CPU: 13.1%, RAM (GB): avl: 23.21, used: 7.27, 26.0%)",
"Task: 9 (end) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.21, used: 7.27, 26.0%)"
]
Let's see what happened here:
--- Phase 1 - gathering data ---
Average request time: 0:00:00.447264
Phase 1 took: 0:00:00.706548
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:20.365066
Phase 2 took: 0:00:30.022417
--- Summary ---
Whole process took: 0:00:30.728965
This time, the average image download time took ~0.45 seconds. In a traditional approach, 10 images would download in about ~4.5 seconds. Meanwhile, we have now downloaded them in just ~0.71 seconds. Multithreading sped up the process for us by about 75%.
Phase 2, however, did not go so well. The process to generate one PDF file is about ~20.36 seconds. Remember in the traditional process it was ten times less at ~2.37 seconds. Why such a slowdown?
Let's go back to our cake-baking analogy for a moment. Phase 1 is gathering ingredients, where our eight-handed cook (--p1_max_workers=8) brings the ingredients for the cake and puts them on the table. Since we have 10 ingredients and 8 hands the first "run" will bring 8 ingredients from the fridge and the next run will bring the remaining 2. The whole process is asynchronous so the time will never be the same as two runs fridge <-> table but it will be similar. If we gave 10 workers this time would be not much bigger than the time to fetch one image.
By executing 10 image requests in parallel, in fact, we wait for the last image response. The request that responds the last image closes the thread's queue and our code continues to execute
The matter gets a bit more complicated in Phase 2 where we deal with slightly more serious activities requiring more computer resources. Here the time to generate one PDF file took on average ~20.36 seconds. 20 seconds vs 2.4 in the classic approach is almost 8 times slower. We rather don't want that :)
Why has the file generation time increased so much?
The answer is simple. In this test, we ran 8 pdf generation processes in parallel, which are themselves quite processor-intensive. It takes a lot more resources to generate our file. Even if we run 10 or more of them it won't change much. Our resources that are allocated to the asynchronously fired threads remain unchanged. Individual threads simply limp along, waiting for the previously run asynchronous CV file generation activities to finish and free up some resources to execute the code run earlier.
Back to our cake analogy. It doesn't matter how many cooks you put at the pie table. The table is a certain size. Even if you tell everyone at the same time "start, make pies" they will get stuck and have to wait for the table to clear before they start working.
For tasks requiring large amounts of computer resources, Python multithreading does not work
Let's see how it looks using multiprocessing? But before that, take another look at the details of the individual processes in phase 1 and phase 2.
For Phase 1 we have 8 tasks to execute in parallel (--p1_max_workers=8) then task 6 ends and task8 (waiting in the queue to be executed) starts immediately. The same goes for task 9, which starts after task 0 ends when there's a worker left that could run the task.
The details look similar for Phase 2. Notice the CPU usage. At the very beginning, it reaches a fairly high result of 66.7%, which reflects the situation where at one point 8 threads are opened in parallel to generate a pdf file. Then, the processor, with a relatively constant load oscillating around 20%, closes the individual threads generating the PDF file.
docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="multiprocessing" --p2_type="multiprocessing" --p1_max_workers=8 --p2_max_workers=8
After running such a test, you will get a result that looks roughly like this:
######################
Number of CV's: 10
Test type:
- Phase 1: multiprocessing
- Phase 2: multiprocessing
Detailed report: Y
Max workers:
- Phase 1: 8
- Phase 2: 8
######################
--- Phase 1 - gathering data ---
Average request time: 0:00:00.451822
Phase 1 took: 0:00:00.815908
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.507320
Phase 2 took: 0:00:05.045943
--- Summary ---
Whole process took: 0:00:05.861851
--- Details Phase 1 ---
[
"Task: 1 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.88, used: 7.52, 27.1%)",
"Task: 0 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 6 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 3 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 2 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 5 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 4 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 7 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 6 (end) - PID: 10 CPU: 11.7%, RAM (GB): avl: 22.75, used: 7.65, 27.5%)",
"Task: 8 (start) - PID: 10 CPU: 33.3%, RAM (GB): avl: 22.75, used: 7.65, 27.5%)",
"Task: 0 (end) - PID: 10 CPU: 10.7%, RAM (GB): avl: 22.74, used: 7.66, 27.5%)",
"Task: 9 (start) - PID: 10 CPU: 25.0%, RAM (GB): avl: 22.74, used: 7.66, 27.5%)",
"Task: 1 (end) - PID: 10 CPU: 10.9%, RAM (GB): avl: 22.72, used: 7.68, 27.6%)",
"Task: 2 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 7 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 3 (end) - PID: 10 CPU: 11.2%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 5 (end) - PID: 10 CPU: 11.4%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 4 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 9 (end) - PID: 10 CPU: 11.5%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 8 (end) - PID: 10 CPU: 7.7%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)"
]
--- Details Phase 2 ---
[
"Task: 0 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.78, used: 7.63, 27.4%)",
"Task: 1 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.78, used: 7.63, 27.4%)",
"Task: 2 (start) - PID: 10 CPU: 12.9%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",
"Task: 4 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",
"Task: 6 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",
"Task: 3 (start) - PID: 10 CPU: 12.9%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",
"Task: 5 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",
"Task: 7 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",
"Task: 2 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 8 (start) - PID: 10 CPU: 75.0%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 5 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.29, used: 8.11, 28.9%)",
"Task: 9 (start) - PID: 10 CPU: 100.0%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 3 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.29, used: 8.11, 28.9%)",
"Task: 4 (end) - PID: 10 CPU: 72.5%, RAM (GB): avl: 22.25, used: 8.16, 29.1%)",
"Task: 1 (end) - PID: 10 CPU: 71.5%, RAM (GB): avl: 22.25, used: 8.16, 29.1%)",
"Task: 0 (end) - PID: 10 CPU: 71.4%, RAM (GB): avl: 22.26, used: 8.14, 29.0%)",
"Task: 7 (end) - PID: 10 CPU: 71.4%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 6 (end) - PID: 10 CPU: 71.3%, RAM (GB): avl: 22.3, used: 8.11, 28.9%)",
"Task: 9 (end) - PID: 10 CPU: 25.6%, RAM (GB): avl: 22.33, used: 8.07, 28.8%)",
"Task: 8 (end) - PID: 10 CPU: 25.7%, RAM (GB): avl: 22.32, used: 8.08, 28.8%)"
]
We received interesting results here:
--- Phase 1 - gathering data ---
Average request time: 0:00:00.451822
Phase 1 took: 0:00:00.815908
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.507320
Phase 2 took: 0:00:05.045943
--- Summary ---
Whole process took: 0:00:05.861851
Phase 1 more or less with similar results as in multithreading. We got a very big speedup of the overall process for Phase 2, which took us more than 30 seconds in multithreading. This time we generated 10 PDF files in about ~5.04 seconds. That's over 70% faster work done!
The details of how each phase runs are also interesting. As in multithreading, we have asynchronous processes for downloading images and generating pdf files. However, unlike multithreading, multiprocessing uses not 1 but 8 parallel processes. The processor utilization at the beginning is 12-13%, similar to the multitasking process, but then it rapidly rises to 70-100% and stays at that level until the very end, when the last two processes use 25% of the CPU.
You can see the process better for more cv files we are generating but here I do not want to generate too long logs. In the end, let's do one more test.
Finally, let us combine the multithreading and multiprocessing approaches where for phase 1 we use multithreading and for phase 2 we use multiprocessing. Here we will use more CVs and compare the separation of the two phases of the code. We will also try to run as many processes as possible.
|
|
|
|
In both cases, we generate 100 files. The results are more or less on the same level. The important difference, in this case, is that for Phase 1 when we used multithreading we still have the computer resources for other tasks (for example requests from other users). Multiprocessing is in any case physically limited by the quality of the computer. In my case it looks like this:
import psutil()
psutil.cpu_count()
12
psutil.cpu_count(logical=False)
It means that I have a max of 12 parallel processes to use, of which 6 are physical CPU cores. Even if I start 100 at once it won't change anything. When I reach 12 parallel processes running at the same time, I'm clogging up the CPU and waiting for resources to be freed up, which can be seen in the average PDF generation time, which has increased to 21.89%.
However, the case is different with the servers on which you host your applications. There it is worth checking what resources you have at your disposal and what you can afford using multiprocessing for "difficult" tasks.
Is it worth it?
Think about how the customer sees it. Let's say that your application generates pdf files with reports for your customers, which contain sales statements, charts, tables, etc. then they send such a file by email. This is quite a labor-intensive process. There are 100 000 clients to serve, and the process of generating files and sending them by e-mail takes 2 seconds
It is known that we will use celery, which will run the process of generating files and sending them out, at the end informing you by email (or any other way) that the files have been generated, 100 000 emails sent 234 did not reach the recipient, etc.
100,000 files. Each one takes 2 seconds, starting the process on Friday at 8 a.m. it will finish after 55 hours on Sunday around 3 p.m. Quite a lot of time. But what if we sped it up 100 times? The inexpensive VPS that I use gives me 150 parallel processes to use. If I use 100 to generate report files it will take me not 55 hours but... 33 minutes.
In the first case, I will know on Sunday how many users did not receive the report email, in the second case I will know it after half an hour and I still have 7.5 hours of work ahead of me to react, check emails and possibly send the report to users again.
...And this is just one of a million cases in which we can use multithreading and multiprocessing in Python :)
I hope this article has pretty much shown you the differences between the traditional approach and the asynchronous approach using the two capabilities Python gives us.
Both approaches have their advantages and disadvantages. It's worth knowing the differences in their operation so that you can use them wisely and take into account the computer resources at your disposal. The most important conclusions I drew are:
`.join()`
`Queue`
etc. the whole thing starts to be simple and the possibilities it opens up to us are huge.`generate_details_log`
). This is our `clone kitchen` process that you remember from the first part of this article. Using Multiprocessing, we need to provide the worker with everything it will need to perform the task, including the very description of the task it needs to perform.Feel free to use code from the repository for your own tests. If you have processes in your application that run in loops check if they can't be accelerated this way :)