Talk to multi-threaded procedure LOAD Balance

Speaking of Load Balance, it is generally easier to think about Load Balance between multiple replica, and Kernal Load Balance. The former is generally just a traffic allocation in the traffic entry, the logic is relatively simple; the latter is more complicated, and it is more complicated. It is necessary to constantly discovery Imbalance between the running processes, and then migrate the process in the CPU, making each CPU Take advantage of it.

The load balance that this article wants to discuss is different from the above two, which is a multi-thread (multi-process) Server program internal, and the Load Balance between the various Worker threads (processes).

Consider a commonly used Server model: A Receiver thread is responsible for receiving the request, and there is a thread pool to install a bunch of worker threads, and the request received is assigned to these workers. Communication between Receiver and Worker via pthread_cond + request_queue. The general approach is: Receiver puts the request to queue, then Signal Cond, it is OK. Which Worker will be awakened, it is a Kernel’s thing (actually KERNEL will follow the first until the principle, wake up the wait for the waiting process, see “Linux Futex”). Usually, this is enough, and the Receiver wakes Worker does not need to involve logic Logic. But sometimes we can also do some Load Balance work to improve Server’s performance.

KERNEL LOAD BALANCE Overview

Since the Load Balance here is closely related to Kernel’s Load Balance, we must first see what Kernel’s Load Balance has done something. For details, please refer to “Linux kernel SMP load balancing”, here only do some brief summary.

To put it bluntly, Kernel’s Load Balance is a thing: let the process in the system Running status as possible, it is Balance in every dispatch field. How to understand? Now the structure of the CPU is generally: physical CPU, Core, hype thread, such a few levels. “See Balance on each scheduling” “can be understood as balance at each level: the total LOAD on each physical CPU is quite, the total LOAD on each Core is quite, and the LOAD on each hyper thread is also quite .

Our “CPU” as seen in the system is the best ultra-threaded level. We can intuitively think that the Running status process is on every “CPU”, but actually KERNEL’s Load Balance Have higher requirements. Suppose our machine has 2 physical CPUs, each physical CPU has 2 cores, each Core has 2 ultra-threaded, a total of 8 “CPUs”. If there is a process of 8 Running status now (assuming the priority is the same), each “CPU” allocated a process, then natural is Balance. But if there is only a process of only 4 Running status now (assuming the priority is the same), the real Balance is not just that each process fell into a “CPU”, but further requires two Process, run a process on each Core.

Why have this strong constraint? Because although each “CPU” is logically independent (there is no class of the primary relationship), they are not isolated. The “CPU” under the same physical CPU shared cache, “CPU” under the same CPU “will share computing resources (so-called hyperthreading is a set of pipeline running two threads). And sharing means fighting. Therefore, in the process of running status is not exactly all “CPUs”, it is necessary to consider whether the higher level of CPU is shared to avoid the battle of Cache and CPU pipeline (of course, in addition to performance, this is also reflected. Kernel’s fairness).

In the end, it will mention it, Kernel’s Load Balance is asynchronous. In order to avoid excessive resources, Kernel will definitely not monitor the situation of “CPU” in real time, then respond to changing real-time, except for real-time processes, but this is not within our discussion).

Server’s Load Balance Consideration

With the Kernel’s Load Balance as a paving, see what it will do if the Receiver thread on our Server.

The first is the quantity problem of the Worker thread. What happens if the number of workers will happen? It is assumed that our machine has the above 8 “CPU”, assuming that we opened 80 Worker, then assume that the 80 threads were assigned to each “CPU”, waiting for processing tasks. When a bunch of requests come back, because our Receiver does not have any Load Balance’s strategy, which “CPU” appears in which “CPU” appears to be random. What do you think about, “At the same time”, how much is the probability of 8 different “CPUs”? Yes: (70 * 60 * 50 * 40 * 30 * 20 * 10) / (79 * 78 * 77 * 76 * 75 * 74 * 73) 0.34%. That is to say, some “CPU” will have some “CPU” to handle multiple requests, some “CPU” is idle, and the performance of the system is imagined. Wait until the lack of Kernel Load Balance will request Balance to every “CPU”, may request a lot of time, wait until the next batch of request comes, LOAD is still messy. Because those Worker threads that have just been good have been returned to the end of the COND wait queue, and priority responding to the new request, those who have never been passed by Balance. So will it reach Balance after several rounds of requests? If the request is really a round of coming, and each request is completely identical, it may reach Balance, but the actual situation is definitely far.

What is the solution? Change COND advanced queue wait logic to the backward first stack logic, maybe you can solve the problem, but a better way should limit the number of workers equal to or slightly smaller than the number of “CPU”, which is natural Balance is.

The second question, since we admit that the meaning of Load Balance on each scheduling field, can the Receiver thread in our Server can get the benefits through a similar approach? Now we have learned the previous lesson, only 8 Worker threads. Relying on the role of Kernel Load Balance, these 8 threads are basically fixed on every “CPU”. Suppose now, there are 4 requests, they will fall to 4 different “CPUs”, if lucky, these four “CPUs” belong to different core, then the process of processing the request will not involve the CPU resource Strike; contrary, it may form 2 Core very busy, 2 core idle situations.

To solve this problem, you need to do two points, continue to take us as an example of Server programs. First, the Receiver thread should know which “CPU” in each Worker thread is falling; then there is a vision of Balance to dispatch the task. To do the first point, it is best to secure the thread on a “CPU” with the SCHED_AFFINITY function, avoiding the Kernel Load Balance to complicate the problem. Since we have drawn the number of working threads equal to or slightly less than the CPU number, it is feasible to fix it on a CPU now. Second, we need to do some improvements on the basis of existing pthread_cond, and give the WORKER thread that enters the waiting state, such as the first ultra-thread of each Core as the first priority, the second hyperthreading For the second priority. Then when the COND wakes up the work thread, we can try to make the worker threads on the same core. Implementation can utilize the BitSet family of futex, identify priority through BitSet to wake up the specified Worker thread. (See “Linux Futex”.)

example

Ok, the paper talks so much, and the actual example is verified. For the sake of simplicity, don’t write any Server programs, just one producer thread and several consumer threads. The producer thread generates some tasks, passes it to the consumer thread by Cond + Queue. In order to observe the performance of the program under different task load, we need to control the task load. Consumer threads After completing the task, after another group of COND + Queue, the task answers to the producer thread, so the producer knows how many tasks are being processed in order to control the rhythm of the production of new tasks. Finally, we experience the performance of the program by observing the time to complete a batch of tasks under different conditions.

The key to this is the processing logic of the task itself. Since we discuss the CPU load, the task must be a CPU-intensive task. Then, the processing time of a single task should not be too short, otherwise the possible scheduling process will become the bottleneck of the program, reflecting the load problem of the CPU; on the other hand, the processing time of a single task should not be too long, otherwise the later KERNEL Load Balance can also solve the problem, reflecting that we take the initiative to do Load Balance benefits (such as the task processing time is 10 seconds, Kernel Load Balance spends tens of milliseconds to solve the balance problem in fact no hurt. The code is attached to the article, and the compiled bin file is like this:

$ G ++ cond.cpp -pthread -O2 $. / a.outusage: ./a.out -j job_kindshm | Calc [-t thread_count1] [-o job_load1] [-c job_count10] [-A affinity0] [-l] [-f filename “./ TEST” -n filelength128m]

There are two task logic in the code, “- j shm” is a MMAP file, then read the above data to make some operations (files and its lengths are limited by the -f and -n parameters); “- J Calc” is Do some arithmetic operations;

“-T” parameter specifies the number of threads;

“-O” specifies task load;

“-C” specifies the number of individual thread processing tasks;

“-A” specifies whether it is set to set SCHED_AFFINITY and indicate a few “CPU” to put a Worker thread. For example, “-a 1” indicates that the WORKER thread is fixed in 1, 2, 3, … “CPU”, and “-a 2” indicates fixed at 2, 4, 6, … “CPU”, Push it in this class. It should be noted that the neighboring “CPU” does not mean that “CPU” is physically neighboring, for example, on the machine I test, a total of 24 “CPU”, 0 ~ 11 is the first Core One hyper-thread, 12 ~ 23 is the second hypercarogram. This details need to read / proc / cpuinfo to determine.

The “-l” parameter specifies that our enhanced version of the hierarchical COND is enabled, and the 0 ~ 11 Worker will be used as the first priority, 12 to 23 as the second priority (of course, it is necessary to cooperate with “-a” parameters. Significance, otherwise it is not sure which “CPU” is on “CPU”);

First, see the problems caused by excessive worker threads (the following CASE runs 5 time minimum).

Case-1, 240 Worker threads, 24 task loads: $. / a.out -j calc -t 240-otal cost: 23790 $. / a.out -j shm -t 240 -o 24 Total Cost: 16827case-2, 24 WORKER threads, 24 task loads: $. / A.out -j calc -t 24 -o 24 Total Cost: 23210 $. / A.out -j shm -t 24 -o 24 Total Cost: 16121

Case-2 effect is obviously better. And if you use TOP during operation, you will find that CASE-1 can only press to 2200% CPU, and Case-2 can reach 2400%.

On the basis of CASE-1, what if Kernel Load Balance will be prohibited? Try to see:

Case-3, 240 Worker threads, 24 task loads, add affinity: $. / a.out -j calc -t 240 -o 24 -a 1 Total Cost: 27170 $. / a.out -j shm -t 240 -O 24 -A 1 Total Cost: 15351

The CALC task is compared to expectations, and if there is no Kernel Load Balance, performance continues to decline.

And the SHM mission makes people falling glasses, and performance is actually improved! In fact, this task is also dependent on memory except for the CPU, because all tasks work on the MMAP of the same file, “CPU” is close to the memory Cache. (It can be seen in this case, Kernel Load Balance is actually helped.)

So, we turn back the work thread back 24. Is it more ideal?

Case-3 ‘$. / a.out -j shm -t 24 -o 24 -a 1total Cost: 15133

Look at the second question, the impact of the Worker thread station uneven. Case-4, 24 WORKER thread, 12 task load: $. / a.out -j calc -t 24 -o 12total cost: 14686 $. / a.out -j shm -t 24 -o 12total cost: 13265CASE-5, 24 WORKER threads, 12 task loads, add Affinity, enable grading COND: $. / A.out -j Calc -t 24 -o 12 -a 1 -ltotal cost: 12206 $. / A. Out -j shm -t 24-t 12 -a 1 -ltotal Cost: 12376

The effect is good. Change the “-a” parameters, let the two ultra-threads of the same Core are in the same priority?

Case-5 ‘$. / a.out -j Calc-T 24 -O 12 -A 2 -Ltotal Cost: 23510 $. / a.out -j shm -t 24 -o 12 -a 2 -ltotal cost: 15063

Because of the competition of CPU resources, Calc task performance has become very poor, almost half. The SHM mission is better than the benefits of cache multiplexing (slightly better than Case-3).

The task here is just two examples of CALC and SHM, and the actual situation may be very complicated. Although Load Balance is definitely existed, the task will be profitable by sharing cache, or is still lost due to changing cache? How much losses will be caused to grab the CPU pipeline? These are only specific to specific problems. The Load Balance of Kernel puts the load as far as possible, “CPU”, and there is no problem in most cases. However, we also see that the revenue of cache sharing in the SHM mission is still very large. If the example is more extreme, it will definitely with the CPU of the load, the closer to the load, but the results are better.

On the other hand, there will be much loss of changing the CPU pipeline, or you can simply analyze it. Hyper thread is equivalent to sharing a CPU pipeline for two threads. If the code context of a single thread is very serious, the instructions can only serve serial, and the pipeline cannot be fully utilized. Then the free power of the pipeline can be left to the second thread. use. Vioence, if a thread fills the pipeline, the two threads of the hard plug come in must only have 50% of the performance (the example of the above CALC is almost like this).

To illustrate this problem, we add a serial_calc macro switch to the CALC task, allowing its operational logic to become context. Then returning two commands in Case-5, we will see that the CPU that is loaded in this case is close to the problem:

Case-6, using serial_calc operation logic, Running CASE-5 CALC Task $ G ++ cond.cpl -pthread -O2 -dserial_calc $. / a.out -j calc -t 24 -o 12 -a 1 -ltotal COST : 51269 $. / A.out -j Calc-T 24 -O 12 -A 2 -LTotal Cost: 56753

Finally, you can try more Case, Have Fun!

#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define CPUS24 # define FUTEX_WAIT_BITSET 9 # define FUTEX_WAKE_BITSET 10struct Job {long _input; long _output;}; class JobRunner { public: virtual void run (Job * job) 0;}; class ShmJobRunner: public JobRunner {public: ShmJobRunner (const char * filepath, size_t length): _length (length) {int fd open (filepath, O_RDONLY); _ base (long *) MMAP (NULL, _LENGTH * SIZEOF (long), prot_read, map_shared | map_populate, fd, 0); if (_base map_failed) {printf (“Fatal: mmap% s (% lu) failed!

“, filepath, _LENGTH * SIZEOF (LONG); ABORT ();} close (fd);} Virtual void Run (job * job) {long i job -> _ input% _LENGTH; long ji + _length – 1; const Int Step 4; While (i + step _ output _base [ I% _LENGTH];} private: constling * _base; size_t _length;}; class calcjobrunner: public jobrunner {public: Virtual void run (job * job) {long v1 1; long V2 1; long V3 1; for (int I 0; i _ infut; i ++) {#ifndef serial_calcv1 + v2 + v3; v2 * 3; v3 * 5; # elsev1 + v2 + v3; v2 v1 * 5 + v2 * v3; v3 v1 * 3 + V1 * v2; #ndif} job -> _ output v1;}}; class jobrunner}}; class jobrunner * create (const char * name, const char * filepath, size_t filength) {ix (Strcmp (Name, “SHM”) 0) {Printf (“Share Memory Job

“); Return New Shmjobrunner (FilePath, FileLength);

Else IF (Strcmp (Name, “CALC”) 0) {Printf (“Caculation Job

“); return new cagcjobrunner ();” “” PRINTF (“Unknown Job ‘% S”

“, Name); return null;}}; class cond {public: Virtual void Lock () 0; Virtual Void UNLOCK () 0; Virtual Void Wait (SIZE_T) 0; Virtual Void Wake () 0;}; Class Normalcond: public cond {public: NormalCond () {pthread_mutex_init (& _ mutex, NULL); pthread_cond_init (& _ cond, NULL);} ~ NormalCond () {pthread_mutex_destroy (& _ mutex); pthread_cond_destroy (& _ cond);} void lock () {pthread_mutex_lock (& _ mutex); } void unlock () {pthread_mutex_unlock (& _ mutex);} void wait (size_t) {pthread_cond_wait (& _ cond, & _mutex);} void wake () {pthread_cond_signal (& _ cond);} private: pthread_mutex_t _mutex; pthread_cond_t _cond;}; class LayeredCond: public Cond {public: LayeredCond (size_t layers 1): _value (0), _layers (layers) {pthread_mutex_init (& _ mutex, NULL); if (_layers> sizeof (int) * 8) {printf ( “FATAL: can not support such layer % U (MAX% U)

_Layers, sizeof (int) * 8); Abort ();} _ waiters new size_t [_Layers]; MEMSET (_Waiters, 0, Sizeof (size_t) * _};} ~ layeredcond () {pthread_mutex_destroy (& _ mutex); delete _waiters; _waiters NULL;} void lock () {pthread_mutex_lock (& _ mutex);} void unlock () {pthread_mutex_unlock (& _ mutex);} void wait (size_t layer) {if (layer> _layers) {printf ( “FATAL: layer overflow ( % U /% U)