首页 > 代码库 > The Performance Analysis of MENCODER with PARROT

The Performance Analysis of MENCODER with PARROT

1 ENVIRONMENTS

                            Table 1 ENV A

CPU
Intel(R) Xeon(R) CPU E7- 4870 @
2.40GHz 10-core
Memory
100G
Linux Version
3.11.0
Ubuntu Version
12.04

2 RESULTS
*We could find that MENCODER with PARROT is faster than that without PARROT over four threads.


3 EXPLANATIONS
We need to explain why MENCODER with PARROT is faster than that without PARROT over four threads. [1] tells us that the Synchronization Context Switches with PARROT is much fewer than that without PARROT. And the reason is that PARROT’s round-robin scheduling reduces contention and its synchronizations use a more efficient wait that combines spin- and block- waits. So we should do the following steps to find the detailed reasons.
(a) First, is Syn Context Switches the main factor that leads to the result?
(b) If it is, please see (c-e); otherwise jump to (f).
(c) What’ s the relationship between Syn Context Switches and threads contention(a.k.a. locks contention)?
(d) What causes the contention?
(e) How does PARROT’ s round-robin scheduling reduce contention?
(f) We need to find other factors.


3.1 Preparations
We can’t read the source code of MENCODER, and don’t know where PARROT works. So I want to find a simple model which is similar to MENCODER. It is located in xtern/apps/mplayer/, and contains speedup-example.cpp(with PARROT) and speedup-example-mutex-only.cpp(without PARROT). We call it EXAMPLE, and I write the mk of speedup-example.cpp.

#!/bin/bash
cd $XTERN_ROOT/apps/mplayer
gcc speedup-example.cpp -o speedup-example -O2 -g \
-I$XTERN_ROOT/include -L$XTERN_ROOT/dync_hook -Wl,--rpath,$XTERN_ROOT/dync_hook
-lxtern-annot \
-lpthread


So we could run the model. Before this, however, we should prove that we could use EXAMPLE to explain MENCODER. We could regard MENCODER as the common producer-consumer idiom. The producer reads blocks from mp4, and the consumers convert into avi. When many consumers strive for the same lock, it will causes performance loss. This is the key point. If we want to explore the reasons of the speedup of MENCODER with PARROT, we need to simplify it as the above. And EXAMPLE is a good choice. It lets many threads access the same critical resource, and uses PARROT to get a speedup. Additionally, EXAMPLE and MENCODER both enjoy big speedups. So I think if we can find the detailed reasons of the speedup of the EXAMPLE with PARROT, the obtained reasons also apply to MENCODER.


3.2 Is Syn Context Switches the main factor that leads to the result?
Now we could use EXAMPLE to explain MENCODER. We run EXAMPLE over four threads, ENV A. Results are as follows.

Table 2 Results of EXAMPLE over four threads, ENV A


WITHOUTPARROT
WITH PARROT
Elapsed Time
7.85s
4.36s
Very good. PARROT get a high speedup. Let’s run it over a single thread without PARROT.

Table 3 Results of EXAMPLE without PARROT over 1 thread, ENV A


WITHOUTPARROT
Elapsed Time
1.38s

Now we try to explain the results. First, let’s see the source code.


speedup-example-mutex-only.cpp

void*thread_func(void*arg){
...
for (int j = 0; j < M; j++) {
pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the
mutex needs to be a little bit congested.
sum += i;
pthread_mutex_unlock(&mutex);
...
}
}
int main(int argc, char* argv[]){
...
for(unsigned i=0; i<N; ++i) {
ret = pthread_create(&th[i], NULL, thread_func, (void*)i);
assert(!ret && "pthread_create() failed!");
}
...
exit(0);
}

According Amdahl‘s law,

the Elapsed Time of execution without PARROT over four threads should be four times that over a single thread. However it is more than four times. The execution with PARROT enjoys a good speedup, and the Elapsed Time is about four times that over a single thread. Now let’s use VTune to find something.


Threads=1, without PARROT


Threads=4, without PARROT


Threads=4, with PARROT


Fig.2. and Fig.1. show that with the increase of the number of threads, Synchronization Context Switches is larger and larger. Fig.2. and Fig.3. show that PARROT reduces the number of Context Switches. Namely, with the increase of Synchronization Context Switches, the Elapsed Time increases, and with the reduce of it, the Elapsed Time reduces. Above all, we could prove that Syn Context Switches is the main factor that leads to the result that we try to explain.


3.3 What’s the relationship between Syn Context Switches and threads contention?
We have known that Syn Context Switches is the main factor. And we also know that with the increase of the number of threads, there is also increased threads contention.What’s the relationship between Syn Context Switches and threads contention? Contention not only causes serialization, but also brings the overhead of thread switches[2]. From VTune Fig.4., we could see that threads spend more Wait Time on the Mutex lock, which caused by threads contention.


So if we could reduce the contention, or use a more efficient wait, we will get a high speedup. So we need to know how it is generated.


3.4 What causes the contention?


Let’s see the source code of EXAMPLE. Every thread executes the following codes:

pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) sum += i;
pthread_mutex_unlock(&mutex);

When many threads strive for the same lock, contention happens. And only one thread could get the lock, while others should wait. So if we want to reduce contention, we should not let too many threads wait on the same lock. We should take some scheduling policies rather than let threads run indefinitely.


3.5 How does PARROT’s round-robin scheduling reduce contention?


First, let’s see the Locks and Waits of EXAMPLE with PARROT and without PARROT from VTune.


With PARROT


Without PARROT


Without PARROT, threads spend more Wait Time on the Mutex lock. While with PARROT, threads spend more wait time on Condition Variable rather than Mutex. It is because PARROT adds


pthread_cond_wait(&cond,&mutex);............(2)


which adds threads to a wait queue, then uses


pthread_cond_broadcast(&cond); .............(3)


to wake up all threads together. Let’s try to explain the benefit. Thread A finishes its access, and unlock the shared resources. Next, there will be many threads striving for the lock, including A itself. While PARROT lets A wait by (2), and lets B, C, D wait subsequently. So the contention has been reduced. When all four threads lie in the wait queue, (3) will let the four threads run again. The following graph describes the number of threads that strive for the lock as time goes on.


So PARROT could reduce contention by schedule threads in order.


4 CONCLUSION

PARROT tells thread A, B, C and D: When you finish a access, you should wait for others, because you are cooperators. So the enemy, Syn Context Switches, reduces gradually. Although they need to wait, in fact the performance will be improved as a whole.



5 REFERENCES
[1] Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth Gibson, and Randy Bryant. "Parrot: a Practical Runtime for Deterministic, Stable, and Reliable      Threads". Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP ’13)
[2] Weiming Zhou, Multi-core Computing and Programming, Wuhan: Huazhong University of Science and Technology Press, 2009(in Chinese)