首页 > 代码库 > Parrot源码分析之海贼王
Parrot源码分析之海贼王
我们的目的是找到speedup-example在使用Parrot加速的原因,如果只说它源于Context Switch的减少,有点简单了,它到底为什么减少了?除了Context Switch外是否还有其他的performance counter也对提速有帮助?这些都是值得去思考的问题。先来看一下我们用来探索Parrot奥秘的程序speedup-example.cpp。
前言:RRScheduler::getTurn&RRScheduler::wait_t::wait
speedup-example.cpp
- /* Copyright (c) 2013, Regents of the Columbia University
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other
- * materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
- * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
- * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
- #include <pthread.h>
- #include <stdio.h>
- #include <stdlib.h>
- #include <errno.h>
- #include <assert.h>
- //#include "tern/user.h"
- #define N 4
- //#define M 30000
- #define M 30000*10
- //int nwait = 0;
- volatile long long sum;
- volatile long long sum_2;
- long loops = 6e3;
- pthread_mutex_t mutex;
- pthread_cond_t cond;
- pthread_barrier_t bar;
- void set_affinity(int core_id) {
- cpu_set_t cpuset;
- CPU_ZERO(&cpuset);
- CPU_SET(core_id, &cpuset);
- assert(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) == 0);
- }
- void* thread_func(void *arg) {
- set_affinity((int)(long)arg);
- for (int j = 0; j < M; j++) {
- //pthread_mutex_lock(&mutex);
- //nwait++;
- for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the mutex needs to be a little bit congested.
- {
- sum += i;
- sum_2 += i;
- }
- //pthread_cond_wait(&cond, &mutex);
- //printf("being in the lock is: %lu\n", pthread_self());
- //pthread_mutex_unlock(&mutex);
- //soba_wait(0);
- //pthread_barrier_wait(&bar);
- for (long i = 0; i < loops; i++)
- {
- sum += i*i*i*i*i*i;
- }
- //fprintf(stderr, "compute thread %u %d\n", (unsigned)thread, sched_getcpu());
- }
- }
- int main(int argc, char *argv[]) {
- set_affinity(23);
- //soba_init(0, N, 20);
- pthread_t th[N];
- int ret;
- //pthread_cond_init(&cond, NULL);
- //pthread_barrier_init(&bar, NULL, N);
- for(unsigned i=0; i<N; ++i) {
- ret = pthread_create(&th[i], NULL, thread_func, (void*)i);
- assert(!ret && "pthread_create() failed!");
- }
- /*for (int j = 0; j < M; j++) {
- while (nwait < N) {
- sched_yield();
- }
- pthread_mutex_lock(&mutex);
- nwait = 0;
- //fprintf(stderr, "broadcast %u %d\n", (unsigned)pthread_self(), sched_getcpu());
- pthread_cond_broadcast(&cond);
- pthread_mutex_unlock(&mutex);
- }
- */
- for(unsigned i=0; i<N; ++i)
- pthread_join(th[i], NULL);
- // test the validity of results
- printf("sum_2 = %lld\n", sum_2);
- //printf("sum = %lld\n", sum);
- exit(0);
- }
导师曾教导我——
Which approach are you going to use in order to ensure that you will do the right experiments and wont‘ mess up with the results?
所以为了不出现实验和实验数据混乱,我这样去做:
- 只保留speedup-example.cpp一个文件,每次只对这一个文件进行修改,那如何保证修改过程的正确性,除了重复三次检查修改的地方外,借助下面的记录。
- 修改记录表:记录改动信息,包括改动文件及行号、是否保存、是否恢复和备注等。
(附链接:
http://bluecloudmatrix.github.io/jekyll_demo/PARROT%E6%BA%90%E7%A0%81%E6%94%B9%E5%8A%A8%E8%AE%B0%E5%BD%95.xlsx)
我们要通过它弄清楚Parrot的运行,下面是我们的始发地。之所以从它开始是受导师的指点——
How many times the sched_yield() in RRScheduler::wait_t::wait() is called within one execution? Have you figured out the relationship between the number of context switches and sched_yield()?
我们想这应该是最接近答案的入口,于是先尝试直接去探究它。
xtern/lib/runtime/record-scheduler.cpp
- void RRScheduler::getTurn()
- {
- int tid = self();
- assert(tid>=0 && tid < Scheduler::nthread);
- waits[tid].wait();
- dprintf("RRScheduler: %d gets turn\n", self());
- SELFCHECK;
- }
self()返回当前线程的tid,来自xtern/include/tern/runtime/scheduler.h中的struct TidMap:
- /// tern tid for current thread
- static int self() { return self_tid; }
RRScheduler::getTurn是RRScheduler::wait_t::wait的唯一入口。
getTurn的作用是让一个线程运行,而让其他线程等待,实现串行化。
实际上getTurn最早见于struct Serializer,所以这里引出一些重要类的继承关系。
接下来要弄明白是哪里调用了getTurn,以及它在RR调度中起的作用。
在34次直接调用getTurn()的共有五处:
- xtern/lib/runtime/record-runtime.cpp: line 337: SCHED_TIMER_START(共调用30次)
- xtern/lib/runtime/record-runtime.cpp: line 272: RecorderRT<_S>::idle_sleep (共调用1次)
- xtern/lib/runtime/record-scheduler.cpp: line 372: RRScheduler::wait(共调用1次)
- xtern/lib/runtime/record-runtime.cpp: line 376: RecorderRT<_S>::printStat (共调用1次)
- xtern/lib/runtime/record-runtime.cpp: line 286: RecorderRT<_S>::idle_cond_wait(共调用1次)
而调用SCHED_TIMER_START的函数有(它们都在xtern/lib/runtime/record-runtime.cpp):
- RecorderRT<_S>::threadBegin(void)
- RecorderRT<_S>::threadEnd(unsigned ins)
- RecorderRT<_S>::pthreadCreate
- RecorderRT<_S>::pthreadJoin
- RecorderRT<_S>::pthreadMutexInit
- RecorderRT<_S>::pthreadMutexDestroy
- RecorderRT<_S>::pthreadMutexLock
- RecorderRT<_S>::__pthread_rwlock_rdlock
- RecorderRT<_S>::__pthread_rwlock_wrlock
- RecorderRT<_S>::__pthread_rwlock_tryrdlock
- RecorderRT<_S>::__pthread_rwlock_trywrlock
- RecorderRT<_S>::__pthread_rwlock_unlock
- RecorderRT<_S>::__pthread_rwlock_destroy
- RecorderRT<_S>::__pthread_rwlock_init
- RecorderRT<_S>::pthreadMutexTryLock
- RecorderRT<_S>::pthreadMutexTimedLock
- RecorderRT<_S>::pthreadMutexUnlock
- RecorderRT<_S>::pthreadBarrierInit
- RecorderRT<_S>::pthreadBarrierWait
- RecorderRT<_S>::pthreadBarrierDestroy
- RecorderRT<_S>::pthreadCondWait
- RecorderRT<_S>::pthreadCondTimedWait
- RecorderRT<_S>::pthreadCondSignal
- RecorderRT<_S>::pthreadCondBroadcast
- RecorderRT<_S>::semWait
- RecorderRT<_S>::semTryWait
- RecorderRT<_S>::semTimedWait
- RecorderRT<_S>::semPost
- RecorderRT<_S>::semInit
- RecorderRT<_S>::lineupInit
- RecorderRT<_S>::lineupDestroy
- RecorderRT<_S>::lineupStart
- RecorderRT<_S>::lineupEnd
- RecorderRT<_S>::nonDetStart()
- RecorderRT<_S>::symbolic
- RecorderRT<RecordSerializer>::pthreadBarrierWait
- RecorderRT<RecordSerializer>::pthreadCondWait
- RecorderRT<RecordSerializer>::pthreadCondTimedWait
- RecorderRT<RecordSerializer>::pthreadCondSignal
- RecorderRT<RecordSerializer>::pthreadCondBroadcast
- RecorderRT<_S>::__fork
- RecorderRT<_S>::__execv
- RecorderRT<_S>::schedYield
- RecorderRT<_S>::__sleep
- RecorderRT<_S>::__usleep
- RecorderRT<_S>::__nanosleep
(B)利用tern_pthread_create创建四个子线程。
在第三回中我们来探索一下tern_pthread_create都干了些什么。
第三回:tern_pthread_create
上回说到我们遇到了第一个强敌“两种顺序”,在首回交锋中,他使出了大招“用tern_pthread_create来创建子线程”。接下来看我们如何破。
这就是大招的原理分解图。在分析RecorderRT<_S>::pthreadCreate时,发现有提示:
- /// The pthread_create wrapper solves three problems.
- ///
- /// Problem 1. We must assign a logical tern tid to the new thread while
- /// holding turn, or multiple newly created thread could get their logical
- /// tids nondeterministically. To do so, we assign a logical tid to a new
- /// thread in the thread that creates the new thread.
- ///
- /// If we were to assign this logical id in the new thread itself, we may
- /// run into nondeterministic runs, as illustrated by the following
- /// example
- ///
- /// t0 t1 t2 t3
- /// getTurn();
- /// create t2
- /// putTurn();
- /// getTurn();
- /// create t3
- /// putTurn();
- /// getTurn();
- /// get turn tid 2
- /// putTurn();
- /// getTurn();
- /// get turn tid 3
- /// putTurn();
- ///
- /// in a different run, t1 may run first and get turn tid 2.
- ///
- /// Problem 2. When a child thread is created, the child thread may run
- /// into a getTurn() before the parent thread has assigned a logical tid
- /// to the child thread. This causes getTurn to refer to self_tid, which
- /// is undefined. To solve this problem, we use @thread_begin_sem to
- /// create a thread in suspended mode, until the parent thread has
- /// assigned a logical tid for the child thread.
- ///
- /// Problem 3. We can use one semaphore to create a child thread
- /// suspended. However, when there are two pthread_create() calls, we may
- /// still run into cases where a child thread tries to get its
- /// thread-local tid but gets -1. consider
- ///
- /// t0 t1 t2 t3
- /// getTurn();
- /// create t2
- /// sem_post(&thread_begin_sem)
- /// putTurn();
- /// getTurn();
- /// create t3
- /// sem_wait(&thread_begin_sem);
- /// self_tid = TidMap[pthread_self()]
- /// sem_post(&thread_begin_sem)
- /// putTurn();
- /// getTurn();
- /// get turn tid 2
- /// putTurn();
- /// sem_wait(&thread_begin_sem);
- /// self_tid = TidMap[pthread_self()]
- /// getTurn();
- /// get turn tid 3
- /// putTurn();
- ///
- /// The crux of the problem is that multiple sem_post can pair up with
- /// multiple sem_down in different ways. We solve this problem using
- /// another semaphore, thread_begin_done_sem.
- ///
顿时剑法的一些关键招数四处浮现在眼前:pthread_create wrapper,assign a logical tid to a new thread,holding turn,getTurn...
看来是上乘剑法。我们目前不能马上理解它,先留着,以后慢慢破。
有了招数分解图和一部分剑谱还是费解,那么先来看一下我们的伙伴厨师top探得的消息:
情景一:./speedup-example(有mutex,四个子线程,M=30000*10,没使用Parrot)
情景二:./run_example_tern(有mutex,四个子线程,M=30000*10,使用Parrot)
情景三:./run_example_tern(没有mutex,四个子线程,M=30000*10,使用Parrot)
通过top,我们发现——
- 情景二和情景三相比于情景一,id列(绿色部分)为0.0%,而情景一的id列则占据了较大的百分比。
- 情景三较之于情景二,sy列(橙色部分)为0.0%,而情景二的sy列占据了近乎一半。
- 情景一和情景二的四个core(Cpu0,1,2,3)都是一起运行,而情景三是一个接着一个运行(Cpu0->Cpu1->Cpu2->Cpu3)。
通过上面的实验我们明显看出情景二和情景三明显与没有使用Parrot的情景一不同,我们做这个实验的目的是要更细致地查看子线程的创建和执行顺序以及一些指标(如id,sy)。我们想探究一下在情景二和情景三下四个子线程的创建顺序是什么样的。先设想一下:情景二是四个子线程近乎同时创建,而情景三是一开始先近乎同时创建两个子线程,此后每当一个子线程完成自己的任务结束时再创建一个新的子线程。通过下面的实验来验证此猜想。
情景四:./run_example_tern(有mutex,八个子线程,M=30000*10,使用Parrot)
情景五:./run_example_tern(没有mutex,八个子线程,M=30000*10,使用Parrot)
对于情景五的行为,有点匪夷所思,为什么一开始要近乎同时创建两个子线程,而不是一个一个的创建,当创建的子线程完成任务,再创建新的子线程?(在第四回会有初步解释)情景四的行为符合常理,与没有使用Parrot的状况是一致的(宏观上如此,实际上它与情景五是一种机制,在第四回中会有答案)。在第二回中我们提到了第一个强敌“两种顺序”,而目前我们得到的情报或许能对战胜它有帮助。其实在这众多的信息中,我们只要铭记一点(它的代号为“crocodile”)——解释有mutex时的顺序:四个子线程在近乎同时创建出来后,究竟是什么让它们保持A-B-C-D这样恒定的执行顺序循环M次。让我们再次回到源码。
第四回:让源码告诉我们crocodile的答案
当我们再次回顾tern_pthread_create这个的流程,我们发现能做到控制子线程创建和执行顺序的可能性最大的是sem_post(&thread_begin_sem)和sem_wait(&thread_begin_done_sem),在经过一番printf之后,我们探得关于信号量thread_begin_sem和thread_begin_done_sem的一些蛛丝马迹:
这让我们意识到在tern_pthread_create创建子线程的同时,通过对信号量thread_begin_sem和thread_begin_done_sem的sem_post和sem_wait操作,便有可能做到了对子线程顺序的控制。
我们发现,实际上创建的子线程要执行的函数并不是speedup_example.cpp中的thread_func,而是位于xtern/lib/runtime/helper.cpp: line 61的__tern_thread_func,thread_func成了__tern_thread_func的参数,让我们来看一下这位大神的庐山面目:
- static void *__tern_thread_func(void *arg) {
- #ifdef XTERN_PLUS_DBUG
- if (idle_th != pthread_self())
- Runtime::__detach_self_from_dbug(__FUNCTION__);
- #endif
- void **args;
- void *ret_val;
- thread_func_t user_thread_func;
- void *user_thread_arg;
- args = (void**)arg;
- user_thread_func = (thread_func_t)((intptr_t)args[0]);
- user_thread_arg = args[1];
- // free arg before calling user_thread_func as it may not return (i.e.,
- // it may call pthread_exit())
- delete[] (void**)arg;
- tern_thread_begin();
- ret_val = user_thread_func(user_thread_arg);
- tern_pthread_exit(-1, ret_val); // calls tern_thread_end() and pthread_exit()
- assert(0 && "unreachable!");
- }
参数arg实际上就是thread_func,我们可以看到在执行thread_func(即语句ret_val = user_thread_func(user_thread_arg))之前,要执行tern_thread_begin(),再来看一下tern_thread_begin(位于xtern/lib/runtime/hooks.cpp: line 70):
- void tern_thread_begin(void) {
- assert(Space::isSys() && "tern_thread_begin must start in sys space");
- int error = errno;
- // thread begins in Sys space
- Runtime::the->threadBegin();
- Space::exitSys();
- errno = error;
- assert(Space::isApp() && "tern_thread_begin must end in app space");
- }
看来还得继续,Runtime::the->threadBegin()(位于xtern/lib/runtime/record-runtime.cpp: line 374)
- template <typename _S>
- void RecorderRT<_S>::threadBegin(void) {
- pthread_t th = pthread_self();
- unsigned ins = INVALID_INSID;
- if(_S::self() != _S::MainThreadTid) {
- sem_wait(&thread_begin_sem);
- _S::self(th);
- sem_post(&thread_begin_done_sem);
- }
- assert(_S::self() != _S::InvalidTid);
- SCHED_TIMER_START;
- app_time.tv_sec = app_time.tv_nsec = 0;
- Logger::threadBegin(_S::self());
- SCHED_TIMER_END(syncfunc::tern_thread_begin, (uint64_t)th);
- }
瞧瞧我们发现了什么,sem_wait(&thread_begin_sem),绝对的宝物。还记得在pthreadCreate(位于xtern/lib/runtime/record-runtime.cpp: line 467)
- template <typename _S>
- int RecorderRT<_S>::pthreadCreate(unsigned ins, int &error, pthread_t *thread,
- pthread_attr_t *attr, void *(*thread_func)(void*), void *arg) {
- int ret;
- SCHED_TIMER_START;
- ret = __tern_pthread_create(thread, attr, thread_func, arg);
- assert(!ret && "failed sync calls are not yet supported!");
- _S::create(*thread);
- SCHED_TIMER_END(syncfunc::pthread_create, (uint64_t)*thread, (uint64_t) ret);
- sem_post(&thread_begin_sem);
- sem_wait(&thread_begin_done_sem);
- return ret;
- }
对于第一个子线程,先执行了pthreadCreate中的sem_post(&thread_begin_sem),才使得接下来执行threadBegin中的sem_wait(&thread_begin_sem)时不会阻塞,进而能继续执行下去,并能在threadBegin之后紧接着执行speedup_example,cpp中的thread_func。可问题出现了——
按理说在第一个子线程在执行threadBegin之后应该马上执行thread_func,可是我们发现了如下现象:
我猜想在第一个子线程执行threadBegin之后,有个东西触发了第二个子线程,第二个子线程的创建过程开始启动(注意只是启动,当第一个子线程执行完第一次临界区中的任务时,才让第二个子线程去执行threadBegin及临界区中的任务),当第二个子线程执行完自己的threadBegin时,该东西又触发了子线程3的启动。
- 在上面的实验之后,我们想弄清楚一点——
在speedup_example.cpp有pthread_mutex_lock(&mutex)并使用Parrot时,在第一个子线程执行第一次循环后,是什么让子线程2开始执行自己的函数__tern_thread_func。这是上面的探索留给我们的一个目前还无法解释的问题,其实截至目前的探索对我们找到speedup-example加速的原因并没有什么帮助,只不过是从源头了解一下源码的执行过程,我们的目的是要找到speedup-example加速的原因,要学会舍弃对此无用的探索,目前我们只需要知道四个子线程的创建和执行顺序,将“为什么”先放在一边,专注我们的最初目的。虽然截至目前的探索对了解加速原因没有直接帮助,但它却是第七回中从源码角度了解原因的前提准备工作。我们目前还没发回答crocodile的原因,毕竟crocodile是我们通向最终答案的关键,我们尝试在第七回中进行解释。
- 这里要再次提醒一下,我们的最初的目的是要找出speedup_example在使用Parrot后速度加快的原因。
- 本来我们可以在下面一回中接着上面,继续分析四个子线程在创建之后在执行中发生了什么,但更重要的是,要利用VTune探得能影响运行时间的除了synchronization context switches,还有哪些cpu performance counters,这是我们的关键,也是导师一再强调的,伟大航路之行一定要铭记。
第五回:来自论文的cpu performance counters
在探索其他的performance counters之前(因为我们不能乱撞),先来重读一下论文,从论文中看看Parrot究竟做了哪些工作,它的原理是什么。
其实我们无法肯定加速的原因只来自于context switches的减少,实际上根据导师google docs的实验数据,有的获得加速的程序的context switches反而增加了,而docs中列出的所有加速的程序(除了phoenix string_match-pthread)的migration都减少了,通过向Intel的技术人员请教,cpu-migration也是一项performance counters——
CPU-migrations:表示进程 t1 运行过程中发生了多少次 CPU 迁移,即被调度器从一个 CPU 转移到另外一个 CPU 上运行。
先让我们来研究一下migration吧——
导师给出的逻辑:
if performance number X go up and two factors A and B also go up, so either A or B or both could affect X. The way to verify the hypothesis is thinking about fixing A, and only let B go up, and see whether X is affected as B goes up. In the parrot task, A or B could be the synchronization context switches or some other cpu performance counters, and X could be the performance of running a program with Parrot. And in order to play with A or B, you first have to figure out which lines of code A or B come from (maybe by using vtune).
regarding the preemption context switches and sync context switches, have you figured out their differences to performance? In order to do this, you should figure out where the preemption context switches come from (for both with and without parrot), and control the number of these context switches by modifying code or any other possible ways. Similar things should happen for sync context switches, and maybe other cpu performance counters.
Now X is the execution time, and you see performance gain when running speedup-example.cpp with Parrot. Let say, in this gain, three performance counters A, B, and C are different from the ones running without Parrot.
So, either A, B, C, or some of them, or all of them may affect performance X. Right?
Your report should fix all the other two, and let only one (let say A) change, and see whether the change of A will affect the performance X. If so, then A should be the one that affect performance.
Could you take this program, some some other programs that have similar results as pbzip2, figure our whether the preemption context switches and sync context switches come from, and figure our their differences in performance impact
我们先假设Parrot能降低migration。
实际上在我们的程序speedup-example.cpp中有set_affinity(23)和set_affinity((int)(long)arg),它们会让子线程固定在一个core上,但这样并没有让speedup-example提速(这都是在没有使用Parrot的情况)。
(A)下面是没有使用Parrot,但使用了set_affinity(23)和set_affinity((int)(long)arg),speedup-example子线程使用core的情况:
(B)下图是将set_affinity(23)和set_affinity((int)(long)arg)注掉,并在没有使用Parrot的情况下,speedup-example子线程使用core的情况:
(C)我们先将set_affinity(23)和set_affinity((int)(long)arg)注掉,看看Parrot能否代替它们,让子线程固定在一个core上。下面是使用了Parrot,没有set_affinity(23)和set_affinity((int)(long)arg)的情况:
从上面的实验可以看到,Parrot会将子线程固定在一个core上。这样speeup-example在使用Parrot后加速,context switches减少,cpu-migration减少。但A和B告诉我们,即使固定住,也不会起到加速的作用,这样来看的话,虽然Parrot能减少cpu-migration,但speedup-example的加速应该不是得益于migration的减少,看起来似乎没有必要去研究Parrot下的migration。元芳你怎么看?元芳:导师的google docs有记录显示,有的程序被加速了,但context switches却增多了,反而migration却减少了,context switches增多一般情况下会增加运行时间,要想解释提速的原因,看来只能从migration入手了,大人请看实验数据。
Parrot还有这个功能,能降低migration?而且这有可能有益于提速。先来解答这样一个问题:migration究竟会不会增加运行时间?
第六回:迷茫的migration
从Parrot论文,《Identifying OS thread migration using Intel? VTune? Amplifier XE》,Parrot google docs, VTune和Parrot源码入手。
“This poses a problem to the newer computing architectures as this SW thread migration disassociates the thread from data that has already been fetched into the caches resulting in longer data access latencies.”这句话告诉我们,当一个程序很依赖cache时(即子线程经常从cache读写数据),如果发生子线程频繁得在core间迁移,那么子线程就不得不从memory中重新读取数据装载cache,这有可能导致程序运行时间增加。而speedup-example不属于cache依赖型,它的四个子线程所做的只是对sum进行加和乘,将它放在寄存器中即可。元芳提到的四条数据,stable_sort,mismatch,equal和linear regression都是cache依赖的。这或许就解答了为什么它们在使用Parrot后Context Switch增加反而提速了,因为migration减少了。这也只是初步的结论。
我们需要通过实验来验证。首先要弄明白五个问题——
- 问题6.1:migration如何测得具体数值(VTune似乎不能测migration的具体数值,而又无法用perf)?
- 问题6.2:在探究migration是否会影响到performance时,假设我们可以让使用Parrot后的Context Switch大致与没有使用Parrot时的相当(即让Context Switch不变,这或许要修改Parrot的源码),但如何同时让Parrot保持减少migration的作用,同理,在探究Context Switch是否会影响performance时,又如何让migration不变,让Context Switch变?这都需要我们知道到底Parrot的哪部分做到了减少migration,哪部分做到了减少Context Switch。
- 问题6.3:找到在使用Parrot后migration和Context Switch都减少,且减少migration对performance有提升(不是speedup-example那样的)的程序。
- 问题6.4:解释下面为什么migration和Context Switch都减少了,而且幅度很大,但运行时间反而增加了?
- 问题6.5:假使我们探究出了migration对performance的影响,我们仍然不能肯定提速只与Context Switch和migration有关,会不会还有其他原因(如下面附中的Preemption Context Switches),似乎没有尽头?
第七回:重回源码之路tern_pthread_mutex_lock——最终的答案
至此我们发现,不论是要解决migration的五个问题,还是进一步分析speedup-example的加速原因,还是透彻了解Parrot源码的执行流程,都需要我们重回源码进行分析,所以在历经上面几回后我们需要进行回归。其实第一、二、三、四回都在分析Parrot源码的执行流程,在第四回的结束时为第五回讲migration开了个头。让我们接着第四回继续分析speedup-example在执行临界区时Parrot都做了些什么。
先从这入手:当第一个子线程创建后,开始执行pthread_mutex_lock时,第二个线程开始创建,那我们就来分析一下第一个线程在执行pthread_mutex_lock时,是否会对第二个子线程的创建有影响。这样假设的原因是在没有pthread_mutex_lock时,speedup-example的执行流程是当第一个子线程将自己所有的任务都完成时,第二个子线程才开始自己的任务。请看第三回中的情景五,一开始第一和第二子线程都创建了,但只有第一子线程在执行自己的任务,而且在第一子线程结束任务后立即创建了第三子线程,随后一段时间第二子线程完成自己的任务。这都让我们猜想有什么东西在背后控制子线程的创建和执行顺序。就像第四回我们猜想在四个子线程的创建过程中似乎也被规定了顺序,而不是简单的for顺序。当然这只是猜测,接下来通过进一步的源码分析来
- 问题7.1:确定之前的猜测——四个子线程在创建时是否通过某些机制控制了创建顺序?
- 问题7.2:探究出四个子线程在执行任务期间是如何维持顺序的?
- 问题7.3:维持顺序是在减少竞争吗?
- 问题7.4:减少了竞争会提高性能吗?
- 问题7.5:那个高效的类似spin-lock的lock在哪里?
来看一下tern_pthread_mutex_lock——
经过一番打探,我们连分析带猜测,认为对于没有获得锁的三个子线程都通过
- while (!wakenUp) {
- sched_yield();
- }
的方式在等待,获得锁的子线程在释放锁后要再次访问pthread_mutex_lock时会做三件事:
- 通过waitq.push_back(tid)将自己放入等待队列的尾端
- 通过next函数中的waits[next_tid].post()唤醒下一个应该获得锁的线程(位于runq头部的线程)
- 通过getTurn函数中的RRScheduler::wait_t::wait()使自己进入等待状态
综上我们先尝试得出以下结论:
目前来看speedup-example加速的原因之一是Context Switch的大幅度减少和锁竞争降低,下面是详细原因——
- 当A线程要去得到锁,而锁正在被线程B占据着,如果没有使用Parrot,A会将状态转成idle,这样当A得到机会时,就不得不花点时间进行上下文切换将状态由idle转成运行,但使用Parrot就不一样了,A会一致保持运行状态(类似于spin-lock,但却不同于pthread_spin_lock,使用pthread_spin_lock时,top命令下us列会达到100%,而使用Parrot,100%近乎由us列和sy列平分,这是因为Parrot加上了sched_yield(),这是个系统调用),轮到A时,立马获得锁,这样便减少了上下文切换的花销,但如果是这样,按理说不用Parrot,只用pthread_spin_lock也能起到同样幅度的加速效果,实际上pthread_spin_lock提高性能的前提是锁竞争不激烈【1】,在speedup-example中临界区较大,锁竞争激烈,用pthread_spin_lock反而会增加运行时间,而且使用pthread_spin_lock时的Context Switch已经降得和使用Parrot一样少,这就说明不是Context Switch减少了就能提速,一定要满足锁竞争不激烈这个前提,所以Parrot提速的原因绝不仅如此,定有能降低锁竞争的前提。
- 通过上面不太全面的源码探索,这个降低锁竞争的前提就隐藏在四个子线程那固定的执行顺序,我们猜想四个子线程不仅在创建时通过信号量保持住固定顺序,而且在执行任务时,也可能是通过tid,队列和信号量保持住类似B->A->C->D的顺序,进一步说,当一个子线程执行完临界区释放锁后,这个子线程不是再去要锁(如果是再去要锁,这就和其它被唤醒子线程发生了锁竞争),而是唤醒下个一该运行的子线程,之后让自己等待,能做到这样有秩序得益于前面提到的它们通过一些机制保持住B->A->C->D这样的顺序,这样在任意时刻都只有一个子线程尝试获取锁,这就变成了非竞争锁【2】,锁竞争减少。这就是Parrot论文中所说的——PARROT’s round-robin scheduling reduced contention【3】。
除此之外,我们在第六回中的五个问题也是我们最应该去啃的地方(speedup-example或许还与其他类似migration的performance counter有关,只让A变,其它不变,来探究speedup-example的性能是否与A有关),接下来就去做这些。
关于migration对speedup-exmaple的影响,在第五回中我们已经给出了答案——
“speedup-example.cpp中有set_affinity(23)和set_affinity((int)(long)arg),它们会让子线程固定在一个core上,但这样并没有让speedup-example提速(这都是在没有使用Parrot的情况)”
如果migration对speedup-example有影响,为什么在用set_affinity后(此时Context没有变,但migration减少),speedup-example没有性能上的变化,这说明起码对于speedup-example来说migration并没有对提速有所影响。至于是否会有其它的performance counter,在我们分析源码的执行流程后,我们可以发现Parrot的贡献在于让四个子线程从创建到执行变得有秩序,以及摆脱了pthread_mutex_lock要产生大量的上下文切换的弊端,所以从这个角度来说,Parrot应该不会再去影响其它的performance counter。
现在来解答一下对于一些程序为什么在使用Parrot后Context Switch大幅度增加,而运行时间反而基本不变的情况,同时解答一下第六回中的问题6.5。为此我们找到了stl中的find_first_of_found,下面是使用Parrot和没有使用Parrot的对比:
- 不使用Parrot:
- 使用Parrot:
我们可以发现,二者的运行时间基本相同,但使用Parrot后,synchronization context switch基本不变,而preemption context switch大幅度增加。虽然我们无法准确度量migration,但从Core/Thread/Function/Call Stack大致可以看到二者migration基本持平(另外,从google doc【4】上我们也可以看到二者的migration基本一致),加上synchronization context switch基本持平,这时虽然preemption context switch大幅度增加,但性能基本未发生变化,在不考虑更多未知performance counter影响时,可以认为preemption context switch并不会影响到性能。
在对speedup-example中Preemption Context Switch的来源探究中(附:speedup-example中Preemption Context Switches的来源探究),我们发现preemption context switch是运行时间的附属品,而不是左右者,意思是preemption context switch随着运行时间变长而增多,而不是运行时间随着preemption context switch的增多而变长。虽然这并不能用于解释上面的find_first_of_found,但从另一个角度我们可以认为Preemption Context Switch并不会影响到性能。
引用:
【1】Pthreads并行编程之spin lock与mutex性能对比分析
http://www.parallellabs.com/2010/01/31/pthreads-programming-spin-lock-vs-mutex-performance-analysis/
【2】如何聪明地使用锁
http://www.ibm.com/developerworks/cn/java/j-lo-lock/
【3】Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth Gibson, and Randy Bryant. "Parrot: a Practical Runtime for Deterministic, Stable, and Reliable Threads". Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP ’13)
http://www.cs.columbia.edu/~junfeng/papers/parrot-sosp13.pdf
【4】Google docs——xtern performance evaluation
https://docs.google.com/spreadsheet/ccc?key=0AmNhZTitrIuAdG1YZzJQbVpjMWVPeWtVY2w5WHpOLXc#gid=13
附录:
speedup-example.cpp(用pthread_spin_lock代替pthread_mutex_lock)
- /* Copyright (c) 2013, Regents of the Columbia University
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other
- * materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
- * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
- * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
- #include <pthread.h>
- #include <stdio.h>
- #include <stdlib.h>
- #include <errno.h>
- #include <assert.h>
- //#include "tern/user.h"
- #define N 4
- #define M 30000
- //#define M 30000*10
- //int nwait = 0;
- volatile long long sum;
- volatile long long sum_2;
- long loops = 6e3;
- //pthread_mutex_t mutex;
- pthread_spinlock_t mutex;
- //pthread_cond_t cond;
- //pthread_barrier_t bar;
- void set_affinity(int core_id) {
- cpu_set_t cpuset;
- CPU_ZERO(&cpuset);
- CPU_SET(core_id, &cpuset);
- assert(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) == 0);
- }
- void* thread_func(void *arg) {
- set_affinity((int)(long)arg);
- for (int j = 0; j < M; j++) {
- //pthread_mutex_lock(&mutex);
- pthread_spin_lock(&mutex);
- //nwait++;
- for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the mutex needs to be a little bit congested.
- {
- sum += i;
- sum_2 += i;
- }
- //pthread_cond_wait(&cond, &mutex);
- //printf("being in the lock is: %lu\n", pthread_self());
- //pthread_mutex_unlock(&mutex);
- pthread_spin_unlock(&mutex);
- //soba_wait(0);
- //pthread_barrier_wait(&bar);
- for (long i = 0; i < loops; i++)
- {
- sum += i*i*i*i*i*i;
- }
- //fprintf(stderr, "compute thread %u %d\n", (unsigned)thread, sched_getcpu());
- }
- }
- int main(int argc, char *argv[]) {
- set_affinity(23);
- //soba_init(0, N, 20);
- pthread_t th[N];
- int ret;
- pthread_spin_init(&mutex, 0);
- //pthread_cond_init(&cond, NULL);
- //pthread_barrier_init(&bar, NULL, N);
- for(unsigned i=0; i<N; ++i) {
- ret = pthread_create(&th[i], NULL, thread_func, (void*)i);
- assert(!ret && "pthread_create() failed!");
- }
- /*for (int j = 0; j < M; j++) {
- while (nwait < N) {
- sched_yield();
- }
- pthread_mutex_lock(&mutex);
- nwait = 0;
- //fprintf(stderr, "broadcast %u %d\n", (unsigned)pthread_self(), sched_getcpu());
- pthread_cond_broadcast(&cond);
- pthread_mutex_unlock(&mutex);
- }
- */
- for(unsigned i=0; i<N; ++i)
- pthread_join(th[i], NULL);
- // test the validity of results
- printf("sum_2 = %lld\n", sum_2);
- //printf("sum = %lld\n", sum);
- pthread_spin_destroy(&mutex);
- exit(0);
- }
Parrot源码分析之海贼王