首页 > 代码库 > 分布式通讯优化篇 – IRQ affinity

分布式通讯优化篇 – IRQ affinity

      在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。

      基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。

      在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线文切换(在实际场景中,发现90%以上的同学无法解释清楚,希望这篇文章能给你带去比较深刻的理解)。


      中断


        Hardware interrupts are used by devices to communicate that they require attention from the operating system. Internally, hardware interrupts are implemented usingelectronic alerting signals that are sent to the processor from anexternal device, which is either a part of the computer itself, such as a disk controller, or an external peripheral. For example, pressing a key on the keyboard or moving the mouse triggers hardware interrupts that cause the processor to read the keystroke or mouse position. Unlike the software type (described below), hardware interrupts areasynchronous and can occur in the middle of instruction execution, requiring additional care in programming. The act of initiating a hardware interrupt is referred to as aninterrupt request (IRQ).


         A software interrupt is caused either by anexceptional condition in the processor itself, or aspecial instruction in the instruction set which causes an interrupt when it is executed. The former is often called a trap or exception and is used for errors or events occurring during program execution that are exceptional enough that they cannot be handled within the program itself. For example, if the processor‘s arithmetic logic unit is commanded to divide a number by zero, this impossible demand will cause a divide-by-zero exception, perhaps causing the computer to abandon the calculation or display an error message. Software interrupt instructionsfunction similarly to subroutine calls and are used for a variety of purposes, such as to request services from low-level system software such as device drivers. For example, computers often use software interrupt instructions to communicate with the disk controller to request data be read or written to the disk.


        硬中断,硬件中断CPU,通常是异步处理的;软中断,指令中断内核执行,分两种情况,一种是异常,另外一种是subroutine calls,注意这里要和system call分开。


      上线文切换


        In computing, a context switch is the process ofstoring and restoring the state (context) of a process or thread so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU and is an essential feature of a multitasking operating system. What constitutes the context is determined by the processor and the operating system.Context switches are usually computationally intensive, and much of the design of operating systems is to optimize the use of context switches. Switching from one process to another requires a certain amount of time for doing the administration – saving and loading registers and memory maps, updating various tables and lists etc. A context switch can mean aregister context switch, atask context switch, astack frame switch, athread context switch, ora process context switch.

        上下文切换,发生在内核态,system call仅仅是kernel mode switch,上下文切换有多种表现形式,如进程之间,线程之间,栈帧之间等。

        问题来了,中断和上下文切换之间究竟存在什么样的内在数理关系?翻阅了很多外文资料,无果而返。最后的Action,准备去check linux kernal代码中关于cs ,%soft和%irq的统计逻辑,如果你恰巧了解,还希望不吝赐教哈!


        Ok,下面我们来看一下整个亲核优化过程中所需要掌握的基本技巧:RPS/RFS,irqbalance和irq affinity!

      

      RPS/RFS - Receive Package Steering/Receive Flow Steering


        Google同学开发的patch,从2.6.35开始加入到kernel中。简单来说,其原理是利用hash算法来hash TCP或者 UDP的 package header,并根据应用所在的CPU去选择软中断所需要的CPU。文档中有一句话,最能概括它的使用场景,如下。大致意思是说网卡单队列模式以及队列数少于CPU核数的场景下,如果能保证共享内存,用它无疑是最佳神器。

        For a single queue device, a typical RPS configuration would be to set the rps_cpus to the CPUs in the same memory domain of the interrupting CPU. If NUMA locality is not an issue, this could also be all CPUs in the system. At high interrupt rate, it might be wise to exclude the interrupting CPU from the map since that already performs much work. For a multi-queue system, if RSS is configured so that a hardware receive queue is mapped to each CPU, then RPS is probably redundant and unnecessary. If there are fewer hardware queues than CPUs, then RPS might be beneficial if the rps_cpus for each queue are the ones that share the same memory domain as the interrupting CPU for that queue.

        那问题又来了,如何辨别多队列网卡?如何保障共享内存?提供一种思路,对于第一个问题,可以用命令

        lspci -vvv | grep 'Ethernet controller'
        

        如果有MSI-X && Enable+ && TabSize > 1,则该网卡是多队列网卡。对于第二个问题,可以考虑在lscpu的帮助下,将中断绑定到具体的物理CPU上。

      Irqbalance


        手册上是这么说的,distribute hardware interrupts across processors on a multiprocessor system。在SMP体系结构上问题还是蛮多的,可以参看Ubuntu的Bug追踪系统。当然,国内褚霸同学对其源码进行了详细分析,感兴趣的可以也参看这里。

       SMP IRQ Affinity


        最后,来看一下kernel 2.4加入的SMP IRQ Affinity:

        An interrupt request (IRQ) is a request for service, sent at the hardware level. Interrupts can be sent by either a dedicated hardware line, or across a hardware bus as an information packet (a Message Signaled Interrupt, or MSI). When interrupts are enabled, receipt of an IRQ prompts a switch to interrupt context. Kernel interrupt dispatch code retrieves the IRQ number and its associated list of registered Interrupt Service Routines (ISRs), and calls each ISR in turn. The ISR acknowledges the interrupt and ignores redundant interrupts from the same IRQ, then queues a deferred handler to finish processing the interrupt and stop the ISR from ignoring future interrupts.

       /proc/interrupts列出了IRQ number, the number of that interrupt handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt. (Refer to the proc(5) man page for further details: man 5 proc).


       /proc/irq/IRQ_NUMBER/smp_affinity,smp_affinity是用来描述中断亲和特性的,this property can be used to improve application performance by assigning both interrupt affinity and the application‘s thread affinity to one or more specific CPU cores. This allows cache line sharing between the specified interrupt and application threads.

       如何验证你的中断亲核性设置是否OK呢?请参看下面的流程:


       a. 查看网卡中断号:

       cat /proc/interrupts

      b. 查看该中断号的cpu affinity:

       sudo cat /proc/irq/42/smp_affinity

      c. 修改绑定:

       sudo echo ff > /proc/irq/42/smp_affinity

      d. 访问特定网站:

       ping -f www.creative.com

      e. 查看中断绑定结果:

       cat /proc/interrupts | grep  'CPU\|42:'

小结:

       在我的多队列网卡中,手动绑定了SMP IRQ Affinity的值,并且排除了其它两种优化方式的干扰,解决掉了开篇提到的性能问题。 但我文章里面提到的那个数理关系,后续还是需要跟进一下,如果有更多的发现,会及时分享给大家,希望大家能够喜欢!


参考文档:

1. http://en.wikipedia.org/wiki/Interrupt
2. http://en.wikipedia.org/wiki/Context_switch
3. http://wenku.baidu.com/view/315d2c8571fe910ef12df838.html
4. https://www.kernel.org/doc/Documentation/networking/scaling.txt
5. http://kernelnewbies.org/Linux_2_6_35
6. https://cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt
7. http://www.linfo.org/context_switch.html
8. http://lwn.net/Articles/328339/
9. http://lwn.net/Articles/398385/
10. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/network-rps.html
11. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html smp_affinity

12, http://www.linfo.org/context_switch.html



分布式通讯优化篇 – IRQ affinity