首页 > 代码库 > 实时处理与流处理
实时处理与流处理
前言:作为一个程序员,总是能不时地听到各种新技术名词,大数据、云计算、实时处理、流式处理、内存计算… 但当我们听到这些时髦的名词时他们到底是在说什么?偶然搜到一个不错的帖子,就总结一下实时处理和流式处理的区别吧。
正文:要说实时处理就得先提一下实时系统(Real-timeSystem)。所谓实时系统就是能在严格的时间限制内响应请求的系统。例如如果某系统能严格保证在10毫秒内处理来自网络的NASDAQ股票报价,那么这个系统就可以算作实时系统,至于系统是通过软件还是硬件或者通过怎样的设计达到的都不限。
虽然看似简单,实际上现实世界中这种系统是很难实现的,尤其是软件实现的实时系统。因为你的进程可能随时被其他进程抢占,CPU调度器无法保证能给你的进程所需的时间和资源来在严格时间限制内完成响应。因此就有了各种实时操作系统内核。现实中实时系统的例子能想到的如军方的导弹控制系统和航天飞机等高精尖的软件系统了。
那实时处理(Real-time Processingor Computing)又是什么?与实时系统类似,但软件工业中似乎对实时二字没有什么明确的定义。例如许多人说实时交易,实际上是因为市场数据瞬息万变,决策经常在毫秒间。一个软实时(Soft Real-time)的例子是Amazon要求所有软件子系统在处理99%的请求时,都能在100-200毫秒内要么给出结果要么立刻失败。
说完实时处理再看流式处理(Stream Processing)。望文生义,流式处理就是指源源不断的数据流过系统时,系统能够不停地连续计算。所以流式处理没有什么严格的时间限制,数据从进入系统到出来结果可能是需要一段时间。然而流式处理唯一的限制是系统长期来看的输出速率应当快于或至少等于输入速率。否则的话,数据岂不是会在系统中越积越多(不然数据哪去了)?如此,不管处理时是在内存、闪存还是硬盘,早晚都会空间耗尽的。就像雪崩效应,系统越来越慢,数据越积越多。
所以我们可以说Storm框架是一种流式处理系统的框架。如果我们的代码能够保证Storm的Topology中每个Bolt结点处理数据的时长一定,那么我们就相当于用Storm开发了一个(软)实时的系统。顺便提一句,又比如Spark这个主要是内存计算框架,在加入了Streaming Spark子项目后,能将数据流切分并转化成RDD进行后续计算,从而也支持了流式处理(否则之前Spark都是以固定的一坨数据为输入的)。
原文:What‘s the difference between real-timeprocessing and stream processing?
“Usually,a system is called a real time system if it has tight deadlines within which aresult is guaranteed. For example, you can consider your TV to be a real timeprocessing system: given an analog or digital input, within say 1ms, acorresponding phosphor dot will light up on the screen. In the context ofsoftware systems, a system is usually called a real time system if it hasresponses that are guaranteed within hard "real-world" timedeadlines. For example, a system that guarantees the processing of a NASDAQstock quote coming in from the network within 10 ms would be considered a realtime processing system: whether this is achieved by using a softwarearchitecture that utilizes continuous (stream) processing or one shot processingin hardware is immaterial. The fact that there is a reasonably small real-worldguaranteed deadline for the processing makes it a real time system.
“Inpractice though real time systems are extremely hard to implement using commonsoftware systems. For example, the vanilla linux kernel isn‘t a real timekernel: certain operations such as process scheduling, network packetprocessing etc. are implemented using algorithms that don‘t guarantee a hardtime limit. eg. If your process is preempted from CPU resources by a higherpriority process, the scheduler may not give your process the CPU resources itneeds to guarantee a response in the given deadline (depending on thescheduling algorithm). The same thing applies to network packets. There are, ofcourse, flavors of the kernel available that provide real time schedulingguarantees for processes etc. (QNX [1]comes to mind) Software systems in this area usually go for a flavor of realtime processing called soft real time computing where the deadline is not an absolute but aprobability. For example, Amazon requires all the software subcomponents on itspage to provide a result or fail within 100-200ms for 99% of all requests. Thisgives it a soft real time guarantee that a page will render within a given timelimit.
“Streamprocessing on the other hand refers to a methodof continuous computation that happens as data is flowing through the system.There are no compulsory time limitations in stream processing. For example, asystem that simply output the count of words present in a Tweet for 99.9% ofthe tweets it encountered but output the complete works of Shakespeare for theremaining 0.1% of tweets is a valid stream processing system. There is no fixedtime deadline on the output of the system when an input is received: the datais processed as it comes in and sometimes data might be awaiting processing.The only constraint on such a stream processing system is that its long termoutput rate should be faster or at least equal to the long term data input rate(otherwise the storage requirements of the system grow without bound).Additionally, it must have enough memory to store queued inputs should it bestuck while processing any item in the input stream.
“Giventhis context, I‘m sure it‘s easy to figure out that Storm is a streamprocessing system. You can use Storm to develop a (soft) real time system ifyou can place guarantees on the processing duration for all inputs at everystage of the topology.
实时处理与流处理