首页 > 代码库 > <Spark><Programming><Key/Value Pairs><RDD>
<Spark><Programming><Key/Value Pairs><RDD>
Working with key/value Pairs
Motivation
- Pair RDDs are a useful building block in many programs, as they expose operations that allow u to act on each key in parallel or regroup data across network.
- Eg: pair RDDs have a reduceByKey() method that can aggeragate data separately for each key; join() method that can merge two RDDs together by grouping elements with the same key.
Creating Pair RDDs
- Many formats we loading from will directly return pair RDDs for their k/v values.
- By turning a regular RDD into a pair RDD --> Using map() function
val pairs = lines.map(x => (x.split(" ")(0), x))
Transformation on Pair RDDs
- 我们同样可以给Spark传送函数,不过由于pair RDDs包含的是元组tuple,所以我们要传送的函数式操作在tuples之上的。实际上Pair RDDs就是RDDs of Tuple2 object。
Aggregations
- reduceByKey()和reduce()很相似:它们都接收一个函数并使用该函数来combine values。它们的不同在于:
- reduceByKey()并行地为数据集中每个key运行reduce操作。
- reduceByKey()属于transformation,它返回一个新的RDD。这样做是考虑到数据集有大量的keys。
<Spark><Programming><Key/Value Pairs><RDD>
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。