reduceByKey
类似reduce
,但reduceByKey
是先根据key进行分组,再将每个组内的记录归并成1个记录,最终返回一个PairRDD,k为key类型,v为每个组归并后的记录类型
方法签名如下:
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
...
}
统计单词个数/wordcount
参考:https://www.cnblogs.com/convict/p/14828084.html
已知每个店的水果单价,求市场上每种水果的平均价格
object TestReduceByKey {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("TestReduceByKey").setMaster("local[1]")
val sc: SparkContext = new SparkContext(conf)
val data = Array(("Apple", 5.0), ("Apple", 5.5), ("Banana", 2.0), ("Pear", 2.0))
val result: Array[(String, Double)] = sc.parallelize(data)
.map(v => (v._1, (v._2, 1)))
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2))
.map(v => (v._1, v._2._1 / v._2._2))
.collect()
result.foreach(println)
}
}
(Apple,5.25)
(Pear,2.0)
(Banana,2.0)
(name, (price, 1))
(price, 1)
,此时同个key的组内price与price相加,1与1相加做为数量和,最终形成(price之和, 数量之和)
的形式price之和
除以 数量之和
的操作,得出每个水果的均价原文:https://www.cnblogs.com/convict/p/14869032.html