业务上有一份行车轨迹的数据 carRecord.csv 如下:
id;carNum;orgId;capTime
1;粤A321;0002;20200512 102010
2;云A321;0001;20200512 102010
3;粤A321;0001;20200512 103010
4;云A321;0002;20200512 103010
5;粤A321;0003;20200512 114010
6;京A321;0003;20200512 114011
其中各字段含义分别为记录id,车牌号,抓拍卡口,抓拍时间。现在需要筛选出所有车辆最后出现的一条记录,得到每辆车最后经过的抓拍点信息,也就是要将其他日期的数据过滤掉,我们可以使用选择去重。下面分别展示通过 dataframe 和 rdd 如果实现。
具体实现:
val carDF = spark.read.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.csv(basePath + "/car.csv")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// This import is needed to use the $-notation
import spark.implicits._
val lastPassCar = carDF.withColumn("num",
row_number().over(
Window.partitionBy($"carNum")
.orderBy($"capTime" desc)
)
).where($"num" === 1).drop($"num")
lastPassCar.explain()
lastPassCar.show()
执行计划如下:
== Physical Plan ==
*(3) Project [id#10, carNum#11, orgId#12, capTime#13]
+- *(3) Filter (isnotnull(num#19) && (num#19 = 1))
+- Window [row_number() windowspecdefinition(carNum#11, capTime#13 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS num#19], [carNum#11], [capTime#13 DESC NULLS LAST]
+- *(2) Sort [carNum#11 ASC NULLS FIRST, capTime#13 DESC NULLS LAST], false, 0
+- Exchange hashpartitioning(carNum#11, 200)
+- *(1) FileScan csv [id#10,carNum#11,orgId#12,capTime#13]
结果如下:
// 获得其中每辆车最后经过的卡口等信息
+---+------+-----+---------------+
| id|carNum|orgId| capTime|
+---+------+-----+---------------+
| 5|粤A321| 3|20200512 114010|
| 6|京A321| 3|20200512 114011|
| 4|云A321| 2|20200512 103010|
+---+------+-----+---------------+
思路:
case class CarRecord(id: Int, carNum: String, orgId: Int, capTime: String)
// 构造 schema RDD
val carRDD: RDD[CarRecord] =
carDF.rdd.map(x =>
CarRecord(x.getInt(0), x.getString(1), x.getInt(2), x.getString(3)))
val res = carRDD.groupBy(_.carNum).map{
x => {
// x._2 是 iter,取其中 capTime 最大的记录
x._2.maxBy { _.capTime }
}
}
res.toDebugString
res.collect.foreach(x => println(x))
实现选择去重的两种常用方法:
Hive 又如何实现选择去重呢?与上文两种方法一样,请自行实现。
原文:https://www.cnblogs.com/stillcoolme/p/12885197.html