首页 > 其他 > 详细

set hive.map.aggr=true 时统计PV数据错误

时间:2015-08-20 20:36:56      阅读:703      评论:0      收藏:0      [点我收藏+]

从一个表里group by 之后 计算累加值、去重值:

为了效率设置并行:set hive.exec.parallel=true(可选:set hive.exec.parallel.thread.number=16)、set hive.groupby.skewindata=true、set hive.map.aggr=true

select plat, pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat, pagetype
union all
select plat, all pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat
union all
select all plat, pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by pagetype
union all
select all plat, all pagetype, count(*) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19

坏就坏在:set hive.map.aggr=true,map端聚合的设置;

出来的pv数跟真实值对不上;

改成下边代码运行正确;

select plat, pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat, pagetype
union all
select plat, all pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by plat
union all
select all plat, pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19 group by pagetype
union all
select all plat, all pagetype, sum(1) pv, count(distinct userkey) uv from client_pv_form where dt = 2015-08-19

 

set hive.map.aggr=true 时统计PV数据错误

原文:http://www.cnblogs.com/sudz/p/4745985.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!