Hive语法层面优化之四count(distinct)引起的数据倾斜

2024-07-13 02:38:51 219人阅读

当该字段存在大量值为null或空的记录，容易发生数据倾斜；

解决思路：

count distinct时，将值为空的情况单独处理，如果是计算count distinct，可以不用处理，直接过滤，在最后结果中加1；

如果还有其他计算，需要进行group by，可以先将值为空的记录单独处理，再和其他计算结果进行union。

案例：

select count(distinct  end_user_id) as user_num  from trackinfo;

调整为：

select cast(count(distinct end_user_id)+1 as bigint) as user_num  from trackinfo where  end_user_id is not null and end_user_id <> ‘‘;

分析：把为空的过滤掉，在总的count上加1

Multi-Count Distinct

select pid, count(distinct acookie),count(distinct ip),count(wangwangid ip) from ods_p4ppv_ad_d where dt=20140305 group by pid;

必须设置参数：set hive.groupby.skewindata=http://www.mamicode.com/true

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们