首页 > 代码库 > Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因
Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因
count(distinct key)案例
explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘ 2013-07-21‘) type: boolean Filter Operator predicate: expr: (ds = ‘ 2013-07-21‘) type: boolean Select Operator expressions: expr: session_id type: string outputColumnNames: session_id Group By Operator aggregations: expr: count(DISTINCT session_id) bucketGroup: true keys: expr: session_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col0:0._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
group by案例
explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: max(session_id) bucketGroup: false keys: expr: city_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: string Reduce Operator Tree: Group By Operator aggregations: expr: max(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: string outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
count(distinct key)联合group by案例
explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: count(DISTINCT session_id) bucketGroup: false keys: expr: city_id type: string expr: session_id type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col2 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col1:0._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
max(key)联合group by案例
explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stageSTAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: trackinfo TableScan alias: trackinfo Filter Operator predicate: expr: (ds = ‘2013-07-21‘) type: boolean Select Operator expressions: expr: city_id type: string expr: session_id type: string outputColumnNames: city_id, session_id Group By Operator aggregations: expr: count(session_id) bucketGroup: false keys: expr: city_id type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
执行计划总结
select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ ;
分发的是:session_id
select max(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;
分发的是:city_id
select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;
分发的是:session_id和city_id
select count(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;
分发的是:city_id
得出数据倾斜的结论:
join、group by、 count(distinct key)容易出现数据倾斜;
max、count等聚合函数并不会导致数据倾斜。
案例中的trackinfo建表语句
create table trackinfo (id bigint ,url string ,referer string ,keyword string ,type int ,gu_id string ,page_id string ,module_id string ,link_id string ,attached_info string ,session_id string ,tracker_u string ,tracker_type int ,ip string ,tracker_src string ,cookie string ,order_code string ,track_time string ,end_user_id bigint ,first_link string ,session_view_no int ,product_id string ,merchant_id bigint ,province_id string ,city_id string ,fee string ,edm_activity string ,edm_email string ,edm_jobid string ,ie_version string ,platform string ,internal_keyword string ,result_sum string ,currentpage string ,link_position string ,button_position string ,ext_field1 string ,ext_field2 string ,ext_field3 string ,ext_field4 string ,ext_field5 string ,adgroupkeywordid string ,ext_field6 string ,ext_field7 string ,ext_field8 string ,ext_field9 string ,ext_field10 string ,url_page_id int ,url_page_value string ,refer_page_id int ,refer_page_value string ) partitioned by(ds string);
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。