Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

首页 > 代码库 > Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

2024-07-13 02:37:15 221人阅读

count(distinct key)案例

explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘ 2013-07-21‘)                  type: boolean              Filter Operator                predicate:                    expr: (ds = ‘ 2013-07-21‘)                    type: boolean                Select Operator                  expressions:                        expr: session_id                        type: string                  outputColumnNames: session_id                  Group By Operator                    aggregations:                          expr: count(DISTINCT session_id)                    bucketGroup: true                    keys:                          expr: session_id                          type: string                    mode: hash                    outputColumnNames: _col0, _col1                    Reduce Output Operator                      key expressions:                            expr: _col0                            type: string                      sort order: +                      tag: -1                      value expressions:                            expr: _col1                            type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col0:0._col0)          bucketGroup: false          mode: mergepartial          outputColumnNames: _col0          Select Operator            expressions:                  expr: _col0                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

group by案例

explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: max(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: string      Reduce Operator Tree:        Group By Operator          aggregations:                expr: max(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: string            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

count(distinct key)联合group by案例

explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(DISTINCT session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                        expr: session_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1, _col2                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                          expr: _col1                          type: string                    sort order: ++                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col2                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col1:0._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

max(key)联合group by案例

explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

执行计划总结

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ ;

分发的是：session_id

select max(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是：city_id

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是：session_id和city_id

select count(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是：city_id

得出数据倾斜的结论：

join、group by、 count(distinct key)容易出现数据倾斜；

max、count等聚合函数并不会导致数据倾斜。

案例中的trackinfo建表语句

create table trackinfo (id      bigint                          ,url     string                          ,referer string                          ,keyword string                          ,type    int                             ,gu_id   string                          ,page_id string                          ,module_id       string                  ,link_id string                          ,attached_info   string                  ,session_id      string                  ,tracker_u       string                  ,tracker_type    int                     ,ip      string                          ,tracker_src     string                  ,cookie  string                          ,order_code      string                  ,track_time      string                  ,end_user_id     bigint                  ,first_link      string                  ,session_view_no int                     ,product_id      string                  ,merchant_id     bigint                  ,province_id     string                  ,city_id string                          ,fee     string                          ,edm_activity    string                  ,edm_email       string                  ,edm_jobid       string                  ,ie_version      string                  ,platform        string                  ,internal_keyword        string          ,result_sum      string                  ,currentpage     string                  ,link_position   string                  ,button_position string                  ,ext_field1      string                  ,ext_field2      string                  ,ext_field3      string                  ,ext_field4      string                  ,ext_field5      string                  ,adgroupkeywordid        string          ,ext_field6      string                  ,ext_field7      string                  ,ext_field8      string                  ,ext_field9      string                  ,ext_field10     string                  ,url_page_id     int                     ,url_page_value  string                  ,refer_page_id   int                     ,refer_page_value        string ) partitioned by(ds      string);

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因