首页 > 代码库 > Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

Hive语法层面优化之五分析执行计划追踪导致数据倾斜的原因

count(distinct key)案例

explain select count(distinct session_id) from trackinfo where ds=‘ 2013-07-21‘ ;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘ 2013-07-21‘)                  type: boolean              Filter Operator                predicate:                    expr: (ds = ‘ 2013-07-21‘)                    type: boolean                Select Operator                  expressions:                        expr: session_id                        type: string                  outputColumnNames: session_id                  Group By Operator                    aggregations:                          expr: count(DISTINCT session_id)                    bucketGroup: true                    keys:                          expr: session_id                          type: string                    mode: hash                    outputColumnNames: _col0, _col1                    Reduce Output Operator                      key expressions:                            expr: _col0                            type: string                      sort order: +                      tag: -1                      value expressions:                            expr: _col1                            type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col0:0._col0)          bucketGroup: false          mode: mergepartial          outputColumnNames: _col0          Select Operator            expressions:                  expr: _col0                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

 

group by案例

explain select max(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: max(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: string      Reduce Operator Tree:        Group By Operator          aggregations:                expr: max(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: string            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

 

count(distinct key)联合group by案例

explain select count(distinct session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(DISTINCT session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                        expr: session_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1, _col2                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                          expr: _col1                          type: string                    sort order: ++                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col2                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col1:0._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

max(key)联合group by案例

explain select count(session_id) from trackinfo where ds=‘2013-07-21‘ group by city_id;STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        trackinfo           TableScan            alias: trackinfo            Filter Operator              predicate:                  expr: (ds = ‘2013-07-21‘)                  type: boolean              Select Operator                expressions:                      expr: city_id                      type: string                      expr: session_id                      type: string                outputColumnNames: city_id, session_id                Group By Operator                  aggregations:                        expr: count(session_id)                  bucketGroup: false                  keys:                        expr: city_id                        type: string                  mode: hash                  outputColumnNames: _col0, _col1                  Reduce Output Operator                    key expressions:                          expr: _col0                          type: string                    sort order: +                    Map-reduce partition columns:                          expr: _col0                          type: string                    tag: -1                    value expressions:                          expr: _col1                          type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(VALUE._col0)          bucketGroup: false          keys:                expr: KEY._col0                type: string          mode: mergepartial          outputColumnNames: _col0, _col1          Select Operator            expressions:                  expr: _col1                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

执行计划总结

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ ;

分发的是:session_id

 

select max(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是:city_id

 

select count(distinct session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是:session_id和city_id

 

select count(session_id) from trackinfo where ds=‘2013-11-01‘ group by city_id;

分发的是:city_id

 

得出数据倾斜的结论:

join、group by、 count(distinct key)容易出现数据倾斜;

max、count等聚合函数并不会导致数据倾斜。

 

案例中的trackinfo建表语句

create table trackinfo (id      bigint                          ,url     string                          ,referer string                          ,keyword string                          ,type    int                             ,gu_id   string                          ,page_id string                          ,module_id       string                  ,link_id string                          ,attached_info   string                  ,session_id      string                  ,tracker_u       string                  ,tracker_type    int                     ,ip      string                          ,tracker_src     string                  ,cookie  string                          ,order_code      string                  ,track_time      string                  ,end_user_id     bigint                  ,first_link      string                  ,session_view_no int                     ,product_id      string                  ,merchant_id     bigint                  ,province_id     string                  ,city_id string                          ,fee     string                          ,edm_activity    string                  ,edm_email       string                  ,edm_jobid       string                  ,ie_version      string                  ,platform        string                  ,internal_keyword        string          ,result_sum      string                  ,currentpage     string                  ,link_position   string                  ,button_position string                  ,ext_field1      string                  ,ext_field2      string                  ,ext_field3      string                  ,ext_field4      string                  ,ext_field5      string                  ,adgroupkeywordid        string          ,ext_field6      string                  ,ext_field7      string                  ,ext_field8      string                  ,ext_field9      string                  ,ext_field10     string                  ,url_page_id     int                     ,url_page_value  string                  ,refer_page_id   int                     ,refer_page_value        string ) partitioned by(ds      string);