首页 > 代码库 > Hive架构层面优化之二合理利用中间结果集(单Job)

Hive架构层面优化之二合理利用中间结果集(单Job)

是针对单个作业,针对本job再怎么优化也不会影响到其他job

 

Hadoop的负载主要有两部分:CPU负载和IO负载;

问题:机器io开销很大,但是机器的cpu开销较小,另外map输出文件也较大,怎么办?

解决办法:通过设置map的中间输出进行压缩就可以了,这个不会影响最终reduce的输出。

 

集群中的机器一旦选定了,那么CPU就没的改变了,所以集群的最主要的负载还是IO负载

压缩技术虽然可以降低IO负载,但是同时也加重了CPU负载,治标不治本,CPU加重了,整体性能还是上不去;如果当前CPU负荷不高的话,可以考虑使用压缩,但是如果CPU负荷本来就很高了,那就没必要再使用压缩了;随着Job的增多,IO资源通常是最大的瓶颈(文件读写方式造成瓶颈)

 

充分利用中间结果集的目的:降低IO负载

 

90%的job按天执行,每天执行一次;5%的job按小时执行,每小时执行一次;

 

案例:

drop table tmp_brand_pv_dtl ;create table tmp_brand_pv_dtl as          select ‘{$label}‘ tdate, a.brand_id, b.province_id, max(a.pv) pv, a.site_id            from          (select getBrandId(url) brand_id, session_id, getSiteId(url) site_id,                  min(track_time) track_time, url_page_id page_id, count(1) pv            from trackinfo           where ds=‘{$label}‘   and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘)   and url_page_id=14           group by getBrandId(url),                   session_id,                   url_page_id,                  getSiteId(url)) a           join          (select getBrandId(url) brand_id, session_id, track_time,                 ext_field2 province_id, city_id, getSiteId(url) site_id            from trackinfo     where ds=‘{$label}‘      and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘)   and url_page_id = in(14,15)) b          on (a.brand_id = b.brand_id and a.session_id = b.session_id and a.site_id = b.site_id and a.track_time = b.track_time)          group by a.brand_id,                  b.province_id,a.site_id; 

 

分析得出:

trackinfo表抽取了两次,数据用完就丢了

 

如何优化:

1、  将抽取的两次的数据存放到一个临时表中,对于trackinfo表来说只查一次;

两个select字段的字段对比:

getBrandId(url) brand_idgetBrandId(url) brand_id
session_idsession_id
getSiteId(url) site_idgetSiteId(url) site_id
min(track_time) track_timetrack_time
url_page_id page_id 
count(1) pv 
 ext_field2 province_id
 city_id

 

 

 

 

 

 

两个select的where对比

‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id=14 ‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id = in(14,15)

 

drop table tmp_brand_pv_0 ;create table tmp_brand_pv_0 asselect getBrandId(url) brand_id, session_id,  getSiteId(url) site_id,     track_time, url_page_id , ext_field2, city_idfrom trackInfowhere     ds=‘{$label}‘    and (button_position is null or length(button_position)=0 or button_position = ‘null‘)and url_page_id in (14,15)  ;

 

2、  在join的时候直接从临时表中获取数据即可,临时表做为两个子查询的数据源;

drop table tmp_brand_pv_dtl ;create table tmp_brand_pv_dtl asselect ‘{$label}‘ tdate,a.brand_id,b.province_id,max(a.pv) pv,a.site_id     from (select brand_id,session_id,site_id,min(track_time) track_time,url_page_id  page_id,count(1) pvfrom tmp_brand_pv_0 where url_page_id=14 group by brand_id,session_id,site_id,url_page_id) a joinselect brand_id, session_id, track_time, ext_field2 province_id, city_id, site_idfrom tmp_brand_pv_0where url_page_id in(14,15)) bon (a.brand_id = b.brand_id and a.session_id = b.session_id and a.site_id = b.site_id and a.track_time = b.track_time) group by a.brand_id, b.province_id,a.site_id;