首页 > 代码库 > Hive架构层面优化之二合理利用中间结果集(单Job)
Hive架构层面优化之二合理利用中间结果集(单Job)
是针对单个作业,针对本job再怎么优化也不会影响到其他job;
Hadoop的负载主要有两部分:CPU负载和IO负载;
问题:机器io开销很大,但是机器的cpu开销较小,另外map输出文件也较大,怎么办?
解决办法:通过设置map的中间输出进行压缩就可以了,这个不会影响最终reduce的输出。
集群中的机器一旦选定了,那么CPU就没的改变了,所以集群的最主要的负载还是IO负载;
压缩技术虽然可以降低IO负载,但是同时也加重了CPU负载,治标不治本,CPU加重了,整体性能还是上不去;如果当前CPU负荷不高的话,可以考虑使用压缩,但是如果CPU负荷本来就很高了,那就没必要再使用压缩了;随着Job的增多,IO资源通常是最大的瓶颈(文件读写方式造成瓶颈);
充分利用中间结果集的目的:降低IO负载
90%的job按天执行,每天执行一次;5%的job按小时执行,每小时执行一次;
案例:
drop table tmp_brand_pv_dtl ;create table tmp_brand_pv_dtl as select ‘{$label}‘ tdate, a.brand_id, b.province_id, max(a.pv) pv, a.site_id from (select getBrandId(url) brand_id, session_id, getSiteId(url) site_id, min(track_time) track_time, url_page_id page_id, count(1) pv from trackinfo where ds=‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id=14 group by getBrandId(url), session_id, url_page_id, getSiteId(url)) a join (select getBrandId(url) brand_id, session_id, track_time, ext_field2 province_id, city_id, getSiteId(url) site_id from trackinfo where ds=‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id = in(14,15)) b on (a.brand_id = b.brand_id and a.session_id = b.session_id and a.site_id = b.site_id and a.track_time = b.track_time) group by a.brand_id, b.province_id,a.site_id;
分析得出:
trackinfo表抽取了两次,数据用完就丢了;
如何优化:
1、 将抽取的两次的数据存放到一个临时表中,对于trackinfo表来说只查一次;
两个select字段的字段对比:
getBrandId(url) brand_id | getBrandId(url) brand_id |
session_id | session_id |
getSiteId(url) site_id | getSiteId(url) site_id |
min(track_time) track_time | track_time |
url_page_id page_id | |
count(1) pv | |
ext_field2 province_id | |
city_id |
两个select的where对比
‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id=14 ‘{$label}‘ and (button_postion is null or length(button_postion)=0 or button_postion = ‘null‘) and url_page_id = in(14,15)
drop table tmp_brand_pv_0 ;create table tmp_brand_pv_0 asselect getBrandId(url) brand_id, session_id, getSiteId(url) site_id, track_time, url_page_id , ext_field2, city_idfrom trackInfowhere ds=‘{$label}‘ and (button_position is null or length(button_position)=0 or button_position = ‘null‘)and url_page_id in (14,15) ;
2、 在join的时候直接从临时表中获取数据即可,临时表做为两个子查询的数据源;
drop table tmp_brand_pv_dtl ;create table tmp_brand_pv_dtl asselect ‘{$label}‘ tdate,a.brand_id,b.province_id,max(a.pv) pv,a.site_id from (select brand_id,session_id,site_id,min(track_time) track_time,url_page_id page_id,count(1) pvfrom tmp_brand_pv_0 where url_page_id=14 group by brand_id,session_id,site_id,url_page_id) a joinselect brand_id, session_id, track_time, ext_field2 province_id, city_id, site_idfrom tmp_brand_pv_0where url_page_id in(14,15)) bon (a.brand_id = b.brand_id and a.session_id = b.session_id and a.site_id = b.site_id and a.track_time = b.track_time) group by a.brand_id, b.province_id,a.site_id;
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。