首页 > 代码库 > Hive-ORC表,百亿数据分组查询测试结果

Hive-ORC表,百亿数据分组查询测试结果

说明:同样数据的两个orc表,分别模拟一个月120亿wifi数据,平均每日4亿数据。

字段:id    唯一标记

          mac  mac地址

          point  点位

          x        x坐标

          y        y坐标

          hour  数据时间-小时

          day    数据时间-日期

区别:

         表1:wifi_orc     以day作为一级分区,hour作为二级分区

         表2:wifi_orc_ex   只以day作为一级分区

 

测试sql:   选取6个不同时间段同范围的点位作为条件,进行6表之间的数据碰撞,之后对频次进行排序

SELECT * FROM
(SELECT tmp.mac, count(tmp.mac) c FROM
(SELECT t1.mac FROM
wifi_orc t1,
wifi_orc t2,
wifi_orc t3,
wifi_orc t4,
wifi_orc t5,
wifi_orc t6
WHERE
t1.day = ‘20170111‘ AND t1.hour >= ‘07‘ AND t1.hour <= ‘09‘ AND t1.point IN(1686,1199,1356,999,2500)
AND t2.day = ‘20170112‘ AND t2.hour >= ‘11‘ AND t2.hour <= ‘13‘ AND t2.point IN(1686,1199,1356,999,2500)
AND t3.day = ‘20170113‘ AND t3.hour >= ‘16‘ AND t3.hour <= ‘19‘ AND t2.point IN(1686,1199,1356,999,2500)
AND t4.day = ‘20170114‘ AND t4.hour >= ‘20‘ AND t4.hour <= ‘22‘ AND t4.point IN(1686,1199,1356,999,2500)
AND t5.day = ‘20170112‘ AND t5.hour >= ‘09‘ AND t5.hour <= ‘11‘ AND t5.point IN(1686,1199,1356,999,2500)
AND t6.day = ‘20170115‘ AND t6.hour >= ‘00‘ AND t6.hour <= ‘05‘ AND t6.point IN(1686,1199,1356,999,2500)
AND t1.mac = t2.mac AND t2.mac = t3.mac AND t3.mac = t4.mac AND t4.mac = t5.mac AND t5.mac = t6.mac
) tmp
GROUP BY tmp.mac) ret
ORDER BY ret.c DESC LIMIT 10

 

结果:

带二级分区的wifi_orc            执行时间: 2分42秒

只有一级分区的wifi_orc_ex   执行时间:4分42秒

上述执行引擎为mr方式执行,等上线spark引擎,可以进行比较.

 

Hive-ORC表,百亿数据分组查询测试结果