首页 > 代码库 > Hive基础之各种排序的区别
Hive基础之各种排序的区别
order by
1、order by会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局排序);
只有一个reducer会导致当输入规模较大时,需要较长的计算时间,速度很非常慢;
2、hive.mapred.mode(默认值是nonstrict)对order by的影响
1)当hive.mapred.mode=nonstrict时,order by和关系型数据库中的order by功能一致,按照指定的某一列或多列排序输出;
2)当hive.mapred.mode=strict时,order by必须要使用limit,否则执行会报错;;
set hive.mapred.mode=strict;select * from emp order by empno desc;
FAILED: SemanticException 1:27 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token ‘empno‘
报错原因:在order by状态下所有数据会到一台服务器进行reducer操作,也就是说只有一个reducer,如果在数据量大的情况下会出现无法输出结果的情况,如果进行limit n,那只有 n * map number条记录而已。只有一个reduce也可以处理过来。
sort by
1、可以有多个reduce任务;
2、sort by不受hive.mapred.mode是否是strict还是nostrict的影响;
3、sort by的数据只能保证在同一个reduce中的数据可以按照指定字段排序;
4、使用sort by可以指定reduce的个数:set mapred.reduce.tasks=10; 对输出的数据在进行归并排序即可得到全部结果;
5、可以用limit子句大大减少数据量。使用limit n后,传输到reduce端的数据记录数就减少到n* (map个数);
set mapred.reduce.tasks = 3;select * from emp sort by empno;
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 307698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 307782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 107788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 207839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 107844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 307499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 307521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 307566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 207876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 207900 JAMES CLERK 7698 1981-12-3 950.0 NULL 307934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 107369 SMITH CLERK 7902 1980-12-17 800.0 NULL 207902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
把上面的结果写到文件中再观察
set mapred.reduce.tasks = 3;insert overwrite local directory ‘/home/spark/data‘ select * from emp sort by empno;cd /home/spark/datals000000_0000001_0000002_0
more 000000_0 7654MARTINSALESMAN76981981-9-281250.01400.0307698BLAKEMANAGER78391981-5-12850.0\N307782CLARKMANAGER78391981-6-92450.0\N107788SCOTTANALYST75661987-4-193000.0\N207839KINGPRESIDENT\N1981-11-175000.0\N107844TURNERSALESMAN76981981-9-81500.00.030
more 000001_0 7499ALLENSALESMAN76981981-2-201600.0300.0307521WARDSALESMAN76981981-2-221250.0500.0307566JONESMANAGER78391981-4-22975.0\N207876ADAMSCLERK77881987-5-231100.0\N207900JAMESCLERK76981981-12-3950.0\N307934MILLERCLERK77821982-1-231300.0\N10
more 000002_0 7369SMITHCLERK79021980-12-17800.0\N207902FORDANALYST75661981-12-33000.0\N20
可见每个reduce内部的数据是经过排序的。
distribute by
1、按照指定的字段对数据进行划分到不同的reduce文件中(可以指定map到reduce端分发的key,这样可以充分利用hadoop资源,在多个reduce中局部按需要排序的字段进行排序);
set mapred.reduce.tasks = 3;insert overwrite local directory ‘/home/spark/data‘ select * from emp distribute by length(ename) sort by empno; cd /home/spark/datals000000_0000001_0000002_0
more 000000_0 7654MARTINSALESMAN76981981-9-281250.01400.0307844TURNERSALESMAN76981981-9-81500.00.0307934MILLERCLERK77821982-1-231300.0\N10
more 000001_0 7521WARDSALESMAN76981981-2-221250.0500.0307839KINGPRESIDENT\N1981-11-175000.0\N107902FORDANALYST75661981-12-33000.0\N20
more 000002_0 7369SMITHCLERK79021980-12-17800.0\N207499ALLENSALESMAN76981981-2-201600.0300.0307566JONESMANAGER78391981-4-22975.0\N207698BLAKEMANAGER78391981-5-12850.0\N307782CLARKMANAGER78391981-6-92450.0\N107788SCOTTANALYST75661987-4-193000.0\N207876ADAMSCLERK77881987-5-231100.0\N207900JAMESCLERK76981981-12-3950.0\N30
length是内建函数,也可以指定其他的函数或者使用UDF;
2、distribute by与sort by连用
set mapred.reduce.tasks = 3;insert overwrite local directory ‘/home/spark/data‘ select * from emp distribute by ename sort by ename; cd /home/spark/datals000000_0000001_0000002_0
more 000000_0 7698BLAKEMANAGER78391981-5-12850.0\N307839KINGPRESIDENT\N1981-11-175000.0\N10
more 000001_0 7876ADAMSCLERK77881987-5-231100.0\N207499ALLENSALESMAN76981981-2-201600.0300.0307654MARTINSALESMAN76981981-9-281250.01400.0307934MILLERCLERK77821982-1-231300.0\N107788SCOTTANALYST75661987-4-193000.0\N207844TURNERSALESMAN76981981-9-81500.00.030
more 000002_0 7782CLARKMANAGER78391981-6-92450.0\N107902FORDANALYST75661981-12-33000.0\N207900JAMESCLERK76981981-12-3950.0\N307566JONESMANAGER78391981-4-22975.0\N207369SMITHCLERK79021980-12-17800.0\N207521WARDSALESMAN76981981-2-221250.0500.030
按照ename指定到reduce,每个reduce中按照ename升序排列;
cluster by
1、cluster by除了具有distribute by的功能外还兼备sort by功能;
2、但是排序只能倒序,不能指定排序规则为asc或者desc;
3、当distribute by col1与sort by col1连用,且跟随的字段相同时,可使用cluster by简写;
select * from emp cluster by ename;
Hive排序总结
1、在hive中进行字段排序统计过程中,使用ORDER BY是全局排序,hive只能通过一个reduce进行排序,效率很低;
2、sort by实现部分排序,单个reduce输出的结果是有序的、效率高,通常与distribute by关键字一起使用,distribute by关键字可以指定map到reduce端的分发key, 这样可以充分利用hadoop资源, 在多个reduce中局部按需要排序的字段进行排序;
3、cluster by col1等同于distributed by col1与sort by col1组合。