首页 > 代码库 > Hive基础之Hive表常用操作
Hive基础之Hive表常用操作
本案例使用的数据均来源于Oracle自带的emp和dept表
创建表
语法:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] (Note: Only available starting with Hive 0.10.0)] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY ‘storage.handler.class.name‘ [WITH SERDEPROPERTIES (...)] (Note: Only available starting with Hive 0.6.0) ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] (Note: Only available starting with Hive 0.6.0) [AS select_statement] (Note: Only available starting with Hive 0.5.0, and not supported when creating external tables.)
create table emp(empno int,ename string,job string,mgr int,hiredate string,sal double,comm double,deptno int)row format delimited fields terminated by ‘\t‘ lines terminated by ‘\n‘ stored as textfile;
create table dept(deptno int,dname string,loc string)row format delimited fields terminated by ‘\t‘ lines terminated by ‘\n‘ stored as textfile;
注:创建表时默认列分割符是\001,行分隔符是\n
加载数据到hive表
Hive操作的数据源:文件、其他表、其他数据库
1)load:加载本地/HDFS文件到hive表
LOAD DATA [LOCAL] INPATH ‘filepath‘ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
默认表数据存储在HDFS上的/user/hive/warehouse目录下,该目录可以在hive-site.xml中配置。
load data inpath 加载hdfs文件到hive表中;
load data local inpath 加载本地文件到hive表中;
overwrite 是否会覆盖表里已有的数据
load data local inpath ‘/home/spark/software/data/emp.txt‘ overwrite into table emp;load data local inpath ‘/home/spark/software/data/dept.txt‘ overwrite into table dept;
2)insert:导入数据到表里/从表里导出到HDFS或者本地目录
Standard syntax:INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Hive extension (multiple inserts):FROM from_statementINSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2][INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;FROM from_statementINSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2][INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;Standard syntax:INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0) SELECT ... FROM ... Hive extension (multiple inserts):FROM from_statementINSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
3)sqoop: 关系型数据库和HDFS文件导入/导出操作
详见sqoop章节介绍。
select操作
select * from emp;7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 207499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 307521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 307566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 207654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 307698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 307782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 107788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 207839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 107844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 307876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 207900 JAMES CLERK 7698 1981-12-3 950.0 NULL 307902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 207934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
select * from dept;10 ACCOUNTING NEW YORK20 RESEARCH DALLAS30 SALES CHICAGO40 OPERATIONS BOSTON
where使用
select * from emp where deptno =10;7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 107839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 107934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10select * from emp where deptno <>10; 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 207499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 307521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 307566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 207654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 307698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 307788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 207844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 307876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 207900 JAMES CLERK 7698 1981-12-3 950.0 NULL 307902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20select * from emp where ename =‘SCOTT‘;7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20select ename,sal from emp where sal between 800 and 1500; SMITH 800.0WARD 1250.0MARTIN 1250.0TURNER 1500.0ADAMS 1100.0JAMES 950.0MILLER 1300.0
limit使用
select * from emp limit 4;7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 207499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 307521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 307566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
(not) in使用
select ename,sal,comm from emp where ename in (‘SMITH‘,‘KING‘);SMITH 800.0 NULLKING 5000.0 NULLselect ename,sal,comm from emp where ename not in (‘SMITH‘,‘KING‘);ALLEN 1600.0 300.0WARD 1250.0 500.0JONES 2975.0 NULLMARTIN 1250.0 1400.0BLAKE 2850.0 NULLCLARK 2450.0 NULLSCOTT 3000.0 NULLTURNER 1500.0 0.0ADAMS 1100.0 NULLJAMES 950.0 NULLFORD 3000.0 NULLMILLER 1300.0 NULL
is (not) null使用
select ename,sal,comm from emp where comm is null;SMITH 800.0 NULLJONES 2975.0 NULLBLAKE 2850.0 NULLCLARK 2450.0 NULLSCOTT 3000.0 NULLKING 5000.0 NULLADAMS 1100.0 NULLJAMES 950.0 NULLFORD 3000.0 NULLMILLER 1300.0 NULLselect ename,sal,comm from emp where comm is not null;ALLEN 1600.0 300.0WARD 1250.0 500.0MARTIN 1250.0 1400.0TURNER 1500.0 0.0
order by的使用
与关系型数据库的order by功能一致,按照某个字段或某几个字段排序输出;
与关系型数据库区别在于:当hive.mapred.mode=strict模式下,必须指定limit否则执行报错;
hive.mapred.mode默认值为nonstrict;
select * from dept;10 ACCOUNTING NEW YORK20 RESEARCH DALLAS30 SALES CHICAGO40 OPERATIONS BOSTONselect * from dept order by deptno desc;40 OPERATIONS BOSTON30 SALES CHICAGO20 RESEARCH DALLAS10 ACCOUNTING NEW YORKselect ename,sal,deptno from emp order by deptno asc,ename desc;MILLER 1300.0 10KING 5000.0 10CLARK 2450.0 10SMITH 800.0 20SCOTT 3000.0 20JONES 2975.0 20FORD 3000.0 20ADAMS 1100.0 20WARD 1250.0 30TURNER 1500.0 30MARTIN 1250.0 30JAMES 950.0 30BLAKE 2850.0 30ALLEN 1600.0 30
set hive.mapred.mode=strict;select * from emp order by empno desc;
报错:FAILED: SemanticException 1:27 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token ‘empno‘
正确写法:
select * from emp order by empno desc limit 4;7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 107902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 207900 JAMES CLERK 7698 1981-12-3 950.0 NULL 307876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
为什么会报错呢?
在order by状态下所有数据会分发到一个节点上进行reduce操作也就只有一个reduce作业,如果在数据量大的情况下会出现无法输出结果的情况,如果进行limit n,那就只有n*map数个记录而已,只有一个reduce也可以处理的过来。
select嵌套查询、别名
from(select ename, sal from emp) eselect e.ename, e.salwhere e.sal>1000;
等价于
select ename, sal from emp where sal>1000;
ALLEN 1600.0WARD 1250.0JONES 2975.0MARTIN 1250.0BLAKE 2850.0CLARK 2450.0SCOTT 3000.0KING 5000.0TURNER 1500.0ADAMS 1100.0FORD 3000.0MILLER 1300.0
组函数:max(), min(), avg(), sum(), count()等
select count(*) from emp where deptno=10;3select count(ename) from emp where deptno=10; #count某个字段,如果这个字段不为空就算一个.3select count(distinct deptno) from emp;3select sum(sal) from emp;29025.0
group by的使用
出现在select中的字段,如果没出现在组函数中,必须出现在Group by语句中
求每个部门的平均薪水:
select deptno, avg(sal) from emp group by deptno;10 2916.666666666666520 2175.030 1566.6666666666667
求每个部门中每个工作最高的薪水:
select deptno,job,max(sal) from emp group by deptno,job;10 CLERK 1300.010 MANAGER 2450.010 PRESIDENT 5000.020 ANALYST 3000.020 CLERK 1100.020 MANAGER 2975.030 CLERK 950.030 MANAGER 2850.030 SALESMAN 1600.0
having的使用
对分组结果筛选,后跟聚合函数,hive0.11版本之后才支持;where是对单条纪录进行筛选,Having是对分组结果进行筛选。
求每个部门的平均薪水大于2000的部门:
select avg(sal),deptno from emp group by deptno having avg(sal)>2000;2916.6666666666665 102175.0 20
having是hive0.11后才支持的,如果不使用having而想达到having一样的功能,语句如何写?
select deptno, e.avg_sal from (select deptno, avg(sal) as avg_sal from emp group by deptno) e where e.avg_sal > 2000;
CASE...WHEN..THEN使用
select ename, sal,casewhen sal > 1 and sal <=1000 then ‘LOWER‘when sal >1000 and sal <=2000 then ‘MIDDLE‘when sal >2000 and sal <=4000 then ‘HIGH‘ELSE ‘HIGHEST‘ endfrom emp;SMITH 800.0 LOWERALLEN 1600.0 MIDDLEWARD 1250.0 MIDDLEJONES 2975.0 HIGHMARTIN 1250.0 MIDDLEBLAKE 2850.0 HIGHCLARK 2450.0 HIGHSCOTT 3000.0 HIGHKING 5000.0 HIGHESTTURNER 1500.0 MIDDLEADAMS 1100.0 MIDDLEJAMES 950.0 LOWERFORD 3000.0 HIGHMILLER 1300.0 MIDDLE