首页 > 代码库 > Hive综合案例分析之不正常订单状态统计

Hive综合案例分析之不正常订单状态统计

需求

订单有5个状态:创建、捡货、发送、送达、取消

统计:创建和捡货之间不能操作2小时,创建到发送时间不能操作4小时,创建到送达之间不能超过48小时。

 

知识点

1)external table

2)desc formatted的使用

3)virtual column

4)Alter FILEFORMAT

5)COALESCE、unix_timestamp的使用

6)PARQUET

 

实现

外部表

订单创建表:

CREATE EXTERNAL TABLE order_created (    orderNumber STRING  , event_time  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载表数据

load data local inpath /home/spark/software/data/order_created.txt overwrite into table order_created;

导入数据另一种方式:在创建表的时候通过location指定文件目录来导入数据 

1)在创建表时location指定的目录下已经存在文件

CREATE  TABLE test_load1 (    id STRING  , name  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \tLOCATION /student;

2)在创建表时location指定的目录下还不存在文件

CREATE  TABLE test_load2 (    id STRING  , name  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \tLOCATION /load;

将文件数据上传到指定目录下:

hadoop fs -put /home/spark/software/data/student.txt /load/

查询数据

select * from order_created;
10703007267488 2014-05-01 06:01:12.334+0110101043505096 2014-05-01 07:28:12.342+0110103043509747 2014-05-01 07:50:12.33+0110103043501575 2014-05-01 09:27:12.33+0110104043514061 2014-05-01 09:03:12.324+01

 

静态分区表

CREATE TABLE order_created_partition (    orderNumber STRING  , event_time  STRING)PARTITIONED BY (event_month string)ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

分区表加载数据方式一:

load data local inpath /home/spark/software/data/order_created.txt overwrite into table order_created_partition PARTITION(event_month=2014-05);

数据查询

select * from order_created_partition where event_month=2014-05;   #不跑mapreduce
10703007267488 2014-05-01 06:01:12.334+01 2014-0510101043505096 2014-05-01 07:28:12.342+01 2014-0510103043509747 2014-05-01 07:50:12.33+01 2014-0510103043501575 2014-05-01 09:27:12.33+01 2014-0510104043514061 2014-05-01 09:03:12.324+01 2014-05

分区表加载数据方式二:

第一步:创建hdfs目录:在hdfs目录:/user/hive/warehouse/order_created_partition目录下创建event_month=2014-06

hadoop fs -mkdir /user/hive/warehouse/order_created_partition/event_month=2014-06

第二步:拷贝数据到新创建的目录下:

hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_partition/event_month=2014-06

第三步:添加新分区数据到元数据信息中

msck repair table order_created_partition;

执行日志信息:

Partitions not in metastore:    order_created_partition:event_month=2014-06Repair: Added partition to metastore order_created_partition:event_month=2014-06

查询event_month=2014-06分区的数据:

select * from order_created_partition where event_month=2014-06; 
10703007267488 2014-05-01 06:01:12.334+01 2014-0610101043505096 2014-05-01 07:28:12.342+01 2014-0610103043509747 2014-05-01 07:50:12.33+01 2014-0610103043501575 2014-05-01 09:27:12.33+01 2014-0610104043514061 2014-05-01 09:03:12.324+01 2014-06

查看分区表已有的所有分区:

show partitions order_created_partition;

查看分区表已有的指定分区:

SHOW PARTITIONS order_created_partition PARTITION(event_month=2014-06);

查看表字段信息:

desc order_created_partition;desc extended order_created_partition;desc formatted order_created_partition;desc formatted order_created_partition partition(event_month=2014-05);

 

动态分区表

CREATE TABLE order_created_dynamic_partition (    orderNumber STRING  , event_time  STRING)PARTITIONED BY (event_month string);

加载数据:

insert into table order_created_dynamic_partition PARTITION (event_month)select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

报错:

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

解决方案:

set hive.exec.dynamic.partition.mode=nonstrict;

重新执行:

insert into table order_created_dynamic_partition PARTITION (event_month)select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

查询数据:

select * from order_created_dynamic_partition;
10703007267488 2014-05-01 06:01:12.334+01 2014-0510101043505096 2014-05-01 07:28:12.342+01 2014-0510103043509747 2014-05-01 07:50:12.33+01 2014-0510103043501575 2014-05-01 09:27:12.33+01 2014-0510104043514061 2014-05-01 09:03:12.324+01 2014-05

 

Parquet类型表以及ALTER FILEFORMAT

创建存储方式为parquet类型的分区表

CREATE TABLE order_created_dynamic_partition_parquet (    orderNumber STRING  , event_time  STRING)PARTITIONED BY (event_month string)ROW FORMAT DELIMITED FIELDS TERMINATED BY \tSTORED AS parquet;

查看表信息:需要重点关注下SerDe Library/InputFormat/OutputFormat三个属性的区别

desc formatted order_created_dynamic_partition_parquet;SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormatdesc formatted order_created_dynamic_partition;SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       InputFormat:            org.apache.hadoop.mapred.TextInputFormat         OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

插入数据:

insert into table order_created_dynamic_partition_parquet PARTITION (event_month)select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

查询数据:

select * from order_created_dynamic_partition_parquet;
10703007267488 2014-05-01 06:01:12.334+01 2014-0510101043505096 2014-05-01 07:28:12.342+01 2014-0510103043509747 2014-05-01 07:50:12.33+01 2014-0510103043501575 2014-05-01 09:27:12.33+01 2014-0510104043514061 2014-05-01 09:03:12.324+01 2014-05

关注下hdfs上存储的文件:

hadoop fs -text /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-05/000000_0

查看发现是“乱码”的,文件内容不能直接肉眼识别。

注意:如下的操作是将textfile的文件拷贝到parquet类型的表中

在hdfs目录:/user/hive/warehouse/order_created_dynamic_partition_parquet目录下创建event_month=2014-06

hadoop fs -mkdir /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06

拷贝数据到新创建的目录下:

hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06

添加新分区数据到元数据信息中:

msck repair table order_created_dynamic_partition_parquet;

查询数据:

select * from order_created_dynamic_partition_parquet;

报错,信息如下:

Failed with exception java.io.IOException:java.lang.RuntimeException: hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [52, 43, 48, 49]

报错原因:因为event_month=2014-06不是parquet文件,而是普通的文本类型文件

而读取parquet分区的数据是正常的,比如:

select * from order_created_dynamic_partition_parquet where event_month=2014-05;

解决方案:

hive0.12版本以及以下版本的解决方案

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET SERDE org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe;ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET FILEFORMAT textfile;

必须要同时修改SERDE又要修改FILEFORMAT才能起作用;

手工设置以后就可以进行正常的查询操作了。

注意查看SerDe LazySimpleSerDe/INPUTFORMAT/OUTPUTFORMAT:

查看event_month=‘2014-06‘分区的Storage Information:

desc formatted order_created_dynamic_partition_parquet partition(event_month=2014-06);SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       InputFormat:            org.apache.hadoop.mapred.TextInputFormat         OutputFormat:           org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 

查看event_month=‘2014-05‘分区的Storage Information:

desc formatted order_created_dynamic_partition_parquet partition(event_month=2014-05);SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

查看整个表的Storage Information:

desc formatted order_created_dynamic_partition_parquet;SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

如上操作后得出结论:

1)order_created_dynamic_partition_parquet中的不同分区可以有不同的存储类型;

2)表的存储类型还是创建时指定的存储类型;

hive0.13版本以及以上版本的解决方案:

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET FILEFORMAT parquet;

insert数据到parquet表

insert into table order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) select orderNumber, event_time from order_created;

查看order_created_dynamic_partition_parquet目录下的文件发现一个partition中有两种不同类型的FILEFORMAT了

hadoop fs -ls /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/000000_0      #Parquet类型/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt   #TEXTFILE类型

因为原先的event_month=‘2014-06‘已经被我们手工修改成TEXTFILE类型了,而insert into进去的是Parquet类型(insert into进去的存储类型和创建表时指定的存储类型一致)。

select * from order_created_dynamic_partition_parquet; 查询报错

查看表信息:

desc formatted order_created_dynamic_partition_parquet partition(event_month=2014-06);SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 

发现SerDe Library是Parquet类型了,原先手工设置的就无效了,变成创建表时指定的存储类型了。需要重新设置:

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET SERDE org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe;ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET FILEFORMAT textfile;

select * from order_created_dynamic_partition_parquet;

能查询出部分数据,其中有些是Parquet格式的就显示不出来。这可能是hive的一个bug

 

Hive虚拟列

INPUT__FILE__NAME: 输入文件的文件名

BLOCK__OFFSET__INSIDE__FILE: 这一行数据所在的文件中的偏移量

select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE  from order_created_dynamic_partition;hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      0hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      42hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       84hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       125hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      166

求这一行在文件的第几行

select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1 from order_created_dynamic_partition;hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      1hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      2hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       3.0487804878hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       4.0487804878hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      4.95238095238
select INPUT__FILE__NAME, ordernumber, event_time, round(BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1) from order_created_dynamic_partition;hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      1hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      2hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       3hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       4hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      5

订单捡货表

创建表:

CREATE TABLE order_picked (    orderNumber STRING  , event_time  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_picked.txt overwrite into table order_picked;

查询数据:

select * from order_picked;10703007267488  2014-05-01 07:02:12.334+0110101043505096  2014-05-01 08:29:12.342+0110103043509747  2014-05-01 10:55:12.33+01

订单发货表

创建表:

CREATE TABLE order_shipped (    orderNumber STRING  , event_time  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_shipped.txt overwrite into table order_shipped;

查询数据:

select * from order_shipped;10703007267488  2014-05-01 10:00:12.334+0110101043505096  2014-05-01 18:39:12.342+01

订单收货表

创建表:

CREATE TABLE order_received (    orderNumber STRING  , event_time  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_received.txt overwrite into table order_received;

查询数据:

select * from order_received;10703007267488  2014-05-02 12:12:12.334+01

订单取消表

创建表:

CREATE TABLE order_cancelled (    orderNumber STRING  , event_time  STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_cancelled.txt overwrite into table order_cancelled;

查询数据:

select * from order_cancelled;10103043501575  2014-05-01 12:12:12.334+01

行列互换操作:

方法一:采用union all

CREATE TABLE order_tracking ASSELECT orderNumber     , max(CASE WHEN type_id="order_created"   THEN event_time ELSE 0 END) AS order_created_ts     , max(CASE WHEN type_id="order_picked"    THEN event_time ELSE 0 END) AS order_picked_ts     , max(CASE WHEN type_id="order_shipped"   THEN event_time ELSE 0 END) AS order_shipped_ts     , max(CASE WHEN type_id="order_received"  THEN event_time ELSE 0 END) AS order_received_ts     , max(CASE WHEN type_id="order_cancelled" THEN event_time ELSE 0 END) AS order_cancelled_tsFROM (    select orderNumber, "order_created"   as type_id, event_time FROM order_created  UNION ALL    select orderNumber, "order_picked"    as type_id, event_time FROM order_picked  UNION ALL    select orderNumber, "order_shipped"   as type_id, event_time FROM order_shipped  UNION ALL    select orderNumber, "order_received"  as type_id, event_time FROM order_received  UNION ALL    select orderNumber, "order_cancelled" as type_id, event_time FROM order_cancelled) ugroup by orderNumber;
select * from order_tracking order by order_created_ts limit 5;10703007267488  2014-05-01 06:01:12.334+01      2014-05-01 07:02:12.334+01      2014-05-01 10:00:12.334+01    2014-05-02 12:12:12.334+01       010101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    010103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       0       0       010104043514061  2014-05-01 09:03:12.324+01      0       0       0       010103043501575  2014-05-01 09:27:12.33+01       0       0       0       2014-05-01 12:12:12.334+01

方法二:采用join

CREATE TABLE order_tracking_join ASselect t1.orderNumber     , t1.event_time as order_created_ts     , t2.event_time as order_picked_ts     , t3.event_time as order_shipped_ts     , t4.event_time as order_received_ts     , t5.event_time as order_cancelled_tsfrom (  select ordernumber, max(event_time) as event_time from order_created group by ordernumber) t1left outer join (  select ordernumber, max(event_time) as event_time from order_picked group by ordernumber) t2on t1.ordernumber = t2.ordernumberleft outer join (  select ordernumber, max(event_time) as event_time from order_shipped group by ordernumber) t3on t1.ordernumber = t3.ordernumberleft outer join (  select ordernumber, max(event_time) as event_time from order_received group by ordernumber) t4on t1.ordernumber = t4.ordernumberleft outer join (  select ordernumber, max(event_time) as event_time from order_cancelled group by ordernumber) t5on t1.ordernumber = t5.ordernumber;
select * from order_tracking_join order by order_created_ts limit 5;10703007267488  2014-05-01 06:01:12.334+01      2014-05-01 07:02:12.334+01      2014-05-01 10:00:12.334+01    2014-05-02 12:12:12.334+01       NULL10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    NULL     NULL10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       NULL    NULL    NULL10104043514061  2014-05-01 09:03:12.324+01      NULL    NULL    NULL    NULL10103043501575  2014-05-01 09:27:12.33+01       NULL    NULL    NULL    2014-05-01 12:12:12.334+01

最终的统计操作:

COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0)使用说明:如果第一个参数为null的话就采用第二个参数作为第一个参数的值

方法一:采用采用union all的结果表

select orderNumber     , order_created_ts     , order_picked_ts     , order_shipped_ts     , order_received_ts     , order_cancelled_ts  from order_tracking WHERE order_created_ts != 0 AND order_cancelled_ts = 0   AND (    COALESCE(unix_timestamp(order_picked_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 2 * 60 * 60    OR    COALESCE(unix_timestamp(order_shipped_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 4 * 60 * 60    OR    COALESCE(unix_timestamp(order_received_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 48 * 60 * 60   );
10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 010103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 0 0 0

方法二:采用join的结果表

select orderNumber     , order_created_ts     , order_picked_ts     , order_shipped_ts     , order_received_ts     , order_cancelled_ts  from order_tracking_join WHERE order_created_ts IS NOT NULL AND order_cancelled_ts IS NULL   AND (    COALESCE(unix_timestamp(order_picked_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 2 * 60 * 60    OR    COALESCE(unix_timestamp(order_shipped_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 4 * 60 * 60    OR    COALESCE(unix_timestamp(order_received_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 48 * 60 * 60   );
10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 NULL NULL10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 NULL NULL NULL

 

Hive综合案例分析之不正常订单状态统计