首页 > 代码库 > Hive综合案例分析之简易推荐系统

Hive综合案例分析之简易推荐系统

知识点:

1、Hive复合数据类型map与Lateral View的使用;

  map、str_to_map、map_keys、map_values,map与lateral view

2、通过translate进行简单数据保护;

  Hive转换函数进行数据保护,确保企业应用信息安全

3、Hive的窗口和分析函数入门;

  row_number、rank、dense_rank

 

创建订单表:

CREATE EXTERNAL TABLE f_orders (    user_id   STRING  , ts        STRING  , order_id  STRING  , items     map<STRING,BIGINT>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY \tCOLLECTION ITEMS TERMINATED BY |MAP KEYS TERMINATED BY :;

 加载数据:

load data local inpath /home/spark/software/data/f_orders.txt overwrite into table f_orders;

 查询数据:

select * from f_orders;11      2014-05-01 06:01:12.334+01      10703007267488  {"item8":2,"item1":1}22      2014-05-01 07:28:12.342+01      10101043505096  {"item6":3,"item3":2}33      2014-05-01 07:50:12.33+01       10103043509747  {"item7":7}11      2014-05-01 09:27:12.33+01       10103043501575  {"item5":5,"item1":1,"item4":1,"item9":1}22      2014-05-01 09:03:12.324+01      10104043514061  {"item1":3}33      2014-05-02 19:10:12.343+01      11003002067594  {"item4":2,"item1":1}11      2014-05-02 09:07:12.344+01      10101043497459  {"item9":1}35      2014-05-03 11:07:12.339+01      10203019269975  {"item5":1,"item1":1}789     2014-05-03 12:59:12.743+01      10401003346256  {"item7":3,"item8":2,"item9":1}77      2014-05-03 18:04:12.355+01      10203019262235  {"item5":2,"item1":1}99      2014-05-04 00:36:39.713+01      10103044681799  {"item9":3,"item1":1}33      2014-05-04 19:10:12.343+01      12345678901234  {"item5":1,"item1":1}11      2014-05-05 09:07:12.344+01      12345678901235  {"item6":1,"item1":1}35      2014-05-05 11:07:12.339+01      12345678901236  {"item5":2,"item1":1}22      2014-05-05 12:59:12.743+01      12345678901237  {"item9":3,"item1":1}77      2014-05-05 18:04:12.355+01      12345678901238  {"item8":3,"item1":1}99      2014-05-05 20:36:39.713+01      12345678901239  {"item9":3,"item1":1}

从map中取值:map_keys, map_values

select map_keys(items), map_values(items) from f_orders where user_id = 35;["item5","item1"]       [1,1]["item5","item1"]       [2,1]

查询包含订单条目中有item8的订单

select * from f_orders where array_contains(map_keys(items), item8);11      2014-05-01 06:01:12.334+01      10703007267488  {"item1":1,"item8":2}789     2014-05-03 12:59:12.743+01      10401003346256  {"item7":3,"item8":2,"item9":1}77      2014-05-05 18:04:12.355+01      12345678901238  {"item1":1,"item8":3}

将f_orders中items列打开成横向视图

select user_id, order_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount;11      10703007267488  item8   211      10703007267488  item1   122      10101043505096  item6   322      10101043505096  item3   233      10103043509747  item7   711      10103043501575  item5   511      10103043501575  item1   111      10103043501575  item4   111      10103043501575  item9   122      10104043514061  item1   333      11003002067594  item4   233      11003002067594  item1   111      10101043497459  item9   135      10203019269975  item5   135      10203019269975  item1   1789     10401003346256  item7   3789     10401003346256  item8   2789     10401003346256  item9   177      10203019262235  item5   277      10203019262235  item1   199      10103044681799  item9   399      10103044681799  item1   133      12345678901234  item5   133      12345678901234  item1   111      12345678901235  item6   111      12345678901235  item1   135      12345678901236  item5   235      12345678901236  item1   122      12345678901237  item9   322      12345678901237  item1   177      12345678901238  item8   377      12345678901238  item1   199      12345678901239  item9   399      12345678901239  item1   1

创建订单条目表:

CREATE EXTERNAL TABLE d_items (  item_sku  STRING,  price     DOUBLE,  catalogs  array<STRING>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY \tCOLLECTION ITEMS TERMINATED BY |;

加载数据:

load data local inpath /home/spark/software/data/d_items.txt overwrite into table d_items;

查询数据:

select * from d_items;item1   100.2   ["catalogA","catalogD","catalogX"]item2   200.3   ["catalogA"]item3   300.4   ["catalogA","catalogX"]item4   400.5   ["catalogB"]item5   500.6   ["catalogB","catalogX"]item6   600.7   ["catalogB"]item7   700.8   ["catalogC"]item8   800.9   ["catalogC","catalogD"]item9   899.99  ["catalogC","catalogA"]

求每个人的每个订单的金额

select orders.user_id, orders.order_id, round(sum(d.price*orders.amount), 2) as order_pricefrom (  select user_id, order_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin d_items don (orders.item = d.item_sku)group by orders.user_id, orders.order_id;11      10101043497459  899.9911      10103043501575  3903.6911      10703007267488  1702.011      12345678901235  700.922      10101043505096  2402.922      10104043514061  300.622      12345678901237  2800.1733      10103043509747  4905.633      11003002067594  901.233      12345678901234  600.835      10203019269975  600.835      12345678901236  1101.477      10203019262235  1101.477      12345678901238  2502.9789     10401003346256  4604.1999      10103044681799  2800.1799      12345678901239  2800.17

求人和订单条目以及订单条目数量对应关系的数量

select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount;11      item8   211      item1   122      item6   322      item3   233      item7   711      item5   511      item1   111      item4   111      item9   122      item1   333      item4   233      item1   111      item9   135      item5   135      item1   1789     item7   3789     item8   2789     item9   177      item5   277      item1   199      item9   399      item1   133      item5   133      item1   111      item6   111      item1   135      item5   235      item1   122      item9   322      item1   177      item8   377      item1   199      item9   399      item1   1

订单条目与类别(类别打散后)的关系

select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog;item1   catalogAitem1   catalogDitem1   catalogXitem2   catalogAitem3   catalogAitem3   catalogXitem4   catalogBitem5   catalogBitem5   catalogXitem6   catalogBitem7   catalogCitem8   catalogCitem8   catalogDitem9   catalogCitem9   catalogA

人和订单条目和订单条目数量以及与类别(类别打散后)的关系

select orders.user_id, orders.item, orders.amount, catalogs.catalogfrom (  select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin (  select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog) catalogson (orders.item = catalogs.item_sku);11      item8   2       catalogC11      item8   2       catalogD11      item1   1       catalogA11      item1   1       catalogD11      item1   1       catalogX22      item6   3       catalogB22      item3   2       catalogA22      item3   2       catalogX33      item7   7       catalogC11      item5   5       catalogB11      item5   5       catalogX11      item1   1       catalogA11      item1   1       catalogD11      item1   1       catalogX11      item4   1       catalogB11      item9   1       catalogC11      item9   1       catalogA22      item1   3       catalogA22      item1   3       catalogD22      item1   3       catalogX33      item4   2       catalogB33      item1   1       catalogA33      item1   1       catalogD33      item1   1       catalogX11      item9   1       catalogC11      item9   1       catalogA35      item5   1       catalogB35      item5   1       catalogX35      item1   1       catalogA35      item1   1       catalogD35      item1   1       catalogX789     item7   3       catalogC789     item8   2       catalogC789     item8   2       catalogD789     item9   1       catalogC789     item9   1       catalogA77      item5   2       catalogB77      item5   2       catalogX77      item1   1       catalogA77      item1   1       catalogD77      item1   1       catalogX99      item9   3       catalogC99      item9   3       catalogA99      item1   1       catalogA99      item1   1       catalogD99      item1   1       catalogX33      item5   1       catalogB33      item5   1       catalogX33      item1   1       catalogA33      item1   1       catalogD33      item1   1       catalogX11      item6   1       catalogB11      item1   1       catalogA11      item1   1       catalogD11      item1   1       catalogX35      item5   2       catalogB35      item5   2       catalogX35      item1   1       catalogA35      item1   1       catalogD35      item1   1       catalogX22      item9   3       catalogC22      item9   3       catalogA22      item1   1       catalogA22      item1   1       catalogD22      item1   1       catalogX77      item8   3       catalogC77      item8   3       catalogD77      item1   1       catalogA77      item1   1       catalogD77      item1   1       catalogX99      item9   3       catalogC99      item9   3       catalogA99      item1   1       catalogA99      item1   1       catalogD99      item1   1       catalogX

将结果写到usr_cat_weight表中

create table usr_cat_weight asselect orders.user_id, catalogs.catalog, sum(orders.amount) as weightfrom (  select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin (  select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog) catalogson (orders.item = catalogs.item_sku)group by orders.user_id, catalogs.catalogorder by user_id, weight desc;
select * from usr_cat_weight;11      catalogX        811      catalogB        711      catalogD        511      catalogA        511      catalogC        422      catalogA        922      catalogX        622      catalogD        422      catalogB        322      catalogC        333      catalogC        733      catalogX        333      catalogB        333      catalogA        233      catalogD        235      catalogX        535      catalogB        335      catalogA        235      catalogD        277      catalogD        577      catalogX        477      catalogC        377      catalogA        277      catalogB        2789     catalogC        6789     catalogD        2789     catalogA        199      catalogA        899      catalogC        699      catalogD        299      catalogX        2

row_number: 行号

select user_id, catalog, weight, row_number() OVER (PARTITION BY user_id ORDER BY weight DESC) as row_num FROM usr_cat_weight where user_id < 33;11      catalogX        8       111      catalogB        7       211      catalogA        5       311      catalogD        5       411      catalogC        4       522      catalogA        9       122      catalogX        6       222      catalogD        4       322      catalogC        3       422      catalogB        3       5

rank: 相同的值排名是相同的,排名值会跳过重复排名的

select user_id, catalog, weight, rank() OVER (PARTITION BY user_id ORDER BY weight DESC) as rnk FROM usr_cat_weight where user_id < 33;11      catalogX        8       111      catalogB        7       211      catalogA        5       311      catalogD        5       311      catalogC        4       522      catalogA        9       122      catalogX        6       222      catalogD        4       322      catalogC        3       422      catalogB        3       4

dense_rank:排名值不会跳过重复排名的

select user_id, catalog, weight, dense_rank() OVER (PARTITION BY user_id ORDER BY weight DESC) as drnk FROM usr_cat_weight where user_id < 33;11      catalogX        8       111      catalogB        7       211      catalogA        5       311      catalogD        5       311      catalogC        4       422      catalogA        9       122      catalogX        6       222      catalogD        4       322      catalogC        3       422      catalogB        3       4
CREATE TABLE usr_cat ASselect user_id, catalog, row_number() OVER (PARTITION BY user_id ORDER BY weight DESC) as row_numFROM (select orders.user_id, catalogs.catalog, sum(orders.amount) as weightfrom (  select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin (  select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog) catalogson (orders.item = catalogs.item_sku)group by orders.user_id, catalogs.catalogorder by user_id, weight) xORDER BY user_id, row_num;
select * from usr_cat;11      catalogX        111      catalogB        211      catalogA        311      catalogD        411      catalogC        522      catalogA        122      catalogX        222      catalogD        322      catalogC        422      catalogB        533      catalogC        133      catalogB        233      catalogX        333      catalogD        433      catalogA        535      catalogX        135      catalogB        235      catalogA        335      catalogD        477      catalogD        177      catalogX        277      catalogC        377      catalogA        477      catalogB        5789     catalogC        1789     catalogD        2789     catalogA        399      catalogA        199      catalogC        299      catalogD        399      catalogX        4

创建用户表:

CREATE EXTERNAL TABLE d_users (    user_id  STRING  , gender   STRING  , birthday STRING  , email    STRING  , regday   STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY \073;

加载数据:

load data local inpath /home/spark/software/data/d_users.txt overwrite into table d_users;

查询:

select * from d_users;11      m       1981-01-01      张三@gmail.com        2014-04-2122      w       1982-01-01      user22@abcn.net 2014-04-2233      m       1983-01-01      user33@fxlive.de        2014-04-2377      w       1977-01-01      user77@fxlive.fr        2014-05-0188      m       1988-01-01      user88@fxlive.eu        2014-05-0299      w       1999-01-01      user99@abcn.net 2014-05-03789     m       2008-01-01      admin@abcn.net  2014-05-03

Hive转换函数translate进行简单数据保护

select user_id, birthday, translate(birthday, 0123456789, 1234567890), email, translate(email, userfxgmail1234567890, 1234567890userfxgmail) from d_users;11      1981-01-01      2092-12-12      user11@gmail.com        1234ss@7890u.co822      1982-01-01      2093-12-12      user22@abcn.net 1234ee@9bcn.n3t33      1983-01-01      2094-12-12      user33@fxlive.de        1234rr@56u0v3.d377      1977-01-01      2088-12-12      user77@fxlive.fr        1234mm@56u0v3.5488      1988-01-01      2099-12-12      user88@fxlive.eu        1234aa@56u0v3.3199      1999-01-01      2000-12-12      user99@abcn.net 1234ii@9bcn.n3t789     2008-01-01      3119-12-12      admin@abcn.net  9d80n@9bcn.n3t

 

Hive综合案例分析之简易推荐系统