首页 > 代码库 > Hive综合案例分析之简易推荐系统
Hive综合案例分析之简易推荐系统
知识点:
1、Hive复合数据类型map与Lateral View的使用;
map、str_to_map、map_keys、map_values,map与lateral view
2、通过translate进行简单数据保护;
Hive转换函数进行数据保护,确保企业应用信息安全
3、Hive的窗口和分析函数入门;
row_number、rank、dense_rank
创建订单表:
CREATE EXTERNAL TABLE f_orders ( user_id STRING , ts STRING , order_id STRING , items map<STRING,BIGINT>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘COLLECTION ITEMS TERMINATED BY ‘|‘MAP KEYS TERMINATED BY ‘:‘;
加载数据:
load data local inpath ‘/home/spark/software/data/f_orders.txt‘ overwrite into table f_orders;
查询数据:
select * from f_orders;11 2014-05-01 06:01:12.334+01 10703007267488 {"item8":2,"item1":1}22 2014-05-01 07:28:12.342+01 10101043505096 {"item6":3,"item3":2}33 2014-05-01 07:50:12.33+01 10103043509747 {"item7":7}11 2014-05-01 09:27:12.33+01 10103043501575 {"item5":5,"item1":1,"item4":1,"item9":1}22 2014-05-01 09:03:12.324+01 10104043514061 {"item1":3}33 2014-05-02 19:10:12.343+01 11003002067594 {"item4":2,"item1":1}11 2014-05-02 09:07:12.344+01 10101043497459 {"item9":1}35 2014-05-03 11:07:12.339+01 10203019269975 {"item5":1,"item1":1}789 2014-05-03 12:59:12.743+01 10401003346256 {"item7":3,"item8":2,"item9":1}77 2014-05-03 18:04:12.355+01 10203019262235 {"item5":2,"item1":1}99 2014-05-04 00:36:39.713+01 10103044681799 {"item9":3,"item1":1}33 2014-05-04 19:10:12.343+01 12345678901234 {"item5":1,"item1":1}11 2014-05-05 09:07:12.344+01 12345678901235 {"item6":1,"item1":1}35 2014-05-05 11:07:12.339+01 12345678901236 {"item5":2,"item1":1}22 2014-05-05 12:59:12.743+01 12345678901237 {"item9":3,"item1":1}77 2014-05-05 18:04:12.355+01 12345678901238 {"item8":3,"item1":1}99 2014-05-05 20:36:39.713+01 12345678901239 {"item9":3,"item1":1}
从map中取值:map_keys, map_values
select map_keys(items), map_values(items) from f_orders where user_id = ‘35‘;["item5","item1"] [1,1]["item5","item1"] [2,1]
查询包含订单条目中有item8的订单
select * from f_orders where array_contains(map_keys(items), ‘item8‘);11 2014-05-01 06:01:12.334+01 10703007267488 {"item1":1,"item8":2}789 2014-05-03 12:59:12.743+01 10401003346256 {"item7":3,"item8":2,"item9":1}77 2014-05-05 18:04:12.355+01 12345678901238 {"item1":1,"item8":3}
将f_orders中items列打开成横向视图
select user_id, order_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount;11 10703007267488 item8 211 10703007267488 item1 122 10101043505096 item6 322 10101043505096 item3 233 10103043509747 item7 711 10103043501575 item5 511 10103043501575 item1 111 10103043501575 item4 111 10103043501575 item9 122 10104043514061 item1 333 11003002067594 item4 233 11003002067594 item1 111 10101043497459 item9 135 10203019269975 item5 135 10203019269975 item1 1789 10401003346256 item7 3789 10401003346256 item8 2789 10401003346256 item9 177 10203019262235 item5 277 10203019262235 item1 199 10103044681799 item9 399 10103044681799 item1 133 12345678901234 item5 133 12345678901234 item1 111 12345678901235 item6 111 12345678901235 item1 135 12345678901236 item5 235 12345678901236 item1 122 12345678901237 item9 322 12345678901237 item1 177 12345678901238 item8 377 12345678901238 item1 199 12345678901239 item9 399 12345678901239 item1 1
创建订单条目表:
CREATE EXTERNAL TABLE d_items ( item_sku STRING, price DOUBLE, catalogs array<STRING>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘COLLECTION ITEMS TERMINATED BY ‘|‘;
加载数据:
load data local inpath ‘/home/spark/software/data/d_items.txt‘ overwrite into table d_items;
查询数据:
select * from d_items;item1 100.2 ["catalogA","catalogD","catalogX"]item2 200.3 ["catalogA"]item3 300.4 ["catalogA","catalogX"]item4 400.5 ["catalogB"]item5 500.6 ["catalogB","catalogX"]item6 600.7 ["catalogB"]item7 700.8 ["catalogC"]item8 800.9 ["catalogC","catalogD"]item9 899.99 ["catalogC","catalogA"]
求每个人的每个订单的金额
select orders.user_id, orders.order_id, round(sum(d.price*orders.amount), 2) as order_pricefrom ( select user_id, order_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin d_items don (orders.item = d.item_sku)group by orders.user_id, orders.order_id;11 10101043497459 899.9911 10103043501575 3903.6911 10703007267488 1702.011 12345678901235 700.922 10101043505096 2402.922 10104043514061 300.622 12345678901237 2800.1733 10103043509747 4905.633 11003002067594 901.233 12345678901234 600.835 10203019269975 600.835 12345678901236 1101.477 10203019262235 1101.477 12345678901238 2502.9789 10401003346256 4604.1999 10103044681799 2800.1799 12345678901239 2800.17
求人和订单条目以及订单条目数量对应关系的数量
select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount;11 item8 211 item1 122 item6 322 item3 233 item7 711 item5 511 item1 111 item4 111 item9 122 item1 333 item4 233 item1 111 item9 135 item5 135 item1 1789 item7 3789 item8 2789 item9 177 item5 277 item1 199 item9 399 item1 133 item5 133 item1 111 item6 111 item1 135 item5 235 item1 122 item9 322 item1 177 item8 377 item1 199 item9 399 item1 1
订单条目与类别(类别打散后)的关系
select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog;item1 catalogAitem1 catalogDitem1 catalogXitem2 catalogAitem3 catalogAitem3 catalogXitem4 catalogBitem5 catalogBitem5 catalogXitem6 catalogBitem7 catalogCitem8 catalogCitem8 catalogDitem9 catalogCitem9 catalogA
人和订单条目和订单条目数量以及与类别(类别打散后)的关系
select orders.user_id, orders.item, orders.amount, catalogs.catalogfrom ( select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin ( select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog) catalogson (orders.item = catalogs.item_sku);11 item8 2 catalogC11 item8 2 catalogD11 item1 1 catalogA11 item1 1 catalogD11 item1 1 catalogX22 item6 3 catalogB22 item3 2 catalogA22 item3 2 catalogX33 item7 7 catalogC11 item5 5 catalogB11 item5 5 catalogX11 item1 1 catalogA11 item1 1 catalogD11 item1 1 catalogX11 item4 1 catalogB11 item9 1 catalogC11 item9 1 catalogA22 item1 3 catalogA22 item1 3 catalogD22 item1 3 catalogX33 item4 2 catalogB33 item1 1 catalogA33 item1 1 catalogD33 item1 1 catalogX11 item9 1 catalogC11 item9 1 catalogA35 item5 1 catalogB35 item5 1 catalogX35 item1 1 catalogA35 item1 1 catalogD35 item1 1 catalogX789 item7 3 catalogC789 item8 2 catalogC789 item8 2 catalogD789 item9 1 catalogC789 item9 1 catalogA77 item5 2 catalogB77 item5 2 catalogX77 item1 1 catalogA77 item1 1 catalogD77 item1 1 catalogX99 item9 3 catalogC99 item9 3 catalogA99 item1 1 catalogA99 item1 1 catalogD99 item1 1 catalogX33 item5 1 catalogB33 item5 1 catalogX33 item1 1 catalogA33 item1 1 catalogD33 item1 1 catalogX11 item6 1 catalogB11 item1 1 catalogA11 item1 1 catalogD11 item1 1 catalogX35 item5 2 catalogB35 item5 2 catalogX35 item1 1 catalogA35 item1 1 catalogD35 item1 1 catalogX22 item9 3 catalogC22 item9 3 catalogA22 item1 1 catalogA22 item1 1 catalogD22 item1 1 catalogX77 item8 3 catalogC77 item8 3 catalogD77 item1 1 catalogA77 item1 1 catalogD77 item1 1 catalogX99 item9 3 catalogC99 item9 3 catalogA99 item1 1 catalogA99 item1 1 catalogD99 item1 1 catalogX
将结果写到usr_cat_weight表中
create table usr_cat_weight asselect orders.user_id, catalogs.catalog, sum(orders.amount) as weightfrom ( select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin ( select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog) catalogson (orders.item = catalogs.item_sku)group by orders.user_id, catalogs.catalogorder by user_id, weight desc;
select * from usr_cat_weight;11 catalogX 811 catalogB 711 catalogD 511 catalogA 511 catalogC 422 catalogA 922 catalogX 622 catalogD 422 catalogB 322 catalogC 333 catalogC 733 catalogX 333 catalogB 333 catalogA 233 catalogD 235 catalogX 535 catalogB 335 catalogA 235 catalogD 277 catalogD 577 catalogX 477 catalogC 377 catalogA 277 catalogB 2789 catalogC 6789 catalogD 2789 catalogA 199 catalogA 899 catalogC 699 catalogD 299 catalogX 2
row_number: 行号
select user_id, catalog, weight, row_number() OVER (PARTITION BY user_id ORDER BY weight DESC) as row_num FROM usr_cat_weight where user_id < ‘33‘;11 catalogX 8 111 catalogB 7 211 catalogA 5 311 catalogD 5 411 catalogC 4 522 catalogA 9 122 catalogX 6 222 catalogD 4 322 catalogC 3 422 catalogB 3 5
rank: 相同的值排名是相同的,排名值会跳过重复排名的
select user_id, catalog, weight, rank() OVER (PARTITION BY user_id ORDER BY weight DESC) as rnk FROM usr_cat_weight where user_id < ‘33‘;11 catalogX 8 111 catalogB 7 211 catalogA 5 311 catalogD 5 311 catalogC 4 522 catalogA 9 122 catalogX 6 222 catalogD 4 322 catalogC 3 422 catalogB 3 4
dense_rank:排名值不会跳过重复排名的
select user_id, catalog, weight, dense_rank() OVER (PARTITION BY user_id ORDER BY weight DESC) as drnk FROM usr_cat_weight where user_id < ‘33‘;11 catalogX 8 111 catalogB 7 211 catalogA 5 311 catalogD 5 311 catalogC 4 422 catalogA 9 122 catalogX 6 222 catalogD 4 322 catalogC 3 422 catalogB 3 4
CREATE TABLE usr_cat ASselect user_id, catalog, row_number() OVER (PARTITION BY user_id ORDER BY weight DESC) as row_numFROM (select orders.user_id, catalogs.catalog, sum(orders.amount) as weightfrom ( select user_id, item, amount from f_orders LATERAL VIEW explode(items) t AS item, amount) ordersjoin ( select item_sku, catalog from d_items LATERAL VIEW explode(catalogs) t AS catalog) catalogson (orders.item = catalogs.item_sku)group by orders.user_id, catalogs.catalogorder by user_id, weight) xORDER BY user_id, row_num;
select * from usr_cat;11 catalogX 111 catalogB 211 catalogA 311 catalogD 411 catalogC 522 catalogA 122 catalogX 222 catalogD 322 catalogC 422 catalogB 533 catalogC 133 catalogB 233 catalogX 333 catalogD 433 catalogA 535 catalogX 135 catalogB 235 catalogA 335 catalogD 477 catalogD 177 catalogX 277 catalogC 377 catalogA 477 catalogB 5789 catalogC 1789 catalogD 2789 catalogA 399 catalogA 199 catalogC 299 catalogD 399 catalogX 4
创建用户表:
CREATE EXTERNAL TABLE d_users ( user_id STRING , gender STRING , birthday STRING , email STRING , regday STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\073‘;
加载数据:
load data local inpath ‘/home/spark/software/data/d_users.txt‘ overwrite into table d_users;
查询:
select * from d_users;11 m 1981-01-01 张三@gmail.com 2014-04-2122 w 1982-01-01 user22@abcn.net 2014-04-2233 m 1983-01-01 user33@fxlive.de 2014-04-2377 w 1977-01-01 user77@fxlive.fr 2014-05-0188 m 1988-01-01 user88@fxlive.eu 2014-05-0299 w 1999-01-01 user99@abcn.net 2014-05-03789 m 2008-01-01 admin@abcn.net 2014-05-03
Hive转换函数translate进行简单数据保护
select user_id, birthday, translate(birthday, ‘0123456789‘, ‘1234567890‘), email, translate(email, ‘userfxgmail1234567890‘, ‘1234567890userfxgmail‘) from d_users;11 1981-01-01 2092-12-12 user11@gmail.com 1234ss@7890u.co822 1982-01-01 2093-12-12 user22@abcn.net 1234ee@9bcn.n3t33 1983-01-01 2094-12-12 user33@fxlive.de 1234rr@56u0v3.d377 1977-01-01 2088-12-12 user77@fxlive.fr 1234mm@56u0v3.5488 1988-01-01 2099-12-12 user88@fxlive.eu 1234aa@56u0v3.3199 1999-01-01 2000-12-12 user99@abcn.net 1234ii@9bcn.n3t789 2008-01-01 3119-12-12 admin@abcn.net 9d80n@9bcn.n3t
Hive综合案例分析之简易推荐系统
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。