首页 > 代码库 > hive:子查询

hive:子查询

hive本身支持的子查询非常有限,Hive不支持where子句中的子查询,只允许子查询在from中出现

错误写法:

insert into table branch_atmzc_sumSelect  XT_OP_TRL, SA_TX_DT,"取款-存款",b.cr_tx_amt- a.cr_tx_amt as cr_tx_amt from branch_atmzc a join  branch_atmzc b  on (a.XT_OP_TRL = b.XT_OP_TRL and a.SA_TX_DT = b.SA_TX_DT and a.tran_cd = ‘ATM存款‘ and b.tran_cd = ‘ATM取款‘), counts from branch_atmzcgroup by XT_OP_TRL, SA_TX_DT,cr_tx_amt,counts;

正确写法:

insert into table branch_atmzc_sumSelect a.XT_OP_TRL, a.SA_TX_DT,"取款-存款",b.cr_tx_amt- a.cr_tx_amt ,b.counts+a.counts  from branch_atmzc a join  branch_atmzc b  on (a.XT_OP_TRL = b.XT_OP_TRL and a.SA_TX_DT = b.SA_TX_DT and a.tran_cd = ‘ATM存款‘ and b.tran_cd = ‘ATM取款‘)

另外一种:考虑效率

SELECT	t1.产品类型,	COUNT(DISTINCT (IF(t2.用户ID IS NULL, NULL, t1.用户ID))) AS KEEP_UVFROM(	SELECT		产品类型,		用户ID	FROM 事实表	WHERE (`DATE` >= 20140201 AND `DATE` <= 20140228)) t1LEFT OUTER JOIN(	SELECT		产品类型,		用户ID	FROM 事实表	WHERE (`DATE` >= 20140101 AND `DATE` <= 20140131)) t2 ON (t1.产品类型 = t2.产品类型 AND t1.用户ID = t2.用户ID)GROUP BY t1.产品类型

 本身表包含的字段信息多,时间跨度大。对于这种对于IN / EXISTS子查询(准确地说,这里是非相关子查询)有一种高效的实现,就是LEFT SEMI JOIN:

       LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way.

left join 

SELECT	产品类型,	COUNT(DISTINCT t1.用户ID) AS KEEP_UVFROM(	SELECT		产品类型,		用户ID	FROM 事实表	WHERE (`DATE` >= 20140201 AND `DATE` <= 20140228)) t1LEFT SEMI JOIN(	SELECT		产品类型,		用户ID	FROM 事实表	WHERE (`DATE` >= 20140101 AND `DATE` <= 20140131)) t2 ON (t1.产品类型 = t2.产品类型 AND t1.用户ID = t2.用户ID)GROUP BY 产品类型

  

hive:子查询