首页 > 代码库 > pig—WordCount analysis
pig—WordCount analysis
grunt> cat /opt/dataset/input.txt keyword1 keyword2 keyword2 keyword4 keyword3 keyword1 keyword4 keyword4 A = LOAD ‘/opt/dataset/input.txt‘ using PigStorage(‘\n‘) as (line:chararray); B = foreach A generate TOKENIZE((chararray)$0); C = foreach B generate flatten($0) as word; D = group C by word; E = foreach D generate COUNT(C), group; dump B; ({(keyword1),(keyword2)}) ({(keyword2),(keyword4)}) ({(keyword3),(keyword1)}) ({(keyword4),(keyword4)}) dump C; (keyword1) (keyword2) (keyword2) (keyword4) (keyword3) (keyword1) (keyword4) (keyword4) dump D; (keyword1,{(keyword1),(keyword1)}) (keyword2,{(keyword2),(keyword2)}) (keyword3,{(keyword3)}) (keyword4,{(keyword4),(keyword4),(keyword4)}) dump E; (2,keyword1) (2,keyword2) (1,keyword3) (3,keyword4) store E into ‘./wordcount‘;
TOKENIZE Splits a string and outputs a bag of words. Syntax TOKENIZE(expression) Terms expression An expression with data type chararray. Usage Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*). Example In this example the strings in each row are split. A = LOAD ‘data‘ AS (f1:chararray); DUMP A; (Here is the first string.) (Here is the second string.) (Here is the third string.) X = FOREACH A GENERATE TOKENIZE(f1); DUMP X; ({(Here),(is),(the),(first),(string.)}) ({(Here),(is),(the),(second),(string.)}) ({(Here),(is),(the),(third),(string.)})
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。