首页 > 代码库 > maven打包hadoop项目(含第三方jar)
maven打包hadoop项目(含第三方jar)
maven打包hadoop项目(含第三方jar)
问题背景:
1 写map-reduce程序,用到第三方jar,怎么打包并提交项目到服务器执行。
2 mahout中itembased算法,将uid从string映射为long。
我这里实现的具体功能是:
Mahout的itembased算法的数据格式是:uid,vid,score。其中uid和vid必须是数字型(long),score是小数整数都可以。
然而我这里每行记录的字段uid,vid,score,
uid是含有字母。因此我必须把uid从string映射到long。
考虑到速度,就用分布式程序来做这个转换。
此外,还直接调用了mahout里面的一个类
org.apache.mahout.cf.taste.impl.model.MemoryIDMigrator
用Maven创建一个标准化的Java项目
mvn archetype:generate-DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=org.linger.mahout-DartifactId=mahoutProject -DpackageName=org.linger.mahout -Dversion=1.0-DinteractiveMode=false
执行mvn clean install初始化项目,注意会自动生成一个pom.xml文件。
修改pom.xml,
1 先把junit的去掉。
2 在pom.xml添加mahout依赖jar(这里先不研究mahout这些jar依赖怎么得出来的)
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <mahout.version>0.8</mahout.version> </properties> <dependencies> <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>${mahout.version}</version> </dependency> <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-integration</artifactId> <version>${mahout.version}</version> <exclusions> <exclusion> <groupId>org.mortbay.jetty</groupId> <artifactId>jetty</artifactId> </exclusion> <exclusion> <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-all</artifactId> </exclusion> <exclusion> <groupId>me.prettyprint</groupId> <artifactId>hector-core</artifactId> </exclusion> </exclusions> </dependency> </dependencies>
3 在pom.xml配置jar打包选项
<build> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>org.linger.mahout.mapreducer.UserVideoFormat</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
我写的map-reduce代码
package org.linger.mahout.mapreducer; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.mahout.cf.taste.impl.model.MemoryIDMigrator; public class UserVideoFormat { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private Text userId = new Text(); private Text lefts = new Text(); private MemoryIDMigrator thing2long = new MemoryIDMigrator(); public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); int spliter = line.indexOf(','); String userStr = line.substring(0, spliter); String leftsStr = line.substring(spliter+1); userId.set(Long.toString(thing2long.toLongID(userStr))); lefts.set(leftsStr); output.collect(userId, lefts); } } public static void main(String[] args) throws IOException { // TODO Auto-generated method stub JobConf conf = new JobConf(UserVideoFormat.class); conf.setJobName("UserVideoFormat"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(Map.class); conf.set("mapred.textoutputformat.separator", ","); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); FileInputFormat.setInputPaths(conf, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(conf, new Path(otherArgs[1])); JobClient.runJob(conf); } }
执行mvn package打包
在target目录下自动生成mahoutProject-1.0-jar-with-dependencies.jar
hadoop jar mahoutProject-1.0-jar-with-dependencies.jarinput output
注意到,由于pom.xml配置中指明该jar包的main函数,所以这里不需要再写明main函数。
否则,一般都会在jar包后面指明main函数。
参考资料:
用Maven构建Mahout项目
http://blog.fens.me/hadoop-mahout-maven-eclipse/
Hadoop Job使用第三方依赖jar文件
http://shiyanjun.cn/archives/373.html
mahout做推荐时uid,pid为string类型
http://blog.csdn.net/pan12jian/article/details/38703569
本文链接:http://blog.csdn.net/lingerlanlan/article/details/42086623
本文作者:linger
maven打包hadoop项目(含第三方jar)