首页 > 代码库 > Ubuntu环境下利用ant编译nutch2.x & 配置nutch2.x

Ubuntu环境下利用ant编译nutch2.x & 配置nutch2.x

利用ant编译nutch2.x

详见:1.    http://blog.javachen.com/2014/05/20/nutch-intro/

      2.    wiki.apache.org/nutch/Nutch2Tutorial

前提条件:配置ant(http://www.cnblogs.com/xxx0624/p/4172277.html)

1. 下载nutch(例如:我的是apache-nutch-2.2.1-src.tar.gz)

解压,重命名nutch文件夹(命名为nutch),然后移动文件夹到/home文件夹下

 

2. 编译nutch

cd nutchant

   2.1 你可能会遇到这种错误:

Trying to override old definition of task javac  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.ivy-probe-antlib:ivy-download:  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

    原因:缺少相应的jar文件

    解决方法:

        (1)下载sonar-ant-task-2.1.jar,并放到nutch文件夹目录下

        (2)修改build.xml文件,从而引入这个新的jar

<!-- Define the Sonar task if this hasn‘t been done in a common script --><taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">    <classpath path="${ant.library.dir}" />    <classpath path="${mysql.library.dir}" />    <classpath><fileset dir="." includes="sonar*.jar" /></classpath></taskdef>

          //找到相应的地方,增加多出的内容即可。

  2.2 编译时间过长

    nutch使用ivy进行构建,故编译时间长。如果时间过长,可使用该办法解决。

    修改该文件:ivy/ivysettings.xml

http://mirrors.ibiblio.org/maven2/

     替换

http://repo1.maven.org/maven2/

   2.3 编译之后的目录:

.├── build├── build.xml├── build.xml~├── CHANGES.txt├── conf├── default.properties├── docs├── ivy├── lib├── LICENSE.txt├── NOTICE.txt├── README.txt├── runtime├── sonar-ant-task-2.1.jar└── src7 directories, 8 files

 

3. 修改nutch配置文件

    Nutch2.x版本存储采用Gora访问Cassandra、HBase、Accumulo、Avro等,需要在该文件中制定Gora属性。

 3.1修改 conf/nutch-site.xml

<property>  <name>storage.data.store.class</name>  <value>org.apache.gora.hbase.store.HBaseStore</value>  <description>Default class for storing data</description></property>

  3.2 修改 ivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. --><dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

  3.3 修改 conf/gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

 

/*****************************************************************************************************************************/

配置nutch

(nutch文件夹已在/home目录下)

1. 修改系统环境变量

sudo gedit /etc/profile

 //增加

#set nutchexport PATH=/home/nutch/runtime/local/bin:$PATH

 

2. 测试(nutch/runtime/local/bin中./nutch  &  ./crawl)

nutch
//结果如下:Usage: nutch COMMANDwhere COMMAND is one of: inject		inject new urls into the database hostinject     creates or updates an existing host table from a text file generate 	generate new batches to fetch from crawl db fetch 		fetch URLs marked during generate parse 		parse URLs marked during fetch updatedb 	update web table after parsing updatehostdb   update host table after parsing readdb 	read/dump records from page database readhostdb     display entries from the hostDB elasticindex   run the elasticsearch indexer solrindex 	run the solr indexer on parsed batches solrdedup 	remove duplicates from solr parsechecker   check the parser for a given url indexchecker   check the indexing filters for a given url plugin 	load a plugin and run one of its classes main() nutchserver    run a (local) Nutch server on a user defined port junit         	runs the given JUnit test or CLASSNAME 	run the class named CLASSNAMEMost commands print help when invoked w/o parameters.

 

crawl
//结果如下:Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

 

Ubuntu环境下利用ant编译nutch2.x & 配置nutch2.x