首页 > 代码库 > Nuch分析一

Nuch分析一


1、构建Nutch

tar -zxvf apache-nutch-2.2.1-src.tar.gz 

cd apache-nutch-2.2.1

ant runtime


2、    ant构建之后,生成runtime文件夹,该文件夹下面有deploy和local文件夹,分别代表了nutch的两种运行方式:

Deploy:的数据必须运行在Hadoop的HDFS中

local:是运行在本地目录中。

(1)二者的目录结构如下:

[jediael@jediael44 runtime]$ ls deploy/ local/

deploy/:

apache-nutch-2.2.1.job  bin 

local/:

bin  conf  lib logs  plugins  test

在deploy中,文件被打包成一个Job,作为Hadoop的一个Job来运行。

 (2)二者目录下均有一个bin的目录,其内包含相同的crawl与nutch两个执行文件。

我们查看nutch文件的最后几行

if $local; then

 # fix for the external Xerceslib issue with SAXParserFactory

 NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl$NUTCH_OPTS"

 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"

else

 # check that hadoop can befound on the path

 if [ $(which hadoop | wc -l )-eq 0 ]; then

    echo "Can‘t findHadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."

    exit -1;

 fi

 # distributed mode

 EXEC_CALL="hadoop jar$NUTCH_JOB"

fi

 # run it

exec $EXEC_CALL $CLASS "$@"

即默认情况下为 EXEC_CALL="hadoop jar$NUTCH_JOB",若为Local,则 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH",若未local,且hadoop不存在,则报错。

(3)根据参数确定类文件

if [ "$COMMAND" = "crawl" ] ; then
CLASS=org.apache.nutch.crawl.Crawler
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.InjectorJob
elif [ "$COMMAND" = "hostinject" ] ; then
CLASS=org.apache.nutch.host.HostInjectorJob
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.GeneratorJob
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.FetcherJob

elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParserJob
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.DbUpdaterJob
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbUpdateJob
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.WebTableReader
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbReader
elif [ "$COMMAND" = "elasticindex" ] ; then
CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "parsechecker" ] ; then
  CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository

如,对于nutch fetch命令,对应的类文件应该是:org.apache.nutch.fetcher.FetcherJob

[jediael@jediael44 java]$ cat org/apache/nutch/fetcher/FetcherJob.java 

可以查看类文件。此方法可以查看一切的shell对应的源文件。