首页 > 代码库 > Nuch分析一
Nuch分析一
1、构建Nutch
tar -zxvf apache-nutch-2.2.1-src.tar.gz
cd apache-nutch-2.2.1
ant runtime
2、 ant构建之后,生成runtime文件夹,该文件夹下面有deploy和local文件夹,分别代表了nutch的两种运行方式:
Deploy:的数据必须运行在Hadoop的HDFS中
local:是运行在本地目录中。
(1)二者的目录结构如下:
[jediael@jediael44 runtime]$ ls deploy/ local/
deploy/:
apache-nutch-2.2.1.job bin
local/:
bin conf lib logs plugins test
在deploy中,文件被打包成一个Job,作为Hadoop的一个Job来运行。
(2)二者目录下均有一个bin的目录,其内包含相同的crawl与nutch两个执行文件。
我们查看nutch文件的最后几行
if $local; then
# fix for the external Xerceslib issue with SAXParserFactory
NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl$NUTCH_OPTS"
EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"
else
# check that hadoop can befound on the path
if [ $(which hadoop | wc -l )-eq 0 ]; then
echo "Can‘t findHadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."
exit -1;
fi
# distributed mode
EXEC_CALL="hadoop jar$NUTCH_JOB"
fi
# run it
exec $EXEC_CALL $CLASS "$@"
即默认情况下为 EXEC_CALL="hadoop jar$NUTCH_JOB",若为Local,则 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH",若未local,且hadoop不存在,则报错。
(3)根据参数确定类文件
if [ "$COMMAND" = "crawl" ] ; then
CLASS=org.apache.nutch.crawl.Crawler
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.InjectorJob
elif [ "$COMMAND" = "hostinject" ] ; then
CLASS=org.apache.nutch.host.HostInjectorJob
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.GeneratorJob
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.FetcherJob
elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParserJob
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.DbUpdaterJob
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbUpdateJob
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.WebTableReader
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbReader
elif [ "$COMMAND" = "elasticindex" ] ; then
CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "parsechecker" ] ; then
CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository
如,对于nutch fetch命令,对应的类文件应该是:org.apache.nutch.fetcher.FetcherJob
[jediael@jediael44 java]$ cat org/apache/nutch/fetcher/FetcherJob.java
可以查看类文件。此方法可以查看一切的shell对应的源文件。