【未完善】使用nutch命令逐步下载网页

首页 > 代码库 > 【未完善】使用nutch命令逐步下载网页

【未完善】使用nutch命令逐步下载网页

2024-07-11 01:38:30 223人阅读

此文未完善。是否可以使用nutch逐步下载，未知。

1、基本操作，构建环境

（1）下载软件安装包，并解压至/usr/search/apache-nutch-2.2.1/

（2）构建runtime

cd /usr/search/apache-nutch-2.2.1/

ant runtime

（3）验证Nutch安装完成

[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

（4）vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

（5）创建seed.txt

cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

（6）修改网页过滤器 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

将

# accept anything else
+.

修改为

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is

generated by Apache Nutch which is nothing but a directory and which contains

details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache

Nutch keeps all the crawling data directly in the database. In our case, we have used

Apache HBase, so all crawling data would go inside Apache HBase.

2 injectJob

[root@jediael44 local]# ./bin/nutch inject urls

InjectorJob: starting at 2014-07-07 14:15:21

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03

3 GenerateJob

[root@jediael44 local]# ./bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)");

-noFilter - do not activate the filter plugin to filter the url, default is true

-noNorm - do not activate the normalizer plugin to normalize the url, default is true

-adddays - Adds numDays to the current time to facilitate crawling urls already

fetched sooner then db.fetch.interval.default. Default value is 0.

-batchId - the batch id

----------------------

Please set the params.

[root@jediael44 local]# ./bin/nutch generate -topN 3

GeneratorJob: starting at 2014-07-07 14:22:55

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: topN: 3

GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03

GeneratorJob: generated batch id: 1404714175-1017128204

4 FetcherJob

The job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.

It will use the input provided by GeneratorJob. The following command will be

used for the FetcherJob:

[root@jediael44 local]# bin/nutch fetch –all

FetcherJob: starting

FetcherJob: batchId: –all

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=1

-finishing thread FetcherThread8, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread9, activeThreads=0

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Here I have provided input parameters—this means that this job will fetch all

the URLs that are generated by the GeneratorJob. You can use different input

parameters according to your needs.

5 ParserJob

After the FetcherJob, the ParserJob is to parse the URLs that are fetched by

FetcherJob. The following command will be used for the ParserJob:

[root@jediael44 local]# bin/nutch parse –all

ParserJob: starting

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: –all

ParserJob: success

[root@jediael44 local]#

I have used input parameters—all of which will parse all the URLs fetched by the

FetcherJob. You can use different input parameters according to your needs.

6 DbUpdaterJob

[root@jediael44 local]# ./bin/nutch updatedb

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 【未完善】使用nutch命令逐步下载网页

【未完善】使用nutch命令逐步下载网页

看完仍有疑问？有类似问题直接问程序猿