Nutch2.2.1抓取流程

首页 > 代码库 > Nutch2.2.1抓取流程

2024-07-17 20:44:15 223人阅读

一、抓取流程概述

1、nutch抓取流程

当使用crawl命令进行抓取任务时，其基本流程步骤如下：

（1）InjectorJob

开始第一个迭代

（2）GeneratorJob

（3）FetcherJob

（4）ParserJob

（5）DbUpdaterJob

（6）SolrIndexerJob

开始第二个迭代

（2）GeneratorJob
（3）FetcherJob
（4）ParserJob
（5）DbUpdaterJob
（6）SolrIndexerJob

开始第三个迭代

……

2、抓取日志

使用crawl命令进行抓取时，console输出日志如下：

InjectorJob: starting at 2014-07-08 10:41:27

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05

Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:41:34

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787293-26339

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787293-26339

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798101129

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 2 records. Hit by time limit :0

fetching http://www.csdn.net/ (queue crawl delay=5000ms)

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://www.itpub.net/ (queue crawl delay=5000ms)

-finishing thread FetcherThread47, activeThreads=48

-finishing thread FetcherThread46, activeThreads=47

-finishing thread FetcherThread45, activeThreads=46

-finishing thread FetcherThread44, activeThreads=45

-finishing thread FetcherThread43, activeThreads=44

-finishing thread FetcherThread42, activeThreads=43

-finishing thread FetcherThread41, activeThreads=42

-finishing thread FetcherThread40, activeThreads=41

-finishing thread FetcherThread39, activeThreads=40

-finishing thread FetcherThread38, activeThreads=39

-finishing thread FetcherThread37, activeThreads=38

-finishing thread FetcherThread36, activeThreads=37

-finishing thread FetcherThread35, activeThreads=36

-finishing thread FetcherThread34, activeThreads=35

-finishing thread FetcherThread33, activeThreads=34

-finishing thread FetcherThread32, activeThreads=33

-finishing thread FetcherThread31, activeThreads=32

-finishing thread FetcherThread30, activeThreads=31

-finishing thread FetcherThread29, activeThreads=30

-finishing thread FetcherThread48, activeThreads=29

-finishing thread FetcherThread27, activeThreads=29

-finishing thread FetcherThread26, activeThreads=28

-finishing thread FetcherThread25, activeThreads=27

-finishing thread FetcherThread24, activeThreads=26

-finishing thread FetcherThread23, activeThreads=25

-finishing thread FetcherThread22, activeThreads=24

-finishing thread FetcherThread21, activeThreads=23

-finishing thread FetcherThread20, activeThreads=22

-finishing thread FetcherThread19, activeThreads=21

-finishing thread FetcherThread18, activeThreads=20

-finishing thread FetcherThread17, activeThreads=19

-finishing thread FetcherThread16, activeThreads=18

-finishing thread FetcherThread15, activeThreads=17

-finishing thread FetcherThread14, activeThreads=16

-finishing thread FetcherThread13, activeThreads=15

-finishing thread FetcherThread12, activeThreads=14

-finishing thread FetcherThread11, activeThreads=13

-finishing thread FetcherThread10, activeThreads=12

-finishing thread FetcherThread9, activeThreads=11

-finishing thread FetcherThread8, activeThreads=10

-finishing thread FetcherThread7, activeThreads=9

-finishing thread FetcherThread5, activeThreads=8

-finishing thread FetcherThread4, activeThreads=7

-finishing thread FetcherThread3, activeThreads=6

-finishing thread FetcherThread2, activeThreads=5

-finishing thread FetcherThread49, activeThreads=4

-finishing thread FetcherThread6, activeThreads=3

-finishing thread FetcherThread28, activeThreads=2

-finishing thread FetcherThread0, activeThreads=1

fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null

-finishing thread FetcherThread1, activeThreads=0

0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Parsing :

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: batchId:     1404787293-26339

Parsing http://www.csdn.net/

http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561

Parsing http://www.itpub.net/

ParserJob: success

CrawlDB update for csdnitpub

DbUpdaterJob: starting

DbUpdaterJob: done

Indexing csdnitpub on SOLR index -> http://ip:8983/solr/

SolrIndexerJob: starting

SolrIndexerJob: done.

SOLR dedup -> http://ip:8983/solr/

Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:42:19

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787338-30453

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787338-30453

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798146676

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

二、使用命令进行逐步抓取

crawlDb, linkDb, a set of segments.

1、InjectorJob

此步骤将seed.txt中的url注入抓取队列中进行初始化。

（1）基本命令

[root@jediael local]# bin/nutch inject urls/
InjectorJob: starting at 2014-08-15 21:17:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 2
InjectorJob: total number of urls injected after normalization and filtering: 3
Injector: finished at 2014-08-15 21:17:06, elapsed: 00:00:05

其中urls/seed.txt的内容如下：

http://money.163.com/

http://www.hexun.com/
http://www.gw.com.cn/

（2）查看注入的url

上述步骤会在hbase中新建一个表，表名为test_1_webpage，url的相应内容会写入这张表

hbase(main):007:0> scan ‘test_1_webpage‘

ROW                              COLUMN+CELL                                                                       cn.com.gw.www:http/             column=f:fi, timestamp=1408086716518, value=http://www.mamicode.com/x00‘/x8D/x00                          cn.com.gw.www:http/             column=f:ts, timestamp=1408086716518, value=/x00/x00/x01G/xD8/x82/x1B"             cn.com.gw.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               cn.com.gw.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   cn.com.gw.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     cn.com.gw.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           com.163.money:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                         com.163.money:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"               com.163.money:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                              com.163.money:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.163.money:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     com.163.money:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                          com.hexun.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          com.hexun.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             com.hexun.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               com.hexun.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.hexun.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                    com.hexun.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           3 row(s) in 0.1100 seconds

(3)关于**_webpage表

对于每一个任务，均会生成一个crawlId_webpage的表，所有已抓取及未抓取的url相关信息均会存入此表。

若url未抓取，则该url相应的行信息较少。若url已经抓取，则抓取到的内容也会放入该行，如网页内容等。

2、GeneratorJob

（1）基本命令

[root@jediael local]# bin/nutch generate -crawlId test_2

GeneratorJob: starting at 2014-08-15 21:24:49

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: finished at 2014-08-15 21:24:55, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1408109089-403376773

（2）命令选项

[root@jediael local]# bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE

   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)");

   -noFilter      - do not activate the filter plugin to filter the url, default is true

    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 

    -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.
    -batchId       - the batch id 

----------------------

Please set the params.

3、FetcherJob

（1）基本命令

[root@jediael local]# bin/nutch fetch -all -crawlId test_2

FetcherJob: starting

FetcherJob: fetching all

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 3 records. Hit by time limit :0

fetching http://www.gw.com.cn/ (queue crawl delay=5000ms)

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://www.hexun.com/ (queue crawl delay=5000ms)

-finishing thread FetcherThread2, activeThreads=8

-finishing thread FetcherThread7, activeThreads=7

-finishing thread FetcherThread6, activeThreads=6

-finishing thread FetcherThread5, activeThreads=5

-finishing thread FetcherThread4, activeThreads=4

-finishing thread FetcherThread3, activeThreads=3

fetching http://money.163.com/ (queue crawl delay=5000ms)

-finishing thread FetcherThread9, activeThreads=3

-finishing thread FetcherThread1, activeThreads=2

-finishing thread FetcherThread0, activeThreads=1

-finishing thread FetcherThread8, activeThreads=0

0/0 spinwaiting/active, 3 pages, 0 errors, 0.6 1 pages/s, 307 307 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

4、ParserJob

（1）基本命令

[root@jediael local]# bin/nutch parse  -all -crawlId test_2

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: parsing all

Parsing http://www.gw.com.cn/

Parsing http://money.163.com/

Parsing http://www.hexun.com/

ParserJob: success

（2）命令参数

[root@jediael local]# bin/nutch parse 

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]

    <batchId>     - symbolic batch ID created by Generator

    -crawlId <id> - the id to prefix the schemas to operate on, 

                    (default: storage.crawl.id)

    -all          - consider pages from all crawl jobs

    -resume       - resume a previous incomplete job

    -force        - force re-parsing even if a page is already parsed

5、DbUpdaterJob

（1）基本命令

[root@jediael local]# bin/nutch updatedb

DbUpdaterJob: starting

DbUpdaterJob: done

6、SolrIndexerJob

（1）基本命令

[root@jediael local]# bin/nutch solrindex http://182.92.160.44:8583/solr/ -crawlId test_2

SolrIndexerJob: starting

SolrIndexerJob: done.

（2）命令参数

[root@jediael local]# bin/nutch solrindex
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]