首页 > 代码库 > nutch 采集到的数据与实际不符

nutch 采集到的数据与实际不符

现象,这个网站我总计能抽取将近500个URL,但实际只抽取了100条

解析:nutch默认从一个页面解析出的链接,只取前 100 个。

<property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that well process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description></property>

将这个值改大一些 1000 .

 

nutch 采集到的数据与实际不符