nutch http file 截断问题

首页 > 代码库 > nutch http file 截断问题

2024-07-20 13:21:08 218人阅读

问题：

列表页预计抽取 355+6 但实际只抽取到220条链接. 原因是nutch对http下载的内容的长度进行了限制。

解决方案：
这里将这个属性扩大10倍。
vim conf/nutch-defalut.xml 
修改http.content.limit属性，将其由65536 改为 655360 <property>  <name>http.content.limit</name>  <value>655360</value>  -------- 这里变大一些吧，有的html确实挺大的。  <description>The length limit for downloaded content using the http  protocol, in bytes. If this value is nonnegative (>=0), content longer  than it will be truncated; otherwise, no truncation at all. Do not  confuse this setting with the file.content.limit setting.  </description></property>
//div[@class=‘com_page‘]/ul/li/span/a/@hrefextract 355 outlinks//div[@class=‘page_link‘]/a/@hrefextract 6 outlinksfound 361 outlinks in http://www.ly.com/news/scenery.html修改后正常抽取

nutch http file 截断问题

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > nutch http file 截断问题

nutch http file 截断问题

看完仍有疑问？有类似问题直接问程序猿