首页 > 代码库 > nutch2.2.1+mysql抓取数据
nutch2.2.1+mysql抓取数据
基本环境:linux centos6.5 nutch2.2.1 源码包, mysql 5.5 ,elasticsearch1.1.1, jdk1.7
1、下载地址http://mirror.bjtu.edu.cn/apache/nutch/2.2.1/ 解压
2、修改数据存储方式是mysql
修改nutch根目录/ivy/ivy.xml文件,原来mysql数据存储是注释的。
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>104 <!-- Uncomment this to use SQL as Gora backend. It should be noted that the 105 gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should 106 downgrade to gora-core 0.2.1 in order to use SQL as a backend. -->107 108 <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />109 110 <!-- Uncomment this to use MySQL as database with SQL as Gora store. -->111 112 <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default">
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest#gora.sqlstore.jdbc.user=sa#gora.sqlstore.jdbc.password=# MySQL properties ################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://ip:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNullgora.sqlstore.jdbc.user=usergora.sqlstore.jdbc.password=pwd
4、修改修改conf的nutch-site.xml
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="http://www.mamicode.com/configuration.xsl"?> <configuration><property><name>http.agent.name</name><value>My Spider</value></property> <property><name>http.accept.language</name><value>ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q=0.3</value></property> <property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to when no other informationis available</description></property> <property><name>storage.data.store.class</name><value>org.apache.gora.sql.store.SqlStore</value></property> <property><name>plugin.includes</name><value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value></property> </configuration>
5、使用ant 编译源码
在nutch 目录下执行 ant
job: [jar] Building jar: /home/hadoop/nutch221/build/apache-nutch-2.2.1.jobruntime: [mkdir] Created dir: /home/hadoop/nutch221/runtime [mkdir] Created dir: /home/hadoop/nutch221/runtime/local [mkdir] Created dir: /home/hadoop/nutch221/runtime/deploy [copy] Copying 1 file to /home/hadoop/nutch221/runtime/deploy [copy] Copying 2 files to /home/hadoop/nutch221/runtime/deploy/bin [copy] Copying 1 file to /home/hadoop/nutch221/runtime/local/lib [copy] Copying 1 file to /home/hadoop/nutch221/runtime/local/lib/native [copy] Copying 26 files to /home/hadoop/nutch221/runtime/local/conf [copy] Copying 2 files to /home/hadoop/nutch221/runtime/local/bin [copy] Copying 100 files to /home/hadoop/nutch221/runtime/local/lib [copy] Copying 106 files to /home/hadoop/nutch221/runtime/local/plugins [copy] Copied 2 empty directories to 2 empty directories under /home/hadoop/nutch221/runtime/local/testBUILD SUCCESSFULTotal time: 41 seconds 编译成功。
6 创建数据库
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,`headers` blob,`text` mediumtext DEFAULT NULL,`status` int(11) DEFAULT NULL,`markers` blob,`parseStatus` blob,`modifiedTime` bigint(20) DEFAULT NULL,`score` float DEFAULT NULL,`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`content` mediumblob,`title` varchar(2048) DEFAULT NULL,`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`fetchInterval` int(11) DEFAULT NULL,`prevFetchTime` bigint(20) DEFAULT NULL,`inlinks` mediumblob,`prevSignature` blob,`outlinks` mediumblob,`fetchTime` bigint(20) DEFAULT NULL,`retriesSinceFetch` int(11) DEFAULT NULL,`protocolStatus` blob,`signature` blob,`metadata` blob,PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
7、执行爬行操作:
bin
/nutch
crawl urls -depth 3
执行完在mysql中即可以查看到爬虫抓取的内容
8、执行索引操作:
bin
/nutch
elasticindex clustername -all
遇到问题:在执行第7步的时候出现 异常:
hadoop@master bin]$ nutch crawl urls -depth 3Exception in thread "main" java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:89) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:73) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221) at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) at org.apache.nutch.crawl.Crawler.run(Crawler.java:136) at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
参照网上资料:http://blog.sina.com.cn/s/blog_3c9872d00101p4f0.html 还是没有解决。
官方解决办法:
http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%3CCAErFeLSwoZ2UhxMA1iYi7H-L52Ojo-j9KoWT7xDittBzvB0F0A@mail.gmail.com%3E
文章参考:
官网资料:http://nlp.solutions.asia/?p=362
https://issues.apache.org/jira/browse/NUTCH-1473
nutch2.2.1+mysql抓取数据
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。