(2) hbase-0.90.4
(1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
(2)vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties
cd /usr/search/apache-nutch-2.2.1/
ant runtime
[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
(6)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
vi seed.txt
(8)修改网页过滤器 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
# accept anything else
(1)vi /usr/search/hbase-0.90.4/conf/hbase-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href=http://www.mamicode.com/"configuration.xsl"?>>
<property> <name>hbase.rootdir</name> <value>file:///tmp/hbase-${user.name}/hbase</value> <description>The directory shared by region servers and into which HBase persists. The URL should be 'fully-qualified' to include the filesystem scheme. For example, to specify the HDFS directory '/hbase' where the HDFS instance's namenode is running at namenode.example.org on port 9000, set this value to: hdfs://namenode.example.org:9000/hbase. By default HBase writes into /tmp. Change this configuration else all data will be lost on machine restart. </description> </property>即默认情况下会放在/tmp目录,若机器重启,有可能数据丢失。
cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/
删除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
增加:<field name="_version_" type="long" indexed="true" stored="true"/>
[root@jediael44 bin]# cd /usr/search/hbase-0.90.4/bin/
[root@jediael44 bin]# ./start-hbase.sh(2)启动Solr
[root@jediael44 bin]# cd /usr/search/solr-4.9.0/example/
[root@jediael44 example]# java -jar start.jar(3)启动Nutch,开始抓取任务
[root@jediael44 example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2大功告成,任务开始执行。
Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.
When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.
A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:
? GeneratorJob
? FetcherJob
? ParserJob (optionally done while fetching using ‘fetch.parse‘)
? DbUpdaterJob
Additionally, the following processes need to be understood:
? InjectorJob
? Invertlinks
? Indexing with Apache Solr
First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.
Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which
every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and
form a group. In most NoSQL stores, row keys are sorted and give an advantage.
Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:
? org.apache..www:http/
? org.apache.gora:http/
Let‘s define each step in depth so that we can understand crawling step-by-step.
Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:
? A crawl_generate job will be used for a set of URLs to be fetched
? A crawl_fetch job will contain the status of fetching each URL
? A content will contain the content of rows retrieved from every URL
Now let‘s understand each job of crawling in detail.