首页 > 代码库 > 网络爬虫2:使用crawler4j爬取网络内容
网络爬虫2:使用crawler4j爬取网络内容
需要两个包:
crawler4j-4.1-jar-with-dependencies.jar
slf4j-simple-1.7.22.jar(如果不加,会有警告:SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".)
相关包下载:
http://download.csdn.net/detail/talkwah/9747407
(crawler4j-4.1-jar-with-dependencies.jar相关资料少,github下载半天还失败,故整理了一下)
参考资料:
http://blog.csdn.net/zjm131421/article/details/13093869
http://favccxx.blog.51cto.com/2890523/1691079/
import java.util.Set;import java.util.regex.Pattern;import edu.uci.ics.crawler4j.crawler.CrawlConfig;import edu.uci.ics.crawler4j.crawler.CrawlController;import edu.uci.ics.crawler4j.crawler.Page;import edu.uci.ics.crawler4j.crawler.WebCrawler;import edu.uci.ics.crawler4j.fetcher.PageFetcher;import edu.uci.ics.crawler4j.parser.HtmlParseData;import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;import edu.uci.ics.crawler4j.url.WebURL;public class MyCrawler extends WebCrawler { // 三要素: // _访问谁? // _怎么访? // _访上了怎么处置? private static final String C_URL = "http://www.ximalaya.com"; @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase(); // 不匹配:MP3|jpg|png结尾的资源 Pattern p = Pattern.compile(".*(\\.(MP3|jpg|png))$"); return !p.matcher(href).matches() && href.startsWith(C_URL); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); String parentUrl = page.getWebURL().getParentUrl(); String anchor = page.getWebURL().getAnchor(); System.out.println("********************************"); System.out.println("URL :" + url); System.out.println("Parent page:" + parentUrl); System.out.println("Anchor text:" + anchor); logger.info("URL: {}", url); logger.debug("Parent page: {}", parentUrl); logger.debug("Anchor text: {}", anchor); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); Set<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("--------------------------"); System.out.println("Text length: " + text.length()); System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } } public static void main(String[] args) throws Exception { // 源代码例子中,这两位是两只参数 // 配置个路径,这个路径相当于Temp文件夹,不用先建好, String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig crawlConf = new CrawlConfig(); crawlConf.setCrawlStorageFolder(crawlStorageFolder); PageFetcher pageFetcher = new PageFetcher(crawlConf); RobotstxtConfig robotConf = new RobotstxtConfig(); RobotstxtServer robotServ = new RobotstxtServer(robotConf, pageFetcher); // 控制器 CrawlController c = new CrawlController(crawlConf, pageFetcher, robotServ); // 添加网址 c.addSeed(C_URL); // 启动爬虫 c.start(MyCrawler.class, numberOfCrawlers); }}
网络爬虫2:使用crawler4j爬取网络内容
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。