网络爬虫2：使用crawler4j爬取网络内容

首页 > 代码库 > 网络爬虫2：使用crawler4j爬取网络内容

网络爬虫2：使用crawler4j爬取网络内容

2024-09-03 03:19:27 222人阅读

需要两个包：

　　crawler4j-4.1-jar-with-dependencies.jar

　　slf4j-simple-1.7.22.jar（如果不加，会有警告：SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".）

相关包下载：

http://download.csdn.net/detail/talkwah/9747407

（crawler4j-4.1-jar-with-dependencies.jar相关资料少，github下载半天还失败，故整理了一下）

参考资料：

http://blog.csdn.net/zjm131421/article/details/13093869

http://favccxx.blog.51cto.com/2890523/1691079/

import java.util.Set;import java.util.regex.Pattern;import edu.uci.ics.crawler4j.crawler.CrawlConfig;import edu.uci.ics.crawler4j.crawler.CrawlController;import edu.uci.ics.crawler4j.crawler.Page;import edu.uci.ics.crawler4j.crawler.WebCrawler;import edu.uci.ics.crawler4j.fetcher.PageFetcher;import edu.uci.ics.crawler4j.parser.HtmlParseData;import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;import edu.uci.ics.crawler4j.url.WebURL;public class MyCrawler extends WebCrawler {    // 三要素：    // _访问谁？    // _怎么访？    // _访上了怎么处置？    private static final String C_URL = "http://www.ximalaya.com";    @Override    public boolean shouldVisit(Page referringPage, WebURL url) {        String href = url.getURL().toLowerCase();        // 不匹配：MP3|jpg|png结尾的资源        Pattern p = Pattern.compile(".*(\\.(MP3|jpg|png))$");        return !p.matcher(href).matches() && href.startsWith(C_URL);    }    @Override    public void visit(Page page) {        String url = page.getWebURL().getURL();        String parentUrl = page.getWebURL().getParentUrl();        String anchor = page.getWebURL().getAnchor();        System.out.println("********************************");        System.out.println("URL        :" + url);        System.out.println("Parent page:" + parentUrl);        System.out.println("Anchor text:" + anchor);        logger.info("URL: {}", url);        logger.debug("Parent page: {}", parentUrl);        logger.debug("Anchor text: {}", anchor);        if (page.getParseData() instanceof HtmlParseData) {            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();            String text = htmlParseData.getText();            String html = htmlParseData.getHtml();            Set<WebURL> links = htmlParseData.getOutgoingUrls();            System.out.println("--------------------------");            System.out.println("Text length: " + text.length());            System.out.println("Html length: " + html.length());            System.out.println("Number of outgoing links: " + links.size());        }    }    public static void main(String[] args) throws Exception {        // 源代码例子中，这两位是两只参数        // 配置个路径，这个路径相当于Temp文件夹，不用先建好，        String crawlStorageFolder = "/data/crawl/root";        int numberOfCrawlers = 7;        CrawlConfig crawlConf = new CrawlConfig();        crawlConf.setCrawlStorageFolder(crawlStorageFolder);        PageFetcher pageFetcher = new PageFetcher(crawlConf);        RobotstxtConfig robotConf = new RobotstxtConfig();        RobotstxtServer robotServ = new RobotstxtServer(robotConf, pageFetcher);        // 控制器        CrawlController c = new CrawlController(crawlConf,                pageFetcher, robotServ);        // 添加网址        c.addSeed(C_URL);        // 启动爬虫        c.start(MyCrawler.class, numberOfCrawlers);    }}

网络爬虫2：使用crawler4j爬取网络内容

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 网络爬虫2：使用crawler4j爬取网络内容

网络爬虫2：使用crawler4j爬取网络内容

看完仍有疑问？有类似问题直接问程序猿