首页 > 代码库 > Web爬虫入门
Web爬虫入门
1.0示例学习:Web爬虫
public class WebCrawler { // 种子url private static String url = "http://www.cnblogs.com/"; public static void main(String[] args) { ArrayList<String> list = crawler(url); System.out.println("Length of listOfPendingURLs: " + list.size()); } /** * 根据种子URL抓取100个url */ public static ArrayList<String> crawler(String StartingURL) { ArrayList<String> listOfPendingURLs = new ArrayList<String>(); //待抓取的url列表 ArrayList<String> listOfTraversedURLs = new ArrayList<String>(); //已抓取的url列表 listOfPendingURLs.add(StartingURL); while(!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) { String urlString = listOfPendingURLs.remove(0); //每次只取 待抓取url列表 的第一个地址 if(!listOfTraversedURLs.contains(urlString)) { listOfTraversedURLs.add(urlString); System.out.println("Crawl " + urlString); for(String s : getSubURLs(urlString)) { //根据种子url遍历该页面所有url,并存入带抓取url列表 if(!listOfTraversedURLs.contains(s)) { listOfPendingURLs.add(s); } } } } return listOfPendingURLs; } /** * 抓取种子url页面的所有http链接,并返回ArrayList */ public static ArrayList<String> getSubURLs(String urlString) { ArrayList<String> list = new ArrayList<String>(); try { URL url = new URL(urlString); @SuppressWarnings("resource") Scanner input = new Scanner(url.openStream()); int begain = 0; while(input.hasNextLine()) { String line = input.nextLine(); begain = line.indexOf("http:", begain); while(begain > 0) { int end = line.indexOf("\"", begain); if(end > 0) { list.add(line.substring(begain, end)); begain = line.indexOf("http:", end); } else { begain = 0; } } } } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return list; } }
Web爬虫入门
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。