首页 > 代码库 > java爬取百度首页logo
java爬取百度首页logo
- 两个方法
- 一个获得Url的网页源代码getUrlContentString,另外一个从源代码中得到想要的地址片段,其中需要用到正则表达式去匹配
- 得到网页源代码的过程:
- 地址为string,将地址转换为java中的url对象
- url的openConnection方法返回urlConnection
- urlConnection的connect方法建立连接
- 新建一个InputStreamReader对象,其中InputStreamReader的构建需要InputStream输入流对象,而URLConnection的getInputStream方法则返回输入流对象,所以可以连接起来
- 然后利用建立好的InputStreamReader对象建立BuffereReader对象
- 从bufferedreader对象中按行读入网页源码,追加到result字符串中,result字符串即为网页源代码字符串
- logo地址匹配
- ?Pattern pattern = Pattern.compile(patternString);
- java.util.regex:java类库包,用正则表达式所定义的模式对字符串进行匹配
它包括两个类:Pattern和Matcher 。
Pattern: 创建匹配模式字符串。
Matcher:将匹配模式字符串与输入字符串。
- pattern的compile方法:将指定的字符编译到模式中
- Matcher matcher = pattern.matcher(contentString);
??
package com.test; ?? import java.io.*; import java.net.*; import java.util.regex.*; ?? public class baidulogo { ????static String getUrlContentString(String urlString) throws Exception { ????????String result = ""; ????????URL url = new URL(urlString); ????????URLConnection urlConnection = url.openConnection(); ????????urlConnection.connect(); ????????InputStreamReader inputStreamReader = new InputStreamReader( ????????????????urlConnection.getInputStream(), "utf-8"); ????????BufferedReader in = new BufferedReader(inputStreamReader); ????????String line; ????????while ((line = in.readLine()) != null) { ????????????result += line; ????????} ????????return result; ????} ?? ????static String getLogoUrl(String contentString, String patternString) { ????????String LogoUrl = null; ????????Pattern pattern = Pattern.compile(patternString); ????????Matcher matcher = pattern.matcher(contentString); ????????if (matcher.find()) { ????????????LogoUrl = matcher.group(1); ????????} ????????return LogoUrl; ?? ????} ?? ????public static void main(String[] args) throws Exception { ????????// 定义即将访问的链接 ????????String urlString = "http://www.baidu.com"; ????????String result = getUrlContentString(urlString); ????????String patternString = "src=http://www.mamicode.com/"(.+?)\""; ????????String contentString = result; ????????String logoUrl = getLogoUrl(contentString, patternString); ????????System.out.println(logoUrl); ????} } |
?
java爬取百度首页logo