【转】Nutch源代码研究网页抓取下载插件

首页 > 代码库 > 【转】Nutch源代码研究网页抓取下载插件

【转】Nutch源代码研究网页抓取下载插件

2024-07-18 23:06:37 221人阅读

今天我们来看看Nutch的源代码中的protocol-http插件，是如何抓取和下载web页面的。protocol-http就两个类HttpRespose和Http类，其中HttpRespose主要是向web服务器发请求来获取响应，从而下载页面。Http类则非常简单，其实可以说是HttpResponse的一个Facade,设置配置信息，然后创建HttpRespose。用户似乎只需要和Http类打交道就行了（我也没看全，所以只是猜测）。
我们来看看HttpResponse类：
看这个类的源码需要从构造函数
public HttpResponse(HttpBase http, URL url, CrawlDatum datum) throws ProtocolException, IOException开始

首先判断协议是否为http

1 if (!"http".equals(url.getProtocol()))2       throw new HttpException("Not an HTTP url:" + url);

获得路径，如果url.getFile()的为空直接返回”/”,否则返回url.getFile()
String path = "".equals(url.getFile()) ? "/" : url.getFile();

然后根据url获取到主机名和端口名。如果端口不存在，则端口默认为80，请求的地址将不包括端口号portString= ""，否则获取到端口号，并得到portString

 1 String host = url.getHost(); 2     int port; 3     String portString; 4     if (url.getPort() == -1) { 5       port= 80; 6       portString= ""; 7     } else { 8       port= url.getPort(); 9       portString= ":" + port;10 }

然后创建socket，并且设置连接超时的时间：

1 socket = new Socket();                    // create the socket socket.setSoTimeout(http.getTimeout());

根据是否使用代理来得到socketHost和socketPort:

1 String sockHost = http.useProxy() ? http.getProxyHost() : host;2 int sockPort = http.useProxy() ? http.getProxyPort() : port;

创建InetSocketAddress，并且开始建立连接：

1 InetSocketAddress sockAddr= new InetSocketAddress(sockHost, sockPort);2 socket.connect(sockAddr, http.getTimeout());

获取输入流：

1 // make request2       OutputStream req = socket.getOutputStream();

以下代码用来向服务器发Get请求：

 1 StringBuffer reqStr = new StringBuffer("GET "); 2       if (http.useProxy()) { 3          reqStr.append(url.getProtocol()+"://"+host+portString+path); 4       } else { 5          reqStr.append(path); 6       } 7  8       reqStr.append(" HTTP/1.0\r\n"); 9       reqStr.append("Host: ");10       reqStr.append(host);11       reqStr.append(portString);12       reqStr.append("\r\n");13       reqStr.append("Accept-Encoding: x-gzip, gzip\r\n");14       String userAgent = http.getUserAgent();15       if ((userAgent == null) || (userAgent.length() == 0)) {16         if (Http.LOG.isFatalEnabled()) { Http.LOG.fatal("User-agent is not set!"); }17       } else {18         reqStr.append("User-Agent: ");19         reqStr.append(userAgent);20         reqStr.append("\r\n");21       }22       reqStr.append("\r\n");23       byte[] reqBytes= reqStr.toString().getBytes();24       req.write(reqBytes);25       req.flush();

接着来处理相应，获得输入流并且包装成PushbackInputStream来方便操作：

1 PushbackInputStream in =                  // process response2         new PushbackInputStream(3           new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), 4           Http.BUFFER_SIZE) ;

提取状态码和响应中的HTML的header：

1 boolean haveSeenNonContinueStatus= false;2       while (!haveSeenNonContinueStatus) {3         // parse status code line4         this.code = parseStatusLine(in, line); 5         // parse headers6         parseHeaders(in, line);7         haveSeenNonContinueStatus= code != 100; // 100 is "Continue"8       }

接着读取内容：

1 readPlainContent(in);

获取内容的格式，如果是压缩的则处理压缩

1 String contentEncoding = getHeader(Response.CONTENT_ENCODING);2       if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) {3         content = http.processGzipEncoded(content, url);4       } else {5         if (Http.LOG.isTraceEnabled()) {6           Http.LOG.trace("fetched " + content.length + " bytes from " + url);7         }8       }

整个过程结束。

下面我们来看看parseStatusLine parseHeaders readPlainContent以及readChunkedContent的过程。

private int parseStatusLine(PushbackInputStream in, StringBuffer line)
throws IOException, HttpException：
这个函数主要来提取响应得状态，例如200 OK这样的状态码：

请求的状态行一般格式（例如响应Ok的话） HTTP/1.1 200" 或 "HTTP/1.1 200 OK

1 int codeStart = line.indexOf(" ");2 int codeEnd = line.indexOf(" ", codeStart+1);

如果是第一种情况：

1 if (codeEnd == -1) 2       codeEnd = line.length();

状态码结束（200）位置便是line.length()
否则状态码结束（200）位置就是line.indexOf(" ", codeStart+1);
接着开始提取状态码：

1 int code;2     try {3       code= Integer.parseInt(line.substring(codeStart+1, codeEnd));4     } catch (NumberFormatException e) {5       throw new HttpException("bad status line ‘" + line 6                               + "‘: " + e.getMessage(), e);7 }

下面看看

1 private void parseHeaders(PushbackInputStream in, StringBuffer line)2 throws IOException, HttpException：

这个函数主要是将响应的headers加入我们已经建立的结构header的Metadata中。

一个循环读取headers:
一般HTTP response的header部分和内容部分会有一个空行，使用readLine如果是空行就会返回读取的字符数为0，具体readLine实现看完这个函数在仔细看：
while (readLine(in, line, true) != 0)

如果没有空行，那紧接着就是正文了，正文一般会以<!DOCTYPE、<HTML、<html开头。如果读到的一行中包含这个，那么header部分就读完了。

1       // handle HTTP responses with missing blank line after headers2       int pos;3       if ( ((pos= line.indexOf("<!DOCTYPE")) != -1) 4            || ((pos= line.indexOf("<HTML")) != -1) 5            || ((pos= line.indexOf("<html")) != -1) )

接着把多读的那部分压回流中,并设置那一行的长度为pos

1        in.unread(line.substring(pos).getBytes("UTF-8"));2         line.setLength(pos);

接着把对一行的处理委托给processHeaderLine(line)来处理：

 1         try { 2             //TODO: (CM) We don‘t know the header names here 3             //since we‘re just handling them generically. It would 4             //be nice to provide some sort of mapping function here 5             //for the returned header names to the standard metadata 6             //names in the ParseData class 7           processHeaderLine(line); 8        } catch (Exception e) { 9           // fixme:10           e.printStackTrace(LogUtil.getErrorStream(Http.LOG));11         }12         return;13       }14       processHeaderLine(line);

下面我们看看如何处理一行header的：
private void processHeaderLine(StringBuffer line)
throws IOException, HttpException
请求的头一般格式：
Cache-Control: private
Date: Fri, 14 Dec 2007 15:32:06 GMT
Content-Length: 7602
Content-Type: text/html
Server: Microsoft-IIS/6.0

这样我们就比较容易理解下面的代码了：

1 int colonIndex = line.indexOf(":");       // key is up to colon

如果没有”:”并且这行不是空行则抛出HttpException异常

1     if (colonIndex == -1) {2       int i;3       for (i= 0; i < line.length(); i++)4         if (!Character.isWhitespace(line.charAt(i)))5           break;6       if (i == line.length())7         return;8       throw new HttpException("No colon in header:" + line);9 }

否则，可以可以提取出键-值对了：
key为0~colonIndex部分,然后过滤掉开始的空白字符，作为value部分。

最后放到headers中：

 1     String key = line.substring(0, colonIndex); 2  3     int valueStart = colonIndex+1;            // skip whitespace 4     while (valueStart < line.length()) { 5       int c = line.charAt(valueStart); 6       if (c != ‘ ‘ && c != ‘\t‘) 7        break; 8       valueStart++; 9     }10     String value =http://www.mamicode.com/ line.substring(valueStart);11     headers.set(key, value);

下面我们看看用的比较多的辅助函数
private static int readLine(PushbackInputStream in, StringBuffer line,
boolean allowContinuedLine) throws IOException

代码的实现：
开始设置line的长度为0不断的读，直到c!=-1,对于每个c:

如果是\r并且下一个字符是\n则读入\r,如果是\n,并且如果line.length() > 0，也就是这行前面已经有非空白字符，并且还允许连续行，在读一个字符，如果是’ ’或者是\t说明此行仍未结束，读入该字符，一行结束，返回读取的实际长度。其他情况下直接往line追加所读的字符：

 1     line.setLength(0); 2     for (int c = in.read(); c != -1; c = in.read()) { 3       switch (c) { 4         case ‘\r‘: 5           if (peek(in) == ‘\n‘) { 6             in.read(); 7           } 8         case ‘\n‘:  9           if (line.length() > 0) {10             // at EOL -- check for continued line if the current11             // (possibly continued) line wasn‘t blank12             if (allowContinuedLine) 13               switch (peek(in)) {14                 case ‘ ‘ : case ‘\t‘:                   // line is continued15                   in.read();16                   continue;17               }18           }19           return line.length();      // else complete20         default :21           line.append((char)c);22       }23     }24     throw new EOFException();25   }

接着看如何读取内容的，也就是
private void readPlainContent(InputStream in)
throws HttpException, IOException的实现：

首先从headers（在此之前已经读去了headers放到metadata中了）中获取响应的长度，

 1 int contentLength = Integer.MAX_VALUE;    // get content length 2     String contentLengthString = headers.get(Response.CONTENT_LENGTH); 3     if (contentLengthString != null) { 4       contentLengthString = contentLengthString.trim(); 5       try { 6         contentLength = Integer.parseInt(contentLengthString); 7       } catch (NumberFormatException e) { 8        throw new HttpException("bad content length: "+contentLengthString); 9       }10 }

如果大于http.getMaxContent()（这个值在配置文件中http.content.limit来配置），

则截取maxContent那么长的字段：

 1     if (http.getMaxContent() >= 0 2      && contentLength > http.getMaxContent())   // limit download size 3       contentLength  = http.getMaxContent(); 4  5     ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE); 6     byte[] bytes = new byte[Http.BUFFER_SIZE]; 7     int length = 0;                           // read content 8     for (int i = in.read(bytes); i != -1; i = in.read(bytes)) { 9       out.write(bytes, 0, i);10       length += i;11       if (length >= contentLength)12         break;13     }14     content = out.toByteArray();15   }

【转】Nutch源代码研究网页抓取下载插件

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 【转】Nutch源代码研究 网页抓取 下载插件

【转】Nutch源代码研究 网页抓取 下载插件

看完仍有疑问？有类似问题直接问程序猿

首页 > 代码库 > 【转】Nutch源代码研究网页抓取下载插件

【转】Nutch源代码研究网页抓取下载插件