spidering hacks 学习笔记(二)

首页 > 代码库 > spidering hacks 学习笔记(二)

spidering hacks 学习笔记(二)

2024-07-05 23:05:46 236人阅读

　　看过去很乱，学习的记录东西而已，等我读完这本书，就把笔记给整理下！嘿嘿

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

spidering hacks 学习笔记(二)

8：Hack8 Installing Perl Modules

安装方式：

linux,mac,unix下通过：CPAN（Comprehensive Perl Archive Network）

windows下(PPM(Programmer‘s Package Manager）

通过安装LWP模块举例,(全称：The World-Wide Web library for Perl)

terminal下(我用的是ubuntu)：

(1)sudo perl -MCPAN -e "install libwww-perl";

(2)sudo perl -MCPAN -e shell

install wwwlib-perl

手动安装:(不详细说明了！perl学习里面有)

一般的: perl Makefile.PL 将模块安装在 /usr/local/bin中

but If you have little more than user access to the system，你应该强制安装在

/usr/hqh/bin

(perl Makefile.PL LIB =/home/hqh/lib)

9:Hack9 Simply Fetching with LWP::Simple

coding：

#! /usr/bin/perl -w

#上面那句里面-w： -w表示使用严格的语法控制

use strict;

use LWP::Simple;

my $url="http://www.baidu.com";

my $content=get($url);

die "count not get $url" unless defined $content;

if ($content =~ m/baidu/i) {

print "有\"baidu\"这个字符串\n";

} else { print "木有\"baidu\"这个字符串\n"; }

#复习下m// s/// tr/// 三个函数

$str="i love perl,oh year!";

if($str=~m/lo/){

print "have lo\n";

}

$str="i love perl,oh year!";

if($str=~/lo/){

print "have lo as well\n";

}

#m可以去掉哦！！

$name="my name is huangqihao haha";

$name=~s/name/handsome name/;

print "$name\n";

$name=~s/m/heihei/;

print "$name\n";

##看到没有，s只替换第一个m，把m替换为heihei，如果$name 中的m全部替换呢？

$name=~s/m/heihei/g;

print "$name\n";

#看到没有，发生了！

#Perl 的正则表达式中如果出现 () ，则发生匹配或替换后 () 内的模式被 Perl 解释器自动依次赋给系统 $1, $2 .....

$office="hangzhou wenyixilu ";

$office=~s/(yi)(xi)(lu)/<$2>,<$3>,<$1>/;

print "$office\n";

#解释下：yi赋值给$1,xi赋值给$2,lu赋值给$3;之后用 <xi>取替换yi，<lu>替换xi, <yi>替换lu

#tr

$car="my car‘s bland is bora";

$car=~s/bora/bmw/;

print "$car\n";

$car=~tr/bmw/BMW/;

print "$car\n";

LWP::Simple 里面有一个head函数，返回一小部分HTTP的head，而get.head返回所有

10：Hack10 More Involved Requests with LWP::UserAgent

LWP::UserAgent is a class for virtual browsers, which you use for performing

requests, and HTTP::Response is a class for the responses (or error messages) that you get back from

those requests.

11:Hack11 Adding HTTP Headers to Your Request

Q1:why:

Add more functionality to your programs, or mimic common browsers, to circumvent server?side

filtering of unknown user agents

Q2:how:

$response =$browser?>get($url)

exa:

"you‘re telling the remote server which types of data you‘re willing to Accept"

change the User?Agent:

$browser?>agent(‘Mozilla/4.76 [en] (Win98; U)‘)

#! /usr/bin/perl

#11 hack11: Adding HTTP Headers to Your Request

=xxx

#复习下request和response的简单过程，LWP：UserAgent这个类取、去new一个对象，这个对象就继承了类的方法和属性

use LWP::UserAgent;

my $browser=LWP::UserAgent->new;

$url="http://www.qq.com";

my $response=$browser->get($url);

if($response->content=~/qq/){

print "response have \‘qq\‘";

}

else{

print "no \‘qq\‘";

}

#增加header内容，看看书中的代码，了解下header都包含什么内容哦，用到的函数是$response=$browser->get($url,....)

#看看书里面headers的结构：

my @ns_headers = (

‘User?Agent‘ => ‘Mozilla/4.76 [en] (Win98; U)‘,

‘Accept‘ => ‘image/gif, image/x?xbitmap, image/jpeg,

image/pjpeg, image/png, */*‘,

‘Accept?Charset‘ => ‘iso?8859?1,*‘,

‘Accept?Language‘ => ‘en?US‘,

);

#分析：user-agent：表示浏览器的版本;

#accept:表示接收的数据类型;

#accept-charset:字符集;

#accept-language:语言编码;

#ok，如果你只要change浏览器版本，那么就用LWP::UserAgent 中的agent方法

# $browser?>agent(‘Mozilla/4.76 [en] (Win98; U)‘);

#12.Hack12 Posting Form Data with LWP

=xxx

exm：http://www.google.com/search?num=100&hl=en&q=%22three+blind+mice%22

分析？后面的num表示每页返回的数量

hl表示语言

q表示 encoded equivalents

=cut

#!/usr/bin/perl ?w

use strict;

use LWP ;

my $word = shift;

$word or die "Usage: perl altavista_post.pl [keyword]\n";

my $browser=LWP::UserAgent->new;

my $url = ‘http://www.altavista.com/web/results‘;

my $response = $browser>post( $url,

[ ‘q‘ => $word, # the Altavista query string

‘pg‘ => ‘q‘, ‘avkw‘ => ‘tgz‘, ‘kl‘ => ‘XX‘,

]);

#改变post请求方式，其实post类似与更新，get相当于查询，获取

#既然改变了post请求方式，那么就看看返回的结果是不是符合request的格式

13.Hack13 Authentication, Cookies, and Proxies

=xxx

#说了那么多authentication，其实就是说

$browser?>credentials(

‘servername:portnumber‘,

‘realm?name‘,

‘username‘ => ‘password‘

);

#在request之前，需要做以上的工作哦！

exa：

$browser?>credentials(

‘www.unicode.org:80‘,

‘Unicode?MailList?Archives‘,

‘unicode?ml‘ => ‘unicode‘

);

cookies:

从硬盘中读入cookies文件

use HTTP::Cookies;

$browser?>cookie_jar( HTTP::Cookies?>new(

‘file‘ => ‘/some/where/cookies.lwp‘,

‘autosave‘ => 1,

));

从网上读入cookies，然后存入硬盘

use HTTP::Cookies; # yes, loads HTTP::Cookies::Netscape too

$browser?>cookie_jar( HTTP::Cookies::Netscape?>new(

‘file‘ => ‘c:/Program Files/Netscape/Users/DIR?NAME?HERE/cookies.txt‘,

));

use LWP::UserAgent;

my $browser = LWP::UserAgent?>new;

$browser?>env_proxy

奶奶的，书中不介绍proxy的有关方法了，叫我自己取看，你也太懒了！

14.hack14：Handling Relative and Absolute URLs

用URI这个类

url->scheme 返回例如http，ftp之类的

url->host 返回 www.baidu.com之类的

url0->new_abs taking a URL string that is most likely relative and getting back an absoulute URL

use URI; my $abs = URI?>new_abs($maybe_relative, $base)

这个hack还介绍了如何匹配 http的网址

=cut

15 Hack15 Secured Access and Browser Attributes

里面介绍，如果你要取一个银行的网站，一般会安装一个SSL（secure socket layer），在browser and server之间

区分 secured site 一般看前面 https

就是说你要安装HTTPS support，你去这里看看参考哦！

还介绍了browser的其他方法

16 Hack16 Respecting Your Scrapee‘s Bandwidth

time2str($response?>last_modified)，这个方法返回相应url最近modified的时间

#!/usr/bin/perl ?w

use strict;

use LWP 5.64;

use HTTP::Date;

my $url = ‘http://disobey.com/amphetadesk/‘;

my $date = "Thu, 31 Oct 2002 01:05:16 GMT";

my %headers = ( ‘If?Modified?Since‘ => $date );

my $browser = LWP::UserAgent?>new;

my $response = $browser?>get( $url, %header)

这段coding主要用于判断在 $date之后有没有再更改过！

ETags:Instead of a date, it returns a unique string based on the content you‘re

downloading. 就是基于内容对应的独立字符串

Compressed Data：

说了一大段：就是压缩嘛，后来又说了一大段，解压缩嘛，ok，so easy，上书中的代码

use strict;

use Compress::Zlib;

use LWP 5.64;

my $url = ‘http://www.disobey.com/‘;

my %headers = ( ‘Accept?Encoding‘ => ‘gzip; deflate‘ );

my $browser = LWP::UserAgent?>new;

my $response = $browser?>get( $url, %headers );

my $data = $response?>content;

if (my $encoding = $response?>content_encoding) ) {

$data = Compress::Zlib::memGunzip($data) if $encoding =~ /gzip/i;

$data = Compress::Zlib::uncompress($data) if $encoding =~ /deflate/i;

}

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > spidering hacks 学习笔记(二)

spidering hacks 学习笔记(二)

看完仍有疑问？有类似问题直接问程序猿