首页 > 代码库 > spidering hacks 学习笔记(二)

spidering hacks 学习笔记(二)

  看过去很乱,学习的记录东西而已,等我读完这本书,就把笔记给整理下!嘿嘿

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
                spidering hacks 学习笔记(二)
 
8:Hack8 Installing Perl Modules
安装方式:
linux,mac,unix下通过:CPAN(Comprehensive Perl Archive Network)
windows下(PPM(Programmer‘s Package Manager)
 
通过安装LWP模块举例,(全称:The World-Wide Web library for Perl)
terminal下(我用的是ubuntu):
(1)sudo perl -MCPAN -e "install libwww-perl";
(2)sudo perl -MCPAN -e shell
   install wwwlib-perl
 
手动安装:(不详细说明了!perl学习里面有)
一般的: perl Makefile.PL  将模块安装在 /usr/local/bin
but If you have little more than user access to the system,你应该强制安装在
/usr/hqh/bin
(perl Makefile.PL LIB =/home/hqh/lib)
 
9:Hack9 Simply Fetching with LWP::Simple
 
coding:
 
#! /usr/bin/perl -w
#上面那句里面-w: -w表示使用严格的语法控制
use strict;
use LWP::Simple;
my $url="http://www.baidu.com";
my $content=get($url);
die "count not get $url" unless defined $content;
if ($content =~ m/baidu/i) {
print "有\"baidu\"这个字符串\n";
} else { print "木有\"baidu\"这个字符串\n"; }
 
 
 
#复习下m// s/// tr/// 三个函数
$str="i love perl,oh year!";
if($str=~m/lo/){
    print "have lo\n";
}
 
$str="i love perl,oh year!";
if($str=~/lo/){
    print "have lo as well\n";
}
 
#m可以去掉哦!!
 
$name="my name is huangqihao haha";
$name=~s/name/handsome name/;
print "$name\n";
$name=~s/m/heihei/;
print "$name\n";
##看到没有,s只替换第一个m,把m替换为heihei,如果$name 中的m全部替换呢?
$name=~s/m/heihei/g;
print "$name\n";
#看到没有,发生了!
 
#Perl 的正则表达式中如果出现 () ,则发生匹配或替换后 () 内的模式被 Perl 解释器自动依次赋给系统 $1, $2 .....
$office="hangzhou wenyixilu ";
$office=~s/(yi)(xi)(lu)/<$2>,<$3>,<$1>/;
print "$office\n";
#解释下:yi赋值给$1,xi赋值给$2,lu赋值给$3;之后用 <xi>取替换yi,<lu>替换xi, <yi>替换lu
 
#tr
$car="my car‘s bland is bora";
$car=~s/bora/bmw/;
print "$car\n";
 
$car=~tr/bmw/BMW/;
print "$car\n";
 
 
LWP::Simple 里面有一个head函数,返回一小部分HTTP的head,而get.head返回所有
 
10:Hack10 More Involved Requests with LWP::UserAgent
LWP::UserAgent is a class for virtual browsers, which you use for performing
requests, and HTTP::Response is a class for the responses (or error messages) that you get back from
those requests.
 
11:Hack11 Adding HTTP Headers to Your Request
Q1:why:
Add more functionality to your programs, or mimic common browsers, to circumvent server?side
filtering of unknown user agents
Q2:how:
$response =$browser?>get($url)
exa:
"you‘re telling the remote server which types of data you‘re willing to Accept"
 
change the User?Agent:
$browser?>agent(‘Mozilla/4.76 [en] (Win98; U)‘)
 
 
#! /usr/bin/perl
#11 hack11: Adding HTTP Headers to Your Request
=xxx
#复习下request和response的简单过程,LWP:UserAgent这个类取、去new一个对象,这个对象就继承了类的方法和属性
use LWP::UserAgent;
my $browser=LWP::UserAgent->new;
$url="http://www.qq.com";
my $response=$browser->get($url);
if($response->content=~/qq/){
    print "response have \‘qq\‘";
}
else{
    print "no \‘qq\‘";
}
 
#增加header内容,看看书中的代码,了解下header都包含什么内容哦,用到的函数是$response=$browser->get($url,....)
 
#看看书里面headers的结构:
my @ns_headers = (
‘User?Agent‘ => ‘Mozilla/4.76 [en] (Win98; U)‘,
‘Accept‘ => ‘image/gif, image/x?xbitmap, image/jpeg,
image/pjpeg, image/png, */*‘,
‘Accept?Charset‘ => ‘iso?8859?1,*‘,
‘Accept?Language‘ => ‘en?US‘,
);
 
#分析:user-agent:表示浏览器的版本;
#accept:表示接收的数据类型;
#accept-charset:字符集;
#accept-language:语言编码;
 
 
#ok,如果你只要change浏览器版本,那么就用LWP::UserAgent 中的agent方法
# $browser?>agent(‘Mozilla/4.76 [en] (Win98; U)‘);
 
#12.Hack12 Posting Form Data with LWP
=xxx
 
exm:http://www.google.com/search?num=100&hl=en&q=%22three+blind+mice%22
分析 ?后面的num表示每页返回的数量
           hl表示语言
           q表示 encoded equivalents
=cut       
 
#!/usr/bin/perl ?w
use strict;
use LWP ;
my $word = shift;
$word or die "Usage: perl altavista_post.pl [keyword]\n";
my $browser=LWP::UserAgent->new;
my $url = ‘http://www.altavista.com/web/results‘;
my $response = $browser>post( $url,
[ ‘q‘ => $word, # the Altavista query string
‘pg‘ => ‘q‘, ‘avkw‘ => ‘tgz‘, ‘kl‘ => ‘XX‘,
]);
#改变post请求方式,其实post类似与更新,get相当于查询,获取
#既然改变了post请求方式,那么就看看返回的结果是不是符合request的格式
 
13.Hack13 Authentication, Cookies, and Proxies
=xxx
#说了那么多authentication,其实就是说
$browser?>credentials(
‘servername:portnumber‘,
‘realm?name‘,
‘username‘ => ‘password‘
);
#在request之前,需要做以上的工作哦!
exa:
$browser?>credentials(
‘www.unicode.org:80‘,
‘Unicode?MailList?Archives‘,
‘unicode?ml‘ => ‘unicode‘
);
 
cookies:
从硬盘中读入cookies文件
use HTTP::Cookies;
$browser?>cookie_jar( HTTP::Cookies?>new(
‘file‘ => ‘/some/where/cookies.lwp‘,
‘autosave‘ => 1,
));
 
从网上读入cookies,然后存入硬盘
use HTTP::Cookies; # yes, loads HTTP::Cookies::Netscape too
$browser?>cookie_jar( HTTP::Cookies::Netscape?>new(
‘file‘ => ‘c:/Program Files/Netscape/Users/DIR?NAME?HERE/cookies.txt‘,
));
 
 
use LWP::UserAgent;
my $browser = LWP::UserAgent?>new;
$browser?>env_proxy
奶奶的,书中不介绍proxy的有关方法了,叫我自己取看,你也太懒了!
 
 
 
 
14.hack14:Handling Relative and Absolute URLs
用URI这个类
url->scheme 返回例如http,ftp之类的
url->host 返回 www.baidu.com之类的
url0->new_abs taking a URL string that is most likely relative and getting back an absoulute URL
use URI; my $abs = URI?>new_abs($maybe_relative, $base)
 
这个hack还介绍了如何匹配 http的网址
=cut
 
 
15 Hack15 Secured Access and Browser Attributes
里面介绍,如果你要取一个银行的网站,一般会安装一个SSL(secure socket layer),在browser and server之间
区分 secured site 一般看前面 https
就是说你要安装HTTPS support,你去这里看看参考哦!
还介绍了browser的其他方法
 
16 Hack16 Respecting Your Scrapee‘s Bandwidth
time2str($response?>last_modified),这个方法返回相应url最近modified的时间
 
#!/usr/bin/perl ?w
use strict;
use LWP 5.64;
use HTTP::Date;
my $url = ‘http://disobey.com/amphetadesk/‘;
my $date = "Thu, 31 Oct 2002 01:05:16 GMT";
my %headers = ( ‘If?Modified?Since‘ => $date );
my $browser = LWP::UserAgent?>new;
my $response = $browser?>get( $url, %header)
这段coding主要用于判断在 $date之后有没有再更改过!
 
ETags:Instead of a date, it returns a unique string based on the content you‘re
downloading. 就是基于内容对应的独立字符串
 
Compressed Data:
说了一大段:就是压缩嘛,后来又说了一大段,解压缩嘛,ok,so easy,上书中的代码
use strict;
use Compress::Zlib;
use LWP 5.64;
my $url = ‘http://www.disobey.com/‘;
my %headers = ( ‘Accept?Encoding‘ => ‘gzip; deflate‘ );
my $browser = LWP::UserAgent?>new;
my $response = $browser?>get( $url, %headers );
my $data = $response?>content;
if (my $encoding = $response?>content_encoding) ) {
$data = Compress::Zlib::memGunzip($data) if $encoding =~ /gzip/i;
$data = Compress::Zlib::uncompress($data) if $encoding =~ /deflate/i;
}