首页 > 代码库 > spidering hacks 学习笔记(二)
spidering hacks 学习笔记(二)
看过去很乱,学习的记录东西而已,等我读完这本书,就把笔记给整理下!嘿嘿
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 | spidering hacks 学习笔记(二) 8 :Hack8 Installing Perl Modules 安装方式: linux,mac,unix下通过:CPAN(Comprehensive Perl Archive Network) windows下(PPM(Programmer‘s Package Manager) 通过安装LWP模块举例,(全称:The World - Wide Web library for Perl) terminal下(我用的是ubuntu): ( 1 )sudo perl - MCPAN - e "install libwww-perl" ; ( 2 )sudo perl - MCPAN - e shell install wwwlib - perl 手动安装:(不详细说明了!perl学习里面有) 一般的: perl Makefile.PL 将模块安装在 / usr / local / bin 中 but If you have little more than user access to the system,你应该强制安装在 / usr / hqh / bin (perl Makefile.PL LIB = / home / hqh / lib) 9 :Hack9 Simply Fetching with LWP::Simple coding: #! /usr/bin/perl -w #上面那句里面-w: -w表示使用严格的语法控制 use strict; use LWP::Simple; my $url = "http://www.baidu.com" ; my $content = get($url); die "count not get $url" unless defined $content; if ($content = ~ m / baidu / i) { print "有\"baidu\"这个字符串\n" ; } else { print "木有\"baidu\"这个字符串\n" ; } #复习下m// s/// tr/// 三个函数 $ str = "i love perl,oh year!" ; if ($ str = ~m / lo / ){ print "have lo\n" ; } $ str = "i love perl,oh year!" ; if ($ str = ~ / lo / ){ print "have lo as well\n" ; } #m可以去掉哦!! $name = "my name is huangqihao haha" ; $name = ~s / name / handsome name / ; print "$name\n" ; $name = ~s / m / heihei / ; print "$name\n" ; ##看到没有,s只替换第一个m,把m替换为heihei,如果$name 中的m全部替换呢? $name = ~s / m / heihei / g; print "$name\n" ; #看到没有,发生了! #Perl 的正则表达式中如果出现 () ,则发生匹配或替换后 () 内的模式被 Perl 解释器自动依次赋给系统 $1, $2 ..... $office = "hangzhou wenyixilu " ; $office = ~s / (yi)(xi)(lu) / <$ 2 >,<$ 3 >,<$ 1 > / ; print "$office\n" ; #解释下:yi赋值给$1,xi赋值给$2,lu赋值给$3;之后用 <xi>取替换yi,<lu>替换xi, <yi>替换lu #tr $car = "my car‘s bland is bora" ; $car = ~s / bora / bmw / ; print "$car\n" ; $car = ~tr / bmw / BMW / ; print "$car\n" ; LWP::Simple 里面有一个head函数,返回一小部分HTTP的head,而get.head返回所有 10 :Hack10 More Involved Requests with LWP::UserAgent LWP::UserAgent is a class for virtual browsers, which you use for performing requests, and HTTP::Response is a class for the responses ( or error messages) that you get back from those requests. 11 :Hack11 Adding HTTP Headers to Your Request Q1:why: Add more functionality to your programs, or mimic common browsers, to circumvent server?side filtering of unknown user agents Q2:how: $response = $browser?>get($url) exa: "you‘re telling the remote server which types of data you‘re willing to Accept" change the User?Agent: $browser?>agent( ‘Mozilla/4.76 [en] (Win98; U)‘ ) #! /usr/bin/perl #11 hack11: Adding HTTP Headers to Your Request = xxx #复习下request和response的简单过程,LWP:UserAgent这个类取、去new一个对象,这个对象就继承了类的方法和属性 use LWP::UserAgent; my $browser = LWP::UserAgent - >new; $url = "http://www.qq.com" ; my $response = $browser - >get($url); if ($response - >content = ~ / qq / ){ print "response have \‘qq\‘" ; } else { print "no \‘qq\‘" ; } #增加header内容,看看书中的代码,了解下header都包含什么内容哦,用到的函数是$response=$browser->get($url,....) #看看书里面headers的结构: my @ns_headers = ( ‘User?Agent‘ = > ‘Mozilla/4.76 [en] (Win98; U)‘ , ‘Accept‘ = > ‘image / gif, image / x?xbitmap, image / jpeg, image / pjpeg, image / png, * / * ‘, ‘Accept?Charset‘ = > ‘iso?8859?1,*‘ , ‘Accept?Language‘ = > ‘en?US‘ , ); #分析:user-agent:表示浏览器的版本; #accept:表示接收的数据类型; #accept-charset:字符集; #accept-language:语言编码; #ok,如果你只要change浏览器版本,那么就用LWP::UserAgent 中的agent方法 # $browser?>agent(‘Mozilla/4.76 [en] (Win98; U)‘); #12.Hack12 Posting Form Data with LWP = xxx exm:http: / / www.google.com / search?num = 100 &hl = en&q = % 22three + blind + mice % 22 分析 ?后面的num表示每页返回的数量 hl表示语言 q表示 encoded equivalents = cut #!/usr/bin/perl ?w use strict; use LWP ; my $word = shift; $word or die "Usage: perl altavista_post.pl [keyword]\n" ; my $browser = LWP::UserAgent - >new; my $url = ‘http://www.altavista.com/web/results‘ ; my $response = $browser>post( $url, [ ‘q‘ = > $word, # the Altavista query string ‘pg‘ = > ‘q‘ , ‘avkw‘ = > ‘tgz‘ , ‘kl‘ = > ‘XX‘ , ]); #改变post请求方式,其实post类似与更新,get相当于查询,获取 #既然改变了post请求方式,那么就看看返回的结果是不是符合request的格式 13.Hack13 Authentication, Cookies, and Proxies = xxx #说了那么多authentication,其实就是说 $browser?>credentials( ‘servername:portnumber‘ , ‘realm?name‘ , ‘username‘ = > ‘password‘ ); #在request之前,需要做以上的工作哦! exa: $browser?>credentials( ‘www.unicode.org:80‘ , ‘Unicode?MailList?Archives‘ , ‘unicode?ml‘ = > ‘unicode‘ ); cookies: 从硬盘中读入cookies文件 use HTTP::Cookies; $browser?>cookie_jar( HTTP::Cookies?>new( ‘file‘ = > ‘/some/where/cookies.lwp‘ , ‘autosave‘ = > 1 , )); 从网上读入cookies,然后存入硬盘 use HTTP::Cookies; # yes, loads HTTP::Cookies::Netscape too $browser?>cookie_jar( HTTP::Cookies::Netscape?>new( ‘file‘ = > ‘c:/Program Files/Netscape/Users/DIR?NAME?HERE/cookies.txt‘ , )); use LWP::UserAgent; my $browser = LWP::UserAgent?>new; $browser?>env_proxy 奶奶的,书中不介绍proxy的有关方法了,叫我自己取看,你也太懒了! 14.hack14 :Handling Relative and Absolute URLs 用URI这个类 url - >scheme 返回例如http,ftp之类的 url - >host 返回 www.baidu.com之类的 url0 - >new_abs taking a URL string that is most likely relative and getting back an absoulute URL use URI; my $ abs = URI?>new_abs($maybe_relative, $base) 这个hack还介绍了如何匹配 http的网址 = cut 15 Hack15 Secured Access and Browser Attributes 里面介绍,如果你要取一个银行的网站,一般会安装一个SSL(secure socket layer),在browser and server之间 区分 secured site 一般看前面 https 就是说你要安装HTTPS support,你去这里看看参考哦! 还介绍了browser的其他方法 16 Hack16 Respecting Your Scrapee‘s Bandwidth time2str($response?>last_modified),这个方法返回相应url最近modified的时间 #!/usr/bin/perl ?w use strict; use LWP 5.64 ; use HTTP::Date; my $url = ‘http://disobey.com/amphetadesk/‘ ; my $date = "Thu, 31 Oct 2002 01:05:16 GMT" ; my % headers = ( ‘If?Modified?Since‘ = > $date ); my $browser = LWP::UserAgent?>new; my $response = $browser?>get( $url, % header) 这段coding主要用于判断在 $date之后有没有再更改过! ETags:Instead of a date, it returns a unique string based on the content you‘re downloading. 就是基于内容对应的独立字符串 Compressed Data: 说了一大段:就是压缩嘛,后来又说了一大段,解压缩嘛,ok,so easy,上书中的代码 use strict; use Compress::Zlib; use LWP 5.64 ; my $url = ‘http://www.disobey.com/‘ ; my % headers = ( ‘Accept?Encoding‘ = > ‘gzip; deflate‘ ); my $browser = LWP::UserAgent?>new; my $response = $browser?>get( $url, % headers ); my $data = $response?>content; if (my $encoding = $response?>content_encoding) ) { $data = Compress::Zlib::memGunzip($data) if $encoding = ~ / gzip / i; $data = Compress::Zlib::uncompress($data) if $encoding = ~ / deflate / i; } |
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。