首页 > 代码库 > spidering hacks 学习笔记(一)

spidering hacks 学习笔记(一)

  我们老大给了我本书《spidering hacks》,说里面的学会了,走遍天下都不怕了!----看看去,400多页的英文书,本来想买纸质,但是太贵,买不起。

ok,我是先看目录,然后看段落标题,然后看书是如何解释段落标题的,段落标题无非就是中心思想嘛!嘿嘿,走起....

 

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
                spidering hacks 学习笔记(一)
                     
                    一:overview:
chapter1:
(basics,philosophies(哲学),consideration,issues)
 
chapter2:
(spidering toolbox,modules galore(丰富的),prominent(突出的), )
 
chapter3:
(media files,Library of Congress(美国国会图书馆))
 
chapter4:
(get to the information which is not as easy as just scraping)
 
chapter5:
(keep data current,mirror collections to hard disk,spider schedule)
 
chapter6:
(share own data to be spidered)
 
                   二:chapter1
1:hack1:
(traverse the Web)
 
Q1:what is the difference between spiders and scrapers?
   spiders as programs that grab entire pages, files, or
sets of either, while scrapers grab very specific bits of information within these files.
  
Q2:Why Spider:
(1) Gain automated access to resources
(2) Gather information and present it in an alternate format
(3) Aggregate otherwise disparate data sources
(4) Combine the functionalities of sites(很多搜索引擎资源整合)
(5) Find and gather specific kinds of information
(6) Perform regular webmaster functions(充当网站管理的一部分职责)
 
2:hack2:Best Practices for You and Your Spider
(1) Be Liberal in What You Accept(格式会很多HTML,XML..需格式转换,need boundary)
(2) Don‘t Limit Your Dataset
(3) Don‘t Reinvent the Wheel(不要推到重来,就是借鉴别人爬虫脚本!)
 
Q3:best practices for you and your spider(几个注意的点):
Choose the most structured format available
If you must scrape HTML, do so sparingly(The less HTML, the less fragile your spider will be!!!呵呵)
Don‘t go where you‘re not wanted
Choose a good identifier
Make information on your spider readily available
Don‘t demand unlimited site access or support
Go light on the bandwidth(爬虫适可而止哦,关注bandwidth)
Take just enough, and don‘t take too often
 
3:Hack3: Anatomy of an HTML Page
<html>
<header>
<title>
 Title
</titile>
</header>
<body>
 body
</body>
</html>
 
(a):Header Information with the H Tags
<H1> <H2> head的层次
(b):List Information with Special HTML Tags
oderlist <ol> <li> </li> </ol>
you can grab everything between <ol> and </ol>, parse each <li></li>
element into an array
(c)Non?HTML Files
XML‘s parts are defined more rigidly than HTML
Using XML::RSS to Repurpose Everything
Perl XML modules
(such as XML::Simple, XML::RSS, or XML::LibXML)
 
4:Hack4: Registering Your Spider
 
(a)naming your spider (取名有意义点!)
(b)A Web Page About Your Spider
(c)Places to Register Your Spider
 
5:Hack5: Preempting Discovery
 
(a)Making Contact:告诉别人你的爬虫,怎么contact!!
(b)Making the Arguments for Your Spider:告诉别人你做什么
(c)Making Your Spider Easy to Find and Learn About
(d)Considering Legal Issues
 
6:Hac6 Keeping Your Spider Out of Sticky Situations
(a)Bad Spider, No Biscuit!(强调不要做一些有害的事情)
(b)Violating Copyright(提到不要取用别人知识产权的东西,小心触犯法律!)
(c)Aggregating Data
(d)Competitive Intelligence(竞争对手)
(e)Possible Consequences of Misbehaving Spiders(里面说到,警察叔叔会敲你的门!)
(f)Tracking Legal Issues(看看法律的东西)<br><br>7:Hack7:Finding the Patterns of Identifiers<br>(木有什么好说的!)<br>