首页 > 代码库 > gocrawl 分析
gocrawl 分析
1. gocrawl 类结构
1 // The crawler itself, the master of the whole process 2 type Crawler struct { 3 Options *Options 4 5 // Internal fields 6 logFunc func(LogFlags, string, ...interface{}) 7 push chan *workerResponse 8 enqueue chan interface{} 9 stop chan struct{}10 wg *sync.WaitGroup11 pushPopRefCount int12 visits int13 14 // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value15 // is of no use, but this is the smallest type possible - it uses no memory at all.16 visited map[string]struct{}17 hosts map[string]struct{}18 workers map[string]*worker19 }
1 // The Options available to control and customize the crawling process. 2 type Options struct { 3 UserAgent string 4 RobotUserAgent string 5 MaxVisits int 6 EnqueueChanBuffer int 7 HostBufferFactor int 8 CrawlDelay time.Duration // Applied per host 9 WorkerIdleTTL time.Duration10 SameHostOnly bool11 HeadBeforeGet bool12 URLNormalizationFlags purell.NormalizationFlags13 LogFlags LogFlags14 Extender Extender15 }
1 // Extension methods required to provide an extender instance. 2 type Extender interface { 3 // Start, End, Error and Log are not related to a specific URL, so they don‘t 4 // receive a URLContext struct. 5 Start(interface{}) interface{} 6 End(error) 7 Error(*CrawlError) 8 Log(LogFlags, LogFlags, string) 9 10 // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo11 // is related to a URLContext (holds a ctx field).12 ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration13 14 // All other extender methods are executed in the context of an URL, and thus15 // receive an URLContext struct as first argument.16 Fetch(*URLContext, string, bool) (*http.Response, error)17 RequestGet(*URLContext, *http.Response) bool18 RequestRobots(*URLContext, string) ([]byte, bool)19 FetchedRobots(*URLContext, *http.Response)20 Filter(*URLContext, bool) bool21 Enqueued(*URLContext)22 Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)23 Visited(*URLContext, interface{})24 Disallowed(*URLContext)25 }
entry point:
1 func main() { 2 ext := &Ext{&gocrawl.DefaultExtender{}} 3 // Set custom options 4 opts := gocrawl.NewOptions(ext) 5 opts.CrawlDelay = 1 * time.Second 6 opts.LogFlags = gocrawl.LogError 7 opts.SameHostOnly = false 8 opts.MaxVisits = 10 9 10 c := gocrawl.NewCrawlerWithOptions(opts)11 c.Run("http://0value.com")12 }
3 steps: in main
1) get a Extender
2) create Options with given Extender
3) create gocrawel
as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.
2. other key structs
worker, workResponse and sync.WaitGroup
1 // Communication from worker to the master crawler, about the crawling of a URL2 type workerResponse struct {3 ctx *URLContext4 visited bool5 harvestedURLs interface{}6 host string7 idleDeath bool8 }
1 // The worker is dedicated to fetching and visiting a given host, respecting 2 // this host‘s robots.txt crawling policies. 3 type worker struct { 4 // Worker identification 5 host string 6 index int 7 8 // Communication channels and sync 9 push chan<- *workerResponse10 pop popChannel11 stop chan struct{}12 enqueue chan<- interface{}13 wg *sync.WaitGroup14 15 // Robots validation16 robotsGroup *robotstxt.Group17 18 // Logging19 logFunc func(LogFlags, string, ...interface{})20 21 // Implementation fields22 wait <-chan time.Time23 lastFetch *FetchInfo24 lastCrawlDelay time.Duration25 opts *Options26 }
3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。