首页 > 代码库 > gocrawl 分析

gocrawl 分析

1. gocrawl 类结构

 

 1 // The crawler itself, the master of the whole process 2 type Crawler struct { 3     Options *Options 4  5     // Internal fields 6     logFunc         func(LogFlags, string, ...interface{}) 7     push            chan *workerResponse 8     enqueue         chan interface{} 9     stop            chan struct{}10     wg              *sync.WaitGroup11     pushPopRefCount int12     visits          int13 14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value15     // is of no use, but this is the smallest type possible - it uses no memory at all.16     visited map[string]struct{}17     hosts   map[string]struct{}18     workers map[string]*worker19 }

 

 1 // The Options available to control and customize the crawling process. 2 type Options struct { 3     UserAgent             string 4     RobotUserAgent        string 5     MaxVisits             int 6     EnqueueChanBuffer     int 7     HostBufferFactor      int 8     CrawlDelay            time.Duration // Applied per host 9     WorkerIdleTTL         time.Duration10     SameHostOnly          bool11     HeadBeforeGet         bool12     URLNormalizationFlags purell.NormalizationFlags13     LogFlags              LogFlags14     Extender              Extender15 }

 

 1 // Extension methods required to provide an extender instance. 2 type Extender interface { 3     // Start, End, Error and Log are not related to a specific URL, so they don‘t 4     // receive a URLContext struct. 5     Start(interface{}) interface{} 6     End(error) 7     Error(*CrawlError) 8     Log(LogFlags, LogFlags, string) 9 10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo11     // is related to a URLContext (holds a ctx field).12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration13 14     // All other extender methods are executed in the context of an URL, and thus15     // receive an URLContext struct as first argument.16     Fetch(*URLContext, string, bool) (*http.Response, error)17     RequestGet(*URLContext, *http.Response) bool18     RequestRobots(*URLContext, string) ([]byte, bool)19     FetchedRobots(*URLContext, *http.Response)20     Filter(*URLContext, bool) bool21     Enqueued(*URLContext)22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)23     Visited(*URLContext, interface{})24     Disallowed(*URLContext)25 }

 

entry point:

 1 func main() { 2     ext := &Ext{&gocrawl.DefaultExtender{}} 3     // Set custom options 4     opts := gocrawl.NewOptions(ext) 5     opts.CrawlDelay = 1 * time.Second 6     opts.LogFlags = gocrawl.LogError 7     opts.SameHostOnly = false 8     opts.MaxVisits = 10 9 10     c := gocrawl.NewCrawlerWithOptions(opts)11     c.Run("http://0value.com")12 }

 

3 steps:  in main

1) get a Extender

2) create Options with given Extender

3) create gocrawel

as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

 

2. other key structs

worker, workResponse and sync.WaitGroup

1 // Communication from worker to the master crawler, about the crawling of a URL2 type workerResponse struct {3     ctx           *URLContext4     visited       bool5     harvestedURLs interface{}6     host          string7     idleDeath     bool8 }

 

 1 // The worker is dedicated to fetching and visiting a given host, respecting 2 // this host‘s robots.txt crawling policies. 3 type worker struct { 4     // Worker identification 5     host  string 6     index int 7  8     // Communication channels and sync 9     push    chan<- *workerResponse10     pop     popChannel11     stop    chan struct{}12     enqueue chan<- interface{}13     wg      *sync.WaitGroup14 15     // Robots validation16     robotsGroup *robotstxt.Group17 18     // Logging19     logFunc func(LogFlags, string, ...interface{})20 21     // Implementation fields22     wait           <-chan time.Time23     lastFetch      *FetchInfo24     lastCrawlDelay time.Duration25     opts           *Options26 }

 



3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)