Colly是Go下功能比较完整的一个HTTP客户端工具.
使用GoLand作为开发环境
GOROOT: go目录放到了/opt/go, 所以GOROOT默认指向的也是/opt/go
GOPATH: 在Settings->Go->GOPATH里配置Global GOPATH, 指向 /home/milton/WorkGo
GOPROXY: 在Settings->Go->Go Modules下, 设置 Environments, GOPROXY=https://goproxy.cn
在GoLand内部的Terminal里查看环境变量, 命令 go env, 确认路径无误, 然后执行以下命令安装
# v1 go get -u github.com/gocolly/colly # v2 go get -u github.com/gocolly/colly/v2
增加import
import "github.com/gocolly/colly/v2"
调用
func main() { // Instantiate default collector c := colly.NewCollector( // Visit only domains: hackerspaces.org, wiki.hackerspaces.org colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Printf("Link found: %q -> %s\n", e.Text, link) // Visit link found on page // Only those links are visited which are in AllowedDomains c.Visit(e.Request.AbsoluteURL(link)) }) // Before making a request print "Visiting ..." c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL.String()) }) // Start scraping on https://hackerspaces.org c.Visit("https://hackerspaces.org/") }
参考文档中的例子 http://go-colly.org/docs/examples/proxy_switcher/ 这里的例子要注意两个问题
1. 初始化时, 需要设置AllowURLRevisit, 否则在访问同一URL时会直接跳过返回之前的结果
c := colly.NewCollector(colly.AllowURLRevisit())
2. 还需要设置禁用KeepAlive, 否则在多次访问同一网址时, 只会调用一次GetProxy, 这样达不到轮询代理池的效果, 相关信息 #392, #366 , #339
c := colly.NewCollector(colly.AllowURLRevisit()) c.WithTransport(&http.Transport{ DisableKeepAlives: true, })
原文:https://www.cnblogs.com/milton/p/13093544.html