Go Colly笔记

时间：2020-06-11 17:07:10 阅读：42 评论：0 收藏：0 [点我收藏+]

Colly是Go下功能比较完整的一个HTTP客户端工具.

安装

使用GoLand作为开发环境

GOROOT: go目录放到了/opt/go, 所以GOROOT默认指向的也是/opt/go

GOPATH: 在Settings->Go->GOPATH里配置Global GOPATH, 指向 /home/milton/WorkGo

GOPROXY: 在Settings->Go->Go Modules下, 设置 Environments, GOPROXY=https://goproxy.cn

在GoLand内部的Terminal里查看环境变量, 命令 go env, 确认路径无误, 然后执行以下命令安装

# v1
go get -u github.com/gocolly/colly

# v2
go get -u github.com/gocolly/colly/v2

基础使用

增加import

import "github.com/gocolly/colly/v2"

调用

func main() {
	// Instantiate default collector
	c := colly.NewCollector(
		// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
		colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
	)

	// On every a element which has href attribute call callback
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		// Print link
		fmt.Printf("Link found: %q -> %s\n", e.Text, link)
		// Visit link found on page
		// Only those links are visited which are in AllowedDomains
		c.Visit(e.Request.AbsoluteURL(link))
	})

	// Before making a request print "Visiting ..."
	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL.String())
	})

	// Start scraping on https://hackerspaces.org
	c.Visit("https://hackerspaces.org/")
}

使用代理池

参考文档中的例子 http://go-colly.org/docs/examples/proxy_switcher/ 这里的例子要注意两个问题

1. 初始化时, 需要设置AllowURLRevisit, 否则在访问同一URL时会直接跳过返回之前的结果

c := colly.NewCollector(colly.AllowURLRevisit())

2. 还需要设置禁用KeepAlive, 否则在多次访问同一网址时, 只会调用一次GetProxy, 这样达不到轮询代理池的效果, 相关信息 #392, #366 , #339

c := colly.NewCollector(colly.AllowURLRevisit())

c.WithTransport(&http.Transport{
	DisableKeepAlives: true,
})

Go Colly笔记

原文：https://www.cnblogs.com/milton/p/13093544.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)