爬虫-day02-抓取和分析

时间：2018-05-09 14:02:45 阅读：153 评论：0 收藏：0 [点我收藏+]

###页面抓取###

1、urllib3

    是一个功能强大且好用的HTTP客户端，弥补了Python标准库中的不足

    安装： pip install urllib3

    使用：

import urllib3
http = urllib3.PoolManager()
response = http.request(‘GET‘, ‘http://news.qq.com‘)
print(response.headers)
result = response.data.decode(‘gbk‘)
print(result)

发送HTTPS协议的请求

安装依赖 ： pip install certifi

import  certifi
import urllib3
http = urllib3.PoolManager(cert_reqs = ‘CERT_REQUIRED‘, ca_certs = certifi.where()) #添加证书
resp = http.request(‘GET‘, ‘http://news.baidu.com/‘)
print(resp.data.decode(‘utf-8‘))

####带上参数

import urllib3
from urllib.parse import urlencode
http = urllib3.PoolManager()
args = {‘wd‘ : ‘人民币‘}
# url = ‘http://www.baidu.com/s?%s‘ % (args)
url = ‘http://www.baidu.com/s?%s‘ % (urlencode(args))
print(url)
# resp = http.request(‘GET‘ , url)
# print(resp.data.decode(‘utf-8‘))
 
headers = {
    ‘Accept‘ : ‘text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, **; q=0.01‘,
    ‘Accept-Encoding‘ : ‘gzip, deflate, br‘,
    ‘Accept-Language‘ : ‘zh-CN,zh;q=0.9‘,
    ‘Connection‘ : ‘keep-alive‘,
    ‘Host‘ : ‘www.baidu.com‘,
    ‘Referer‘ : ‘https://www.baidu.com/s?wd=人民币‘,
    ‘User-Agent‘ : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
}
resp8 = requests.get(url8, fields=args8, headers=headers8)
print(resp8.text)

爬虫-day02-抓取和分析

原文：https://www.cnblogs.com/Albert-w/p/9013194.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)