首页 > Web开发 > 详细

0.爬虫 urlib库讲解 urlopen()与Request()

时间:2019-04-09 12:11:51      阅读:184      评论:0      收藏:0      [点我收藏+]

# 注意一下 是import urllib.request 还是 form urllib import request

0. urlopen()

语法:urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

  • 实例0:(这个函数 一般就使用三个参数 url data timeout)

*添加的data参数需要使用bytes()方法将参数转换为字节流(区别于str的一种类型 是一种比特流 010010010)编码的格式的内容,即bytes类型。

*response.read()是bytes类型的数据,需要decode(解码)一下。

import urllib.parse
import urllib.request
import urllib.error

url = http://httpbin.org/post
data = bytes(urllib.parse.urlencode({word: hello}), encoding=utf8)
try:
    response = urllib.request.urlopen(url, data=data,timeout=1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print(TIME OUT)
else:
    print(response.read().decode("utf-8"))

输出结果:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "101.206.170.234, 101.206.170.234", 
  "url": "https://httpbin.org/post"
}
  • 实例1:查看i状态码、响应头、响应头里server字段的信息
import urllib.request

response = urllib.request.urlopen(‘https://www.python.org‘)
print(response.status)
print(response.getheaders())
print(response.getheader(‘Server‘))

输出结果:

200
[(‘Server‘, ‘nginx‘), (‘Content-Type‘, ‘text/html; charset=utf-8‘), (‘X-Frame-Options‘, ‘DENY‘), (‘Via‘, ‘1.1 vegur‘), (‘Via‘, ‘1.1 varnish‘), (‘Content-Length‘, ‘48410‘), (‘Accept-Ranges‘, ‘bytes‘), (‘Date‘, ‘Tue, 09 Apr 2019 02:32:34 GMT‘), (‘Via‘, ‘1.1 varnish‘), (‘Age‘, ‘722‘), (‘Connection‘, ‘close‘), (‘X-Served-By‘, ‘cache-iad2126-IAD, cache-hnd18751-HND‘), (‘X-Cache‘, ‘MISS, HIT‘), (‘X-Cache-Hits‘, ‘0, 1223‘), (‘X-Timer‘, ‘S1554777154.210361,VS0,VE0‘), (‘Vary‘, ‘Cookie‘), (‘Strict-Transport-Security‘, ‘max-age=63072000; includeSubDomains‘)]
nginx

使用urllib库的urlopen()方法有很大的局限性,比如不能设置响应头的信息等。所以需要引入request()方法。

 

1. Request()

  • 实例0:(这两种方法的实现效果是一样的)
import urllib.request

response = urllib.request.urlopen(https://www.python.org)
print(response.read().decode(utf-8))

######################################

import urllib.request

req = urllib.request.Request(https://python.org)
response = urllib.request.urlopen(req)
print(response.read().decode(utf-8))

 

下面主要讲解下使用Request()方法来实现get请求和post请求,并设置参数。

  • 实例1:(post请求)
from urllib import request, parse

url = http://httpbin.org/post
headers = {
    User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT),
    Host: httpbin.org
}
dict = {
    name: Germey
}
data = bytes(parse.urlencode(dict), encoding=utf8)
req = request.Request(url=url, data=data, headers=headers, method=POST)
response = request.urlopen(req)
print(response.read().decode(utf-8))

亦可使用add_header()方法来添加报头,实现浏览器的模拟,添加data属性亦可如下书写:

补充:还可以使用bulid_opener()修改报头,不过多阐述,够用了就好。

from urllib import request, parse

url = http://httpbin.org/post
dict = {
    name: Germey
}
data = parse.urlencode(dict).encode(utf-8)
req = request.Request(url=url, data=data, method=POST)
req.add_header(User-Agent, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT))
response = request.urlopen(req)
print(response.read().decode(utf-8))
  • 实例2:(get请求) 百度关键字的查询
from urllib import request,parse

url = http://www.baidu.com/s?wd=
key = 路飞
key_code = request.quote(key)
url_all = url + key_code
"""
#第二种写法
url = ‘http://www.baidu.com/s‘
key = ‘路飞‘
wd = parse.urlencode({‘wd‘:key})
url_all = url + ‘?‘ + wd
"""
req = request.Request(url_all)
response = request.urlopen(req)
print(response.read().decode(utf-8))

在这里,对编码decode、reqest模块里的quote()方法、parse模块的urlencode()方法 等就有疑问了,,对此,做一些说明:

  1. request.quote:将str数据转换为对应的编码
  2. parse.urlencode:将字典中的k:v转换为K:编码后的v
  3. request.unquote:将编码后的数据转化为编码前的数据
  4. decode 字符串解码 decode("utf-8")跟read()搭配很配!
  5. encode 字符串编码
>>> str0 = 我爱你
>>> str1 = str0.encode(gb2312)    
>>> str1 
b\xce\xd2\xb0\xae\xc4\xe3
>>> str2 = str0.encode(gbk)
>>> str2
b\xce\xd2\xb0\xae\xc4\xe3
>>> str3 = str0.encode(utf-8)
>>> str3
b\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0
>>> str00 = str1.decode(gb2312)
>>> str00
我爱你
>>> str11 = str1.decode(utf-8) #报错,因为str1是gb2312编码的
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    str11 = str1.decode(utf-8)
UnicodeDecodeError: utf-8 codec cant decode byte 0xce in position 0: invalid continuation byte

 

* encoding指定编码格式

在这里,又有疑问了?read()、readline()、readlines()的区别:

  1. read():全部,字符串str
  2. reasline():一行
  3. readlines():全部,列表list

 

0.爬虫 urlib库讲解 urlopen()与Request()

原文:https://www.cnblogs.com/DC0307/p/10675878.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!