一、Urllib库详解
1、什么是Urllib
Python内置的HTTP请求库
urllib.request 请求模块(模拟实现传入网址访问)
urllib.error 异常处理模块(如果出现错误,进行捕捉这个异常,然后进行重试和其他的操作保证程序不会意外的中止)
urllib.parse url解析模块(工具模块,提供了许多url处理方法,例如:拆分,合并等)
urllib.robotparser robots.txt解析模块(主要是用来识别网页的robots.txt文件,判断哪些网站是可以爬的,哪些是不可以爬的)
2、相比Python变化
Python2
import urllib2
response = urllib2.urlopen(‘http://www.baidu.com‘)
Python3
import urllib.request
response = urllib.request.urlopen(‘http://www.baidu.com‘)
3、基本用法
Urllib
urlopen
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
方法1
1 import urllib.request 2 3 response = urllib.request.urlopen(‘http://www.baidu.com‘) 4 print(response.read().decode(‘utf-8‘)) # 获取相应体的内容,用decode(‘utf-8‘)显示
方法2
import urllib.request import urllib.parse data = bytes(urllib.parse.urlencode({‘word‘:‘hello‘}),encoding=‘utf-8‘) response = urllib.request.urlopen(‘http://httpbin.org/post‘,data=data) # 加了data 是已post形式传递 ,不加则是get方式传递 print(response.read())
方法3
1 import urllib.request 2 3 response = urllib.request.urlopen(‘http://httpbin.org/get‘,timeout=1) 4 print(response.read())
方法4
1 import socket 2 import urllib.request 3 import urllib.error 4 5 6 try: 7 response = urllib.request.urlopen(‘http://httpbin.org/get‘,timeout=0.1) 8 except urllib.error.URLError as e: 9 if isinstance(e.reason,socket.timeout): 10 print(‘TIME OUT‘)
响应
响应类型
1 import urllib.request 2 3 response = urllib.request.urlopen(‘http://www.baidu.com‘) 4 print(type(response))
状态码、响应头
1 import urllib.request 2 3 response = urllib.request.urlopen(‘http://www.python.org‘) 4 print(response.status) # 获取状态码 5 print(response.getheaders()) # 获取响应头 6 print(response.getheader(‘Server‘)) # 获取特定的响应头,这里拿 Server举例
Request
url作为对象传给urlopen
1 import urllib.request 2 3 request = urllib.request.Request(‘https://python.org‘) # 把url封装成一个对象 4 response = urllib.request.urlopen(request) # 把对象传给urlopen一样可以访问 5 print(response.read().decode(‘utf-8‘))
添加request请求的方式
1 from urllib import request,parse 2 3 url = ‘http://httpbin.org/post‘ 4 headers={ 5 ‘User-Agent‘:‘Mozilla/4.0(compatible;MSIE 5.5;Windows NT)‘, 6 ‘Host‘:‘httpbin.org‘ 7 } 8 dict = { 9 ‘name‘:‘Germey‘ 10 } 11 data = bytes(parse.urlencode(dict),encoding=‘utf-8‘) 12 req = request.Request(url=url,data=data,headers=headers,method=‘POST‘) 13 response = request.urlopen(req) 14 print(response.read().decode(‘utf-8‘))
request.add_header()方法
1 from urllib import request,parse 2 3 url = ‘http://httpbin.org/post‘ 4 dict = { 5 ‘name‘:‘Germey‘ 6 } 7 data = bytes(parse.urlencode(dict),encoding=‘utf-8‘) 8 req = request.Request(url=url,data=data,method=‘POST‘) 9 req.add_header(‘User-Agent‘,‘Mozilla/4.0(compatible;MSIE 5.5;Windows NT)‘) 10 response = request.urlopen(req) 11 print(response.read().decode(‘utf-8‘))
Handler
代理
原文:https://www.cnblogs.com/wyh-study/p/11055140.html