深度
也可以使用pycharm安装
给item;pipeline用于做格式化;持久化
在Override the default request headers: 中 启动headers 加‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 QIHU 360SE‘
def start_requests(self):
for url in self.start_urls:
yield Request(url,callback=self.parse)
通过在items.py文件的XXXXItem类中写入:例如:author =scrapy.Field() ,content =scrapy.Field() 可以将要传给pipelines的数据进行格式化,即用一个XXXItem对象进行封装。然后通过在spider文件初始函数中中写入:from 工程名.items import XXXItem ,item = XXXXItem(author = author,content =content)就可以将数据进行格式化,将author,content封装到item对象中,再通过写yield item 自动将item对象递交给引擎,然后引擎再传给pipelines
pipeline对象中process_item()方法是当spiders有yield语句 或reurn 数据语句中会执行该方法,open_spider()方法是当爬虫打开后会调用这个函数,close_spider()方法是爬虫完成后会调用该方法。
因而在打开文件的操作可以放在,open_spider()方法中,关闭文件的操作可以放在close_spider()方法中,这样可以防止重复的打开与关闭文件。from_crawler()方法用于创建对象,并获取配置文件中的相关数据。
import json
#from scrapy.exporters import JsonItemExporter
from scrapy.exceptions import DropItem
class QsbkPipeline(object):
# def __init__(self):
# self.fp = open('duanzi.json','w',encoding = 'utf-8')
#
def __init__(self):
'''
进行数据初始化
'''
# self.fp = open('duanzi.json','wb',encoding = 'utf-8') # 这里要用二进制方式打开,因为JsonItemExporter写入文件的形式是二进制形式
# self.exporter = JsonItemExporter(self.fp,ensure_ascii = False,encoding = 'utf-8')
# self.exporter.start_exporting()
pass
@classmethod
def from_crawler(cls,crawler):
'''
创建对象,并获取配置文件(settings)中的相关数据
:param crawler:
:return:
'''
# conn_str = crawler.settings.get("DB") # 获取配置文件(settings文件)中的DB值,
# return cls(conn_str)# 创建对象,并向对对象中传入配置文件中的DB值,可以在__inint__方法中,获取到
def process_item(self, item, spider):
'''
当spiders有yield语句 或reurn 数据语句中会执行该方法
:param item:
:param spider:
:return:
'''
# item_json = json.dumps(dict(item),ensure_ascii = False)
# self.fp.write(item_json+'\n')
# return item
print(item['author']+":"*5+item['content'])
# return item # 当有多个pipeline对象时,这里返回一个item,即可以将item传递给下一个pipeline对象
# raise DropItem() # 若不想将item交给下一个pipeline对象,应该使用这种方式,因为后面有某种方式可以监听
def open_spider(self,spider):
'''
爬虫打开后会调用这个函数
:param spider:
:return:
'''
pass
def close_spider(self,spider):
'''
爬虫完成后会调用这个函数
:param spider:
:return:
'''
pass
输入这个语句即可看到,scrapy中自带类的语句,模范这个类,自定义写一个新的文件,自定义相关方法,即可自定义去重
from scrapy.dupefilters import RFPDupeFilter
DUPEFILTER_CLASS = '工程名.类所在的文件名.RepeatFilter'
类RepeatFilter文件中的代码如下:
# 这个类的名称可以随意定义
class RepeatFilter(object):
def __init__(self):
'''
2、进行对象初始化
'''
self.visited_set = set() #定义一个集合,用来存储非重复地url
@classmethod # 这是一个类方法
def from_settings(cls, settings):
'''
1、创建对象
'''
return cls() #在这里创建了一个实例,因而会去执行__init__()方法
def request_seen(self, request):
'''
4、对象被访问即调用该方法
'''
if request.url in self.visited_set:
return True
self.visited_set.add(request.url)
return False
def open(self): # can return deferred
'''
3、打开蜘蛛,开始爬取
'''
pass
def close(self, reason): # can return a deferred
'''
停止爬取
'''
pass
def log(self, request, spider): # log that a request has been filtered
pass
# -*- coding: utf-8 -*-
# Scrapy settings for qsbk project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'qsbk' # 爬虫的名字,会放在后面的USER_AGENT中
# 爬虫的路径
SPIDER_MODULES = ['qsbk.spiders']
NEWSPIDER_MODULE = 'qsbk.spiders'
# 标识请求的身份
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qsbk (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
# Obey robots.txt rules 是否遵守爬虫协议
ROBOTSTXT_OBEY = False
# 并发请求的最大数目,一次可以发出的最大请求数目
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 当对同一网站发送不同请求之间的延迟(单位:秒)
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
# 具体针对某个域名或IP的最大并发数,比上面的更精准
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# 是否帮你获取cookies,默认为True
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# 是否监听爬虫的状态,默认为True,在cmd中输入:telnet 127.0.0.1 6023 即可进入监听状态,输入相应操作命令即
# 可获得相关信息,例如:est()
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers: 设置请求头
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'qsbk.middlewares.QsbkSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'qsbk.middlewares.QsbkDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# 注册扩展
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
'qsbk.my_extention.MyExtend':200
}
from scrapy.extensions.telnet import TelnetConsole
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 注册pipeline,即pipelines文件的代码能运行
ITEM_PIPELINES = {
'qsbk.pipelines.QsbkPipeline': 300, # pipelines执行的优先级
}
# 智能请求,让请求间的时间间隔变得不固定,这里填的参数是智能限速算法的相关参数
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True # 是否允许智能请求
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5 # 初始请求的延迟时间(秒)
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60 # 请求间的最大延迟时间
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# 15. 调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler
# 做缓存
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# 是否启用缓存策略 默认为False
# HTTPCACHE_ENABLED = True
# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可 默认的缓存策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略 这种缓存策略比上面的好,一般用这个就行,上面的可以不用
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
# 缓存超时时间,若请求时间超过多少我就不存了
# HTTPCACHE_EXPIRATION_SECS = 0
# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'
# 缓存忽略的Http状态码,如果状态码是这个,我就不存了
# HTTPCACHE_IGNORE_HTTP_CODES = []
# 缓存存储的插件,即缓存的具体方法
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# from scrapy.extensions.httpcache import FilesystemCacheStorage # 通过这条语句可以看到里面具体的方法
#DEPTH_LIMIT = 1 # 指定"递归"的层数 可以限制爬取深度,这个深度是与start_urls中定义url的相对值。也就是相对url的深度。
# 例如定义url为:http://www.domz.com/game/,DEPTH_LIMIT=1那么限制爬取的只能是此url下一级的网页。深度大于设置值的将被ignore。
#DEPTH_PRIORITY =0 # 只能是0或1,依次表示是深度度优先还是广度优先,调度器中url访问的顺序,默认为0
# 进行自定义去重类的注册(激活)
# DUPEFILTER_CLASS = 'qsbk.my_dupefilter.RepeatFilter'
# 里面的配置变量名 必须大写
缓存:在更近的地方快速将数据读取出来
在settings文件中进行相关注册与配置
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
#默认的代理要添加到这里面去,格式如下
os.environ
{
http_proxy:http://代理账号:代理密码@192.168.11.11:9999/
https_proxy:http://192.168.11.11:9999/
}
import base64
import random
def to_bytes(text, encoding=None, errors='strict'):
if isinstance(text, bytes):
return text
if not isinstance(text, six.string_types):
raise TypeError('to_bytes must receive a unicode, str or bytes '
'object, got %s' % type(text).__name__)
if encoding is None:
encoding = 'utf-8'
return text.encode(encoding, errors)
# 这个类的名称不能改变
class ProxyMiddleware(object):
def process_request(self, request, spider):
PROXIES = [
{'ip_port': '111.11.228.75:80', 'user_pass': ''},
{'ip_port': '120.198.243.22:80', 'user_pass': ''},
{'ip_port': '111.8.60.9:8123', 'user_pass': ''},
{'ip_port': '101.71.27.120:80', 'user_pass': ''},
{'ip_port': '122.96.59.104:80', 'user_pass': ''},
{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
]
# 把代理就填入PROXIES字典中
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
print("**************ProxyMiddleware have pass************" + proxy['ip_port'])
else:
print ("**************ProxyMiddleware no pass************" + proxy['ip_port'])
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
在settings文件中配置以激活
DOWNLOADER_MIDDLEWARES = {
'工程名.代理插件文件名.ProxyMiddleware': 500,
}
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
class MySSLFactory(ScrapyClientContextFactory):
def getCertificateOptions(self):
from OpenSSL import crypto
# 这里只需要改变你拿到的证书的两个文件的地址即可
v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
return CertificateOptions(
privateKey=v1, # pKey对象
certificate=v2, # X509对象
verify=False,
method=getattr(self, 'method', getattr(self, '_ssl_method', None))
)
相关在settings文件中的配置信息
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
DOWNLOADER_CLIENTCONTEXTFACTORY = "工程名.文件名.MySSLFactory"
传入给调度器的Request请求会一个个经过自定义的下载中间件(DOWNLOADER_MIDDLEWARES),顺序为settings文件中的配置信息数值得大小(小的优先),要是自定义的下载中间件都不下载,最后会由scrapy内置的下载中间件去下载。
class DownMiddleware1(object):
def process_request(self, request, spider):
'''
请求需要被下载时,经过所有下载器中间件的process_request调用
:param request:
:param spider:
:return:
None,继续后续中间件去下载;,没有return语句时的默认形式
Response对象,停止process_request的执行,开始执行process_response
Request对象,停止中间件的执行,将Request重新调度器
raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
'''
pass
def process_response(self, request, response, spider):
'''
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return:
Response 对象:转交给其他中间件process_response
Request 对象:停止中间件,request会被重新调度下载
raise IgnoreRequest 异常:调用Request.errback
'''
print('response1')
return response
def process_exception(self, request, exception, spider):
'''
当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
:param response:
:param exception:
:param spider:
:return:
None:继续交给后续中间件处理异常;
Response对象:停止后续process_exception方法
Request对象:停止中间件,request将会被重新调用下载
'''
return None
DOWNLOADER_MIDDLEWARES = {'工程名.文件名.DownMiddleware1':543}
##### 自定义spider中间件的文件中的示例代码:
class SpiderMiddleware(object):
def process_spider_input(self, response, spider):
'''
下载中间件完成后,调用该方法,然后交给spider中的parse(指callbakc指定的函数)处理
:param response:
:param spider:
:return:
'''
pass
def process_spider_output(self, response, result, spider):
# 这个result就是parse中yield Request或yield Item的生成器
'''
spider处理完成,返回时调用该方法
:param response:
:param result:
:param spider:
:return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
'''
return result
def process_spider_exception(self, response, exception, spider):
'''
异常调用
:param response:
:param exception:
:param spider:
:return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
'''
return None
def process_start_requests(self, start_requests, spider):
'''
爬虫启动时调用
:param start_requests:
:param spider:
:return: 包含 Request 对象的可迭代对象
'''
return start_requests
SPIDER_MIDDLEWARES = {'工程名.文件名.SpiderMiddleware':500}
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
# 找到所有的爬虫名称
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
COMMANDS_MODULE = '工程名.文件名'
在cmd中,在项目目录执行命令:scrapy crawlall,即可运行所有的爬虫
scrapy shell 网站
exit()退出该模式
scrapy crawl 爬虫名 -o 输出文件名.json # 输出为json文件
scrapy crawl 爬虫名 -o 输出文件名.csv # 输出为csv文件
scrapy crawl 爬虫名 -o 输出文件名.jl # 输出为每一个item为一行的json文件
scrapy crawl 爬虫名 -o 输出文件名.xml # 输出为xml文件
scrapy crawl 爬虫名 -o 输出文件名.pickle # 输出为pickle文件
在项目文件夹下新建一个py文件,如图:
文件中代码示例如下:
from scrapy import cmdline
# 这个文件是执行命令行的语句,
cmdline.execute(['scrapy','crawl', 'qsbk_spider'])
这样,以后通过点击start.py文件,即可执行爬虫。
原文:https://www.cnblogs.com/greatljg/p/11160250.html