首页 > 其他 > 详细

爬虫框架之Scrapy(四 ImagePipeline)

时间:2019-04-26 23:13:44      阅读:320      评论:0      收藏:0      [点我收藏+]

ImagePipeline

使用scrapy框架我们除了要下载文本,还有可能需要下载图片,scrapy提供了ImagePipeline来进行图片的下载。

ImagePipeline还支持以下特别的功能:

1 生成缩略图:通过配置IMAGES_THUMBS = {‘size_name‘: (width_size,heigh_size),}

2 过滤小图片:通过配置IMAGES_MIN_HEIGHTIMAGES_MIN_WIDTH来过滤过小的图片。

具体其他功能可以看下参考官网手册:https://docs.scrapy.org/en/latest/topics/media-pipeline.html.

ImagePipelines的工作流程

1 在spider中爬取需要下载的图片链接,将其放入item中的image_urls.

2 spider将其传送到pipieline

3 当ImagePipeline处理时,它会检测是否有image_urls字段,如果有的话,会将url传递给scrapy调度器和下载器

4 下载完成后会将结果写入item的另一个字段images,images包含了图片的本地路径,图片校验,和图片的url。

示例 爬取巴士lol的英雄美图

只爬第一页

http://lol.tgbus.com/tu/yxmt/

技术分享图片

第一步:items.py

import scrapy


class Happy4Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

 爬虫文件lol.py

# -*- coding: utf-8 -*-
import scrapy
from happy4.items import Happy4Item

class LolSpider(scrapy.Spider):
    name = lol
    allowed_domains = [lol.tgbus.com]
    start_urls = [http://lol.tgbus.com/tu/yxmt/]

    def parse(self, response):
        li_list = response.xpath(//div[@class="list cf mb30"]/ul//li)
        for one_li in li_list:
            item = Happy4Item()
            item[image_urls] =one_li.xpath(./a/img/@src).extract()
            yield item

 最后 settings.py

BOT_NAME = happy4

SPIDER_MODULES = [happy4.spiders]
NEWSPIDER_MODULE = happy4.spiders


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   scrapy.pipelines.images.ImagesPipeline: 1,
}
IMAGES_STORE = images

 不需要操作管道文件,就可以爬去图片到本地

技术分享图片

极大的减少了代码量.

注意:因为图片管道会尝试将所有图片都转换成JPG格式的,你看源代码的话也会发现图片管道中文件名类型直接写死为JPG的。所以如果想要保存原始类型的图片,就应该使用文件管道。

示例 爬取mm131美女图片

http://www.mm131.com/xinggan/

要求爬取的就是这个网站

技术分享图片

这个网站是有反爬的,当你直接去下载一个图片的时候,你会发现url被重新指向了别处,或者有可能是直接报错302,这是因为它使用了referer这个请求头里的字段,当你打开一个图片的url的时候,你的请求头里必须有referer,不然就会被识别为爬虫文件,禁止你的爬取,那么如何解决呢?

手动在爬取每个图片的时候添加referer字段。

技术分享图片

xingan.py

# -*- coding: utf-8 -*-
import scrapy
from happy5.items import Happy5Item
import re

class XinganSpider(scrapy.Spider):
    name = xingan
    allowed_domains = [www.mm131.com]
    start_urls = [http://www.mm131.com/xinggan/]

    def parse(self, response):
        every_html = response.xpath(//div[@class="main"]/dl//dd)
        for one_html in every_html[0:-1]:
            item = Happy5Item()
            # 每个图片的链接
            link = one_html.xpath(./a/@href).extract_first()
            # 每个图片的名字
            title = one_html.xpath(./a/img/@alt).extract_first()
            item[title] = title
            # 进入到每个标题里面
            request = scrapy.Request(url=link, callback=self.parse_one, meta={item:item})
            yield request

    # 每个人下面的图集
    def parse_one(self, response):
        item = response.meta[item]
        # 找到总页数
        total_page = response.xpath(//div[@class="content-page"]/span[@class="page-ch"]/text()).extract_first()
        num = int(re.findall((\d+), total_page)[0])
        # 找到当前页数
        now_num = response.xpath(//div[@class="content-page"]/span[@class="page_now"]/text()).extract_first()
        now_num = int(now_num)
        # 当前页图片的url
        every_pic = response.xpath(//div[@class="content-pic"]/a/img/@src).extract()
        # 当前页的图片url
        item[image_urls] = every_pic
        # 当前图片的refer
        item[referer] = response.url
        yield item

        # 如果当前数小于总页数
        if now_num < num:
            if now_num == 1:
                url1 = response.url[0:-5] + _%d%(now_num+1) + .html
            elif now_num > 1:
                url1 = re.sub(_(\d+), _ + str(now_num+1), response.url)
            headers = {
                referer:self.start_urls[0]
            }
            # 给下一页发送请求
            yield scrapy.Request(url=url1, headers=headers, callback=self.parse_one, meta={item:item})

items.py

import scrapy


class Happy5Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    title = scrapy.Field()
    referer = scrapy.Field()

pipelines.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class Happy5Pipeline(object):
    def process_item(self, item, spider):
        return item

class QiushiImagePipeline(ImagesPipeline):

    # 下载图片时加入referer请求头
    def get_media_requests(self, item, info):
        for image_url in item[image_urls]:
            headers = {referer:item[referer]}
            yield Request(image_url, meta={item: item}, headers=headers)
            # 这里把item传过去,因为后面需要用item里面的书名和章节作为文件名

    # 获取图片的下载结果, 控制台查看
    def item_completed(self, results, item, info):
        image_paths = [x[path] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

    # 修改文件的命名和路径
    def file_path(self, request, response=None, info=None):
        item = request.meta[item]
        image_guid = request.url.split(/)[-1]
        filename = ./{}/{}.format(item[title], image_guid)
        return filename

settings.py

BOT_NAME = happy5

SPIDER_MODULES = [happy5.spiders]
NEWSPIDER_MODULE = happy5.spiders


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   # ‘scrapy.pipelines.images.ImagesPipeline‘: 1,
   happy5.pipelines.QiushiImagePipeline: 2,
}
IMAGES_STORE = images

得到图片文件:

技术分享图片

这种图还是要少看。

爬虫框架之Scrapy(四 ImagePipeline)

原文:https://www.cnblogs.com/xiaozx/p/10776694.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!