首页 > 其他 > 详细

Scrapy的基本使用

时间:2019-10-24 00:08:38      阅读:80      评论:0      收藏:0      [点我收藏+]

爬取:http://quotes.toscrape.com

单页面

# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = quote
    allowed_domains = [quotes.toscrape.com]
    start_urls = [http://quotes.toscrape.com/]
    """
    知识点
        1. text()获取标签的text
        2. @属性  获取属性的值
        3. extract()查找多个    extract_first() 查找一个
    """
    def parse(self, response):
        # print(response.text)
        quotes = response.xpath(//div[@class="col-md-8"]/div[@class="quote"])
        # print(quotes)‘‘
        for quote in quotes:
            print(= * 20)
            # print(quote)
            # extract_first() 查找一个
            text = quote.xpath(.//span[@class="text"]/text()).extract_first()
            print(text)
            author = quote.xpath(.//span/small[@class="author"]/text()).extract_first()
            print(author)
            # extract()查找多个
            tags = quote.xpath(.//div[@class="tags"]/a[@class="tag"]/@href).extract()
            print(tags)

所有页面

# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = quote
    allowed_domains = [quotes.toscrape.com]
    start_urls = [http://quotes.toscrape.com/]
    """
    知识点
        1. text()获取标签的text
        2. @属性  获取属性的值
        3. extract()查找多个    extract_first() 查找一个
        4. response.urljoin()     url拼接
        5. scrapy.Request(url=_next, callback=self.parse)   回调
    """
    def parse(self, response):
        # print(response.text)
        quotes = response.xpath(//div[@class="col-md-8"]/div[@class="quote"])
        # print(quotes)‘‘
        for quote in quotes:
            print(= * 20)
            # print(quote)
            # extract_first() 查找一个
            text = quote.xpath(.//span[@class="text"]/text()).extract_first()
            print(text)
            author = quote.xpath(.//span/small[@class="author"]/text()).extract_first()
            print(author)
            # extract()查找多个
            tags = quote.xpath(.//div[@class="tags"]/a[@class="tag"]/@href).extract()
            print(tags)
        print(> * 40)
        next_url = response.xpath(//div[@class="col-md-8"]/nav/ul[@class="pager"]/li[@class="next"]/a/@href).extract_first()
        print(next_url)
        # 拼接url
        _next = response.urljoin(next_url)
        print(_next)
        # callback 回调函数 
        yield scrapy.Request(url=_next, callback=self.parse)

 补充:在b站上看到使用css而不是使用xpath的方法

https://www.bilibili.com/video/av19057145/?p=23

Scrapy的基本使用

原文:https://www.cnblogs.com/wt7018/p/11729534.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!