07 linkextractor的基本用法

时间：2019-10-19 19:23:45 阅读：78 评论：0 收藏：0 [点我收藏+]

linkextractor: 连接提取器:

　　帮我们从response对象中提取指定的链接

　　用法:

- 实例化一个提取器对象, 实例化的时候我们可以传递各种参数, 指定提取规则
- 调用extract_links方法, 传入response对象, 返回一个列表, 里边是一个个提取到的link对象
- link对象有两个属性, url: 提取到的连接, text: 连接的文本描述

　　代码示例:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor

class ACrawlspiderSpider(scrapy.Spider):
    name = ‘_crawlspider‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘http://pic.netbian.com/‘]

    def parse(self, response):
        # link = LinkExtractor(restrict_xpaths=‘//ul/li‘)


        #allow: 接受一个正则表达式或一个正则表达式列表,提取匹配正则的链接
        link = LinkExtractor(allow=‘tupian.+\.html‘)

        #deny: 与allow相反, 排除符合正则的链接
        link = LinkExtractor(deny=‘tupian.+\.html‘)

        #allow_domins 提取该域名下的链接
        link = LinkExtractor(allow_domains=‘pic.netbian.com‘)

        #deny_domins 剔除该域名下的链接
        link = LinkExtractor(deny_domains=‘pic.netbian.com‘)

        #restrict_xpaths 接受xpath表达式, 提取指定区域的链接
        link = LinkExtractor(restrict_xpaths=‘//ul/li‘)

        #restrict_css 提取指定区域的链接

        links = link.extract_links(response)
        for link in links:
            print(link.url, link.text)

07 linkextractor的基本用法

原文：https://www.cnblogs.com/zhangjian0092/p/11704660.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)