首页 > 其他 > 详细

scrapy 中crawlspider 爬虫

时间:2019-04-21 00:46:12      阅读:209      评论:0      收藏:0      [点我收藏+]

 

爬取目标网站:

http://www.chinanews.com/rss/rss_2.html

技术分享图片

获取url后进入另一个页面进行数据提取

技术分享图片

检查网页:

技术分享图片

 

爬虫该页数据的逻辑:

Crawlspider爬虫类:

# -*- coding: utf-8 -*-
import scrapy
import re
#from scrapy import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class NwSpider(CrawlSpider):
    name = nw
    # allowed_domains = [‘www.new.com‘]
    start_urls = [http://www.chinanews.com/rss/rss_2.html]

    rules = (
      Rule(LinkExtractor(allow
=‘http://www.chinanews.com/rss/.*?\.xml‘), callback=parse_item), ) def parse_item(self, response): selector = Selector(response) items =response.xpath(//item).extract() for node in items: # print(type(node)) # item = {} item[title] = re.findall(r<title>(.*?)</title>,node,re.S)[0] item[link] = re.findall(r<link>(.*?)</link>,node,re.S)[0] item[desc] = re.findall(r<description>(.*?)</description>,node,re.S)[0] item[pub_date] =re.findall(r<pubDate>(.*?)</pubDate>,node,re.S)[0] print(item) #item[‘domain_id‘] = response.xpath(‘//input[@id="sid"]/@value‘).get() #item[‘name‘] = response.xpath(‘//div[@id="name"]‘).get() #item[‘description‘] = response.xpath(‘//div[@id="description"]‘).get() # yield item

 

scrapy 中crawlspider 爬虫

原文:https://www.cnblogs.com/knighterrant/p/10743532.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!