scrapy startproject ['项目名']
在spiders下生成文件
cd spiders
scrapy genspider douban_spider ['域名']
明确需要抓取哪些内容,在items.py
中定义数据结构:
import scrapy
class DoubanItem(scrapy.Item):
# 序号
serial_number = scrapy.Field()
# 电影名
movie_name = scrapy.Field()
# 介绍
introduce = scrapy.Field()
# 星级
star = scrapy.Field()
# 评论
evaluate = scrapy.Field()
# 描述
describe = scrapy.Field()
打开spider.py文件,默认会有三个参数:
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫名
name = 'douban_spider'
# 允许的域名,超出该域名的链接不会进行抓取
allowed_domains = ['movie.douban.com']
# 入口url
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
pass
在def parse
方法中进行内容解析:
def parse(self, response):
print(response.text)
命令行中启动
# douban_spider 即spider.py中的爬虫名
scrapy crawl douban_spider
报错403的原因:user_agent设置不对,去settings.py
中设置:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
在pycharm中启动
创建一个main.py
文件:
from scrapy import cmdline
if __name__ == '__main__':
cmdline.execute('scrapy crawl douban_spider'.split())
如何解析是写在def parse(self, response)
中。
xpath提取内容
需要去学习下xpath的语法
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
根据之前的item.py来封装对象
from douban.items import DoubanItem
具体代码
# 先使用xpath语法来选取,然后后跟text()函数获取内容
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
for item in movie_list:
douban_item = DoubanItem()
douban_item['serial_number'] = item.xpath(".//div[@class='item']//em/text()").extract_first()
douban_item['movie_name'] = item.xpath(".//div[@class='info']//a/span/text()").extract_first()
content = item.xpath(".//div[@class='bd']/p[1]/text()").extract()
content_set = list()
for i_content in content:
tmp = ""
for temp in i_content.split():
tmp += temp
content_set.append(tmp)
douban_item['introduce'] = content_set
douban_item['star'] = item.xpath(".//div[@class='star']/span[2]/text()").extract_first()
douban_item['evaluate'] = item.xpath(".//div[@class='star']/span[4]/text()").extract_first()
douban_item['describe'] = item.xpath(".//div[@class='bd']/p[2]/span/text()").extract_first()
# 重点
yield douban_item
对象解析完毕后务必要调用yield
来进行提交
yield douban_item
上面的代码只能读取当前页的信息,需要去抓取下一页的链接,然后再次yield
# 取下一页链接
next_link = response.xpath("//span[@class='next']/link/@href").extract()
# 如果不为最后一页
if next_link:
next = next_link[0]
yield scrapy.Request("https://movie.douban.com/top250" + next, callback=self.parse)
在命令的后边加入-o
参数即可,支持json(unicode编码保存)、csv等多个格式
scrapy crawl douban_spider -o test.json
原文:https://www.cnblogs.com/yisany/p/11227781.html