Node爬虫之初体验

时间：2017-03-14 22:31:06 阅读：166 评论：0 收藏：0 [点我收藏+]

记得之前就听说过爬虫，个人初步理解就是从网页中抓取一些有用的数据，存储到本地，今天就当是小牛试刀，拿来溜溜......

实现需求： 抓取课程数据，输入url后并在浏览器端以一定的数据格式显示出来（如下图所示）

技术分享

实现需求需用到的Node库介绍

cheerio(https://github.com/cheeriojs/cheerio ) 可以理解成一个 Node.js 版的 jquery，用来从网页中以 css selector 取数据，使用方式跟 jquery 一样一样的。

superagent(http://visionmedia.github.io/superagent/ ) 是个轻量的的 http 方面的库，是nodejs里一个非常方便的客户端请求代理模块，当我们需要进行 get 、 post 、 head 等网络请求时。

express(http://www.expressjs.com.cn/starter/) 是一个基于 Node.js 平台的极简、灵活的 web 应用开发框架,路由、express生成器、静态文件等。

实现需求源代码如下

var express = require(‘express‘),
    app = express(),//基于WEB平台的开发框架
    superagent = require("superagent"),//处理服务端/客户端的http请求
    cheerio=require(‘cheerio‘);//一个 Node.js 版的 jquery，用来从网页中以 css selector 取数据，使用方式跟 jquery 一样
var pathUrl=‘http://www.imooc.com/learn/348‘;
  
/*=========================================================================
|抓取data数据结构如下
|    var courseData = [{
|           chapterTitle:‘‘,
|            videos:[{
|              title:‘‘,
|               id:‘‘
|            }]
|     }]
*==========================================================================*/
function printCourseInfo(courseData){
    courseData.forEach(function(item){
        var chapterTitle=item.chapterTitle;
        console.log(chapterTitle+‘\n‘);
        item.videos.forEach(function(video){
            console.log(‘ 【‘+video.id+‘】‘+video.title+‘\n‘);
        })
    });
}
/*==========================================================================
|   分析从网页里抓取到的数据
==========================================================================*/
function filterChapter(html){
    var courseData=[];
    var $=cheerio.load(html);
    var chapters=$(‘.chapter‘);
    chapters.each(function(item){
        var chapter=$(this);
        var chapterTitle=chapter.find(‘strong‘).text().replace(/(\s*)/g,‘‘); //找到章节标题
        var videos=chapter.find(‘.video‘).children(‘li‘);
  
        var chapterData={
            chapterTitle:chapterTitle,
            videos:[]
        };
        
        //videos
        videos.each(function(item){
            var $that = $(this),
                video=$that.find(‘.J-media-item‘),
                title=video.text().replace(/(\s*)/g,‘‘);
                id=video.attr(‘href‘).split(‘/video‘)[1].replace(/(\s*)/g,‘‘).replace(‘/‘,‘‘);
            chapterData.videos.push({
                title:title,
                id:id
            })
        })
        courseData.push(chapterData);
    });  
    return courseData;
}
/*==========================================================================
| GET method route
===========================================================================*/
app.get(‘/‘, function(request, respones){
   //处理服务端/客户端的http请求
   superagent.get(pathUrl).end(function(error, sres){
       //error
       if(error){
          return next(err);
       }
       //抓取https网址html
       var html = sres.text;
       var courseData=filterChapter(html);
       //打印
       printCourseInfo(courseData);
       //respones
       respones.send((courseData));
    })
})
/*==========================================================================
| listening at port
===========================================================================*/
app.listen(9090, function(){
    console.log(‘app is listening at port 9090‘);
});

资料参考

　http://www.imooc.com/video/7965

http://www.cnblogs.com/coco1s/p/4954063.html

https://github.com/alsotang/node-lessons

作者：Avenstar

出处：http://www.cnblogs.com/zjf-1992/p/6548220.html

关于作者：专注于前端开发

本文版权归作者所有,转载请标明原文链接

Node爬虫之初体验

原文：http://www.cnblogs.com/zjf-1992/p/6548220.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)