首页 > 其他 > 详细

简易爬虫

时间:2020-07-06 14:31:16      阅读:71      评论:0      收藏:0      [点我收藏+]

Python

Step 1 模拟登录

import requests

s = requests.Session()

# 登录表 (打开开发者工具, 登录网站, 然后在开发者工具中找到POST网址及其登录表)
form = {
    ‘username‘: ‘‘,
    ‘password‘: ‘‘,
}
s.post(url, form) # POST网址及登录表

Step 2 查找结点

from bs4 import BeautifulSoup

r = s.get(url) # 待爬网页
r.encoding = ‘gbk‘ # 网页编码

soup = BeautifulSoup(r.text, ‘lxml‘)

# find示例
title = soup.find(‘h1‘).text

# find_all示例
dls = soup.find_all(‘dl‘, class_ = ‘attachlist‘)
for dl in dls:
    filename = dl.dt.a.text
    fileUrl = baseUrl + dl.dt.a.get(‘href‘)

Step 3 下载

def download(url, s, filename):
    import urllib, os
    # filename = urllib.parse.unquote(url)
    # filename = filename[filename.rfind(‘/‘) + 1:]
    try:
        r = s.get(url, stream=True, timeout = 2)
        chunk_size = 1000
        timer = 0
        length = int(r.headers[‘Content-Length‘])
        print(‘Downloading {}‘.format(filename))
        if os.path.isfile(‘./‘ + filename):
                    print(‘  File already exist, skipped‘)
                    return False
        with open(‘./‘ + filename, ‘wb‘) as f:
            for chunk in r.iter_content(chunk_size):
                timer += chunk_size
                percent = round(timer/length, 4) * 100
                print(‘\r {:4f}‘.format((percent)), end = ‘‘)
                f.write(chunk)
        print(‘\r  Finished    ‘)
        return True
    except requests.exceptions.ReadTimeout:
        print(‘Read time out, this file failed to download‘)
        return False
    except requests.exceptions.ConnectionError:
        print(‘ConnectionError, this file failed to download‘)
        return False

Javascript

Step 1 查找结点

// XPath查询
var threads = document.evaluate("//span[contains(@id, ‘thread‘)]/a", document, null,
                                XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)
// 用户代码
for (var i = 0; i < threads.snapshotLength; ++i) {
    thread = threads.snapshotItem(i)

    threadUrl = baseUrl + thread.attributes.href.textContent
    threadTitle = thread.text

    console.log(‘%s (%s)‘, threadTitle, threadUrl)
}

Step 2 GET子网页

var parser = new DOMParser()

var xhr = new XMLHttpRequest()
xhr.overrideMimeType("text/html;charset=gbk") // 网页编码

xhr.onload = function(e) {
    if (xhr.readyState === 4) {
        if (xhr.status === 200) {
            // 解析response为document
            var threadDoc = parser.parseFromString(xhr.response, ‘text/html‘)
            // XPath查询
            var files = document.evaluate("//dl[@class=‘t_attachlist‘]/dt/a", threadDoc, null,
                                      XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)
            // 用户代码
        } else {
            console.error(xhr.statusText)
        }
    }
}
xhr.onerror = function(e) {
    console.error(xhr.statusText)
}

xhr.open(‘GET‘, threadUrl, false) // GET子网页, false表示同步
xhr.send(null)

Step 3 下载

function download(url, filename) {
    var a = document.createElement(‘a‘)
    var e = document.createEvent(‘MouseEvents‘)

    e.initEvent(‘click‘, false, false)
    a.download = filename
    a.href = url

    a.dispatchEvent(e)
}

Step 4 安装插件

安装油猴插件, 并将代码保存为用户脚本, 注意添加例如

// @include      *www.baidu.com/*

确保脚本正常运行后, 打开开发者工具 - Console, 之后打开待爬网页即可.

简易爬虫

原文:https://www.cnblogs.com/maoruimas/p/13254362.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!