爬取博客数据

时间：2016-04-24 11:08:43 阅读：269 评论：0 收藏：0 [点我收藏+]

#coding:utf-8

import urllib
import time

url = [‘‘]*350
page = 1
link = 1
while page <= 7:
    con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘).read()
    i = 0
    title = con.find(r‘<a title=‘)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)

    while title != -1 and href != -1 and html != -1 and i < 50:
        url[i] = con[href + 6 : html + 5]
        print link, url[i]
        content = urllib.urlopen(url[i]).read()
        open(r‘hanhan/‘+url[i][-26:],‘w+‘).write(content)
        print ‘downloading‘, url[i]
        time.sleep(1)
        title = con.find(r‘<a title=‘, html)
        href = con.find(r‘href=‘, title)
        html = con.find(r‘.html‘, href)
        i = i + 1
        link = link + 1
    else:
        print page,‘find end!‘
    page = page + 1
else:
    print ‘all find end‘
    print ‘all find end‘

爬取博客数据

原文：http://www.cnblogs.com/XDJjy/p/5426510.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)