首页 > 编程语言 > 详细

用python写网络爬虫 -从零开始 3 编写ID遍历爬虫

时间:2017-10-08 21:04:07      阅读:256      评论:0      收藏:0      [点我收藏+]
我们在访问网站的时候,发现有些网页ID 是按顺序排列的数字,这个时候我们就可以使用ID遍历的方式来爬取内容。但是局限性在于有些ID数字在10位数左右,那么这样爬取效率就会很低很低!



import itertools
from common import download


def iteration():
max_errors = 5 # maximum number of consecutive download errors allowed
num_errors = 0 # current number of consecutive download errors
for page in itertools.count(1):
url = ‘http://example.webscraping.com/view/-{}‘.format(page)
html = download(url)
if html is None:
# received an error trying to download this webpage
num_errors += 1
if num_errors == max_errors:
# reached maximum amount of errors in a row so exit
break
# so assume have reached the last country ID and can stop downloading
else:
# success - can scrape the result
# ...
num_errors = 0

用python写网络爬虫 -从零开始 3 编写ID遍历爬虫

原文:http://www.cnblogs.com/mrruning/p/7638459.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!