python爬虫基础

时间：2017-01-09 00:49:40 阅读：195 评论：0 收藏：0 [点我收藏+]

#coding:utf-8
#爬虫基础，需要两个模块urllib和re
import urllib,re

#获取网页源码
def get_html():
	page = urllib.urlopen(‘http://www.baidu.com‘)
	html = read(page)  #用read方式读取网页源代码
	return html
x=0
#匹配url的图片地址，然后下载	
def getimages():
	#编译成正则表达式对象，compile提高效率
	image_re=re.compile(r‘src="(.*?)" class=‘)
	
	#找到re匹配的所有字串，通过列表返回
	image_list = re.findall(image_re,html)
	for image_url in image_list:
		print image_url
		global x #全局变量，后面可以跟上一个或多个变量
		
		#将url定位到的html下载到本地
		urllib.urlretrieve(image_url,‘/tmp/python/%s.jpg‘%x)

本文出自 “王家东哥” 博客，谢绝转载！

python爬虫基础

原文：http://xiaodongge.blog.51cto.com/11636589/1890232

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)