首页 > 编程语言 > 详细

python爬虫之BeautifulSoup的HTML解析

时间:2020-05-21 23:10:02      阅读:63      评论:0      收藏:0      [点我收藏+]

  BeautifulSoup是一个用于从HTML和XML文件中提取数据的python库,它提供一些简单的函数来处理导航、搜索、修改分析树等功能。BeautifulSoup能自动将文档转换成Unicode编码,输出文档转换为UTF-8编码。

  本例直接创建模拟HTML代码,进行美化:

# 导入BeautifulSoup库
from bs4 import BeautifulSoup

# 创建模拟HTML代码的字符串
html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 创建一个BeautifulSoup对象,获取页面正文
soup = BeautifulSoup(html_doc, features="lxml")
# 打印解析的HTML代码
print(经BeautifulSoup美化的代码:,soup)
print(===================================)
print(通过prettify()方法进行代码的格式化处理:,soup.prettify())

结果:

经BeautifulSoup美化的代码: <html><head><title>The Dormouses story</title></head>
<body>
<p class="title"><b>The Dormouses story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
===================================
通过prettify()方法进行代码的格式化处理: <html>
 <head>
  <title>
   The Dormouses story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouses story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

技术分享图片

 

python爬虫之BeautifulSoup的HTML解析

原文:https://www.cnblogs.com/xiao02fang/p/12933835.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!