BeautifuSoup快速使用方案

1
2
from bs4 import BeautifulSoup
# 引入BeautifulSoup,其实bs4库里还有其他子库
1
2
3
4
5
6
7
8
9
10
11
12
13
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
1
2
3
soup = BeautifulSoup(html_doc, 'html.parser')
# soup = BeautifulSoup(open("index.html")) 使用open解析网页
# 将网页解析为Soup的对象,第二个参数'html.parser'是使用python自带的解析器
1
2
print(soup.prettify())
# 'prettify()'使用标准缩进格式进行输出
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
1
2
3
4
print soup.title
print soup.title.name
print soup.title.string
# 'title'找到<title>标签;'title.name'找到<title>的标签名称;'title.string'找到<title>标签内的内容
<title>The Dormouse's story</title>
title
The Dormouse's story
1
2
3
print soup.p
print soup.p['class']
# 'p'找到<p>标签;'p['class']找到<p>标签内的class属性
<p class="title"><b>The Dormouse's story</b></p>
[u'title']
1
2
print soup.a
# 'a'找到<a>标签
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1
2
3
print soup.find_all('a')
print soup.find(id = "link3")
# 'find_all('a')'找到所有<a>标签链接;'find(id = "link3")'找到所有id为link3的标签
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
1
2
3
for link in soup.find_all('a'):
print(link.get('href'))
# 'get('href')'获得<href>标签中的内容
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
1
2
print soup.get_text()
# 'get_text'获取所有文字内容
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
1
2
3
soup.p.name = 'p1'
print soup.p1
# 所有属性都是可以修改和删除的
<p1 class="title"><b>The Dormouse's story</b></p1>
1
2
3
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']
# 最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list:
[u'body', u'strikeout']
1
2
3
print soup.p.contents
print soup.p.contents[0]
# 'contents'属性可以将标签的子节点以列表的方式输出
[u'Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, u',\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, u' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, u';\nand they lived at the bottom of a well.']
Once upon a time there were three little sisters; and their names were
  • 其他属性:.children .descendants .string .strings .stripped_strings
  • .parent .parents .next_sibling .previous_sibling .next_element .previous_element
1
2
print soup.find_all(['a', 'p'])
# 传入列表参数找到与列表中任一元素匹配的内容
[<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

find的其他方法

  • find_praents() find_parent()
  • find_next_siblings() find_next_sibling()
  • find_previous_siblings() find_previous_sibling()
  • find_all_next() find_next()
  • find_all_previous() find_previous()
1
2
print soup.select("title")
# BeautifulSoup支持大部分CSS选择器,使用select()方法传入选择器参数
[<title>The Dormouse's story</title>]

正则表达式检索

1
import re
1
2
for tag in soup.find_all(re.compile("^bo")):
print tag
<body>
<p1 class="title"><b>The Dormouse's story</b></p1>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

修改文档树

  • append()
  • NavigbleSting()
  • insert() insert_before() insert_after()
  • clear()
  • extract()
  • decompose()
  • replace_with()
  • wrap() unwrap()

格式化输出

  • prettify()
  • unicode()
  • str()
  • get_text()
------ 本文结束 ------