饭桶教程-python教程-

python教程

Python 爬虫（Web Scraping）是指通过编写 Python 程序从互联网上自动提取信息的过程。

爬虫的基本流程通常包括发送 HTTP 请求获取网页内容、解析网页并提取数据，然后存储数据。

Python 的丰富生态使其成为开发爬虫的热门语言，特别是由于其强大的库支持。

一般来说，爬虫的流程可以分为以下几个步骤：

发送 HTTP 请求：爬虫通过 HTTP 请求从目标网站获取 HTML 页面，常用的库包括 requests。
解析 HTML 内容：获取 HTML 页面后，爬虫需要解析内容并提取数据，常用的库有 BeautifulSoup、lxml、Scrapy 等。
提取数据：通过定位 HTML 元素（如标签、属性、类名等）来提取所需的数据。
存储数据：将提取的数据存储到数据库、CSV 文件、JSON 文件等格式中，以便后续使用或分析。

本章节主要介绍 BeautifulSoup，它是一个用于解析 HTML 和 XML 文档的 Python 库，能够从网页中提取数据，常用于网页抓取和数据挖掘。

BeautifulSoup

BeautifulSoup 是一个用于从网页中提取数据的 Python 库，特别适用于解析 HTML 和 XML 文件。

BeautifulSoup 能够通过提供简单的 API 来提取和操作网页中的内容，非常适合用于网页抓取和数据提取的任务。

安装 BeautifulSoup

要使用 BeautifulSoup，需要安装 beautifulsoup4 和 lxml 或 html.parser（一个 HTML 解析器）。

我们可以使用 pip 来安装这些依赖：

pip install beautifulsoup4
pip install lxml  # 推荐使用 lxml 作为解析器（速度更快）

如果你没有 lxml，可以使用 Python 内置的 html.parser 作为解析器。

基本用法

BeautifulSoup 用于解析 HTML 或 XML 数据，并提供了一些方法来导航、搜索和修改解析树。

BeautifulSoup 常见的操作包括查找标签、获取标签属性、提取文本等。

要使用 BeautifulSoup，需要先导入 BeautifulSoup，并将 HTML 页面加载到 BeautifulSoup 对象中。

通常，你会先用爬虫库（如 requests）获取网页内容:

实例

from bs4 import BeautifulSoup
import requests

# 使用 requests 获取网页内容
url = 'https://cn.bing.com/' # 抓取bing搜索引擎的网页内容
response = requests.get(url)

# 使用 BeautifulSoup 解析网页
soup = BeautifulSoup(response.text, 'lxml')  # 使用 lxml 解析器
# 解析网页内容 html.parser 解析器
# soup = BeautifulSoup(response.text, 'html.parser')

获取网页标题：

实例

from bs4 import BeautifulSoup
import requests

# 指定你想要获取标题的网站
url = 'https://cn.bing.com/' # 抓取bing搜索引擎的网页内容

# 发送HTTP请求获取网页内容
response = requests.get(url)

# 中文乱码问题
response.encoding = 'utf-8' 

# 确保请求成功
if response.status_code == 200:
    # 使用BeautifulSoup解析网页内容
    soup = BeautifulSoup(response.text, 'lxml')
    
    # 查找<title>标签
    title_tag = soup.find('title')
    
    # 打印标题文本
    if title_tag:
        print(title_tag.get_text())
    else:
        print("未找到<title>标签")
else:
    print("请求失败，状态码：", response.status_code)

执行以上代码，输出标题为：

搜索 - Microsoft 必应

中文乱码问题

使用 requests 库抓取中文网页时，可能会遇到编码问题，导致中文内容无法正确显示，为了确保能够正确抓取并显示中文网页，通常需要处理网页的字符编码。

自动检测编码 requests 通常会自动根据响应头中的 Content-Type 来推测网页的编码，但有时可能不准确，此时可以使用 chardet 来自动检测编码。

实例

import requests
url = 'https://cn.bing.com/'
response = requests.get(url)

# 使用 chardet 自动检测编码
import chardet
encoding = chardet.detect(response.content)['encoding']
print(encoding)
response.encoding = encoding

执行以上代码，输出为：

utf-8

如果你知道网页的编码（例如 utf-8 或 gbk），可以直接设置 response.encoding:

response.encoding = 'utf-8'  # 或者 'gbk'，根据实际情况选择

查找标签

BeautifulSoup 提供了多种方法来查找网页中的标签，最常用的包括 find() 和 find_all()。

find() 返回第一个匹配的标签
find_all() 返回所有匹配的标签

实例

from bs4 import BeautifulSoup
import requests

# 指定你想要获取标题的网站
url = 'https://www.baidu.com/' # 抓取bing搜索引擎的网页内容

# 发送HTTP请求获取网页内容
response = requests.get(url)

# 中文乱码问题
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'lxml')

# 查找第一个 <a> 标签
first_link = soup.find('a')
print(first_link)
print("----------------------------")

# 获取第一个 <a> 标签的 href 属性
first_link_url = first_link.get('href')
print(first_link_url)
print("----------------------------")

# 查找所有 <a> 标签
all_links = soup.find_all('a')
print(all_links)

输出结果类似如下：

<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
----------------------------
http://news.baidu.com
----------------------------
[<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>, <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>, <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>, <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a>,

获取标签的文本

通过 get_text() 方法，你可以提取标签中的文本内容：

实例

from bs4 import BeautifulSoup
import requests

# 指定你想要获取标题的网站
url = 'https://www.baidu.com/' # 抓取bing搜索引擎的网页内容

# 发送HTTP请求获取网页内容
response = requests.get(url)

# 中文乱码问题
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'lxml')

# 获取第一个 <p> 标签中的文本内容
paragraph_text = soup.find('p').get_text()

# 获取页面中所有文本内容
all_text = soup.get_text()
print(all_text)

输出结果类似如下：

 百度一下，你就知道           
 ...

查找子标签和父标签

你可以通过 parent 和 children 属性访问标签的父标签和子标签：

# 获取当前标签的父标签
parent_tag = first_link.parent

# 获取当前标签的所有子标签
children = first_link.children

实例

from bs4 import BeautifulSoup
import requests

# 指定你想要获取标题的网站
url = 'https://www.baidu.com/' # 抓取bing搜索引擎的网页内容

# 发送HTTP请求获取网页内容
response = requests.get(url)

# 中文乱码问题
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'lxml')

# 查找第一个 <a> 标签
first_link = soup.find('a')
print(first_link)
print("----------------------------")

# 获取当前标签的父标签
parent_tag = first_link.parent
print(parent_tag.get_text())

输出结果类似如下：

<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
----------------------------
 新闻 hao123 地图 视频 贴吧  登录   更多产品

查找具有特定属性的标签

你可以通过传递属性来查找具有特定属性的标签。

例如，查找类名为 example-class 的所有 div 标签：

# 查找所有 class="example-class" 的 <div> 标签
divs_with_class = soup.find_all('div', class_='example-class')

# 查找具有 id="unique-id" 的 <p> 标签
unique_paragraph = soup.find('p', id='unique-id')

获取搜索按钮，id 为 su ：

实例

from bs4 import BeautifulSoup
import requests

# 指定你想要获取标题的网站
url = 'https://www.baidu.com/' # 抓取bing搜索引擎的网页内容

# 发送HTTP请求获取网页内容
response = requests.get(url)

# 中文乱码问题
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'lxml')

# 查找具有 id="unique-id" 的 <input> 标签
unique_input = soup.find('input', id='su')
input_value = unique_input['value'] # 获取 input 输入框的值
print(input_value)

输出结果为：

百度一下

高级用法

CSS 选择器

BeautifulSoup 也支持通过 CSS 选择器来查找标签。

select() 方法允许使用类似 jQuery 的选择器语法来查找标签：

# 使用 CSS 选择器查找所有 class 为 'example' 的 <div> 标签
example_divs = soup.select('div.example')

# 查找所有 <a> 标签中的 href 属性
links = soup.select('a[href]')

处理嵌套标签

BeautifulSoup 支持深度嵌套的 HTML 结构，你可以通过递归查找子标签来处理这些结构：

# 查找嵌套的 <div> 标签
nested_divs = soup.find_all('div', class_='nested')
for div in nested_divs:
    print(div.get_text())

修改网页内容

BeautifulSoup 允许你修改 HTML 内容。

我们可以修改标签的属性、文本或删除标签：

实例

# 修改第一个 <a> 标签的 href 属性
first_link['href'] = 'http://new-url.com'

# 修改第一个 <p> 标签的文本内容
first_paragraph = soup.find('p')
first_paragraph.string = 'Updated content'

# 删除某个标签
first_paragraph.decompose()

转换为字符串

你可以将解析的 BeautifulSoup 对象转换回 HTML 字符串：

# 转换为字符串html_str = str(soup)

BeautifulSoup 属性与方法

以下是 BeautifulSoup 中常用的属性和方法:

方法/属性	描述	示例
`BeautifulSoup()`	用于解析 HTML 或 XML 文档并返回一个 BeautifulSoup 对象。	`soup = BeautifulSoup(html_doc, 'html.parser')`
`.prettify()`	格式化并美化文档内容，生成结构化的字符串。	`print(soup.prettify())`
`.find()`	查找第一个匹配的标签。	`tag = soup.find('a')`
`.find_all()`	查找所有匹配的标签，返回一个列表。	`tags = soup.find_all('a')`
`.find_all_next()`	查找当前标签后所有符合条件的标签。	`tags = soup.find('div').find_all_next('p')`
`.find_all_previous()`	查找当前标签前所有符合条件的标签。	`tags = soup.find('div').find_all_previous('p')`
`.find_parent()`	返回当前标签的父标签。	`parent = tag.find_parent()`
`.find_all_parents()`	查找当前标签的所有父标签。	`parents = tag.find_all_parents()`
`.find_next_sibling()`	查找当前标签的下一个兄弟标签。	`next_sibling = tag.find_next_sibling()`
`.find_previous_sibling()`	查找当前标签的前一个兄弟标签。	`prev_sibling = tag.find_previous_sibling()`
`.parent`	获取当前标签的父标签。	`parent = tag.parent`
`.next_sibling`	获取当前标签的下一个兄弟标签。	`next_sibling = tag.next_sibling`
`.previous_sibling`	获取当前标签的前一个兄弟标签。	`prev_sibling = tag.previous_sibling`
`.get_text()`	提取标签内的文本内容，忽略所有HTML标签。	`text = tag.get_text()`
`.attrs`	返回标签的所有属性，以字典形式表示。	`href = tag.attrs['href']`
`.string`	获取标签内的字符串内容。	`string_content = tag.string`
`.name`	返回标签的名称。	`tag_name = tag.name`
`.contents`	返回标签的所有子元素，以列表形式返回。	`children = tag.contents`
`.descendants`	返回标签的所有后代元素，生成器形式。	`for child in tag.descendants: print(child)`
`.parent`	获取当前标签的父标签。	`parent = tag.parent`
`.previous_element`	获取当前标签的前一个元素（不包括文本）。	`prev_elem = tag.previous_element`
`.next_element`	获取当前标签的下一个元素（不包括文本）。	`next_elem = tag.next_element`
`.decompose()`	从树中删除当前标签及其内容。	`tag.decompose()`
`.unwrap()`	移除标签本身，只保留其子内容。	`tag.unwrap()`
`.insert()`	向标签内插入新标签或文本。	`tag.insert(0, new_tag)`
`.insert_before()`	在当前标签前插入新标签。	`tag.insert_before(new_tag)`
`.insert_after()`	在当前标签后插入新标签。	`tag.insert_after(new_tag)`
`.extract()`	删除标签并返回该标签。	`extracted_tag = tag.extract()`
`.replace_with()`	替换当前标签及其内容。	`tag.replace_with(new_tag)`
`.has_attr()`	检查标签是否有指定的属性。	`if tag.has_attr('href'):`
`.get()`	获取指定属性的值。	`href = tag.get('href')`
`.clear()`	清空标签的所有内容。	`tag.clear()`
`.encode()`	编码标签内容为字节流。	`encoded = tag.encode()`
`.is_empty_element`	检查标签是否是空元素（例如 `<br>`、`<img>` 等）。	`if tag.is_empty_element:`
`.is_ancestor_of()`	检查当前标签是否是指定标签的祖先元素。	`if tag.is_ancestor_of(another_tag):`
`.is_descendant_of()`	检查当前标签是否是指定标签的后代元素。	`if tag.is_descendant_of(another_tag):`

其他属性

方法/属性	描述	示例
`.style`	获取标签的内联样式。	`style = tag['style']`
`.id`	获取标签的 `id` 属性。	`id = tag['id']`
`.class_`	获取标签的 `class` 属性。	`class_name = tag['class']`
`.string`	获取标签内部的字符串内容，忽略其他标签。	`content = tag.string`
`.parent`	获取标签的父元素。	`parent = tag.parent`

其他

方法/属性	描述	示例
`find_all(string)`	使用字符串查找匹配的标签。	`tag = soup.find_all('div', class_='container')`
`find_all(id)`	查找指定 `id` 的标签。	`tag = soup.find_all(id='main')`
`find_all(attrs)`	查找具有指定属性的标签。	`tag = soup.find_all(attrs={"href": "http://example.com"})`