练习1：urllib基础 📘

📋 任务目标

掌握urllib.request模块的基本使用方法

使用urllib.request.urlopen()发送请求
使用Request对象发送请求
携带User-Agent发送请求
读取并解码响应内容

💡 知识点回顾

课程位置：04.第一个爬虫.pdf - 4.1 urllib的使用

核心方法：

urllib.request.urlopen(url) - 打开URL
response.read() - 读取响应内容（字节）
response.read().decode("utf-8") - 解码为字符串
urllib.request.Request(url, headers=headers) - 创建请求对象

📝 练习内容

下面是一段简单的HTML内容，请使用urllib爬取本页面：

欢迎学习urllib模块！

urllib是Python内置的HTTP请求库，无需安装即可使用。

它提供了urlopen、Request等方法来发送HTTP请求。

虽然requests更加简单易用，但了解urllib的基础很重要。

💡 示例代码

方法1：使用urlopen直接请求

import urllib.request

url = 'https://req.haleibc.com/practice1'
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode("utf-8")
print(html)

方法2：使用Request对象

import urllib.request

url = 'https://req.haleibc.com/practice1'
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")
print(html)

方法3：携带User-Agent

import urllib.request

url = 'https://req.haleibc.com/practice1'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")
print(html)

✅ 练习任务

1. 使用urllib.request.urlopen()访问本页面

2. 读取响应内容并解码为UTF-8字符串

3. 使用BeautifulSoup提取id为"target-title"的标题

4. 提取所有class为"target-content"的段落

5. 尝试携带User-Agent发送请求

💡 提取数据的完整示例

import urllib.request
from bs4 import BeautifulSoup

url = 'https://req.haleibc.com/practice1'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")

# 使用BeautifulSoup解析
soup = BeautifulSoup(html, 'html.parser')

# 提取标题
title = soup.find('h3', id='target-title').text
print(f'标题：{title}')

# 提取所有段落
paragraphs = soup.find_all('p', class_='target-content')
for i, p in enumerate(paragraphs, 1):
    print(f'段落{i}：{p.text}')

返回练习列表下一个练习 →