掌握urllib.request模块的基本使用方法
课程位置:04.第一个爬虫.pdf - 4.1 urllib的使用
核心方法:
urllib.request.urlopen(url) - 打开URLresponse.read() - 读取响应内容(字节)response.read().decode("utf-8") - 解码为字符串urllib.request.Request(url, headers=headers) - 创建请求对象下面是一段简单的HTML内容,请使用urllib爬取本页面:
urllib是Python内置的HTTP请求库,无需安装即可使用。
它提供了urlopen、Request等方法来发送HTTP请求。
虽然requests更加简单易用,但了解urllib的基础很重要。
方法1:使用urlopen直接请求
import urllib.request
url = 'https://req.haleibc.com/practice1'
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode("utf-8")
print(html)
方法2:使用Request对象
import urllib.request
url = 'https://req.haleibc.com/practice1'
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")
print(html)
方法3:携带User-Agent
import urllib.request
url = 'https://req.haleibc.com/practice1'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")
print(html)
1. 使用urllib.request.urlopen()访问本页面
2. 读取响应内容并解码为UTF-8字符串
3. 使用BeautifulSoup提取id为"target-title"的标题
4. 提取所有class为"target-content"的段落
5. 尝试携带User-Agent发送请求
import urllib.request
from bs4 import BeautifulSoup
url = 'https://req.haleibc.com/practice1'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")
# 使用BeautifulSoup解析
soup = BeautifulSoup(html, 'html.parser')
# 提取标题
title = soup.find('h3', id='target-title').text
print(f'标题:{title}')
# 提取所有段落
paragraphs = soup.find_all('p', class_='target-content')
for i, p in enumerate(paragraphs, 1):
print(f'段落{i}:{p.text}')