站点:https://movie.douban.com/top250

先看数据是不是在页面源代码

先尝试获取源代码

1
2
3
4
5
import re  
import requests
url = 'https://movie.douban.com/top250'
resp = requests.get(url)
print(resp.text)

返回空,证明需要加入User-Agent

1
2
3
headers = {  
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27'
}
1
2
3
4
5
6
7
8
url = 'https://movie.douban.com/top250'  

headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27'
}
resp = requests.get(url,headers=headers)
print(resp.text)
resp.close()

解析数据

从li开始解析数据

1
2
3
#解析数据  
obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>',re.S)
resp.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
url = 'https://movie.douban.com/top250'  

headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27'
}
resp = requests.get(url,headers=headers)
page_content = resp.text

#解析数据
obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*? <span ' r'class="rating_num" property="v:average">(?P<score>.*?)</span>' r'<span>(?P<num>.*?)人评价</span>',re.S)
#开始匹配
result = obj.finditer(page_content)
for it in result:
print(it.group('name'))
print(it.group('year').strip())
print(it.group('score').strip())
print(it.group('num'))
resp.close()

使用csv模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
url = 'https://movie.douban.com/top250'  

headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27'
}
resp = requests.get(url,headers=headers)
page_content = resp.text

#解析数据
obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*? <span ' r'class="rating_num" property="v:average">(?P<score>.*?)</span>',re.S)
#开始匹配
result = obj.finditer(page_content)

#创建csv文件
f = open('data.csv','w',encoding='utf-8')
csvwriter = csv.writer(f)
#写入csv
for it in result:
dic = it.groupdict()
dic['year'] = dic['year'].strip() #处理year,因为year有换行
csvwriter.writerow(dic.values())
resp.close()