站点:https://movie.douban.com/top250
先看数据是不是在页面源代码

先尝试获取源代码
1 2 3 4 5
| import re import requests url = 'https://movie.douban.com/top250' resp = requests.get(url) print(resp.text)
|
返回空,证明需要加入User-Agent
1 2 3
| headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27' }
|
1 2 3 4 5 6 7 8
| url = 'https://movie.douban.com/top250' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27' } resp = requests.get(url,headers=headers) print(resp.text) resp.close()
|
解析数据
从li开始解析数据

1 2 3
| obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>',re.S) resp.close()
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| url = 'https://movie.douban.com/top250' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27' } resp = requests.get(url,headers=headers) page_content = resp.text
obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)' r'</span>.*?<p class="">.*?<br>(?P<year>.*?) .*? <span ' r'class="rating_num" property="v:average">(?P<score>.*?)</span>' r'<span>(?P<num>.*?)人评价</span>',re.S)
result = obj.finditer(page_content) for it in result: print(it.group('name')) print(it.group('year').strip()) print(it.group('score').strip()) print(it.group('num')) resp.close()
|
使用csv模块
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| url = 'https://movie.douban.com/top250' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27' } resp = requests.get(url,headers=headers) page_content = resp.text
obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)' r'</span>.*?<p class="">.*?<br>(?P<year>.*?) .*? <span ' r'class="rating_num" property="v:average">(?P<score>.*?)</span>',re.S)
result = obj.finditer(page_content)
f = open('data.csv','w',encoding='utf-8') csvwriter = csv.writer(f)
for it in result: dic = it.groupdict() dic['year'] = dic['year'].strip() csvwriter.writerow(dic.values()) resp.close()
|