刚学会正则,想着把我之前爬的教学课程代码优化一下,整了大半夜了发现无法匹配
import os
import re
import requests
download_path = './所有课本抓取'
if not os.path.exists(download_path):
os.makedirs(download_path)
def get_one_page(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
response = requests.get(url,headers=headers)
print(response.status_code)
return response.text#为了main函数print(html)必须return,text为文本形式
def fenxi(html):
zhen_url = re.compile('<div.*?data-hint-title="(.*?)"', re.S)
zhen = re.findall(zhen_url,html)
print(zhen)
#re.compile('')
def main():
url = 'https://book.yunzhan365.com/bookcase/tfsc/index.html'
html = get_one_page(url)
fenxi(html)
# print(html)#若要print html,那么get_one_page函数必须被return,否则默认为None
main()
最后群里大佬告诉我搜索源代码是否有这个标签,CTRL+U之后,发现网页全是JS....
一晚上啊,整了个教训,继续学习了