照着书上的例子,所有代码照敲,发现爬不了,于是自己动手修改一下,连爬取的页面都该为hot了,可是发现好像是编码的问题,于是卡住了

经过1个小时的奋斗,被我修改了一下,但是由于这个直接抓取hot在没登录是不可能的情况下,我改为了抓取https://www.zhihu.com/explore 的分页https://www.zhihu.com/collection/123237973

import requests
from pyquery import PyQuery as pq
url = 'https://www.zhihu.com/collection/123237973'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
html = requests.get(url,headers=headers).text
doc = pq(html)
# print(doc)
items = doc('.zm-item').items()
# print(items)
with open('zhihu.txt','w',encoding='utf-8') as file:
    for item in items:
        question = item.find('h2').text()
        # print('\n' + '=' * 30)
        # print(question)
        # author = item.find('.author-link-line').text()
        answer = pq(item.find('.content').html()).text()
        # print(answer)
        file.write('\n'.join([question,answer]))
        file.write('\n'  + '=' * 50 + '\n')


最后修改:2022 年 12 月 05 日
如果觉得我的文章对你有用,请随意赞赏