爬取测试:知乎-发现-分页

2020 年 03 月 28 日

301 次浏览

966字数

照着书上的例子,所有代码照敲,发现爬不了,于是自己动手修改一下,连爬取的页面都该为hot了,可是发现好像是编码的问题,于是卡住了

经过1个小时的奋斗,被我修改了一下,但是由于这个直接抓取hot在没登录是不可能的情况下,我改为了抓取https://www.zhihu.com/explore 的分页https://www.zhihu.com/collection/123237973

import requests
from pyquery import PyQuery as pq
url = 'https://www.zhihu.com/collection/123237973'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
html = requests.get(url,headers=headers).text
doc = pq(html)
# print(doc)
items = doc('.zm-item').items()
# print(items)
with open('zhihu.txt','w',encoding='utf-8') as file:
    for item in items:
        question = item.find('h2').text()
        # print('\n' + '=' * 30)
        # print(question)
        # author = item.find('.author-link-line').text()
        answer = pq(item.find('.content').html()).text()
        # print(answer)
        file.write('\n'.join([question,answer]))
        file.write('\n'  + '=' * 50 + '\n')

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

爬取测试:知乎-发现-分页

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款