记一次Browsermob-Proxy抓取Shopee的Har内容

2022 年 12 月 19 日

653 次浏览

3205字数

MD,又被骗着装了一次Java
首先按照上一篇内容安装一下Browsermob代理,它年久失修,没有进步.
安装Browsermob代理
然后记得安装Java8,在Path里面设置一下
2022-12-19T14:04:10.png
Python代码记得标记一下这款神器的路径

__BMP = r"D:\XX\FastAPI\NG\browsermob-proxy-2.1.4\bin\browsermob-proxy.bat"

2022-12-19T14:05:25.png
看一下全部的代码吧,就不藏着了:

import time

from browsermobproxy import Server
from selenium import webdriver
import time
import pprint

class ProxyManger:
    __BMP = r"D:\XX\FastAPI\NG\browsermob-proxy-2.1.4\bin\browsermob-proxy.bat"
    def __init__(self):
        self.__server = Server(ProxyManger.__BMP)
        self.__client = None

    def start_server(self):
        self.__server.start()
        return self.__server

    def start_client(self):
        self.__client = self.__server.create_proxy(params={"trustAllServers": "true"})
        return self.__client

    @property
    def client(self):
        return self.__client

    @property
    def server(self):
        return self.__server
if __name__=="__main__":
    # 开启Proxy
    proxy = ProxyManger()
    server = proxy.start_server()
    client = proxy.start_client()

    # 配置Proxy启动WebDriver
    options = webdriver.ChromeOptions()
    options.add_argument("--proxy-server={}".format(client.proxy))
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"')
    # options.add_argument('--headless')
    #chromePath = r"D:\AzRjN\anaconda3_7\envs\demo36\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe"
    driver = webdriver.Chrome(chrome_options=options)

    # 获取返回的内容
    client.new_har("shopee.com.my", options={'captureHeaders': True, 'captureContent': True})
    driver.get("https://shopee.com.my/search?keyword=phone")
    time.sleep(3)

    result = client.har
    # print(result)

    for entry in result['log']['entries']:
        _url = entry['request']['url']
        # print("请求地址：", _url)

        if "/api/v4/search/search_items?" in _url:
            _response = entry['response']
            _content = _response['content']
            print("请求响应内容：", _response)

    server.stop()

你会很容易的发现,大厂的反爬措施做的不错,很明显的抓取了Selenium的特征,Shopee本身就有弹框需要你点击,可是这个内容防止不了更高手段的人,所以一旦检测到你非真浏览器,会直接弹到登录页面,这样就会导致你的神器根本用不上,研究到这里我就不继续了,始终这款神器久未更新,是Java领域的,研究了很久发现抓取har确实没什么问题,MitmProxy无法完美的契合Python脚本,但是这款神器确实做到了,不过速度却并没有我现象中的理想,再加上其他的弊端,应该不合适...

记一次Browsermob-Proxy抓取Shopee的Har内容

E_Page • 2022 年 12 月 19 日

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

记一次Browsermob-Proxy抓取Shopee的Har内容

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款