Python怎么爬取動(dòng)態(tài)網(wǎng)頁? Python動(dòng)態(tài)網(wǎng)頁爬蟲實(shí)現(xiàn)方法

開發(fā)者資訊
2025-06-24
編輯

　　與傳統(tǒng)的靜態(tài)網(wǎng)頁不同，動(dòng)態(tài)網(wǎng)頁通過 JavaScript、AJAX 等技術(shù)實(shí)現(xiàn)數(shù)據(jù)的異步加載和頁面的局部刷新，能夠?yàn)橛脩籼峁└鲿?、更豐富的交互體驗(yàn)。這也給網(wǎng)頁爬蟲的編寫帶來了新的挑戰(zhàn)，傳統(tǒng)基于靜態(tài) HTML 解析的爬蟲往往無法獲取動(dòng)態(tài)加載的內(nèi)容。接下來就讓我們深入了解 Python 動(dòng)態(tài)網(wǎng)頁爬蟲的實(shí)現(xiàn)方法。

　　一、使用 Selenium 庫爬取動(dòng)態(tài)網(wǎng)頁

　　Selenium 是一個(gè)強(qiáng)大的 Web 自動(dòng)化測試工具，通過模擬瀏覽器操作，能夠與網(wǎng)頁進(jìn)行交互，等待頁面動(dòng)態(tài)內(nèi)容加載完成后再進(jìn)行數(shù)據(jù)提取，非常適合爬取動(dòng)態(tài)網(wǎng)頁。

　　1. 安裝 Selenium 庫

　　在命令行中使用以下命令安裝 Selenium：

　　TypeScript取消自動(dòng)換行復(fù)制

　　pip install selenium2. 安裝瀏覽器驅(qū)動(dòng)

　　Selenium 需要配合瀏覽器驅(qū)動(dòng)使用，不同的瀏覽器對應(yīng)不同的驅(qū)動(dòng)。例如，Chrome 瀏覽器需要下載 ChromeDriver，F(xiàn)irefox 瀏覽器需要下載 GeckoDriver。下載完成后，需將驅(qū)動(dòng)程序所在路徑添加到系統(tǒng)環(huán)境變量中，或者在代碼中指定驅(qū)動(dòng)路徑。

　　3. 編寫爬取代碼

　　以爬取一個(gè)包含動(dòng)態(tài)加載數(shù)據(jù)的新聞網(wǎng)站為例，代碼如下：

　　TypeScript取消自動(dòng)換行復(fù)制

　　from selenium import webdriver

　　import time

　　# 創(chuàng)建瀏覽器驅(qū)動(dòng)對象

　　driver = webdriver.Chrome() # 根據(jù)實(shí)際使用的瀏覽器修改

　　# 打開網(wǎng)頁

　　driver.get("https://example.com")

　　# 等待頁面動(dòng)態(tài)內(nèi)容加載(可根據(jù)實(shí)際情況調(diào)整等待時(shí)間)

　　time.sleep(5)

　　# 使用CSS選擇器或XPath提取數(shù)據(jù)

　　news_titles = driver.find_elements_by_css_selector('.news-title')

　　for title in news_titles:

　　print(title.text)

　　# 關(guān)閉瀏覽器

　　driver.quit()上述代碼中，首先創(chuàng)建瀏覽器驅(qū)動(dòng)對象并打開目標(biāo)網(wǎng)頁，然后通過time.sleep函數(shù)等待頁面動(dòng)態(tài)內(nèi)容加載完成，最后使用 CSS 選擇器提取所需數(shù)據(jù)，完成爬取后關(guān)閉瀏覽器。

360截圖20250430235257427.jpg

　　二、使用 Scrapy - Splash 框架爬取動(dòng)態(tài)網(wǎng)頁

　　Scrapy 是 Python 中強(qiáng)大的爬蟲框架，而 Splash 是一個(gè) JavaScript 渲染服務(wù)，Scrapy - Splash 框架將兩者結(jié)合，能夠處理動(dòng)態(tài)網(wǎng)頁的渲染和爬取。

　　1. 安裝 Scrapy 和 Scrapy - Splash

　　在命令行中依次執(zhí)行以下命令進(jìn)行安裝：

　　TypeScript取消自動(dòng)換行復(fù)制

　　pip install scrapy

　　pip install scrapy-splash2. 安裝和運(yùn)行 Splash 服務(wù)

　　Splash 是一個(gè)基于 Twisted 和 Qt5 的服務(wù)，可以從官方獲取安裝包進(jìn)行安裝。安裝完成后，啟動(dòng) Splash 服務(wù)，默認(rèn)情況下，Splash 服務(wù)會(huì)在http://localhost:8050運(yùn)行。

　　3. 編寫 Scrapy - Splash 爬蟲

　　創(chuàng)建一個(gè)新的 Scrapy 項(xiàng)目并定義爬蟲：

　　TypeScript取消自動(dòng)換行復(fù)制

　　import scrapy

　　from scrapy_splash import SplashRequest

　　class DynamicSpider(scrapy.Spider):

　　name = "dynamic_spider"

　　start_urls = ["https://example.com"]

　　def start_requests(self):

　　for url in self.start_urls:

　　yield SplashRequest(url, self.parse, args={'wait': 3})

　　def parse(self, response):

　　# 使用CSS選擇器或XPath提取數(shù)據(jù)

　　products = response.css('.product-item')

　　for product in products:

　　yield {

　　'name': product.css('.product-name::text').get(),

　　'price': product.css('.product-price::text').get()

　　}在上述代碼中，通過SplashRequest發(fā)送請求，并設(shè)置等待時(shí)間，確保頁面動(dòng)態(tài)內(nèi)容渲染完成后再進(jìn)行數(shù)據(jù)解析和提取。

　　三、使用 Playwright 庫爬取動(dòng)態(tài)網(wǎng)頁

　　Playwright 是微軟開發(fā)的新一代自動(dòng)化測試和網(wǎng)頁抓取工具，支持多個(gè)瀏覽器，能夠方便地操作網(wǎng)頁元素，處理動(dòng)態(tài)加載內(nèi)容。

　　1. 安裝 Playwright 庫

　　在命令行中執(zhí)行以下命令安裝：

　　TypeScript取消自動(dòng)換行復(fù)制

　　pip install playwright

　　# 安裝瀏覽器驅(qū)動(dòng)

　　playwright install2. 編寫爬取代碼

　　以爬取一個(gè)電商平臺(tái)動(dòng)態(tài)加載的商品列表頁為例：

　　TypeScript取消自動(dòng)換行復(fù)制

　　from playwright.sync_api import sync_playwright

　　with sync_playwright() as p:

　　browser = p.chromium.launch()

　　page = browser.new_page()

　　page.goto("https://example.com")

　　page.wait_for_load_state('networkidle')

　　products = page.query_selector_all('.product')

　　for product in products:

　　name = product.query_selector('.product-name').text_content()

　　price = product.query_selector('.product-price').text_content()

　　print(f"商品名稱: {name}, 價(jià)格: {price}")

　　browser.close()代碼中，page.wait_for_load_state('networkidle')用于等待頁面網(wǎng)絡(luò)請求空閑，即動(dòng)態(tài)內(nèi)容加載完成，然后使用query_selector等方法提取數(shù)據(jù)。

　　隨著 Web 技術(shù)的不斷發(fā)展，動(dòng)態(tài)網(wǎng)頁越來越普遍。通過 Selenium、Scrapy - Splash、Playwright 等工具和框架，Python 為我們提供了多樣化的動(dòng)態(tài)網(wǎng)頁爬取方案。開發(fā)者可以根據(jù)具體的需求和場景，選擇合適的方法，高效地獲取動(dòng)態(tài)網(wǎng)頁中的數(shù)據(jù)，為數(shù)據(jù)分析、信息檢索等工作提供有力支持。