python怎么爬取網(wǎng)站上的數(shù)據(jù)庫 python爬取網(wǎng)頁內(nèi)容并保存到數(shù)據(jù)庫

開發(fā)者資訊
2025-05-13
編輯

　　在數(shù)據(jù)時(shí)代，數(shù)據(jù)分析越來越受到重視，而數(shù)據(jù)的獲取則成為了數(shù)據(jù)分析中重要的一環(huán)。Python作為一種強(qiáng)大的編程語言，提供了豐富的庫和工具來實(shí)現(xiàn)網(wǎng)頁數(shù)據(jù)的爬取與存儲。小編將詳細(xì)介紹如何使用Python爬取網(wǎng)頁內(nèi)容，并將其保存到數(shù)據(jù)庫中，以MySQL和MongoDB為例進(jìn)行說明。

　　一、準(zhǔn)備工作

　　1. 安裝必要的庫

　　首先，需要安裝一些Python庫來幫助我們完成爬取和存儲任務(wù)。常用的庫包括requests、BeautifulSoup、pymysql(用于MySQL)和pymongo(用于MongoDB)。

　　pip install requests beautifulsoup4 pymysql pymongo

　　2. 數(shù)據(jù)庫準(zhǔn)備

　　MySQL

　　創(chuàng)建數(shù)據(jù)庫和表

　　CREATE DATABASE baby_info;

　　USE baby_info;

　　CREATE TABLE mamawang_info (

　　id bigint(20) NOT NULL AUTO_INCREMENT,

　　title varchar(255) DEFAULT NULL,

　　href varchar(255) DEFAULT NULL,

　　content text,

　　imgs varchar(255) DEFAULT NULL,

　　PRIMARY KEY (id)

　　) ENGINE=InnoDB AUTO_INCREMENT=627 DEFAULT CHARSET=utf8;

　　連接數(shù)據(jù)庫

　　import pymysql.cursors

　　connect = pymysql.Connect(

　　host='localhost',

　　port=3306,

　　user='root',

　　passwd='admin',

　　db='baby_info',

　　charset='utf8'

　　)

　　MongoDB

　　連接數(shù)據(jù)庫

　　import pymongo

　　myclient = pymongo.MongoClient('localhost', 27017)

　　mydb = myclient['webpages']

　　dblist = myclient.list_database_names()

　　if "webpages" in dblist:

　　print("該數(shù)據(jù)庫存在")

　　mycol = mydb['gov.publicity']

360截圖20250426224640574.png

　　二、爬取網(wǎng)頁內(nèi)容

　　1. 使用requests模塊獲取網(wǎng)頁源代碼

　　import requests

　　url = 'http://www.mama.cn/z/t1183/'

　　headers = {

　　'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'

　　}

　　response = requests.get(url, headers=headers)

　　2. 使用BeautifulSoup解析網(wǎng)頁

　　from bs4 import BeautifulSoup

　　soup = BeautifulSoup(response.text, "html.parser")

　　div = soup.find(class_='list-left')

　　3. 提取所需數(shù)據(jù)

　　# 示例：提取文章標(biāo)題和鏈接

　　articles = div.find_all('a')

　　for article in articles:

　　title = article.get_text()

　　href = article['href']

　　print(title, href)

　　三、將數(shù)據(jù)保存到數(shù)據(jù)庫

　　1. 保存到MySQL

　　def save_to_mysql(title, href, content, imgs):

　　cursor = connect.cursor()

　　sql = "ｉｎｓｅｒｔ INTO mamawang_info (title, href, content, imgs) VALUES (%s, %s, %s, %s)"

　　cursor.execute(sql, (title, href, content, imgs))

　　connect.commit()

　　cursor.close()

　　# 示例調(diào)用

　　save_to_mysql('示例標(biāo)題', 'http://example.com', '示例內(nèi)容', 'http://example.com/image.jpg')

　　2. 保存到MongoDB

　　def save_to_mongodb(title, href, content, imgs):

　　mycol.insert_one({

　　'title': title,

　　'href': href,

　　'content': content,

　　'imgs': imgs

　　})

　　# 示例調(diào)用

　　save_to_mongodb('示例標(biāo)題', 'http://example.com', '示例內(nèi)容', 'http://example.com/image.jpg')

　　四、完整示例代碼

　　以下是一個(gè)完整的示例代碼，展示了如何從媽媽網(wǎng)爬取文章數(shù)據(jù)并保存到MySQL數(shù)據(jù)庫中。

　　import requests

　　from bs4 import BeautifulSoup

　　import pymysql.cursors

　　# 連接數(shù)據(jù)庫

　　connect = pymysql.Connect(

　　host='localhost',

　　port=3306,

　　user='root',

　　passwd='admin',

　　db='baby_info',

　　charset='utf8'

　　)

　　def get_one_page():

　　url = 'http://www.mama.cn/z/t1183/'

　　headers = {

　　'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'

　　}

　　response = requests.get(url, headers=headers)

　　soup = BeautifulSoup(response.text, "html.parser")

　　div = soup.find(class_='list-left')

　　articles = div.find_all('a')

　　for article in articles:

　　title = article.get_text()

　　href = article['href']

　　save_to_mysql(title, href, '', '')

　　def save_to_mysql(title, href, content, imgs):

　　cursor = connect.cursor()

　　sql = "ｉｎｓｅｒｔ INTO mamawang_info (title, href, content, imgs) VALUES (%s, %s, %s, %s)"

　　cursor.execute(sql, (title, href, content, imgs))

　　connect.commit()

　　cursor.close()

　　if __name__ == '__main__':

　　get_one_page()

　　connect.close()

　　五、注意事項(xiàng)

　　反爬蟲機(jī)制：許多網(wǎng)站都有反爬蟲機(jī)制，可以通過設(shè)置請求頭、使用代理等方式來應(yīng)對。

　　數(shù)據(jù)清洗：爬取的數(shù)據(jù)可能包含不需要的信息，需要進(jìn)行清洗和整理。

　　法律問題：確保有權(quán)訪問和使用數(shù)據(jù)，遵守網(wǎng)站規(guī)則和隱私政策。

　　通過以上步驟，你可以使用Python實(shí)現(xiàn)從網(wǎng)頁爬取數(shù)據(jù)并將其保存到數(shù)據(jù)庫中。這不僅有助于數(shù)據(jù)的存儲和管理，也為后續(xù)的數(shù)據(jù)分析和可視化提供了基礎(chǔ)。

微信分享

上一篇：python未響應(yīng)怎么辦 python程序未響應(yīng)的問題

下一篇：idea怎么運(yùn)行單個(gè)java文件 idea運(yùn)行單個(gè)java文

猜你喜歡

最近中文字幕国语免费完整,中文亚洲无线码49vv,中文无码热在线视频,亚洲自偷自拍熟女另类,中文字幕高清av在线

python怎么爬取網(wǎng)站上的數(shù)據(jù)庫 python爬取網(wǎng)頁內(nèi)容并保存到數(shù)據(jù)庫

猜你喜歡

閱讀排行

java如何使用數(shù)據(jù)庫中的數(shù)據(jù) java怎么從數(shù)據(jù)庫中得到數(shù)據(jù)

Python的虛擬環(huán)境如何使用?Python虛擬環(huán)境的配置與管理

java中異常處理有什么優(yōu)點(diǎn)?異常處理的兩種方式

如何優(yōu)化Java代碼常用的優(yōu)化技巧

javascript動(dòng)畫效果怎么弄?javascript有哪些功能

熱門標(biāo)簽

隨便看看

怎么使用js實(shí)現(xiàn)動(dòng)畫效果?js實(shí)現(xiàn)持續(xù)的動(dòng)畫效果怎么樣

python安裝完了還要安裝什么 python安裝需要配置環(huán)境嗎

python為什么要?jiǎng)?chuàng)建虛擬環(huán)境 python創(chuàng)建虛擬環(huán)境的命令

數(shù)據(jù)存儲的性能優(yōu)化方法有哪些?

Python中的列表推導(dǎo)式有什么作用，是如何提高代碼效率的?