參考
安裝 lxml 庫
import pymysql
import requests
from lxml import etree
def get_movies(page):
url =
"" % page
# 獲取url中的內容
response = requests.get(url)
html_content = response.text
# 使用xpath進行內容解析
html = etree.html(html_content)
# 根據規則提取內容
movies = html.xpath(
"/html/body/div[8]/div[2]/ul/li"
)# 存入資料庫
dbparmas =
conn = pymysql.connect(**dbparmas)
# 任意關鍵字引數
# 獲取游標
cursor = conn.cursor(
)for movie in movies:
title = movie.xpath(
"./div/div[1]/a/p/text()"
)[0]
cover_image = movie.xpath(
"./a/img/@_src"
)[0]
durations = movie.xpath(
"./a/span/text()"
)if durations:
duration = durations[0]
else:
duration =
'無資訊'
publish_time = movie.xpath(
"./a/div[2]/p/text()"
)[0]
cate = movie.xpath(
"./div/div[1]/div[1]/span[1]/text()"
)[0]
play_num = movie.xpath(
"./div/div[1]/div[2]/span[1]/text()"
)[0]
like_num = movie.xpath(
"./div/div[1]/div[2]/span[2]/text()"
)[0]
descriptions = movie.xpath(
"./a/div[2]/div/text()"
)if descriptions:
description = descriptions[0]
else:
description =
"描述"
print(title, cover_image, duration, description, publish_time, cate, play_num, like_num)
# 執行sql 只是新增到執行佇列中
# % (cover_image, duration, description, publish_time, title, cate, play_num, like_num))
# # # 提交
# conn.commit()
if __name__ ==
'__main__'
:for i in range(2, 10):
get_movies(i)
python爬蟲XPath學習
xpath簡介和基本使用 1.前言 之前爬蟲的時候沒有用過xpath,就是沒用過lxml這個包,遇到json格式網頁我用的json.loads html格式用的beautifulsoup裡面有find和find all函式查詢標籤之類的。但是xpath在爬蟲裡面也算乙個比較重要的工具,當然要學習啦。...
python 爬蟲(XPATH使用)
xpath xml path language 是一門在xml文件中查詢資訊的語言,可用來在xml文件中對元素和屬性進行遍歷。w3school官方文件 pip install lxml 如果出現網路延遲,可使用清華源進行安裝匯入兩種匯入方式 第一種 直接匯入from lxml import etre...
python爬蟲學習 xpath
1.例項化乙個etree的物件,且需要將被解析的頁面原始碼資料載入到該物件中。2.呼叫etree物件中的xpath方法結合著xpath表示式實現標籤的定位和內容的捕獲。pip install lxml1.將本地的html文件中的原始碼資料載入etree物件中 etree.parse filepath...