scrapy爬當當網書籍資訊

2021-10-04 12:16:54 字數 2937 閱讀 4679

本次只爬取搜尋的python相關的所有書籍

scrapy start project ddbook

(cd /ddbook/ddbook)

scrapy genspider -t basic book dangdang.com

然後開啟 book.py

# 

# #一共100頁

# for i in range(1, 101): #共100頁

url =

""+str

(i)+

"#j_tab"

print

(url)

yield request(url, callback=self.parse)

defparse

(self, response)

:for j in

range(1

,61):

#每頁60個

try:

item = ddbookitem(

) author =

"//ul[@class='bigimg']/li[@class='line"

+str

(j)+

"']"

print

(author)

if response.xpath(author+

"//a/@title"):

item[

"title"

]= response.xpath(author+

"//a/@title")[

0].extract(

)print

(item[

"title"])

else

: item[

"title"]=

''if response.xpath(author+

"//a[@name='itemlist-author']/text()"):

item[

"author"

]= response.xpath(author+

"//a[@name='itemlist-author']/text()")[

0].extract(

)print

(item[

"author"])

else

: item[

"author"]=

''if response.xpath(author +

"//span[@class='search_now_price']/text()"):

item[

"price"

]= response.xpath(author +

"//span[@class='search_now_price']/text()")[

0].extract(

)print

(item[

"price"])

else

: item[

"price"]=

''if response.xpath(author +

"//a[@name='p_cbs']/text()"):

item[

"press"

]= response.xpath(author +

"//a[@name='p_cbs']/text()")[

0].extract(

)print

(item[

"press"])

else

: item[

"press"]=

''if response.xpath(author +

"//span/text()"):

alldata = response.xpath(author +

"//span/text()"

).extract(

)# print(alldata)

# a = len(alldata)

data = alldata[

len(alldata)-2

]#因為日期的位置從頭數不一定是第七個,但一定是span的最後乙個,又因為陣列開頭是0

# print(data)

# 因為出來的日期前面有個斜槓

pat =

"\d-\d-\d"

item[

"data"

]= re.

compile

(pat)

.findall(data)[0

]#[0]是為了去掉"

print

(item[

"data"])

else

: data =

''yield item

except exception as e:

print

(e)

爬蟲爬取當當網書籍

初學者學習爬蟲爬取當當網會比較容易,因為噹噹沒有反爬蟲import requests from lxml import html name input 請輸入要搜尋書籍的資訊 1.準備url url format name start 1 while true print start start 1...

scrapy基礎 當當網爬取

xpath與正則簡單對比 1.xpath表示式效率更高 2.正規表示式功能更強大 3.一般優先選擇xpath,解決不了再用正則 xpath提取規則 逐層提取 text 提取標籤下的文字 html head title text 3.標籤名 提取所有名為的標籤 4.標籤名 屬性 屬性值 提取屬性為 的...

爬蟲實戰 爬取當當網top500書籍

1.這個好像是爬蟲入門必備專案,練練手 練習 2.requests bs4模式,因為這個 比較簡單,不多說廢話了。usr bin env python coding utf 8 爬取當當網top500書籍 import requests from bs4 import beautifulsoup f...