scrapy資料爬取和資料處理

scrapy把爬取資料和處理資料分別放在以下兩個位置（itcast為我們建立的爬蟲名）

資料爬取的**如下（其中parse方法中的response是請求start_urls的返回）：

import scrapy
class
itcastspider
(scrapy.spider)
: name =
'itcast'
#爬蟲名字
allowed_domains =
['itcast.cn'
]#爬取範圍
start_urls =
['']#爬取url
# 對url響應的處理
defparse
(self, response)
:#print(response.xpath("//div[@class='tea_con']/div/ul/li/h3/text()"))
#extract返回的是乙個類似於list
#res = response.xpath("//div[@class='tea_con']/div/ul/li//h3/text()").extract()
li_list=response.xpath(
"//div[@class='tea_con']/div/ul/li"
)for li in li_list:
item=
#extract_first返回的是第乙個字串，如果為空，就返回none
item[
"name"
]=li.xpath(
".//h3/text()"
).extract_first(
) item[
"level"
]=li.xpath(
".//h4/text()"
).extract_first(
) item[
"desc"
]=li.xpath(
".//p/text()"
).extract_first(
)yield item#使用yield方便資料處理，且只能yield dic、none、request、baseitem

接下來就是資料處理：

item_pipelines =

其中鍵為管道的位址，值為管道離引擎的距離，距離越小，越優先執行

class
myspiderpipeline
:def
process_item
(self, item, spider)
:if spider.name==
"itcast"
:print
(item)
return item
class
myspiderpipeline1
:def
process_item
(self, item, spider)
:if spider.name ==
"itcast"
:print
(item)
return item

process_item()方法中的item即為yield的內容，spider為這個item是哪個爬蟲的，所以我們可以對不同的爬蟲採取不同的處理。

因為第乙個管道的距離小於第二個所以先執行第乙個，在字典中新增乙個a屬性，然後第二個管道進行輸出（注意：當後乙個管道需要使用前乙個管道的結果時，前乙個管道需要return）

scrapy爬取酒店評論資料

總共有28w條記錄。資料來源 www.booking.com 具體設定一條記錄有如下字段用csv檔案儲存with seperator t hotel review booking hotel data資料檔案 hotel review booking scrapy理解的專案目錄 hotel re...

scrapy框架全站資料爬取

每個都有很多頁碼，將中某板塊下的全部頁碼對應的頁面資料進行爬取實現方式有兩種 1 將所有頁面的url新增到start urls列表不推薦 2 自行手動進行請求傳送推薦 yield scrapy.request url,callback callback專門用做於資料解析下面我們介紹第二種...

Scrapy爬取資料存入Mongodb中

這次使用scrapy簡單的爬取一些多列表電影資料，儲存在csv檔案及json檔案中，最後把這些資料全部儲存在mongodb中。涉及的知識點有pipeline，yield，中介軟體，xpath，items 的使用。coding utf 8 import scrapy from douban.items...

scrapy資料爬取和資料處理

scrapy爬取酒店評論資料

scrapy框架全站資料爬取

Scrapy爬取資料存入Mongodb中

相關推薦