scrapy抓取貝殼找房租房資料

資料直接在網頁中展示，不是動態載入，也不需要cookie，更沒有什麼反爬（之所以寫這篇文章是因為我對scrapy框架不了解，正在學習中，加深一下印象）

可以看到位址是有一定規律的

#url可以這樣表示
start_urls = [f'
'for i in range(1,101)]

不知道scrapy怎麼用的同學可以看我的另外幾篇文章 scrapy基本命令

scrapy框架持久化儲存

scrapy分布式爬蟲

增量式爬蟲

1，settings.py配置檔案需要改一些配置

2，item.py確定需要的資料

import
scrapy
class
propertiesitem(scrapy.item):
#define the fields for your item here like:
#name = scrapy.field()
title =scrapy.field()
link =scrapy.field()
address =scrapy.field()
big =scrapy.field()
where =scrapy.field()
how =scrapy.field()
price =scrapy.field()
name = scrapy.field()

3，進行編寫爬蟲檔案

import
copy
import
scrapy
from properties.items import
propertiesitem
class
examplespider(scrapy.spider):
name = '
example
'allowed_domains = ['
example.com']
start_urls = [f'
'for i in range(1,101)]
defparse(self, response):
node_list = response.xpath('
//div[@class="content__list--item--main"]')
item =propertiesitem()
for node in
node_list:
item[
"title
"] = node.xpath("
./p[1]/a/text()
").extract_first().strip()
item[
"link
"] = response.urljoin(node.xpath("
./p[1]/a/@href
").extract_first().strip())
item[
"address
"] = node.xpath("
./p[2]/a[3]/text()
").extract_first().strip()
item[
"big
"] = node.xpath("
./p[2]/text()[5]
").extract_first().strip()
item[
"where
"] = node.xpath("
./p[2]/text()[6]
").extract_first().strip()
item[
"how
"] = node.xpath("
./p[2]/text()[7]
").extract_first().strip()
item[
"price
"] =node.xpath(
'./span[@class="content__list--item-price"]/em/text()
').extract_first().strip() + '
元/月'
#item["name"] ='none'
#yield item
yield
scrapy.request(
url=item["
link"],
callback=self.makes,
meta=,
dont_filter=true
)defmakes(self,response):
item = response.meta['
item']
item[
"name
"] = response.xpath('
//span[@class="contact_name"]/@title
').extract_first()
yield item

4，編寫管道檔案pipelines.py

import
csvclass
propertiespipeline:
def__init__
(self):
self.fp=none
defopen_spider(self,spider):
print('
*****爬蟲開始*****')
self.fp = open('
貝殼.csv
', '
w', newline='', encoding="
utf8")
self.csv_writer=csv.writer(self.fp)
self.csv_writer.writerow(["標題
", "
鏈結", '
位址', "
大小", "
方向", "居室"
, "**
","姓名"])
defprocess_item(self, item, spider):
self.csv_writer.writerow(
[item[
"title
"], item["
link
"], item["
address"],
item[
"big
"], item["
where
"], item["
how"], item["
price
"],item['
name']]
)return
item
defclose_spider(self,spider):
self.fp.close()
print('
*****爬蟲結束*****
')

別的檔案就不需要做任何更改了，想要儲存到資料庫，csv等地方可以看我的部落格scrapy持久化儲存

python操作csv，excel，word

這就是我爬取好的資料，爬取下來是有很多重複資料的，是因為這個**的原因，他在展示資料的時候就是從乙個大列表裡面抽取資料來展示，你每重新整理一次頁面資料也就不一樣，可以用資料分析相關模組進行去重，後續會在部落格中更新，我目前是儲存在csv中，wps，office自帶去重功能

貝殼找房算數（中等）

描述輸入格式一行兩個正整數，分別表示 n和k。保證1 n 1e6,1 k 1e18。輸出格式乙個整數表示答案。樣例輸入 9 5樣例輸出思路對於數字積相同的可以只算一次，用map存起來個數，這樣就可以將複雜度壓下來了。include pragma warning disable 4786 de...

貝殼找房丟失的卡片

有一疊編號為1 n的卡片。吹落了一張。輸入剩餘n 1張卡編號中0 9出現的次數。次數不大於300 輸出卡片數目n和吹落卡片的編號。如果吹落卡片的解不唯一，從小到大排輸入樣例 2 12 9 3 3 3 3 2 2 2 輸出樣例 26 12 26 21 晚上又寫了一段基本思路是 0.check了0...

scrapy無法迴圈抓取

最近在學習scrapy，寫好了大概的樣子，但是卻發現無法迴圈抓取，最後自己想著以前貌似有個例子說過原因。name dmoz allowed domains dmoz.org start urls name dmoz allowed domains 123.info start urls 為了實現yi...

scrapy抓取貝殼找房租房資料

貝殼找房算數（中等）

貝殼找房 丟失的卡片

scrapy無法迴圈抓取

相關推薦

貝殼找房丟失的卡片