Python爬蟲入門 16 鏈家租房資料抓取

作為乙個活躍在京津冀地區的開發者，要閒著沒事就看看石家莊這個國際化大都市的一些資料，這篇部落格爬取了鏈家網的租房資訊，爬取到的資料在後面的部落格中可以作為一些資料分析的素材。

我們需要爬取的**為：

首先確定一下，哪些資料是我們需要的

可以看到，×××框就是我們需要的資料。

接下來，確定一下翻頁規律

pg1/ pg2/ pg3/ pg4/ pg5/ ... pg80/

本篇部落格主要使用的是呼叫乙個隨機的ua

self._ua = useragent()
self._headers = # 呼叫乙個隨機的ua

由於可以快速的把頁碼拼接出來，所以採用協程進行抓取，寫入csv檔案採用的pandas模組

print("正在爬取{}".format(url))

html = await self.get(url) # 獲取網頁內容

html = etree.html(html) # 解析網頁

self.parse_page(html) # 匹配我們想要的資料

print("正在儲存資料....")

######################### 資料寫入

data = pd.dataframe(self._data)

data.to_csv("鏈家網租房資料.csv", encoding='utf_8_sig') # 寫入檔案

######################### 資料寫入

def run(self):

loop = asyncio.get_event_loop()

tasks = [asyncio.ensure_future(self.parse_html())]

loop.run_until_complete(asyncio.wait(tasks))

if __name__ == '__main__':

l = lianjiaspider()

l.run()

上述**中缺少乙個解析網頁的函式，我們接下來把他補全

def parse_page(self,html):
info_panel = html.xpath("//div[@class='info-panel']")
for info in info_panel:
region = self.remove_space(info.xpath(".//span[@class='region']/text()"))
zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()"))
meters = self.remove_space(info.xpath(".//span[@class='meters']/text()"))
where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()"))
con = info.xpath(".//div[@class='con']/text()")
floor = con[0] # 樓層
type = con[1] # 樣式
agent = info.xpath(".//div[@class='con']/a/text()")[0]
has = info.xpath(".//div[@class='left agency']//text()")
price = info.xpath(".//div[@class='price']/span/text()")[0]
price_pre = info.xpath(".//div[@class='price-pre']/text()")[0]
look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0]
one_data =

不一會，資料就爬取的差不多了。

廈門房價鏈家爬蟲

python爬取鏈家官網上廈門二手房資料防止封ip，cookies設定 def url open url url open headers while1 try req requests.get url url,headers headers,timeout 3 break except prin...

python鏈家網高併發非同步爬蟲and非同步存入資料

python鏈家網二手房非同步io爬蟲，使用asyncio aiohttp和aiomysql 很多小夥伴初學python時都會學習到爬蟲，剛入門時會使用requests urllib這些同步的庫進行單執行緒爬蟲，速度是比較慢的，後學會用scrapy框架進行爬蟲，速度很快，原因是scrapy是基於tw...

Python爬蟲入門 5 爬取小豬短租租房資訊

小豬短租是乙個租房上面有很多優質的民宿出租資訊，下面我們以成都地區的租房資訊為例，來嘗試爬取這些資料。小豬短租成都頁面按照慣例，先來爬下標題試試水，找到標題，複製xpath。多複製幾個房屋的標題 xpath 進行對比 id page list ul li 1 div 2 div a span...

Python爬蟲入門 16 鏈家租房資料抓取

廈門房價鏈家爬蟲

python鏈家網高併發非同步爬蟲and非同步存入資料

Python爬蟲入門 5 爬取小豬短租租房資訊

相關推薦