Python爬取鏈家二手房資料重慶地區

最近在學習資料分析的相關知識，打算找乙份資料做訓練，於是就打算用python爬取鏈家在重慶地區的二手房資料。

鏈家的頁面如下：

爬取**如下：

import
requests, json, time
from bs4 import
beautifulsoup
import
re, csv
defparse_one_page(url):
headers=
r = requests.get(url, headers=headers)
soup = beautifulsoup(r.text, '
lxml')
results = soup.find_all(class_="
clear logclickdata")
for item in
results: 
output =
#從url中獲得區域
/')[-3]) 
#獲得戶型、面積、朝向等資訊，有無電梯的資訊可能會有缺失，資料清理可以很方便的處理
info1 = item.find('
div', '
houseinfo
').text.replace('
', '').split('|'
) 
for t in
info1:
#獲得總價
div', '
totalprice
').text)
#獲得年份資訊，如果沒有就為空值
info2 = item.find('
div', '
positioninfo
').text.replace('
', ''
) 
if info2.find('
年') != -1:
pos = info2.find('年'
) 
else:'
')#獲得單價
div', '
unitprice
').text)
#print(output)
write_to_file(output)
defwrite_to_file(content):
#引數newline保證輸出到csv後沒有空行
with open('
data.csv
', '
a', newline=''
) as csvfile:
writer =csv.writer(csvfile)
#writer.writerow(['region', 'garden', 'layout', 'area', 'direction', 'renovation', 'elevator', 'price', 'year', 'perprice'])
writer.writerow(content)
defmain(offset):
regions = ['
jiangbei
', '
yubei
', '
nanan
', '
banan
', '
shapingba
', '
jiulongpo
', '
yuzhong
', '
dadukou
', '
jiangjing
', '
fuling',
'wanzhou
', '
hechuang
', '
bishan
', '
changshou1
', '
tongliang
', '
beibei']
for region in
regions:
for i in range(1, offset):
url = '
' + region + '
/pg'+ str(i) + '/'
html =parse_one_page(url)
time.sleep(1)
print('{} has been writen.'.format(region))

main(101)

鏈家**的資料最多隻顯示100頁，所以這裡我們爬取各個區域的前100頁資訊，有的可能沒有100頁，但並不影響，爬取結果如下（已經對資料做了一點處理，有問題的資料出現在有無電梯那一列和小區名那一列，只要排個序然後整體移動單元內容即可，年份缺失後面再做處理）：

接下來，我們用excel的資料透視表簡單看一下資料的數量資訊：

從表中我們可以看到，此次共爬取了33225條資料，elevator這一項有很多資料缺失，year這一項由於在爬蟲時使用空格代替了空值，所以這一項也存在一些資料缺失。現在有了資料，後面就可以開始對這些資料進行分析了。

[1]

Python爬取鏈家二手房資料重慶地區

Python爬取鏈家二手房資訊

python爬取鏈家二手房的資料

python爬蟲爬取鏈家二手房資訊

Python爬取鏈家二手房資料 重慶地區

Python爬取鏈家二手房資訊

python爬取鏈家二手房的資料

python爬蟲爬取鏈家二手房資訊

相關推薦

Python爬取鏈家二手房資料重慶地區