爬蟲練習之資料整理基於Pandas

包括salary company time job_name address欄位

本次資料整理的小目標是將薪資資料拿出來單獨處理為統一的格式, 以便後續的資料統計分析和視覺化操作

先來看看資料有多醜

薪資原始資料示例

可以看到除了常規的幾千/月, 還有幾萬/月, 以及幾萬/年

不過, 沒看到xx以上的資料. 但是, 你還是要考慮到啊

根據資料格式, 可以把薪資拆成兩行, 以 - 作為分割點, 然後對資料分情況整理, 根據拆分後資料位置得到底薪和薪資上限

獲取底薪

這裡需要分三種情況(實際是四種, 不過xx千/年這種資料並沒有出現)

xx千/月, xx萬/月, xx萬/年

思路是

判斷: xx千/月, xx萬/月, xx萬/年找到'-'位置萬/月和萬/年需要進行轉化得到底薪

如果遇到沒有上限的資料, 另外寫個判斷即可

函式**如下

# coding=utf-8
def cut_word(word):
if(word.find('萬') == -1):
# xx千/月
postion = word.find('-')
bottomsalary = word[postion-1]
else:
if(word.find('年') == -1):
# xx萬/月
postion = word.find('-')
bottomsalary = word[postion-1] + '0.0' 
else:
# xx萬/年
postion = word.find('-')
bottomsalary = word[postion-1]
bottomsalary = str(int(bottomsalary) / 1.2)
return bottomsalary

獲取薪資上限

獲取薪資上限的思路與獲取底薪的思路一致, 稍改**即可

這裡有乙個中文坑, 在utf-8的編碼環境下, 乙個中文佔3個位元組, 所以像'萬/年'這些, 要減去7才能得到正確結果, 而不是減去3

這裡把兩個方法合併於乙個函式, 通過變數來獲得上下限

考慮到還有0.x這種數字, 使用類似```bottomsalary = word[:(postion)] + '0.0'``這樣的**會出現以下情況

錯誤示範

函式**如下

def cut_word(word, method):
if method == 'bottom':
if(word.find('萬') == -1):
# xx千/月
postion = word.find('-')
bottomsalary = str(float(word[:(postion)]))
else:
if(word.find('年') == -1):
# xx萬/月
postion = word.find('-')
bottomsalary = str(float(word[:(postion)]) * 10) 
else:
# xx萬/年
postion = word.find('-')
bottomsalary = word[:(postion)]
bottomsalary = str(int(bottomsalary) / 1.2)
return bottomsalary
if method == 'top':
length = len(word)
if(word.find('萬') == -1):
# xx千/月
postion = word.find('-')
topsalary = str(float(word[(postion+1):(length-7)]))
else:
if(word.find('年') == -1):
# xx萬/月
postion = word.find('-')
topsalary = str(float(word[(postion+1):(length-7)]) * 10) 
else:
# xx萬/年
postion = word.find('-')
topsalary = word[(postion+1):(length-7)]
topsalary = str(int(topsalary) / 1.2)
return topsalary

# 新增底薪列
# 選擇salary, bottomsalary, topsalary列
df_clean[['salary', 'bottomsalary', 'topsalary']]

選擇與薪水有關的列顯示, 可以看到結果符合預期(後兩列的單位是k)

計算平均薪資

df_clean['bottomsalary'] = df_clean['bottomsalary'].astype('float')
df_clean['topsalary'] = df_clean['topsalary'].astype('float')

參考文獻

知乎——用pandas進行資料分析實戰

爬蟲練習之了解反爬蟲機制

沒學習之前我理解字面意思就是你爬蟲然後該順著你的ip等會對你的網路電腦等造成損失最簡單的是你爬獲取不到正確的資訊案例爬取拉勾網python職位的薪資等我們可以看到一般並不能獲取到需要的資訊瀏覽器訪問的時候除了會傳送url，引數等內容外，還會給伺服器端傳遞一些額外的請求頭 requ...

爬蟲資料之爬蟲流程

多頁面爬蟲流程有的網頁存在多頁的情況，每頁的網頁結構都相同或類似，這種型別的網頁爬蟲流程為手動翻頁並觀察各網頁的url 構成特點，構造出所有頁面的url 存入列表中。根據url 列表依次迴圈取出url 定義爬蟲函式。迴圈呼叫爬蟲函式，儲存資料。迴圈完畢，結束爬蟲程式跨頁面爬蟲流程定義爬取函...

二 python爬蟲之基於requests模組學習

requests模組是python中原生的基於網路請求的模組，其主要作用是用來模擬瀏覽器發起請求。功能強大，用法簡潔高效。在爬蟲領域中佔據著半壁江山的地位。因為在使用urllib模組的時候，會有諸多不便之處，總結如下手動處理url編碼手動處理post請求引數處理cookie和操作繁瑣自動處...

爬蟲練習之資料整理 基於Pandas

爬蟲練習之了解反爬蟲機制

爬蟲資料之爬蟲流程

二 python爬蟲之基於requests模組學習

相關推薦

爬蟲練習之資料整理基於Pandas