Python 資料清洗

重複值處理：

一般採取刪除法，但是有些不能刪

df.duplicated(
)df.duplicated(subset=
,keep=
'last'
/'first'
)np.
sum(sd.duplicated())
df.dorp_duplicates(subset=
,keep=
'last'
/'first'
,inplace=true/
false
)

缺失值：

可以使用刪除法，替換法,插值法

統計個數
np.sum
(df.isnull())
統計缺失率
df.(
lambda x:
sum(x.isnull())
/len
(x),axis=0)
#統計列的結果
直接刪除
df.dropna(subset=
['gender'
,'age'
],how=
'any'
/'all'
,axis=0)
#刪除行
df.drop(
['age'
,'gender'
],axis=1)
填充：中值，均值
df.age.fillna(df.age.mean(
)/median())
df.gender.fillna(df.gender.mode()[
0])-
-眾數填補
df.age.fillna(20)
df.fillna(value=
)df.fillna(method=
'ffill'
/'bfill')-
-前項填補，後項填補
df.age.interpolate(method=
'linear'
) 線性插值
df.age.interpolate(method=
'polynomial'
,order=
'1') 多項式插值

異常值：

判斷：
xbar = df.counts.mean(
)xstd = df.counts.std(
)正常值分布範圍
xbar +
2*xstd
xbar -
2*xstd
# 利用any初步判斷
any(df.counts>xbar +
2*xstd)
any(df.counts2*xstd)
#畫圖判斷
df.counts.plot(kind=
'hist'
)#分布圖
q1 = df.counts.quantile(q=
0.25
)q3 = df.counts.quantile(q=
0.75
)iqr=q3-q1 分位差
df.couts.plot(kind=
'box'
)#箱線圖
ul=q3 +
1.5*iqr
#最大值代替
replace = df.counts[sunspots.counts.max()
df.loc[df.counts>ul,
'counts'
]=replace
#分位數替代
p1=df.counts.quantile(
0.01
)p2=df.counts.quantile(
0.99
)df[
'counts_new'
]=df[
'counts'
]df.loc[df[
'counts_new'
]>p2,
'counts_new'
]=p2
df.loc[df[
'counts_new'
]'counts_new'
]=p1

資料離散化（分箱，一般用等頻或等寬分段）：

pd.cut(series,num/切割點,labels=)

pd.qcut(series,頻數列表，labels=)

#等寬分段：
>>
> df=pd.dataframe(np.arange(12)
.reshape(4,
3),columns=
list
('abc'))
>>
>> pd.cut(df.a,
2,labels=[1
,2])
#第一行是行索引01
1122
32>>
> a.value_counts(
)#統計22
12#等頻分段
#法一：
>>
> b= pd.series(np.arange(12)
)>>
>> w=
4>>
> k=
[i/w for i in
range
(w+1)]
>>
> k
[0.0
,0.25
,0.5
,0.75
,1.0
]>>
> pd.qcut(b,k,labels=[1
,2,3
,4])
#k:分割的頻率01
1121
3242
5263
7383
94104
114#法二：
>>
> w=b.quantile(
[i/w for i in
range
(w+1)]
)>>
> w
0.00
0.00
0.25
2.75
0.50
5.50
0.75
8.25
1.00
11.00
>>
> pd.cut(b,w)
0 nan1(
0.0,
2.75]2
(0.0
,2.75]3
(2.75
,5.5]4
(2.75
,5.5]5
(2.75
,5.5]6
(5.5
,8.25]7
(5.5
,8.25]8
(5.5
,8.25]9
(8.25
,11.0]10
(8.25
,11.0]11
(8.25
,11.0
]

python資料清洗

對於資料中缺失的值，可以有3種方法處理 1.刪除。比如餐廳的營業額，有幾天去裝修了，確實沒營業，可以刪除 2.不處理有一些模型可以將缺失值作為一種特殊的值，可以直接建模。3.補上均值中位數眾數一般情況吧固定值比如工資啊，補貼啊最近臨插補最近的值，相鄰的，補上下面是拉格朗日插值法 ...

資料清洗 python

資料清洗 python 1.1引言對於處理大資料問題，首先就是要進行資料預處理，排除掉那些那些很離譜的資料，當然我們肯定不能乙個乙個用眼睛來找容易累死所以我們就要學會如何用程式來進行資料的預處理，我們常常用兩種語言 matlab和python，這裡我先介紹一下用python進行資料清洗。1.2...

Python 清洗資料

import numpy as np import pandas as pd from pandas import series,dataframe s series 1,2,3 index a b c print s a 1 b 2 c 3 dtype int64 print np.max s 可...

Python 資料清洗

python資料清洗

資料清洗 python

Python 清洗資料

相關推薦