07 20 缺失值處理二異常值處理

一：刪除法

對於缺失值，一般不會直接進行刪除，因為刪除過多影響資料分析，所以會對其進行第二中操作，補齊

二：填補法( 一般適用於元素為 float 或者 int的資料)

1. 均值或中位數補齊法：

df
.age
---檢視age列df.
age.
mean()
----檢視年齡列平均值df.
age.
fillna(df
.age
.mean()
)----使用均值填補，即在缺失值位置
補上均值df.
age.
median()
_中位數的檢視df.
age.
fillna（df.
age.
median()
)

2、眾數填補法（適用於字元型資料）

df
.gender
----
檢視性別df.
gender
.fillna(df
.gender
.mode()
)----檢視性別列
-gender的眾數
（眾數一般不止乙個，通常選擇第乙個）df.
gender
.fillna(df
.gender
.mode()
[0])
---使用第乙個眾數填補
缺失值

在資料分析中，一般對不同的變數，採取不同的填補法

綜合應用：

df
.fillna
(value
=)

3、前後填補法

前向填補--用缺失值的上一行的資料來填補df.
fillna
(method
=' ffill'
)後向填補法---缺失值的後一行的資料來填補df.
fillna
(method
='bfill'
)前後向填補法
填補後還是會有缺失值，因為

四：差值法：

df
.age
.interpolate
(method
=' polynomial'
,order=1
)代替

異常值：指哪些偏離正常範圍的值，不是錯誤值異常值出現頻率較低，但又會對專案分析造成偏差

異常值往往採取蓋帽法或者資料離散化

一、異常的判斷

1、均值的兩個標準差

凡是在均值± 2標準差

範圍內都是正常值，範圍外就是異常值

例項：

import
pandas
aspd
import
numpy
asnp
import
osos
.chdir
(' 資料檔案儲存路徑'
)sunspots=pd
.read_csv
('sunspots.csv'
,sep
=' , '
)----讀取資料
sunspots
---檢視資料內容
xbar
=sunspots
.counts
.mean()
----計算資料均值
xstd
=sunspots
.counts
.std()
----計算標準差
xbar+2
*xstd
---計算上限
xbar-2*
xstd----計算下限
any(
sunspots
.counts
>
xbar+2
*xstd
)----判斷是否有超出上線的資料存在，
返回true
o***lse
any(
sunspots
.counts
<
xbar-2*
xstd
)sunpots
.counts
.plot()
----畫波**
sunpots
.counts
.plot
(kind
='hist')繪製
分布圖

2、分位數（象限法）

先求中位數，上四分位數（75% 分位數）下四分位數（25% 分位數）上四分位數-下四分位數 =分位差上限為上四分位數 +1.5 分位差下限：下四分位數- 1.5分位差

這個範圍內都是正常值

例項：

q1
=sunspots
.counts
.quantile(q
=0.25
)---下四分位數q3=
sunspots
.counts
.quantile(q
=0.75
)---上四分位數
iqr=
q3-q1
分位差判斷：
any(
sunspots
.counts
>q3+
1.5*
iqr)
---判斷是否有超出上限的資料
any(
sunspots
.counts
>
q3-1.5
*iqr
)---判斷是否有低於下限的資料
繪製象限圖：
sunspots
.counts
.plots
(kond
=' box'
)

二、異常值的小處理

1、替換法

ul（upper
limit）=q3
+1.5
*iqr
replace_value
=sunspots
.counts
[sunspots.counts< ul] .
max(
)---
在未超出上限的值裡邊找乙個最大值當做替換值，凡是超出上線的值都用其進行替換
sunspots
.loc
[sunspots
.counts
>ul,
'cpunts']=
replace_value
，凡是超出上線的值都用其進行替換
sunspots
.counts
.describe()
對替換後的資料進行統計，
發現最大值就是replace_value
,沒有超過上線的值了，因為已經被替換完了

2、分位數替換法

p1
=sunspots
.counts
.quantile(q
=0.01
)p99
=sunspots
.counts
.quantile(q
=0.99
)凡是低於1%
分位數的用其代替，凡是超過99%
的用其代替
sunspots
.loc
['sunspots.counts'
>
p99,
' new_counts']=
p99

資料處理缺失值處理異常值處理

造成資料缺失的原因是多方面的，主要可能有以下幾種有些資訊暫時無法獲取，致使一部分屬性值空缺出來。有些資訊因為一些人為因素而丟失了。有些物件的某個或某些屬性是不可用的。如乙個未婚者的配偶姓名。獲取這些資訊的代價太大，從而未獲取資料。空值處理的重要性空值的存在，造成了以下影響系統丟失了大量的有用資...

pandas 處理異常值缺失值重複值資料差分

處理異常值缺失值重複值資料差分 import pandas as pd import numpy as np import copy 設定列對齊 pd.set option display.unicode.ambiguous as wide true pd.set option display.un...

python資料清洗（缺失值與異常值處理）

本文寫入的是python資料庫的taob表 source 本地檔案其中總資料為9616行，列分別為title,link,price,comment檢視資料概括 coding utf 8 author m10 import numpy as np import pandas as pd import...

07 20 缺失值處理二 異常值處理

資料處理 缺失值處理 異常值處理

pandas 處理異常值缺失值重複值資料差分

python資料清洗（缺失值與異常值處理）

相關推薦

07 20 缺失值處理二異常值處理

資料處理缺失值處理異常值處理