ML Data Processing資料預處理

資料歸一化

引數*arrays

list/np.array/matrices/padas dataframes

需被分割的樣本集

**options

test_size

在0.0和1.0之間，表示要從樣本集拆分到測試集的比例，預設為0.25

train_size

在0.0和1.0之間，表示要從樣本集拆分到訓練集的比例，預設為0

random_state

隨機數生成器

shuffle

是否對樣本集進行洗牌，預設為true

examples

>>
>
import numpy as np
>>
>
from sklearn.model_selection import train_test_split
>>
> x, y = np.arange(10)
.reshape((5
,2))
,range(5
)>>
> x
array([[
0,1]
,[2,
3],[
4,5]
,[6,
7],[
8,9]
])>>
>
list
(y)[0,
1,2,
3,4]
>>
> x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=
0.33
, random_state=66)
>>
> x_train
array([[
4,5]
,[0,
1],[
6,7]
])>>
> y_train[2
,0,3
]>>
> x_test
array([[
2,3]
,[8,
9]])
>>
> y_test[1,4]

import numpy as np
deftrain_test_split
(x,y,test_ratio =
0.2,seed =
none):
assert x.shape[0]
== y.shape[0]
,"the size of x must equal y"
assert
0.0< test_ratio <
1.0,
"test_ratio must be valid"
if seed:
np.random.seed(seed)
# permutation(x)返回乙個新的打亂x，但x本身不變
shuffle_index = np.random.permutation(
len(x)
) test_size =
int(
len(x)
*test_ratio)
train_index = shuffle_index[test_size:
] test_index = shuffle_index[
:test_size]
x_train = x[train_index]
y_train = y[train_index]
x_test = x[test_index]
y_test = y[test_index]
return x_train,x_test,y_train,y_test

1.當樣本的特徵數值浮動大、差距大時，訓練模型容易被某幾種特徵主導。

如圖，如果使用knn，距離完全主要受"發現時間"影響。

2.解決方案：

3.　乙個問題：

'minmaxscaler '
from sklearn.preprocessing import minmaxscaler 
scaler = minmaxscaler (
)# 使用訓練集構建歸一化模型(scaler)
scaler.fit(x_train)
# 訓練集的最大值
scaler .data_max_
# 訓練集的最小值
scaler .data_min_
# 將訓練集歸一化
x_train = scaler .transform(x_train)
# 測試集用同scaler歸一化
x_test = scaler .transform(x_test)
'standardscaler'
from sklearn.preprocessing import standardscaler 
scaler = standardscaler(
) scaler .fit(x_train)
# 訓練集的均值
scaler .mean_ 
# 訓練集的方差
scaler .scale_ 
x_train = scaler .transform(x_train)
x_test = scaler .transform(x_test)

import numpy as np
class
standardscaler
:def
__init__
(self)
: self.mean_ =
none 
self.scale_ =
none
"""根據訓練樣本集 x 獲得資料的均值和方差"""
deffit
(self, x)
:assert x.ndim ==2,
"the dimension of x must be 2"
# 針對樣本的每乙個維度(即特徵)，計算均值和方差
self.mean_ = np.array(
[np.mean(x[
:,i]
)for i in
range
(x.shape[1]
)]) self.scale_ = np.array(
[np.std(x[
:,i]
)for i in
range
(x.shape[1]
)])return self
"""將樣本集 y 根據此standardscaler進行均值方差歸一化處理"""
deftransform
(self, y)
:assert y.ndim ==2,
"the dimension of x must be 2"
assert self.mean_ is
notnone
and self.scale_ is
notnone
,"must fit before transform!"
assert y.shape[1]
==len
(self.mean_)
,"the feature number of x must be equal to mean_ and std_"
sacley = np.empty(shape=x.shape, dtype=
float
)for col in
range
(x.shape[1]
):sacley[
:,col]
=(y[
:,col]
- self.mean_[col]
)/ self.scale_[col]
return sacley

Python 使用Pandas進行資料預處理

利用pandas庫中的get dummies函式對類別型特徵進行啞變數處理。get dummies語法 pandas.get dummies data,prefix none,prefix sep dummy na false,columns none,sparse false,drop first...

spss資料預處理步驟 Spss的資料預處理

spss 的資料預處理資料預處理的目的在資料檔案建立好後，通常還要對待分析的資料進行必要的預加工處理，這是資料分析過程中不可缺少的乙個關鍵環節。資料的預加工處理是服務與資料分析和建模的，需要解決的問題如下缺失值和異常資料的處理。資料的轉換處理。資料的轉換處理是在原有資料的基礎上，計算產生 ...

data preprocessing資料預處理

先放乙個github學習鏈結data preprocessing fit方法是用於從乙個訓練集中學習模型引數，其中就包括了歸一化時用到的均值，標準偏差。transform方法就是用於將模型用於位置資料，fit transform就很高效的將模型訓練和轉化合併到一起，訓練樣本先做fit，得到mean，...

ML Data Processing資料預處理

Python 使用Pandas進行資料預處理

spss資料預處理步驟 Spss的資料預處理

data preprocessing資料預處理

相關推薦