Python sklearn庫資料預處理

python: sklearn庫 —— 資料預處理

資料集轉換之預處理資料：

將輸入的資料轉化成機器學習演算法可以使用的資料。包含特徵提取和標準化。

原因：資料集的標準化（服從均值為0方差為1的標準正態分佈（高斯分布））是大多數機器學習演算法的常見要求。

如果原始資料不服從高斯分布，在**時表現可能不好。在實踐中，我們經常進行標準化（z-score 特徵減去均值/標準差）。

一、標準化（z-score），或者去除均值和方差縮放

公式為：(x-mean)/std 計算時對每個屬性/每列分別進行。

將資料按期屬性（按列進行）減去其均值，並處以其方差。得到的結果是，對於每個屬性/每列來說所有資料都聚集在0附近，方差為1。

實現時，有兩種不同的方式：

>>> from sklearn import
preprocessing
>>> import
numpy as np
>>> x = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> x_scaled =preprocessing.scale(x)
>>>x_scaled 
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
>>>#
處理後資料的均值和方差
>>> x_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> x_scaled.std(axis=0)
array([ 1., 1., 1.])

>>> scaler =preprocessing.standardscaler().fit(x)
>>>scaler
standardscaler(copy=true, with_mean=true, with_std=true)
>>>scaler.mean_ 
array([ 1. ..., 0. ..., 0.33...])
>>>scaler.std_ 
array([ 0.81..., 0.81..., 1.24...])
>>>scaler.transform(x) 
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]]) 
>>>#
可以直接使用訓練集對測試集資料進行轉換
>>> scaler.transform([[-1., 1., 0.]]) 
array([[-2.44..., 1.22..., -0.26...]])

除了上述介紹的方法之外，另一種常用的方法是將屬性縮放到乙個指定的最大和最小值（通常是1-0）之間，這可以通過preprocessing.minmaxscaler類實現。

使用這種方法的目的包括：

1、對於方差非常小的屬性可以增強其穩定性。

2、維持稀疏矩陣中為0的條目。

>>> x_train = np.array([[ 1., -1.,  2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
...>>> min_max_scaler =preprocessing.minmaxscaler()
>>> x_train_minmax =min_max_scaler.fit_transform(x_train)
>>>x_train_minmax
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
>>> #
將相同的縮放應用到測試集資料中
>>> x_test = np.array([[ -3., -1., 4.]])
>>> x_test_minmax =min_max_scaler.transform(x_test)
>>>x_test_minmax
array([[-1.5 , 0. , 1.66666667]]) 
>>> #
縮放因子等屬性
>>>min_max_scaler.scale_ 
array([ 0.5 , 0.5 , 0.33...])
>>>min_max_scaler.min_ 
array([ 0. , 0.5 , 0.33...])

當然，在構造類物件的時候也可以直接指定最大最小值的範圍：feature_range=(min, max)，此時應用的公式變為：

x_std=(x-x.min(axis=0))/(x.max(axis=0)-x.min(axis=0))

x_scaled=x_std/(max-min)+min

正則化的過程是將每個樣本縮放到單位範數（每個樣本的範數為1），如果後面要使用如二次型（點積）或者其它核方法計算兩個樣本之間的相似性這個方法會很有用。

normalization主要思想是對每個樣本計算其p-範數，然後對該樣本中每個元素除以該範數，這樣處理的結果是使得每個處理後樣本的p-範數（l1-norm,l2-norm）等於1。

p-範數的計算公式：||x||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p

該方法主要應用於文字分類和聚類中。例如，對於兩個tf-idf向量的l2-norm進行點積，就可以得到這兩個向量的余弦相似性。

1、可以使用preprocessing.normalize()函式對指定資料進行轉換：

>>> x = [[ 1., -1.,  2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> x_normalized = preprocessing.normalize(x, norm='l2'
) >>>x_normalized 
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])

2、可以使用processing.normalizer()類實現對訓練集和測試集的擬合和轉換：

>>> normalizer = preprocessing.normalizer().fit(x)  #
fit does nothing
>>>normalizer
normalizer(copy=true, norm='l2'
) >>>
>>>normalizer.transform(x) 
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
>>> normalizer.transform([[-1., 1., 0.]]) 
array([[-0.70..., 0.70..., 0. ...]])

python sklearn庫實現簡單邏輯回歸

import xlrd import matplotlib.pyplot as plt import numpy as np from sklearn import model selection from sklearn.linear model import logisticregression...

python sklearn庫中的缺失值填充

今天小萌新複習資料探勘課程的知識點，當看到缺失值填充 imputation of missing values 部分，被sklearn庫中的transform函式搞暈了。看了幾篇前輩們的部落格，來這裡總結一下。請大家指正。原始資料中會有一些缺失的屬性值，一般人都會選擇自動填充。import nump...

Python sklearn 交叉驗證

from sklearn.datasets import load boston from sklearn.model selection import cross val score from sklearn.tree import decisiontreeregressor boston loa...

Python sklearn庫 資料預處理

python sklearn庫實現簡單邏輯回歸

python sklearn庫中的缺失值填充

Python sklearn 交叉驗證

相關推薦

Python sklearn庫資料預處理