資料預處理

#############################  使用standardscler進行資料預處理  #######################################
#匯入numpy
import numpy as np
#匯入畫圖工具
import matplotlib.pyplot as plt
#匯入資料集生成工具
from sklearn.datasets import make_blobs
#先建立50個資料點,讓他們分為兩類
x,y = make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
#用散點圖繪製資料點
plt.scatter(x[:, 0],x[:, 1],c=y,s=30,cmap=plt.cm.cool)
#顯示影象
#匯入standardscaler
from sklearn.preprocessing import standardscaler
#使用standardscaler進行資料預處理
x_1 = standardscaler().fit_transform(x)
#用散點圖繪製經過預處理的資料點
plt.scatter(x_1[:, 0],x_1[:, 1],c=y,cmap=plt.cm.cool)
#顯示影象
#############################  使用minmaxscler進行資料預處理  #######################################
#匯入資料集生成工具
from sklearn.datasets import make_blobs
#先建立50個資料點,讓他們分為兩類
x,y = make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
#匯入minmaxscaler
from sklearn.preprocessing import minmaxscaler
#使用minmaxscaler進行資料預處理
x_2 = minmaxscaler().fit_transform(x)
#用散點圖繪製資料點
plt.scatter(x_2[:, 0],x_2[:, 1],c=y,cmap=plt.cm.cool)
#顯示影象
#############################  使用robustscler進行資料預處理  #######################################
#匯入資料集生成工具
from sklearn.datasets import make_blobs
#先建立50個資料點,讓他們分為兩類
x,y = make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
#匯入robustscaler
from sklearn.preprocessing import robustscaler
#使用minmaxscaler進行資料預處理
x_3 = robustscaler().fit_transform(x)
#用散點圖繪製資料點
plt.scatter(x_3[:, 0],x_3[:, 1],c=y,cmap=plt.cm.cool)
#顯示影象
#############################  使用normalizer進行資料預處理  #######################################
#匯入資料集生成工具
from sklearn.datasets import make_blobs
#先建立50個資料點,讓他們分為兩類
x,y = make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
#匯入robustscaler
from sklearn.preprocessing import normalizer
#使用minmaxscaler進行資料預處理
x_4 = normalizer().fit_transform(x)
#用散點圖繪製資料點
plt.scatter(x_4[:, 0],x_4[:, 1],c=y,cmap=plt.cm.cool)
#顯示影象
#############################  通過資料預處理提高模型準確率  #######################################
#匯入紅酒資料集
from sklearn.datasets import load_wine
#匯入mlp神經網路
from sklearn.neural_network import mlpclassifier
#匯入資料集拆分工具
from sklearn.model_selection import train_test_split
#建立訓練集和測試集
wine = load_wine()
x_train,x_test,y_train,y_test = train_test_split(wine.data,wine.target,random_state=62)
#列印資料形態
print(x_train.shape,x_test.shape)

(133, 13) (45, 13)
#設定mlp神經網路的引數
mlp= mlpclassifier(hidden_layer_sizes=[100,100],max_iter=400,random_state=62)
#使用mlp擬合資料
mlp.fit(x_train,y_train)
#列印模型得分
print('模型得分:'.format(mlp.score(x_test,y_test)))

模型得分:0.93
#使用minmaxscaler進行資料預處理
scaler = minmaxscaler()
scaler.fit(x_train)
x_train_pp = scaler.transform(x_train)
x_test_pp = scaler.transform(x_test)
#重新訓練模型
mlp.fit(x_train_pp,y_train)
#列印模型得分
print('模型得分:'.format(mlp.score(x_test_pp,y_test)))

模型得分:1.00
注 : 我們在使用minmaxscaler擬合了原始的訓練資料集,再用它去轉換原始的訓練資料集和測試資料集
切記不要用它先擬合原始的測試資料集,再去轉換測試資料集,這樣就失去了資料轉換的意義.
總結 : 
standardscaler的原理是,將所有資料的特徵值轉換為均值為0,而方差為1的狀態,這樣就可以確保資料的"大小"都是一致的.
minmaxscaler的原理是,可以想象成把資料壓進了乙個長和寬都是1的方格仔中了.
robustscaler的原理是,和standardscaler比較近似,但是它並不是均值和方差來進行轉換,而是使用中位數和四分位數.
normalizer的原理是,將所有樣本的特徵向量轉化為歐幾里得距離為1,即把資料的分布變成乙個半徑為1的圓,或者是乙個球.
在進行資料預處理後,模型的準確率大大提高了,特別對那些需要進行資料預處理的模型,效果是顯著的.
文章引自 : 《深入淺出python機器學習》
 資料預處理
現實世界中資料大體上都是不完整，不一致的髒資料，無法直接進行資料探勘，或挖掘結果差強人意。為了提前資料探勘的質量產生了資料預處理技術。資料預處理有多種方法 資料清理，資料整合，資料變換，資料歸約等。這些資料處理技術在資料探勘之前使用，大大提高了資料探勘模式的質量，降低實際挖掘所需要的時間。一 資料清...
資料預處理
常見的資料預處理方法，以下通過sklearn的preprocessing模組來介紹 變換後各維特徵有0均值，單位方差。也叫z score規範化 零均值規範化 計算方式是將特徵值減去均值，除以標準差。sklearn.preprocessing scale x 一般會把train和test集放在一起做標...
資料預處理
用cut函式分箱 有時把數值聚集在一起更有意義。例如，如果我們要為交通狀況 路上的汽車數量 根據時間 分鐘資料 建模。具體的分鐘可能不重要，而時段如 上午 下午 傍晚 夜間 深夜 更有利於 如此建模更直觀，也能避免過度擬合。這裡我們定義乙個簡單的 可復用的函式，輕鬆為任意變數分箱。def binni...
資料預處理

資料預處理

資料預處理

資料預處理

相關推薦