7 資料預處理資料標準化

這個文章知識講解了入門的資料預處理，更多的歸一化方法請看：

sklearn中常用資料預處理方法

由於資料的偏差與跨度會影響機器學習的成效，因此正規化(標準化)資料可以提公升機器學習的成效。首先由例子來講解:

- 例子1 - 資料標準化

- 例子2 - 資料標準化對機器學習成效的影響

#資料預處理模組
from sklearn import preprocessing
import numpy as np
#建立array
a = np.array([[10, 2.7, 3.6],
[-100, 5, -2],
[120, 20, 40]], dtype=np.float64)
#資料預處理模組 有乙個方法：scale 歸一化資料
print(preprocessing.scale(a))
# [[ 0. -0.85170713 -0.55138018]
# [-1.22474487 -0.55187146 -0.852133 ]
# [ 1.22474487 1.40357859 1.40351318]]
# 或者是
print(preprocessing.minmax_scale(a,feature_range=(-1,1)))
'''結果是：
[[ -2.77555756e-17 -1.00000000e+00 -7.33333333e-01]
[ -1.00000000e+00 -7.34104046e-01 -1.00000000e+00]
[ 1.00000000e+00 1.00000000e+00 1.00000000e+00]]
'''

# 標準化資料模組
from sklearn import preprocessing 
import numpy as np
# 將資料分割成train與test的模組
frfrom sklearn.model_selection import train_test_split
# 生成適合做classification資料的模組
from sklearn.datasets.samples_generator import make_classification 
# support vector machine中的support vector classifier
from sklearn.svm import svc 
# 視覺化資料的模組
import matplotlib.pyplot as plt 
#生成具有2種屬性的300筆資料
x, y = make_classification(
n_samples=300, n_features=2,
n_redundant=0, n_informative=2, 
random_state=22, n_clusters_per_class=1, 
scale=100)
'''引數的含義：
n_samples：樣本數。
n_features：特徵總數。
n_informative：資訊特徵的數量。
n_redundant：冗餘特徵數。
n_repeated：從資訊和冗餘特徵中隨機抽取的重複特徵數。
n_classes：分類問題的類（或標號）的個數。
n_clusters_per_class：每個類的群集數。
random_state：隨機數生成器使用的種子。
'''#視覺化資料
標準化前的**準確率只有0.477777777778
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
clf = svc()
clf.fit(x_train, y_train)
print(clf.score(x_test, y_test))
# 0.477777777778
資料標準化後
資料的單位發生了變化, x 資料也被壓縮到差不多大小範圍.標準化後的**準確率提公升至0.9
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
clf = svc()
clf.fit(x_train, y_train)
print(clf.score(x_test, y_test))
# 0.9
 資料預處理 資料標準化
x train np.array 1,1,2 2,0,0 0,1,1 min max scaler preprocessing.minmaxscaler x train minmax min max scaler.fit transform x train print x train minmax ...
資料預處理之資料標準化
在對資料集建模前，常常要對資料的某一特徵或幾個特徵進行規範化處理，其目的在於將特徵值歸一到同乙個維度，消除比重不平衡的問題。常用的標準化方法有最大 最小標準化 零 均值標準化和小數定標標準化。最大 最小標準化又稱為離差標準化，將原始資料進行線性變換，對映到 0,1 區間。轉換公式如下 其中，max為...
資料預處理之標準化
近來趁專案間隔期，工作不是太多，也在利用空餘時間把資料分析的完整流程用python實現一遍，也恰好整理下這幾年手頭的一些資料，順序可能比較亂，後期再慢慢調整。資料的標準化 normalization 是將資料按照一定規則縮放，使之落入乙個小的特定區間。這樣去除資料的單位限制，將其轉化為無量綱的純數值...

7 資料預處理 資料標準化

資料預處理 資料標準化

資料預處理之資料標準化

資料預處理之標準化

相關推薦

7 資料預處理資料標準化

資料預處理資料標準化