機器學習實戰之信用卡詐騙（一）

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# 讀取資料
data = pd.read_csv('creditcard.csv')
print(data.head())
count_classes = pd.value_counts(data['class'], sort = true).sort_index()
count_classes.plot(kind='bar')
plt.title('fraud class histogram')
plt.xlabel('class')
plt.ylabel("frequency")
plt.show()

樣本不均衡

樣本資料不均衡的情況時採用下取樣和過取樣

下取樣：讓0和1資料一樣小，樣本同樣少過取樣：樣本同樣多

from sklearn.preprocessing import standardscaler
data['normamount'] = standardscaler().fit_transform(data['amount'].values.reshape(-1, 1))
data = data.drop(['time', 'amount'], axis=1)
print(data.head())

下取樣：

# 下取樣
x = data.ix[:, data.columns !='class']
y = data.ix[:, data.columns =='class']
number_records_fraud = len(data[data.class == 1])
fraud_indeices = np.array(data[data.class == 1].index)
normal_indices = data[data.class == 0].index
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=false)
random_normal_indices = np.array(random_normal_indices)
#合併under_sample_indices = np.concatenate([fraud_indeices，random_normal_indices])
under_sample_data = data.iloc[under_sample_indices,:]
x_undersample = under_sample_data.ix[:,under_sample_data.columns !='class']
x_undersample = under_sample_data.ix[:,under_sample_data.columns =='class']
print('percentage of nomal transaction:,', len(under_sample_data[under_sample_data.class == 0])/len(under_sample_data))
print('percentage of fraud transaction:,', len(under_sample_data[under_sample_data.class == 1])/len(under_sample_data))
print('reasmpled data 總的 transactions:', len(under_sample_data))

交叉驗證

#交叉驗證
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state = 0)
x_train_undersample, x_test_undersample, y_train_undersample,y_test_undersample =train_test_split(x_undersample,y_undersample,test_size,random_state)
print('')
print('number transact train dataset: ', len(x_train))
print('number transact test dataset: ', len(x_test))
print('total number of transaction: ', len(x_train_undersample)+len(x_test_undersample))

機器學習專案實戰之信用卡欺詐檢測

反欺詐應用的機器模型演算法，多為二分類演算法。1 gbdt梯度提公升決策樹 gradient boosting decision tree，gbdt 演算法，該演算法的效能高，且在各類資料探勘中應用廣泛，表現優秀，被應用的場景較多。2 logistic回歸又稱logistic回歸分析，是一種廣義的線...

機器學習高階專案實戰信用卡數字識別

機器學習高階專案實戰信用卡數字識別 1.cv2.findcontour 找出輪廓 2.cv2.boudingrect 輪廓外接矩陣位置 3.cv2.threshold 二值化操作 4.cv2.morph tophat 禮帽運算突出線條 5.cv2.morph close 閉運算內部膨脹 6.cv...

大資料分析實戰信用卡欺詐檢測

假設有乙份信用卡交易記錄，遺憾的是資料經過了脫敏處理，只知道其特徵，卻不知道每乙個字段代表什麼含義，沒關係，就當作是乙個個資料特徵。在資料中有兩種類別，分別是正常交易資料和異常交易資料，欄位中有明確的識別符號。要做的任務就是建立邏輯回歸模型，以對這兩類資料進行分類，看起來似乎很容易，但實際應用時會出...

機器學習實戰之信用卡詐騙（一）

機器學習專案實戰之信用卡欺詐檢測

機器學習高階 專案實戰 信用卡數字識別

大資料分析實戰 信用卡欺詐檢測

相關推薦

機器學習高階專案實戰信用卡數字識別

大資料分析實戰信用卡欺詐檢測