機器學習實戰之信用卡詐騙(一)

2021-10-10 15:24:55 字數 2517 閱讀 2402

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

# 讀取資料

data = pd.read_csv('creditcard.csv')

print(data.head())

count_classes = pd.value_counts(data['class'], sort = true).sort_index()

count_classes.plot(kind='bar')

plt.title('fraud class histogram')

plt.xlabel('class')

plt.ylabel("frequency")

plt.show()

樣本不均衡

樣本資料不均衡的情況時 採用 下取樣 和 過取樣

下取樣 :讓0和1資料一樣小,樣本同樣少 過取樣: 樣本同樣多

from sklearn.preprocessing import standardscaler

data['normamount'] = standardscaler().fit_transform(data['amount'].values.reshape(-1, 1))

data = data.drop(['time', 'amount'], axis=1)

print(data.head())

下取樣:

# 下取樣

x = data.ix[:, data.columns !='class']

y = data.ix[:, data.columns =='class']

number_records_fraud = len(data[data.class == 1])

fraud_indeices = np.array(data[data.class == 1].index)

normal_indices = data[data.class == 0].index

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=false)

random_normal_indices = np.array(random_normal_indices)

#合併under_sample_indices = np.concatenate([fraud_indeices,random_normal_indices])

under_sample_data = data.iloc[under_sample_indices,:]

x_undersample = under_sample_data.ix[:,under_sample_data.columns !='class']

x_undersample = under_sample_data.ix[:,under_sample_data.columns =='class']

print('percentage of nomal transaction:,', len(under_sample_data[under_sample_data.class == 0])/len(under_sample_data))

print('percentage of fraud transaction:,', len(under_sample_data[under_sample_data.class == 1])/len(under_sample_data))

print('reasmpled data 總的 transactions:', len(under_sample_data))

交叉驗證

#交叉驗證

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state = 0)

x_train_undersample, x_test_undersample, y_train_undersample,y_test_undersample =train_test_split(x_undersample,y_undersample,test_size,random_state)

print('')

print('number transact train dataset: ', len(x_train))

print('number transact test dataset: ', len(x_test))

print('total number of transaction: ', len(x_train_undersample)+len(x_test_undersample))

機器學習專案實戰之信用卡欺詐檢測

反欺詐應用的機器模型演算法,多為二分類演算法。1 gbdt梯度提公升決策樹 gradient boosting decision tree,gbdt 演算法,該演算法的效能高,且在各類資料探勘中應用廣泛,表現優秀,被應用的場景較多。2 logistic回歸又稱logistic回歸分析,是一種廣義的線...

機器學習高階 專案實戰 信用卡數字識別

機器學習高階 專案實戰 信用卡數字識別 1.cv2.findcontour 找出輪廓 2.cv2.boudingrect 輪廓外接矩陣位置 3.cv2.threshold 二值化操作 4.cv2.morph tophat 禮帽運算突出線條 5.cv2.morph close 閉運算內部膨脹 6.cv...

大資料分析實戰 信用卡欺詐檢測

假設有乙份信用卡交易記錄,遺憾的是資料經過了脫敏處理,只知道其特徵,卻不知道每乙個字段代表什麼含義,沒關係,就當作是乙個個資料特徵。在資料中有兩種類別,分別是正常交易資料和異常交易資料,欄位中有明確的識別符號。要做的任務就是建立邏輯回歸模型,以對這兩類資料進行分類,看起來似乎很容易,但實際應用時會出...