Step 0. 사기탐지 분류 모형 개요
금융 데이터의 특성 (Review)
- 1) 이종(heterogeneous) 데이터의 결합
- 2) 분포의 편향성(skewness)
- 3) 분류 레이블의 불명확성
- 4) 변수의 다중공선성(multicollinearity)
- 5) 변수의 비선형성
- 그 외 현실적인 규제·수집·저장 등의 한계 때문에 데이터가 불완전(missing, truncated, censored)할 수 있음
불균형 데이터
1) X (피처)의 불균형
- 범주변수일 경우 범주에 따라 빈도가 낮을 수 있음
- 고차원 피처 공간의 축소(Feature Transformation)
- PCA, t-SNE 등의 알고리즘 사용
2) y (타겟)의 불균형
- 여신(대출), 수신(적금), 보험(클레임), 카드(사기탐지), 거시경제(불황) 등 대부분의 금융 데이터는 희소 타겟 문제
- 리샘플링(Resampling)으로 저빈도 데이터를 극복
- 무선 과대표집(Random Oversampling), 무선 과소표집(Random Undersampling), SMOTE, Tomek Links 등의 알고리즘 사용
Step 1. 데이터 변환
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
## colab 버전은 csv 데이터 세트를 google drive에서 로딩 해야 합니다. 이를 위해 google drive를 colab에 mount 수행.
import os, sys
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Colab Notebooks/card_fraud'
!ls
데이터 확인하기
# 파일 불러오기
df = pd.read_csv('creditcard.csv')
df.info()
# 로드한 데이터의 맨 윗 30개 행 확인하기
df.head(30)
# Missing 여부 확인하기
df.isnull().sum()
# 불러온 데이터의 클래스 분포 확인하기
df.groupby(by=['Class']).count()
print('Target class is ', '{0:0.4f}'. format(492/(284315+492)*100), '%') # 0.1727 %
변수의 스케일 변환하기
# 데이터 스케일 조정하기
from sklearn.preprocessing import StandardScaler, RobustScaler
std_scaler = StandardScaler()
rob_scaler = RobustScaler() ##
df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))
# 원 데이터에서 Time 컬럼과 Amount 컬럼 제외하기
df.drop(['Time','Amount'], axis=1, inplace=True)
# 스케일 조정된 컬럼 추가하기
scaled_amount = df['scaled_amount']
scaled_time = df['scaled_time']
df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
df.insert(0, 'scaled_amount', scaled_amount)
df.insert(1, 'scaled_time', scaled_time)
## 스케일 조정된 데이터 확인하기
df.head()
Train / Test set 분할
# X와 y 데이터 셋 만들기
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import KFold, StratifiedKFold
X = df.drop('Class', axis=1)
y = df['Class']
# 데이터 나누기
sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
for train_index, test_index in sss.split(X, y):
original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]
# 클래스의 skew 정도가 매우 높기 때문에 클래스간 분포를 맞추는 것이 필요합니다.
# subsample 구축 전 셔플링을 통해 레이블이 한쪽에 몰려있지 않도록 하겠습니다.
df = df.sample(frac=1)
# 데이터 준비
fraud_df = df.loc[df['Class'] == 1]
non_fraud_df = df.loc[df['Class'] == 0][:492]
normal_distributed_df = pd.concat([fraud_df, non_fraud_df])
# 데이터 셔플하기
new_df = normal_distributed_df.sample(frac=1, random_state=0)
# 셔플한 새로운 데이터 셋 확인
new_df.head()
Step 2. 차원축소
- 주성분 분석(Principal Component Analysis)
- t-SNE (Stochastic Neighbor Embedding)
- SNE는 n 차원에 분포된 이산 데이터를 k(n 이하의 정수) 차원으로 축소하며 거리 정보를 보존하되, 거리가 가까운 데이터의 정보를 우선하여 보존하기 위해 고안되었음
- 단어 벡터와 같이 고차원 데이터를 시각화하는 데 가장 자주 쓰이는 알고리즘
- SNE 학습과정에 사용되는 가우시안 분포는 t 분포에 비해 거리에 따른 확률 값 변화의 경사가 가파른 특징을 가지기 때문에 특정 거리 이상부터는 학습과정에 거의 반영이 되지 않는 문제점을 가지고 있음(Crowding Problem
- 이러한 문제점을 보완하기 위해 고안된 방법이 t-SNE: 학습과정에서 가우시안 분포 대신 t 분포를 이용
- t-SNE는 보통 word2vec으로 임베딩한 단어벡터를 시각화하는데 쓰임
- 특이값 분해(Singular Value Decomposition)
- 그 외 잠재 의미분석(Latent Semantic Analysis), 행렬 인수분해(Matrix Factorization) 등
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
# 차원 축소할 데이터 준비
X = new_df.drop('Class', axis=1)
y = new_df['Class']
# t-SNE
X_reduced_tsne = TSNE(n_components=2, random_state=0).fit_transform(X.values)
print('t-SNE done')
# PCA
X_reduced_pca = PCA(n_components=2, random_state=0).fit_transform(X.values)
print('PCA done')
# TruncatedSVD
X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=0).fit_transform(X.values)
print('Truncated SVD done')
결과 시각화
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24, 6))
f.suptitle('Clusters after Dimensionality Reduction', fontsize=16)
labels = ['No Fraud', 'Fraud']
blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')
# t-SNE scatter plot
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax1.set_title('t-SNE', fontsize=14)
ax1.grid(True)
ax1.legend(handles=[blue_patch, red_patch])
# PCA scatter plot
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax2.set_title('PCA', fontsize=14)
ax2.grid(True)
ax2.legend(handles=[blue_patch, red_patch])
# TruncatedSVD scatter plot
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax3.set_title('Truncated SVD', fontsize=14)
ax3.grid(True)
ax3.legend(handles=[blue_patch, red_patch])
plt.show()
Step 3. Undersampling
# 재구축한 데이터의 클래스 분포 확인하기
new_df.groupby(by=['Class']).count()
# X와 y 데이터 셋 만들기
X = new_df.drop('Class', axis=1)
y = new_df['Class']
# 언더샘플링을 위한 샘플 데이터 구축
from sklearn.model_selection import train_test_splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 모델 인풋에 들어가기 위한 데이터의 형태 바꾸기
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
# 학습시킬 모델 로드하기
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
classifiers = {
"Logisitic Regression": LogisticRegression(),
"K Nearest": KNeighborsClassifier(),
"Support Vector Classifier": SVC(),
"Decision Tree Classifier": DecisionTreeClassifier(),
"Random Forest Classifier": RandomForestClassifier(),
"Gradient Boosting Classifier": GradientBoostingClassifier(),
"LightGBM Classifier": LGBMClassifier()
}
# 모델별 cross validation 한 결과의 평균 정확도 점수 출력하기
from sklearn.model_selection import cross_val_score
for key, classifier in classifiers.items():
classifier.fit(X_train, y_train)
training_score = cross_val_score(classifier, X_train, y_train, cv=5)
print(classifier.__class__.__name__, ':', round(training_score.mean(), 2) * 100, '% accuracy')
# 결과
LogisticRegression : 93.0 % accuracy
KNeighborsClassifier : 93.0 % accuracy
SVC : 93.0 % accuracy
DecisionTreeClassifier : 89.0 % accuracy
RandomForestClassifier : 93.0 % accuracy
GradientBoostingClassifier : 93.0 % accuracy
[LightGBM] [Info] Number of positive: 396, number of negative: 391
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000478 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7641
[LightGBM] [Info] Number of data points in the train set: 787, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503177 -> initscore=0.012707
[LightGBM] [Info] Start training from score 0.012707
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Number of positive: 317, number of negative: 312
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000220 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6293
[LightGBM] [Info] Number of data points in the train set: 629, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503975 -> initscore=0.015899
[LightGBM] [Info] Start training from score 0.015899
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Number of positive: 316, number of negative: 313
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000248 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6291
[LightGBM] [Info] Number of data points in the train set: 629, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.502385 -> initscore=0.009539
[LightGBM] [Info] Start training from score 0.009539
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Number of positive: 317, number of negative: 313
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000284 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6303
[LightGBM] [Info] Number of data points in the train set: 630, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503175 -> initscore=0.012699
[LightGBM] [Info] Start training from score 0.012699
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Number of positive: 317, number of negative: 313
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000283 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6308
[LightGBM] [Info] Number of data points in the train set: 630, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503175 -> initscore=0.012699
[LightGBM] [Info] Start training from score 0.012699
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Number of positive: 317, number of negative: 313
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000359 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6302
[LightGBM] [Info] Number of data points in the train set: 630, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503175 -> initscore=0.012699
[LightGBM] [Info] Start training from score 0.012699
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LGBMClassifier : 94.0 % accuracy
결과 확인하기
# 올바른 예
# 모델별 분류결과 확인하기
from sklearn.metrics import classification_report
for key, classifier in classifiers.items():
y_pred = classifier.predict(X_test) ####
results = classification_report(y_test, y_pred) ####
print(classifier.__class__.__name__, '-------','\n', results)
LogisticRegression -------
precision recall f1-score support
0 0.93 0.97 0.95 101
1 0.97 0.93 0.95 96
accuracy 0.95 197
macro avg 0.95 0.95 0.95 197
weighted avg 0.95 0.95 0.95 197
KNeighborsClassifier -------
precision recall f1-score support
0 0.92 0.97 0.94 101
1 0.97 0.91 0.94 96
accuracy 0.94 197
macro avg 0.94 0.94 0.94 197
weighted avg 0.94 0.94 0.94 197
SVC -------
precision recall f1-score support
0 0.90 0.98 0.94 101
1 0.98 0.89 0.93 96
accuracy 0.93 197
macro avg 0.94 0.93 0.93 197
weighted avg 0.94 0.93 0.93 197
DecisionTreeClassifier -------
precision recall f1-score support
0 0.93 0.95 0.94 101
1 0.95 0.93 0.94 96
accuracy 0.94 197
macro avg 0.94 0.94 0.94 197
weighted avg 0.94 0.94 0.94 197
RandomForestClassifier -------
precision recall f1-score support
0 0.92 0.96 0.94 101
1 0.96 0.91 0.93 96
accuracy 0.93 197
macro avg 0.94 0.93 0.93 197
weighted avg 0.94 0.93 0.93 197
GradientBoostingClassifier -------
precision recall f1-score support
0 0.92 0.96 0.94 101
1 0.96 0.92 0.94 96
accuracy 0.94 197
macro avg 0.94 0.94 0.94 197
weighted avg 0.94 0.94 0.94 197
LGBMClassifier -------
precision recall f1-score support
0 0.92 0.97 0.95 101
1 0.97 0.92 0.94 96
accuracy 0.94 197
macro avg 0.95 0.94 0.94 197
weighted avg 0.95 0.94 0.94 197
from sklearn.metrics import confusion_matrix
for key, classifier in classifiers.items():
y_pred = classifier.predict(X_test) ####
cm = confusion_matrix(y_test, y_pred) ####
print(classifier.__class__.__name__, '\n', cm, '\n')
LogisticRegression
[[98 3]
[ 7 89]]
KNeighborsClassifier
[[98 3]
[ 9 87]]
SVC
[[99 2]
[11 85]]
DecisionTreeClassifier
[[96 5]
[ 7 89]]
RandomForestClassifier
[[97 4]
[ 9 87]]
GradientBoostingClassifier
[[97 4]
[ 8 88]]
LGBMClassifier
[[98 3]
[ 8 88]]
#잘못된 예
# 모델별 분류결과 확인하기 (잘못된 예)
from sklearn.metrics import classification_report
for key, classifier in classifiers.items():
y_pred = classifier.predict(original_Xtest) ####
results_wrong = classification_report(original_ytest, y_pred) ####
print(classifier.__class__.__name__, '-------','\n', results_wrong)
LogisticRegression -------
precision recall f1-score support
0 1.00 0.97 0.98 56863
1 0.05 0.91 0.09 98
accuracy 0.97 56961
macro avg 0.52 0.94 0.54 56961
weighted avg 1.00 0.97 0.98 56961
KNeighborsClassifier -------
precision recall f1-score support
0 1.00 0.97 0.99 56863
1 0.05 0.92 0.10 98
accuracy 0.97 56961
macro avg 0.53 0.95 0.54 56961
weighted avg 1.00 0.97 0.98 56961
SVC -------
precision recall f1-score support
0 1.00 0.98 0.99 56863
1 0.09 0.88 0.16 98
accuracy 0.98 56961
macro avg 0.54 0.93 0.57 56961
weighted avg 1.00 0.98 0.99 56961
DecisionTreeClassifier -------
precision recall f1-score support
0 1.00 0.85 0.92 56863
1 0.01 0.98 0.02 98
accuracy 0.85 56961
macro avg 0.51 0.91 0.47 56961
weighted avg 1.00 0.85 0.92 56961
RandomForestClassifier -------
precision recall f1-score support
0 1.00 0.98 0.99 56863
1 0.08 0.98 0.15 98
accuracy 0.98 56961
macro avg 0.54 0.98 0.57 56961
weighted avg 1.00 0.98 0.99 56961
GradientBoostingClassifier -------
precision recall f1-score support
0 1.00 0.96 0.98 56863
1 0.04 0.98 0.08 98
accuracy 0.96 56961
macro avg 0.52 0.97 0.53 56961
weighted avg 1.00 0.96 0.98 56961
LGBMClassifier -------
precision recall f1-score support
0 1.00 0.97 0.98 56863
1 0.05 0.98 0.09 98
accuracy 0.97 56961
macro avg 0.52 0.97 0.54 56961
weighted avg 1.00 0.97 0.98 56961
# 모델별 Confusion Matrix 확인하기 (잘못된 예)
from sklearn.metrics import confusion_matrix
for key, classifier in classifiers.items():
y_pred = classifier.predict(original_Xtest) ####
cm_wrong = confusion_matrix(original_ytest, y_pred) ####
print(classifier.__class__.__name__, '\n', cm_wrong, '\n')
LogisticRegression
[[55130 1733]
[ 9 89]]
KNeighborsClassifier
[[55300 1563]
[ 8 90]]
SVC
[[55941 922]
[ 12 86]]
DecisionTreeClassifier
[[48226 8637]
[ 2 96]]
RandomForestClassifier
[[55739 1124]
[ 2 96]]
GradientBoostingClassifier
[[54794 2069]
[ 2 96]]
LGBMClassifier
[[55005 1858]
[ 2 96]]
Oversampling
pip install scikit-learn==0.23.1
pip install imbalanced-learn==0.7.0
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_resampled, y_resampled = sm.fit_sample(original_Xtrain,list(original_ytrain)) ####
#https://github.com/scikit-learn-contrib/imbalanced-learn/issues/528
print('Before SMOTE, original X_train: {}'.format(original_Xtrain.shape))
print('Before SMOTE, original y_train: {}'.format(np.array(original_ytrain).shape))
print('After SMOTE, resampled original X_train: {}'.format(X_resampled.shape))
print('After SMOTE, resampled original y_train: {} \n'.format(np.array(y_resampled).shape))
print("Before SMOTE, fraud counts: {}".format(sum(np.array(original_ytrain)==1)))
print("Before SMOTE, non-fraud counts: {}".format(sum(np.array(original_ytrain)==0)))
print("After SMOTE, fraud counts: {}".format(sum(np.array(y_resampled)==1)))
print("After SMOTE, non-fraud counts: {}".format(sum(np.array(y_resampled)==0)))
분류 모형 1
from sklearn.metrics import accuracy_score, recall_score
# f1_score, roc_auc_score, precision_score
# Logistic Regression 모델의 weight 파라미터를 지정하는 방법
w = {1:0, 1:99} ## 불균형 클래스 weight 파라미터 지정 less than 0.2
# 모델 피팅
logreg_weighted = LogisticRegression(random_state=0, class_weight=w) ###
logreg_weighted.fit(original_Xtrain,original_ytrain) ###
# 예측값 구하기
y_pred = logreg_weighted.predict(original_Xtest) ###
# 예측결과 확인하기
print('Logistic Regression ------ Weighted')
print(f'Accuracy: {accuracy_score(original_ytest,y_pred)}') ###
print('\n')
print(f'Confusion Matrix: \n{confusion_matrix(original_ytest, y_pred)}')###
print('\n')
print(f'Recall: {recall_score(original_ytest,y_pred)}') ###
# imblearn 패키지를 이용하여 예측 결과 확인하기
from imblearn.metrics import classification_report_imbalanced
label = ['non-fraud', 'fraud']
print(classification_report_imbalanced(original_ytest, y_pred, target_names=label))
분류 모형 2
# 재구축한 샘플 데이터로 모델 피팅하기
logreg_resampled = LogisticRegression(random_state=0) ###
logreg_resampled.fit(X_resampled, y_resampled) ###
# 예측값 구하기
y_pred = logreg_resampled.predict(original_Xtest)
print('Logistic Regression ------ Resampled Data')
print(f'Accuracy: {accuracy_score(original_ytest,y_pred)}') ###
print('\n')
print(f'Confusion Matrix: \n{confusion_matrix(original_ytest, y_pred)}') ###
print('\n')
print(f'Recall: {recall_score(original_ytest,y_pred)}') ###
from imblearn.metrics import classification_report_imbalanced
label = ['non-fraud', 'fraud']
print(classification_report_imbalanced(original_ytest, y_pred, target_names=label))
Stqp 5. 요약
- 1) 불균형 분류 문제에 대한 이해: 사기탐지 데이터
- 2) 피처 변환 알고리즘의 이해: PCA, t-SNE, SVD
- 3) 과대적합 발생시 해결 방법 습득: 모델 파라미터 조정, 샘플 재구축
- 4) 리샘플링 알고리즘에 대한 이해: Random Undersampling, Random Oversampling, SMOTE Oversampling 등
- 5) 불균형 데이터를 이용한 분류 결과의 올바른 해석 방법 습득: classification_report_imbalanced 이용하기
'빅데이터 분석가 양성과정 > Python - 딥러닝' 카테고리의 다른 글
실습) 철판 제고 공정 데이터 (0) | 2024.07.18 |
---|---|
공정 검사와 딥러닝 (0) | 2024.07.18 |
실습) Google Net (0) | 2024.07.18 |
이론 - DenseNet (0) | 2024.07.18 |
이론 - Residual Network (0) | 2024.07.18 |