데이터 확인
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('galaxy.csv')
data.head()
data.info()
data.describe()
sns.distplot(data['startprice'])
sns.distplot(data['charCountDescription'])
plt.figure(figsize=(20, 10))
sns.boxplot(x='productline', y='startprice', data = data)
결측값 처리
data.isna().sum() / len(data)
data.head()
# 결측값 대체
data = data.fillna('Unknown')
data['carrier'].value_counts()
카테고리 변수 처리
data[['carrier','color','productline','noDescription']].nunique()
data['carrier'].value_counts()
data['color'].value_counts()
data['productline'].value_counts()
data['noDescription'].value_counts()
# Black 통합
def black(x):
if (x == 'Midnight Black') | (x == 'Aura Black') | (x == 'Prism Black'):
return 'Black'
else:
return x
data['color'].apply(lambda x: black(x))
data['color'].value_counts()
data = pd.get_dummies(data, columns = ['carrier', 'color', 'productline', 'noDescription'])
data
모델링
Decision Tree 모델 생성
from sklearn.model_selection import train_test_split
X = data.drop('sold', axis = 1)
y = data['sold']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=100)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth = 10)
model.fit(X_train, y_train)
예측
pred = model.predict(X_test)
y_test
평가
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, pred)
# 0.7811447811447811
최적의 Max Depth 찾기 (파라미터 튜닝)
for i in range(2, 31):
model = DecisionTreeClassifier(max_depth = i)
model.fit(X_train, y_train)
pred = model.predict(X_test)
print(i, round(accuracy_score(y_test, pred), 4)) # 3이 가장 높음
# 최적의 깊이로 다시 모델링
model = DecisionTreeClassifier(max_depth = 3)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred) # 0.8316498316498316
confusion_matrix(y_test, pred)
Tree Plot 생성
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(model, feature_names=X_train.columns, fontsize=15, label ="None", max_depth = 2)
'빅데이터 분석가 양성과정 > Python - 머신러닝' 카테고리의 다른 글
Ecommerce Machine Learning - 프로모션 효율 예측 (Random Forest) (2) | 2024.07.12 |
---|---|
Ecommerce Machine Learning - 고객 이탈 예측(KNN) (2) | 2024.07.12 |
Ecommerce Machine Learning - 광고 반응률 예측(로지스틱 회귀) (0) | 2024.07.12 |
DBSCAN( Density Based Spatial clustering of application with noise) (0) | 2024.07.12 |
군집화 (K-Means) - 실습 (0) | 2024.07.12 |