데이터 불러오기

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 100) 

import warnings
warnings.filterwarnings("ignore")



from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import OneHotEncoder
import random

train = pd.read_csv("/content/drive/MyDrive/carddata/train.csv")
test = pd.read_csv('/content/drive/MyDrive/carddata/test.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/carddata/sample_submission.csv')
train.head()

기본 변수 설명

gender : 성별(F/M), car : 차량 소유 유무(Y/N), reality : 부동산 소유 유무(Y/N), child_num : 자녀 수

income_total : 연간 소득, income_type : 소득 분류(5개로 분리), edu_type : 교육 수준(5개로 분리)

family_type : 결혼 여부(5개로 분리), house_type : 생활 방식(6개로 분리), DAYS_BIRTH : 출생일(수집일부터 음수로 계산)

DAYS_EMPLOYED : 업무 시작일(수집일부터 음수로 계산, 업무 안하는 사람은 365243 값 부여), FLAG_MOBIL : 핸드폰 소유 여부

work_phone : 업무용 전화 소유 여부, phone : 가정용 전화 소유 여부, email : 이메일 소유 여부

occyp_type : 직업 유형, family_size: 가족 규모, begin_month : 신용카드 발급 월(수집일로부터 음수 계산)

반응변수 => credit : 사용자의 신용카드 대금 연체를 기준으로 한 신용도. 낮을수록 높은 신용임.

train.describe()

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          26457 non-null  int64  
 1   gender         26457 non-null  object 
 2   car            26457 non-null  object 
 3   reality        26457 non-null  object 
 4   child_num      26457 non-null  int64  
 5   income_total   26457 non-null  float64
 6   income_type    26457 non-null  object 
 7   edu_type       26457 non-null  object 
 8   family_type    26457 non-null  object 
 9   house_type     26457 non-null  object 
 10  DAYS_BIRTH     26457 non-null  int64  
 11  DAYS_EMPLOYED  26457 non-null  int64  
 12  FLAG_MOBIL     26457 non-null  int64  
 13  work_phone     26457 non-null  int64  
 14  phone          26457 non-null  int64  
 15  email          26457 non-null  int64  
 16  occyp_type     18286 non-null  object 
 17  family_size    26457 non-null  float64
 18  begin_month    26457 non-null  float64
 19  credit         26457 non-null  float64
dtypes: float64(4), int64(8), object(8)
memory usage: 4.0+ MB

유일하게 occyp_type(직업유형) 변수가 null 값이 존재합니다.

NAN으로 채워넣겠습니다.

train.fillna('NAN', inplace=True)
test.fillna('NAN', inplace=True)

plt.subplots(figsize = (8,8))
plt.pie(train['credit'].value_counts(), labels = train['credit'].value_counts().index, 
        autopct="%.2f%%", shadow = True, startangle = 90)
plt.title('credit ratio', size=20)
plt.show()

matplotlib 패키지 내 pie 차트를 이용해 반응변수의 비율을 확인했습니다.

신용등급이 떨어지는 2번의 비율이 상당히 크군요.

범주형 변수를 신용등급별로 쪼개서 관찰해보기

train_0 = train[train['credit']==0.0]
train_1 = train[train['credit']==1.0]
train_2 = train[train['credit']==2.0]

def cat_plot(column):

  f, ax = plt.subplots(1, 3, figsize=(16, 6))


  sns.countplot(x = column,
                data = train_0,
                ax = ax[0],
                order = train_0[column].value_counts().index)
  ax[0].tick_params(labelsize=12)
  ax[0].set_title('credit = 0')
  ax[0].set_ylabel('count')
  ax[0].tick_params(rotation=50)


  sns.countplot(x = column,
                data = train_1,
                ax = ax[1],
                order = train_1[column].value_counts().index)
  ax[1].tick_params(labelsize=12)
  ax[1].set_title('credit = 1')
  ax[1].set_ylabel('count')
  ax[1].tick_params(rotation=50)

  sns.countplot(x = column,
                data = train_2,
                ax = ax[2],
                order = train_2[column].value_counts().index)
  ax[2].tick_params(labelsize=12)
  ax[2].set_title('credit = 2')
  ax[2].set_ylabel('count')
  ax[2].tick_params(rotation=50)
  plt.subplots_adjust(wspace=0.3, hspace=0.3)
  plt.show()

cat_plot("gender")

train 데이터를 신용등급에 따라 분류한 뒤 설명변수와에 관계를 그래프로 보는 함수를 만들었습니다.

성별에 대해서 살펴봤는데, 절대적으로 여성이 그냥 많은 것 같습니다.

더불어 성별에 따른 신용등급 차이는 모두 비슷한 비율에 그래프인 것으로 보아 확인하기 힘듭니다.

cat_plot('car')

우선 차량보유를 하지 않은 사람이 모든 비율에서 많습니다.

다만 신용 등급과에 연관성은 그래프로 봤을땐 크게 없는 것 같네요.

cat_plot('reality')

모든 신용 등급에서 부동산을 소유한 사람들이 많았습니다.

딱히 신용 등급에 따른 차이가 존재하지 않는 것 같네요.

cat_plot('income_type')

소득 종류 변수도 신용 등급 별로 차이가 두드러지진 않습니다.

다만 학생은 신용등급 0에 없는 점이 눈에 띄네요.

cat_plot('edu_type')

교육 수준 변수 또한 신용 등급별로 차이가 있어보이진 않네요.

cat_plot('family_type')

가족 구성 변수에 따른 신용등급 변수도 차이가 없는 것 같아요.

전반적으로 결혼한 사람이 많은 것이 눈에 띄네요.

cat_plot('house_type')

house_type 변수 또한 큰 의미가 없는 변수인 것 같습니다. 대부분 House / apartment 타입이기 때문에 의미가 더더욱 없습니다.

cat_plot('FLAG_MOBIL')

여기에 나온 모든 사람은 스마트폰을 보유하고 있습니다.

cat_plot('work_phone')

신용 등급 그룹 별 가정 전화 비율이 차이가 없습니다. 가정용 전화기 보유률이 떨어지는게 눈에 띄네요.

cat_plot('email')

이메일 변수 또한 유의미하지 않아 보입니다.

f, ax = plt.subplots(1, 3, figsize=(16, 6))
sns.countplot(y = 'occyp_type', data = train_0, order = train_0['occyp_type'].value_counts().index, ax=ax[0])
sns.countplot(y = 'occyp_type', data = train_1, order = train_1['occyp_type'].value_counts().index, ax=ax[1])
sns.countplot(y = 'occyp_type', data = train_2, order = train_2['occyp_type'].value_counts().index, ax=ax[2])
plt.subplots_adjust(wspace=0.5, hspace=0.3)
plt.show()

직업 유형 변수를 신용 등급별로 비교했습니다.

전반적인 경향은 비슷하지만, 세세한 차이가 조금 있어보입니다.

연속형 변수를 신용등급별로 쪼개서 관찰해보기

def num_plot(column):
  
  fig, axes = plt.subplots(1, 3, figsize=(16, 6))


  sns.distplot(train_0[column],
                ax = axes[0])
  axes[0].tick_params(labelsize=12)
  axes[0].set_title('credit = 0')
  axes[0].set_ylabel('count')

  sns.distplot(train_1[column],
                ax = axes[1])
  axes[1].tick_params(labelsize=12)
  axes[1].set_title('credit = 1')
  axes[1].set_ylabel('count')

  sns.distplot(train_2[column],
                ax = axes[2])
  axes[2].tick_params(labelsize=12)
  axes[2].set_title('credit = 2')
  axes[2].set_ylabel('count')
  plt.subplots_adjust(wspace=0.3, hspace=0.3)

num_plot("child_num")

자녀 수 변수입니다. 신용 등급별로 큰 차이는 없어보입니다.

다만 신용등급 2에 자녀가 아주 많은 소수의 변수가 존재하는 걸 알 수 있습니다.

num_plot("family_size")

가족 수 변수도 자식 수 변수와 마찬가지 결과를 보이는 것 같아요.

num_plot("income_total")

신용등급에 따른 월간 소득 차이는 크게 없어 보입니다. (??)

sns.distplot(train_0['income_total'],label='0.0', hist=False)
sns.distplot(train_1['income_total'],label='0.1', hist=False)
sns.distplot(train_2['income_total'],label='0.2', hist=False)
plt.legend()

<matplotlib.legend.Legend at 0x7f8be19fa9d0>

정확히 확인하기 위해 그래프를 겹첬는데요. 조금 차이는 있으나 많이 비슷한 것을 볼 수 있습니다.

num_plot("DAYS_BIRTH")

숫자의 절대값이 작을 수록 젊은 사람 변수 입니다. 그래프가 전반적으로 비슷해 보입니다.

train_0['Month'] = abs(train_0['begin_month'])
train_1['Month'] = abs(train_1['begin_month'])
train_2['Month'] = abs(train_2['begin_month'])
train_0 = train_0.astype({'Month': 'int'})
train_1 = train_1.astype({'Month': 'int'})
train_2 = train_2.astype({'Month': 'int'})
train_0['Month'].head()

num_plot("Month")

카드 생성일 변수를 양수로 바꿔서 분석했습니다.

전반적으로 흐름은 비슷해보이는데, 카드 발급 초기에서 약 70프로 정도는 신용등급 1을, 약 30프로는 0을 부여하는 것 같습니다.

간단한 모델 적합

object_col = []
for col in train.columns:
    if train[col].dtype == 'object':
        object_col.append(col)

enc = OneHotEncoder()
enc.fit(train.loc[:,object_col])

train_onehot_df = pd.DataFrame(enc.transform(train.loc[:,object_col]).toarray(), 
             columns=enc.get_feature_names(object_col))
train.drop(object_col, axis=1, inplace=True)
train = pd.concat([train, train_onehot_df], axis=1)

test_onehot_df = pd.DataFrame(enc.transform(test.loc[:,object_col]).toarray(), 
             columns=enc.get_feature_names(object_col))
test.drop(object_col, axis=1, inplace=True)
test = pd.concat([test, test_onehot_df], axis=1)

범주형 변수는 모두 원-핫 인코딩을 해줍니다.

sample_submission

이 대회는 0, 1, 2의 확률이 어떻게 되는지 예측하는 모델입니다.

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
folds=[]
for train_idx, valid_idx in skf.split(train, train['credit']):
    folds.append((train_idx, valid_idx))

random.seed(42)
lgb_models={}
for fold in range(5):
    print(f'===================================={fold+1}============================================')
    train_idx, valid_idx = folds[fold]
    X_train, X_valid, y_train, y_valid = train.drop(['credit'],axis=1).iloc[train_idx].values, train.drop(['credit'],axis=1).iloc[valid_idx].values,\
                                         train['credit'][train_idx].values, train['credit'][valid_idx].values 
    lgb = LGBMClassifier(n_estimators=1000)
    lgb.fit(X_train, y_train, 
            eval_set=[(X_train, y_train), (X_valid, y_valid)], 
            early_stopping_rounds=30,
           verbose=100)
    lgb_models[fold]=lgb
    print(f'================================================================================\n\n')

====================================1============================================
Training until validation scores don't improve for 30 rounds.
[100]	training's multi_logloss: 0.676692	valid_1's multi_logloss: 0.766702
[200]	training's multi_logloss: 0.596634	valid_1's multi_logloss: 0.755074
[300]	training's multi_logloss: 0.53456	valid_1's multi_logloss: 0.751863
[400]	training's multi_logloss: 0.482683	valid_1's multi_logloss: 0.750901
Early stopping, best iteration is:
[385]	training's multi_logloss: 0.489523	valid_1's multi_logloss: 0.750597
================================================================================


====================================2============================================
Training until validation scores don't improve for 30 rounds.
[100]	training's multi_logloss: 0.673988	valid_1's multi_logloss: 0.778812
[200]	training's multi_logloss: 0.593911	valid_1's multi_logloss: 0.766056
[300]	training's multi_logloss: 0.532019	valid_1's multi_logloss: 0.762532
Early stopping, best iteration is:
[358]	training's multi_logloss: 0.500235	valid_1's multi_logloss: 0.761024
================================================================================


====================================3============================================
Training until validation scores don't improve for 30 rounds.
[100]	training's multi_logloss: 0.676709	valid_1's multi_logloss: 0.771762
[200]	training's multi_logloss: 0.593522	valid_1's multi_logloss: 0.758924
Early stopping, best iteration is:
[236]	training's multi_logloss: 0.57026	valid_1's multi_logloss: 0.758105
================================================================================


====================================4============================================
Training until validation scores don't improve for 30 rounds.
[100]	training's multi_logloss: 0.675515	valid_1's multi_logloss: 0.7694
[200]	training's multi_logloss: 0.597206	valid_1's multi_logloss: 0.758117
[300]	training's multi_logloss: 0.533343	valid_1's multi_logloss: 0.753141
Early stopping, best iteration is:
[308]	training's multi_logloss: 0.528916	valid_1's multi_logloss: 0.752857
================================================================================


====================================5============================================
Training until validation scores don't improve for 30 rounds.
[100]	training's multi_logloss: 0.676696	valid_1's multi_logloss: 0.767947
[200]	training's multi_logloss: 0.595696	valid_1's multi_logloss: 0.757343
[300]	training's multi_logloss: 0.531936	valid_1's multi_logloss: 0.753206
Early stopping, best iteration is:
[346]	training's multi_logloss: 0.50629	valid_1's multi_logloss: 0.752064
================================================================================

sample_submission.iloc[:,1:]=0
for fold in range(5):
    sample_submission.iloc[:,1:] += lgb_models[fold].predict_proba(test)/5
sample_submission.to_csv('ssu6_submission.csv', index=False)
sample_submission.head()

	index	gender	car	reality	child_num	income_total	income_type	edu_type	family_type	house_type	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	phone	email	occyp_type	family_size	begin_month	credit
0	0	F	N	N	0	202500.0	Commercial associate	Higher education	Married	Municipal apartment	-13899	-4709	1	0	0	NaN	2.0	-6.0	1.0
1	1	F	N	Y	1	247500.0	Commercial associate	Secondary / secondary special	Civil marriage	House / apartment	-11380	-1540	1	0	1	Laborers	3.0	-5.0	1.0
2	2	M	Y	Y	0	450000.0	Working	Higher education	Married	House / apartment	-19087	-4434	1	1	0	Managers	2.0	-22.0	2.0
3	3	F	N	Y	0	202500.0	Commercial associate	Secondary / secondary special	Married	House / apartment	-15088	-2092	1	1	0	Sales staff	2.0	-37.0	0.0
4	4	F	Y	Y	0	157500.0	State servant	Higher education	Married	House / apartment	-15037	-2105	1	0	0	Managers	2.0	-26.0	2.0

	index	child_num	income_total	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	work_phone	phone	email	family_size	begin_month	credit
count	26457.000000	26457.000000	2.645700e+04	26457.000000	26457.000000	26457.0	26457.000000	26457.000000	26457.000000	26457.000000	26457.000000	26457.000000
mean	13228.000000	0.428658	1.873065e+05	-15958.053899	59068.750728	1.0	0.224742	0.294251	0.091280	2.196848	-26.123294	1.519560
std	7637.622372	0.747326	1.018784e+05	4201.589022	137475.427503	0.0	0.417420	0.455714	0.288013	0.916717	16.559550	0.702283
min	0.000000	0.000000	2.700000e+04	-25152.000000	-15713.000000	1.0	0.000000	0.000000	0.000000	1.000000	-60.000000	0.000000
25%	6614.000000	0.000000	1.215000e+05	-19431.000000	-3153.000000	1.0	0.000000	0.000000	0.000000	2.000000	-39.000000	1.000000
50%	13228.000000	0.000000	1.575000e+05	-15547.000000	-1539.000000	1.0	0.000000	0.000000	0.000000	2.000000	-24.000000	2.000000
75%	19842.000000	1.000000	2.250000e+05	-12446.000000	-407.000000	1.0	0.000000	1.000000	0.000000	3.000000	-12.000000	2.000000
max	26456.000000	19.000000	1.575000e+06	-7705.000000	365243.000000	1.0	1.000000	1.000000	1.000000	20.000000	0.000000	2.000000

	index	0	1	2
0	26457	0	0	0
1	26458	0	0	0
2	26459	0	0	0
3	26460	0	0	0
4	26461	0	0	0
...	...	...	...	...
9995	36452	0	0	0
9996	36453	0	0	0
9997	36454	0	0	0
9998	36455	0	0	0
9999	36456	0	0	0

	index	0	1	2
0	26457	0.018329	0.187203	0.794468
1	26458	0.061934	0.121026	0.817041
2	26459	0.027629	0.203945	0.768427
3	26460	0.067723	0.199497	0.732780
4	26461	0.079370	0.229451	0.691179