데이터 불러오기

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

path = '/content/drive/MyDrive/heart/'

train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')
sample_submission = pd.read_csv(path + 'sample_submission.csv')
train.head()

기본 변수 설명

sex : 성별(0은 여자, 1은 남자), cp : 가슴통증(0~3, 클수록 심한통증), trestbps : 휴식 중 혈압

chol : 혈중 콜레스테롤, fbs : 공복 중 혈당(120이상시 1), restecg : 휴식 중 심전도 결과(0은 좌심실 비대, 1은 정상, 2는 ST-T파 이상)

thalach : 최대 심박수, exang : 활동으로 인한 협심증 여부(0은 정상, 1은 이상), oldpeak : 휴식 대비 운동으로 인한 ST 하강

slope : 활동 ST 분절 피크의 기울기(0 하강, 1 보통, 2 상승), ca : 주요 혈관 수(0-3개, 4는 NULL값), thal : 지중해빈혈 여부(0 Null, 1 정상, 2~3 결함)

train['target'].value_counts()

1    83
0    68
Name: target, dtype: int64

반응변수는 target으로 심장질환 판단 여부를 나타냅니다. 1은 이상, 0은 정상입니다.

테스트 데이터는 이상이 83개, 정상이 68개로 이상이 조금 더 많습니다.

범주형 변수를 심장질환 여부로 쪼개서 관찰하기

import matplotlib.pyplot as plt
import seaborn as sns

train_0 = train[train['target']==0]
train_1 = train[train['target']==1]

def cat_plot(column):
  f, ax = plt.subplots(1, 2, figsize=(16, 6))
  sns.countplot(x = column,
                data = train_0,
                ax = ax[0],
                order = train_0[column].value_counts().index)
  ax[0].tick_params(labelsize=12)
  ax[0].set_title('target = 0')
  ax[0].set_ylabel('count')
  ax[0].tick_params(rotation=50)


  sns.countplot(x = column,
                data = train_1,
                ax = ax[1],
                order = train_1[column].value_counts().index)
  ax[1].tick_params(labelsize=12)
  ax[1].set_title('target = 1')
  ax[1].set_ylabel('count')
  ax[1].tick_params(rotation=50)

  plt.subplots_adjust(wspace=0.3, hspace=0.3)
  plt.show()

cat_plot("sex")

test['sex'].value_counts()

1    104
0     48
Name: sex, dtype: int64

우선 전반적으로 sex가 1인 자료가 많습니다. 앞서 설명한대로 sex가 1인 자료는 남성입니다. 이는 테스트 자료도 유사합니다.

왼쪽 그림은 심장병이 없는 데이터의 성별 별 개수, 오른쪽 그림은 심장병이 있는 데이터의 성별 별 개수 입니다.

그래프로 보아 주어진 데이터 내 여성의 심장병 발생 확률이 높은 것을 보여줍니다.

cat_plot("cp")

다음은 심장병이 있는 데이터와 없는 데이터를 가슴통증 유무 변수로 확인했습니다.

0이 가슴통증이 없는 값인데, cp가 0인 데이터들은 대부분 심장병이 없습니다.

나머지 변수들은(cp가 1~3) 모두 심장병이 있을 확률이 더 높습니다.

특이한 점은 cp가 3인 경우 가슴통증이 더 심해서 심장병이 있을 확률이 제일 높을 것이라고 생각하는데 그렇지는 않습니다.

오히려 cp가 2인 경우가 더 심장병이 있을 확률이 더 높습니다.

즉 이 변수는 순서형으로 보면 안됩니다.

cat_plot("fbs")

범주형 변수들을 계속 같은 패턴으로 분석할 것 입니다.

그래프에서는 심장질환 유무를 판단하는데 fbs는 크게 유의미한 변수는 아닌 것 같습니다.

cat_plot("restecg")
test['restecg'].value_counts()

1    77
0    72
2     3
Name: restecg, dtype: int64

restecg, 휴식 중 심전도 변수입니다. 1이 정상 값이나 0값 대비 오히려 심장질환이 있을 확률이 높은 것을 알 수 있어요.

이 변수는 처리하기 애매합니다. 또 2는 스몰 샘플이나 모두 심장질환이 없는데, 너무 스몰샘플이라 함부로 처리하면 안되겠습니다.

그래서 저는 이 변수는 유의미 하지 않다고 판단, 제거하겠습니다.

cat_plot("exang")

exang, 활동으로 인한 협심증 여부를 판단하는 변수 입니다. 역시 0은 정상, 1은 이상으로 알고 있는데 이상합니다.

0이 나왔을때가 심장 질환을 가질 확률이 높습니다. 잘 이해가 되진 않는데 차이가 눈에 띄게 유의미하니 이 변수는 사용해야겠습니다.

cat_plot("slope")

slope, 활동 ST 분절 피크의 기울기 변수입니다. 우선 0인 값은 절대적 개수도 적고 심장 질환이 있든 없든 분포가 비슷합니다.

차이가 나는 것은 1과 2인데 1은 심장병이 없을 확률이, 2는 심장병이 있을 확률이 높아집니다.

cat_plot("ca")
test['ca'].value_counts()

0    80
1    34
2    23
3    10
4     5
Name: ca, dtype: int64

ca, 확인된 주요 혈관 수 변수 입니다. 0이 절대적으로 많으며 2,3은 개수는 적으나 대부분 심장질환이 없습니다.

그래프를 관찰해보면 2,3은 심장질환이 없다고 판단할 수 있는 좋은 변수 입니다.

0은 약 70%가 심장질환이 있는 변수, 1은 대부분이 심장질환이 없는 변수 입니다.

특이사항은 테스트 데이터에만 NULL값을 의미하는 4가 있는데 처리를 고민해야겠습니다.

2와 3은 심장질환이 없을 확률이 대단히 높으므로 두 칼럼을 병합하겠습니다.

cat_plot("thal")
test['thal'].value_counts()

2    82
3    59
1    10
0     1
Name: thal, dtype: int64

thal, 지중해빈혈 여부 입니다. 우선 데이터 내 2번에 비율이 꽤 높습니다.

2는 대부분 심장질환이 있는 변수, 3은 대부분 심장질환이 없는 변수 입니다.

1은 정상을 의미하는 변수이나 심장질환을 판단하기 쉽지 않은 변수입니다.

0은 NULL 값이므로 이 변수에선 판단을 보류한다는 의미에서 1과 합쳐주겠습니다.

연속형 변수를 심장질환 여부로 쪼개서 관찰해보기

def num_plot(column):
  
  fig, axes = plt.subplots(1, 2, figsize=(16, 6))

  sns.distplot(train_0[column],
                ax = axes[0])
  axes[0].tick_params(labelsize=12)
  axes[0].set_title('target = 0')
  axes[0].set_ylabel('count')

  sns.distplot(train_1[column],
                ax = axes[1])
  axes[1].tick_params(labelsize=12)
  axes[1].set_title('target = 1')
  axes[1].set_ylabel('count')

  plt.subplots_adjust(wspace=0.3, hspace=0.3)

num_plot("trestbps")
[(train_0['trestbps']).mean(), (train_1['trestbps']).mean()]

[134.4558823529412, 130.04819277108433]

trestbps, 휴식 중 혈압 변수 입니다. 사실 두 집단 간 유의미한 차이가 있는 것 같진 않아요.

num_plot("chol")
[(train_0['chol']).mean(), (train_1['chol']).mean()]

[242.23529411764707, 246.40963855421685]

chol, 콜레스테롤 변수 입니다. 두 분포가 유의미하게 차이있진 않아요.

num_plot("thalach")
[(train_0['thalach']).mean(), (train_1['thalach']).mean()]

[141.19117647058823, 158.36144578313252]

thalach, 최대 심박수 변수 입니다. 확실히 thalach 값이 크면 심장질환일 확률이 늘어나는 것 같아요.

num_plot("oldpeak")
[(train_0['oldpeak']).mean(), (train_1['oldpeak']).mean()]

[1.4808823529411763, 0.563855421686747]

oldpeak, 운동으로 인한 ST 하강 변수 입니다. 이 변수의 값이 크면 심장질환이 아닐 확률이 높아집니다.

데이콘 베이스라인에 있는 연속형 변수 EDA

fig, axes = plt.subplots(5, 3, figsize=(25, 20))

fig.suptitle('feature distributions per quality', fontsize= 40)
for ax, col in zip(axes.flat, train.columns[1:-1]):
    sns.violinplot(x= 'target', y= col, ax=ax, data=train)
    ax.set_title(col, fontsize=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

다음 코드를 참고했습니다.

https://dacon.io/competitions/official/235848/codeshare/3832?page=1&dtype=recent

한눈에 변수들을 살펴볼 수 있어서 좋은 것 같아요.

데이터 전처리

train['thal'][train['thal'] == 0] = 1
test['thal'][test['thal'] == 0] = 1

train_label = train['target']
train.drop(['trestbps','chol', 'fbs', 'restecg', 'target'], axis = 1, inplace= True)
test.drop(['trestbps','chol', 'fbs', 'restecg'], axis = 1, inplace= True)

앞서 EDA 한 정보를 바탕으로 trestbps, chol, fbs, restecg 변수를 모델에서 제외했습니다.

test2 = (test[test['ca'] == 4]).drop(['ca'], axis = 1)
test2id = test2['id']

또 ca가 4인 값은 트레인 데이터에서 없는 NULL 값입니다.

따라서 이 값을 가진 테스트 데이터는 ca변수가 없는 별개의 모델에서 학습하도록 값을 조정해줍니다.

간단한 모델 적합

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state = 0, n_estimators = 100)
rf.fit(train,train_label)
sample_submission['target'] = rf.predict(test)

# ca가 4인 데이터는 cp를 제외한 모델에서 생성된 결과를 사용하기로 한다.
rf2 = RandomForestClassifier(random_state = 0, n_estimators = 100)
rf2.fit(train.drop(['ca'], axis = 1),train_label)
pred2 = rf2.predict(test2)

k = 0
for i in test2id:
    sample_submission['target'][sample_submission['id'] == i] = pred2[k]
    k += 1

sample_submission.to_csv('heart_final_3.csv',index=False)

랜덤포레스트로 모델을 만들었습니다.

from xgboost import XGBClassifier

xgb = XGBClassifier()

xgb.fit(train,train_label)
sample_submission['target'] = xgb.predict(test)

# ca가 4인 데이터는 cp를 제외한 모델에서 생성된 결과를 사용하기로 한다.
xgb2 = XGBClassifier()
xgb2.fit(train.drop(['ca'], axis = 1),train_label)
pred2 = xgb2.predict(test2)

k = 0
for i in test2id:    
    sample_submission['target'][sample_submission['id'] == i] = pred2[k]
    k += 1

sample_submission.to_csv('heart_final_4.csv',index=False)

[14:00:08] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[14:00:08] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

xgb 모델을 만들었습니다.

	id	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	1	53	1	2	130	197	1	0	152	0	1.2	0	0	2	1
1	2	52	1	3	152	298	1	1	178	0	1.2	1	0	3	1
2	3	54	1	1	192	283	0	0	195	0	0.0	2	1	3	0
3	4	45	0	0	138	236	0	0	152	1	0.2	1	0	2	1
4	5	35	1	1	122	192	0	1	174	0	0.0	2	0	2	1

	id	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	1	53	1	2	130	197	1	0	152	0	1.2	0	0	2	1
1	2	52	1	3	152	298	1	1	178	0	1.2	1	0	3	1
2	3	54	1	1	192	283	0	0	195	0	0.0	2	1	3	0
3	4	45	0	0	138	236	0	0	152	1	0.2	1	0	2	1
4	5	35	1	1	122	192	0	1	174	0	0.0	2	0	2	1

	id	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	1	53	1	2	130	197	1	0	152	0	1.2	0	0	2	1
1	2	52	1	3	152	298	1	1	178	0	1.2	1	0	3	1
2	3	54	1	1	192	283	0	0	195	0	0.0	2	1	3	0
3	4	45	0	0	138	236	0	0	152	1	0.2	1	0	2	1
4	5	35	1	1	122	192	0	1	174	0	0.0	2	0	2	1