데이터 불러오기

pip install kaggle --upgrade

Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.5.30)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (5.0.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.62.2)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)

!pip install kaggle
from google.colab import files
files.upload()

Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.5.30)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.62.2)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (5.0.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)

Saving kaggle.json to kaggle (1).json

{'kaggle.json': b'{"username":"ksy1998","key":"0c820de52cea65ec11954012ef8b00d2"}'}

kaggle.json 파일 선택합니다.

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle competitions download -c bike-sharing-demand

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
Downloading train.csv to /content
  0% 0.00/633k [00:00<?, ?B/s]
100% 633k/633k [00:00<00:00, 42.8MB/s]
Downloading sampleSubmission.csv to /content
  0% 0.00/140k [00:00<?, ?B/s]
100% 140k/140k [00:00<00:00, 44.0MB/s]
Downloading test.csv to /content
  0% 0.00/316k [00:00<?, ?B/s]
100% 316k/316k [00:00<00:00, 41.9MB/s]

캐글에서 복사한 코드에 느낌표만 붙여줍니다.

import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge ,Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, PolynomialFeatures
from sklearn.metrics import mean_squared_log_error as msle
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
%matplotlib inline

train=pd.read_csv('./train.csv')
test=pd.read_csv('./test.csv')

데이터 둘러보기

train.head()

test.head()

변수가 3개 차이 나는데, casual + registered = count 변수 입니다.

테스트 데이터에서는 count를 맞추는것이 목적입니다.

train.drop(['casual','registered'],1,inplace=True)

글쓴이는 이런 이유로 쿨하게 두 변수를 날렸습니다.

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   count       10886 non-null  int64  
dtypes: float64(3), int64(6), object(1)
memory usage: 850.6+ KB

데이터 10886개, count를 제외한 변수 개수 9개, 결측값은 없습니다.

train.describe()

최솟값 또는 최댓값에 이상한 값은 없습니다.

또 값을 관찰해보면 season, holiday, workingday, weather 변수는 범주형 변수인 것을 알 수 있습니다.

그리고 datetime 변수는 형태가 특수하여 이 표에 표현되지 않습니다.

datetime 변수에 대해

train['datetime']

0        2011-01-01 00:00:00
1        2011-01-01 01:00:00
2        2011-01-01 02:00:00
3        2011-01-01 03:00:00
4        2011-01-01 04:00:00
                ...         
10881    2012-12-19 19:00:00
10882    2012-12-19 20:00:00
10883    2012-12-19 21:00:00
10884    2012-12-19 22:00:00
10885    2012-12-19 23:00:00
Name: datetime, Length: 10886, dtype: object

datetime 변수는 날짜 + 시간 변수 입니다.

데이터 타입은 현재 object 타입인데요. 변경해보겠습니다.

train['datetime'] = pd.to_datetime(train['datetime'])
test['datetime'] = pd.to_datetime(test['datetime'])
train['Month']=pd.DatetimeIndex(train['datetime']).month
test['Month']=pd.DatetimeIndex(test['datetime']).month
train['Year']=pd.DatetimeIndex(train['datetime']).year
test['Year']=pd.DatetimeIndex(test['datetime']).year
train['WeekDay']=pd.DatetimeIndex(train['datetime']).weekday
test['WeekDay']=pd.DatetimeIndex(test['datetime']).weekday
train['Hour']=pd.DatetimeIndex(train['datetime']).hour
test['Hour']=pd.DatetimeIndex(test['datetime']).hour
train.head(3)

판다스에 to_* 함수를 통해 데이터 타입을 바꿀 수 있습니다.

여기서는 to_datetime 함수로 'datetime' 데이터 형으로 바꿨습니다.

이 데이터 형에 특징은 DatetimeIndex함수를 이용하여 연도/달/일 등을 쉽게 추출할 수 있다는 것 입니다.

데이터 소개

season => 계절 변수, 1 : 봄, 2 : 여름, 3 : 가을, 4 : 겨울

holiday => 휴일 변수, 날짜가 휴일이면 1 아니면 0

workingday => 근무일 변수, 날짜가 주말도 휴일도 아니라면 1

weather => 날씨 변수, 맑음이 1 폭우가 4. 흐릴수록 값이 점차 증가.

temp => 온도, atemp => 체감온도, humidity => 습도, windspeed => 풍속

categorical_cols=['season','holiday','workingday','weather']
numerical_cols=['temp','atemp','humidity','windspeed']
label='count'

변수를 범주형 변수와 연속형 변수로 나눴습니다.

시간 데이터 관측

def encodetime(train,test,col,label):
    d3=train[[col,label]].groupby(col).mean()
    d3.sort_values(by='count',ascending=False)
    plt.scatter(x=d3.index,y=d3['count'])
    d3=d3.sort_values(by='count')
    d3['w']=np.arange(train[col].nunique())
    #순서
    dic=dict(zip(d3.index,d3['w']))
    train[col]=train[col].map(dic)
    test[col]=test[col].map(dic)

.nunique() => 유니크한 범주 개수 출력

dict(zip()) => 두 시리즈 변수를 튜플로 묶고 딕셔러니 자료형으로 변환

map => 주로 함수를 입력하는데, 여기서는 딕셔너리 키 값을 받을때 value 값을 반환해주는 함수로 사용

즉 이 함수는 col 변수로 변수가 입력되면 그 변수의 plot를 출력해주고 0~n 까지 레이블 인코딩을 해줍니다.

encodetime(train,test,'Year',label)

연도별로 차이가 있습니다.

encodetime(train,test,'Month',label)

날씨가 따뜻한 여름 부근에 확실히 값이 큽니다.

encodetime(train,test,'Hour',label)

야간시간에 값이 떨어집니다.

encodetime(train,test,'WeekDay',label)

0부터 월요일이므로 값이 많이 떨어지는 6은 일요일입니다.

범주형 자료 시각화

def boxplot(x,y,**kwargs):
    sns.boxplot(x=x,y=y)
    x=plt.xticks(rotation=90)
f=pd.melt(train,id_vars=['count'],value_vars=categorical_cols)
g=sns.FacetGrid(f,col='variable',col_wrap=2,sharex=False)
g.map(boxplot,'value','count')

<seaborn.axisgrid.FacetGrid at 0x7fd4625d22d0>

boxplot 모형입니다. count변수의 이상치가 변수에 상관없이 존재합니다.

파이썬 문법 : 함수 인자 내 **kwargs에 대해서.

단순하게 함수 인자 내 *이나 **이 보일 경우 C언어에서 사용하는 포인터 개념이 아닙니다.

몇개의 인자를 보낼지 모를때 사용되며, **같은 경우 딕셔너리 형태일때 사용합니다.

다만 이 코드는 너무 복잡해서 지금 실력에서 어떻게 해석을 못하겠습니다.

sns.pairplot(train[[*numerical_cols,'count']])

<seaborn.axisgrid.PairGrid at 0x7fd460faa110>

*numerical_cols은 리스트를 해체한다고 이해하면 편할 것 같아요.

pairplot함수를 통해 연속형 변수 간에 산점도를 한 눈에 볼 수 있습니다.

f, ax = plt.subplots(figsize=(15, 15))
corr = train[[*numerical_cols,'count']].corr()
sns.heatmap(corr,cmap=sns.diverging_palette(220, 10, as_cmap=True),square=True, ax=ax, annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fd461781f50>

heatmap 함수는 아까 본 산점도를 상관계수 버전으로 보여줍니다.

square는 셀을 정사각형으로 출력하는 것, annot은 셀 안에 숫자를 출력해주는 것을 의미합니다.

여기서는 체감온도와 실제 온도 간 상관계수가 엄청 높은 것이 눈에 띄네요.

f, ax = plt.subplots(figsize=(15, 15))
corr = train.corr()
sns.heatmap(corr,cmap=sns.diverging_palette(220, 10, as_cmap=True),square=True, ax=ax, annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fd4602b2650>

조금 더 확장해서 모든 변수간 상관계수를 살펴보았습니다.

반응변수 변환

sns.displot(train[label] , kde=True, height=8.27, aspect=11.7/8.27)
sns.displot(np.log(train[label]) , kde=True, height=8.27, aspect=11.7/8.27)

<seaborn.axisgrid.FacetGrid at 0x7fd46015a090>

count 변수를 로그변환 해보았는데요.

로그변환 전 우측 꼬리가 긴 그래프였는데 변환 후 상대적으로 더 정규분포에 가까워졌습니다.

다만 왼쪽 꼬리가 다소 길어져 변환 정도를 조절할 필요가 있겠습니다.

def trans(x,l1=0.3,l2=0):
    if l1!=0:
        return ((x+l2)**l1-1)/l1
    else:
        return np.log(x+l2)
def rev_trans(x,l1=0.3,l2=0):
    return (x*l1+1)**(1/l1)-l2

z=train[label].apply(trans)   
sns.displot(z , kde=True, height=8.27, aspect=11.7/8.27)

<seaborn.axisgrid.FacetGrid at 0x7fd45feda3d0>

l2가 0일때 이는 람다가 l1인 box-cox 변환입니다.

이 변환은 정규분포가 아닌 값을 정규분포형태로 변환합니다.

그림을 확인해보면 이전 대비 그래프가 확연히 정규분포 형태에 가까운 것을 알 수 있습니다.

데이터 모델링

x=train.drop(['count','datetime','atemp'],1) 
xtest=test.drop(['datetime','atemp'],1)
y=train['count']
xt,xv,yt,yv=train_test_split(x,y,test_size=0.2,random_state=101)

변수이름을 조금 대충 만들었네요.

보면서 생각난 점은 데이터 분석 프로젝트롤 여러명이서 할 경우 변수명 통일이 상당히 중요하다는 점 입니다.

def redun(x):
    return x

def mk_model_RF(xt1,xv1,yt,yv,md=None,func1=redun,func2=redun,mss=2,n_est=100,al=0):    
    ytt=yt.apply(func1)
    yvt=yv.apply(func2)
    model=RandomForestRegressor(max_depth=md, random_state=0,min_samples_split=mss,n_estimators=n_est,ccp_alpha=al)
    model.fit(xt1,ytt)
    ypt=np.apply_along_axis(func2,arr=model.predict(xt1),axis=0)
    ypv=np.apply_along_axis(func2,arr=model.predict(xv1),axis=0)
    print('training r2:',r2_score(yt,ypt))
    print('Validation r2:',r2_score(yv,ypv))
    print('training rmsle:',np.sqrt(msle(yt,ypt)))
    print('validation rmsle:',np.sqrt(msle(yv,ypv)))
    return model

함수를 조금 복잡하게 만들었는데 함수 한 개만 소개하겠습니다.

np.apply_along_axis 함수는 인자가 (함수, 어레이, 행/열 여부) 입니다.

apply와 비슷한 역할의 함수인데 넘파이 함수라서 실행속도가 엄청 빠릅니다.

mk_model_RF(xt,xv,yt,yv)

training r2: 0.9922090269471024
Validation r2: 0.9338288545809785
training rmsle: 0.16871805329660905
validation rmsle: 0.379570233664069

RandomForestRegressor(bootstrap=True, ccp_alpha=0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

랜덤 포레스트 방법이네요.

count 변수에 box-cox변환을 하지 않은 결과입니다.

mk_model_RF(xt,xv,yt,yv,func1=trans,func2=rev_trans,n_est=400,md=20)

training r2: 0.9916973216662878
Validation r2: 0.9338712883052291
training rmsle: 0.12704646646266962
validation rmsle: 0.34592422107196

RandomForestRegressor(bootstrap=True, ccp_alpha=0, criterion='mse',
                      max_depth=20, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=400, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

box-cox 변환을 한 결과가 조금 더 좋은 것 같습니다.

def mk_model_xgb(xt,xv,yt,yv,func1=redun,func2=redun,lr=1,min_child_weight =25,colsample_bytree = 0.8,md=None):
    model =XGBRegressor( colsample_bytree = colsample_bytree, learning_rate = lr,min_child_weight =min_child_weight, max_depth=md )
    ytt=yt.apply(func1)
    model.fit(xt,ytt)
    ypt=np.apply_along_axis(func2,arr=model.predict(xt),axis=0)
    ypv=np.apply_along_axis(func2,arr=model.predict(xv),axis=0)
    print('training r2:',r2_score(yt,ypt))
    print('Validation r2:',r2_score(yv,ypv))
    print('training rmsle:',np.sqrt(msle(yt,ypt)))
    print('validation rmsle:',np.sqrt(msle(yv,ypv)))
    return model

mk_model_xgb(xt,xv,yt,yv,func1=trans,func2=rev_trans,lr=0.2,min_child_weight =20,colsample_bytree = 0.8,md=20)

[14:54:06] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
training r2: 0.9945096180099459
Validation r2: 0.9504273424389956
training rmsle: 0.10444065614264901
validation rmsle: 0.3133324307194445

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, gamma=0,
             importance_type='gain', learning_rate=0.2, max_delta_step=0,
             max_depth=20, min_child_weight=20, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

xgboost 사용 결과 값이 더 좋게 나옵니다. box-cox 변환을 한 결과입니다.

model=XGBRegressor(colsample_bytree = 0.8, learning_rate = 0.2,min_child_weight =20, max_depth=20).fit(x,y.apply(trans))

[14:55:11] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

그래서 box-cox변환 후 xgboost를 사용해 모델을 적합시켰습니다.

분석 결과 제출

yp=np.apply_along_axis(rev_trans,arr=model.predict(xtest),axis=0)

plt.hist(yp)

(array([2543., 1497., 1057.,  606.,  367.,  184.,  125.,   69.,   34.,
          11.]),
 array([7.3024821e-01, 9.8740685e+01, 1.9675111e+02, 2.9476157e+02,
        3.9277197e+02, 4.9078241e+02, 5.8879285e+02, 6.8680328e+02,
        7.8481372e+02, 8.8282416e+02, 9.8083459e+02], dtype=float32),
 <a list of 10 Patch objects>)

test['count']=yp
test[['datetime', 'count']].to_csv('./submission.csv', index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "Message"

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
100% 188k/188k [00:01<00:00, 106kB/s]
Successfully submitted to Bike Sharing Demand

별도에 작업 없이 캐글과 연동하여 바로 제출할 수 있습니다.

느낀점

우선 이번 데이터 분석은 저번보다 열심히 한 것 같네요.

중점으로 두었던 것은 제가 시각화 부분이 많이 부족해서 이 부분 공부해보려고 이번 코드 골랐습니다.

잊고 있었던 box-cox 정규분포 변환에 대해 다시 떠올리게 되었던 것도 큰 수확인것 같네요.

이 사람이 코드 설명을 크게 한 것이 없어 찾아보느라도 고생한 것 같습니다.

결과는 0.41정도로 상위권은 아니지만 노력 대비 어느정도 성과가 있습니다.

이번 코드 리뷰를 통해서 많이 배운것 같습니다.

대회 출처 : https://www.kaggle.com/c/bike-sharing-demand

코드 출처 : https://www.kaggle.com/muhammedmamdouhsalah/bike-sharing

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000
3	2011-01-20 03:00:00	1	1	1	10.66	12.880	56	11.0014
4	2011-01-20 04:00:00	1	1	1	10.66	12.880	56	11.0014

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	count
count	10886.000000	10886.000000	10886.000000	10886.000000	10886.00000	10886.000000	10886.000000	10886.000000	10886.000000
mean	2.506614	0.028569	0.680875	1.418427	20.23086	23.655084	61.886460	12.799395	191.574132
std	1.116174	0.166599	0.466159	0.633839	7.79159	8.474601	19.245033	8.164537	181.144454
min	1.000000	0.000000	0.000000	1.000000	0.82000	0.760000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	0.000000	1.000000	13.94000	16.665000	47.000000	7.001500	42.000000
50%	3.000000	0.000000	1.000000	1.000000	20.50000	24.240000	62.000000	12.998000	145.000000
75%	4.000000	0.000000	1.000000	2.000000	26.24000	31.060000	77.000000	16.997900	284.000000
max	4.000000	1.000000	1.000000	4.000000	41.00000	45.455000	100.000000	56.996900	977.000000

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1