https://www.kaggle.com/c/house-prices-advanced-regression-techniques

캐글에 있는 주택 가격 예측 데이터 분석입니다.

부스팅 모델들이 튜닝하는데 시간이 걸리기 때문에 좀 더 간단한 선형 회귀 모델을 사용하겠습니다.

분류 관련 공부를 조금 해본 경험으로, 회귀에 기본인 선형 회귀모델을 이번 데이터를 이용해 공부해보겠습니다.

이번 분석에 핵심 포인트는 숫자 변수 대부분이 치우쳐 있으므로 숫자 변수를 log_transform하는 것입니다.

데이터 불러오기 및 둘러보기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr


%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

train = pd.read_csv("/content/drive/MyDrive/house/train.csv")
test = pd.read_csv("/content/drive/MyDrive/house/test.csv")

train.head()

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

id(고유번호)와 설명변수를 뺀 나머지 변수들을 전처리를 위해 all_data 변수로 합쳐주었습니다.

데이터 전처리

이 코드의 데이터 전처리는 화려하지 않습니다. 기본에 충실합니다.

다음 3가지로 요약할 수 있습니다.

로그(기능 + 1)를 사용하여 오른쪽으로 꼬리가 긴 그래프를 변환합니다. 그러면 어느정도 정규화됩니다.
범주형 형상에 대한 더미 변수 생성
숫자 결측값(NaN)을 각 열의 평균으로 바꾸기

설명변수를 로그변환 해보기

matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3be1089890>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3be1052b10>]],
      dtype=object)

로그변환 전 우측 꼬리가 두터운 느낌이였는데 잘 정규화 된 모습입니다.

all_data.dtypes

MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
                  ...   
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
Length: 79, dtype: object

train["SalePrice"] = np.log1p(train["SalePrice"])

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

all_data.dtypes => 데이터 타입 나열. 여기서 인덱스는 변수이름이기 때문에 이런 방식으로 쉽게 추출.

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

shew = 왜도 값을 나타네는 함수. 왜도란 그래프가 비 대칭적인 모양인 것

shew값이 큰 양수값이면 오른쪽으로 긴 꼬리를 가지는 분포를 가집니다.

그러므로 shew값을 기준으로 로그변환을 할 변수를 찾을 수 있습니다.

참고로 apply 함수는 파이썬 데이터 프레임에 적용하는 함수인데, 원하는 함수를 적용하고 싶을때 사용합니다.

이때 apply 기본인자는 axis = 0이므로 열을 기준으로 함수를 적용합니다.

all_data = pd.get_dummies(all_data)
all_data.head(5)

get_dummies 함수로 모든 object형 값이 원핫인코딩 됐습니다.

저번에 프로젝트 할 때 변수를 하나하나 입력했던 것이 생각나는데 더 편한 방식을 알게 되었습니다.

all_data = all_data.fillna(all_data.mean())

결측값이 있을때 각 열의 평균값으로 대체하는 일반적인 방식입니다.

윗 코드와 마찬가지로 저번 프로젝트에서 열마다 함수를 돌려 사용했는데 더 편한 방식을 알게 됐습니다.

X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

저번 프로젝트에서 트레인, 테스트 데이터에 각각 전처리를 적용했습니다.

하지만 이 방법처럼 all_data로 묶고 한번에 전처리 하는 방식이 깔끔한 것 같습니다.

릿지 모델

선형 회귀 모델 적합을 하겠습니다.

이때 라쏘, 릿지 방법을 모두 사용해서 최적의 rmse 값을 찾겠습니다.

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV, Lasso
from sklearn.model_selection import cross_val_score

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

cross_val_score 함수는 교차 검증 후 정확도를 리스트로 보여줍니다.

여기서 cv = 5 이기 때문에 5-fold로 교차검증 하게 됩니다.

model_ridge = Ridge()

릿지 모델의 주요 파라미터는 알파입니다.

알파값이 높아지면 규제가 심해지고 과적합을 방지해줍니다.

다만 너무 많이 높아지면 과소적합이 되기 때문에 적절한 값을 찾아야합니다.

alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() 
            for alpha in alphas]

다양한 알파값을 릿지 함수에 적용시켰습니다.

여기서 [값 for alpha in alphas] 는 for루프를 리스트 내에서 돌리는 것 입니다.

cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")

Text(0, 0.5, 'rmse')

시리즈에 plot를 하면 그래프가 생깁니다.

이때 x축은 인덱스, y축은 본 값이 들어갑니다.

알파값이 10일때 rmse값이 최소로, 알파는 10을 쓰는 것이 좋겠습니다.

보통 규제하는 변수와 예측도를 측정하는 값간에 그래프는 U자형태가 잘 나옵니다.

그 이유는 규제가 약할때와 쌜 때 각각 과소적합, 과적합이 일어나 예측도를 측정하는 값이 커지기 때문입니다.

cv_ridge.min()

0.1273373466867076

최적의 rmse값은 0.1273입니다.

라쏘 모델

이번엔 라쏘 모델입니다.

라쏘 모델은 릿지 모델과 다르게 영향력이 작은 변수의 계수를 0으로 만듭니다.

변수 선택 과정까지 한번에 할 수 있다는 것이 장점입니다.

model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)

LassoCV 함수로 여러가지 알파값을 동시에 검정할 수 있습니다.

model_lasso.alpha_

0.0005

rmse_cv(model_lasso).mean()

0.12256735885048142

라쏘 모델이 rmse 값이 훨씬 낮아서 좋습니다.

라쏘 모델을 사용하겠습니다.

coef = pd.Series(model_lasso.coef_, index = X_train.columns)

회귀 모델.coef_ => 계수를 컬럼순으로 보여줍니다.

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

Lasso picked 110 variables and eliminated the other 178 variables

110개 변수는 선택되었고 178개 변수는 계수가 0, 즉 선택하지 않은 변수들입니다.

coef

MSSubClass              -0.007480
LotFrontage              0.000000
LotArea                  0.071826
OverallQual              0.053160
OverallCond              0.043027
                           ...   
SaleCondition_AdjLand    0.000000
SaleCondition_Alloca    -0.000000
SaleCondition_Family    -0.007925
SaleCondition_Normal     0.019666
SaleCondition_Partial    0.000000
Length: 288, dtype: float64

imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")

Text(0.5, 1.0, 'Coefficients in the Lasso Model')

sort_values() 함수는 범주형 변수의 히스토그램을 아는데 유용한 함수입니다.

여기선 정렬기능으로 사용했는데, 정렬기능으로도 충분히 우수한 것을 보여줬습니다.

정렬된 값 상위 10개, 하위 10개를 시각화했는데, 이 변수들이 핵심 변수입니다.

왜냐하면 계수의 절대값이 큰 값이기 때문입니다.

양의 값으로 가장 큰 GrLivArea변수는 면적으로 주택가격에 당연히 큰 영향을 끼칩니다.

matplotlib.rcParams['figure.figsize'] = (6.0, 6.0)

preds = pd.DataFrame({"preds":model_lasso.predict(X_train), "true":y})
preds["residuals"] = preds["true"] - preds["preds"]
preds.plot(x = "preds", y = "residuals",kind = "scatter")

<matplotlib.axes._subplots.AxesSubplot at 0x7f3bd2529490>

잔차 그림도 큰 이상이 없습니다.

model_lasso = Lasso(alpha = 0.0005).fit(X_train, y)
pred = model_lasso.predict(X_test)
pred2 = np.exp(pred) - 1
X_test['SalePrice'] = pred2
X_test['Id'] = test['Id']
final = X_test[['Id','SalePrice']]
final.to_csv('/content/drive/MyDrive/houselasso2.csv',encoding='UTF-8', index=False)

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """

알파값 0.0005인 라쏘 모델로 모델을 적합시키고 그 모델로 예측 파일을 만들었습니다.

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	...	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196.0	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	...	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NaN	Attchd	2003.0	RFn	2	548	TA	TA	Y	0	61	0	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0.0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	...	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976.0	RFn	2	460	TA	TA	Y	298	0	0	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162.0	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	...	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001.0	RFn	2	608	TA	TA	Y	0	42	0	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0.0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	...	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998.0	Unf	3	642	TA	TA	Y	0	35	272	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350.0	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	...	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000.0	RFn	3	836	TA	TA	Y	192	84	0	NaN	NaN	NaN	12	2008	WD	Normal	250000

	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	MoSold	YrSold	MSZoning_RL	...	GarageFinish_Unf	GarageQual_TA	GarageCond_TA	PavedDrive_Y	SaleType_WD	SaleCondition_Abnorml	SaleCondition_Normal
0	4.110874	4.189655	9.042040	7	5	2003	2003	5.283204	6.561031	5.017280	6.753438	6.753438	6.751101	7.444833	1.0	0.000000	2	1	3	0.693147	8	0	2003.0	2.0	548.0	0.000000	4.127134	0.000000	2	2008	1	...	0	1	1	1	1	0	1
1	3.044522	4.394449	9.169623	6	8	1976	1976	0.000000	6.886532	5.652489	7.141245	7.141245	0.000000	7.141245	0.0	0.693147	2	0	3	0.693147	6	1	1976.0	2.0	460.0	5.700444	0.000000	0.000000	5	2007	1	...	0	1	1	1	1	0	1
2	4.110874	4.234107	9.328212	7	5	2001	2002	5.093750	6.188264	6.075346	6.825460	6.825460	6.765039	7.488294	1.0	0.000000	2	1	3	0.693147	6	1	2001.0	2.0	608.0	0.000000	3.761200	0.000000	9	2008	1	...	0	1	1	1	1	0	1
3	4.262680	4.110874	9.164401	7	5	1915	1970	0.000000	5.379897	6.293419	6.629363	6.869014	6.629363	7.448916	1.0	0.000000	1	0	3	0.693147	7	1	1998.0	3.0	642.0	0.000000	3.583519	5.609472	2	2006	1	...	1	1	1	1	1	1	0
4	4.110874	4.442651	9.565284	8	5	2000	2000	5.860786	6.486161	6.196444	7.044033	7.044033	6.960348	7.695758	1.0	0.000000	2	1	4	0.693147	9	1	2000.0	3.0	836.0	5.262690	4.442651	0.000000	12	2008	1	...	0	1	1	1	1	0	1