Open In Colab

캐글과 연동하기

!pip install kaggle
from google.colab import files
files.upload()
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.5.30)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.62.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (5.0.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json
{'kaggle.json': b'{"username":"ksy1998","key":"23e68db36970b65937516103c630ba75"}'}
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c competitive-data-science-predict-future-sales
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
Downloading sample_submission.csv.zip to /content
  0% 0.00/468k [00:00<?, ?B/s]
100% 468k/468k [00:00<00:00, 69.0MB/s]
Downloading sales_train.csv.zip to /content
 38% 5.00M/13.3M [00:00<00:01, 5.79MB/s]
100% 13.3M/13.3M [00:00<00:00, 14.4MB/s]
Downloading item_categories.csv to /content
  0% 0.00/3.49k [00:00<?, ?B/s]
100% 3.49k/3.49k [00:00<00:00, 2.51MB/s]
Downloading shops.csv to /content
  0% 0.00/2.91k [00:00<?, ?B/s]
100% 2.91k/2.91k [00:00<00:00, 10.6MB/s]
Downloading test.csv.zip to /content
  0% 0.00/1.02M [00:00<?, ?B/s]
100% 1.02M/1.02M [00:00<00:00, 156MB/s]
Downloading items.csv.zip to /content
  0% 0.00/368k [00:00<?, ?B/s]
100% 368k/368k [00:00<00:00, 117MB/s]
!unzip items.csv.zip
!unzip sales_train.csv.zip
!unzip sample_submission.csv.zip
!unzip test.csv.zip
Archive:  items.csv.zip
  inflating: items.csv               
Archive:  sales_train.csv.zip
  inflating: sales_train.csv         
Archive:  sample_submission.csv.zip
  inflating: sample_submission.csv   
Archive:  test.csv.zip
  inflating: test.csv                

데이터 불러오기

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from matplotlib import pylab as plt
import matplotlib.dates as mdates
plt.rcParams['figure.figsize'] = (15.0, 8.0)
import seaborn as sns
train = pd.read_csv('./sales_train.csv')
print ('number of shops: ', train['shop_id'].max())
print ('number of items: ', train['item_id'].max())
num_month = train['date_block_num'].max()
print ('number of month: ', num_month)
print ('size of train: ', train.shape)
train.head()
number of shops:  59
number of items:  22169
number of month:  33
size of train:  (2935849, 6)
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0

변수 설명

date : 날짜 변수, date_block_num : 달 변수(2013년 1월 => 0, 2015년 10월 => 33)

shop_id, item_id : 상점/제품의 고유번호 변수

item_price : 제품의 가격 변수, item_cnt_dat : 그 날 제품이 팔린 개수

(여기서 item_cnt_dat 변수가 음수인 것은 물건이 반품된 것을 의미하는 것 같습니다.)

test = pd.read_csv('./test.csv')
test.head()
ID shop_id item_id
0 0 5 5037
1 1 5 5320
2 2 5 5233
3 3 5 5232
4 4 5 5268
sub = pd.read_csv('./sample_submission.csv')
sub.head()
ID item_cnt_month
0 0 0.5
1 1 0.5
2 2 0.5
3 3 0.5
4 4 0.5

2015년 11월 데이터를 예측하는 캐글 대회입니다.

date_block_num 변수는 34가 되겠죠.

items = pd.read_csv('./items.csv')
print ('number of categories: ', items['item_category_id'].max()) # the maximun number of category id
items.head()
number of categories:  83
item_name item_id item_category_id
0 ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D 0 40
1 !ABBYY FineReader 12 Professional Edition Full... 1 76
2 ***В ЛУЧАХ СЛАВЫ (UNV) D 2 40
3 ***ГОЛУБАЯ ВОЛНА (Univ) D 3 40
4 ***КОРОБКА (СТЕКЛО) D 4 40
train_clean = train.drop(labels = ['date', 'item_price'], axis = 1)
train_clean.head()
date_block_num shop_id item_id item_cnt_day
0 0 59 22154 1.0
1 0 25 2552 1.0
2 0 25 2552 -1.0
3 0 25 2554 1.0
4 0 25 2555 1.0

날짜는 대체하는 date_block_num 변수가 있기 때문에 빼줍니다.

또 제품 가격 변수 또한 빼줍니다.

train_clean = train_clean.groupby(["item_id","shop_id","date_block_num"]).sum().reset_index()
train_clean = train_clean.rename(index=str, columns = {"item_cnt_day":"item_cnt_month"})
train_clean = train_clean[["item_id","shop_id","date_block_num","item_cnt_month"]]
train_clean
item_id shop_id date_block_num item_cnt_month
0 0 54 20 1.0
1 1 55 15 2.0
2 1 55 18 1.0
3 1 55 19 1.0
4 1 55 20 1.0
... ... ... ... ...
1609119 22168 12 8 1.0
1609120 22168 16 1 1.0
1609121 22168 42 1 1.0
1609122 22168 43 2 1.0
1609123 22169 25 14 1.0

1609124 rows × 4 columns

같은 달별로(= date_block_num 변수가 같은 값으로) 묶어줍니다.

테스트 데이터에서 예측하고자 하는 값의 범위가 달 단위이기 때문입니다.

변수 이름 또한 그에 맞게 item_cnt_month로 바꿨습니다.

시계열 데이터 연습하기

check = train_clean[["shop_id","item_id","date_block_num","item_cnt_month"]]
check = check.loc[check['shop_id'] == 5]
check = check.loc[check['item_id'] == 5037]
check
shop_id item_id date_block_num item_cnt_month
400439 5 5037 20 1.0
400440 5 5037 22 1.0
400441 5 5037 23 2.0
400442 5 5037 24 2.0
400443 5 5037 28 1.0
400444 5 5037 29 1.0
400445 5 5037 30 1.0
400446 5 5037 31 3.0
400447 5 5037 32 1.0

특정 shop_id와 item_id 값을 가지는 값만 모았습니다.

시계열 분석을 처음하기 때문에 1차로 소량의 데이터를 다루었습니다.

이렇게 데이터 분석을 공부하면 보다 직관적으로 LSTM 모델을 학습할 수 있을 것 같습니다.

plt.figure(figsize=(10,4))
plt.title('Check - Sales of Item 5037 at Shop 5')
plt.xlabel('Month')
plt.ylabel('Sales of Item 5037 at Shop 5')
plt.plot(check["date_block_num"],check["item_cnt_month"]);

단순히 Y값에 대해 그림을 그려보았습니다.

month_list=[i for i in range(num_month+1)] # num_month = train['date_block_num'].max(), 최고값
shop = []
for i in range(num_month+1):
    shop.append(5)
item = []
for i in range(num_month+1):
    item.append(5037)
months_full = pd.DataFrame({'shop_id':shop, 'item_id':item,'date_block_num':month_list})
months_full.head(10)
shop_id item_id date_block_num
0 5 5037 0
1 5 5037 1
2 5 5037 2
3 5 5037 3
4 5 5037 4
5 5 5037 5
6 5 5037 6
7 5 5037 7
8 5 5037 8
9 5 5037 9

빈 데이터를 없애기 위해 처음부터 데이터프레임을 세팅하는 모습입니다.

shop = [] for i in range(num_month+1): shop.append(5)

다만 이 코드 보다는 [5]*(num_month+1) 식으로 리스트를 구성하는게 더 깔끔한 것 같습니다.

sales_33month = pd.merge(check, months_full, how='right', on=['shop_id','item_id','date_block_num'])
sales_33month = sales_33month.sort_values(by=['date_block_num'])
sales_33month.fillna(0.00,inplace=True)
plt.figure(figsize=(10,4))
plt.title('Check - Sales of Item 5037 at Shop 5 for whole period')
plt.xlabel('Month')
plt.ylabel('Sales of Item 5037 at Shop 5')
plt.plot(sales_33month["date_block_num"],sales_33month["item_cnt_month"]);

물품 구매가 없는 데이터까지 0 값을 넣어서 그림을 그렸습니다.

for i in range(1,6):
    sales_33month["T_" + str(i)] = sales_33month.item_cnt_month.shift(i)
sales_33month.fillna(0.0, inplace=True)
df = sales_33month[['shop_id','item_id','date_block_num','T_1','T_2','T_3','T_4','T_5', 'item_cnt_month']].reset_index()
df = df.drop(labels = ['index'], axis = 1)
df
shop_id item_id date_block_num T_1 T_2 T_3 T_4 T_5 item_cnt_month
0 5 5037 0 0.0 0.0 0.0 0.0 0.0 0.0
1 5 5037 1 0.0 0.0 0.0 0.0 0.0 0.0
2 5 5037 2 0.0 0.0 0.0 0.0 0.0 0.0
3 5 5037 3 0.0 0.0 0.0 0.0 0.0 0.0
4 5 5037 4 0.0 0.0 0.0 0.0 0.0 0.0
5 5 5037 5 0.0 0.0 0.0 0.0 0.0 0.0
6 5 5037 6 0.0 0.0 0.0 0.0 0.0 0.0
7 5 5037 7 0.0 0.0 0.0 0.0 0.0 0.0
8 5 5037 8 0.0 0.0 0.0 0.0 0.0 0.0
9 5 5037 9 0.0 0.0 0.0 0.0 0.0 0.0
10 5 5037 10 0.0 0.0 0.0 0.0 0.0 0.0
11 5 5037 11 0.0 0.0 0.0 0.0 0.0 0.0
12 5 5037 12 0.0 0.0 0.0 0.0 0.0 0.0
13 5 5037 13 0.0 0.0 0.0 0.0 0.0 0.0
14 5 5037 14 0.0 0.0 0.0 0.0 0.0 0.0
15 5 5037 15 0.0 0.0 0.0 0.0 0.0 0.0
16 5 5037 16 0.0 0.0 0.0 0.0 0.0 0.0
17 5 5037 17 0.0 0.0 0.0 0.0 0.0 0.0
18 5 5037 18 0.0 0.0 0.0 0.0 0.0 0.0
19 5 5037 19 0.0 0.0 0.0 0.0 0.0 0.0
20 5 5037 20 0.0 0.0 0.0 0.0 0.0 1.0
21 5 5037 21 1.0 0.0 0.0 0.0 0.0 0.0
22 5 5037 22 0.0 1.0 0.0 0.0 0.0 1.0
23 5 5037 23 1.0 0.0 1.0 0.0 0.0 2.0
24 5 5037 24 2.0 1.0 0.0 1.0 0.0 2.0
25 5 5037 25 2.0 2.0 1.0 0.0 1.0 0.0
26 5 5037 26 0.0 2.0 2.0 1.0 0.0 0.0
27 5 5037 27 0.0 0.0 2.0 2.0 1.0 0.0
28 5 5037 28 0.0 0.0 0.0 2.0 2.0 1.0
29 5 5037 29 1.0 0.0 0.0 0.0 2.0 1.0
30 5 5037 30 1.0 1.0 0.0 0.0 0.0 1.0
31 5 5037 31 1.0 1.0 1.0 0.0 0.0 3.0
32 5 5037 32 3.0 1.0 1.0 1.0 0.0 1.0
33 5 5037 33 1.0 3.0 1.0 1.0 1.0 0.0

시계열 분석을 기초부터 뜯어본 것 같습니다.

T1 ~ T5에 의미는 최근 5달간 이전 Y값의 기록입니다. 예를 들면 T1은 한달 전 Y값을 나타냅니다.

시간의 흐름에 따라 예측값이 영향을 받기 때문에 이러한 방식이 지금 이 데이터에서 적절합니다.

LSTM 모델 사용

train_df = df[:-3]
val_df = df[-3:]
x_train,y_train = train_df.drop(["item_cnt_month"],axis=1),train_df.item_cnt_month
x_val,y_val = val_df.drop(["item_cnt_month"],axis=1),val_df.item_cnt_month

맨 마지막 3개 데이터를 test 데이터로 사용합니다.

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
model_lstm = Sequential()
model_lstm.add(LSTM(15, input_shape=(1,8)))
model_lstm.add(Dense(1))
model_lstm.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
from sklearn.preprocessing import StandardScaler,MinMaxScaler
scaler = StandardScaler()
scaler = MinMaxScaler(feature_range=(-1, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_valid_scaled = scaler.fit_transform(x_val)
x_train_reshaped = x_train_scaled.reshape((x_train_scaled.shape[0], 1, x_train_scaled.shape[1]))
x_val_resaped = x_valid_scaled.reshape((x_valid_scaled.shape[0], 1, x_valid_scaled.shape[1]))
history = model_lstm.fit(x_train_reshaped, y_train, validation_data=(x_val_resaped, y_val),epochs=70, batch_size=12, verbose=2, shuffle=False)
y_pre = model_lstm.predict(x_val_resaped)
Epoch 1/70
3/3 - 2s - loss: 0.4119 - accuracy: 0.7742 - val_loss: 3.6385 - val_accuracy: 0.3333
Epoch 2/70
3/3 - 0s - loss: 0.3959 - accuracy: 0.7742 - val_loss: 3.5825 - val_accuracy: 0.3333
Epoch 3/70
3/3 - 0s - loss: 0.3818 - accuracy: 0.7742 - val_loss: 3.5290 - val_accuracy: 0.3333
Epoch 4/70
3/3 - 0s - loss: 0.3689 - accuracy: 0.7742 - val_loss: 3.4781 - val_accuracy: 0.3333
Epoch 5/70
3/3 - 0s - loss: 0.3571 - accuracy: 0.7742 - val_loss: 3.4296 - val_accuracy: 0.3333
Epoch 6/70
3/3 - 0s - loss: 0.3464 - accuracy: 0.7742 - val_loss: 3.3839 - val_accuracy: 0.3333
Epoch 7/70
3/3 - 0s - loss: 0.3368 - accuracy: 0.7742 - val_loss: 3.3409 - val_accuracy: 0.3333
Epoch 8/70
3/3 - 0s - loss: 0.3281 - accuracy: 0.7742 - val_loss: 3.3008 - val_accuracy: 0.3333
Epoch 9/70
3/3 - 0s - loss: 0.3203 - accuracy: 0.7742 - val_loss: 3.2637 - val_accuracy: 0.3333
Epoch 10/70
3/3 - 0s - loss: 0.3132 - accuracy: 0.7742 - val_loss: 3.2296 - val_accuracy: 0.3333
Epoch 11/70
3/3 - 0s - loss: 0.3069 - accuracy: 0.7742 - val_loss: 3.1984 - val_accuracy: 0.3333
Epoch 12/70
3/3 - 0s - loss: 0.3012 - accuracy: 0.7742 - val_loss: 3.1702 - val_accuracy: 0.3333
Epoch 13/70
3/3 - 0s - loss: 0.2960 - accuracy: 0.7742 - val_loss: 3.1451 - val_accuracy: 0.3333
Epoch 14/70
3/3 - 0s - loss: 0.2913 - accuracy: 0.7742 - val_loss: 3.1228 - val_accuracy: 0.3333
Epoch 15/70
3/3 - 0s - loss: 0.2869 - accuracy: 0.7742 - val_loss: 3.1035 - val_accuracy: 0.3333
Epoch 16/70
3/3 - 0s - loss: 0.2829 - accuracy: 0.7742 - val_loss: 3.0871 - val_accuracy: 0.3333
Epoch 17/70
3/3 - 0s - loss: 0.2791 - accuracy: 0.7742 - val_loss: 3.0733 - val_accuracy: 0.3333
Epoch 18/70
3/3 - 0s - loss: 0.2755 - accuracy: 0.7742 - val_loss: 3.0623 - val_accuracy: 0.3333
Epoch 19/70
3/3 - 0s - loss: 0.2720 - accuracy: 0.7742 - val_loss: 3.0537 - val_accuracy: 0.3333
Epoch 20/70
3/3 - 0s - loss: 0.2687 - accuracy: 0.7742 - val_loss: 3.0476 - val_accuracy: 0.3333
Epoch 21/70
3/3 - 0s - loss: 0.2654 - accuracy: 0.7742 - val_loss: 3.0437 - val_accuracy: 0.3333
Epoch 22/70
3/3 - 0s - loss: 0.2622 - accuracy: 0.7742 - val_loss: 3.0419 - val_accuracy: 0.3333
Epoch 23/70
3/3 - 0s - loss: 0.2590 - accuracy: 0.7742 - val_loss: 3.0421 - val_accuracy: 0.3333
Epoch 24/70
3/3 - 0s - loss: 0.2558 - accuracy: 0.8065 - val_loss: 3.0440 - val_accuracy: 0.3333
Epoch 25/70
3/3 - 0s - loss: 0.2527 - accuracy: 0.8065 - val_loss: 3.0477 - val_accuracy: 0.3333
Epoch 26/70
3/3 - 0s - loss: 0.2495 - accuracy: 0.8387 - val_loss: 3.0528 - val_accuracy: 0.3333
Epoch 27/70
3/3 - 0s - loss: 0.2463 - accuracy: 0.8387 - val_loss: 3.0592 - val_accuracy: 0.3333
Epoch 28/70
3/3 - 0s - loss: 0.2432 - accuracy: 0.8387 - val_loss: 3.0669 - val_accuracy: 0.3333
Epoch 29/70
3/3 - 0s - loss: 0.2401 - accuracy: 0.8387 - val_loss: 3.0756 - val_accuracy: 0.3333
Epoch 30/70
3/3 - 0s - loss: 0.2370 - accuracy: 0.8387 - val_loss: 3.0853 - val_accuracy: 0.3333
Epoch 31/70
3/3 - 0s - loss: 0.2339 - accuracy: 0.8387 - val_loss: 3.0958 - val_accuracy: 0.3333
Epoch 32/70
3/3 - 0s - loss: 0.2308 - accuracy: 0.8387 - val_loss: 3.1070 - val_accuracy: 0.3333
Epoch 33/70
3/3 - 0s - loss: 0.2278 - accuracy: 0.8065 - val_loss: 3.1187 - val_accuracy: 0.6667
Epoch 34/70
3/3 - 0s - loss: 0.2248 - accuracy: 0.8065 - val_loss: 3.1310 - val_accuracy: 0.6667
Epoch 35/70
3/3 - 0s - loss: 0.2219 - accuracy: 0.8065 - val_loss: 3.1436 - val_accuracy: 0.6667
Epoch 36/70
3/3 - 0s - loss: 0.2190 - accuracy: 0.7742 - val_loss: 3.1565 - val_accuracy: 0.6667
Epoch 37/70
3/3 - 0s - loss: 0.2162 - accuracy: 0.7742 - val_loss: 3.1696 - val_accuracy: 0.3333
Epoch 38/70
3/3 - 0s - loss: 0.2134 - accuracy: 0.7742 - val_loss: 3.1829 - val_accuracy: 0.3333
Epoch 39/70
3/3 - 0s - loss: 0.2107 - accuracy: 0.8065 - val_loss: 3.1963 - val_accuracy: 0.3333
Epoch 40/70
3/3 - 0s - loss: 0.2081 - accuracy: 0.8065 - val_loss: 3.2096 - val_accuracy: 0.3333
Epoch 41/70
3/3 - 0s - loss: 0.2056 - accuracy: 0.8065 - val_loss: 3.2229 - val_accuracy: 0.3333
Epoch 42/70
3/3 - 0s - loss: 0.2031 - accuracy: 0.8065 - val_loss: 3.2361 - val_accuracy: 0.3333
Epoch 43/70
3/3 - 0s - loss: 0.2008 - accuracy: 0.8065 - val_loss: 3.2492 - val_accuracy: 0.3333
Epoch 44/70
3/3 - 0s - loss: 0.1985 - accuracy: 0.8065 - val_loss: 3.2621 - val_accuracy: 0.3333
Epoch 45/70
3/3 - 0s - loss: 0.1963 - accuracy: 0.8065 - val_loss: 3.2748 - val_accuracy: 0.3333
Epoch 46/70
3/3 - 0s - loss: 0.1941 - accuracy: 0.8065 - val_loss: 3.2872 - val_accuracy: 0.3333
Epoch 47/70
3/3 - 0s - loss: 0.1921 - accuracy: 0.8065 - val_loss: 3.2994 - val_accuracy: 0.3333
Epoch 48/70
3/3 - 0s - loss: 0.1901 - accuracy: 0.8065 - val_loss: 3.3113 - val_accuracy: 0.3333
Epoch 49/70
3/3 - 0s - loss: 0.1882 - accuracy: 0.8065 - val_loss: 3.3229 - val_accuracy: 0.3333
Epoch 50/70
3/3 - 0s - loss: 0.1864 - accuracy: 0.8065 - val_loss: 3.3342 - val_accuracy: 0.3333
Epoch 51/70
3/3 - 0s - loss: 0.1847 - accuracy: 0.8065 - val_loss: 3.3451 - val_accuracy: 0.3333
Epoch 52/70
3/3 - 0s - loss: 0.1830 - accuracy: 0.8065 - val_loss: 3.3558 - val_accuracy: 0.3333
Epoch 53/70
3/3 - 0s - loss: 0.1814 - accuracy: 0.8065 - val_loss: 3.3661 - val_accuracy: 0.3333
Epoch 54/70
3/3 - 0s - loss: 0.1799 - accuracy: 0.8065 - val_loss: 3.3760 - val_accuracy: 0.3333
Epoch 55/70
3/3 - 0s - loss: 0.1785 - accuracy: 0.8065 - val_loss: 3.3855 - val_accuracy: 0.3333
Epoch 56/70
3/3 - 0s - loss: 0.1771 - accuracy: 0.8065 - val_loss: 3.3947 - val_accuracy: 0.3333
Epoch 57/70
3/3 - 0s - loss: 0.1757 - accuracy: 0.8065 - val_loss: 3.4036 - val_accuracy: 0.3333
Epoch 58/70
3/3 - 0s - loss: 0.1745 - accuracy: 0.8065 - val_loss: 3.4120 - val_accuracy: 0.3333
Epoch 59/70
3/3 - 0s - loss: 0.1732 - accuracy: 0.8065 - val_loss: 3.4201 - val_accuracy: 0.3333
Epoch 60/70
3/3 - 0s - loss: 0.1720 - accuracy: 0.8065 - val_loss: 3.4278 - val_accuracy: 0.3333
Epoch 61/70
3/3 - 0s - loss: 0.1709 - accuracy: 0.8065 - val_loss: 3.4351 - val_accuracy: 0.3333
Epoch 62/70
3/3 - 0s - loss: 0.1698 - accuracy: 0.8065 - val_loss: 3.4420 - val_accuracy: 0.3333
Epoch 63/70
3/3 - 0s - loss: 0.1687 - accuracy: 0.8065 - val_loss: 3.4485 - val_accuracy: 0.3333
Epoch 64/70
3/3 - 0s - loss: 0.1677 - accuracy: 0.8065 - val_loss: 3.4547 - val_accuracy: 0.3333
Epoch 65/70
3/3 - 0s - loss: 0.1667 - accuracy: 0.8065 - val_loss: 3.4605 - val_accuracy: 0.3333
Epoch 66/70
3/3 - 0s - loss: 0.1658 - accuracy: 0.8065 - val_loss: 3.4659 - val_accuracy: 0.3333
Epoch 67/70
3/3 - 0s - loss: 0.1648 - accuracy: 0.8065 - val_loss: 3.4710 - val_accuracy: 0.3333
Epoch 68/70
3/3 - 0s - loss: 0.1639 - accuracy: 0.8065 - val_loss: 3.4758 - val_accuracy: 0.3333
Epoch 69/70
3/3 - 0s - loss: 0.1631 - accuracy: 0.8065 - val_loss: 3.4802 - val_accuracy: 0.3333
Epoch 70/70
3/3 - 0s - loss: 0.1622 - accuracy: 0.8065 - val_loss: 3.4844 - val_accuracy: 0.3333
fig, ax = plt.subplots()
ax.plot(x_val['date_block_num'], y_val, label='Actual')
ax.plot(x_val['date_block_num'], y_pre, label='Predicted')
plt.title('LSTM Prediction vs Actual Sales for last 3 months')
plt.xlabel('Month')
plt.xticks(x_val['date_block_num'])
plt.ylabel('Sales of Item 5037 at Shop 5')
ax.legend()
plt.show()

LSTM 모델을 적용시킨 모습입니다.

잘 맞췄다면 잘 맞췄다고도 말 할수 있고 아쉽다면 아쉽다고 할 수 있는 결과인 것 같습니다.

데이터 탐색

sales_data = pd.read_csv('./sales_train.csv')
item_cat = pd.read_csv('./item_categories.csv')
items = pd.read_csv('./items.csv')
shops = pd.read_csv('./shops.csv')
sample_submission = pd.read_csv('./sample_submission.csv')
test_data = pd.read_csv('./test.csv')
def basic_eda(df):
    print("----------TOP 5 RECORDS--------")
    print(df.head(5))
    print("----------INFO-----------------")
    print(df.info())
    print("----------Describe-------------")
    print(df.describe())
    print("----------Columns--------------")
    print(df.columns)
    print("----------Data Types-----------")
    print(df.dtypes)
    print("-------Missing Values----------")
    print(df.isnull().sum())
    print("-------NULL values-------------")
    print(df.isna().sum())
    print("-----Shape Of Data-------------")
    print(df.shape)

print("=============================Sales Data=============================")
basic_eda(sales_data)
print("=============================Test data=============================")
basic_eda(test_data)
print("=============================Item Categories=============================")
basic_eda(item_cat)
print("=============================Items=============================")
basic_eda(items)
print("=============================Shops=============================")
basic_eda(shops)
print("=============================Sample Submission=============================")
basic_eda(sample_submission)
=============================Sales Data=============================
----------TOP 5 RECORDS--------
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154      999.00           1.0
1  03.01.2013               0       25     2552      899.00           1.0
2  05.01.2013               0       25     2552      899.00          -1.0
3  06.01.2013               0       25     2554     1709.05           1.0
4  15.01.2013               0       25     2555     1099.00           1.0
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
None
----------Describe-------------
       date_block_num       shop_id       item_id    item_price  item_cnt_day
count    2.935849e+06  2.935849e+06  2.935849e+06  2.935849e+06  2.935849e+06
mean     1.456991e+01  3.300173e+01  1.019723e+04  8.908532e+02  1.242641e+00
std      9.422988e+00  1.622697e+01  6.324297e+03  1.729800e+03  2.618834e+00
min      0.000000e+00  0.000000e+00  0.000000e+00 -1.000000e+00 -2.200000e+01
25%      7.000000e+00  2.200000e+01  4.476000e+03  2.490000e+02  1.000000e+00
50%      1.400000e+01  3.100000e+01  9.343000e+03  3.990000e+02  1.000000e+00
75%      2.300000e+01  4.700000e+01  1.568400e+04  9.990000e+02  1.000000e+00
max      3.300000e+01  5.900000e+01  2.216900e+04  3.079800e+05  2.169000e+03
----------Columns--------------
Index(['date', 'date_block_num', 'shop_id', 'item_id', 'item_price',
       'item_cnt_day'],
      dtype='object')
----------Data Types-----------
date               object
date_block_num      int64
shop_id             int64
item_id             int64
item_price        float64
item_cnt_day      float64
dtype: object
-------Missing Values----------
date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64
-------NULL values-------------
date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64
-----Shape Of Data-------------
(2935849, 6)
=============================Test data=============================
----------TOP 5 RECORDS--------
   ID  shop_id  item_id
0   0        5     5037
1   1        5     5320
2   2        5     5233
3   3        5     5232
4   4        5     5268
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   ID       214200 non-null  int64
 1   shop_id  214200 non-null  int64
 2   item_id  214200 non-null  int64
dtypes: int64(3)
memory usage: 4.9 MB
None
----------Describe-------------
                  ID        shop_id        item_id
count  214200.000000  214200.000000  214200.000000
mean   107099.500000      31.642857   11019.398627
std     61834.358168      17.561933    6252.644590
min         0.000000       2.000000      30.000000
25%     53549.750000      16.000000    5381.500000
50%    107099.500000      34.500000   11203.000000
75%    160649.250000      47.000000   16071.500000
max    214199.000000      59.000000   22167.000000
----------Columns--------------
Index(['ID', 'shop_id', 'item_id'], dtype='object')
----------Data Types-----------
ID         int64
shop_id    int64
item_id    int64
dtype: object
-------Missing Values----------
ID         0
shop_id    0
item_id    0
dtype: int64
-------NULL values-------------
ID         0
shop_id    0
item_id    0
dtype: int64
-----Shape Of Data-------------
(214200, 3)
=============================Item Categories=============================
----------TOP 5 RECORDS--------
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1
2         Аксессуары - PS3                 2
3         Аксессуары - PS4                 3
4         Аксессуары - PSP                 4
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   item_category_name  84 non-null     object
 1   item_category_id    84 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ KB
None
----------Describe-------------
       item_category_id
count         84.000000
mean          41.500000
std           24.392622
min            0.000000
25%           20.750000
50%           41.500000
75%           62.250000
max           83.000000
----------Columns--------------
Index(['item_category_name', 'item_category_id'], dtype='object')
----------Data Types-----------
item_category_name    object
item_category_id       int64
dtype: object
-------Missing Values----------
item_category_name    0
item_category_id      0
dtype: int64
-------NULL values-------------
item_category_name    0
item_category_id      0
dtype: int64
-----Shape Of Data-------------
(84, 2)
=============================Items=============================
----------TOP 5 RECORDS--------
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76
2      ***В ЛУЧАХ СЛАВЫ   (UNV)                    D        2                40
3    ***ГОЛУБАЯ ВОЛНА  (Univ)                      D        3                40
4        ***КОРОБКА (СТЕКЛО)                       D        4                40
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 519.7+ KB
None
----------Describe-------------
           item_id  item_category_id
count  22170.00000      22170.000000
mean   11084.50000         46.290753
std     6400.07207         15.941486
min        0.00000          0.000000
25%     5542.25000         37.000000
50%    11084.50000         40.000000
75%    16626.75000         58.000000
max    22169.00000         83.000000
----------Columns--------------
Index(['item_name', 'item_id', 'item_category_id'], dtype='object')
----------Data Types-----------
item_name           object
item_id              int64
item_category_id     int64
dtype: object
-------Missing Values----------
item_name           0
item_id             0
item_category_id    0
dtype: int64
-------NULL values-------------
item_name           0
item_id             0
item_category_id    0
dtype: int64
-----Shape Of Data-------------
(22170, 3)
=============================Shops=============================
----------TOP 5 RECORDS--------
                        shop_name  shop_id
0   !Якутск Орджоникидзе, 56 фран        0
1   !Якутск ТЦ "Центральный" фран        1
2                Адыгея ТЦ "Мега"        2
3  Балашиха ТРК "Октябрь-Киномир"        3
4        Волжский ТЦ "Волга Молл"        4
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   shop_name  60 non-null     object
 1   shop_id    60 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
None
----------Describe-------------
         shop_id
count  60.000000
mean   29.500000
std    17.464249
min     0.000000
25%    14.750000
50%    29.500000
75%    44.250000
max    59.000000
----------Columns--------------
Index(['shop_name', 'shop_id'], dtype='object')
----------Data Types-----------
shop_name    object
shop_id       int64
dtype: object
-------Missing Values----------
shop_name    0
shop_id      0
dtype: int64
-------NULL values-------------
shop_name    0
shop_id      0
dtype: int64
-----Shape Of Data-------------
(60, 2)
=============================Sample Submission=============================
----------TOP 5 RECORDS--------
   ID  item_cnt_month
0   0             0.5
1   1             0.5
2   2             0.5
3   3             0.5
4   4             0.5
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 2 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   ID              214200 non-null  int64  
 1   item_cnt_month  214200 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 3.3 MB
None
----------Describe-------------
                  ID  item_cnt_month
count  214200.000000        214200.0
mean   107099.500000             0.5
std     61834.358168             0.0
min         0.000000             0.5
25%     53549.750000             0.5
50%    107099.500000             0.5
75%    160649.250000             0.5
max    214199.000000             0.5
----------Columns--------------
Index(['ID', 'item_cnt_month'], dtype='object')
----------Data Types-----------
ID                  int64
item_cnt_month    float64
dtype: object
-------Missing Values----------
ID                0
item_cnt_month    0
dtype: int64
-------NULL values-------------
ID                0
item_cnt_month    0
dtype: int64
-----Shape Of Data-------------
(214200, 2)

앞 코드와 다른 사람 코드입니다.

여기서 train 데이터 프레임을 이 사람은 sales_data 이름으로 했네요.

사실 데이터 탐색하는 함수를 잘 만들어 놓은것 같아서 향후 다른 데이터 분석시 복사를 위해 가져왔습니다.

데이터 전처리

sales_data['date'] = pd.to_datetime(sales_data['date'],format = '%d.%m.%Y')
dataset = sales_data.pivot_table(index = ['shop_id','item_id'],
                                 values = ['item_cnt_day'],columns = ['date_block_num'],fill_value = 0,aggfunc='sum')
dataset.reset_index(inplace = True)
dataset.head()
shop_id item_id item_cnt_day
date_block_num 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
0 0 30 0 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 31 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 32 6 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 33 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 35 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

판다스 내 피벗 테이블을 사용하는 모습입니다. group_by 함수를 확장한 것으로 생각할 수 있습니다.

피벗 테이블은 우선 index로 데이터를 구분 짓습니다. 여기서 shop_id, item_id가 모두 같은 값을 가진 행끼리 그룹을 짓습니다.

다음으로 columns로 한번 더 데이터를 구분 짓습니다. 같은 상점, 같은 제품을 달별로 나누었습니다.

values는 실제 적용되는 값을 의미합니다. 여기서는 item_cnt_day 변수를 사용했습니다.

상점, 제품, 달이 같은 데이터 별로 구분했을때 여러개의 item_cnt_day 값을 더해주는 함수(aggfunc='sum')를 사용합니다.

빈 값도 충분히 존재할 가능성이 있는데, 그 경우 거래 기록이 존재하지 않았다는 의미이므로 0값을 채웁니다.(fill_value = 0)

dataset = pd.merge(test_data,dataset,on = ['item_id','shop_id'],how = 'left')
dataset.fillna(0,inplace = True)
dataset.head()
/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/merge.py:643: UserWarning: merging between different levels can give an unintended result (1 levels on the left,2 on the right)
  warnings.warn(msg, UserWarning)
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py:3889: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)
ID shop_id item_id (item_cnt_day, 0) (item_cnt_day, 1) (item_cnt_day, 2) (item_cnt_day, 3) (item_cnt_day, 4) (item_cnt_day, 5) (item_cnt_day, 6) (item_cnt_day, 7) (item_cnt_day, 8) (item_cnt_day, 9) (item_cnt_day, 10) (item_cnt_day, 11) (item_cnt_day, 12) (item_cnt_day, 13) (item_cnt_day, 14) (item_cnt_day, 15) (item_cnt_day, 16) (item_cnt_day, 17) (item_cnt_day, 18) (item_cnt_day, 19) (item_cnt_day, 20) (item_cnt_day, 21) (item_cnt_day, 22) (item_cnt_day, 23) (item_cnt_day, 24) (item_cnt_day, 25) (item_cnt_day, 26) (item_cnt_day, 27) (item_cnt_day, 28) (item_cnt_day, 29) (item_cnt_day, 30) (item_cnt_day, 31) (item_cnt_day, 32) (item_cnt_day, 33)
0 0 5 5037 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 2.0 0.0 0.0 0.0 1.0 1.0 1.0 3.0 1.0 0.0
1 1 5 5320 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 5 5233 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0 0.0 1.0 3.0 1.0
3 3 5 5232 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 4 5 5268 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

피벗 테이블을 사용해 같은 상점, 제품을 달 별로 거래기록이 몇건 있었는가를 나타내는 데이터 프레임입니다.

이를 활용해 test 데이터 프레임과 병합한다면 테스트 데이터에 있는 상점, 제품의 이전 달별 거래기록을 전부 알 수 있습니다.

이때 만약 병합이 안된 데이터가 있다면(이전 거래기록이 없는 데이터이겠죠?) 0으로 값을 넣어줍니다.

데이터 모델링

dataset.drop(['shop_id','item_id','ID'],inplace = True, axis = 1)
dataset.head()

X_train = np.expand_dims(dataset.values[:,:-1],axis = 2)
y_train = dataset.values[:,-1:]
X_test = np.expand_dims(dataset.values[:,1:],axis = 2)

print(X_train.shape,y_train.shape,X_test.shape)
(214200, 33, 1) (214200, 1) (214200, 33, 1)

데이터를 모델링 하기 위해 상점, 제품 데이터를 지우고, train과 test 데이터 셋을 만들었습니다.

X_train : 0번째 달부터 32번째 달까지 거래 기록 데이터

y_train : 33번째 달 거래 기록 데이터

X_test : 1번째 달부터 33번째 달까지 거래 기록 데이터(train과 test간 데이터 형식을 맞추기 위해)

우리가 예측해야할 y_test는 34번째 달 거래 기록 데이터, 즉 2015년 10월 거래 기록 데이터 입니다.

from keras.models import Sequential
from keras.layers import LSTM,Dense,Dropout

my_model = Sequential()
my_model.add(LSTM(units = 64,input_shape = (33,1)))
my_model.add(Dropout(0.4))
my_model.add(Dense(1))

my_model.compile(loss = 'mse',optimizer = 'adam', metrics = ['mean_squared_error'])
my_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 64)                16896     
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 16,961
Trainable params: 16,961
Non-trainable params: 0
_________________________________________________________________
my_model.fit(X_train,y_train,batch_size = 4096,epochs = 10)
Epoch 1/10
53/53 [==============================] - 27s 471ms/step - loss: 30.6011 - mean_squared_error: 30.6011
Epoch 2/10
53/53 [==============================] - 25s 466ms/step - loss: 30.2430 - mean_squared_error: 30.2430
Epoch 3/10
53/53 [==============================] - 24s 462ms/step - loss: 30.0014 - mean_squared_error: 30.0014
Epoch 4/10
53/53 [==============================] - 25s 481ms/step - loss: 29.8476 - mean_squared_error: 29.8476
Epoch 5/10
53/53 [==============================] - 26s 482ms/step - loss: 29.7404 - mean_squared_error: 29.7404
Epoch 6/10
53/53 [==============================] - 26s 487ms/step - loss: 29.7396 - mean_squared_error: 29.7396
Epoch 7/10
53/53 [==============================] - 25s 480ms/step - loss: 29.7369 - mean_squared_error: 29.7369
Epoch 8/10
53/53 [==============================] - 25s 473ms/step - loss: 29.6503 - mean_squared_error: 29.6503
Epoch 9/10
53/53 [==============================] - 25s 472ms/step - loss: 29.6353 - mean_squared_error: 29.6353
Epoch 10/10
53/53 [==============================] - 25s 468ms/step - loss: 29.5096 - mean_squared_error: 29.5096
<keras.callbacks.History at 0x7f2b3e51ff90>

모델을 LSTM(시계열 분석) 방법을 사용해서 분석합니다. 사실 LSTM 모델을 처음 사용했는데요.

이번주에 다소 시간이 부족해 LSTM 모델의 사용방법이나 원리 등은 아직 파악하지 못했네요. (다른 사람 발표를 경청하겠습니다.)

submission_pfs = my_model.predict(X_test)
submission_pfs = submission_pfs.clip(0,20)
submission = pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
submission.to_csv('./submission.csv',index = False)
submission
ID item_cnt_month
0 0 0.396485
1 1 0.103207
2 2 0.743674
3 3 0.135947
4 4 0.103207
... ... ...
214195 214195 0.331131
214196 214196 0.103207
214197 214197 0.097571
214198 214198 0.103207
214199 214199 0.069235

214200 rows × 2 columns

데이터를 모델에 적용시켜 예측값을 찾은 뒤, 제출 형식에 맞게 데이터 프레임 형식을 조정했습니다.

<참고>

이때 clip 함수는 이상치 조정 함수입니다.

clip(최솟값, 최댓값) 구조로 범위를 벗어나면 범위 내로 값을 조정시켜줍니다.

</div> </div> </div>
data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
df
col_0 col_1
0 9 -2
1 -3 -7
2 0 6
3 -1 8
4 5 -5
df.clip(-4, 6)
col_0 col_1
0 6 -2
1 -3 -4
2 0 6
3 -1 6
4 5 -4

예시를 보면 보다 직관적으로 이해가 가능할 것 같습니다.

이 함수는 범용성이 넓으니 다른 데이터 분석에 자주 쓰일 수 있어 따로 정리했네요.

!kaggle competitions submit -c competitive-data-science-predict-future-sales -f submission.csv -m "Message"
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
100% 3.55M/3.55M [00:04<00:00, 769kB/s]
Successfully submitted to Predict Future Sales

캐글에 파일을 자동 제출하는 코드입니다.

스코어는 약 1.02로 만 2천명 중 6천등 정도를 기록합니다.

느낀점

우선 공부하기 좋은 데이터를 찾아 줘서 고맙습니다.

시계열 자료가 현실에서 상당히 많아 꼭 공부해보고 싶은 분야였는데, 이번 기회에 분석하게 되서 너무 좋습니다.

개인적으로 공부하고 싶은 분야가 이미지 분류같은 것 보다는 자연어 처리, 시계열 분석 등 현실 세계를 설명할 수 있는 것 입니다.

이번엔 시간이 다소 부족해서 자주쓰는 시계열 모델인 LSTM 모델의 탐구가 부족했습니다.

다른 사람 발표 경청하고, 시간이 있을때 LSTM 모델을 열심히 공부해보고 싶네요.

감사합니다.

</div>