캐글과 연동하기

!pip install kaggle
from google.colab import files
files.upload()

Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2021.5.30)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.62.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (5.0.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)

Saving kaggle.json to kaggle.json

{'kaggle.json': b'{"username":"ksy1998","key":"23e68db36970b65937516103c630ba75"}'}

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle competitions download -c competitive-data-science-predict-future-sales

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
Downloading sample_submission.csv.zip to /content
  0% 0.00/468k [00:00<?, ?B/s]
100% 468k/468k [00:00<00:00, 69.0MB/s]
Downloading sales_train.csv.zip to /content
 38% 5.00M/13.3M [00:00<00:01, 5.79MB/s]
100% 13.3M/13.3M [00:00<00:00, 14.4MB/s]
Downloading item_categories.csv to /content
  0% 0.00/3.49k [00:00<?, ?B/s]
100% 3.49k/3.49k [00:00<00:00, 2.51MB/s]
Downloading shops.csv to /content
  0% 0.00/2.91k [00:00<?, ?B/s]
100% 2.91k/2.91k [00:00<00:00, 10.6MB/s]
Downloading test.csv.zip to /content
  0% 0.00/1.02M [00:00<?, ?B/s]
100% 1.02M/1.02M [00:00<00:00, 156MB/s]
Downloading items.csv.zip to /content
  0% 0.00/368k [00:00<?, ?B/s]
100% 368k/368k [00:00<00:00, 117MB/s]

!unzip items.csv.zip
!unzip sales_train.csv.zip
!unzip sample_submission.csv.zip
!unzip test.csv.zip

Archive:  items.csv.zip
  inflating: items.csv               
Archive:  sales_train.csv.zip
  inflating: sales_train.csv         
Archive:  sample_submission.csv.zip
  inflating: sample_submission.csv   
Archive:  test.csv.zip
  inflating: test.csv

데이터 불러오기

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from matplotlib import pylab as plt
import matplotlib.dates as mdates
plt.rcParams['figure.figsize'] = (15.0, 8.0)
import seaborn as sns

train = pd.read_csv('./sales_train.csv')
print ('number of shops: ', train['shop_id'].max())
print ('number of items: ', train['item_id'].max())
num_month = train['date_block_num'].max()
print ('number of month: ', num_month)
print ('size of train: ', train.shape)
train.head()

number of shops:  59
number of items:  22169
number of month:  33
size of train:  (2935849, 6)

변수 설명

date : 날짜 변수, date_block_num : 달 변수(2013년 1월 => 0, 2015년 10월 => 33)

shop_id, item_id : 상점/제품의 고유번호 변수

item_price : 제품의 가격 변수, item_cnt_dat : 그 날 제품이 팔린 개수

(여기서 item_cnt_dat 변수가 음수인 것은 물건이 반품된 것을 의미하는 것 같습니다.)

test = pd.read_csv('./test.csv')
test.head()

sub = pd.read_csv('./sample_submission.csv')
sub.head()

2015년 11월 데이터를 예측하는 캐글 대회입니다.

date_block_num 변수는 34가 되겠죠.

items = pd.read_csv('./items.csv')
print ('number of categories: ', items['item_category_id'].max()) # the maximun number of category id
items.head()

number of categories:  83

train_clean = train.drop(labels = ['date', 'item_price'], axis = 1)
train_clean.head()

날짜는 대체하는 date_block_num 변수가 있기 때문에 빼줍니다.

또 제품 가격 변수 또한 빼줍니다.

train_clean = train_clean.groupby(["item_id","shop_id","date_block_num"]).sum().reset_index()
train_clean = train_clean.rename(index=str, columns = {"item_cnt_day":"item_cnt_month"})
train_clean = train_clean[["item_id","shop_id","date_block_num","item_cnt_month"]]
train_clean

같은 달별로(= date_block_num 변수가 같은 값으로) 묶어줍니다.

테스트 데이터에서 예측하고자 하는 값의 범위가 달 단위이기 때문입니다.

변수 이름 또한 그에 맞게 item_cnt_month로 바꿨습니다.

시계열 데이터 연습하기

check = train_clean[["shop_id","item_id","date_block_num","item_cnt_month"]]
check = check.loc[check['shop_id'] == 5]
check = check.loc[check['item_id'] == 5037]
check

특정 shop_id와 item_id 값을 가지는 값만 모았습니다.

시계열 분석을 처음하기 때문에 1차로 소량의 데이터를 다루었습니다.

이렇게 데이터 분석을 공부하면 보다 직관적으로 LSTM 모델을 학습할 수 있을 것 같습니다.

plt.figure(figsize=(10,4))
plt.title('Check - Sales of Item 5037 at Shop 5')
plt.xlabel('Month')
plt.ylabel('Sales of Item 5037 at Shop 5')
plt.plot(check["date_block_num"],check["item_cnt_month"]);

단순히 Y값에 대해 그림을 그려보았습니다.

month_list=[i for i in range(num_month+1)] # num_month = train['date_block_num'].max(), 최고값
shop = []
for i in range(num_month+1):
    shop.append(5)
item = []
for i in range(num_month+1):
    item.append(5037)
months_full = pd.DataFrame({'shop_id':shop, 'item_id':item,'date_block_num':month_list})
months_full.head(10)

빈 데이터를 없애기 위해 처음부터 데이터프레임을 세팅하는 모습입니다.

shop = [] for i in range(num_month+1): shop.append(5)

다만 이 코드 보다는 [5]*(num_month+1) 식으로 리스트를 구성하는게 더 깔끔한 것 같습니다.

sales_33month = pd.merge(check, months_full, how='right', on=['shop_id','item_id','date_block_num'])
sales_33month = sales_33month.sort_values(by=['date_block_num'])
sales_33month.fillna(0.00,inplace=True)
plt.figure(figsize=(10,4))
plt.title('Check - Sales of Item 5037 at Shop 5 for whole period')
plt.xlabel('Month')
plt.ylabel('Sales of Item 5037 at Shop 5')
plt.plot(sales_33month["date_block_num"],sales_33month["item_cnt_month"]);

물품 구매가 없는 데이터까지 0 값을 넣어서 그림을 그렸습니다.

for i in range(1,6):
    sales_33month["T_" + str(i)] = sales_33month.item_cnt_month.shift(i)
sales_33month.fillna(0.0, inplace=True)
df = sales_33month[['shop_id','item_id','date_block_num','T_1','T_2','T_3','T_4','T_5', 'item_cnt_month']].reset_index()
df = df.drop(labels = ['index'], axis = 1)
df

시계열 분석을 기초부터 뜯어본 것 같습니다.

T1 ~ T5에 의미는 최근 5달간 이전 Y값의 기록입니다. 예를 들면 T1은 한달 전 Y값을 나타냅니다.

시간의 흐름에 따라 예측값이 영향을 받기 때문에 이러한 방식이 지금 이 데이터에서 적절합니다.

LSTM 모델 사용

train_df = df[:-3]
val_df = df[-3:]
x_train,y_train = train_df.drop(["item_cnt_month"],axis=1),train_df.item_cnt_month
x_val,y_val = val_df.drop(["item_cnt_month"],axis=1),val_df.item_cnt_month

맨 마지막 3개 데이터를 test 데이터로 사용합니다.

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
model_lstm = Sequential()
model_lstm.add(LSTM(15, input_shape=(1,8)))
model_lstm.add(Dense(1))
model_lstm.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])

from sklearn.preprocessing import StandardScaler,MinMaxScaler
scaler = StandardScaler()
scaler = MinMaxScaler(feature_range=(-1, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_valid_scaled = scaler.fit_transform(x_val)

x_train_reshaped = x_train_scaled.reshape((x_train_scaled.shape[0], 1, x_train_scaled.shape[1]))
x_val_resaped = x_valid_scaled.reshape((x_valid_scaled.shape[0], 1, x_valid_scaled.shape[1]))
history = model_lstm.fit(x_train_reshaped, y_train, validation_data=(x_val_resaped, y_val),epochs=70, batch_size=12, verbose=2, shuffle=False)
y_pre = model_lstm.predict(x_val_resaped)

Epoch 1/70
3/3 - 2s - loss: 0.4119 - accuracy: 0.7742 - val_loss: 3.6385 - val_accuracy: 0.3333
Epoch 2/70
3/3 - 0s - loss: 0.3959 - accuracy: 0.7742 - val_loss: 3.5825 - val_accuracy: 0.3333
Epoch 3/70
3/3 - 0s - loss: 0.3818 - accuracy: 0.7742 - val_loss: 3.5290 - val_accuracy: 0.3333
Epoch 4/70
3/3 - 0s - loss: 0.3689 - accuracy: 0.7742 - val_loss: 3.4781 - val_accuracy: 0.3333
Epoch 5/70
3/3 - 0s - loss: 0.3571 - accuracy: 0.7742 - val_loss: 3.4296 - val_accuracy: 0.3333
Epoch 6/70
3/3 - 0s - loss: 0.3464 - accuracy: 0.7742 - val_loss: 3.3839 - val_accuracy: 0.3333
Epoch 7/70
3/3 - 0s - loss: 0.3368 - accuracy: 0.7742 - val_loss: 3.3409 - val_accuracy: 0.3333
Epoch 8/70
3/3 - 0s - loss: 0.3281 - accuracy: 0.7742 - val_loss: 3.3008 - val_accuracy: 0.3333
Epoch 9/70
3/3 - 0s - loss: 0.3203 - accuracy: 0.7742 - val_loss: 3.2637 - val_accuracy: 0.3333
Epoch 10/70
3/3 - 0s - loss: 0.3132 - accuracy: 0.7742 - val_loss: 3.2296 - val_accuracy: 0.3333
Epoch 11/70
3/3 - 0s - loss: 0.3069 - accuracy: 0.7742 - val_loss: 3.1984 - val_accuracy: 0.3333
Epoch 12/70
3/3 - 0s - loss: 0.3012 - accuracy: 0.7742 - val_loss: 3.1702 - val_accuracy: 0.3333
Epoch 13/70
3/3 - 0s - loss: 0.2960 - accuracy: 0.7742 - val_loss: 3.1451 - val_accuracy: 0.3333
Epoch 14/70
3/3 - 0s - loss: 0.2913 - accuracy: 0.7742 - val_loss: 3.1228 - val_accuracy: 0.3333
Epoch 15/70
3/3 - 0s - loss: 0.2869 - accuracy: 0.7742 - val_loss: 3.1035 - val_accuracy: 0.3333
Epoch 16/70
3/3 - 0s - loss: 0.2829 - accuracy: 0.7742 - val_loss: 3.0871 - val_accuracy: 0.3333
Epoch 17/70
3/3 - 0s - loss: 0.2791 - accuracy: 0.7742 - val_loss: 3.0733 - val_accuracy: 0.3333
Epoch 18/70
3/3 - 0s - loss: 0.2755 - accuracy: 0.7742 - val_loss: 3.0623 - val_accuracy: 0.3333
Epoch 19/70
3/3 - 0s - loss: 0.2720 - accuracy: 0.7742 - val_loss: 3.0537 - val_accuracy: 0.3333
Epoch 20/70
3/3 - 0s - loss: 0.2687 - accuracy: 0.7742 - val_loss: 3.0476 - val_accuracy: 0.3333
Epoch 21/70
3/3 - 0s - loss: 0.2654 - accuracy: 0.7742 - val_loss: 3.0437 - val_accuracy: 0.3333
Epoch 22/70
3/3 - 0s - loss: 0.2622 - accuracy: 0.7742 - val_loss: 3.0419 - val_accuracy: 0.3333
Epoch 23/70
3/3 - 0s - loss: 0.2590 - accuracy: 0.7742 - val_loss: 3.0421 - val_accuracy: 0.3333
Epoch 24/70
3/3 - 0s - loss: 0.2558 - accuracy: 0.8065 - val_loss: 3.0440 - val_accuracy: 0.3333
Epoch 25/70
3/3 - 0s - loss: 0.2527 - accuracy: 0.8065 - val_loss: 3.0477 - val_accuracy: 0.3333
Epoch 26/70
3/3 - 0s - loss: 0.2495 - accuracy: 0.8387 - val_loss: 3.0528 - val_accuracy: 0.3333
Epoch 27/70
3/3 - 0s - loss: 0.2463 - accuracy: 0.8387 - val_loss: 3.0592 - val_accuracy: 0.3333
Epoch 28/70
3/3 - 0s - loss: 0.2432 - accuracy: 0.8387 - val_loss: 3.0669 - val_accuracy: 0.3333
Epoch 29/70
3/3 - 0s - loss: 0.2401 - accuracy: 0.8387 - val_loss: 3.0756 - val_accuracy: 0.3333
Epoch 30/70
3/3 - 0s - loss: 0.2370 - accuracy: 0.8387 - val_loss: 3.0853 - val_accuracy: 0.3333
Epoch 31/70
3/3 - 0s - loss: 0.2339 - accuracy: 0.8387 - val_loss: 3.0958 - val_accuracy: 0.3333
Epoch 32/70
3/3 - 0s - loss: 0.2308 - accuracy: 0.8387 - val_loss: 3.1070 - val_accuracy: 0.3333
Epoch 33/70
3/3 - 0s - loss: 0.2278 - accuracy: 0.8065 - val_loss: 3.1187 - val_accuracy: 0.6667
Epoch 34/70
3/3 - 0s - loss: 0.2248 - accuracy: 0.8065 - val_loss: 3.1310 - val_accuracy: 0.6667
Epoch 35/70
3/3 - 0s - loss: 0.2219 - accuracy: 0.8065 - val_loss: 3.1436 - val_accuracy: 0.6667
Epoch 36/70
3/3 - 0s - loss: 0.2190 - accuracy: 0.7742 - val_loss: 3.1565 - val_accuracy: 0.6667
Epoch 37/70
3/3 - 0s - loss: 0.2162 - accuracy: 0.7742 - val_loss: 3.1696 - val_accuracy: 0.3333
Epoch 38/70
3/3 - 0s - loss: 0.2134 - accuracy: 0.7742 - val_loss: 3.1829 - val_accuracy: 0.3333
Epoch 39/70
3/3 - 0s - loss: 0.2107 - accuracy: 0.8065 - val_loss: 3.1963 - val_accuracy: 0.3333
Epoch 40/70
3/3 - 0s - loss: 0.2081 - accuracy: 0.8065 - val_loss: 3.2096 - val_accuracy: 0.3333
Epoch 41/70
3/3 - 0s - loss: 0.2056 - accuracy: 0.8065 - val_loss: 3.2229 - val_accuracy: 0.3333
Epoch 42/70
3/3 - 0s - loss: 0.2031 - accuracy: 0.8065 - val_loss: 3.2361 - val_accuracy: 0.3333
Epoch 43/70
3/3 - 0s - loss: 0.2008 - accuracy: 0.8065 - val_loss: 3.2492 - val_accuracy: 0.3333
Epoch 44/70
3/3 - 0s - loss: 0.1985 - accuracy: 0.8065 - val_loss: 3.2621 - val_accuracy: 0.3333
Epoch 45/70
3/3 - 0s - loss: 0.1963 - accuracy: 0.8065 - val_loss: 3.2748 - val_accuracy: 0.3333
Epoch 46/70
3/3 - 0s - loss: 0.1941 - accuracy: 0.8065 - val_loss: 3.2872 - val_accuracy: 0.3333
Epoch 47/70
3/3 - 0s - loss: 0.1921 - accuracy: 0.8065 - val_loss: 3.2994 - val_accuracy: 0.3333
Epoch 48/70
3/3 - 0s - loss: 0.1901 - accuracy: 0.8065 - val_loss: 3.3113 - val_accuracy: 0.3333
Epoch 49/70
3/3 - 0s - loss: 0.1882 - accuracy: 0.8065 - val_loss: 3.3229 - val_accuracy: 0.3333
Epoch 50/70
3/3 - 0s - loss: 0.1864 - accuracy: 0.8065 - val_loss: 3.3342 - val_accuracy: 0.3333
Epoch 51/70
3/3 - 0s - loss: 0.1847 - accuracy: 0.8065 - val_loss: 3.3451 - val_accuracy: 0.3333
Epoch 52/70
3/3 - 0s - loss: 0.1830 - accuracy: 0.8065 - val_loss: 3.3558 - val_accuracy: 0.3333
Epoch 53/70
3/3 - 0s - loss: 0.1814 - accuracy: 0.8065 - val_loss: 3.3661 - val_accuracy: 0.3333
Epoch 54/70
3/3 - 0s - loss: 0.1799 - accuracy: 0.8065 - val_loss: 3.3760 - val_accuracy: 0.3333
Epoch 55/70
3/3 - 0s - loss: 0.1785 - accuracy: 0.8065 - val_loss: 3.3855 - val_accuracy: 0.3333
Epoch 56/70
3/3 - 0s - loss: 0.1771 - accuracy: 0.8065 - val_loss: 3.3947 - val_accuracy: 0.3333
Epoch 57/70
3/3 - 0s - loss: 0.1757 - accuracy: 0.8065 - val_loss: 3.4036 - val_accuracy: 0.3333
Epoch 58/70
3/3 - 0s - loss: 0.1745 - accuracy: 0.8065 - val_loss: 3.4120 - val_accuracy: 0.3333
Epoch 59/70
3/3 - 0s - loss: 0.1732 - accuracy: 0.8065 - val_loss: 3.4201 - val_accuracy: 0.3333
Epoch 60/70
3/3 - 0s - loss: 0.1720 - accuracy: 0.8065 - val_loss: 3.4278 - val_accuracy: 0.3333
Epoch 61/70
3/3 - 0s - loss: 0.1709 - accuracy: 0.8065 - val_loss: 3.4351 - val_accuracy: 0.3333
Epoch 62/70
3/3 - 0s - loss: 0.1698 - accuracy: 0.8065 - val_loss: 3.4420 - val_accuracy: 0.3333
Epoch 63/70
3/3 - 0s - loss: 0.1687 - accuracy: 0.8065 - val_loss: 3.4485 - val_accuracy: 0.3333
Epoch 64/70
3/3 - 0s - loss: 0.1677 - accuracy: 0.8065 - val_loss: 3.4547 - val_accuracy: 0.3333
Epoch 65/70
3/3 - 0s - loss: 0.1667 - accuracy: 0.8065 - val_loss: 3.4605 - val_accuracy: 0.3333
Epoch 66/70
3/3 - 0s - loss: 0.1658 - accuracy: 0.8065 - val_loss: 3.4659 - val_accuracy: 0.3333
Epoch 67/70
3/3 - 0s - loss: 0.1648 - accuracy: 0.8065 - val_loss: 3.4710 - val_accuracy: 0.3333
Epoch 68/70
3/3 - 0s - loss: 0.1639 - accuracy: 0.8065 - val_loss: 3.4758 - val_accuracy: 0.3333
Epoch 69/70
3/3 - 0s - loss: 0.1631 - accuracy: 0.8065 - val_loss: 3.4802 - val_accuracy: 0.3333
Epoch 70/70
3/3 - 0s - loss: 0.1622 - accuracy: 0.8065 - val_loss: 3.4844 - val_accuracy: 0.3333

fig, ax = plt.subplots()
ax.plot(x_val['date_block_num'], y_val, label='Actual')
ax.plot(x_val['date_block_num'], y_pre, label='Predicted')
plt.title('LSTM Prediction vs Actual Sales for last 3 months')
plt.xlabel('Month')
plt.xticks(x_val['date_block_num'])
plt.ylabel('Sales of Item 5037 at Shop 5')
ax.legend()
plt.show()

LSTM 모델을 적용시킨 모습입니다.

잘 맞췄다면 잘 맞췄다고도 말 할수 있고 아쉽다면 아쉽다고 할 수 있는 결과인 것 같습니다.

데이터 탐색

sales_data = pd.read_csv('./sales_train.csv')
item_cat = pd.read_csv('./item_categories.csv')
items = pd.read_csv('./items.csv')
shops = pd.read_csv('./shops.csv')
sample_submission = pd.read_csv('./sample_submission.csv')
test_data = pd.read_csv('./test.csv')

def basic_eda(df):
    print("----------TOP 5 RECORDS--------")
    print(df.head(5))
    print("----------INFO-----------------")
    print(df.info())
    print("----------Describe-------------")
    print(df.describe())
    print("----------Columns--------------")
    print(df.columns)
    print("----------Data Types-----------")
    print(df.dtypes)
    print("-------Missing Values----------")
    print(df.isnull().sum())
    print("-------NULL values-------------")
    print(df.isna().sum())
    print("-----Shape Of Data-------------")
    print(df.shape)

print("=============================Sales Data=============================")
basic_eda(sales_data)
print("=============================Test data=============================")
basic_eda(test_data)
print("=============================Item Categories=============================")
basic_eda(item_cat)
print("=============================Items=============================")
basic_eda(items)
print("=============================Shops=============================")
basic_eda(shops)
print("=============================Sample Submission=============================")
basic_eda(sample_submission)

=============================Sales Data=============================
----------TOP 5 RECORDS--------
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154      999.00           1.0
1  03.01.2013               0       25     2552      899.00           1.0
2  05.01.2013               0       25     2552      899.00          -1.0
3  06.01.2013               0       25     2554     1709.05           1.0
4  15.01.2013               0       25     2555     1099.00           1.0
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
None
----------Describe-------------
       date_block_num       shop_id       item_id    item_price  item_cnt_day
count    2.935849e+06  2.935849e+06  2.935849e+06  2.935849e+06  2.935849e+06
mean     1.456991e+01  3.300173e+01  1.019723e+04  8.908532e+02  1.242641e+00
std      9.422988e+00  1.622697e+01  6.324297e+03  1.729800e+03  2.618834e+00
min      0.000000e+00  0.000000e+00  0.000000e+00 -1.000000e+00 -2.200000e+01
25%      7.000000e+00  2.200000e+01  4.476000e+03  2.490000e+02  1.000000e+00
50%      1.400000e+01  3.100000e+01  9.343000e+03  3.990000e+02  1.000000e+00
75%      2.300000e+01  4.700000e+01  1.568400e+04  9.990000e+02  1.000000e+00
max      3.300000e+01  5.900000e+01  2.216900e+04  3.079800e+05  2.169000e+03
----------Columns--------------
Index(['date', 'date_block_num', 'shop_id', 'item_id', 'item_price',
       'item_cnt_day'],
      dtype='object')
----------Data Types-----------
date               object
date_block_num      int64
shop_id             int64
item_id             int64
item_price        float64
item_cnt_day      float64
dtype: object
-------Missing Values----------
date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64
-------NULL values-------------
date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64
-----Shape Of Data-------------
(2935849, 6)
=============================Test data=============================
----------TOP 5 RECORDS--------
   ID  shop_id  item_id
0   0        5     5037
1   1        5     5320
2   2        5     5233
3   3        5     5232
4   4        5     5268
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   ID       214200 non-null  int64
 1   shop_id  214200 non-null  int64
 2   item_id  214200 non-null  int64
dtypes: int64(3)
memory usage: 4.9 MB
None
----------Describe-------------
                  ID        shop_id        item_id
count  214200.000000  214200.000000  214200.000000
mean   107099.500000      31.642857   11019.398627
std     61834.358168      17.561933    6252.644590
min         0.000000       2.000000      30.000000
25%     53549.750000      16.000000    5381.500000
50%    107099.500000      34.500000   11203.000000
75%    160649.250000      47.000000   16071.500000
max    214199.000000      59.000000   22167.000000
----------Columns--------------
Index(['ID', 'shop_id', 'item_id'], dtype='object')
----------Data Types-----------
ID         int64
shop_id    int64
item_id    int64
dtype: object
-------Missing Values----------
ID         0
shop_id    0
item_id    0
dtype: int64
-------NULL values-------------
ID         0
shop_id    0
item_id    0
dtype: int64
-----Shape Of Data-------------
(214200, 3)
=============================Item Categories=============================
----------TOP 5 RECORDS--------
        item_category_name  item_category_id
0  PC - Гарнитуры/Наушники                 0
1         Аксессуары - PS2                 1
2         Аксессуары - PS3                 2
3         Аксессуары - PS4                 3
4         Аксессуары - PSP                 4
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   item_category_name  84 non-null     object
 1   item_category_id    84 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ KB
None
----------Describe-------------
       item_category_id
count         84.000000
mean          41.500000
std           24.392622
min            0.000000
25%           20.750000
50%           41.500000
75%           62.250000
max           83.000000
----------Columns--------------
Index(['item_category_name', 'item_category_id'], dtype='object')
----------Data Types-----------
item_category_name    object
item_category_id       int64
dtype: object
-------Missing Values----------
item_category_name    0
item_category_id      0
dtype: int64
-------NULL values-------------
item_category_name    0
item_category_id      0
dtype: int64
-----Shape Of Data-------------
(84, 2)
=============================Items=============================
----------TOP 5 RECORDS--------
                                           item_name  item_id  item_category_id
0          ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40
1  !ABBYY FineReader 12 Professional Edition Full...        1                76
2      ***В ЛУЧАХ СЛАВЫ   (UNV)                    D        2                40
3    ***ГОЛУБАЯ ВОЛНА  (Univ)                      D        3                40
4        ***КОРОБКА (СТЕКЛО)                       D        4                40
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 519.7+ KB
None
----------Describe-------------
           item_id  item_category_id
count  22170.00000      22170.000000
mean   11084.50000         46.290753
std     6400.07207         15.941486
min        0.00000          0.000000
25%     5542.25000         37.000000
50%    11084.50000         40.000000
75%    16626.75000         58.000000
max    22169.00000         83.000000
----------Columns--------------
Index(['item_name', 'item_id', 'item_category_id'], dtype='object')
----------Data Types-----------
item_name           object
item_id              int64
item_category_id     int64
dtype: object
-------Missing Values----------
item_name           0
item_id             0
item_category_id    0
dtype: int64
-------NULL values-------------
item_name           0
item_id             0
item_category_id    0
dtype: int64
-----Shape Of Data-------------
(22170, 3)
=============================Shops=============================
----------TOP 5 RECORDS--------
                        shop_name  shop_id
0   !Якутск Орджоникидзе, 56 фран        0
1   !Якутск ТЦ "Центральный" фран        1
2                Адыгея ТЦ "Мега"        2
3  Балашиха ТРК "Октябрь-Киномир"        3
4        Волжский ТЦ "Волга Молл"        4
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   shop_name  60 non-null     object
 1   shop_id    60 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
None
----------Describe-------------
         shop_id
count  60.000000
mean   29.500000
std    17.464249
min     0.000000
25%    14.750000
50%    29.500000
75%    44.250000
max    59.000000
----------Columns--------------
Index(['shop_name', 'shop_id'], dtype='object')
----------Data Types-----------
shop_name    object
shop_id       int64
dtype: object
-------Missing Values----------
shop_name    0
shop_id      0
dtype: int64
-------NULL values-------------
shop_name    0
shop_id      0
dtype: int64
-----Shape Of Data-------------
(60, 2)
=============================Sample Submission=============================
----------TOP 5 RECORDS--------
   ID  item_cnt_month
0   0             0.5
1   1             0.5
2   2             0.5
3   3             0.5
4   4             0.5
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 2 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   ID              214200 non-null  int64  
 1   item_cnt_month  214200 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 3.3 MB
None
----------Describe-------------
                  ID  item_cnt_month
count  214200.000000        214200.0
mean   107099.500000             0.5
std     61834.358168             0.0
min         0.000000             0.5
25%     53549.750000             0.5
50%    107099.500000             0.5
75%    160649.250000             0.5
max    214199.000000             0.5
----------Columns--------------
Index(['ID', 'item_cnt_month'], dtype='object')
----------Data Types-----------
ID                  int64
item_cnt_month    float64
dtype: object
-------Missing Values----------
ID                0
item_cnt_month    0
dtype: int64
-------NULL values-------------
ID                0
item_cnt_month    0
dtype: int64
-----Shape Of Data-------------
(214200, 2)

앞 코드와 다른 사람 코드입니다.

여기서 train 데이터 프레임을 이 사람은 sales_data 이름으로 했네요.

사실 데이터 탐색하는 함수를 잘 만들어 놓은것 같아서 향후 다른 데이터 분석시 복사를 위해 가져왔습니다.

데이터 전처리

sales_data['date'] = pd.to_datetime(sales_data['date'],format = '%d.%m.%Y')
dataset = sales_data.pivot_table(index = ['shop_id','item_id'],
                                 values = ['item_cnt_day'],columns = ['date_block_num'],fill_value = 0,aggfunc='sum')
dataset.reset_index(inplace = True)
dataset.head()

판다스 내 피벗 테이블을 사용하는 모습입니다. group_by 함수를 확장한 것으로 생각할 수 있습니다.

피벗 테이블은 우선 index로 데이터를 구분 짓습니다. 여기서 shop_id, item_id가 모두 같은 값을 가진 행끼리 그룹을 짓습니다.

다음으로 columns로 한번 더 데이터를 구분 짓습니다. 같은 상점, 같은 제품을 달별로 나누었습니다.

values는 실제 적용되는 값을 의미합니다. 여기서는 item_cnt_day 변수를 사용했습니다.

상점, 제품, 달이 같은 데이터 별로 구분했을때 여러개의 item_cnt_day 값을 더해주는 함수(aggfunc='sum')를 사용합니다.

빈 값도 충분히 존재할 가능성이 있는데, 그 경우 거래 기록이 존재하지 않았다는 의미이므로 0값을 채웁니다.(fill_value = 0)

dataset = pd.merge(test_data,dataset,on = ['item_id','shop_id'],how = 'left')
dataset.fillna(0,inplace = True)
dataset.head()

/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/merge.py:643: UserWarning: merging between different levels can give an unintended result (1 levels on the left,2 on the right)
  warnings.warn(msg, UserWarning)
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py:3889: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)

피벗 테이블을 사용해 같은 상점, 제품을 달 별로 거래기록이 몇건 있었는가를 나타내는 데이터 프레임입니다.

이를 활용해 test 데이터 프레임과 병합한다면 테스트 데이터에 있는 상점, 제품의 이전 달별 거래기록을 전부 알 수 있습니다.

이때 만약 병합이 안된 데이터가 있다면(이전 거래기록이 없는 데이터이겠죠?) 0으로 값을 넣어줍니다.

데이터 모델링

dataset.drop(['shop_id','item_id','ID'],inplace = True, axis = 1)
dataset.head()

X_train = np.expand_dims(dataset.values[:,:-1],axis = 2)
y_train = dataset.values[:,-1:]
X_test = np.expand_dims(dataset.values[:,1:],axis = 2)

print(X_train.shape,y_train.shape,X_test.shape)

(214200, 33, 1) (214200, 1) (214200, 33, 1)

데이터를 모델링 하기 위해 상점, 제품 데이터를 지우고, train과 test 데이터 셋을 만들었습니다.

X_train : 0번째 달부터 32번째 달까지 거래 기록 데이터

y_train : 33번째 달 거래 기록 데이터

X_test : 1번째 달부터 33번째 달까지 거래 기록 데이터(train과 test간 데이터 형식을 맞추기 위해)

우리가 예측해야할 y_test는 34번째 달 거래 기록 데이터, 즉 2015년 10월 거래 기록 데이터 입니다.

from keras.models import Sequential
from keras.layers import LSTM,Dense,Dropout

my_model = Sequential()
my_model.add(LSTM(units = 64,input_shape = (33,1)))
my_model.add(Dropout(0.4))
my_model.add(Dense(1))

my_model.compile(loss = 'mse',optimizer = 'adam', metrics = ['mean_squared_error'])
my_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 64)                16896     
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 16,961
Trainable params: 16,961
Non-trainable params: 0
_________________________________________________________________

my_model.fit(X_train,y_train,batch_size = 4096,epochs = 10)

Epoch 1/10
53/53 [==============================] - 27s 471ms/step - loss: 30.6011 - mean_squared_error: 30.6011
Epoch 2/10
53/53 [==============================] - 25s 466ms/step - loss: 30.2430 - mean_squared_error: 30.2430
Epoch 3/10
53/53 [==============================] - 24s 462ms/step - loss: 30.0014 - mean_squared_error: 30.0014
Epoch 4/10
53/53 [==============================] - 25s 481ms/step - loss: 29.8476 - mean_squared_error: 29.8476
Epoch 5/10
53/53 [==============================] - 26s 482ms/step - loss: 29.7404 - mean_squared_error: 29.7404
Epoch 6/10
53/53 [==============================] - 26s 487ms/step - loss: 29.7396 - mean_squared_error: 29.7396
Epoch 7/10
53/53 [==============================] - 25s 480ms/step - loss: 29.7369 - mean_squared_error: 29.7369
Epoch 8/10
53/53 [==============================] - 25s 473ms/step - loss: 29.6503 - mean_squared_error: 29.6503
Epoch 9/10
53/53 [==============================] - 25s 472ms/step - loss: 29.6353 - mean_squared_error: 29.6353
Epoch 10/10
53/53 [==============================] - 25s 468ms/step - loss: 29.5096 - mean_squared_error: 29.5096

<keras.callbacks.History at 0x7f2b3e51ff90>

모델을 LSTM(시계열 분석) 방법을 사용해서 분석합니다. 사실 LSTM 모델을 처음 사용했는데요.

이번주에 다소 시간이 부족해 LSTM 모델의 사용방법이나 원리 등은 아직 파악하지 못했네요. (다른 사람 발표를 경청하겠습니다.)

submission_pfs = my_model.predict(X_test)
submission_pfs = submission_pfs.clip(0,20)
submission = pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
submission.to_csv('./submission.csv',index = False)
submission

데이터를 모델에 적용시켜 예측값을 찾은 뒤, 제출 형식에 맞게 데이터 프레임 형식을 조정했습니다.

data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
df

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
100% 3.55M/3.55M [00:04<00:00, 769kB/s]
Successfully submitted to Predict Future Sales

data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
df

df.clip(-4, 6)

예시를 보면 보다 직관적으로 이해가 가능할 것 같습니다.

이 함수는 범용성이 넓으니 다른 데이터 분석에 자주 쓰일 수 있어 따로 정리했네요.

!kaggle competitions submit -c competitive-data-science-predict-future-sales -f submission.csv -m "Message"

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
100% 3.55M/3.55M [00:04<00:00, 769kB/s]
Successfully submitted to Predict Future Sales

캐글에 파일을 자동 제출하는 코드입니다.

스코어는 약 1.02로 만 2천명 중 6천등 정도를 기록합니다.

느낀점

우선 공부하기 좋은 데이터를 찾아 줘서 고맙습니다.

시계열 자료가 현실에서 상당히 많아 꼭 공부해보고 싶은 분야였는데, 이번 기회에 분석하게 되서 너무 좋습니다.

개인적으로 공부하고 싶은 분야가 이미지 분류같은 것 보다는 자연어 처리, 시계열 분석 등 현실 세계를 설명할 수 있는 것 입니다.

이번엔 시간이 다소 부족해서 자주쓰는 시계열 모델인 LSTM 모델의 탐구가 부족했습니다.

다른 사람 발표 경청하고, 시간이 있을때 LSTM 모델을 열심히 공부해보고 싶네요.

감사합니다.

	ID	item_cnt_month
0	0	0.396485
1	1	0.103207
2	2	0.743674
3	3	0.135947
4	4	0.103207
...	...	...
214195	214195	0.331131
214196	214196	0.103207
214197	214197	0.097571
214198	214198	0.103207
214199	214199	0.069235

	date	shop_id	item_id	item_price	item_cnt_day
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

	item_name	item_id	item_category_id
0	! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D	0	40
1	!ABBYY FineReader 12 Professional Edition Full...	1	76
2	***В ЛУЧАХ СЛАВЫ (UNV) D	2	40
3	***ГОЛУБАЯ ВОЛНА (Univ) D	3	40
4	***КОРОБКА (СТЕКЛО) D	4	40

	item_id	shop_id	date_block_num	item_cnt_month
0	0	54	20	1.0
1	1	55	15	2.0
2	1	55	18	1.0
3	1	55	19	1.0
4	1	55	20	1.0
...	...	...	...	...
1609119	22168	12	8	1.0
1609120	22168	16	1	1.0
1609121	22168	42	1	1.0
1609122	22168	43	2	1.0
1609123	22169	25	14	1.0

	shop_id	item_id	date_block_num	item_cnt_month
400439	5	5037	20	1.0
400440	5	5037	22	1.0
400441	5	5037	23	2.0
400442	5	5037	24	2.0
400443	5	5037	28	1.0
400444	5	5037	29	1.0
400445	5	5037	30	1.0
400446	5	5037	31	3.0
400447	5	5037	32	1.0

	shop_id	item_id	item_cnt_day
date_block_num			0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33
0	0	30	0	31	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	31	0	11	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	32	6	10	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	33	3	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	35	1	14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	ID	shop_id	item_id	(item_cnt_day, 20)	(item_cnt_day, 22)	(item_cnt_day, 23)	(item_cnt_day, 24)	(item_cnt_day, 28)	(item_cnt_day, 29)	(item_cnt_day, 30)	(item_cnt_day, 31)	(item_cnt_day, 32)	(item_cnt_day, 33)
0	0	5	5037	1.0	1.0	2.0	2.0	1.0	1.0	1.0	3.0	1.0	0.0
1	1	5	5320	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2	5	5233	0.0	0.0	0.0	0.0	3.0	2.0	0.0	1.0	3.0	1.0
3	3	5	5232	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
4	4	5	5268	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	col_0	col_1
0	9	-2
1	-3	-7
2	0	6
3	-1	8
4	5	-5

	col_0	col_1
0	6	-2
1	-3	-4
2	0	6
3	-1	6
4	5	-4