Open In Colab

데이터 불러오기

install.packages("TTR")
install.packages("forecast")
install.packages("tseries")
library(TTR)
library(forecast)
library(tseries)

data <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birth <- ts(data, frequency = 12, start = c(1946, 1))
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘xts’, ‘zoo’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘quadprog’, ‘quantmod’, ‘fracdiff’, ‘lmtest’, ‘timeDate’, ‘tseries’, ‘urca’, ‘RcppArmadillo’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

필요한 패키지들을 설치하고, '뉴욕 월별 출생자수' 데이터를 입력받습니다.

ts 함수를 쓰면 보다 편리하게 원 자료를 시계열 자료로 변환이 가능합니다.

data
<ol class=list-inline>
  • 26.663
  • 23.598
  • 26.931
  • 24.74
  • 25.806
  • 24.364
  • 24.477
  • 23.901
  • 23.175
  • 23.227
  • 21.672
  • 21.87
  • 21.439
  • 21.089
  • 23.709
  • 21.669
  • 21.752
  • 20.761
  • 23.479
  • 23.824
  • 23.105
  • 23.11
  • 21.759
  • 22.073
  • 21.937
  • 20.035
  • 23.59
  • 21.672
  • 22.222
  • 22.123
  • 23.95
  • 23.504
  • 22.238
  • 23.142
  • 21.059
  • 21.573
  • 21.548
  • 20
  • 22.424
  • 20.615
  • 21.761
  • 22.874
  • 24.104
  • 23.748
  • 23.262
  • 22.907
  • 21.519
  • 22.025
  • 22.604
  • 20.894
  • 24.677
  • 23.673
  • 25.32
  • 23.583
  • 24.671
  • 24.454
  • 24.122
  • 24.252
  • 22.084
  • 22.991
  • 23.287
  • 23.049
  • 25.076
  • 24.037
  • 24.43
  • 24.667
  • 26.451
  • 25.618
  • 25.014
  • 25.11
  • 22.964
  • 23.981
  • 23.798
  • 22.27
  • 24.775
  • 22.646
  • 23.988
  • 24.737
  • 26.276
  • 25.816
  • 25.21
  • 25.199
  • 23.162
  • 24.707
  • 24.364
  • 22.644
  • 25.565
  • 24.062
  • 25.431
  • 24.635
  • 27.009
  • 26.606
  • 26.268
  • 26.462
  • 25.246
  • 25.18
  • 24.657
  • 23.304
  • 26.982
  • 26.199
  • 27.21
  • 26.122
  • 26.706
  • 26.878
  • 26.152
  • 26.379
  • 24.712
  • 25.688
  • 24.99
  • 24.239
  • 26.721
  • 23.475
  • 24.767
  • 26.219
  • 28.361
  • 28.599
  • 27.914
  • 27.784
  • 25.693
  • 26.881
  • 26.217
  • 24.218
  • 27.914
  • 26.975
  • 28.527
  • 27.139
  • 28.982
  • 28.169
  • 28.056
  • 29.136
  • 26.291
  • 26.987
  • 26.589
  • 24.848
  • 27.543
  • 26.896
  • 28.878
  • 27.39
  • 28.065
  • 28.141
  • 29.048
  • 28.484
  • 26.634
  • 27.735
  • 27.132
  • 24.924
  • 28.963
  • 26.589
  • 27.931
  • 28.009
  • 29.229
  • 28.759
  • 28.405
  • 27.945
  • 25.912
  • 26.619
  • 26.076
  • 25.286
  • 27.66
  • 25.951
  • 26.398
  • 25.565
  • 28.865
  • 30
  • 29.261
  • 29.012
  • 26.992
  • 27.897
  • </ol>
    birth
    
    A Time Series: 14 × 12
    JanFebMarAprMayJunJulAugSepOctNovDec
    194626.66323.59826.93124.74025.80624.36424.47723.90123.17523.22721.67221.870
    194721.43921.08923.70921.66921.75220.76123.47923.82423.10523.11021.75922.073
    194821.93720.03523.59021.67222.22222.12323.95023.50422.23823.14221.05921.573
    194921.54820.00022.42420.61521.76122.87424.10423.74823.26222.90721.51922.025
    195022.60420.89424.67723.67325.32023.58324.67124.45424.12224.25222.08422.991
    195123.28723.04925.07624.03724.43024.66726.45125.61825.01425.11022.96423.981
    195223.79822.27024.77522.64623.98824.73726.27625.81625.21025.19923.16224.707
    195324.36422.64425.56524.06225.43124.63527.00926.60626.26826.46225.24625.180
    195424.65723.30426.98226.19927.21026.12226.70626.87826.15226.37924.71225.688
    195524.99024.23926.72123.47524.76726.21928.36128.59927.91427.78425.69326.881
    195626.21724.21827.91426.97528.52727.13928.98228.16928.05629.13626.29126.987
    195726.58924.84827.54326.89628.87827.39028.06528.14129.04828.48426.63427.735
    195827.13224.92428.96326.58927.93128.00929.22928.75928.40527.94525.91226.619
    195926.07625.28627.66025.95126.39825.56528.86530.00029.26129.01226.99227.897
    class(birth)
    
    'ts'

    데이터 관찰

    plot(birth)
    

    전반적인 추세도 조금 있는 것 같고, 계절성이 있는 것 같아요.

    평균과 분산이 시간에 따라 달라지는 것으로 관찰되는데 정상성을 만족하진 않는 자료인것 같습니다.

    autoplot(decompose(birth))
    

    시계열 자료를 원 데이터, 트렌드, 계절성, 그외 오차로 나눠주는 autoplot(decompose()) 함수 입니다.

    자료를 시각적으로 보기 매우 좋은 것 같아요.

    acf(birth, lag.max=20)
    

    acf 자기상관계수 입니다. 빠르게 감소하는 지점이 있다면 MA를 쓰는게 좋으나 그렇지 않아 보입니다.

    pacf(birth, lag.max=20)
    

    pacf 편자기상관계수 입니다. 빠르게 감소하는 지점이 있다면 AR을 쓰는 것이 적절합니다.

    AR 모델 적합

    fit <- ar(birth, method = 'mle')
    fit
    
    Warning message in arima0(x, order = c(i, 0L, 0L), include.mean = demean):
    “possible convergence problem: optim gave code = 1”
    Warning message in arima0(x, order = c(i, 0L, 0L), include.mean = demean):
    “possible convergence problem: optim gave code = 1”
    
    Call:
    ar(x = birth, method = "mle")
    
    Coefficients:
          1        2        3        4        5        6        7        8  
     0.4043   0.2923  -0.0888  -0.0623   0.0544  -0.0443  -0.0171  -0.0640  
          9       10       11       12  
     0.1755   0.1134  -0.2982   0.5094  
    
    Order selected 12  sigma^2 estimated as  0.8879

    자동으로 P값을 정해서 계산해줍니다.

    est.1 <- arima(birth, order = c(11,0,0), fixed = c(0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
    est.1
    
    Warning message in arima(birth, order = c(11, 0, 0), fixed = c(0, NA, NA, NA, NA, :
    “some AR parameters were fixed: setting transform.pars = FALSE”
    
    Call:
    arima(x = birth, order = c(11, 0, 0), fixed = c(0, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA, NA, NA))
    
    Coefficients:
          ar1     ar2     ar3      ar4     ar5      ar6      ar7      ar8     ar9
            0  0.5759  0.1475  -0.1288  0.0248  -0.0355  -0.0245  -0.1414  0.1476
    s.e.    0  0.0773  0.0818   0.0880  0.0871   0.0894   0.0907   0.0897  0.0897
            ar10     ar11  intercept
          0.3941  -0.0112    25.6374
    s.e.  0.0816   0.0800     1.2054
    
    sigma^2 estimated as 1.319:  log likelihood = -263.59,  aic = 551.18

    order = (p,d,q) 이고 fixed는 0인경우 그 항의 계수를 사용하지 않고 NA인 경우 계수를 사용하겠다는 의미입니다.

    acf(est$residuals)
    

    ACF 값이 안정회된 모습입니다.

    Box.test(est$residuals)
    
    	Box-Pierce test
    
    data:  est$residuals
    X-squared = 0.30321, df = 1, p-value = 0.5819
    

    통계적 검증도 할 수 있습니다.

    plot(birth, type = 'l')
    
    lines(fitted(est), col = 2, lty = 1)
    

    예측한 값을 그래프로 표현했습니다.

    fitted(est, h = 3)
    
    A Time Series: 14 × 12
    JanFebMarAprMayJunJulAugSepOctNovDec
    1946 NA NA NA26.3996324.8402526.0714425.1257726.2249824.4634425.6299924.5075724.29708
    194725.6244122.5437025.1882422.0363724.0506922.9105923.2367022.2612621.6071822.8313822.0837822.79384
    194822.5564721.8336723.2343721.6676621.9338721.8231623.7022323.5388523.3122223.5890322.0923422.28697
    194922.7198720.6323323.1493121.5247022.5787922.0104223.2078122.8924722.5965123.6681521.9315422.83969
    195022.2195120.8176722.7598121.7483122.8338624.1221225.2035625.0951623.6956323.7050922.5056423.09822
    195123.8876622.1614424.9889323.5112525.6430323.9966725.1576124.1572924.3861025.2032123.4778424.36874
    195224.3843623.6383224.9873524.1541824.3221224.6099225.3441024.7359624.7197625.3477423.7810924.75321
    195324.3258522.7220224.9705023.4142424.6614225.1656726.3352625.7817524.9226725.8847724.2344425.65019
    195425.3772724.2708325.7523524.1849625.6815725.4628627.6650827.0873226.7374826.3218125.2163725.31379
    195525.3705724.1130826.9952425.6383526.8538825.9054725.4487825.3398325.8036827.1861326.0332427.22084
    195626.2659624.7787426.7067624.5581525.6721527.0028829.1062929.1603427.7274428.0943125.7961427.04030
    195727.2529725.4974927.8609926.5863227.9676726.6004728.5800128.0825327.7999328.3282426.1789827.54414
    195826.7828625.8453028.0331026.7666727.8822427.6390628.0175627.4896228.5647328.7088226.9413027.60250
    195927.1852225.0860727.9373325.7904627.7266427.2224528.2721627.2495426.5573127.4912326.8617127.76910

    fitted 함수를 가지고 모델로 만든 예측값을 사용했습니다. h를 3으로 제한했기 때문에 앞 3개 값은 나올 수가 없습니다.

    MA 모델 적합

    ma.est = arima(birth, order = c(0,0,9))
    ma.est
    
    Call:
    arima(x = birth, order = c(0, 0, 9))
    
    Coefficients:
             ma1     ma2     ma3     ma4     ma5     ma6     ma7     ma8     ma9
          0.6346  0.9636  0.5796  0.9329  0.6558  0.7272  0.5426  0.2053  0.4116
    s.e.  0.0728  0.1019  0.1162  0.1137  0.1242  0.1304  0.1380  0.1186  0.0779
          intercept
            25.0739
    s.e.     0.5602
    
    sigma^2 estimated as 1.244:  log likelihood = -261.17,  aic = 544.33

    order = c(p, d, q) 에서 q값이 MA 개수를 결정합니다.

    fitted(ma.est, h = 1)
    
    A Time Series: 14 × 12
    JanFebMarAprMayJunJulAugSepOctNovDec
    194625.9525125.6714325.4056124.8239926.1992024.8545324.8431824.5605123.6660423.8584322.4720422.74461
    194721.8418521.6628221.8094323.0516723.4551521.6742422.2255322.9359224.3425423.6807722.7947223.63520
    194821.7521522.0434721.3419222.1019223.8812921.0489823.3952524.5002723.7453823.8765922.0509022.97825
    194921.2127120.7785521.6395221.7028822.2182921.4515224.0364724.6617123.9006924.4084222.9229022.43632
    195021.6700922.1028522.3676122.9934225.2743024.9894725.4804524.3933325.0251624.6806122.6750523.35984
    195122.1952323.3759123.5419624.5055225.9828623.9542425.2406526.2260325.5426825.3456124.4495924.05728
    195223.3061123.5839723.0151224.0243123.9869322.7968626.1316625.6411226.1636226.3857424.0223924.91265
    195323.2595223.9308923.8738023.5063825.7878124.6751825.7693226.5152827.0099526.6401525.2889325.92493
    195424.8597224.1185123.5620425.5980627.3446325.7852327.4876026.7862726.5303826.7151024.6628925.81467
    195524.3957524.4492725.0280925.4333625.6250023.5679326.4426527.8154828.0546128.4375227.0956627.21027
    195625.3697925.5665424.9468425.3091727.6899127.0642328.6315128.2040528.7661028.3021826.8358827.38308
    195725.2639226.0218425.0188725.9374728.2890326.9381029.0112627.3841627.3982229.4472426.9110526.89012
    195826.8808926.4454825.5705326.4543528.2629326.6750028.0038628.4062529.0636128.4463126.2598327.32193
    195925.2262624.8898925.6054026.2173426.9225525.5433926.5643127.8208029.6043029.3203328.0251427.98380
    fitted(ma.est, h = 10)
    
    A Time Series: 14 × 12
    JanFebMarAprMayJunJulAugSepOctNovDec
    1946 NA NA NA NA NA NA NA NA NA NA25.0739225.07392
    194725.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    194825.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    194925.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195025.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195125.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195325.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195425.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195525.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195625.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195725.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195825.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392
    195925.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.0739225.07392

    h가 일정 수준 이상 커지면 MA 모델은 평균값으로 수렴합니다.

    ARIMA 모델 자동 적합

    est <- auto.arima(birth, stepwise = FALSE, max.p = 12, max.q = 10)
    est
    
    Series: birth 
    ARIMA(2,1,1)(1,1,1)[12] 
    
    Coefficients:
             ar1     ar2      ma1     sar1     sma1
          0.4349  -0.241  -0.4999  -0.2474  -0.8465
    s.e.  0.1846   0.085   0.1854   0.0986   0.1004
    
    sigma^2 estimated as 0.406:  log likelihood=-157.78
    AIC=327.56   AICc=328.12   BIC=345.82

    auto.arima 함수를 가지고 자동으로 arima 작업을 했습니다.

    이론적으로 p, q 값을 5 이하로 하는 것이 적당한데, 그에 맞는 결과입니다.

    d값은 추세를 의미합니다.

    fitted(est)
    
    A Time Series: 14 × 12
    JanFebMarAprMayJunJulAugSepOctNovDec
    194626.6476123.5934926.9252324.7376325.8031524.3629524.4760023.9006623.1753923.2273021.6737621.88578
    194721.5434319.3638824.1283021.1353522.2970320.4299421.8754823.0792822.5488822.7159121.5274022.16409
    194821.6982319.9355623.1413821.4245622.1980721.0277223.2526523.3630422.3818222.3192021.6465021.36331
    194921.4059019.9560622.9200720.3674021.3067520.8877724.1694123.3737422.7647823.2987121.3452422.00278
    195021.8944620.8678023.8633322.5826824.0185524.1362724.7969824.5299723.7380924.4663022.4882322.58123
    195123.0163821.6258625.7532922.7626624.6973923.9937526.1559426.0225024.7187025.2293123.5116123.49823
    195224.1587422.0497925.4609423.0136023.7916323.5509926.0611825.6663024.9179225.2967923.4549023.78538
    195324.8594222.8260225.6219824.1101224.9151624.9161026.0268526.6010325.7068526.2827424.5333425.73220
    195425.0920523.4111326.3466225.3267826.8485026.6911627.4261226.5356326.4695926.2594624.4976125.56646
    195525.6143723.5257027.1437324.8910624.6015624.9403227.9955427.5183727.6568527.6971425.9155826.36729
    195626.8670024.6656627.4015726.6969727.8457827.7346628.3871428.7182627.5200528.2873827.2877326.76079
    195726.9763825.4196527.8216225.8665327.9586528.5211128.6529828.1728227.7986929.0638826.4481727.42565
    195827.4642525.6816628.1013427.3981227.5463727.7831429.6763428.8280528.1790128.7869925.9810726.86454
    195926.4564924.7495128.2397526.1561727.3107726.0011227.3708828.7432929.2140728.9004926.9426527.87884
    plot(birth, type = 'l')
    
    lines(fitted(est), col = 2, lty = 1)
    

    이전 모델보다 더 잘 적합된 모습입니다.

    느낀점

    간단한 ARIMA 모형에 대해서 공부했습니다.

    AR와 MA가 무엇인지 부분적으로는 이해했는데 완벽히, 직관적으로, 누구에게 설명할 정도로 이해한 것 같진 않아서 조금 걱정입니다.

    요즘 딥러닝으로 시계열 자료를 많이 예측합니다만 소규모 데이터 셋은 과적합 방지를 위해, 또 설명력을 가지기 위해 ARIMA 모델도 사용합니다.

    시계열분석에 기초인 ARIMA 모델을 배워서 그동안 시계열 데이터를 보면 막막했던게 조금 사라진 것 같아요.