Splitting Time Series with Scikit-learn

Amir Hossein Jazayeri
3 min readApr 28, 2021

There is a fundamental difference between time series data and other types. Observations’ sequential order is important in time series, Thus shuffling or any other kind of sampling that disrupts the successive order of observations can not be used.

Photo by Marino Pili on Stock. up

Trend, Seasonality, and cyclicality of time series are important to capture, whether you use machine learning models, ARIMA models, or any other ones. So you need to split data in such a way that preserves the important attributes.

Cross-validation in time series could also be a problem because of the mentioned characteristics. But Skicit-learn solved this problem. you can use TimeSeriesSplit as a great tool to get the job done.

Let’s see How!

We have a Series with 50 observations

import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
TP=pd.Series([9248.9, 9248.9, 9178.3, 9130.5, 9089.2, 9023.7, 8973.3, 8907. ,8890.7, 8869. , 8857.1, 8851.6, 8832.2, 8803.4, 8795.7, 8786.7,8757.8, 8726.9, 8695. , 8656.3, 8641.8, 8619.6, 8596.4, 8575.7,8542. , 8519. , 8463.7, 8456.9, 8458.1, 8459.9, 8467.4, 8496.9,8514.3, 8553.7, 8519.1, 8507.3, 8509.1, 8505.6, 8508.3, 8491.8,8496.2, 8483.4, 8466.4, 8451.3, 8438. , 8423.7, 8430. , 8420. ,8412.7, 8390.2],index=['2008-12-04', '2008-12-05', '2008-12-06', '2008-12-07', '2008-12-08','2008-12-10', '2008-12-13', '2008-12-14', '2008-12-15', '2008-12-16','2008-12-20', '2008-12-21', '2008-12-22', '2008-12-23', '2008-12-24','2008-12-27', '2008-12-28', '2008-12-29', '2008-12-30', '2008-12-31','2009-01-03', '2009-01-04', '2009-01-05', '2009-01-10', '2009-01-11','2009-01-12', '2009-01-13', '2009-01-14', '2009-01-17', '2009-01-18','2009-01-19', '2009-01-20', '2009-01-21', '2009-01-24', '2009-01-25','2009-01-26', '2009-01-27', '2009-01-28', '2009-01-31', '2009-02-01','2009-02-02', '2009-02-03', '2009-02-04', '2009-02-07', '2009-02-08','2009-02-09', '2009-02-11', '2009-02-14', '2009-02-15', '2009-02-17'])

Then we create a TimeSeriesSplit object.

#With No Params
tscv = TimeSeriesSplit()
for train_index, test_index in tscv.split(TP):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = TP[train_index], TP[test_index]

TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11 12 13 14 15 16 17]

TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] TEST: [18 19 20 21 22 23 24 25]

TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25] TEST: [26 27 28 29 30 31 32 33]

TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33] TEST: [34 35 36 37 38 39 40 41]

TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41] TEST: [42 43 44 45 46 47 48 49]

As you can see, We now have 5 (TimeSeriesSplit default) train/test sets that respects the sequence in time series.

We have 4 optional parameters that we can use to modify our split.

1- n_splits

2-max_train_size

3-test_size

4-gap

A more customized TimeSeriesSplit object can be defined like this.

Let's say we need 3 splits, with a maximum train size of 15 and a test size equal to 2. And we want a gap included after our test sets ( for example we want to skip 2 observations and then define our train set)

Using the provided parameters in TimeSeriesSplit, we can easily achieve that:

#Using All Params
tscv = TimeSeriesSplit(n_splits=3,max_train_size=15, test_size=2, gap=2)
for train_index, test_index in tscv.split(TP):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = TP[train_index], TP[test_index]

TRAIN: [27 28 29 30 31 32 33 34 35 36 37 38 39 40 41] TEST: [44 45]

TRAIN: [29 30 31 32 33 34 35 36 37 38 39 40 41 42 43] TEST: [46 47]

TRAIN: [31 32 33 34 35 36 37 38 39 40 41 42 43 44 45] TEST: [48 49]

--

--

Amir Hossein Jazayeri

Machine Learning enthusiast, Developer, former financial analyst :) mixing things up so some magic happens !