.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/preparing_data_for_ts_prediction.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_preparing_data_for_ts_prediction.py: ============================================ Data Preparation for Time Series Prediction ============================================ This example demonstrates how to prepare data for time series prediction especially for deep learning models/algorithms like LSTM/RNN. .. GENERATED FROM PYTHON SOURCE LINES 10-26 .. code-block:: Python import time import numpy as np import pandas as pd import tensorflow as tf from tensorflow.keras.layers import Input, LSTM, Dense from tensorflow.keras.models import Model from aqua_fetch import RainfallRunoff print("tf: ", tf.__version__) print("np: ", np.__version__) print('pd: ', pd.__version__) from utils import prepare_data, prepare_data_sample .. rst-class:: sphx-glr-script-out .. code-block:: none tf: 2.7.0 np: 1.21.6 pd: 2.0.3 .. GENERATED FROM PYTHON SOURCE LINES 27-29 First we create a simple dataset with 2000 rows and 1 columns i.e. a univariate time series with no covariates. .. GENERATED FROM PYTHON SOURCE LINES 29-34 .. code-block:: Python rows = 2000 cols = 1 data = np.arange(int(rows*cols)).reshape(-1,rows).transpose() .. GENERATED FROM PYTHON SOURCE LINES 35-37 Below we print the first 10 rows, the shape of the dataset, and the last 10 rows to give an overview of the data structure. .. GENERATED FROM PYTHON SOURCE LINES 37-42 .. code-block:: Python print(data[0:10]) print('\n {} \n'.format(data.shape)) print(data[-10:]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]] (2000, 1) [[1990] [1991] [1992] [1993] [1994] [1995] [1996] [1997] [1998] [1999]] .. GENERATED FROM PYTHON SOURCE LINES 43-47 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=1, num_outputs=1, lookback=4) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1997, 4, 1) (1997, 3, 1) (1997, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 48-49 Checking the first sample/example/data point .. GENERATED FROM PYTHON SOURCE LINES 49-51 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[0], [1], [2], [3]]) .. GENERATED FROM PYTHON SOURCE LINES 52-55 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[0], [1], [2]]) .. GENERATED FROM PYTHON SOURCE LINES 56-59 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[3]]) .. GENERATED FROM PYTHON SOURCE LINES 60-61 Checking the second sample/example/data point .. GENERATED FROM PYTHON SOURCE LINES 61-63 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[1], [2], [3], [4]]) .. GENERATED FROM PYTHON SOURCE LINES 64-67 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[1], [2], [3]]) .. GENERATED FROM PYTHON SOURCE LINES 68-71 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[4]]) .. GENERATED FROM PYTHON SOURCE LINES 72-75 Now we create another dataset with 2000 rows but with 6 columns i.e. multivariate timeseries. Each column can represent a different feature or variable in the time series data. The dataset is filled with sequential integers for demonstration purposes. .. GENERATED FROM PYTHON SOURCE LINES 75-84 .. code-block:: Python rows = 2000 cols = 6 data = np.arange(int(rows*cols)).reshape(-1,rows).transpose() print(data[0:10]) print('\n {} \n'.format(data.shape)) print(data[-10:]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[ 0 2000 4000 6000 8000 10000] [ 1 2001 4001 6001 8001 10001] [ 2 2002 4002 6002 8002 10002] [ 3 2003 4003 6003 8003 10003] [ 4 2004 4004 6004 8004 10004] [ 5 2005 4005 6005 8005 10005] [ 6 2006 4006 6006 8006 10006] [ 7 2007 4007 6007 8007 10007] [ 8 2008 4008 6008 8008 10008] [ 9 2009 4009 6009 8009 10009]] (2000, 6) [[ 1990 3990 5990 7990 9990 11990] [ 1991 3991 5991 7991 9991 11991] [ 1992 3992 5992 7992 9992 11992] [ 1993 3993 5993 7993 9993 11993] [ 1994 3994 5994 7994 9994 11994] [ 1995 3995 5995 7995 9995 11995] [ 1996 3996 5996 7996 9996 11996] [ 1997 3997 5997 7997 9997 11997] [ 1998 3998 5998 7998 9998 11998] [ 1999 3999 5999 7999 9999 11999]] .. GENERATED FROM PYTHON SOURCE LINES 85-87 If this were a multivariate time series with no covariates then we would use the same approach as before i.e. set the num_inputs equal to that of num_outputs. .. GENERATED FROM PYTHON SOURCE LINES 87-92 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=6, num_outputs=6, lookback=4) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1997, 4, 6) (1997, 3, 6) (1997, 6, 1) .. GENERATED FROM PYTHON SOURCE LINES 93-94 Checking the first sample/example/data point .. GENERATED FROM PYTHON SOURCE LINES 94-96 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000, 10000], [ 1, 2001, 4001, 6001, 8001, 10001], [ 2, 2002, 4002, 6002, 8002, 10002], [ 3, 2003, 4003, 6003, 8003, 10003]]) .. GENERATED FROM PYTHON SOURCE LINES 97-100 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000, 10000], [ 1, 2001, 4001, 6001, 8001, 10001], [ 2, 2002, 4002, 6002, 8002, 10002]]) .. GENERATED FROM PYTHON SOURCE LINES 101-104 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 3], [ 2003], [ 4003], [ 6003], [ 8003], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 105-106 Checking the second sample/example/data point .. GENERATED FROM PYTHON SOURCE LINES 106-108 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001, 10001], [ 2, 2002, 4002, 6002, 8002, 10002], [ 3, 2003, 4003, 6003, 8003, 10003], [ 4, 2004, 4004, 6004, 8004, 10004]]) .. GENERATED FROM PYTHON SOURCE LINES 109-112 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001, 10001], [ 2, 2002, 4002, 6002, 8002, 10002], [ 3, 2003, 4003, 6003, 8003, 10003]]) .. GENERATED FROM PYTHON SOURCE LINES 113-116 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 4], [ 2004], [ 4004], [ 6004], [ 8004], [10004]]) .. GENERATED FROM PYTHON SOURCE LINES 117-120 However, if this were a multivariate time series with covariates, i.e. one timeseries column is our target variable and the others are input features, we would need to adjust the data preparation accordingly. .. GENERATED FROM PYTHON SOURCE LINES 120-125 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1997, 4, 5) (1997, 3, 1) (1997, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 126-129 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003]]) .. GENERATED FROM PYTHON SOURCE LINES 130-133 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10000], [10001], [10002]]) .. GENERATED FROM PYTHON SOURCE LINES 134-137 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10003]]) .. GENERATED FROM PYTHON SOURCE LINES 138-141 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004]]) .. GENERATED FROM PYTHON SOURCE LINES 142-145 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10001], [10002], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 146-149 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004]]) .. GENERATED FROM PYTHON SOURCE LINES 150-151 Consider the case where number of input features/timeseries are 4 and output features/timeseries are 2. .. GENERATED FROM PYTHON SOURCE LINES 151-156 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=4, lookback=4) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1997, 4, 4) (1997, 3, 2) (1997, 2, 1) .. GENERATED FROM PYTHON SOURCE LINES 157-160 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000], [ 1, 2001, 4001, 6001], [ 2, 2002, 4002, 6002], [ 3, 2003, 4003, 6003]]) .. GENERATED FROM PYTHON SOURCE LINES 161-164 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 8000, 10000], [ 8001, 10001], [ 8002, 10002]]) .. GENERATED FROM PYTHON SOURCE LINES 165-168 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 8003], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 169-172 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001], [ 2, 2002, 4002, 6002], [ 3, 2003, 4003, 6003], [ 4, 2004, 4004, 6004]]) .. GENERATED FROM PYTHON SOURCE LINES 173-176 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 8001, 10001], [ 8002, 10002], [ 8003, 10003]]) .. GENERATED FROM PYTHON SOURCE LINES 177-180 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 8004], [10004]]) .. GENERATED FROM PYTHON SOURCE LINES 181-186 nowcasting vs forecasting -------------------------- If forecast_step is > 0, it means we want to predict in future. It reflects that we are predicting at timestep t = `t+1` which effectively means that we feed input data at timestep t and predict the target at timestep t+1. .. GENERATED FROM PYTHON SOURCE LINES 186-191 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, forecast_step=1) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1996, 4, 5) (1996, 3, 1) (1996, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 192-193 First sample .. GENERATED FROM PYTHON SOURCE LINES 193-195 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003]]) .. GENERATED FROM PYTHON SOURCE LINES 196-199 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10000], [10001], [10002]]) .. GENERATED FROM PYTHON SOURCE LINES 200-203 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004]]) .. GENERATED FROM PYTHON SOURCE LINES 204-205 Second sample .. GENERATED FROM PYTHON SOURCE LINES 205-207 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004]]) .. GENERATED FROM PYTHON SOURCE LINES 208-211 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10001], [10002], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 212-215 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10005]]) .. GENERATED FROM PYTHON SOURCE LINES 216-217 if we want to forecast multiple timesteps in future .. GENERATED FROM PYTHON SOURCE LINES 217-222 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, forecast_step=1, forecast_len=2) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1995, 4, 5) (1995, 3, 1) (1995, 1, 2) .. GENERATED FROM PYTHON SOURCE LINES 223-226 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003]]) .. GENERATED FROM PYTHON SOURCE LINES 227-230 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10000], [10001], [10002]]) .. GENERATED FROM PYTHON SOURCE LINES 231-234 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004, 10005]]) .. GENERATED FROM PYTHON SOURCE LINES 235-238 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004]]) .. GENERATED FROM PYTHON SOURCE LINES 239-242 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10001], [10002], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 243-246 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10005, 10006]]) .. GENERATED FROM PYTHON SOURCE LINES 247-249 If forecast_step is 0, that means make prediction at t=0 which means we are using input at current timestep to predict the output at current timestep. .. GENERATED FROM PYTHON SOURCE LINES 249-254 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, forecast_step=0, forecast_len=2) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1996, 4, 5) (1996, 3, 1) (1996, 1, 2) .. GENERATED FROM PYTHON SOURCE LINES 255-258 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003]]) .. GENERATED FROM PYTHON SOURCE LINES 259-262 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10000], [10001], [10002]]) .. GENERATED FROM PYTHON SOURCE LINES 263-266 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10003, 10004]]) .. GENERATED FROM PYTHON SOURCE LINES 267-270 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004]]) .. GENERATED FROM PYTHON SOURCE LINES 271-274 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10001], [10002], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 275-278 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004, 10005]]) .. GENERATED FROM PYTHON SOURCE LINES 279-281 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=1, forecast_step=0) .. GENERATED FROM PYTHON SOURCE LINES 282-283 changing input_steps .. GENERATED FROM PYTHON SOURCE LINES 283-287 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, input_steps=2) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1993, 4, 5) (1993, 3, 1) (1993, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 288-291 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 2, 2002, 4002, 6002, 8002], [ 4, 2004, 4004, 6004, 8004], [ 6, 2006, 4006, 6006, 8006]]) .. GENERATED FROM PYTHON SOURCE LINES 292-295 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10000], [10002], [10004]]) .. GENERATED FROM PYTHON SOURCE LINES 296-299 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10006]]) .. GENERATED FROM PYTHON SOURCE LINES 300-303 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 3, 2003, 4003, 6003, 8003], [ 5, 2005, 4005, 6005, 8005], [ 7, 2007, 4007, 6007, 8007]]) .. GENERATED FROM PYTHON SOURCE LINES 304-307 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10001], [10003], [10005]]) .. GENERATED FROM PYTHON SOURCE LINES 308-312 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10007]]) .. GENERATED FROM PYTHON SOURCE LINES 313-314 changing output_steps .. GENERATED FROM PYTHON SOURCE LINES 314-318 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, output_steps=2) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1996, 4, 5) (1996, 3, 1) (1996, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 319-322 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003]]) .. GENERATED FROM PYTHON SOURCE LINES 323-326 .. code-block:: Python _y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10000], [10001], [10002]]) .. GENERATED FROM PYTHON SOURCE LINES 327-330 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10003]]) .. GENERATED FROM PYTHON SOURCE LINES 331-334 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004]]) .. GENERATED FROM PYTHON SOURCE LINES 335-338 .. code-block:: Python _y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10001], [10002], [10003]]) .. GENERATED FROM PYTHON SOURCE LINES 339-342 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004]]) .. GENERATED FROM PYTHON SOURCE LINES 343-344 using known future inputs .. GENERATED FROM PYTHON SOURCE LINES 344-353 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, forecast_step=1, forecast_len=4, known_future_inputs=True) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1989, 8, 5) (1989, 7, 1) (1989, 1, 4) .. GENERATED FROM PYTHON SOURCE LINES 354-357 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004], [ 5, 2005, 4005, 6005, 8005], [ 6, 2006, 4006, 6006, 8006], [ 7, 2007, 4007, 6007, 8007]]) .. GENERATED FROM PYTHON SOURCE LINES 358-361 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004, 10005, 10006, 10007]]) .. GENERATED FROM PYTHON SOURCE LINES 362-365 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 2, 2002, 4002, 6002, 8002], [ 3, 2003, 4003, 6003, 8003], [ 4, 2004, 4004, 6004, 8004], [ 5, 2005, 4005, 6005, 8005], [ 6, 2006, 4006, 6006, 8006], [ 7, 2007, 4007, 6007, 8007], [ 8, 2008, 4008, 6008, 8008]]) .. GENERATED FROM PYTHON SOURCE LINES 366-368 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10005, 10006, 10007, 10008]]) .. GENERATED FROM PYTHON SOURCE LINES 369-370 using known future inputs with forecast_step=2 .. GENERATED FROM PYTHON SOURCE LINES 370-382 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4, forecast_len=4, forecast_step=2, input_steps=2, output_steps=2, known_future_inputs=True) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1976, 8, 5) (1976, 7, 1) (1976, 1, 4) .. GENERATED FROM PYTHON SOURCE LINES 383-386 .. code-block:: Python x[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 2000, 4000, 6000, 8000], [ 2, 2002, 4002, 6002, 8002], [ 4, 2004, 4004, 6004, 8004], [ 6, 2006, 4006, 6006, 8006], [ 8, 2008, 4008, 6008, 8008], [ 10, 2010, 4010, 6010, 8010], [ 12, 2012, 4012, 6012, 8012], [ 14, 2014, 4014, 6014, 8014]]) .. GENERATED FROM PYTHON SOURCE LINES 387-390 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10008, 10010, 10012, 10014]]) .. GENERATED FROM PYTHON SOURCE LINES 391-394 .. code-block:: Python x[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 2001, 4001, 6001, 8001], [ 3, 2003, 4003, 6003, 8003], [ 5, 2005, 4005, 6005, 8005], [ 7, 2007, 4007, 6007, 8007], [ 9, 2009, 4009, 6009, 8009], [ 11, 2011, 4011, 6011, 8011], [ 13, 2013, 4013, 6013, 8013], [ 15, 2015, 4015, 6015, 8015]]) .. GENERATED FROM PYTHON SOURCE LINES 395-398 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10009, 10011, 10013, 10015]]) .. GENERATED FROM PYTHON SOURCE LINES 399-402 Handling missing values -------------------------- Consider the case where missing values are present in the output/target variable/feature .. GENERATED FROM PYTHON SOURCE LINES 402-415 .. code-block:: Python data = np.arange(int(rows*cols)).reshape(-1,rows).transpose() rng = np.random.default_rng(seed=313) # for reproducibility # create a random mask for the last column mask = rng.integers(0, 2, size=data[:, -1].shape).astype(bool) # introduce NaNs in the last column data = data.astype(float) data[mask, -1] = None print(data[0:10]) print('\n {} \n'.format(data.shape)) print(data[-10:]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[ 0. 2000. 4000. 6000. 8000. 10000.] [ 1. 2001. 4001. 6001. 8001. nan] [ 2. 2002. 4002. 6002. 8002. 10002.] [ 3. 2003. 4003. 6003. 8003. 10003.] [ 4. 2004. 4004. 6004. 8004. 10004.] [ 5. 2005. 4005. 6005. 8005. nan] [ 6. 2006. 4006. 6006. 8006. nan] [ 7. 2007. 4007. 6007. 8007. 10007.] [ 8. 2008. 4008. 6008. 8008. nan] [ 9. 2009. 4009. 6009. 8009. nan]] (2000, 6) [[ 1990. 3990. 5990. 7990. 9990. 11990.] [ 1991. 3991. 5991. 7991. 9991. 11991.] [ 1992. 3992. 5992. 7992. 9992. nan] [ 1993. 3993. 5993. 7993. 9993. nan] [ 1994. 3994. 5994. 7994. 9994. 11994.] [ 1995. 3995. 5995. 7995. 9995. 11995.] [ 1996. 3996. 5996. 7996. 9996. 11996.] [ 1997. 3997. 5997. 7997. 9997. 11997.] [ 1998. 3998. 5998. 7998. 9998. 11998.] [ 1999. 3999. 5999. 7999. 9999. 11999.]] .. GENERATED FROM PYTHON SOURCE LINES 416-420 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=4) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1997, 4, 5) (1997, 3, 1) (1997, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 421-424 .. code-block:: Python y[0] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10003.]]) .. GENERATED FROM PYTHON SOURCE LINES 425-428 .. code-block:: Python y[1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[10004.]]) .. GENERATED FROM PYTHON SOURCE LINES 429-432 .. code-block:: Python y[2] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[nan]]) .. GENERATED FROM PYTHON SOURCE LINES 433-435 .. code-block:: Python y[3] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[nan]]) .. GENERATED FROM PYTHON SOURCE LINES 436-438 .. code-block:: Python y[4], y[5], y[6] .. rst-class:: sphx-glr-script-out .. code-block:: none (array([[10007.]]), array([[nan]]), array([[nan]])) .. GENERATED FROM PYTHON SOURCE LINES 439-441 Now we should remove all examples with NaN in the output. This will definitely reduce the number of samples. .. GENERATED FROM PYTHON SOURCE LINES 441-452 .. code-block:: Python nan_idx_y = np.isnan(y).any(axis=(1, 2)) non_nan_idx_y = np.invert(nan_idx_y) x = x[non_nan_idx_y] _y = _y[non_nan_idx_y] y = y[non_nan_idx_y] print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (955, 4, 5) (955, 3, 1) (955, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 453-454 Now consider the case where missing values in the input features/variables as well .. GENERATED FROM PYTHON SOURCE LINES 454-466 .. code-block:: Python data = np.arange(int(rows*cols)).reshape(-1,rows).transpose() rng = np.random.default_rng(seed=313) # for reproducibility # put missing at random positions in the input data mask = rng.integers(0, 50, size=data[:, :-1].shape).astype(bool) data = data.astype(float) data[:, :-1][~mask] = np.nan print(data[0:10]) print('\n {} \n'.format(data.shape)) print(data[-10:]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[ 0. 2000. 4000. 6000. 8000. 10000.] [ 1. 2001. 4001. 6001. 8001. 10001.] [ 2. 2002. 4002. 6002. 8002. 10002.] [ 3. 2003. nan 6003. 8003. 10003.] [ 4. 2004. 4004. 6004. 8004. 10004.] [ 5. 2005. 4005. 6005. 8005. 10005.] [ 6. 2006. 4006. 6006. 8006. 10006.] [ 7. 2007. 4007. 6007. 8007. 10007.] [ 8. 2008. 4008. 6008. nan 10008.] [ 9. 2009. 4009. 6009. 8009. 10009.]] (2000, 6) [[ 1990. 3990. 5990. 7990. 9990. 11990.] [ 1991. 3991. 5991. 7991. 9991. 11991.] [ 1992. 3992. nan 7992. 9992. 11992.] [ 1993. 3993. 5993. 7993. 9993. 11993.] [ 1994. 3994. 5994. 7994. 9994. 11994.] [ 1995. 3995. 5995. 7995. 9995. 11995.] [ 1996. 3996. 5996. 7996. 9996. 11996.] [ 1997. 3997. 5997. 7997. 9997. 11997.] [ 1998. 3998. 5998. 7998. 9998. 11998.] [ 1999. 3999. 5999. 7999. 9999. 11999.]] .. GENERATED FROM PYTHON SOURCE LINES 467-472 .. code-block:: Python x, _y, y = prepare_data(data, num_inputs=5, lookback=5) print(x.shape, _y.shape, y.shape) x[-3] .. rst-class:: sphx-glr-script-out .. code-block:: none (1996, 5, 5) (1996, 4, 1) (1996, 1, 1) array([[1993., 3993., 5993., 7993., 9993.], [1994., 3994., 5994., 7994., 9994.], [1995., 3995., 5995., 7995., 9995.], [1996., 3996., 5996., 7996., 9996.], [1997., 3997., 5997., 7997., 9997.]]) .. GENERATED FROM PYTHON SOURCE LINES 473-476 .. code-block:: Python x[-4] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[1992., 3992., nan, 7992., 9992.], [1993., 3993., 5993., 7993., 9993.], [1994., 3994., 5994., 7994., 9994.], [1995., 3995., 5995., 7995., 9995.], [1996., 3996., 5996., 7996., 9996.]]) .. GENERATED FROM PYTHON SOURCE LINES 477-480 .. code-block:: Python y[-4] .. rst-class:: sphx-glr-script-out .. code-block:: none array([[11996.]]) .. GENERATED FROM PYTHON SOURCE LINES 481-482 We should definitely remove all examples with NaN in the input (x) .. GENERATED FROM PYTHON SOURCE LINES 482-493 .. code-block:: Python nan_idx_x = np.isnan(x).any(axis=(1, 2)) non_nan_idx_x = np.invert(nan_idx_x) x = x[non_nan_idx_x] _y = _y[non_nan_idx_x] y = y[non_nan_idx_x] print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1188, 5, 5) (1188, 4, 1) (1188, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 494-501 making batches -------------- A batch represents a group of samples/examples (x,y) pairs. The concept of batch is important in deep learning because neural networks are not training at once with all the data but are trained with batches i.e. we divide the whole data into batches then feed the a single batch to neural network , train with it and then feed the next batch. .. GENERATED FROM PYTHON SOURCE LINES 501-508 .. code-block:: Python lookback = 4 num_inputs = 5 data = np.arange(int(rows*cols)).reshape(-1,rows).transpose() x, _y, y = prepare_data(data, num_inputs=num_inputs, lookback=lookback) print(x.shape, _y.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1997, 4, 5) (1997, 3, 1) (1997, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 509-510 Consider the following example of training an LSTM with a data of of ~2000 samples. .. GENERATED FROM PYTHON SOURCE LINES 510-518 .. code-block:: Python inputs = Input(shape=(lookback, num_inputs)) lstm = LSTM(32)(inputs) output = Dense(1)(lstm) model = Model(inputs=inputs, outputs=output) model.compile(optimizer='adam', loss='mse') .. GENERATED FROM PYTHON SOURCE LINES 519-520 .. code-block:: Python model.fit(x, y, epochs=2, batch_size=128) .. rst-class:: sphx-glr-script-out .. code-block:: none Epoch 1/2 1/16 [>.............................] - ETA: 12s - loss: 121331936.0000 16/16 [==============================] - 1s 1ms/step - loss: 121338976.0000 Epoch 2/2 1/16 [>.............................] - ETA: 0s - loss: 121454472.0000 16/16 [==============================] - 0s 1ms/step - loss: 121330960.0000 .. GENERATED FROM PYTHON SOURCE LINES 521-523 We see that when we trained the model with whole data i.e. 1997 samples, there were 16 batches. This is because we set the batch size equal to 128. .. GENERATED FROM PYTHON SOURCE LINES 523-526 .. code-block:: Python pred = model.predict(x) .. GENERATED FROM PYTHON SOURCE LINES 527-539 using generator --------------- In previous example, we had 1997 samples/examples, and each sample had shape (4, 5). Our ``x`` contained all the samples/examples. Since this is a small data therefore we can fit it (all the samples) in memory. But in real world, we may have large datasets with e.g. millions of samples/examples (all of) which cannot fit in memory. This means we can not have x with millions of samples in memory especially when each sample is also large. In such cases, we can use a data generator to load and preprocess the data in batches ourselves. What do we do in such a case? We prepare data only for those many samples/examples which are required at the moment. That means our `x` at a certain moment does not consist of all the samples/examples but only those that are needed for the current **batch**. .. GENERATED FROM PYTHON SOURCE LINES 539-550 .. code-block:: Python cols = 6 rows = 200 lookback = 4 num_inputs = 5 data = np.arange(int(rows*cols)).reshape(-1,rows).transpose() x0, _, y0 = prepare_data_sample(data, index=0, lookback=lookback, num_inputs=num_inputs) x0 .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 0, 200, 400, 600, 800], [ 1, 201, 401, 601, 801], [ 2, 202, 402, 602, 802], [ 3, 203, 403, 603, 803]]) .. GENERATED FROM PYTHON SOURCE LINES 551-553 The function prepare_data_sample returns a single sample/example/data point at a time using the `index` parameter to specify which sample to return. .. GENERATED FROM PYTHON SOURCE LINES 553-556 .. code-block:: Python y0 .. rst-class:: sphx-glr-script-out .. code-block:: none array([[1003]]) .. GENERATED FROM PYTHON SOURCE LINES 557-558 So if we want to get the second sample/example/data point, we can call the function with index=1 .. GENERATED FROM PYTHON SOURCE LINES 558-563 .. code-block:: Python x1, _, y1 = prepare_data_sample(data, index=1, lookback=lookback, num_inputs=num_inputs) x1 .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 1, 201, 401, 601, 801], [ 2, 202, 402, 602, 802], [ 3, 203, 403, 603, 803], [ 4, 204, 404, 604, 804]]) .. GENERATED FROM PYTHON SOURCE LINES 564-568 .. code-block:: Python y1 .. rst-class:: sphx-glr-script-out .. code-block:: none array([[1004]]) .. GENERATED FROM PYTHON SOURCE LINES 569-570 Similarly, if we want to get the fifth sample/example/data point, we can call the function with index=4 .. GENERATED FROM PYTHON SOURCE LINES 570-575 .. code-block:: Python x4, _, y4 = prepare_data_sample(data, index=4, lookback=lookback, num_inputs=num_inputs) x4 .. rst-class:: sphx-glr-script-out .. code-block:: none array([[ 4, 204, 404, 604, 804], [ 5, 205, 405, 605, 805], [ 6, 206, 406, 606, 806], [ 7, 207, 407, 607, 807]]) .. GENERATED FROM PYTHON SOURCE LINES 576-579 .. code-block:: Python y4 .. rst-class:: sphx-glr-script-out .. code-block:: none array([[1007]]) .. GENERATED FROM PYTHON SOURCE LINES 580-581 Now we can create a generator function that yields samples from the dataset. .. GENERATED FROM PYTHON SOURCE LINES 581-607 .. code-block:: Python def sample_generator(data:np.array, lookback, num_inputs, num_outputs=None, input_steps=1, forecast_step=0, forecast_len=1, known_future_inputs=False, output_steps=1): for i in range(len(data) - lookback * input_steps + 1 - forecast_step - forecast_len * output_steps): x, _, y = prepare_data_sample(data, index=i, lookback=lookback, num_inputs=num_inputs, num_outputs=num_outputs, input_steps=input_steps, forecast_step=forecast_step, forecast_len=forecast_len, known_future_inputs=known_future_inputs, output_steps=output_steps ) # Skip samples with NaNs in x or y if np.isnan(x).any() or np.isnan(y).any(): continue yield x, y gen = sample_generator(data, lookback, num_inputs) for idx, (x, y) in enumerate(gen): print(idx, x.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none 0 (4, 5) (1, 1) 1 (4, 5) (1, 1) 2 (4, 5) (1, 1) 3 (4, 5) (1, 1) 4 (4, 5) (1, 1) 5 (4, 5) (1, 1) 6 (4, 5) (1, 1) 7 (4, 5) (1, 1) 8 (4, 5) (1, 1) 9 (4, 5) (1, 1) 10 (4, 5) (1, 1) 11 (4, 5) (1, 1) 12 (4, 5) (1, 1) 13 (4, 5) (1, 1) 14 (4, 5) (1, 1) 15 (4, 5) (1, 1) 16 (4, 5) (1, 1) 17 (4, 5) (1, 1) 18 (4, 5) (1, 1) 19 (4, 5) (1, 1) 20 (4, 5) (1, 1) 21 (4, 5) (1, 1) 22 (4, 5) (1, 1) 23 (4, 5) (1, 1) 24 (4, 5) (1, 1) 25 (4, 5) (1, 1) 26 (4, 5) (1, 1) 27 (4, 5) (1, 1) 28 (4, 5) (1, 1) 29 (4, 5) (1, 1) 30 (4, 5) (1, 1) 31 (4, 5) (1, 1) 32 (4, 5) (1, 1) 33 (4, 5) (1, 1) 34 (4, 5) (1, 1) 35 (4, 5) (1, 1) 36 (4, 5) (1, 1) 37 (4, 5) (1, 1) 38 (4, 5) (1, 1) 39 (4, 5) (1, 1) 40 (4, 5) (1, 1) 41 (4, 5) (1, 1) 42 (4, 5) (1, 1) 43 (4, 5) (1, 1) 44 (4, 5) (1, 1) 45 (4, 5) (1, 1) 46 (4, 5) (1, 1) 47 (4, 5) (1, 1) 48 (4, 5) (1, 1) 49 (4, 5) (1, 1) 50 (4, 5) (1, 1) 51 (4, 5) (1, 1) 52 (4, 5) (1, 1) 53 (4, 5) (1, 1) 54 (4, 5) (1, 1) 55 (4, 5) (1, 1) 56 (4, 5) (1, 1) 57 (4, 5) (1, 1) 58 (4, 5) (1, 1) 59 (4, 5) (1, 1) 60 (4, 5) (1, 1) 61 (4, 5) (1, 1) 62 (4, 5) (1, 1) 63 (4, 5) (1, 1) 64 (4, 5) (1, 1) 65 (4, 5) (1, 1) 66 (4, 5) (1, 1) 67 (4, 5) (1, 1) 68 (4, 5) (1, 1) 69 (4, 5) (1, 1) 70 (4, 5) (1, 1) 71 (4, 5) (1, 1) 72 (4, 5) (1, 1) 73 (4, 5) (1, 1) 74 (4, 5) (1, 1) 75 (4, 5) (1, 1) 76 (4, 5) (1, 1) 77 (4, 5) (1, 1) 78 (4, 5) (1, 1) 79 (4, 5) (1, 1) 80 (4, 5) (1, 1) 81 (4, 5) (1, 1) 82 (4, 5) (1, 1) 83 (4, 5) (1, 1) 84 (4, 5) (1, 1) 85 (4, 5) (1, 1) 86 (4, 5) (1, 1) 87 (4, 5) (1, 1) 88 (4, 5) (1, 1) 89 (4, 5) (1, 1) 90 (4, 5) (1, 1) 91 (4, 5) (1, 1) 92 (4, 5) (1, 1) 93 (4, 5) (1, 1) 94 (4, 5) (1, 1) 95 (4, 5) (1, 1) 96 (4, 5) (1, 1) 97 (4, 5) (1, 1) 98 (4, 5) (1, 1) 99 (4, 5) (1, 1) 100 (4, 5) (1, 1) 101 (4, 5) (1, 1) 102 (4, 5) (1, 1) 103 (4, 5) (1, 1) 104 (4, 5) (1, 1) 105 (4, 5) (1, 1) 106 (4, 5) (1, 1) 107 (4, 5) (1, 1) 108 (4, 5) (1, 1) 109 (4, 5) (1, 1) 110 (4, 5) (1, 1) 111 (4, 5) (1, 1) 112 (4, 5) (1, 1) 113 (4, 5) (1, 1) 114 (4, 5) (1, 1) 115 (4, 5) (1, 1) 116 (4, 5) (1, 1) 117 (4, 5) (1, 1) 118 (4, 5) (1, 1) 119 (4, 5) (1, 1) 120 (4, 5) (1, 1) 121 (4, 5) (1, 1) 122 (4, 5) (1, 1) 123 (4, 5) (1, 1) 124 (4, 5) (1, 1) 125 (4, 5) (1, 1) 126 (4, 5) (1, 1) 127 (4, 5) (1, 1) 128 (4, 5) (1, 1) 129 (4, 5) (1, 1) 130 (4, 5) (1, 1) 131 (4, 5) (1, 1) 132 (4, 5) (1, 1) 133 (4, 5) (1, 1) 134 (4, 5) (1, 1) 135 (4, 5) (1, 1) 136 (4, 5) (1, 1) 137 (4, 5) (1, 1) 138 (4, 5) (1, 1) 139 (4, 5) (1, 1) 140 (4, 5) (1, 1) 141 (4, 5) (1, 1) 142 (4, 5) (1, 1) 143 (4, 5) (1, 1) 144 (4, 5) (1, 1) 145 (4, 5) (1, 1) 146 (4, 5) (1, 1) 147 (4, 5) (1, 1) 148 (4, 5) (1, 1) 149 (4, 5) (1, 1) 150 (4, 5) (1, 1) 151 (4, 5) (1, 1) 152 (4, 5) (1, 1) 153 (4, 5) (1, 1) 154 (4, 5) (1, 1) 155 (4, 5) (1, 1) 156 (4, 5) (1, 1) 157 (4, 5) (1, 1) 158 (4, 5) (1, 1) 159 (4, 5) (1, 1) 160 (4, 5) (1, 1) 161 (4, 5) (1, 1) 162 (4, 5) (1, 1) 163 (4, 5) (1, 1) 164 (4, 5) (1, 1) 165 (4, 5) (1, 1) 166 (4, 5) (1, 1) 167 (4, 5) (1, 1) 168 (4, 5) (1, 1) 169 (4, 5) (1, 1) 170 (4, 5) (1, 1) 171 (4, 5) (1, 1) 172 (4, 5) (1, 1) 173 (4, 5) (1, 1) 174 (4, 5) (1, 1) 175 (4, 5) (1, 1) 176 (4, 5) (1, 1) 177 (4, 5) (1, 1) 178 (4, 5) (1, 1) 179 (4, 5) (1, 1) 180 (4, 5) (1, 1) 181 (4, 5) (1, 1) 182 (4, 5) (1, 1) 183 (4, 5) (1, 1) 184 (4, 5) (1, 1) 185 (4, 5) (1, 1) 186 (4, 5) (1, 1) 187 (4, 5) (1, 1) 188 (4, 5) (1, 1) 189 (4, 5) (1, 1) 190 (4, 5) (1, 1) 191 (4, 5) (1, 1) 192 (4, 5) (1, 1) 193 (4, 5) (1, 1) 194 (4, 5) (1, 1) 195 (4, 5) (1, 1) .. GENERATED FROM PYTHON SOURCE LINES 608-610 Since we have drawn all the samples from generator and thus generator is exhausted we don't get anymore samples from it .. GENERATED FROM PYTHON SOURCE LINES 610-613 .. code-block:: Python for idx, (x, y) in enumerate(gen): print(idx, x.shape, y.shape) .. GENERATED FROM PYTHON SOURCE LINES 614-615 Now we can prepare tensorflow Dataset using the generator. .. GENERATED FROM PYTHON SOURCE LINES 615-629 .. code-block:: Python output_signature = ( tf.TensorSpec(shape=(4, 5), dtype=tf.float32), # shape and dtype for x tf.TensorSpec(shape=(1, 1), dtype=tf.float32) # shape and dtype for y ) dataset = tf.data.Dataset.from_generator( sample_generator, args=(data, lookback, num_inputs), output_signature=output_signature ) dataset .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 630-632 The `dataset` is a generator which returns a single sample (x,y) pair at each iteration .. GENERATED FROM PYTHON SOURCE LINES 632-636 .. code-block:: Python for idx, (x,y) in enumerate(dataset): print(idx, type(x), type(y), x.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none 0 (4, 5) (1, 1) 1 (4, 5) (1, 1) 2 (4, 5) (1, 1) 3 (4, 5) (1, 1) 4 (4, 5) (1, 1) 5 (4, 5) (1, 1) 6 (4, 5) (1, 1) 7 (4, 5) (1, 1) 8 (4, 5) (1, 1) 9 (4, 5) (1, 1) 10 (4, 5) (1, 1) 11 (4, 5) (1, 1) 12 (4, 5) (1, 1) 13 (4, 5) (1, 1) 14 (4, 5) (1, 1) 15 (4, 5) (1, 1) 16 (4, 5) (1, 1) 17 (4, 5) (1, 1) 18 (4, 5) (1, 1) 19 (4, 5) (1, 1) 20 (4, 5) (1, 1) 21 (4, 5) (1, 1) 22 (4, 5) (1, 1) 23 (4, 5) (1, 1) 24 (4, 5) (1, 1) 25 (4, 5) (1, 1) 26 (4, 5) (1, 1) 27 (4, 5) (1, 1) 28 (4, 5) (1, 1) 29 (4, 5) (1, 1) 30 (4, 5) (1, 1) 31 (4, 5) (1, 1) 32 (4, 5) (1, 1) 33 (4, 5) (1, 1) 34 (4, 5) (1, 1) 35 (4, 5) (1, 1) 36 (4, 5) (1, 1) 37 (4, 5) (1, 1) 38 (4, 5) (1, 1) 39 (4, 5) (1, 1) 40 (4, 5) (1, 1) 41 (4, 5) (1, 1) 42 (4, 5) (1, 1) 43 (4, 5) (1, 1) 44 (4, 5) (1, 1) 45 (4, 5) (1, 1) 46 (4, 5) (1, 1) 47 (4, 5) (1, 1) 48 (4, 5) (1, 1) 49 (4, 5) (1, 1) 50 (4, 5) (1, 1) 51 (4, 5) (1, 1) 52 (4, 5) (1, 1) 53 (4, 5) (1, 1) 54 (4, 5) (1, 1) 55 (4, 5) (1, 1) 56 (4, 5) (1, 1) 57 (4, 5) (1, 1) 58 (4, 5) (1, 1) 59 (4, 5) (1, 1) 60 (4, 5) (1, 1) 61 (4, 5) (1, 1) 62 (4, 5) (1, 1) 63 (4, 5) (1, 1) 64 (4, 5) (1, 1) 65 (4, 5) (1, 1) 66 (4, 5) (1, 1) 67 (4, 5) (1, 1) 68 (4, 5) (1, 1) 69 (4, 5) (1, 1) 70 (4, 5) (1, 1) 71 (4, 5) (1, 1) 72 (4, 5) (1, 1) 73 (4, 5) (1, 1) 74 (4, 5) (1, 1) 75 (4, 5) (1, 1) 76 (4, 5) (1, 1) 77 (4, 5) (1, 1) 78 (4, 5) (1, 1) 79 (4, 5) (1, 1) 80 (4, 5) (1, 1) 81 (4, 5) (1, 1) 82 (4, 5) (1, 1) 83 (4, 5) (1, 1) 84 (4, 5) (1, 1) 85 (4, 5) (1, 1) 86 (4, 5) (1, 1) 87 (4, 5) (1, 1) 88 (4, 5) (1, 1) 89 (4, 5) (1, 1) 90 (4, 5) (1, 1) 91 (4, 5) (1, 1) 92 (4, 5) (1, 1) 93 (4, 5) (1, 1) 94 (4, 5) (1, 1) 95 (4, 5) (1, 1) 96 (4, 5) (1, 1) 97 (4, 5) (1, 1) 98 (4, 5) (1, 1) 99 (4, 5) (1, 1) 100 (4, 5) (1, 1) 101 (4, 5) (1, 1) 102 (4, 5) (1, 1) 103 (4, 5) (1, 1) 104 (4, 5) (1, 1) 105 (4, 5) (1, 1) 106 (4, 5) (1, 1) 107 (4, 5) (1, 1) 108 (4, 5) (1, 1) 109 (4, 5) (1, 1) 110 (4, 5) (1, 1) 111 (4, 5) (1, 1) 112 (4, 5) (1, 1) 113 (4, 5) (1, 1) 114 (4, 5) (1, 1) 115 (4, 5) (1, 1) 116 (4, 5) (1, 1) 117 (4, 5) (1, 1) 118 (4, 5) (1, 1) 119 (4, 5) (1, 1) 120 (4, 5) (1, 1) 121 (4, 5) (1, 1) 122 (4, 5) (1, 1) 123 (4, 5) (1, 1) 124 (4, 5) (1, 1) 125 (4, 5) (1, 1) 126 (4, 5) (1, 1) 127 (4, 5) (1, 1) 128 (4, 5) (1, 1) 129 (4, 5) (1, 1) 130 (4, 5) (1, 1) 131 (4, 5) (1, 1) 132 (4, 5) (1, 1) 133 (4, 5) (1, 1) 134 (4, 5) (1, 1) 135 (4, 5) (1, 1) 136 (4, 5) (1, 1) 137 (4, 5) (1, 1) 138 (4, 5) (1, 1) 139 (4, 5) (1, 1) 140 (4, 5) (1, 1) 141 (4, 5) (1, 1) 142 (4, 5) (1, 1) 143 (4, 5) (1, 1) 144 (4, 5) (1, 1) 145 (4, 5) (1, 1) 146 (4, 5) (1, 1) 147 (4, 5) (1, 1) 148 (4, 5) (1, 1) 149 (4, 5) (1, 1) 150 (4, 5) (1, 1) 151 (4, 5) (1, 1) 152 (4, 5) (1, 1) 153 (4, 5) (1, 1) 154 (4, 5) (1, 1) 155 (4, 5) (1, 1) 156 (4, 5) (1, 1) 157 (4, 5) (1, 1) 158 (4, 5) (1, 1) 159 (4, 5) (1, 1) 160 (4, 5) (1, 1) 161 (4, 5) (1, 1) 162 (4, 5) (1, 1) 163 (4, 5) (1, 1) 164 (4, 5) (1, 1) 165 (4, 5) (1, 1) 166 (4, 5) (1, 1) 167 (4, 5) (1, 1) 168 (4, 5) (1, 1) 169 (4, 5) (1, 1) 170 (4, 5) (1, 1) 171 (4, 5) (1, 1) 172 (4, 5) (1, 1) 173 (4, 5) (1, 1) 174 (4, 5) (1, 1) 175 (4, 5) (1, 1) 176 (4, 5) (1, 1) 177 (4, 5) (1, 1) 178 (4, 5) (1, 1) 179 (4, 5) (1, 1) 180 (4, 5) (1, 1) 181 (4, 5) (1, 1) 182 (4, 5) (1, 1) 183 (4, 5) (1, 1) 184 (4, 5) (1, 1) 185 (4, 5) (1, 1) 186 (4, 5) (1, 1) 187 (4, 5) (1, 1) 188 (4, 5) (1, 1) 189 (4, 5) (1, 1) 190 (4, 5) (1, 1) 191 (4, 5) (1, 1) 192 (4, 5) (1, 1) 193 (4, 5) (1, 1) 194 (4, 5) (1, 1) 195 (4, 5) (1, 1) .. GENERATED FROM PYTHON SOURCE LINES 637-638 getting batches instead of single samples (x,y pairs) during iteration .. GENERATED FROM PYTHON SOURCE LINES 638-651 .. code-block:: Python dataset = tf.data.Dataset.from_generator( sample_generator, args=(data, lookback, num_inputs), output_signature=output_signature ) batch_size = 32 dataset = dataset.shuffle(buffer_size=10000) dataset = dataset.batch(batch_size) dataset = dataset.prefetch(tf.data.AUTOTUNE) dataset .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 652-655 Now when we iterate over `dataset`, we don't get a single sample/example (x,y) pair at each iteration but we get a batch of samples and the length/size of the batch is determined by the `batch_size` parameter. .. GENERATED FROM PYTHON SOURCE LINES 655-659 .. code-block:: Python for idx, (x,y) in enumerate(dataset): print(idx, type(x), type(y), x.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none 0 (32, 4, 5) (32, 1, 1) 1 (32, 4, 5) (32, 1, 1) 2 (32, 4, 5) (32, 1, 1) 3 (32, 4, 5) (32, 1, 1) 4 (32, 4, 5) (32, 1, 1) 5 (32, 4, 5) (32, 1, 1) 6 (4, 4, 5) (4, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 660-662 Let's use a real world example. We get rainfall-runoff data for several hundred catchments/stations from Columbia. .. GENERATED FROM PYTHON SOURCE LINES 662-669 .. code-block:: Python ds = RainfallRunoff('CAMELS_COL', verbosity=0) static, dynamic = ds.fetch() type(dynamic), len(dynamic) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/ml-tutorials/envs/latest/lib/python3.9/site-packages/aqua_fetch/rr/utils.py:126: UserWarning: netCDF4 module is not installed. Please install it to save data in netcdf format warnings.warn(msg, UserWarning) (, 347) .. GENERATED FROM PYTHON SOURCE LINES 670-671 dynamic is a dictionary with keys as station names and each value is a DataFrame. .. GENERATED FROM PYTHON SOURCE LINES 671-674 .. code-block:: Python dynamic['26247030'].shape .. rst-class:: sphx-glr-script-out .. code-block:: none (13971, 6) .. GENERATED FROM PYTHON SOURCE LINES 675-676 get the total length of all DataFrames in dynamic .. GENERATED FROM PYTHON SOURCE LINES 676-679 .. code-block:: Python sum(df.shape[0] for df in dynamic.values()) .. rst-class:: sphx-glr-script-out .. code-block:: none 4079935 .. GENERATED FROM PYTHON SOURCE LINES 680-681 get the total length after dropping nan in last column .. GENERATED FROM PYTHON SOURCE LINES 681-684 .. code-block:: Python sum(df.dropna(subset=[df.columns[-1]]).shape[0] for df in dynamic.values()) .. rst-class:: sphx-glr-script-out .. code-block:: none 4079935 .. GENERATED FROM PYTHON SOURCE LINES 685-686 Now we make the sample_generator for given number of stations determined by `station_ids` .. GENERATED FROM PYTHON SOURCE LINES 686-732 .. code-block:: Python def sample_generator( station_ids, lookback:int, num_inputs:int, num_outputs=None, input_steps=1, forecast_step=0, forecast_len=1, known_future_inputs=False, output_steps=1): for stn in station_ids: stn = stn.decode() if isinstance(stn, bytes) else stn data = dynamic[stn].values for i in range(len(data) - lookback * input_steps + 1 - forecast_step - forecast_len * output_steps): x, _, y = prepare_data_sample(data, index=i, lookback=lookback, num_inputs=num_inputs, num_outputs=num_outputs, input_steps=input_steps, forecast_step=forecast_step, forecast_len=forecast_len, known_future_inputs=known_future_inputs, output_steps=output_steps ) # Skip samples with NaNs in x or y if np.isnan(x).any() or np.isnan(y).any(): continue yield x, y lookback = 365 num_inputs = dynamic['26247030'].shape[1] - 1 output_signature = ( tf.TensorSpec(shape=(lookback, num_inputs), dtype=tf.float32), # shape and dtype for x tf.TensorSpec(shape=(1, 1), dtype=tf.float32) # shape and dtype for y ) dataset = tf.data.Dataset.from_generator( sample_generator, args=(list(dynamic.keys())[0:34], lookback, num_inputs), output_signature=output_signature ) dataset .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 733-735 Now we iterate over the `dataset` and measure the time taken to get all the samples from 34 stations. We chose 34 because it is a manageable number for our example. .. GENERATED FROM PYTHON SOURCE LINES 735-742 .. code-block:: Python start = time.time() for idx, (x,y) in enumerate(dataset): pass print(round(time.time() - start, 2), 'seconds taken') print("index of last sample: ", idx) print(x.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none 61.78 seconds taken index of last sample: 397807 (365, 5) (1, 1) .. GENERATED FROM PYTHON SOURCE LINES 743-744 getting batches instead of single samples (x,y pairs) during iteration .. GENERATED FROM PYTHON SOURCE LINES 744-759 .. code-block:: Python dataset = tf.data.Dataset.from_generator( sample_generator, args=(list(dynamic.keys())[0:34], lookback, num_inputs), output_signature=output_signature ) batch_size = 1024 dataset = dataset.take(1_000_000) # Limit to 1 million samples dataset = dataset.shuffle(buffer_size=10000) dataset = dataset.batch(batch_size) dataset = dataset.prefetch(tf.data.AUTOTUNE) dataset .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 760-763 Now when we iterate over `dataset`, we don't get a single sample/example (x,y) pair at each iteration but we get a batch of samples and the length/size of the batch is determined by the `batch_size` parameter. .. GENERATED FROM PYTHON SOURCE LINES 763-771 .. code-block:: Python start = time.time() for idx, (x,y) in enumerate(dataset): pass print(round(time.time() - start, 2), 'seconds taken') print("index of last batch: ", idx) print(x.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none 34.88 seconds taken index of last batch: 388 (496, 365, 5) (496, 1, 1) .. GENERATED FROM PYTHON SOURCE LINES 772-773 using tf.keras utility function which highly optimized .. GENERATED FROM PYTHON SOURCE LINES 773-785 .. code-block:: Python data = pd.concat([val for val in list(dynamic.values())[0:34]], axis=0) print(data.shape) dataset = tf.keras.utils.timeseries_dataset_from_array( data.iloc[:, 0:-1].values, targets=data.iloc[:, -1].values, sequence_length=lookback, batch_size=batch_size ) dataset .. rst-class:: sphx-glr-script-out .. code-block:: none (410218, 6) .. GENERATED FROM PYTHON SOURCE LINES 786-793 .. code-block:: Python start = time.time() for idx, (x,y) in enumerate(dataset): pass print(round(time.time() - start, 2), 'seconds taken') print("index of last batch: ", idx) print(x.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none 22.67 seconds taken index of last batch: 400 (254, 365, 5) (254,) .. rst-class:: sphx-glr-timing **Total running time of the script:** (2 minutes 17.893 seconds) .. _sphx_glr_download_auto_examples_preparing_data_for_ts_prediction.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/sphinx-gallery/sphinx-gallery.github.io/master?urlpath=lab/tree/notebooks/auto_examples/preparing_data_for_ts_prediction.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: preparing_data_for_ts_prediction.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: preparing_data_for_ts_prediction.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: preparing_data_for_ts_prediction.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_