1. Structuring the input

In order to train and test models, we need to structure the input data in a way that is compatible with the model and the framework’s experiments.

For now, this framework is designed to work with time-series data and Pytorch Lightning. Thus, it provides the necessary tools to create Dataset and LightningDataModule objects, required by Pytorch Lightning to train and test models.

In this notebook, we explain the default data pipeline, which includes: 1. Creating Dataset objects, that are responsible for loading the data. 2. Creating DataLoader objects, that are responsible for batched loading of the data. It encapsulates the Dataset object and provides an iterator to iterate over the data in batches. 3. Creating LightningDataModule objects, that are responsible for loading the data and creating the Dataset and encapsulate it into DataLoader objects for training, validation, and test sets.

Time-series dataset implementations

The Dataset object is responsible for loading the data. It is a Pytorch object that is used to load the data and make it available to the model.

Every Dataset class must implement two methods: __len__ and __getitem__. The __len__ method returns the number of samples in the dataset, and the __getitem__, given an integer from 0 to __len__ - 1, returns the corresponding sample from the dataset. The returned type of the __getitem__ method is not specified, but it is usually a 2-element tuple with the input and the target. The input is the data that will be used to make the predictions, and the target is the data that the model will try to predict.

The first step when creating a Dataset object is to identify the layout of the data directory and choose the appropriate class to handle it. For now, this framework provides default Dataset classes for time-series data, which are the SeriesFolderCSVDataset and MultiModalSeriesCSVDataset classes. Both classes assumes that data are stored in CSV files, but with different layouts, to know:

A directory with several CSV files, where each file contains a time-series. Each row in a CSV file is a time-step, each column is a feature. Thus, the whole file is a single multi-modal time-series. Also, if you want to use labels, it must be in a separated column of the CSV file and it should exists to all rows (time-steps). This layout is handled by the SeriesFolderCSVDataset class.
A single CSV file with a windowed time-series. Each row contains different modalities of the same windowed time-series. The prefix of the column names is used to identify the modalities. For instance, if the is accel-x, all columns that start with this prefix, like accel-x-1, accel-x-2, accel-x-3, are considered time-steps from the same modality (accel-x). Also, if you want to use labels, it must be in a separated column and it should exists to all rows, that is, for each windowed multimodal time-series. This layout is handled by the MultiModalSeriesCSVDataset class.

We will show how to use these classes in the next sections.

`SeriesFolderCSVDataset`

The SeriesFolderCSVDataset class is designed to work with a directory containing several CSV files, where each file represent a time-series. Each row in a CSV file is a time-step, and each column is a feature.

This class assumes that data is organized in the following way, where my_dataset is the path to the directory containing the CSV files:

my_dataset/
    series1.csv
    series2.csv
    other_series.csv
    ...

Where each CSV file represents a time-series, similar to the one below:

accel-x	accel-y	accel-z	gyro-x	gyro-y	gyro-z	class
0.502123	0.02123	0.12312	0.12312	0.12312	0.12312	1
0.682012	0.02123	0.12312	0.12312	0.12312	0.12312	1
0.498217	0.00001	0.12312	0.12312	0.12312	0.12312	1

Note that the CSV must have a header with the column names. Also, columns that are not used as features or labels are ignored.

To handle this kind of data, we use the SeriesFolderCSVDataset class. This class is a Pytorch Dataset object that loads the data from the CSV files and makes it available to the model. Note that, each feature (column) represent a dimension of the time-series, while the rows represent the time-steps. The sample is a numpy array.

For this class, we must specify the following parameters:

data_path: the path to the directory containing the CSV files;
features: a list of strings with the names of the features columns, e.g. ['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'];
label: a string with the name of the label column, e.g. 'class'.

[1]:

from ssl_tools.data.datasets import SeriesFolderCSVDataset

# Path to the data
data_path = "/workspaces/hiaac-m4/ssl_tools/data/view_concatenated/KuHar_cpc/train"

# Creating the dataset
dataset = SeriesFolderCSVDataset(
    data_path=data_path,
    features=["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"],
    label="standard activity code",
)

dataset

[1706883997.242541] [aae107fc745c:2264626:f]        vfs_fuse.c:281  UCX  ERROR inotify_add_watch(/tmp) failed: No space left on device

[1]:

SeriesFolderCSVDataset at /workspaces/hiaac-m4/ssl_tools/data/view_concatenated/KuHar_cpc/train (57 samples)

We can get the number of samples in the dataset with the len function, and we can retrive a sample with the __getitem__ method, that is, using [], such as dataset[0]. The dataset return type is different depending on the label parameter.

If label is speficied, the return type is a 2-element tuple, where the first element is a 2D numpy array with shape (num_features, time_steps), and the second element is a 1D tensor with shape (time_steps,).
If label is not speficied, the return type is a single 2D numpy array with shape (num_features, time_steps).

Let’s check the number of samples and access the first sample and its label.

[2]:

# Gte the length of the dataset
length_of_dataset = len(dataset)
print(f"Length of dataset: {length_of_dataset} samples")

Length of dataset: 57 samples

[3]:

# Get the first sample. We can go from 0 to length_of_dataset - 1 (56)
sample = dataset[0]
type_of_sample = type(sample).__name__
print(f"Type of sample: {type_of_sample} with {len(sample)} elements")

Type of sample: tuple with 2 elements

[4]:

# The first element of the sample is the input, while the second element is the label
# We can split the sample into input and label variables
shape_of_sample = sample[0].shape
shape_of_label = sample[1].shape
print(f"Shape of sample: {shape_of_sample}, shape of label: {shape_of_label}")

Shape of sample: (6, 2586), shape of label: (2586, 1)

You can see above, the sample is a 2-element tuple. The first element is a 2D numpy array with shape (6, 2586), and the second element is a 1D tensor with shape (2586,), that is, a label for each time-step.

`MultiModalSeriesCSVDataset`

The MultiModalSeriesCSVDataset class is designed to work with a single CSV file containing a windowed time-series. Each row contains different modalities of the same windowed time-series. The prefix of the column names is used to identify the modalities. For instance, if the prefix is accel-x, all columns that start with this prefix, like accel-x-1, accel-x-2, accel-x-3, are considered time-steps from the same modality (accel-x). Also, if you want to use labels, it must be in a separated column and it should exists to all rows, that is, for each windowed multimodal time-series.

The CSV file looks like this:

accel-x-0	accel-x-1	accel-y-0	accel-y-1	class
0.502123	0.02123	0.502123	0.502123	0
0.6820123	0.02123	0.502123	0.502123	1
0.498217	0.00001	1.414141	3.141592	1

In the example, columns accel-x-0 and accel-x-1 are the accel-x feature at time 0 and time 1, respectively. The same goes for the accel-y feature. Finally, the class column is the label. Columns that are not used as features or labels are ignored.

To use MultiModalSeriesCSVDataset, we must specify the following parameters:

data_path: the path to the CSV file
feature_prefixes: a list of strings with the prefixes of the feature columns, e.g. ['accel-x', 'accel-y']. The class will look for columns with these prefixes and will consider them as features of a modality.
label: a string with the name of the label column, e.g. 'class'
features_as_channels: a boolean indicating if the features should be treated as channels, that is, if each prefix will become a channel. If True, the data will be returned as a vector of shape (C, T), where C is the number of channels (features/prefixes) and T is the number of time steps. Else, the data will be returned as a vector of shape T*C (a single vector with all the features).

Note that, each feature (column) represent a dimension of the time-series, while the rows represent the samples.

Let’s show how to read this data and create a MultiModalSeriesCSVDataset object.

[5]:

from ssl_tools.data.datasets import MultiModalSeriesCSVDataset

# Path to the data
data_path = "/workspaces/hiaac-m4/ssl_tools/data/standartized_balanced/KuHar/train.csv"

# Instantiate the dataset
dataset = MultiModalSeriesCSVDataset(
    data_path=data_path,
    feature_prefixes=["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"],
    label="standard activity code",
    features_as_channels = True,
)

dataset

[5]:

MultiModalSeriesCSVDataset at /workspaces/hiaac-m4/ssl_tools/data/standartized_balanced/KuHar/train.csv (1386 samples)

We can get the number of samples in the dataset with the len function, and we can retrive a sample with the __getitem__ method, that is, using [], such as dataset[0]. The dataset may return:

A 2-element tuple, where the first element is a 2D numpy array with shape (num_features, time_steps), and the second element is a 1D tensor with shape (time_steps,).
A 2D numpy array with shape (num_features, time_steps), if label is None, at the time of the dataset object’s creation.

Let’s check the number of samples and access the first sample and its label.

[6]:

length_of_dataset = len(dataset)
print(f"Length of dataset: {length_of_dataset} samples")

Length of dataset: 1386 samples

[7]:

sample = dataset[0]
type_of_sample = type(sample).__name__
print(f"Type of sample: {type_of_sample} with length {len(sample)} elements")

Type of sample: tuple with length 2 elements

[8]:

shape_of_sample = sample[0].shape
print(f"Shape of sample: {shape_of_sample}")

Shape of sample: (6, 60)

Loading batches of data using DataLoader

Pytorch models are trained using batches of data. Thus, we do not feed the model with a single sample at a time, but with a batch of samples. If we see the last example, the MultiModalSeriesCSVDataset object returns a single sample at a time. Each sample is a 2-element tuple, where first element is a (6, 60) numpy array and the second is an integer, representing the label.

A batch of samples add an extra dimension to the data. Thus, in our case, a batch of samples would be a 3D tensor, where the first dimension is the batch size (B), the second dimension is the number of features, or channels (C), and the third dimension is the number of time-steps (T). Thus, if the data have the shape (6, 60), a batch of 32 samples will be a tensor with shape (32, 6, 60). The same happens to label, which gains an extra dimension, and would be an 1D tensor with shape (32,).

The batching of samples is done using a DataLoader object. This object is a Pytorch object that takes a Dataset object and returns batches of samples. The DataLoader object is responsible for shuffling the data, dividing it into batches, and loading the data in parallel. Thus, given a Dataset object, we can easilly create a DataLoader object using the torch.utils.data.DataLoader class.

[9]:

from torch.utils.data import DataLoader

# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
dataloader

[9]:

<torch.utils.data.dataloader.DataLoader at 0x7f4dd2cb5960>

Datasets implement the __len__ and __getitem__ methods. However, the DataLoader object implements the iterable protocol, that is, it implements the __iter__ method, which returns an iterator to iterate over the data in batches. Thus, to fetch a batch of samples from the DataLoader object, we can use a for loop, as we do with any other iterable object in Python, like lists and tuples (e.g. for batch in dataloader: ...). We can also use the next function to fetch a single batch of samples, such as batch = next(iter(dataloader)). Let’s fetch a single sample from the DataLoader object and check its shape.

[10]:

batch = next(iter(dataloader))
# Batch is a tuple with two elements: inputs and labels.
# Let's extract it to two different variables
inputs, labels = batch
# Print the shape of the inputs and labels
print(f"Inputs shape: {inputs.shape}, labels shape: {labels.shape}")

Inputs shape: torch.Size([32, 6, 60]), labels shape: torch.Size([32])

Handling data splits (train, validation, and test) using `LightningDataModule`

Usually, we create a DataLoader object for the training data, another for the validation data, and another for the test data. We can encapsulate the DataLoader creation logic in a single place, and make it easy to use the same data processing logic across different experiments. A simple way to do this is to create a LightningDataModule object.

A LightningDataModule object is responsible for splitting the data into training, validation, and test sets, and creating the DataLoader objects for each set. This object may also be responsible for setting up the data, such as downloading the data from the internet, checking the data, and add the augmentations.

The LightningDataModule object must implement four methods: setup, train_dataloader, val_dataloader, and test_dataloader. The setup is optional, and is responsible for splitting the data into training, validation, and test sets, and train_dataloader, val_dataloader and test_dataloader methods are responsible for creating the DataLoader objects for the training, validation and test sets, respectively.

A data module may be implemented as shown below.

[11]:

import lightning as L

class MyDataModule(L.LightningDataModule):
    def __init__(self, train_csv_path, batch_size=32):
        super().__init__()
        self.data_path = data_path
        self.batch_size = batch_size
        self.train_csv_path = train_csv_path

    def setup(self, stage=None):
        if stage == "fit" or stage is None:
            self.train_dataset = MultiModalSeriesCSVDataset(
                data_path=self.train_csv_path,
                feature_prefixes=["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"],
                label="standard activity code",
                features_as_channels=True,
            )

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers, shuffle=True
        )

When training a Pytorch Lightning model, we pass a LightningDataModule object to the Trainer object, and the Trainer object is responsible calling the setup, train_dataloader, and val_dataloader methods, and for training the model.

Summary

First, we need to check which is our dataset and how the data is organized.

If data is organized in a directory with several CSV files, we use the SeriesFolderCSVDataset class.
If data is organized in a single CSV file with a windowed time-series, we use the MultiModalSeriesCSVDataset class.

Then, we create a Dataset object, and use it to create a DataLoader object.

In order to organize the creation of the DataLoader object for each split (train, validation and test), we encapsule this logic in a LightningDataModule object. The LightningDataModule object is responsible for creating the DataLoader objects for the training (train_dataloader), validation (val_dataloader), and test (test_dataloader) data, and for setting up the data.

The LightningDataModule object is then used to train Pytorch Lightning models, which will call the setup, train_dataloader, and val_dataloader methods, corretly, as needed in the training/test process.