1. Structuring the input
In order to train and test models, we need to structure the input data in a way that is compatible with the model and the framework’s experiments.
For now, this framework is designed to work with time-series data and Pytorch Lightning. Thus, it provides the necessary tools to create Dataset
and LightningDataModule
objects, required by Pytorch Lightning to train and test models.
In this notebook, we explain the default data pipeline, which includes: 1. Creating Dataset
objects, that are responsible for loading the data. 2. Creating DataLoader
objects, that are responsible for batched loading of the data. It encapsulates the Dataset
object and provides an iterator to iterate over the data in batches. 3. Creating LightningDataModule
objects, that are responsible for loading the data and creating the Dataset
and encapsulate it into DataLoader
objects for training, validation, and test sets.
Time-series dataset implementations
The Dataset
object is responsible for loading the data. It is a Pytorch object that is used to load the data and make it available to the model.
Every Dataset
class must implement two methods: __len__
and __getitem__
. The __len__
method returns the number of samples in the dataset, and the __getitem__
, given an integer from 0 to __len__
- 1, returns the corresponding sample from the dataset. The returned type of the __getitem__
method is not specified, but it is usually a 2-element tuple with the input and the target. The input is the data that will be used to make the predictions, and the target is the data
that the model will try to predict.
The first step when creating a Dataset
object is to identify the layout of the data directory and choose the appropriate class to handle it. For now, this framework provides default Dataset
classes for time-series data, which are the SeriesFolderCSVDataset
and MultiModalSeriesCSVDataset
classes. Both classes assumes that data are stored in CSV files, but with different layouts, to know:
A directory with several CSV files, where each file contains a time-series. Each row in a CSV file is a time-step, each column is a feature. Thus, the whole file is a single multi-modal time-series. Also, if you want to use labels, it must be in a separated column of the CSV file and it should exists to all rows (time-steps). This layout is handled by the
SeriesFolderCSVDataset
class.A single CSV file with a windowed time-series. Each row contains different modalities of the same windowed time-series. The prefix of the column names is used to identify the modalities. For instance, if the is
accel-x
, all columns that start with this prefix, likeaccel-x-1
,accel-x-2
,accel-x-3
, are considered time-steps from the same modality (accel-x
). Also, if you want to use labels, it must be in a separated column and it should exists to all rows, that is, for each windowed multimodal time-series. This layout is handled by theMultiModalSeriesCSVDataset
class.
We will show how to use these classes in the next sections.
SeriesFolderCSVDataset
The SeriesFolderCSVDataset
class is designed to work with a directory containing several CSV files, where each file represent a time-series. Each row in a CSV file is a time-step, and each column is a feature.
This class assumes that data is organized in the following way, where my_dataset
is the path to the directory containing the CSV files:
my_dataset/
series1.csv
series2.csv
other_series.csv
...
Where each CSV file represents a time-series, similar to the one below:
accel-x |
accel-y |
accel-z |
gyro-x |
gyro-y |
gyro-z |
class |
---|---|---|---|---|---|---|
0.502123 |
0.02123 |
0.12312 |
0.12312 |
0.12312 |
0.12312 |
1 |
0.682012 |
0.02123 |
0.12312 |
0.12312 |
0.12312 |
0.12312 |
1 |
0.498217 |
0.00001 |
0.12312 |
0.12312 |
0.12312 |
0.12312 |
1 |
Note that the CSV must have a header with the column names. Also, columns that are not used as features or labels are ignored.
To handle this kind of data, we use the SeriesFolderCSVDataset
class. This class is a Pytorch Dataset
object that loads the data from the CSV files and makes it available to the model. Note that, each feature (column) represent a dimension of the time-series, while the rows represent the time-steps. The sample is a numpy array.
For this class, we must specify the following parameters:
data_path
: the path to the directory containing the CSV files;features
: a list of strings with the names of the features columns, e.g.['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z']
;label
: a string with the name of the label column, e.g.'class'
.
[1]:
from ssl_tools.data.datasets import SeriesFolderCSVDataset
# Path to the data
data_path = "/workspaces/hiaac-m4/ssl_tools/data/view_concatenated/KuHar_cpc/train"
# Creating the dataset
dataset = SeriesFolderCSVDataset(
data_path=data_path,
features=["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"],
label="standard activity code",
)
dataset
[1706883997.242541] [aae107fc745c:2264626:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
[1]:
SeriesFolderCSVDataset at /workspaces/hiaac-m4/ssl_tools/data/view_concatenated/KuHar_cpc/train (57 samples)
We can get the number of samples in the dataset with the len
function, and we can retrive a sample with the __getitem__
method, that is, using []
, such as dataset[0]
. The dataset return type is different depending on the label
parameter.
If
label
is speficied, the return type is a 2-element tuple, where the first element is a 2D numpy array with shape(num_features, time_steps)
, and the second element is a 1D tensor with shape(time_steps,)
.If
label
is not speficied, the return type is a single 2D numpy array with shape(num_features, time_steps)
.
Let’s check the number of samples and access the first sample and its label.
[2]:
# Gte the length of the dataset
length_of_dataset = len(dataset)
print(f"Length of dataset: {length_of_dataset} samples")
Length of dataset: 57 samples
[3]:
# Get the first sample. We can go from 0 to length_of_dataset - 1 (56)
sample = dataset[0]
type_of_sample = type(sample).__name__
print(f"Type of sample: {type_of_sample} with {len(sample)} elements")
Type of sample: tuple with 2 elements
[4]:
# The first element of the sample is the input, while the second element is the label
# We can split the sample into input and label variables
shape_of_sample = sample[0].shape
shape_of_label = sample[1].shape
print(f"Shape of sample: {shape_of_sample}, shape of label: {shape_of_label}")
Shape of sample: (6, 2586), shape of label: (2586, 1)
You can see above, the sample is a 2-element tuple. The first element is a 2D numpy array with shape (6, 2586)
, and the second element is a 1D tensor with shape (2586,)
, that is, a label for each time-step.
MultiModalSeriesCSVDataset
The MultiModalSeriesCSVDataset
class is designed to work with a single CSV file containing a windowed time-series. Each row contains different modalities of the same windowed time-series. The prefix of the column names is used to identify the modalities. For instance, if the prefix is accel-x
, all columns that start with this prefix, like accel-x-1
, accel-x-2
, accel-x-3
, are considered time-steps from the same modality (accel-x
). Also, if you want to use labels, it must
be in a separated column and it should exists to all rows, that is, for each windowed multimodal time-series.
The CSV file looks like this:
accel-x-0 |
accel-x-1 |
accel-y-0 |
accel-y-1 |
class |
---|---|---|---|---|
0.502123 |
0.02123 |
0.502123 |
0.502123 |
0 |
0.6820123 |
0.02123 |
0.502123 |
0.502123 |
1 |
0.498217 |
0.00001 |
1.414141 |
3.141592 |
1 |
In the example, columns accel-x-0
and accel-x-1
are the accel-x
feature at time 0
and time 1
, respectively. The same goes for the accel-y
feature. Finally, the class
column is the label. Columns that are not used as features or labels are ignored.
To use MultiModalSeriesCSVDataset
, we must specify the following parameters:
data_path
: the path to the CSV filefeature_prefixes
: a list of strings with the prefixes of the feature columns, e.g.['accel-x', 'accel-y']
. The class will look for columns with these prefixes and will consider them as features of a modality.label
: a string with the name of the label column, e.g.'class'
features_as_channels
: a boolean indicating if the features should be treated as channels, that is, if each prefix will become a channel. IfTrue
, the data will be returned as a vector of shape(C, T)
, where C is the number of channels (features/prefixes) andT
is the number of time steps. Else, the data will be returned as a vector of shapeT*C
(a single vector with all the features).
Note that, each feature (column) represent a dimension of the time-series, while the rows represent the samples.
Let’s show how to read this data and create a MultiModalSeriesCSVDataset
object.
[5]:
from ssl_tools.data.datasets import MultiModalSeriesCSVDataset
# Path to the data
data_path = "/workspaces/hiaac-m4/ssl_tools/data/standartized_balanced/KuHar/train.csv"
# Instantiate the dataset
dataset = MultiModalSeriesCSVDataset(
data_path=data_path,
feature_prefixes=["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"],
label="standard activity code",
features_as_channels = True,
)
dataset
[5]:
MultiModalSeriesCSVDataset at /workspaces/hiaac-m4/ssl_tools/data/standartized_balanced/KuHar/train.csv (1386 samples)
We can get the number of samples in the dataset with the len
function, and we can retrive a sample with the __getitem__
method, that is, using []
, such as dataset[0]
. The dataset may return:
A 2-element tuple, where the first element is a 2D numpy array with shape
(num_features, time_steps)
, and the second element is a 1D tensor with shape(time_steps,)
.A 2D numpy array with shape
(num_features, time_steps)
, iflabel
isNone
, at the time of the dataset object’s creation.
Let’s check the number of samples and access the first sample and its label.
[6]:
length_of_dataset = len(dataset)
print(f"Length of dataset: {length_of_dataset} samples")
Length of dataset: 1386 samples
[7]:
sample = dataset[0]
type_of_sample = type(sample).__name__
print(f"Type of sample: {type_of_sample} with length {len(sample)} elements")
Type of sample: tuple with length 2 elements
[8]:
shape_of_sample = sample[0].shape
print(f"Shape of sample: {shape_of_sample}")
Shape of sample: (6, 60)
Loading batches of data using DataLoader
Pytorch models are trained using batches of data. Thus, we do not feed the model with a single sample at a time, but with a batch of samples. If we see the last example, the MultiModalSeriesCSVDataset
object returns a single sample at a time. Each sample is a 2-element tuple, where first element is a (6, 60)
numpy array and the second is an integer, representing the label.
A batch of samples add an extra dimension to the data. Thus, in our case, a batch of samples would be a 3D tensor, where the first dimension is the batch size (B
), the second dimension is the number of features, or channels (C
), and the third dimension is the number of time-steps (T
). Thus, if the data have the shape (6, 60)
, a batch of 32 samples will be a tensor with shape (32, 6, 60)
. The same happens to label
, which gains an extra dimension, and would be an 1D tensor
with shape (32,)
.
The batching of samples is done using a DataLoader
object. This object is a Pytorch object that takes a Dataset
object and returns batches of samples. The DataLoader
object is responsible for shuffling the data, dividing it into batches, and loading the data in parallel. Thus, given a Dataset
object, we can easilly create a DataLoader
object using the torch.utils.data.DataLoader
class.
[9]:
from torch.utils.data import DataLoader
# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
dataloader
[9]:
<torch.utils.data.dataloader.DataLoader at 0x7f4dd2cb5960>
Datasets implement the __len__
and __getitem__
methods. However, the DataLoader
object implements the iterable protocol, that is, it implements the __iter__
method, which returns an iterator to iterate over the data in batches. Thus, to fetch a batch of samples from the DataLoader
object, we can use a for
loop, as we do with any other iterable object in Python, like lists and tuples (e.g. for batch in dataloader: ...
). We can also use the next
function to fetch
a single batch of samples, such as batch = next(iter(dataloader))
. Let’s fetch a single sample from the DataLoader
object and check its shape.
[10]:
batch = next(iter(dataloader))
# Batch is a tuple with two elements: inputs and labels.
# Let's extract it to two different variables
inputs, labels = batch
# Print the shape of the inputs and labels
print(f"Inputs shape: {inputs.shape}, labels shape: {labels.shape}")
Inputs shape: torch.Size([32, 6, 60]), labels shape: torch.Size([32])
Handling data splits (train, validation, and test) using LightningDataModule
Usually, we create a DataLoader
object for the training data, another for the validation data, and another for the test data. We can encapsulate the DataLoader
creation logic in a single place, and make it easy to use the same data processing logic across different experiments. A simple way to do this is to create a LightningDataModule
object.
A LightningDataModule
object is responsible for splitting the data into training, validation, and test sets, and creating the DataLoader
objects for each set. This object may also be responsible for setting up the data, such as downloading the data from the internet, checking the data, and add the augmentations.
The LightningDataModule
object must implement four methods: setup
, train_dataloader
, val_dataloader
, and test_dataloader
. The setup
is optional, and is responsible for splitting the data into training, validation, and test sets, and train_dataloader
, val_dataloader
and test_dataloader
methods are responsible for creating the DataLoader
objects for the training, validation and test sets, respectively.
A data module may be implemented as shown below.
[11]:
import lightning as L
class MyDataModule(L.LightningDataModule):
def __init__(self, train_csv_path, batch_size=32):
super().__init__()
self.data_path = data_path
self.batch_size = batch_size
self.train_csv_path = train_csv_path
def setup(self, stage=None):
if stage == "fit" or stage is None:
self.train_dataset = MultiModalSeriesCSVDataset(
data_path=self.train_csv_path,
feature_prefixes=["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"],
label="standard activity code",
features_as_channels=True,
)
def train_dataloader(self):
return DataLoader(
self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers, shuffle=True
)
When training a Pytorch Lightning model, we pass a LightningDataModule
object to the Trainer
object, and the Trainer
object is responsible calling the setup
, train_dataloader
, and val_dataloader
methods, and for training the model.
Summary
First, we need to check which is our dataset and how the data is organized.
If data is organized in a directory with several CSV files, we use the
SeriesFolderCSVDataset
class.If data is organized in a single CSV file with a windowed time-series, we use the
MultiModalSeriesCSVDataset
class.
Then, we create a Dataset
object, and use it to create a DataLoader
object.
In order to organize the creation of the DataLoader
object for each split (train, validation and test), we encapsule this logic in a LightningDataModule
object. The LightningDataModule
object is responsible for creating the DataLoader
objects for the training (train_dataloader
), validation (val_dataloader
), and test (test_dataloader
) data, and for setting up the data.
The LightningDataModule
object is then used to train Pytorch Lightning models, which will call the setup
, train_dataloader
, and val_dataloader
methods, corretly, as needed in the training/test process.