{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Structuring the input\n",
    "\n",
    "In order to train and test models, we need to structure the input data in a way that is compatible with the model and the framework's experiments.\n",
    "\n",
    "For now, this framework is designed to work with time-series data and Pytorch Lightning. \n",
    "Thus, it provides the necessary tools to create `Dataset` and `LightningDataModule` objects, required by Pytorch Lightning to train and test models.\n",
    "\n",
    "In this notebook, we explain the default data pipeline, which includes:\n",
    "1. Creating `Dataset` objects, that are responsible for loading the data.\n",
    "2. Creating `DataLoader` objects, that are responsible for batched loading of the data. It encapsulates the `Dataset` object and provides an iterator to iterate over the data in batches.\n",
    "3. Creating `LightningDataModule` objects, that are responsible for loading the data and creating the `Dataset` and encapsulate it into `DataLoader` objects for training, validation, and test sets.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time-series dataset implementations\n",
    "\n",
    "The `Dataset` object is responsible for loading the data. \n",
    "It is a Pytorch object that is used to load the data and make it available to the model. \n",
    "\n",
    "Every `Dataset` class must implement two methods: `__len__` and `__getitem__`.\n",
    "The `__len__` method returns the number of samples in the dataset, and the `__getitem__`, given an integer from 0 to `__len__` - 1, returns the corresponding sample from the dataset.\n",
    "The returned type of the `__getitem__` method is not specified, but it is usually a 2-element tuple with the input and the target. The input is the data that will be used to make the predictions, and the target is the data that the model will try to predict.\n",
    "\n",
    "The first step when creating a `Dataset` object is to identify the layout of the data directory and choose the appropriate class to handle it.\n",
    "For now, this framework provides default `Dataset` classes for time-series data, which are the `SeriesFolderCSVDataset` and `MultiModalSeriesCSVDataset` classes.\n",
    "Both classes assumes that data are stored in CSV files, but with different layouts, to know:\n",
    "\n",
    "- A directory with several CSV files, where each file contains a time-series. Each row in a CSV file is a time-step, each column is a feature. Thus, the whole file is a single multi-modal time-series. Also, if you want to use labels, it must be in a separated column of the CSV file and it should exists to all rows (time-steps). This layout is handled by the `SeriesFolderCSVDataset` class.\n",
    "- A single CSV file with a windowed time-series. Each row contains different modalities of the same windowed time-series. The prefix of the column names is used to identify the modalities. For instance, if the is `accel-x`, all columns that start with this prefix, like `accel-x-1`, `accel-x-2`, `accel-x-3`, are considered time-steps from the same modality (`accel-x`). Also, if you want to use labels, it must be in a separated column and it should exists to all rows, that is, for each windowed multimodal time-series. This layout is handled by the `MultiModalSeriesCSVDataset` class.\n",
    "\n",
    "We will show how to use these classes in the next sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `SeriesFolderCSVDataset`\n",
    "\n",
    "The `SeriesFolderCSVDataset` class is designed to work with a directory containing several CSV files, where each file represent a time-series. \n",
    "Each row in a CSV file is a time-step, and each column is a feature. \n",
    "\n",
    "This class assumes that data is organized in the following way, where `my_dataset` is the path to the directory containing the CSV files:\n",
    "\n",
    "```bash\n",
    "my_dataset/\n",
    "    series1.csv\n",
    "    series2.csv\n",
    "    other_series.csv\n",
    "    ...\n",
    "```\n",
    "\n",
    "Where each CSV file represents a time-series, similar to the one below: \n",
    "\n",
    "| accel-x | accel-y | accel-z | gyro-x  | gyro-y  | gyro-z  | class   |\n",
    "|---------|---------|---------|---------|---------|---------|---------|\n",
    "| 0.502123| 0.02123 | 0.12312 | 0.12312 | 0.12312 | 0.12312 | 1       |\n",
    "| 0.682012| 0.02123 | 0.12312 | 0.12312 | 0.12312 | 0.12312 | 1       |\n",
    "| 0.498217| 0.00001 | 0.12312 | 0.12312 | 0.12312 | 0.12312 | 1       |\n",
    "\n",
    "\n",
    "Note that the CSV must have a header with the column names.\n",
    "Also, columns that are not used as features or labels are ignored."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To handle this kind of data, we use the `SeriesFolderCSVDataset` class. This class is a Pytorch `Dataset` object that loads the data from the CSV files and makes it available to the model.\n",
    "Note that, each feature (column) represent a dimension of the time-series, while the rows represent the time-steps. The sample is a numpy array.\n",
    "\n",
    "For this class, we must specify the following parameters:\n",
    "\n",
    "- `data_path`: the path to the directory containing the CSV files;\n",
    "- `features`: a list of strings with the names of the features columns, *e.g.* `['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z']`;\n",
    "- `label`: a string with the name of the label column, *e.g.* `'class'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1706883997.242541] [aae107fc745c:2264626:f]        vfs_fuse.c:281  UCX  ERROR inotify_add_watch(/tmp) failed: No space left on device\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "SeriesFolderCSVDataset at /workspaces/hiaac-m4/ssl_tools/data/view_concatenated/KuHar_cpc/train (57 samples)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ssl_tools.data.datasets import SeriesFolderCSVDataset\n",
    "\n",
    "# Path to the data\n",
    "data_path = \"/workspaces/hiaac-m4/ssl_tools/data/view_concatenated/KuHar_cpc/train\"\n",
    "\n",
    "# Creating the dataset\n",
    "dataset = SeriesFolderCSVDataset(\n",
    "    data_path=data_path,\n",
    "    features=[\"accel-x\", \"accel-y\", \"accel-z\", \"gyro-x\", \"gyro-y\", \"gyro-z\"],\n",
    "    label=\"standard activity code\",\n",
    ")\n",
    "\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the number of samples in the dataset with the `len` function, and we can retrive a sample with the `__getitem__` method, that is, using `[]`, such as `dataset[0]`.\n",
    "The dataset return type is different depending on the `label` parameter.\n",
    "\n",
    "- If `label` is speficied, the return type is a 2-element tuple, where the first element is a 2D numpy array with shape `(num_features, time_steps)`, and the second element is a 1D tensor with shape `(time_steps,)`.\n",
    "- If `label` is not speficied, the return type is a single 2D numpy array with shape `(num_features, time_steps)`.\n",
    "\n",
    "Let's check the number of samples and access the first sample and its label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of dataset: 57 samples\n"
     ]
    }
   ],
   "source": [
    "# Gte the length of the dataset\n",
    "length_of_dataset = len(dataset)\n",
    "print(f\"Length of dataset: {length_of_dataset} samples\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type of sample: tuple with 2 elements\n"
     ]
    }
   ],
   "source": [
    "# Get the first sample. We can go from 0 to length_of_dataset - 1 (56)\n",
    "sample = dataset[0]\n",
    "type_of_sample = type(sample).__name__\n",
    "print(f\"Type of sample: {type_of_sample} with {len(sample)} elements\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of sample: (6, 2586), shape of label: (2586, 1)\n"
     ]
    }
   ],
   "source": [
    "# The first element of the sample is the input, while the second element is the label\n",
    "# We can split the sample into input and label variables\n",
    "shape_of_sample = sample[0].shape\n",
    "shape_of_label = sample[1].shape\n",
    "print(f\"Shape of sample: {shape_of_sample}, shape of label: {shape_of_label}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see above, the sample is a 2-element tuple. The first element is a 2D numpy array with shape `(6, 2586)`, and the second element is a 1D tensor with shape `(2586,)`, that is, a label for each time-step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `MultiModalSeriesCSVDataset`\n",
    "\n",
    "\n",
    "The `MultiModalSeriesCSVDataset` class is designed to work with a single CSV file containing a windowed time-series.\n",
    "Each row contains different modalities of the same windowed time-series. \n",
    "The prefix of the column names is used to identify the modalities. \n",
    "For instance, if the prefix is `accel-x`, all columns that start with this prefix, like `accel-x-1`, `accel-x-2`, `accel-x-3`, are considered time-steps from the same modality (`accel-x`). \n",
    "Also, if you want to use labels, it must be in a separated column and it should exists to all rows, that is, for each windowed multimodal time-series. \n",
    "\n",
    "The CSV file looks like this:\n",
    "\n",
    "\n",
    "| accel-x-0 | accel-x-1 | accel-y-0 | accel-y-1 |  class |\n",
    "|-----------|-----------|-----------|-----------|--------|\n",
    "| 0.502123  | 0.02123   | 0.502123  | 0.502123  |  0     |\n",
    "| 0.6820123 | 0.02123   | 0.502123  | 0.502123  |  1     |\n",
    "| 0.498217  | 0.00001   | 1.414141  | 3.141592  |  1     |\n",
    "\n",
    "In the example, columns `accel-x-0` and `accel-x-1` are the `accel-x` feature at time `0` and time `1`, respectively. \n",
    "The same goes for the `accel-y` feature. Finally, the `class` column is the label. \n",
    "Columns that are not used as features or labels are ignored.\n",
    "\n",
    "To use `MultiModalSeriesCSVDataset`, we must specify the following parameters:\n",
    "\n",
    "- `data_path`: the path to the CSV file\n",
    "- `feature_prefixes`: a list of strings with the prefixes of the feature columns, e.g. `['accel-x', 'accel-y']`. The class will look for columns with these prefixes and will consider them as features of a modality.\n",
    "- `label`: a string with the name of the label column, e.g. `'class'`\n",
    "- `features_as_channels`: a boolean indicating if the features should be treated as channels, that is, if each prefix will become a channel. If ``True``, the data will be returned as a vector of shape `(C, T)`, where C is the number of channels (features/prefixes) and `T` is the number of time steps. Else, the data will be returned as a vector of shape `T*C` (a single vector with all the features).\n",
    "\n",
    "Note that, each feature (column) represent a dimension of the time-series, while the rows represent the samples.\n",
    "\n",
    "Let's show how to read this data and create a `MultiModalSeriesCSVDataset` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiModalSeriesCSVDataset at /workspaces/hiaac-m4/ssl_tools/data/standartized_balanced/KuHar/train.csv (1386 samples)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ssl_tools.data.datasets import MultiModalSeriesCSVDataset\n",
    "\n",
    "# Path to the data\n",
    "data_path = \"/workspaces/hiaac-m4/ssl_tools/data/standartized_balanced/KuHar/train.csv\"\n",
    "\n",
    "# Instantiate the dataset\n",
    "dataset = MultiModalSeriesCSVDataset(\n",
    "    data_path=data_path,\n",
    "    feature_prefixes=[\"accel-x\", \"accel-y\", \"accel-z\", \"gyro-x\", \"gyro-y\", \"gyro-z\"],\n",
    "    label=\"standard activity code\",\n",
    "    features_as_channels = True,\n",
    ")\n",
    "\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the number of samples in the dataset with the `len` function, and we can retrive a sample with the `__getitem__` method, that is, using `[]`, such as `dataset[0]`.\n",
    "The dataset may return:\n",
    "\n",
    "- A 2-element tuple, where the first element is a 2D numpy array with shape `(num_features, time_steps)`, and the second element is a 1D tensor with shape `(time_steps,)`.\n",
    "- A 2D numpy array with shape `(num_features, time_steps)`, if `label` is `None`, at the time of the dataset object's creation.\n",
    "\n",
    "Let's check the number of samples and access the first sample and its label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of dataset: 1386 samples\n"
     ]
    }
   ],
   "source": [
    "length_of_dataset = len(dataset)\n",
    "print(f\"Length of dataset: {length_of_dataset} samples\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type of sample: tuple with length 2 elements\n"
     ]
    }
   ],
   "source": [
    "sample = dataset[0]\n",
    "type_of_sample = type(sample).__name__\n",
    "print(f\"Type of sample: {type_of_sample} with length {len(sample)} elements\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of sample: (6, 60)\n"
     ]
    }
   ],
   "source": [
    "shape_of_sample = sample[0].shape\n",
    "print(f\"Shape of sample: {shape_of_sample}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading batches of data using DataLoader\n",
    "\n",
    "Pytorch models are trained using batches of data. Thus, we do not feed the model with a single sample at a time, but with a batch of samples.\n",
    "If we see the last example, the `MultiModalSeriesCSVDataset` object returns a single sample at a time. Each sample is a 2-element tuple, where first element is a `(6, 60)` numpy array and the second is an integer, representing the label.\n",
    "\n",
    "A batch of samples add an extra dimension to the data. \n",
    "Thus, in our case, a batch of samples would be a 3D tensor, where the first dimension is the batch size (`B`), the second dimension is the number of features, or channels (`C`), and the third dimension is the number of time-steps (`T`).\n",
    "Thus, if the data have the shape `(6, 60)`, a batch of 32 samples will be a tensor with shape `(32, 6, 60)`. \n",
    "The same happens to `label`, which gains an extra dimension, and would be an 1D tensor with shape `(32,)`.\n",
    "\n",
    "The batching of samples is done using a `DataLoader` object. \n",
    "This object is a Pytorch object that takes a `Dataset` object and returns batches of samples. \n",
    "The `DataLoader` object is responsible for shuffling the data, dividing it into batches, and loading the data in parallel.\n",
    "Thus, given a `Dataset` object, we can easilly create a `DataLoader` object using the `torch.utils.data.DataLoader` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<torch.utils.data.dataloader.DataLoader at 0x7f4dd2cb5960>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "# Create a DataLoader\n",
    "dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
    "dataloader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datasets implement the `__len__` and `__getitem__` methods. \n",
    "However, the `DataLoader` object implements the iterable protocol, that is, it implements the `__iter__` method, which returns an iterator to iterate over the data in batches.\n",
    "Thus, to fetch a batch of samples from the `DataLoader` object, we can use a `for` loop, as we do with any other iterable object in Python, like lists and tuples (*e.g.* `for batch in dataloader: ...`).\n",
    "We can also use the `next` function to fetch a single batch of samples, such as `batch = next(iter(dataloader))`.\n",
    "Let's fetch a single sample from the `DataLoader` object and check its shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inputs shape: torch.Size([32, 6, 60]), labels shape: torch.Size([32])\n"
     ]
    }
   ],
   "source": [
    "batch = next(iter(dataloader))\n",
    "# Batch is a tuple with two elements: inputs and labels. \n",
    "# Let's extract it to two different variables\n",
    "inputs, labels = batch\n",
    "# Print the shape of the inputs and labels\n",
    "print(f\"Inputs shape: {inputs.shape}, labels shape: {labels.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling data splits (train, validation, and test) using `LightningDataModule`\n",
    "\n",
    "Usually, we create a `DataLoader` object for the training data, another for the validation data, and another for the test data. \n",
    "We can encapsulate the `DataLoader` creation logic in a single place, and make it easy to use the same data processing logic across different experiments.\n",
    "A simple way to do this is to create a `LightningDataModule` object.\n",
    "\n",
    "A `LightningDataModule` object is responsible for splitting the data into training, validation, and test sets, and creating the `DataLoader` objects for each set. \n",
    "This object may also be responsible for setting up the data, such as downloading the data from the internet, checking the data, and add the augmentations. \n",
    "\n",
    "The `LightningDataModule` object must implement four methods: `setup`, `train_dataloader`, `val_dataloader`, and `test_dataloader`. The `setup` is optional, and is responsible for splitting the data into training, validation, and test sets, and `train_dataloader`, `val_dataloader` and `test_dataloader` methods are responsible for creating the `DataLoader` objects for the training, validation and test sets, respectively.\n",
    "\n",
    "A data module may be implemented as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightning as L\n",
    "\n",
    "class MyDataModule(L.LightningDataModule):\n",
    "    def __init__(self, train_csv_path, batch_size=32):\n",
    "        super().__init__()\n",
    "        self.data_path = data_path\n",
    "        self.batch_size = batch_size\n",
    "        self.train_csv_path = train_csv_path\n",
    "\n",
    "    def setup(self, stage=None):\n",
    "        if stage == \"fit\" or stage is None:\n",
    "            self.train_dataset = MultiModalSeriesCSVDataset(\n",
    "                data_path=self.train_csv_path,\n",
    "                feature_prefixes=[\"accel-x\", \"accel-y\", \"accel-z\", \"gyro-x\", \"gyro-y\", \"gyro-z\"],\n",
    "                label=\"standard activity code\",\n",
    "                features_as_channels=True,\n",
    "            )\n",
    "\n",
    "    def train_dataloader(self):\n",
    "        return DataLoader(\n",
    "            self.train_dataset, batch_size=self.batch_size, num_workers=self.num_workers, shuffle=True\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When training a Pytorch Lightning model, we pass a `LightningDataModule` object to the `Trainer` object, and the `Trainer` object is responsible calling the `setup`, `train_dataloader`, and `val_dataloader` methods, and for training the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "First, we need to check which is our dataset and how the data is organized.\n",
    "\n",
    "- If data is organized in a directory with several CSV files, we use the `SeriesFolderCSVDataset` class. \n",
    "- If data is organized in a single CSV file with a windowed time-series, we use the `MultiModalSeriesCSVDataset` class.\n",
    "\n",
    "Then, we create a `Dataset` object, and use it to create a `DataLoader` object.\n",
    "\n",
    "In order to organize the creation of the `DataLoader` object for each split (train, validation and test), we encapsule this logic in a `LightningDataModule` object.\n",
    "The `LightningDataModule` object is responsible for creating the `DataLoader` objects for the training (`train_dataloader`), validation (`val_dataloader`), and test (`test_dataloader`) data, and for setting up the data.\n",
    "\n",
    "The `LightningDataModule` object is then used to train Pytorch Lightning models, which will call the `setup`, `train_dataloader`, and `val_dataloader` methods, corretly, as needed in the training/test process."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}