Versioned Data

Overview

Versioned Datasets are used to manage the flow of data with your machine learning workloads. Datasets have immutable versions that can be used to track your data as it changes. Dataset version can be used as input to Gradient workloads as well as outputs. Data is stored at a Storage Provider and will be cached on a Gradient cluster's shared storage for a period of time so that data will be available readily on repeated usage.

Versions, Tags, and Messages

Datasets have multiple versions that can be referenced. You can specify a message with a new dataset version to provide info around a newly created dataset version. In addition, you can tag a specific dataset version with a custom name as well. Here are the available ways to reference a dataset:
    [dataset-id]:latest : this will use the latest version of your dataset
    [dataset-id]:[dataset-version]: this will the use the specified dataset-version
    [dataset-id]:[dataset-tag] : this will use the specified dataset version that the dataset-tag points to

Committed state

Dataset versions have an uncommitted and committed state. When a Dataset is uncommitted, you can modify or add files freely. When a Dataset is committed it will be immutable (will not allow any modifications). This allows workloads to be repeatable and deterministic with the provided Datasets.

Creating a Dataset and Dataset Version

GUI
CLI
To create a new dataset (one that does not yet have an ID) in the GUI, go to the Data tab in your team's page and click "Create a Dataset". This brings up a window to give it a name, optional description, and select the storage provider on which it will be created.
Creation of new dataset
If the team already has datasets, there is a similar "Add" button. The resulting screen after creation allows you to upload files, or you can just retrieve the dataset ID for use elsewhere.
Optional importing of data
Importing data, or adding it in some other way such as a Workflow output, will create a new version of the dataset.
To create a new dataset (one that does not yet have an ID) in the CLI, use gradient datasets create with a command like
1
$ gradient datasets create --name democli --storageProviderId ssfe843ndkjdsnr
2
Created dataset: dsr5zdx0thjhfe2
Copied!
To create a new dataset version, use, e.g.,
1
$ gradient datasets versions create --id=dst364npcw6ccok --source-path=./some-data/
2
Created dataset version: dst364npcw6ccok:fo5rp4m
3
Committed dataset version: dst364npcw6ccok:fo5rp4m
Copied!

Using Datasets

You can use existing Datasets or create new ones. In the below scenarios, the following dataset actions are specified:
    dst123abc:latest will be mounted to: /inputs/my-dataset
1
job-1:
2
inputs:
3
my-dataset:
4
type: dataset
5
with:
6
ref: dst123abc
8
with:
9
image: bash:5
10
args: ["bash", "-c", "ls /inputs/my-dataset"]
Copied!
    dst123abc:latest will be created by job-1 and mounted to job-2 at: /inputs/my-created-dataset
1
job-1:
2
outputs:
3
my-dataset:
4
type: dataset
5
with:
6
ref: dst123abc
8
with:
9
image: bitnami/git
10
args: ["bash", "-c", "git clone https://github.com/username/repo /outputs/my-dataset"]
11
job-2:
12
needs:
13
- job-1
14
inputs:
15
my-created-dataset: job-1.outputs.my-dataset
17
with:
18
image: bash:5
19
args: ["bash", "-c", "ls /inputs/my-created-dataset"]
Copied!

Viewing Datasets

1
$ gradient datasets list
2
+------+-----------------+-------------------------+
3
| Name | ID | Storage Provider |
4
+------+-----------------+-------------------------+
5
| test | dst364npcw6ccok | test1 (splgct3arqdh77c) |
6
+------+-----------------+-------------------------+
7
8
$ gradient datasets details --id=dst364npcw6ccok
9
+-----------------+-------------------------+
10
| Name | test |
11
+-----------------+-------------------------+
12
| ID | dst364npcw6ccok |
13
| Description | |
14
| StorageProvider | test1 (splgct3arqdh77c) |
15
+-----------------+-------------------------+
Copied!

Viewing Dataset files

1
$ gradient datasets files list --id=dst364npcw6ccok:fo5rp4m
2
+-----------+------+
3
| Name | Size |
4
+-----------+------+
5
| hello.txt | 12 |
6
+-----------+------+
Copied!
Last modified 1mo ago