Skip to main content

Storage

Overview

There are two types of storage offered in Gradient: Persistent Storage and Versioned Data.

Persistent storage

Persistent storage is a high-performance storage directory located within Gradient Notebooks in the /storage directory. Persistent storage is backed by a filesystem and is ideal for storing data like images, datasets, model checkpoints etc. Anything you store in the /storage directory will be accessible across multiple runs of notebooks in a given storage region.

For more information on persistent storage in notebooks, check out the files and storage section of the notebooks docs.

Versioned data

In Workflows and Datasets, Gradient provides the ability to mount S3 compatible object storage buckets to workloads at runtime. Datasets have immutable versions that can be used to track your data as it changes.

The easiest way to get data into Gradient is to use the web-based uploader. Read on to understand how versioned data works within Gradient and learn about connecting to additional data sources.

How versioned data works

Versioned Datasets are used to manage the flow of data with your machine learning workloads. Datasets have immutable versions that can be used to track your data as it changes. Dataset version can be used as input to Gradient workloads as well as outputs. Data is stored at a Storage Provider and will be cached on a Gradient cluster's shared storage for a period of time so that data will be available readily on repeated usage.

Volumes

Within Gradient, Volumes allow various Gradient resources to access a shared Network File System. Storage volumes provide a block-level storage device for use as the primary system drive on a Paperspace machine. Volumes appear to the operating system as locally attached storage which can be partitioned and formatted according to your needs.

Volumes for Notebooks, Workflows, and Deployments -

  • Gradient Versioned Datasets - These are created by the user for storing ML data, artifacts, models, etc. The data accessible via one or more <user-chosen job-specific directory> path name. Users can create Versioned Datasets directly via the CLI, within the Gradient GUI in the data tab, or even through Workflows to enable regular updates to the file via Github. Versioned Datasets can be stored within Gradient Managed or with a chosen storage provider.

    • /inputs/{user-chosen-job-specific-dir-name1}
    • /inputs/{user-chosen-job-specific-dir-name2}
    • /outputs/{user-chosen-job-specific-dir-name1}
    • /outputs/{user-chosen-job-specific-dir-name2}
  • /storage - This is a team-wide shared storage space on the NFS or other Kubernetes Container Storage Interface storage option, such as ceph. This is created and allocated as a Kubernetes PersistentVolume during installation, but can be accessed by the customer afterwards.

  • Gradient Volumes - These are temporary Workflow run volumes that only exist for the duration of the Workflow run. They are references under the same root paths as Gradient Dataset Versions. Use these volumes to instantiate, access, and upload to temporary storage spaces that facilitate your Workflow without necessitating storing the files/data permanently in one of the persistent storage options.

    • /inputs/{user-chosen-job-specific-dir-name3}
    • /inputs/{user-chosen-job-specific-dir-name4}
    • /outputs/{user-chosen-job-specific-dir-name3}
    • /outputs/{user-chosen-job-specific-dir-name4}
    • ...

Volumes accessible by Notebooks only:

  • /notebook - This is a directory path under the team's /storage root that stores the home directory content of each notebook run. The files in the notebook repo can be cloned directly from Github to efficiently set up your workspace. This is done via the Workspace URL entry box in the Advanced Options section of the Notebook Create page. This is allocated as a temporary subvolume under the main team storage volume.

Team-wide Volumes:

  • /{team-id}/datasets - this contains cached named versions of the Gradient Datasets. The team can control the size of this cache area. Data stored in the cache is automatically backed up to the configured team Object Storage

Cluster-wide Volumes:

  • "metrics" - this is a persistent volume where prometheus metrics data is stored.
  • "share-storage" - team subvolumes are allocated from this cluster-wide persistent volume.

Versions, Tags, and Messages

Datasets have multiple versions that can be referenced. You can specify a message with a new dataset version to provide info around a newly created dataset version. In addition, you can tag a specific dataset version with a custom name as well. Here are the available ways to reference a dataset:

  • [dataset-id]:latest : this will use the latest version of your dataset
  • [dataset-id]:[dataset-version]: this will the use the specified dataset-version
  • [dataset-id]:[dataset-tag] : this will use the specified dataset version that the dataset-tag points to

Committed state

Dataset versions have an uncommitted and committed state. When a Dataset is uncommitted, you can modify or add files freely. When a Dataset is committed it will be immutable (will not allow any modifications). This allows workloads to be repeatable and deterministic with the provided Datasets.

How to create a dataset and dataset version in the GUI

To create a new dataset (one that does not yet have an ID) in the GUI, go to the Data tab in your team's page and click "Create a Dataset". This brings up a window to give it a name, optional description, and select the storage provider on which it will be created.

Creation of new dataset

If the team already has datasets, there is a similar "Add" button. The resulting screen after creation allows you to upload files, or you can just retrieve the dataset ID for use elsewhere.

Optional importing of data

Importing data, or adding it in some other way such as a Workflow output, will create a new version of the dataset.

Datasets can also be created from the Notebook IDE through the Datasets tab.

How to create a dataset and dataset version in the CLI

To create a new dataset (one that does not yet have an ID) in the CLI, use the gradient datasets create command like below:

$ gradient datasets create --name democli --storageProviderId ssfe843ndkjdsnr
Created dataset: dsr5zdx0thjhfe2

All Gradient datasets are versioned, so to make any changes to data in a Dataset, first you need to create a new version. That can be done with the command below.

$ gradient datasets versions create --id dst364npcw6ccok
Created dataset version: dst364npcw6ccok:fo5rp4m

Once the version is created, you can then add files to the Dataset version.

$ gradient datasets files put --id dst364npcw6ccok:fo5rp4m --source-path ./some-data/

Once all desired files are uploaded to the version, commit the version to the Dataset.

$ gradient datasets versions commit --id dst364npcw6ccok:fo5rp4m
Committed dataset version: dst364npcw6ccok:fo5rp4m

Once the Dataset version is committed, the data will be available in the UI and to reference in other Gradient services (i.e. Notebooks, Workflows, and Deployments).

How to use datasets

You can use existing Datasets or create new ones. In the below scenarios, the following dataset actions are specified:

  • dst123abc:latest will be mounted to: /inputs/my-dataset
  job-1:
inputs:
my-dataset:
type: dataset
with:
ref: dst123abc
uses: container@v1
with:
image: bash:5
args: ["bash", "-c", "ls /inputs/my-dataset"]
  • dst123abc:latest will be created by job-1 and mounted to job-2 at: /inputs/my-created-dataset
job-1:
outputs:
my-dataset:
type: dataset
with:
ref: dst123abc
uses: container@v1
with:
image: bitnami/git
args:
[
"bash",
"-c",
"git clone https://github.com/username/repo /outputs/my-dataset",
]
job-2:
needs:
- job-1
inputs:
my-created-dataset: job-1.outputs.my-dataset
uses: container@v1
with:
image: bash:5
args: ["bash", "-c", "ls /inputs/my-created-dataset"]

How to view datasets

$ gradient datasets list
+------+-----------------+-------------------------+
| Name | ID | Storage Provider |
+------+-----------------+-------------------------+
| test | dst364npcw6ccok | test1 (splgct3arqdh77c) |
+------+-----------------+-------------------------+

$ gradient datasets details --id=dst364npcw6ccok
+-----------------+-------------------------+
| Name | test |
+-----------------+-------------------------+
| ID | dst364npcw6ccok |
| Description | |
| StorageProvider | test1 (splgct3arqdh77c) |
+-----------------+-------------------------+

How to view dataset files

$ gradient datasets files list --id=dst364npcw6ccok:fo5rp4m
+-----------+------+
| Name | Size |
+-----------+------+
| hello.txt | 12 |
+-----------+------+

Storage providers

Storage providers are a way to connect various storage resources to Gradient. Once connected this storage can be used to store and access data for use in Gradient, such as models, artifacts, and datasets.

How storage providers work

Gradient uses storage providers with versioned data to ensure that your data is verified and immutable. Gradient will create a folder with the same name as your Paperspace team ID within the storage provider. Gradient storage providers do not provide general S3 capabilities through the storage provider interface. However if you define additional storage providers, you can use the tools compatible with your storage provider to interact with the data stored by Gradient.

Gradient Managed Storage Provider

Your Gradient account automatically comes with a storage provider named Gradient Managed. This storage provider can be used without additional configuration, for storing data in Gradient's hosted s3 compatible object storage.

Gradient Managed storage has a default persistent storage quota, based on your Gradient subscription level, which can be used for no additional charges. After the default quota is consumed you may need to upgrade your subscription plan to have access to more. See your Gradient Subscription plan details for more info.

Setting up additional Storage Providers

Choose a public storage provider, such as DigitalOcean Spaces, AWS S3, Google GCS, minio, or similar. Currently Gradient supports these types of storage providers:

Supported types:

  • DigitalOcean Spaces
  • AWS S3
  • S3-compatible

Define a storage bucket

Create a bucket within your storage provider, and a set of read/write credentials for accessing the data (usually an access key and secret key). Note the bucket name, and endpoint url, as well as access key and secret key.

For DigitalOcean Spaces, you can navigate to the DigitalOcean Spaces page and create a new space in the desired region.

Once created, you can navigate to the Spaces Keys page to create a new access key pair.

Configure your storage bucket

By default, CORS is automatically configured for any DigitalOcean spaces integration to be seamlessly used by Gradient. If you are using a different storage provider, you may need to configure CORS rules for your bucket.

If you need to modify your CORS config, you can navigate to the Settings tab under your spaces bucket.

Add a Storage Provider

A Storage Provider can be created on your team's settings page.

Note: The "AccessKey" and "SecretAccessKey" can be obtained from the Spaces Keys page within the DigitalOcean console.

Public datasets

A number of public datasets are available out of the box for use in Gradient

How to use public datasets

A read-only collection of sample datasets datasets are provided for free for use within Gradient.

  • For Notebooks, they are available in the directory /datasets, e.g., /datasets/mnist.
  • For Workflows, they are in the Gradient namespace, e.g., in YAML, ref: gradient/mnist.

List of Public Datasets

Name & PathDescription

Fast.ai

/datasets/fastai/

ref: gradient/fastai

Paperspace's Fast.ai template is built for getting up and running with the enormously popular Fast.ai online MOOC called Practical Deep Learning for Coders.

Source: https://registry.opendata.aws/ (previously http://files.fast.ai/data/ )

LSUN

/datasets/lsun/

ref: gradient/lsun

Contains around one million labeled images for each of 10 scene categories and 20 object categories.


Source: http://www.yf.io/p/lsun

(was http://lsun.cs.princeton.edu/2017; link no longer active)

MNIST

/datasets/mnist/

ref: gradient/mnist

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples


Source: http://yann.lecun.com/exdb/mnist/

COCO

/datasets/coco

ref: gradient/coco

COCO is a large-scale object detection, segmentation, and captioning dataset.


Source: http://cocodataset.org/

OpenSLR

/datasets/openslr

ref: gradient/openslr

Open Speech and Language Resources. This is dataset number 12, the LibriSpeech ASR corpus.


Source: https://www.openslr.org/resources.php

Tiny-imagenet-200

/datasets/tiny-imagenet-200

ref: gradient/tiny-imagenet-200

A subset of the ImageNET dataset created by the Stanford CS231n course. It spans 200 image classes with 500 training examples per class. It also has 50 validation and 50 test examples per class.


Source: http://cs231n.stanford.edu/tiny-imagenet-200.zip