Skip to main content

Data

In Gradient, there are two types of storage: persistent storage and versioned data. This guide explains how to connect a storage provider for connecting external resources, how to access public datasets which come out of the box in Gradient, and more.

Overview

There are two types of storage offered in Gradient: Persistent Storage and Versioned Data.

Persistent storage

Persistent storage is a high-performance storage directory located within Gradient Notebooks in the /storage directory. Persistent storage is backed by a filesystem and is ideal for storing data like images, datasets, model checkpoints etc. Anything you store in the /storage directory will be accessible across multiple runs of notebooks in a given storage region.

For more information on persistent storage in notebooks, check out the files and storage section of the notebooks docs.

Versioned data

In Workflows and Datasets, Gradient provides the ability to mount S3 compatible object storage buckets to workloads at runtime. Datasets have immutable versions that can be used to track your data as it changes.

The easiest way to get data into Gradient is to use the web-based uploader. Read on to understand how versioned data works within Gradient and learn about connecting to additional data sources.

How versioned data works

Versioned Datasets are used to manage the flow of data with your machine learning workloads. Datasets have immutable versions that can be used to track your data as it changes. Dataset version can be used as input to Gradient workloads as well as outputs. Data is stored at a Storage Provider and will be cached on a Gradient cluster's shared storage for a period of time so that data will be available readily on repeated usage.

Volumes

Within Gradient, Volumes allow various Gradient resources to access a shared Network File System. Storage volumes provide a block-level storage device for use as the primary system drive on a Paperspace machine. Volumes appear to the operating system as locally attached storage which can be partitioned and formatted according to your needs.

Volumes for Notebooks, Workflows, and Deployments -

  • Gradient Versioned Datasets - These are created by the user for storing ML data, artifacts, models, etc. The data accessible via one or more <user-chosen job-specific directory> path name. Users can create Versioned Datasets directly via the CLI, within the Gradient GUI in the data tab, or even through Workflows to enable regular updates to the file via Github. Versioned Datasets can be stored within Gradient Managed or with a chosen storage provider.

    • /inputs/{user-chosen-job-specific-dir-name1}
    • /inputs/{user-chosen-job-specific-dir-name2}
    • /outputs/{user-chosen-job-specific-dir-name1}
    • /outputs/{user-chosen-job-specific-dir-name2}
  • /storage - This is a team-wide shared storage space on the NFS or other Kubernetes Container Storage Interface storage option, such as ceph. This is created and allocated as a Kubernetes PersistentVolume during installation, but can be accessed by the customer afterwards.

  • Gradient Volumes - These are temporary Workflow run volumes that only exist for the duration of the Workflow run. They are references under the same root paths as Gradient Dataset Versions. Use these volumes to instantiate, access, and upload to temporary storage spaces that facilitate your Workflow without necessitating storing the files/data permanently in one of the persistent storage options.

    • /inputs/{user-chosen-job-specific-dir-name3}
    • /inputs/{user-chosen-job-specific-dir-name4}
    • /outputs/{user-chosen-job-specific-dir-name3}
    • /outputs/{user-chosen-job-specific-dir-name4}
    • ...

Volumes accessible by Notebooks only:

  • /notebook - This is a directory path under the team's /storage root that stores the home directory content of each notebook run. The files in the notebook repo can be cloned directly from Github to efficiently set up your workspace. This is done via the Workspace URL entry box in the Advanced Options section of the Notebook Create page. This is allocated as a temporary subvolume under the main team storage volume.

Team-wide Volumes:

  • /{team-id}/datasets - this contains cached named versions of the Gradient Datasets. The team can control the size of this cache area. Data stored in the cache is automatically backed up to the configured team Object Storage

Cluster-wide Volumes:

  • "metrics" - this is a persistent volume where prometheus metrics data is stored.
  • "share-storage" - team subvolumes are allocated from this cluster-wide persistent volume.

Versions, Tags, and Messages

Datasets have multiple versions that can be referenced. You can specify a message with a new dataset version to provide info around a newly created dataset version. In addition, you can tag a specific dataset version with a custom name as well. Here are the available ways to reference a dataset:

  • [dataset-id]:latest : this will use the latest version of your dataset
  • [dataset-id]:[dataset-version]: this will the use the specified dataset-version
  • [dataset-id]:[dataset-tag] : this will use the specified dataset version that the dataset-tag points to

Committed state

Dataset versions have an uncommitted and committed state. When a Dataset is uncommitted, you can modify or add files freely. When a Dataset is committed it will be immutable (will not allow any modifications). This allows workloads to be repeatable and deterministic with the provided Datasets.

How to create a dataset and dataset version in the GUI

To create a new dataset (one that does not yet have an ID) in the GUI, go to the Data tab in your team's page and click "Create a Dataset". This brings up a window to give it a name, optional description, and select the storage provider on which it will be created.

Creation of new dataset

If the team already has datasets, there is a similar "Add" button. The resulting screen after creation allows you to upload files, or you can just retrieve the dataset ID for use elsewhere.

Optional importing of data

Importing data, or adding it in some other way such as a Workflow output, will create a new version of the dataset.

Datasets can also be created from the Notebook IDE through the Datasets tab.

How to create a dataset and dataset version in the CLI

To create a new dataset (one that does not yet have an ID) in the CLI, use the gradient datasets create command like below:

$ gradient datasets create --name democli --storageProviderId ssfe843ndkjdsnr
Created dataset: dsr5zdx0thjhfe2

All Gradient datasets are versioned, so to make any changes to data in a Dataset, first you need to create a new version. That can be done with the command below.

$ gradient datasets versions create --id dst364npcw6ccok
Created dataset version: dst364npcw6ccok:fo5rp4m

Once the version is created, you can then add files to the Dataset version.

$ gradient datasets files put --id dst364npcw6ccok:fo5rp4m --source-path ./some-data/

Once all desired files are uploaded to the version, commit the version to the Dataset.

$ gradient datasets version commit --id dst364npcw6ccok:fo5rp4m
Committed dataset version: dst364npcw6ccok:fo5rp4m

Once the Dataset version is committed, the data will be available in the UI and to reference in other Gradient services (i.e. Notebooks, Workflows, and Deployments).

How to use datasets

You can use existing Datasets or create new ones. In the below scenarios, the following dataset actions are specified:

  • dst123abc:latest will be mounted to: /inputs/my-dataset
  job-1:
inputs:
my-dataset:
type: dataset
with:
ref: dst123abc
uses: container@v1
with:
image: bash:5
args: ["bash", "-c", "ls /inputs/my-dataset"]
  • dst123abc:latest will be created by job-1 and mounted to job-2 at: /inputs/my-created-dataset
job-1:
outputs:
my-dataset:
type: dataset
with:
ref: dst123abc
uses: container@v1
with:
image: bitnami/git
args:
[
"bash",
"-c",
"git clone https://github.com/username/repo /outputs/my-dataset",
]
job-2:
needs:
- job-1
inputs:
my-created-dataset: job-1.outputs.my-dataset
uses: container@v1
with:
image: bash:5
args: ["bash", "-c", "ls /inputs/my-created-dataset"]

How to view datasets

$ gradient datasets list
+------+-----------------+-------------------------+
| Name | ID | Storage Provider |
+------+-----------------+-------------------------+
| test | dst364npcw6ccok | test1 (splgct3arqdh77c) |
+------+-----------------+-------------------------+

$ gradient datasets details --id=dst364npcw6ccok
+-----------------+-------------------------+
| Name | test |
+-----------------+-------------------------+
| ID | dst364npcw6ccok |
| Description | |
| StorageProvider | test1 (splgct3arqdh77c) |
+-----------------+-------------------------+

How to view dataset files

$ gradient datasets files list --id=dst364npcw6ccok:fo5rp4m
+-----------+------+
| Name | Size |
+-----------+------+
| hello.txt | 12 |
+-----------+------+

Storage providers

Storage providers are a way to connect various storage resources to Gradient. Once connected this storage can be used to store and access data for use in Gradient, such as models, artifacts, and datasets.

How storage providers work

Gradient uses storage providers with versioned data to ensure that your data is verified and immutable. Gradient will create a folder with the same name as your Paperspace team ID within the storage provider. Gradient storage providers do not provide general S3 capabilities through the storage provider interface. However if you define additional storage providers, you can use the tools compatible with your storage provider to interact with the data stored by Gradient.

Gradient Managed Storage Provider

Your Gradient account automatically comes with a storage provider named Gradient Managed. This storage provider can be used without additional configuration, for storing data in Gradient's hosted s3 compatible object storage.

Gradient Managed storage has a default persistent storage quota, based on your Gradient subscription level, which can be used for no additional charges. After the default quota is consumed you may need to upgrade your subscription plan to have access to more. See your Gradient Subscription plan details for more info.

Setting up additional Storage Providers

Choose a public storage provider, such as AWS S3, Google GCS, minio, or similar. Currently Gradient supports these types of storage providers:

Supported types:

  • S3-compatible storage

Define a storage bucket

Create a bucket within your storage provider, and a set of read/write credentials for accessing the data (usually an access key and secret key). Note the bucket name, and endpoint url, as well as access key and secret key.

For AWS S3 you can define a bucket using the AWS CLI. Here we create a bucket named my-gradient-storage-provider-bucket.

aws s3api create-bucket --bucket my-gradient-storage-provider-bucket --region us-east-1

Gradient will create folders and objects inside this bucket, once configured.

Configure your storage bucket

From within a new or existing S3 bucket, you'll need to edit the CORS configuration so your data can be viewed within Gradient.

As a best practice you should create a dedicated user identity and access key/secret key pair for accessing the bucket. You should also set a restricted access policy on the user identity so it is limited this bucket.

Assign CORS Rules (AWS)

Add CORS rules to your bucket:

[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET",
"PUT"
],
"AllowedOrigins": [
"https://console.paperspace.com"
],
"ExposeHeaders": [],
"MaxAgeSeconds": 3000
}
]

In the AWS S3 console bucket permissions settings, you'll see an option to edit the CORS configuration. Click edit, then copy and paste the JSON above, and then save your changes.

Alternatively you can apply them using the AWS CLI:

aws s3api put-bucket-cors --bucket my-gradient-storage-provider-bucket --cors-configuration '{
"CORSRules": [
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "PUT"],
"AllowedOrigins": ["https://console.paperspace.com"],
"MaxAgeSeconds": 3000
}
]
}'

Assign CORS Rules (GCS)

Create a cors.json file with the follow contents

[
{
"responseHeader": [
"*"
],
"method": [
"GET",
"PUT"
],
"origin": [
"https://console.paperspace.com"
],
"maxAgeSeconds": 3000
}
]

Apply it to your bucket gsutil cors set cors.json gs://my-storage-provider-bucket

Create a Restricted User and Access Key/Secret Key

Create a restricted user identity and access key/secret key for accessing the bucket.

Here we create these using the AWS CLI:

aws iam create-user --user-name gradient-storage-provider-user

aws iam create-access-key --user-name gradient-storage-provider-user

Note the returned AccessKeyId and SecretAccessKey values as they will only appear once.

Create Bucket Access Policy

Gradient requires a minimal level of policy permissions to access the bucket. Sample permissions for a bucket named my-gradient-storage-provider-bucket are as follows:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowGeneratedUrls",
"Effect": "Allow",
"Action": "sts:GetFederationToken",
"Resource": "*"
},
{
"Sid": "AllowListBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-gradient-storage-provider-bucket"
},
{
"Sid": "AllowBucketAccess",
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3:::my-gradient-storage-provider-bucket/*"
}
]
}

Place the policy definition in a file, e.g, gradient-storage-provider-access-policy.json.

Edit the file and replace my-gradient-storage-provider-bucket with your actual bucket name in the AllowListBucket and AllowBucketAccess Resource fields.

Create Policy

Using the AWS CLI create a policy object from the policy definition file:

aws iam create-policy --policy-name gradient-storage-provider-access-policy \
--policy-document file://gradient-storage-provider-access-policy.json

Note the returned policy "Arn" value, which is used when assigning the policy to the user identity.

Attach Policy to User

Attach the policy object to the restricted user identity.

Using the AWS CLI:

aws iam attach-user-policy --user-name gradient-storage-provider-user \
--policy-arn "arn:aws:iam::XXXXXXXXXXXX:policy/gradient-storage-provider-access-policy"

Add a Storage Provider

A Storage Provider can be created on your team's settings page.

Note: The "AccessKey" and "SecretAccessKey" can be obtained from the "My security credentials" section of the AWS Identity and Access Management (IAM) portal. See the following:

CLI

A Storage Provider can also be created with the Gradient CLI

$ gradient storageProviders create s3 --name test --bucket my-bucket --accessKey=access-key --secretAccessKey=secret-key
Created storage provider: splgct3arqdh77c

Public datasets

A number of public datasets are available out of the box for use in Gradient

How to use public datasets

A read-only collection of sample datasets datasets are provided for free for use within Gradient.

  • For Notebooks, they are available in the directory /datasets, e.g., /datasets/mnist.
  • For Workflows, they are in the Gradient namespace, e.g., in YAML, ref: gradient/mnist.

List of Public Datasets

Name & PathDescription

Fast.ai

/datasets/fastai/

ref: gradient/fastai

Paperspace's Fast.ai template is built for getting up and running with the enormously popular Fast.ai online MOOC called Practical Deep Learning for Coders.

Source: https://registry.opendata.aws/ (previously http://files.fast.ai/data/ )

LSUN

/datasets/lsun/

ref: gradient/lsun

Contains around one million labeled images for each of 10 scene categories and 20 object categories.


Source: http://www.yf.io/p/lsun

(was http://lsun.cs.princeton.edu/2017; link no longer active)

MNIST

/datasets/mnist/

ref: gradient/mnist

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples


Source: http://yann.lecun.com/exdb/mnist/

COCO

/datasets/coco

ref: gradient/coco

COCO is a large-scale object detection, segmentation, and captioning dataset.


Source: http://cocodataset.org/

Selfie

/datasets/selfie

ref: gradient/selfie

Selfie dataset contains 46,836 selfie images annotated with 36 different attributes divided into several categories.


Source: https://www.crcv.ucf.edu/data/Selfie/

(was http://crcv.ucf.edu/data/Selfie )

StyleGAN

/datasets/stylegan

StyleGAN is a Style-Based Generator Architecture for Generative Adversarial Networks. This dataset allows for photographs of people to be produced by the generator.


Source: https://github.com/NVlabs/stylegan

OpenSLR

/datasets/openslr

ref: gradient/openslr

Open Speech and Language Resources. This is dataset number 12, the LibriSpeech ASR corpus.


Source: https://www.openslr.org/resources.php

Self Driving Demo

/datasets/self-driving-demo-data

A dataset by comma.ai that includes over 33 hours of commute on California's I280 freeway.


Source: https://github.com/commaai/comma2k19

Sentiment140

/datasets/sentiment140

ref: gradient/sentiment140

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter.


Source: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

Tiny-imagenet-200

/datasets/tiny-imagenet-200

ref: gradient/tiny-imagenet-200

A subset of the ImageNET dataset created by the Stanford CS231n course. It spans 200 image classes with 500 training examples per class. It also has 50 validation and 50 test examples per class.


Source: http://cs231n.stanford.edu/tiny-imagenet-200.zip