Storage and datasets
In Gradient Notebooks, there is a file browser, shared persistent storage, and Gradient Datasets. This guide explains the full storage architecture of your notebook.
Introduction to the file architecture of Gradient Notebooksβ
Every notebook in Gradient has a file management interface that looks like this:
The file manager within the notebook does not represent the full file structure of the notebook.
The full file structure of a notebook is as follows:
Here are the main components:
- File manager - Files available in the normal IDE sidebar. This corresponds to the directory located at
/notebooks
. - Storage - Shared persistent storage directory accessible to your entire team on a specific cluster. Available at
/storage
. This is a method for sharing data across notebooks and users. In the case of the Private Workspace team, the/storage
volume cannot be shared with other users. - Gradient Datasets - Team and public datasets that you can mount in the IDE. Ideal for large amounts of data and for sharing. Public datasets include popular datasets that Gradient makes available out of the box such as MNIST.
What is the file manager?β
Refer to Introduction to the file structure of Gradient Notebooks to understand the overall file architecture of Gradient Notebooks.
Files stored in the file manager are persisted across notebook sessions. This is the same directory that is represented by the yellow box labeled { notebook IDE }
in the previous section.
Within the /notebooks
directory, the folder name checkpoints
is reserved by Jupyter. Avoid using checkpoints
as a directory name in order to avoid any unexpected behavior.
The notebook must be in the Running state to display files.
How to upload large files and folders to the file managerβ
To upload a large number of files or a large amount of data, it is best to use command-line libraries such as curl, Wget, or gdown.
Here is an example of how to use Wget to download the Stanford Dogs dataset to our notebook:
This command downloads the dataset to our current folder:
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar
That's all there is to it! We can also perform the same command from the terminal if we are on the Pro or Growth subscription plans.
Transferring files from Google Driveβ
Files/folders in Google Drive can be brought into your notebook using gdown
.
- Through the notebook or terminal execute:
pip install gdown
to install andpip install --upgrade gdown
to upgrade. Use a!
before each command in the notebook. - In the permissions settings of the files/folders you want to upload, set the permissions to βAnyone with the Link.β
- Obtain the
file id
by copying and extracting it from the file share link and use the following commands based on your needs.
- For bigger than 500 Mb files use:
gdown "<file_ID>&confirm=t"
- For smaller files
gdown <file_ID>
- For Folders
gdown https://drive.google.com/drive/folders/<file_ID> -O /tmp/folder --folder
How to download files and folders to the file managerβ
To download large files or folders from the notebook, we suggest you zip/tar the files first. You can do this from the notebook or terminal.
Compress the files/folders using the following command in a notebook code cell or the terminal. If you use the notebook make sure to add a
!
before each command.tar
cd /notebooks
tar -cf [filename].tar [file1] [file2]...zip
cd /notebooks
zip -r [filename].zip [file1] [file2]...
Refresh the file manager
Right click on the compressed file created
Select the option Download
If the files are in shared storage or a dataset, they can be downloaded by moving them into the file manager and following the steps shown above.
What is shared storage?β
Refer to Introduction to the file structure of Gradient Notebooks to understand the overall file architecture of Gradient Notebooks.
Data can be shared between users on a team and between notebooks that belong to users on a team.
Access to shared persistent storage must be done through code, either via the notebook terminal or via a code cell within a notebook, as there is currently no way to access shared persistent storage from the GUI.
Shared storage cannot be accessed cross cluster. As a result, data stored in /storage
on the Gradient cluster will not be accessible on the Graphcore cluster.
How to access shared persistent storage from a notebook code cellβ
We can access shared persistent storage from a code cell within a notebook using the !
operator and issuing our bash commands on a single line connected with the &&
operator.
For example, to create a new directory within our persistent /storage
directory, we'll input the following:
!cd /storage && mkdir data && cd data
This is what that would look like in a notebook code cell:
We can also access persistent storage via the terminal, as described in the next section.
How to access shared persistent storage from a notebook terminalβ
The terminal feature requires Gradient Pro or Gradient Growth subscriptions.
To access persistent storage in a Gradient Notebooks terminal, we can use the cd
command to change into the persistent directory /storage
.
Let's say we'd like to create a new persistent directory called data
. We can accomplish this as follows:
cd /storage
mkdir data
cd data
Let's try it out:
We can now use the directory located at /storage/data
to store any files we need to access across users and notebooks.
How to view storage limitsβ
Storage in Gradient is scoped to the team level. By default, storage tiers are as follows:
Free | Pro | Growth | Enterprise | |
---|---|---|---|---|
Storage | 5 GB | 15 GB | 50 GB | β GB |
Excess storage is billed at $0.29 per GB per month and this is prorated for the duration of the month.
As an example, if we are on the Pro plan, which grants us 15 GB of storage, and we use 50 GB of storage for an entire month, we will be billed (50 - 15) * 0.29 = $10.15 on top of our normal bill.
To view storage utilization, visit the Storage tab in the workspace settings.
Here we have an example of the Storage tab for a new team that is not yet using any volume storage:
Here we have an example of a Private Workspace team that is using a good amount of storage:
If we expect to be billed for storage overages, we can use the Utilization tab to explore our storage use further.
Use the file management tab to upload data, organize files and folders, and download files stored in a notebook.
Some additional options such as renaming, duplicating, and deleting files and folders are available by clicking the menu icon on the individual entity.
There are multiple ways to upload files to a notebook, which are discussed in the following sections.
What is a dataset?β
Refer to Introduction to the file structure of Gradient Notebooks to understand the overall file architecture of Gradient Notebooks.
Gradient Datasets are available as a first-class resource within Gradient Notebooks.
How to mount datasets in a notebookβ
The IDE supports mounting Gradient Datasets to explore data and train models. Use the datasets tab to mount existing team datasets, mount public datasets, and create new team datasets.
Mounting a dataset is as easy as clicking the MOUNT button next to either the team or public dataset you would like to use.
When mounting a team dataset, this will only mount the latest
version of a dataset. To change the version of the dataset please see the Advanced Settings section below.
How to add small datasets to a notebookβ
To add a new dataset, click on the + icon. Then name, describe and upload the data. Feel free to close the modal once you start the upload, this process is still happening in the background.
Datasets can also be added from the Gradient Project level. To learn more see this article.
How to add large datasets (5GB +) to a notebookβ
To create datasets larger than 5GB, you can use the CLI through the terminal. To learn more about how to create a dataset through the CLI see this article
Datasets Advanced Settingsβ
To access the settings file that manages all of your mounted Datasets go to .gradient/settings.yaml
. Here you can see all of the mounted Datasets and their arguments. This file should only be used to do one of the following:
- Change the
version-id
of the dataset that should be mounted.
integrations:
quarterly-reports: # mounts in /datasets/quarterly-reports
type: dataset # denotes a paperspace dataset
id: dataset-id # a paperspace dataset id
version: verion-id # a paperspace version id
my-bucket-data: # mounts in /datasets/my-bucket-data
type: s3 # an s3 bucket
url: s3://my-bucket/my-data # your s3 bucket url
accessKeyId: AK123 # your s3 access key id
secretAccessKey: secret:my-bucket-secret-key # a paperspace secret with your s3 secret key
region: "us-west-1" # the aws region your bucket is in, if not in aws set "endpoint"
endpoint: "https://my-bucket-host.com" # a custom bucket host, do not set region if set
Other Data Sourcesβ
DigitalOcean Spacesβ
The Gradient IDE provides the ability to mount DigitalOcean Spaces into Notebooks to access data that is stored externally. This is available to Pro and Growth plans. Follow these simple steps to mount your DigitalOcean Space.
- Add a new data source and select the DigitalOcean Spaces icon.
- Enter the endpoint url (e.g.
https://jane.nyc3.digitaloceanspaces.com
), display name (an arbitrary name for the data source), along with the access key & secret.- note: you will have to upload project secrets to the under the project settings tab.
Once the data source is created, find the source in the list of data sources and click the mount button.
This will create a bidrectional mount on the underlying container for reading and writing data to the space.
AWS S3β
The Gradient IDE provides the ability to mount public and private S3 buckets into the Notebook to access data that is stored externally. This is available to Pro and Growth plans. Follow these simple steps to mount your S3.
- Add a new data source and select the Amazon S3 icon.
- Enter the name of your datasource and bucket url.
- If the bucket is private add an Access Key ID and Secret Access Key by choosing a Gradient Secret in the dropdown. Learn more about how to create a Gradient secret.
- The data source can now be mounted to your notebook and accessed through the data source panel.
S3-Compatible Data Sourceβ
To connect to other S3-compatible data sources like Google Cloud Platform (GCP), follow these steps.
- Add the S3-compatible bucket url. For GCP it would look like
s3://example-bucket-name
- Open the Advanced Settings and change the default endpoint. For GCP enter
https://storage.googleapis.com
Making an S3 Bucket Publicly Accessibleβ
Warning: This will allow anyone on the internet to access your files. DO NOT enable this if you have sensitive information in your S3 bucket.
To make an AWS S3 bucket publicly accessible without credentials, you'll need to update two settings in your bucket settings under the Permissions tab:
Uncheck "Block all public access".
Edit ACL to allow Everyone (public access) List Objects and Read Bucket ACL.
Storage uses and billingβ
Gradient Notebooks provide volume storage and bucket storage. The delineation refers to whether the data is available online or offline and helps users pay only for what they use.
With volume storage, data is available only while running a Gradient Resource such as a notebook or workflow. With bucket storage, data is available for online or offline viewing.
Volume storageβ
Volumes are persistent storage resources that provide shared access to a filesystem while the instance is online.
Examples of volume storage include:
- Gradient Notebooks: Any information stored in
/storage
and in/notebooks
- Gradient Datasets: Any dataset cache in Gradient Workflows or Gradient Deployments
For more information about team volumes go to the Storage tab in the Team Settings view which can be found by clicking the user icon in the top right.
Volume storage billingβ
The amount of volume storage you have access to is dictated by your Gradient subscription tier as shown below. These storage limit are on a per notebook basis.
Any storage over these limits will be charged at $.29/GB/month.
These charges are accrued hourly at the current usage of the bucket. For example, if a user with a free subscription, goes over the 5GB limit for 3 days then the account will only be charged for the 3 days of usage over the free limit.
Subscription | Volume Storage |
---|---|
Free | 5GB |
Pro | 15GB |
Growth | 50GB |
Bucket storageβ
Buckets refers to long-term storage which is primarily used for offline viewing of files and dataset versioning.
Examples of bucket storage include:
- Offline data for Gradient Notebooks: Files and datasets viewable offline in a notebook. In the offline view, buckets will store
.ipynb
files,.md
files, any git tracked files, and any files included in the.notebookinclude
file. - Versioned Datasets: Each time a Gradient Dataset is versioned, these iterations of the dataset are stored within a bucket. For more information see Versioned Data.
To avoid getting charged for bucket storage, delete Gradient Notebooks and versioned datasets that may no longer be in use.
Bucket storage billingβ
No matter the subscription type all Gradient users receive 2GB free bucket storage.
Users who exceed that limit will be charged $0.29/GB/Month.
These charges are accrued hourly at the current usage of the bucket. For example, if a user goes over the 2GB limit for 3 days then the account will only be charged for the 3 days of usage over the free limit.
If a user does not have a credit card associated with the team then there will be a strict cap at 2GB and could lead to a failed notebook teardown.