Workflows Sample Project: StyleGAN2
This sample project shows how to use Workflows to train a StyleGAN deep learning model to generate pictures of cats, based on the popular Nvidia StyleGAN2 GitHub repository. A higher-level overview is also given in the blog entry.
This is one of the advanced tutorials in our set, so we assume the reader is familiar with the basics of Workflows, and here wishes to proceed to using them for real projects. The emphasis here is on realistic, end-to-end data science with Gradient, so the data size and runtimes are longer than some of the other tutorials, and the setup is more complicated than some of the others.
The code can be run by the user to create this project, or used as a starting point for another project.
The project contains 2 Workflows:
    1 (optional): Download a full-sized set of images from online (42GB), and extract a (size-controllable) subsample of images from the supplied database format
    2: Train, evaluate, and generate images from a StyleGAN deep learning model
If desired, Workflow 2 can be run without running Workflow 1, via our copy of the extracted images in our Gradient Public Datasets storage. This is suitable for users who wish to see the model training, but not the data preparation using a 42GB file.

Tutorial

Prerequisites

We utilize Gradient's functionality for linking Projects to GitHub repositories, then triggering the Workflows to run by making a change to the repository.
We therefore assume that:
    You have your own GitHub username, providing a space into which to fork the repository
    Your GitHub user space has the Paperspace Gradient app enabled for the repo fork (or all repos)
    You have access to both C3 CPU and P4000 GPU (or higher) Gradient instances, on which to run the model training, via an appropriate subscription level
    OPTIONAL (running Workflow 1 only): You have sufficient storage space to output a copy of the 42GB image dataset

Run the project

Once these prerequisites are met, go through the following to run the project:
    1.
    Fork (i.e., copy) the project repository into your own GitHub username. This is most easily done using the "Fork" button on the top right of the repo page.
    2.
    Create a Project from the Gradient GUI
    3.
    Create a Workflow from within your newly created project
    4.
    Do "import an existing gradient repository" and not "Select a template". In the resulting dialog, choose your fork of the project repo from step 1
    5.
    Don't follow the "Let's create a Workflow" steps, but proceed to below
The two Workflows in the repo are stored in the .gradient/workflows directory. This location, similar to GitHub's Actions directory, is where Workflows can be placed so they can be triggered to run or rerun via a repo change.
Note: we could place this project in the GUI as a ready-forkable repo, however, we deliberately omit this to show more realistically how you would create your own future projects.
By default, the triggering mechanism in the Workflow YAML (the same in both cases):
1
#on:
2
# github:
3
# branches:
4
# only: main
Copied!
is commented out. This is to prevent the Workflows, which have runtimes of a couple of hours, from being triggered accidentally.
Most of the caveats from earlier incarnations of this project, encountered due to Gradient being in an earlier stage of development, have now been resolved. However, one remains, which is that the user must create their own empty copies of the output datasets before the Workflow can be run. This is because there are 6 of them, so our default demo-dataset , which avoids the caveat for simpler demos, doesn't work here.
6. Navigate to the Data tab within your Project, and create these 6 Gradient-managed Datasets:
1
stylegan2-wsp-pretrained-network
2
stylegan2-wsp-evaluation-pretrained
3
stylegan2-wsp-generated-cats-pretrained
4
stylegan2-wsp-our-trained-network
5
stylegan2-wsp-evaluation-ours
6
stylegan2-wsp-generated-cats-ours
Copied!
(where wsp just stands for "Workflows Sample Project" to avoid the names being very long).
Now you are set up to run the second Workflow of the 2. For the optional Workflow 1, see below.
7. In your copy of Workflow 2, stylegan2-train-and-evaluate-model.yaml, edit the YAML to uncomment the on: trigger lines, as given above. The YAML also shows where.
8. Upload (commit) this modified copy to the .gradient/workflows directory in your fork of the repo, the one linked to your Project, replacing the copy of the file that is there.
This should trigger the Workflow to run, and you can see it running under the Workflows tab in your Project. The Workflow is using the made-by-us-earlier Gradient Public Dataset copy of the output from Workflow 1 containing the extracted images. This dataset is located at gradient/stylegan2-workflows-sample-project-extr-img, which identifies it as being in our gradient namespace. This namespace is where our public dataset are stored.
Depending on which level of instance you are using, the runtime will be approximately 2-4 hours.

OPTIONAL: Run Workflow 1 in addition to Workflow 2

In addition to running the model training above, if you would like to see the full end-to-end dataflow, you need to pass the output of Workflow 1 to you own Gradient-managed dataset, since our public one is not writable by users (as you would expect).
Therefore, add the following steps:
9. Create two additional Gradient-managed Datasets: stylegan2-wsp-cat-img-db and stylegan2-wsp-extr-img
10. Comment out again the 4 on: lines in stylegan2-train-and-evaluate-model.yaml that were uncommented above
11. Uncomment the same 4 lines in stylegan2-download-and-extract-data.yaml
12. In stylegan2-train-and-evaluate-model.yaml, swap the commenting/uncommenting on the dataset references (ref:) in jobs 3, 5, and 6 to point to your copy of the output datasets from Workflow 1 instead of our public copy. The YAML annotation also shows where.
13. Upload the updated versions of the 2 YAML files to .gradient/workflows in your forked repo.
This should trigger Workflow 1 to run (and not 2), and the process will again take a couple of hours, depending upon download speed. It may seem slightly lengthy, but it proves that Gradient Workflows work correctly with realistically-sized datasets.
Once Workflow 1 has run, you can then run Workflow 2 again to see it using your copy of the data, and hence end-to-end data science. To run Workflow 2, swap the on: commenting in the 2 files again so that a new upload/commit triggers Workflow 2 instead of 1.
You can also try other steps, for example, increasing the number of images extracted and then training the model for longer. This may give better model performance.
Now that we have seen the project running, some other details are given below, along with a few possible future improvements (e.g., we didn't show Gradient Deployments).

Other project details

These are some details in addition to those given in the blog entry.

Workflow time to run: Autoscaling

In the second Workflow, stylegan2-train-and-evaluate-model.yaml, because it is training and evaluating a StyleGAN neural network, and our purpose here is to show a realistic example, it typically takes a couple of hours to run. The type of machine used can affect this quite significantly, so, if you are using a private cluster, for best performance, you can use autoscaling with hot nodes enabled. With two each of C5, P4000, and V100, and the corresponding resources specified in the YAML file, the instances do not need to be spun up:
Step number
Step name
Machine type
1
cloneStyleGAN2Repo
C5
2
getPretrainedModel
C5
3
evaluatePretrainedModel
V100
4
generateImagesPretrainedModel
P4000
5
trainOurModel
V100
6
evaluateOurModel
V100
7
generateImagesOurModel
P4000
We want two of each machine because steps 1 & 2, 3 & 5, and 4 & 7 can run in parallel, but 6 requires 5 to finish first. The generate images steps take less time than the evaluate and train steps, so they can be run using a P4000 machine, while V100 is best for steps 3, 5, and 6.
If you want to avoid complexity and use a simpler resource list, defaulting all steps to V100 will run in the same time, or it will run in 3-4 hours if all machines are P4000.
Note that if you are using the public cluster, the default, hot nodes for oneself are not available, and the availability of machine types may be different. In this case, the default setup that we supply, all-P4000, is OK.

What the Workflow is doing

The first Workflow of the two downloads and extracts the data from http://dl.yf.io/lsun/objects/cat.zip . The data is supplied as a 42 gigabyte .zip file, which we unzip into its LMDB database format, extract the WebP images it contains, and compile those images into the multi-resolution TFRecords format needed for training the model. Working with the large file demonstrates Workflows and Gradient-managed storage being used with realistic data sizes. The extract step subsamples the images to 1000, but this is controllable via the --max_images argument, and can be set to larger sample sizes at the expense of a longer training runtime in the second Workflow.
The second Workflow sets up the following steps, as shown in this directed acyclic graph (DAG) representation. The box colors show inputs, repo clone, pretrained model steps, and steps for our model. "Cat images subsample" is the set of extracted images from the first Workflow.
We clone the StyleGAN2 repo, and get the online pretrained model. Then we use the repo's run_metrics.py to evaluate the pretrained model, and its run_generator.py to generate new cat images from it. At the same time (Gradient Workflow jobs can run in parallel), we train our model on some image data using run_training.py, then evaluate it and generate images from it in the same way as for the pretrained model. This allows us to compare the two models.
The DAG shows that, as one would expect, both models require the repo to be able to run, and then our evaluation requires our model to be trained first, but the pretrained evaluation does not.
To keep this tutorial tractable with a Workflow runtime of hours and not days, we train on the small subsample mentioned above of the 1.6 million images available in the original dataset. So as we will see below, our model does not do well compared with the pretrained model. But to train the full model, the setup is the same as here: one would simply pass in the full-sized dataset and run for longer.

Results

If the Workflows have run correctly, for the second Workflow you will see a DAG that looks like this:
You can then explore in the usual way in the GUI the output logs of each job, the YAML, and the contents of the output directories. For the project here, this allows us to compare the results from our trained model to those from the pretrained one.
By viewing the output files of the generateImagesPretrainedModel, and generateImagesOutputModel YAML jobs, we can see the quality of the images that the models generate. As expected, the pretrained model generates realistic looking cats:
and in contrast, our model trained on only 1000 images clearly needs more time and data:
Some of its images have hints of a cat shape, but they are mostly noise. Again this is as expected.
To see a more quantitative comparison, in the output logs from the network evaluations, evaluatePretrainedModel and evaluateOurModel, the files run.txt contains the values for the fid50k and ppl2_wend metrics that we evaluated. fid_50k is the Frechet Inception Distance, where lower is better, and 0 is perfect. ppl2_wend is the Perceptual Path Length in W, with path endpoints.
The difference between the models can be seen (your result may vary slight from the numbers here):
Pretrained
1
stylegan2-cat-config-f time 44m 08s fid50k 35.4571
2
stylegan2-cat-config-f time 1h 16m 21s ppl2_wend 437.9006
Copied!
Ours
1
network-final time 45m 18s fid50k 386.3678
2
network-final time 1h 17m 42s ppl2_wend 19.5041
Copied!
For our model, the fid50k is much higher, and the ppl2_wend much lower, both indicating a worse model.
The resolution to this is of course to train our model for longer (days not hours) on a training set of proper size, which is 10,000s images+, not 1000. The Workflow and YAML setups are the same; we would simply pass in the bigger image set.

Future additions

As mentioned, Gradient Workflows is still a new product, and as such, aside from the caveats above, some of Gradient's functionality is also not yet supported. This includes
    Adding Gradient's multi-GPU support to Workflows
    Deployment, monitoring, and triggering model retraining
Further YAML actions such as managed deployment, are being added too.
Once these are available, the code here is straightforwardly alterable or extensible to take advantage of them.
Last modified 13d ago