Reproducibility in Machine Learning

Machine learning is said to be experiencing a reproducibility crisis. What does this mean?

A 2016 “Nature” survey demonstrated that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. -- Sam Charrington, TWiML

As heuristics are replaced by models, the focus of the machine learning community has shifted away from factors like explainability and reproducibility towards matters of raw performance as measured by evaluation metrics. As a result, many machine learning models are either not reproducible or are difficult to reproduce. This is largely the result of a lack of best-practices (what we called MLOps) in machine learning today.


Retraining an existing model with the same hyperparameters and dataset does not always reproduce the same result. This seems counterintuitive -- why is this? There are several factors that contribute to non-determinism:

  • Initialization of layer weights: It is common practice to set initial weights to a non-zero value. These need to be captured to reproduce the same results.

  • Dataset shuffling: Datasets are often randomly shuffled at initialization so this is a fairly obvious root cause of non-determinism. However even if the model is set to use a fixed range of the dataset (e.g. the last 20%), the contents of this set will not be consistent across training runs. Shuffling within the training dataset affects the order of samples and therefore the way the model learns as it iterates over these samples.

  • Randomness in hidden layers: Many neural network-based architectures include layers with deliberate randomness. Dropout is a common example used to prevent overfitting.

  • Updates to ML frameworks, libraries, & drivers: Updates to ML libraries and even GPU drivers can lead to subtly different behavior across iterations.

  • Hardware used during the training process: The specific combination of GPU and CPU can produce different results. This is the result of several factors including the way GPUs handle floating-point calculations and CPUs handle multi-threading.

Making Machine Learning Reproducible

At a high-level, the first step in making machine learning more reproducible involves capturing all of the core primitives (hyperparameters, code commit, and dataset) and metadata (outlined above) associated with the training process.

Beyond capturing the basic components in the machine learning system, the path towards reproducible ML can be thought of as a philosophical shift away from ad-hoc methodologies to a more deterministic way of working.

By adopting mature practices found in software engineering and DevOps, machine learning can evolve to achieve improved resiliency and predictability. Versioning and continuous integration are an integral part many software workflows, and transposing these concepts to machine learning will streamline many processes.

Finally, organization-wide visibility and collaboration are also essential. Due to a lack of available tools, data scientists often work in siloed environments and don't have access to shared notebooks, code, and parameters.

A unified hub for tracking all models in development, testing/QA, and production is a must-have for any ML team -- especially as the number of data scientists, models, and complexity of models increases.

Reproducibility + Gradient

Gradient from Paperspace automatically tags each entity (e.g. training inputs, parameters of deployed models, etc.) with a unique identifier. Versioning and continuous integration are first-class citizens in the unified CI/CD for machine learning.

MLOps is another modern approach to machine learning that is deeply embedded in the Gradient platform.