Once a machine learning model is released into production, it is inevitable that data teams will want to improve model performance based on new learnings. Data might change over time leading to bad model performance, an edge case might be having a large impact on the business or your product team could be tracking a new data point that you think would likely increase model performance. So what process should businesses follow in order to test and resolve these questions and how can you do this if you are working in a large team? The most common way is experiment tracking and data versioning.
It is important to version data to make sure that experiments can be reproduced accurately and reliably. Without a version control system, it is cumbersome to guarantee reproducible experiments, and to know which experiments have been successful. This problem is amplified when dealing with teams where multiple people will be working on the same model but that model is producing different results. Specifically, data versioning and experiment tracking aim to solve the following problems:
Ideally, as this system is relatively easy to setup and provides utility even as a solo data scientist, it should be implemented in most cases. As a solo data scientist, you can simply version your data and experiment results as you go, without maintaining external spreadsheets of results and manually versioning data you use. In team settings, systems such as these become indispensable tools for seamless collaboration, making sure you can share results and use more traditional software engineering methodologies to maintain your repository such as code review, devops, and CI/CD.
By the end of this tutorial you will be able to:
- Setup DVC for version controlling datasets and models
- Link your GitHub repository to DagsHub
- Use DagsHub to track your experiments
The data used for this tutorial can be found here and you should fork the Github repository here. Alternatively there is a Google Notebook here but you’ll need to setup your own GitHub repo accordingly in this case.
We will need a number of libraries for this tutorial so boot up a terminal and install them before you proceed.
When we version our software projects, we typically use a version control system such as git. Using a system like this allows us to keep track of any changes we make to the codebase, easily collaborate with others by merging versions, and revert to earlier versions if something breaks. Git does have a limitation; it is not built to handle large files or binary files like those used in model building. DVC is a version control system built for datasets and models. You can think of it like git, but allows you to version both large files and model files, treating them in the same way as git.
If you have forked this repo on Github, you already have git version control active. If not please create a git repo. You can now use DVC to version your data and models. Initialise DVC, and commit the changes DVC made to git.
Our git repo is now a DVC repo too! Let us add the data we downloaded into a ‘data’ folder, add it to DVC and push our changes to Github!
That’s how you version your data and models with DVC! We now want a central remote repository that deals with DVC repos.
While it is useful to use DVC to version our datasets and models, we need to have a central remote store that will take advantage of the DVC versioning system in the same manner that Github takes advantage of git. This is what DagsHub will allow us to do, acting not only as a source of truth for our codebase, but also as a central repository for our data, data pipeline, models and experimentation.
Navigate to DagsHub and sign up for a free account. You can login with your Github account.
You should be greeted with the following screen. Click the Connect button and select Github.
You will be prompted to connect a Github repository. Search for repo fork in and connect! You’ll be greeted with a familiar looking screen. This is your repo page. You should refer back to this page throughout the tutorial.
If we want our data to be viewable in DagsHub, we need to add our dataset to DVC and set the DVC remote to our DagsHub repo.
Now we need to tell DVC how to auth.
Before you push your DVC repo, navigate to your DagsHub settings and create a password if you do not have one. Then push! (It may take a while)
You can now view your data in your DagsHub repo. Our file is a little too big to view in the dashboard, bit you can view the raw file to verify it is working.
To see this update add more files to the data directory. Using DVC, run the `preprocess.py` script to add some more files to generate train-test splits of the data to be used in training and testing.
We could run the file as is with vanilla python, but we can also use DVC to specify a pipeline stage. This tells DVC what the inputs and outputs are for this particular command, which can be used to determined whether in the future the stage needs to be run again with a future build. With the following command we will create a stage called preprocessing, and specify all the inputs and outputs of the stage, as well as the processing file itself.
In this case, we specify the name of the stage with `-n`, all inputs with `-d` and all outputs with `-o`, and finally specify the actual command to run for the stage as the last argument.
This should generate 6 new files, `X_train.csv`, `X_test.csv`, `y_train.csv`, and `y_test.csv`, as well as new DVC files, `dvc.lock` and `dvc.yaml`. These new DVC files specify how the data pipeline works, the data versioning, and determines which pipeline stages need to be rerun upon rebuild. Add them to the git and DVC repo, commit and push.
You can now have a look at your Data Pipeline in DagsHub! You can keep adding new stages to the pipeline, and DVC will automatically determine which stages need to be run while DagsHub will visualise it. We will create a new stage in the next section to see how this works.
DagsHub doesn’t only supply you with a remote DVC store, but also allows you to track your experiments along with your versioned code and data. We are going to modify the supplied file, `train.py`, to show how this is done. At the top of the file, add the following:
Now, we are going to wrap our training code in a dagshub_logger context manager. The logger has 2 methods, log_hyperparams and log_metrics, for logging hyperparameters and metrics respectively. Modify the XGBClassifier and .fit() lines with one indent and wrap it in a dagshub.dagshub_logger object as is below. If you’d like, you can be more specific about which hyperparameters you want to log by specifying them manually to the logger (we just use .get_params() here).
Save the changes and git commit.
Run your modified training file in a new DVC pipeline stage. You may notice we have a new flag -M. This tells DVC this is a metric output to be tracked by git. There is another option -p, which allows us to specify a hyperparameter file as a pipeline dependency. In this case, we are tracking `params.yml` as a metric artifact instead as we are specifying the hyperparameters in code and this file is output by the DagsHub logger.
This creates 3 new files, our 2 DagsHub files `metrics.csv` and `params.yml`, and our model file `models/xgb-fraud-classifier.joblib`. Add the Dagshub files to the git repo, the model file to dvc, and commit. It is also a good idea to tag your commit with some sort of model version.
You will see that now you can view your experiments in DagsHub under the Experiments tab.
You will also see your updated data pipeline.
Testing a new hypothesis and want to run a new experiment with the same model, but with a different set of hyperparameters. After some analysis, we change the n_estimators and max_depth hyperpameters in the training file, finalising them to the following. Save the file modification.
We can now run the experiment again. Since we have specified how each pipeline should be run previously, we now simply tell dvc to reproduce the entire pipeline.
Great! Push your changes and view your new experiment in DagsHub.
You are now armed with the tools to successfully run experiments and track your results in DagsHub. You can visit each commit of an experiment separately to see the exact code and data that was used to train the model in that particular run. Reproducing that result is simply a matter of checking out the commit and running the pipeline!