Perform proper version control on your team’s notebooks
The struggle is real
Let me give you two quick tests to differentiate a data scientist from a software engineer. Give them a simple coding task in Python, something that mostly uses common sense and then
- Only give them a jupyter notebook to work with
- Ask them to do commit to a git branch and then do a pull request on GitHub
The data scientists will be like a fish in fresh water on the notebook but will surely struggle with git/GitHub while the engineer will be bewildered by the “run-any-cell-anytime” nature of the notebook but fly through git as if you asked them to simply breathe.
Yet both will be throwing the towel if you ask them to review a PR involving notebooks.
And there is a good reason for it. It is hard.
What data scientists discover trying to follow a software engineering workflow is that Jupyter Notebooks are in essence JSON files. These store a plethora of metadata (such as how many times it was opened (not edited, opened), when cells were ran (an ordinal), and plenty more. It also stores outputs in binary form! Any graph you have, table you showed with
.head etc… It quickly becomes a grueling task to understand whether a change is relevant and should be inspected or not.
Just the fact of a colleague opening a notebook to make a quick check and run the import cell by mistake will make changes in the notebook. Ouch!
In defense of notebooks
Jupyter notebooks are essential for the Data Science workflow. The reason is that in a data science pipeline there is data, and there is almost no setting in which you can really constrain what the data looks like. There’s no equivalent to statically typed language to data, only dynamically typed ones. In other words data is alive and its collection complex.
Data Science is to software…