Coding tools

Solved: Jupyter nightmare on Git

Dany Majard
7 min readJul 21, 2022

Perform proper version control on your team’s notebooks

Photo by Roman Synkevych 🇺🇦 on Unsplash

The struggle is real

Let me give you two quick tests to differentiate a data scientist from a software engineer. Give them a simple coding task in Python, something that mostly uses common sense and then

  • Only give them a jupyter notebook to work with
  • Ask them to do commit to a git branch and then do a pull request on GitHub

The data scientists will be like a fish in fresh water on the notebook but will surely struggle with git/GitHub while the engineer will be bewildered by the “run-any-cell-anytime” nature of the notebook but fly through git as if you asked them to simply breathe.

Yet both will be throwing the towel if you ask them to review a PR involving notebooks.

And there is a good reason for it. It is hard.

What data scientists discover trying to follow a software engineering workflow is that Jupyter Notebooks are in essence JSON files. These store a plethora of metadata (such as how many times it was opened (not edited, opened), when cells were ran (an ordinal), and plenty more. It also stores outputs in binary form! Any graph you…

--

--