best practices

5 tips to take the Data Science training wheels off

and write more sustainable code.

Dany Majard
7 min readAug 5, 2022

--

Photo by David Clarke on Unsplash

Houston, we have a problem

There is a fair amount of snobbery towards wizards of the spreadsheet in the data science community. Yet we are to seasoned ML engineers what excel power users are to us.

Chatting to data scientist at meetups etc, there’s no end to the sarcasm we/they can spit towards excel power users. Who are these cavemen and why do they all have a background in Finance? Why do they think they’re the masters of the universe for their lookup chops, their pivot table addiction, or god forbid their macros??!!

To data scientist, everything they do is painfully convoluted and intrinsically wrong. 😱 when a single tab contains three tables, 😱 when complex lookups perform a simple filtering of a dataset, 😱 when a formula was extended with click and drag but one cell was modified, etc, etc….

To a data scientist, the real world is a jungle of data and spreadsheets are akin to swiss army knifes. You need better tools to survive.

Yet we turn around and commit the same sin. We think that jupyter notebooks are the best tool ever and whip a server as soon as we want to perform any coding task. As I claimed in a previous article (see notes) we actually fall prey to the same issues that plague spreadsheet super users: the tool is too damn convenient to use it within its natural bounds.

Start jupyter, select your kernel and boom! your hands on the data.

With all the libraries that exist and its very high interactivity, it taps in our need to learn by experience and really allow work as a form of play with the data and the computer. The more time I spend in the industry, the more I understand that Jupyter is the excel of data scientist. And with time, this convenience creates some bad habits, as Joel Grus masterfully laid out in his JupyterCon2018 talk called “I don’t like notebooks.”.

--

--