A mantra that we’ve all heard about blockchain technology and bitcoin in particular is that “The data is public, the information is accessible to all”. But how accessible is it really? The answer is not that accessible.
There are tools people use to get information on what is happening in the bitcoin blockchain. One of the most used is blockchain.com, which is a private company with a misleading name (so much so that the company’s LinkedIn employee’s list is infatuated with people that aren’t in any shape or form connected to the company). But for an analyst or economist, these tools only give snippets of activity and don’t allow to extract meaningful knowledge on the state of the bitcoin economy. It doesn’t allow to replicate results found by their peers either.
Before being able to query the data, they must become blockchain engineers and data engineers. Indeed they must learn to run a Bitcoin node, understand its API to access the block data and write up the whole ETL (Extract-Transfer-Load) procedure to get the data into a database of their choice. They will also need to learn SCRIPT, the low-level coding language of Bitcoin. This is hard technical work.
Thankfully some people at Google decided that it would be good to take care of these steps and make the data readily available through their cloud platform and via Kaggle, which they acquired. This is a piece of information that is not widespread enough yet.
So help me get other analysts on board! Share this article with your Data Scientist friends. They can get started with my introductory series of articles, or request a tutorial in the comments below.
Kaggle’s jupyter kernels are not the only solution, though they provide a good machine and a free quota of queries. Moreover the more people post good analyses there, the more knowledge will be shared within the community. In any case, there are ways to work straight from the google cloud platform or from your local machine.
In my articles I slowly go through the functioning of the blockchain and the reader will learn about SCRIPT. Through that, I underline some of the issues with the current dataset. I have seen many that don’t consider the subtleties of the Bitcoin architecture, don’t bother checking the cleanliness of the data and draw false conclusions out of their hasty analyses. I hope that my articles will help make good analyses more accessible. To do so, I have been communicating with Google to clean the data. I am also pushing for extra datasets for ease of use:
- a dev/test set
- a stable set
These datasets will address two issues that are currently faced by analysts. Due to its nature, there is no way to make a query in BigQuery that does not scan all rows, making query writing potentially costly for beginners. Trying new queries against the dev/test set will allow you to write queries in a step-by-step manner and fine tune them without blowing your quota or your wallet. Then the stable set will allow you to make full use of BigQuery’s cache. Indeed once you make a query, google caches the results so that a repeat will not actually run the query and returns the results for no additional costs. It is a very important feature to craft an analysis without fear of fully exploring the data.
Looking forward to sharing the actual analyses with you! Follow me if you too are curious about what exactly is happening within the blockchains.