The giant footprint of F2Pool, or Discus Fish, the 2nd largest mining pool in Bitcoin.

Data science is fun, the more I do it, the more I see how exciting forensics could be. Lately I have been diving in the Bitcoin ledger, trying to see what’s to be found there. It is an extremely rich dataset and very little work has been done still, everyone being more interested in the off-chain data of trading histories. But I’ll show you slowly that there is a lot of things that are worth looking at in the data.

Let us start with a very strange phenomenon. It is linked to the fact that bitcoin transactions are, contrary to usual banking transactions, not necessarily one-to-one. They can be many-to-many like in the picture below:

I define the number of unique addresses in inputs or outputs the in-diversity and out-diversity of a transaction. If bitcoin users behaved according to the standard banking practice, the in-diversity would be 1 (the sender) and the out-diversity would be two (one recipient and the sender’s account for change). Tracking the average daily diversity will therefore show us how much the bitcoin activity deviates from standard banking practices.

The graph below shows the average daily diversity across transactions. We see that when transactions other than the coinbase (reward from mining) start appearing, we are rather close to 1-in, 2-out. But the more time passes, the more the behavior changes. There is even a peak in 2011 where the daily average is above 10!!

That peak alone would be extremely fun to look at and I encourage you to check for yourself on the BigQuery dataset. But I will keep it for later maybe. Note that the fall of 2015 shows many more similar events. Surprisingly the in-diversity shoots the out-diversity in the first quarter of 2018. Would that be explainable by people consolidating their bitcoin dust while it was worth a lot of money? I can’t answer this yet.

Note that the bitcoin ledger is such a zoo that our conclusions may be a little hasty. One may think that colored coins add 1 to out-diversity for example, but the output coloring the coins doesn’t have an address attached to it so it doesn’t appear in this graph. In fact, due to ETL issues, neither do multi-sigs or segwits. So take the graph above with a grain of salt.

But there is something else that I checked. I checked the daily max for in- and out-diversity. And boy did it look interesting!

The first thing that we notice is that plateau of max out-diversity in 2014. That is SO different from random human activity as we’ve seen in the first graph. There is definitely one actor that set the maximum for a year straight. Once again I will let you fork my (unfinished) Kaggle kernel or investigate by yourself on Google’s cloud platform. I prefered looking at the three times it rose above 10000. That’s an incredible amount of output addesses!

So I checked and found that they are all coming from the same input address:


After a quick investigation I figured out that it belongs to the F2pool, a Chinese mining pool responsible for ~25% of current mining. These are their profit distribution transfers. Focusing on this address, we see that all transfer happen at 00:00:00 precisely, and almost every day. But plotting the diversity for this address only, we also discover that it defined the above curve since the fall of 2014!

With a little more work one can discover that there is a second address used by F2pool that is responsible for the rest of the curve when the first one dips. So the whole max daily out-diversity is defined by Discus Fish since the fall of 2014. I bet the plateau we saw all along 2014 was due to another pool.

Let’s make this a challenge for you, reader. Go find out what this plateau is and five your findings in the comments! I hope you’ll have as much fun as I am having.

