BigData (R)evolution

I want to learn some more about the scale, but not scale in general but scaling data storing and processing. Nowadays if you hear about the big data, it often implies hadoop based distributed system. As Yahoo is one of the biggest contributors to Hadoop at the moment, the best tutorial is also on their website, Yahoo! Hadoop Tutorial. But that is only the beginning, and there are many different things to learn as the distributed environments constantly evolve.

The newest product of the most innovative company from Mountain View is F1, the fault tolerant distributed RDBMS. Very briefly it is a NoSQL key/value based interface which allows for simple access to rows. There is a short presentations on F1, if you do not want to read the full paper. As they call it themselves it is a “descendant of bigtable, successor to megastore“. Surprisingly, according to the presentation, reads are much slower than MySQL, but… it scales easily and it is not RDBMS. For more detailed information you have to go and read the full paper, or a nice discussion on

I wonder what next will Goolge show us. First they introduced a file system, that allowed for distributed computing on commodity hardware (GFS & MapReduce). Then they build on top of that a database that was able to cope with enormous amount of data (BigTable). To make it more safe they span the data across different data centres (Spanner). They have also changed the file system from GFS to Colossus, which allows for smaller data chunks. Finally to make use of it in area that requires a bit more than reading static users data, they created F1, which is a RDBMS like feeling for NoSQL database.

Breakthrough articles by Google:
The Google File System
MapReduce: Simplied Data Processing on Large Clusters
Bigtable: A Distributed Storage System for Structured Data
Spanner: Google’s Globally-Distributed Database
F1 – The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business
Wired article about Colossus

Some Hadoop related keywords:
Haddoop – framework for distributed computing
Zookeeper – coordination service for distributed systems
HBase – a BigTable like experience on Hadoop
Hive – SQL-like language called HiveQL
CUDA on Hadoop – computations on GPU in distributed environment