Thursday

Lesson learned : Hadoop + Cassandra integration

After a few weeks break at last we completed our tuning and configuration cassandra hadoop stack in production. It was exciting and i decided to share our experience with all.
1) Cassandra version >> 1.2 has some problems and doesn't integrate with Hadoop very well. The problem with Map Reduce, when we runs any Map reduce job, it always assigns only one mapper regardless of the amount of data. See here for more detail.
2) If you are going to use Pig for you data analysis, think twice, because Pig always picks up all the data from the Cassandra Storage and only after these it can filter. If you have a billions of rows and only a few millions of then you have to aggregate, then Pig always pick up the billions of rows.Here you can find a compression between Hadoop framework for executing Map reduce.
3) If you are using Pig, filter rows as early as possible. Filter fields like null or empty.
4) When using Pig, try to model your CF slightly different. Use Bucket pattern, store your data by weeks or months, it's better than store all the data in one CF. Consider to use TTL.
5) If you have more than 8GB of heap, consider JVM from IBM JVM9 or Azul JVM
6) Always use separate hard disk (High speed, i.e more than 7200 rpm) for Cassandra commit log.
7) Sizing your hardware carefully, choose between RAID 5 and RAID 10 depends on your need. it's better to create separate LUN for every node.
8) Tune bloom filter, if you have analytical node. See here for more information.
9) Tune you Cassandra CF, use ROW cache. Cassandra row cache is off heap cache as like memcache. Its slightly slower than heap cache but much faster than disk IO.

Disclaimer:
Every experience describe above is my own and it could be differ from any others experiences.
UP1 - Bug already fixed in version 1.2.6