Sunday

Book Review: Cassandra Design Patterns

This post is my review of the Packt Publishing book Cassandra Design patterns by Sanjay Sharma. As the main title suggest, it's all about pattern and anti pattern of using Cassandra. The book has almost 74 pages covering 6 chapters.
Preface: What this book covers and Who this book is for
The preface of the book starts with the ideas behind this book. The main idea behind this book is for Cassandra audience to understand where and how to use Cassandra correctly and effectively.The Preface also provides brief summaries of each of the six chapters in the book and convention that follows in this books.
Under the section "What you need for this book", author specified that, readers don't need any special version of Cassandra, however Cassandra 2.0 or above version will be proffered. The "Who this book is for" section of the Preface specify the audience of the book, it may be architect, or developer starting with Cassandra.
Chapter 1: An Overview of Architecture and Data Modeling in Cassandra
In this chapter author briefly describing the architecture and history of the Cassandra. Most of the books of Cassandra and articles always admit this important information. It really nice to know, how Cassandra pick up or combine the best features of two technologies, Google Big table as data model and Amazon Dynamo for scale out. Author also provide core Cassandra architecture, how Cassandra handle write and read under the hood, consistency level and much more. In one moment i cant agree with author about Cassandra read performance (when data is not in cache), Cassandra is not fast for read data. For every single read Cassandra need 2 iops to disk which make Cassandra slow to read data.
In the last part of the chapter author outlines the features of Cassandra which is very useful.
Chapter 2: An Overview of Case and Design Patterns
This chapter introduced a few key use cases and design patterns that briefly discussed in the following chapters. First of all author describes the 3V model and how Cassandra fit on it. Next section coverage Cassandra's high availability architecture and comparison with Oracle RDBMS. In the next few sections, author introduced Cassandra schema flexibility, counter column, streaming analytics capability and much more. Like the first chapter, this second chapter covers additional information about Cassandra strength and features.
Chapter 3: 3V Patterns
3rd chapter covered fundamentals patterns of Cassandra, describe where and how Cassandra should be uses. First part of the chapter provided how Cassandra can handle huge amount of data to scale web application. Also mentioned why RDBMS such Oracle or Terrdata is not vertically scaling well. Finally describe the patterns solution, focused on Cassandra CQL, 3rd party framework which rich Cassandra.
Second pattern focused on Cassandra fast write ability. Cassandra support parallel writes where each node in a cluster is responsible for a specific key range which differs Cassandra from traditional RDBMS. Author also provide benchmark from Netflix which is very impressive and informative to prove the pattern.
Last pattern of this chapter described Cassandra's schema less feature. This Cassandra feature pick from the Google big table and allow Cassandra to store data in multiple formats. This pattern is one of the main advantage over RDBMS, where RDBMS never support schema changes online.
Chapter 4: Core Cassandra Patterns
4th chapter starts with the Cassandra's fundamental feature - high availability. Cassandra provides high availability data store with peer to peer communication between nodes. This feature provides fine-grained control over how the data is spread and replicated across different data centers. Example with Oracle golden gate was very informative and helpful.
Next section refers Cassandra's time series data manipulations. Author successfully explain the term time series with examples and provide solution with CQL pseudo code. Example with CQL code clear the concepts also show how to store and retrieve time series data from Cassandra. Additional information about kariosdb project with a few word fulfil the section.
Last pattern in this chapter provide when and how to use counter column to keep tracing of event or content. Example with pseudo code completely clear the concept and show the reader how it's works under the hood.
Chapter 5: Search and Analytics Applied Use Case Patterns
Chapter five focused on a serious topics about data analysis and search. Every enterprise application need some search capability on data, it could be simple search or complex context search. On the other hand data analysis is another business challenge, which can be very travail. First section of this chapter focused on streaming analytics or real time analytics. Author provides reference architecture where combine Storm framework with Cassandra to do real time analytics. With storm bolt you can easily got precomputed value or aggregate data to alerts on some event.
Second section of this chapter dedicated to enterprise search. Cassandra like any other db doesn't support enterprise search out of box. For enterprise search, now any one can use two most popular search engine Solr and Elasticsearch based on lucene technology. Author explains what inverted index is and when not using Cassandra secondary index.
In the third section, author focused on graph analysis. However Cassandra data model is not fit for graph analysis and there are other databases specially fit for this type of model, for example neo4j is one of the popular graph database fit for these task. But if somebody have to solve graph analysis over Cassandra data, framework Titan can solve this problem.
Final section of this chapter dedicated to Hadoop Cassandra integration. This topics is too much big to write another few tom. Cassandra provides ColumnFamilyInputFormat and ColumnFamilyOutputFormat class that helps to run Map reduce on hadoop. You can use pig (data flow language) tools to run batch analysis over Cassandra data from Hadoop Map reduce, even more you can use Hive like query. Author forget to provide another framework like Spark or presto to fast data analysis
Chapter 6: Patterns and Anti-patterns
Last chapter focused on some additional patterns and anti-patterns, which was very interesting to read. Pattern Content/document store based on question, Which data store to use as a content/document store? Under the hood Cassandra store any data in raw bytes, which allows content or documents to be stored as raw bytes as column values. Author also provide framework Astyanax, which support for storing and retrieving large objects in chunk.
Pattern Materialized view, i have found very useful and detailed explained. Author introduced two implementing of materialized view in Cassandra, Application-tier-driven materialized view and Analytics-driven materialized view.
Last part of the chapter followed with anti-pattern Messaging queue. A lot of time i heard from peoples that, they are using Cassandra as a Messaging Queue, i asked them why? Most of them can't answer the question, a few of them tried to make distributed persistence queue with Cassandra. There are hazelcast and much more product to use as a Distributed queue. I have to agree with author that, messenging queue is an anti-pattern of use Cassandra.
Summery:
I really enjoyed Cassandra Design Patterns and recommended it to anyone interested in learning about Cassandra. Author touched most of all main topics of Cassandra and explained very easily. Now we are depends on Datastax to get any documentation about Cassandra and documentation about Cassandra is not very available. This book could be a major source of information to decided when and how to use Cassandra for real life problem. Thank'x to author Sanjay Sharma for such a nice book and Packt publication to give me a change to review this book.

Saturday

Elasticsearch with Cassandra data

Sooner or later every enterprise application needs full text search with their content. Slor, elasticsearch based on lucene are one the best candidate for developying enterprise search. Elasticsearch got very popularity with its simplicity, but out of box it dosen't support importing data from Cassandra cluster. However Elasticsearch provides river, a river is a pluggable service running within elasticsearch cluster pulling data (or being pushed with data) that is then indexed into the cluster. With a few search i have found a cassandra-river on github from ebay, unfortunatley, project was legeacy and only support Cassandra version 1.2*. With a few effort i rewrite the project with data stax cassandra driver. Here you can find the project, now it support the following features:
1) Cron scheduling;
2) Reading Cassandra rows through Paging;
3) Based on DataStax java driver 2.0;

For quick installation, download the project from the Github. Build with maven:
mvn clean install

it will create river plugin in the folder target/releases/cassandra-river-1.0-SNAPSHOT.zip. To installation the river plugin you could use plugin command line utility.
from the elasticsearch_home/bin directory run the follwing command:
./plugin --url file:/PATH/cassandra-river-1.0-SNAPSHOT.zip --install cassandra-river
now you can start the elasticsearch or and initilize the river with following command:
curl -XPUT 'http://HOST:PORT/_river/cassandra-river/_meta' -d '{
    "type" : "cassandra",
    "cassandra" : {
        "cluster_name" : "Test Cluster",
        "keyspace" : "nortpole",
        "column_family" : "users",
        "batch_size" : 20000,
        "hosts" : "localhost",
        "dcName" : "DC",
        "cron"  : "0/60 * * * * ?"
    },
    "index" : {
        "index" : "prodinfo",
        "type" : "product"
    }
}'
it should start pulling data from your Cassandra cluster.
For remove plugin use:
./plugin --remove cassandra-river

If you have installed elasticsearch _head plugin, you can search as follows:
Improvments plan:
1) Add unit Tests
2) Update index in ES
3) Add newly added rows in ES by date
4) Add multi tables support