Objectives
Data modeling is the hardest part for the people coming from RDBMS back ground. The goal here is to define the basic rules one should keep in mind before designing the schema. If you follow these guide lines you will design a good performance schema.
Cassandra Data Model
To design a good schema first we need to carefully understand our use cases for needed data and then design the schema around the queries. As Cassandra is NOSQL peer to peer database with data distributed on multiple machines and datacenters so data need to be distributed evenly across cluster and group of data is necessary so similar data can be found easily on one node instead of getting the data from multiple nodes. As storage is cheap now days but stream out big size of data is time taking. So, we must carefully select the partition and clustering key in our schema while designing the schema.
Partition Key
As Cassandra is a distributed database and stored data on multiple nodes. Partition key is responsible for distributing the data on multiple nodes. It uses murmur3 hashing on the primary key and distribute the data on multiple nodes.
Clustering Key
It is responsible of storing data efficiently on single node on SSD. So, data must be stored closely so it can be retrieved in minimum operations. It is storing of data in sorted order on the disk. It is especially useful for storing the time series data.
-P1: Primary key (PK)
-(P1,C1): PK, where C1 is the clustering key.
-((P1,P2),(C1,C2)) Both Primary and clustering keys can be compound keys and consist of more then one fields.
Not to do
Reducing the number of writes
Reducing the data duplicates
Basic Guide Lines
Spread data in the cluster evenly
Lessen the number of partitions to read.
Model around your use cases (queries)
Model around your queries and determine what exactly queries you want to create for fetching the data you need.
Grouping by fields
Order by fields
Filtering by some fields
Distinct results
Changes in any of these conditions will change the design of your model.
What queries to support
Determine the queries to support
Write the query in such a way that it will read from only one partition.
Comments