Introduction to the data model of cassandra db

My last post was about Cassandra Set Up. Current article discusses Cassandras data model and objects. In essence Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key value nature is represented by a row object, in which value would be generally organized in columns. In short, cassandra knows following objects

  • Keyspace can be seen as DB Schema in SQL.
  • Column family resembles a table in SQL world (read below this analogy is misleading)
  • Row has a key and as a value a set of Cassandra columns.  But without  relational schema corset.
  • Column is a triplet := (name, value, timestamp).
  • Super column  is a tupel := (name, collection of columns).
  • Data Types : Validators & Comparators
  • Indexes

Keyspace

Keyspaces are easy to understand, they are a first level collection to other objects. Every model begins by keyspace.

Rows and Columns

Cassandra organizes data in columns and rows of these. Rows are accomulated in collection object called column family.

A similarity to SQL Tables is noticeable here. Looking at columns we see that all of them have implicit external given timestamp ("ts"). Further we see that there is no rigid obligations for rows in a same colum family to have the same set of columns and column types. Also there is no obligatation to provide a value for a column, it could be just name (and timestamp). Further cassandra allows to specify additional aspects per column, things like TTL. But it's not so interesting for understanding a model generally.

Super Column

As we see such super column is a combination of simple columns with one single name. Such inclusion provides addtional abstraction and access level. That actually also adds unnecessary complexity.

Hence super columns are not longer favoured. Nowadays it is recomended to manipulate C* data model by CQL and to use composite keys instead of super columns (more on this in the next tutorial).

Column families

As a typical NoSQL database, Cassandra does not enforce relationships between column families the way that relational databases do between tables. Therefore Apache Cassandra has no definition of foreign keys. Each column family has a self-contained set of columns that are intended to be accessed together to satisfy queries of your application. In addition there is not rigid schema, hence don't think of column family as of some sort of relation tables, it's better to think of them as structures like

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

and in case of super columun family as

Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>

Data Types

And of course there are predefined data types in cassandra, in which
* The data type of row key is called a validator. * The data type for a column name is called a comparator.

You can assign predefined data types when you create your column family  (which is recommended), but Cassandra does not require it. Internally Cassandra stores column names and values as hex byte arrays (BytesType). This is the default client encoding.

Following table shows built-in Cassandra types:

  • ascii US-ASCII character string
  • bigin 64-bit signed long
  • blob Arbitrary bytes (no validation), expressed as hexadecimal
  • boolean true or false
  • counter Distributed counter value (64-bit long)
  • decimal Variable-precision decimal
  • double 64-bit IEEE-754 floating point
  • float 32-bit IEEE-754 floating point
  • inet IP address string in IPv4 or IPv6 format*
  • int 32-bit signed integer
  • list A collection of one or more ordered elements
  • map A JSON-style array of literals: { literal : literal, literal : literal ... }
  • set A collection of one or more elements
  • text UTF-8 encoded string
  • timestamp Date plus time, encoded as 8 bytes since epoch
  • uuid A UUID in standard UUID format
  • timeuuid Type 1 UUID only (CQL 3)
  • varchar UTF-8 encoded string
  • varint Arbitrary-precision integer

Indexes

The understanding of Indexes in Cassandra is requisite. There are two kinds of them.

  • The Primary index for a column family is the index of its row keys. Each node maintains this index for the data it manages.
  • The Secondary indexes in Cassandra refer to indexes on column values. Cassandra implements secondary indexes as a hidden column family.

Primary index determines cluster-wide row distribution. Secondary indexes is very important for custom queries. Cassandra’s native index is like a hashed index and has limitation on range queries. To get more understanding of secondary indexes i would like to recomend Mavazo's nice introduction to secondary indexes.