My last post was about Cassandra Set Up. Current article discusses Cassandras data model and objects. In essence Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key value nature is represented by a row object, in which value would be generally organized in columns. In short, cassandra knows following objects
Keyspace
can be seen as DB Schema in SQL.Column
family
resembles a table in SQL world (read below this analogy is misleading)Row
has a key and as a value a set of Cassandra columns. But without relational schema corset.Column
is a triplet := (name, value, timestamp).Super column
is a tupel := (name, collection of columns).Data Types
: Validators and ComparatorsIndexes
Keyspace
Keyspaces are easy to understand, they are a first level collection to other objects. Every model begins by keyspace.
Rows and Columns
Cassandra organizes data in columns and rows of these. Rows are accumulated in collection object called column family.
A similarity to SQL Tables is noticeable here. Looking at columns we see that all of them have implicit external given timestamp (“ts”). Further we see that there is no rigid obligations for rows in a same colum family to have the same set of columns and column types. Also there is no obligation to provide a value for a column, it could be just name (and timestamp). Moreover cassandra allows to specify additional aspects per column, things like TTL. But it’s not so interesting for understanding a model generally.
Super Column
As we see such super column is a combination of simple columns with one single name. Such inclusion provides additional abstraction and access level. That actually also adds unnecessary complexity.
Hence super columns are not longer favoured. Nowadays it is recommended to manipulate C* data model by CQL and to use composite keys instead of super columns (more on this in the next tutorial).
Column families
As a typical NoSQL database, Cassandra does not enforce relationships between column families the way that relational databases do between tables. Therefore Apache Cassandra has no definition of foreign keys. Each column family has a self-contained set of columns that are intended to be accessed together to satisfy queries of your application. In addition there is not rigid schema, hence don’t think of column family as of some sort of relation tables, it’s better to think of them as structures like
Map<RowKey, SortedMap<ColumnKey, ColumnValue»
and in case of super columun family as:
Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue»>
Data Types
And of course there are predefined data types in cassandra, in which
- The data type of row key is called a
validator
. - The data type for a column name is called a
comparator
.
You can assign predefined data types when you create your column family (which is recommended), but Cassandra does not require it. Internally Cassandra stores column names and values as hex byte arrays (BytesType
). This is the default client encoding.
Following table shows built-in Cassandra types:
ascii
US-ASCII character stringbigin
64-bit signed longblob
Arbitrary bytes (no validation), expressed as hexadecimalboolean
true or falsecounter
Distributed counter value (64-bit long)decimal
Variable-precision decimaldouble
64-bit IEEE-754 floating pointfloat
32-bit IEEE-754 floating pointinet
IP address string in IPv4 or IPv6 format*int
32-bit signed integerlist
A collection of one or more ordered elementsmap
A JSON-style array of literals: { literal : literal, literal : literal … }set
A collection of one or more elementstext
UTF-8 encoded stringtimestamp
Date plus time, encoded as 8 bytes since epochuuid
A UUID in standard UUID formattimeuuid
Type 1 UUID only (CQL 3)varchar
UTF-8 encoded stringvarint
Arbitrary-precision integer
Indexes
The understanding of Indexes in Cassandra is requisite. There are two kinds of them.
- The Primary index for a column family is the index of its row keys. Each node maintains this index for the data it manages.
- The Secondary indexes in Cassandra refer to indexes on column values. Cassandra implements secondary indexes as a hidden column family.
Primary index determines cluster-wide row distribution. Secondary indexes is very important for custom queries. Cassandra’s native index is like a hashed index and has limitation on range queries.
Let me know if you would like to read more on the topic.