Schema-less databases are the latest buzzword in the IT world. Geek programmers seem to love the flexibility and low cost and these attributes have fired up many a start-up. NoSQL database is schema Agnostic: Information can be stored without doing any upfront schema designing. So with so much demand in the industry for NoSQL, let’s have a look at the Top Cassandra Interview Questions you must know if you are going to apply for a NoSQL Database Developer or a NoSQL Database Administrator. You can even check out the details of relational databases, functions, queries, variables, etc with the SQL Course.
As you can see the Salary trend for people having Cassandra Experience, it is quite high. So Let’s begin with the Cassandra Interview Questions
I’ve divided this blog of Cassandra Interview Questions in 3 Parts:
- General NoSQL Interview Questions
- Beginners Cassandra Interview Questions
- Advance Cassandra Interview Questions
General NoSQL Interview Questions
1. What are the key features of any NoSQL Database?
Feature | Description |
Schema Agnostic | Information can be stored without doing any upfront schema design |
Auto-Sharding & Elastic | NoSQL allows the workload to automatically spread across any number of servers |
Highly Distributable | A cluster of servers can be used to hold a single large database. |
Easily Scalable | Allows easy scaling to adapt to the data volume and complexity of cloud applications |
Integrated Caching | Cached data in system memory is transparent to the application developers & operations team. |
2. What is a NoSQL Database?
- NoSQL is also referred as Not only SQL to emphasize that they may support SQL-like query language used in relational database.
- NoSQL database provides a mechanism to store and retrieve data, which are modeled rather than the tabular relations used in Relational databases.
3. What are the different types of NoSQL Databases?
There are majorly 4 types of NoSQL Databases,
- Key Value Store
- Document Store
- Column Store
- Graph Databases
4. What is Key-Value Store DB? Explain with an example.
All of the data within database consists of an indexed key and a value. A key may correspond to one or multiple values (hash table). Provides a great performance and can be very easily scaled as per business needs.
5. What is Document Store DB? Explain with an example.
The data record is the JSON/XML representation of key-value pairs. Every record can have a different set of fields.
Document DBs are similar to Key-value pairs, But the difference is that the key is associated with a document
6. What is Column Store DB? Explain with an example.
Data is stored in cells are grouped in columns of data rather than as rows of data. Columns are logically grouped into column families.
One row may have one or multiple data records, which is indexed by a partition key.
7. What is Graph DB? Explain with an example.
The type of NoSQL database in which a flexible graphical representation is used. The key purpose is to store relationships between nodes.
Here, Nodes are Id 1, 2 and 3. Properties for Node 1 are Name and Age
Edges are : Id 100, 101, 102, 103, 104 and 105
Beginners Cassandra Interview Questions
8. What is Apache Cassandra?
Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
9. What are the features of Apache Cassandra?
Apache Cassandra has a lot of features, some of them which make it stand out of crowd are:
10. What are the Different types of Data Model?
There are majorly 3 types/stages of Data Model
- Conceptual Data Model
- Logical Data Model
- Physical Data Model
11. What are the Key Differences between Cassandra and Traditional RDBMS?
12. What are the different Database Elements of Cassandra?
There are 4 main Cassandra Database Elements:
13. What is CQLSH? And why is it used?
Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things:
- Define a schema
- Insert a data, and
- Execute a query
14. What is a YAML file in Cassandra?
The cassandra.yaml file is the main configuration file for Cassandra. After changing properties in the cassandra.yaml file, you must restart the node for the changes to take effect.
15. What are Clusters in Cassandra?
The outermost structure in Cassandra is the cluster. A cluster is a container for Keyspaces
Sometimes called the ring, because Cassandra assigns data to nodes in the cluster by arranging them in a ring
A node holds a replica for a different range of data.
16. What is a Keyspace in Cassandra?
A keyspace is the outermost container for data in Cassandra. Like a relational database, a keyspace has a name and a set of attributes that define keyspace-wide behaviour. The keyspace is used to group Column families together.
17. How is a Keyspace created in Cassandra? & What are the parameters used?
CREATE KEYSPACE ABC
WITH replication = { ‘class ’: ‘SimpleStrategy’, ‘replication_factor’: ‘3’}
AND durable_writes = ‘TRUE’;
The parameters used while creating a keyspace are:
- Keyspace Name
- Replication Strategy
- Replication Factor &
- Durable Writes
18. What are durable writes?
Durable Writes provides a means to instruct Cassandra whether to use commitlog for updates on the current KeySpace or not.
This option is not mandatory. The default value for durable writes is TRUE.
19. What do you mean by replication factor?
Cassandra stores copies (called replicas) of each row based on the row key. The replication factor refers to the number of nodes that will act as copies (replicas) of each row of data.
20. What do you mean by replication Strategy?
The replica placement strategy refers to how the replicas will be placed in the ring
There are different strategies that ship with Cassandra for determining which nodes will get copies of which keys
There are mainly two types of Strategies:
- Simple Strategy
- Network Topology Strategy
21. What is Simple Strategy?
It uses Simple Single Datacenter Clusters. It places the first Replica on a node determined by the Partitioner. Additional Replicas are placed on the next nodes in clockwise (in a Ring) manner without considering Rack or Datacenter location.
22. What is Network Topology Strategy?
This is used when we deploy a cluster across Multiple Datacenters. It is the primary consideration to insert replicas. Can satisfy reads, locally without incurring cross Data-Center Latency and also Handle Failure Scenarios.
23. What is a Column Family?
A column family is a container for an ordered collection of rows, each of which is itself an ordered collection of columns. We can freely add any column to any column family at any time, depending on your needs. The comparator value indicates how columns will be sorted when they are returned to you in a query.
24. What is a Row in Cassandra? and What are the different elements of it?
A row is a collection of sorted columns. It is the smallest unit that stores related data in Cassandra. Any component of a Row can store data or metadata
The different elements/parts of a row are the
- Row Key
- Column Keys
- Column Values
25. What is a Primary Key? And what are it’s different types?
The Primary Key is a column that is used to uniquely identify a row
There are 3 types of Primary Keys:
- Single Primary Key
- Compound Primary Key
- Composite Partitioning Key
These were some Beginner Level Cassandra Interview Questions, you must know about.
So, let’s move ahead with some Advance Cassandra Interview Questions
Advance Cassandra Interview Questions
26. Differentiate between the various types of Primary Keys in Cassandra.
- In the Single Primary Key, there is only a single column as a Primary Key.
The column is also called partitioning key. Data is partitioned on the basis of that column. Data is spread on different nodes on the basis of the partition key.
- In Compound Primary Key, data is partitioned and then clustered
race_name is the partitioning key and race_position is the Clustering key. Data will be partitioned on the basis of race_name and data will be clustered on the basis of race_position. Clustering is the process that sorts data in the partition. Retrieval of rows is very efficient when rows for a partition key are stored in order, based on the clustering column.
- Composite partitioning key is used to create multiple partitions for the data
race_year and race_name are the composite partition key and data will be partitioned on the basis of both columns. Data will be clustered on the basis of the rank. It is used when too much data is present on the single partition.
27. Differentiate between Static and Dynamic CQL Tables.
- A Static Table uses a relatively static set of column names and is similar to Relational Database Table.
- A dynamic table allows you to pre-compute result sets and stores them in a single row for efficient data retrieval.
28. Differentiate between Drop and Truncate in CQLSH
- The Drop table command drops specified table including all the data from the keyspace.
- The Truncate table command is used to truncate a table and deletes all the rows of the table permanently.
29. What is Gossip Protocol?
Gossip Protocol in Cassandra is a peer-to-peer communication protocol in which nodes can choose among themselves with whom they want to exchange their state information. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster.
30. How does gossip Protocol Work?
31. How does gossip Protocol help in Failure Detection?
The process of Acknowledging messages helps in failure detection. When a node is down/failing it is unable to send or receive messages and hence the Acknowledgements are not received.
32. What are partitions and Tokens in Cassandra?
- Partition: It is a hash function located on each node which hashes tokens from designated values in rows being added. It converts a variable length input to a fixed length value.
- Token: Integer value generated by a hashing algorithm, identifying a partition’s location within a cluster
33. What are the different types of Partitioners in Cassandra? Explain.
- Murmur3Partitioner is the default partitioner. It is both improved and faster than RandomPartitioner. Uniformly distributes data based on MurmurHash function.
64-bit hash value partition key with Range: 263 to 263-1
- RandomPartitioner was the default partitioner prior to Cassandra 1.2. It is used with vnodes. It has a Uniform Distribution.
It uses MD5 hash values with Range: 0 to 2127-1
- ByteOrderedPartioner is used for ordered partitioning. It orders rows lexically by key bytes. Using the ordered partitioner allows ordered scans by primary key. This means we can scan rows as though we were moving a cursor through a traditional index.
34. What do you mean by Snitch? Name a few
A snitch determines which datacenters and racks, nodes belong to. They inform Cassandra about the network topology and allows Cassandra to distribute replicas specifically, the Replication strategy places the replicas based on the information provided by the new snitch.
There are many types of snitches, to name a few:
- Dynamic snitching
- SimpleSnitch
- RackInferringSnitch
- Ec2Snitch
- PropertyFileSnitch
- GossipingPropertyFile
- Ec2MultiRegionSnitch
- GoogleCloudSnitch
- CloudstackSnitch
35. How does Cassandra perform write operations?
When write request comes to the node:
- Firstly, it logs in the Commit Log. Data will be captured and stored in the Mem-Table.
- When mem-table is full, data is flushed to the SSTable data file.
All writes are automatically partitioned and replicated throughout the cluster Cassandra periodically consolidates the SSTables, discarding unnecessary data.
36. Explain the terms Memtable, CommitLog and SSTables.
- Commit log: The Commit log is a crash-recovery mechanism that supports Cassandra’s durability goals
- MemTable: MemTable is an in-memory data structure that corresponds to a CQL table
- SSTable: The contents of the memtable are flushed to disk in a file called an SSTable.
37. What is the use of Coordinator Node in Read?
Read Operation is easy because clients can connect to any node in the cluster to perform reads. If a client connects to a node that doesn’t have the data it’s trying to read, the node it’s connected to will act as the coordinator node.
38. How does Cassandra perform Read operation? Explain
39. What do you mean by Compaction?
It is the process of freeing up space by merging largely accumulated datafiles. It improves performance by reducing the number of required seeks.
40. What is Anti-Entropy and How is it associated with Merkel Tree?
Anti-entropy is the replica synchronization mechanism, ensuring that data on different nodes is updated to the newest version
Cassandra uses Merkle tree for anti-entropy repair. A Merkel Tree is a hash tree where leaves are hashes of the values of individual keys.
41. Explain the different types of Repairs.
- Anti Entropy: Anti-entropy Repair is a process of comparing the data of all replicas and updating them with the newest version of data using Merkle Tree. Anti-entropy repair is triggered manually. It has two phases to the process:
- Building a Merkle tree for each replica
- Comparing the Merkle trees to discover differences
Anti-entropy repair is very useful and is often recommended to run periodically to keep data in sync.
- Read Repair: Read Repair is the process of fixing inconsistencies among the replica nodes at the time of read request. In a read operation, if some nodes respond with data that is inconsistent with the response of newer nodes, a Read Repair is performed on the old nodes. It ensures consistency throughout the node ring. Done by pulling all of the data from the node and performing a merge, and then writing it back to the nodes that were out of sync.
- Nodetool Repair: Nodetool repair command against a node, initiates repair for some range of tokens. The range being repaired depends on what options are specified. The default options, just calling “nodetool repair”, initiate a repair of every token range owned by the node
- Full Repair: Full Repairs operate over all of the data in the token range
- Incremental Repair: Incremental Repair only repairs the data that’s been written since the previous incremental repair. Incremental repairs are the default repair type, and if run regularly, can significantly reduce the time and I/O cost of performing a repair. It splits the data into repaired and unrepaired SSTables, and only repairs unrepaired data.Once an incremental repair marks data as repaired, it won’t try to repair it again. Incremental Repair is not recommended instead Full Repair should be performed.
42. What is Hinted Handoff?
Hinted Handoff is a mechanism to ensure availability, fault-tolerance and graceful degradation in Cassandra. The node that receives the hint will know when the unavailable node comes back online again, because of Gossip.
43. What do you mean by Logging in Cassandra?
Logs are written to the system.log and debug.log file in the Cassandra logging directory
We can configure logging programmatically or manually. The simplest way to get a picture of what’s happening in your database is to just change the logging level to make the output more verbose, by default it is set at INFO.
44. Explain the different Logging levels in Cassandra.
- ALL: All levels including custom levels
- TRACE: Designates finer-grained informational events than the DEBUG
- DEBUG: Designates fine-grained informational events that are most useful to debug an application
- INFO: Designates informational messages that highlight the progress at a coarse-grained level
- WARN: Designates potentially harmful situations
- ERROR: Designates error events that might still allow the application to continue running
- OFF: The highest possible rank and is intended to turn off logging
45: What is JMX? And How is it useful in Cassandra?
JMX (Java Management Extension) is a Java technology that supplies tools for managing and monitoring Java applications and services. Cassandra makes use of JMX to enable remote management of the servers.
46. What are snapshots and how do you create one in Cassandra?
Snapshot represents the state of the data files at a particular point in time. Snapshot command is used while taking a backup and creates hard links for SSTables in the snapshots folder which can later be used to restore the node,
47. Why is JConsole used? What is it’s different elements?
JConsole is used to Monitor and perform analysis on the Server activities. Once you’ve connected to a server, the default view includes four major categories about your server’s state, which are updated constantly:
48. Explain Nodetool Utility.
The Nodetool Utility is a command-line utility that comes out of the box with Cassandra and is a great tool for administration and monitoring. It communicates with JMX to perform operational and monitoring tasks exposed by MBeans.
49. What are Roles in CQLSH?
Roles enable authorization management on a larger scale than security per user can provide. A role is created and may be granted to other roles. Hierarchical sets of permissions can be created with the help of it.
50. What is Python Stress test in Cassandra?
Cassandra comes with a popular utility called py_stress that can be used to run a stress test on Cassandra cluster. The Cassandra-stress tool is a Java-based stress testing utility for basic benchmarking and load testing a Cassandra cluster. This is an effective tool for populating a cluster and stress testing CQL tables and queries.
So, I hope these Cassandra Interview Questions helped you to brush up your knowledge of Apache Cassandra.
Got a question for us? Please mention it in the comments section and we will get back to you at the earliest.
If you wish to build a career in the domain of Cassandra and gain expertise in NoSQL Databases, get enrolled in live-online Edureka Apache Cassandra Certification Training here, that comes with 24*7 support to guide you throughout your learning period.