Pages

50 Latest Apache Cassandra Interview Questions with Answers

Q1: How many types of NoSQL databases are there?
There are four types of NoSQL databases, namely:
Document Stores (MongoDB, Couchbase)
Key-Value Stores (Redis, Volgemort)
Column Stores (Cassandra)
Graph Stores (Neo4j, Giraph)

Q2: What do you understand by Commit log in Cassandra?
Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.

Q3: Define Mem-table in Cassandra.
It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.

Q4: What is SSTable?
SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.

Q5: What is bloom filter?
Bloom filter is an off-heap data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.

Q6: Establish the difference between a node, cluster & data centres in Cassandra.
Node is a single machine running Cassandra.
Cluster is a collection of nodes that have similar type of data grouped together.
Data centres are useful components when serving customers in different geographical areas. Different nodes of a cluster are grouped into different data centres.

Q7: Define composite type in Cassandra?
In Cassandra, composite type allows to define a key or a column name with a concatenation of data of different type. You can use two types of Composite Types:
Row Key
Column Name

Q8: What is Cassandra Data Model?
Cassandra Data Model consists of four main components, namely:
Cluster: These are made up of multiple nodes and keyspaces.
Keyspace: It is a namespace to group multiple column families, especially one per partition.
Column: It consists of a column name, value and timestamp
Column family: This refers to multiple columns with row key reference.

Q9: Explain what is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consists of one keyspace per node.

Q10: Elaborate on CQL?
A user can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.

Q11: Talk about the concept of tunable consistency in Cassandra.
Tunable Consistency is a characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies – Eventual Consistency and Strong Consistency.

Q12: What are the three components of Cassandra write?
The three components are:
Commitlog write
Memtable write
SStable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable.

Q13: Explain zero consistency.
 In zero consistency the write operations will be handled in the background, asynchronously. It is the fastest way to write data.

Q14: Mention what are the values stored in the Cassandra Column?
There are three values in Cassandra Column. They are:
Column Name
Value
Time Stamp

Q15: What do you understand by Kundera?
Kundera is an object-relational mapping (ORM) implementation for Cassandra which is written using Java annotations.

Apache Cassandra Interview Questions with Answers

Q16: What is the concept of SuperColumn in Cassandra?
Cassandra SuperColumn is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action.

Q17: When do you have to avoid secondary indexes?
Try not using secondary indexes on columns containing a high count of unique values as that will produce few results.

Q18: List the steps in which Cassandra writes changed data into commitlog?
Cassandra concatenates changed data to commitlog. Then Commitlog acts as a crash recovery log for data. Until the changed data is concatenated to commitlog, write operation will never be considered successful.

Q19: What is the use of “ResultSet execute(Statement statement)” method?
This method is used to execute a query. It requires a statement object.

Q20: What is Thrift?
Thrift is the name of the Remote Procedure Call (RPC) client used to communicate with the Cassandra server.

Q21: Explain the two types of compactions in Cassandra.
Compaction refers to a maintenance process in Cassandra , in which, the SSTables are reorganized for data optimization of data structures on the disk. There are two types of compaction in Cassandra:
Minor compaction: It starts automatically when a new table is created. Here, Cassandra condenses all the equally sized tables into one.
Major compaction: It is triggered manually using nodetool. It compacts all tables of a ColumnFamily into one.

Q22: Explain what is Cassandra-Cqlsh?
Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things:
Define a schema
Insert a data, and
Execute a query

Q23: What is the use of “void close()” method?
This method is used to close the current session instance.

Q24: What are the collection data types provided by CQL?
There are three collection data types:
List : A list is a collection of one or more ordered elements.
Map : A map is a collection of key-value pairs.
Set : A set is a collection of one or more elements.

Q25: Describe Replication Factor?
Replication Factor is the measure of number of data copies existing. It is important to increase the replication factor to log into the cluster.

26). What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE <identifier> WITH <properties>

27). What is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.

28). What is cqlsh?
cqlsh is a Python-based command-line client for cassandra.

29). Does Cassandra works on Windows?
Yes, Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.

30). What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.

31). Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.

32). What do you understand by Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.

33). What do you understand by Kundera?
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.

34). JMX stands for?
Java Management Extension

35). What is the difference between Cassandra, Hadoop Big Data, MongoDB, CouchDB?
http://www.interviewquestionspdf.com/2015/10/what-is-difference-between-cassandra.html

36). When to use Cassandra?
Being a part of NoSQL family Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytic where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner.

37). When should you not use Cassandra? OR When to use RDBMS instead of Cassandra?
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.

38). What are secondary indexes?
Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.

39). When to use secondary indexes?
You want to query on a column that isn't the primary key and isn't part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).

40). When to avoid secondary indexes?
Try not using secondary indexes on columns contain a high count of unique values and that will produce few results.

41). I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?
XX%

42). What happens to existing data in my cluster when I add new nodes?
When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.

43). What are "Seed Nodes" in Cassandra?
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. Its recommended to use the 2 seed node per data center.

44). When to avoid secondary indexes?
Try not using secondary indexes on columns contain a high count of unique values and that will produce few results.

45). What are the befefits of NoSQL over relational database?
NoSQL overcome the weaknesses that the relational data model does not address well, which are as follows:
Huge volume of sructured, semi-structured, and unstructured data
Flexible data model(schema) that is easy to change
Scalability and performance for web-scale applications
Lower cost
Impedance mismatch between the relational data model and object-oriented programming
Built-in replication
Support for agile software development

46). What ports does Cassandra use?
By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP.

47). What do you understand by High availability?
A high availability system is the one that is ready to serve any request at any time. High avaliability is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request. To a client, it seems as if everything worked fine.

48). How Cassandra provide High availability feature?
Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.

49). Who uses Cassandra?
Cassandra is in wide use around the world, and usage is growing all the time. Companies like Netflix, eBay, Twitter, Reddit, and Ooyala all use Cassandra to power pieces of their architecture, and it is critical to the day-to-da operations of those organizations. to date, the largest publicly known Cassandra cluster by machine count has over 300TB of data spanning 400 machines.
Because of Cassandra's ability to handle high-volume data, it works well for a myriad of applications. This means that it's well suited to handling projects from the high-speed world of advertising technology in real time to the high-volume world of big-data analytics and everything in between. It is important to know your use case before moving forward to ensure things like proper deployment and good schema design.

50). When to use secondary indexes?
You want to query on a column that isn't the primary key and isn't part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).

Apache Cassandra Interview Questions with Answers

2 comments: