85 + Interview Questions in MongoDB in Big Data

1.	How can we change the configuration of the replica set?
Answer» Loading every DOCUMENT into RAM means that QUERY will not be using index efficiently and will have to FETCH documents from disk to ram. For using an index, the initial match in the FIND statement should either use index or index prefix. Below query has b key to find that does not use any existing index or index prefix so it would have to fetch from disk. db.sample.find( { b : 1 } ).sort( { C : 1, a : 1 } )

1.

How can we change the configuration of the replica set?

Answer»

Loading every DOCUMENT into RAM means that QUERY will not be using index efficiently and will have to FETCH documents from disk to ram.

For using an index, the initial match in the FIND statement should either use index or index prefix.

Below query has b key to find that does not use any existing index or index prefix so it would have to fetch from disk.

db.sample.find( { b : 1 } ).sort( { C : 1, a : 1 } )

Discussion

2.	What is the need for Replication in MongoDB and what all kinds of replica members does it support?
Answer» In replication we have multiple copies of the same data in sync with each other. It is mainly useful for High availability purpose. While in sharding we divide our entire dataset in small chunks and distribute among several servers. Sharding is used where we have some sort of bottleneck in terms of HARDWARE or for getting the benefits of query parallelism. If our dataset is very small sharding would not provide many advantages but as the data grows we should move to sharding. Below are a few of the situations where sharding is recommended over replication. If our MongoDB instance cannot keep up with the APPLICATION's write load. We have exhausted RAM and CPU options for the server. Out data set is too big to fit in a single MongoDB instance. We have reached the disk limits for the server. To improve read performance for the application. By USING targeted queries in the SHARDED cluster we can only view the data required skipping other data. The data set is taking too much time to backup and restore. By breaking the dataset over shards will MEAN having more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster.

2.

What is the need for Replication in MongoDB and what all kinds of replica members does it support?

Answer»

In replication we have multiple copies of the same data in sync with each other. It is mainly useful for High availability purpose. While in sharding we divide our entire dataset in small chunks and distribute among several servers. Sharding is used where we have some sort of bottleneck in terms of HARDWARE or for getting the benefits of query parallelism. If our dataset is very small sharding would not provide many advantages but as the data grows we should move to sharding.

Below are a few of the situations where sharding is recommended over replication.

If our MongoDB instance cannot keep up with the APPLICATION's write load. We have exhausted RAM and CPU options for the server.
Out data set is too big to fit in a single MongoDB instance. We have reached the disk limits for the server.
To improve read performance for the application. By USING targeted queries in the SHARDED cluster we can only view the data required skipping other data.
The data set is taking too much time to backup and restore.

By breaking the dataset over shards will MEAN having more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster.

Discussion

3.	How Does sharding and replication effect concurrency in MongoDB?
Answer» In some cases, chunks can grow beyond the specified chunk SIZE but cannot undergo a split. The most common scenario is when a chunk represents a single shard key value. Since the chunk cannot split, it continues to grow beyond the chunk size, becoming a jumbo chunk. These jumbo chunks can become a performance bottleneck as they CONTINUE to grow, ESPECIALLY if the shard key value occurs with high frequency. The addition of new data or new shards can result in data distribution imbalances within the cluster. A particular shard may acquire more chunks than another shard, or the size of a chunk may grow beyond the configured MAXIMUM chunk size. MongoDB ensures a balanced cluster using TWO processes: chunk splitting and the balancer.

3.

How Does sharding and replication effect concurrency in MongoDB?

Answer»

In some cases, chunks can grow beyond the specified chunk SIZE but cannot undergo a split. The most common scenario is when a chunk represents a single shard key value. Since the chunk cannot split, it continues to grow beyond the chunk size, becoming a jumbo chunk. These jumbo chunks can become a performance bottleneck as they CONTINUE to grow, ESPECIALLY if the shard key value occurs with high frequency.

The addition of new data or new shards can result in data distribution imbalances within the cluster. A particular shard may acquire more chunks than another shard, or the size of a chunk may grow beyond the configured MAXIMUM chunk size.

MongoDB ensures a balanced cluster using TWO processes: chunk splitting and the balancer.

Discussion

4.	What are different index options MongoDB provides?
Answer» Chunk split operations are carried out automatically by the system when any insert operation causes chunk to exceed the maximum chunk size. Balancer then migrates recently split chunks to new shards. But in some cases we may want to pre-split the chunks manually: If we have deployed a CLUSTER using existing data, we may have large data and very few chunks. In cases, pre-splitting would be beneficial for even distribution. If the cluster is using hashed shard key, or we KNOW the distribution of our data very well, we can arrange for a distribution of data to be equilibrated between shards and pre-split the chunks. If we perform INITIAL bulk load, all data would go to SINGLE shard and then those documents will migrate to other shards later doubling the number of writes. ALTERNATIVELY, if we can pre-split the collection across the values avoiding the documents to be written twice. To split the chunks manually we can use the split command with helper sh.splitFind() and sh.splitAt(). Example: To split the chunk of employee collection for employee id field at a value of 713626 below command should be used. sh.splitAt( "test.people", { "employeid": "713626" } ) We should be careful while pre-splitting chunks as sometimes it can lead to a collection with different sized chunks.

4.

What are different index options MongoDB provides?

Answer»

Chunk split operations are carried out automatically by the system when any insert operation causes chunk to exceed the maximum chunk size. Balancer then migrates recently split chunks to new shards. But in some cases we may want to pre-split the chunks manually:

If we have deployed a CLUSTER using existing data, we may have large data and very few chunks. In cases, pre-splitting would be beneficial for even distribution.
If the cluster is using hashed shard key, or we KNOW the distribution of our data very well, we can arrange for a distribution of data to be equilibrated between shards and pre-split the chunks.
If we perform INITIAL bulk load, all data would go to SINGLE shard and then those documents will migrate to other shards later doubling the number of writes. ALTERNATIVELY, if we can pre-split the collection across the values avoiding the documents to be written twice.

To split the chunks manually we can use the split command with helper sh.splitFind() and sh.splitAt().

Example:

To split the chunk of employee collection for employee id field at a value of 713626 below command should be used.

sh.splitAt( "test.people", { "employeid": "713626" } )

We should be careful while pre-splitting chunks as sometimes it can lead to a collection with different sized chunks.

Discussion

5.	What are two important properties of MongoDB replication?
Answer» Shard KEY selection is based on the workload. Since the first query is being used 90% it should be DRIVING the selection for selection of shard key. Combination of fields from that query would MAKE the best shard key. This eliminates option b, C and d. Option a and e uses a subset of fields from the most used workload and both can be shard key but option has more fields and so would be more suitable.

5.

What are two important properties of MongoDB replication?

Answer»

Shard KEY selection is based on the workload. Since the first query is being used 90% it should be DRIVING the selection for selection of shard key.

Combination of fields from that query would MAKE the best shard key. This eliminates option b, C and d.

Option a and e uses a subset of fields from the most used workload and both can be shard key but option has more fields and so would be more suitable.

Discussion

6.	What is the role of balancer in a sharded cluster environment and how does it work?
Answer» In MongoDB we can read from Primary as well as secondary members of the replica set. This behaviour can be controlled by us as we can define the desired Read preference to which clients route read operations to a member of the replica set. If we do not specify any real preference by default MongoDB will read from primary. There are situations when you would want to reduce the load on your primary by FORCING applications to read from secondary. Below are DIFFERENT MongoDB read preference MODES: primary This is the default mode. Applications read from the replica set primary. primaryPreferred In this mode, all applications read from primary but if the primary member is not AVAILABLE they start reading from secondary. secondary All applications read from the secondary members of the replica set. secondaryPreferred In this mode, all applications read from secondary but if any secondary member is not available they start reading from primary. nearest In this mode, applications read from the member which is nearest to them in terms of network latency, irrespective of the member being primary or secondary.

6.

What is the role of balancer in a sharded cluster environment and how does it work?

Answer»

In MongoDB we can read from Primary as well as secondary members of the replica set. This behaviour can be controlled by us as we can define the desired Read preference to which clients route read operations to a member of the replica set. If we do not specify any real preference by default MongoDB will read from primary. There are situations when you would want to reduce the load on your primary by FORCING applications to read from secondary.

Below are DIFFERENT MongoDB read preference MODES:

primary

This is the default mode. Applications read from the replica set primary.

primaryPreferred

In this mode, all applications read from primary but if the primary member is not AVAILABLE they start reading from secondary.

secondary

All applications read from the secondary members of the replica set.

secondaryPreferred

In this mode, all applications read from secondary but if any secondary member is not available they start reading from primary.

nearest

In this mode, applications read from the member which is nearest to them in terms of network latency, irrespective of the member being primary or secondary.

Discussion

7.	As a MongoDB administrator you are asked to perform hardening of current MongoDB environment. What all should be implemented?
Answer» Idempotence is the PROPERTY of certain operations whereby they can be applied multiple times without changing the result beyond the initial application. In MongoDB, OPLOG is idempotent meaning even if they are applied multiple times the same output will be produced. So if the server goes down and we need to apply oplogs there WOULD not be any inconsistency, even if it applies any logs that were already applied there will not be changed in the database end state. Also, there was a DESIRE to have a new state of a document to be independent of the previous state. For these all operators that rely on the previous state to determine new VALUE needed to be transformed to see the actual values. For example, if an addition operation results in modifying the value from ‘21’ to ’30’, the operation should be changed to set value ‘30’ on the field. Replaying the operator multiple times should produce the same result.

7.

As a MongoDB administrator you are asked to perform hardening of current MongoDB environment. What all should be implemented?

Answer»

Idempotence is the PROPERTY of certain operations whereby they can be applied multiple times without changing the result beyond the initial application. In MongoDB, OPLOG is idempotent meaning even if they are applied multiple times the same output will be produced. So if the server goes down and we need to apply oplogs there WOULD not be any inconsistency, even if it applies any logs that were already applied there will not be changed in the database end state.

Also, there was a DESIRE to have a new state of a document to be independent of the previous state. For these all operators that rely on the previous state to determine new VALUE needed to be transformed to see the actual values. For example, if an addition operation results in modifying the value from ‘21’ to ’30’, the operation should be changed to set value ‘30’ on the field. Replaying the operator multiple times should produce the same result.

Discussion

8.	What are the different Monitoring utilities available for MongoDB?
Answer» MongoDB applies database operations on the primary and then RECORDS the operations on the primary’s oplog. The secondary members then copy and APPLY these operations in an asynchronous PROCESS. For each operation, there is separate oplog entry. First, let’s check how many rows the query would fetch by changing delete to find operation. db.sample.find( { state : "WA" } ) This will give all the rows with the state is WA. {"FIRSTNAME" : "Arthur", "lastName" : "Aaronson", "state" : "WA", "CITY" : "Seattle", "likes" : [ "dogs", "cats" ] } {"firstName" : "Beth", "lastName" : "Barnes", "state" : "WA", "city" : "Richland", "likes" : [ "forest", "cats" ] } {"firstName" : "Dawn", "lastName" : "Davis", "state" : "WA", "city" : "Seattle", "likes" : [ "forest", "mountains" ] } Now Ideally delete should remove all matching rows but query says deleteOne. If the query would have said deleteMany then all the matching rows would have been deleted and there would be 3 oplog entries but deleteOne will remove first matching row. So 1 oplog entry will be generated with provided query

8.

What are the different Monitoring utilities available for MongoDB?

Answer»

MongoDB applies database operations on the primary and then RECORDS the operations on the primary’s oplog. The secondary members then copy and APPLY these operations in an asynchronous PROCESS. For each operation, there is separate oplog entry.

First, let’s check how many rows the query would fetch by changing delete to find operation.

db.sample.find( { state : "WA" } )

This will give all the rows with the state is WA.

{"FIRSTNAME" : "Arthur", "lastName" : "Aaronson", "state" : "WA", "CITY" : "Seattle", "likes" : [ "dogs", "cats" ] } {"firstName" : "Beth", "lastName" : "Barnes", "state" : "WA", "city" : "Richland", "likes" : [ "forest", "cats" ] } {"firstName" : "Dawn", "lastName" : "Davis", "state" : "WA", "city" : "Seattle", "likes" : [ "forest", "mountains" ] }

Now Ideally delete should remove all matching rows but query says deleteOne.

If the query would have said deleteMany then all the matching rows would have been deleted and there would be 3 oplog entries but deleteOne will remove first matching row. So 1 oplog entry will be generated with provided query

Discussion

9.	Your production sharded cluster has lots of chunks with a jumbo flag which is hampering application performance. How will you clear the jumbo flag?
Answer» The OPLOG is operation logs that keep an update of all operations that modify the data stored in databases. We can define the oplog size while STARTING MongoDB by specifying the --oplog option. If we do not specify this option it will take the default values which is 5% of PHYSICAL memory in case of wiredTiger. While the default value is sufficient for most workloads in some cases we may need to change the oplog size for the replica SET. OPlog size is CHANGED in a rolling manner, first, we change on all secondary and then a primary member of the replica set. To change oplog size First we need to connect to any secondary member. Verify the size of current oplog by running below command on the local database. use local db.oplog.rs.stats().maxSize Change the oplog size using admin command replSetResizeOplog specifying a new size for oplog. db.adminCommand({replSetResizeOplog: 1, size: "Size-in-MB"}) Repeat the same process for other secondary members and then on the primary member of the replica set.

9.

Your production sharded cluster has lots of chunks with a jumbo flag which is hampering application performance. How will you clear the jumbo flag?

Answer»

The OPLOG is operation logs that keep an update of all operations that modify the data stored in databases. We can define the oplog size while STARTING MongoDB by specifying the --oplog option. If we do not specify this option it will take the default values which is 5% of PHYSICAL memory in case of wiredTiger. While the default value is sufficient for most workloads in some cases we may need to change the oplog size for the replica SET.

OPlog size is CHANGED in a rolling manner, first, we change on all secondary and then a primary member of the replica set. To change oplog size

First we need to connect to any secondary member.
Verify the size of current oplog by running below command on the local database.

use local

db.oplog.rs.stats().maxSize

Change the oplog size using admin command replSetResizeOplog specifying a new size for oplog.

db.adminCommand({replSetResizeOplog: 1, size: "Size-in-MB"})

Repeat the same process for other secondary members and then on the primary member of the replica set.

Discussion

10.	db.employee.find({"employeeid" : {"$gte" : 15000, "$lte" : 70000}});
Answer» Every operation on the primary is logged in operation logs known as oplog. These oplogs are replicated to For a healthy replica set system, it is RECOMMENDED that all MEMBERS are in sync with no replication lag. Data is first written on primary by the applications then replicated to secondary. This synchronization is important to maintain up-to-date copies of data on all members. Synchronization happens in 2 ways: initial sync and continuous replication. Initial Sync: When we add new member to replica set data from one member is copied to the new member. When we PERFORM an initial sync, MongoDB copies all databases one by one except the local database. This is DONE by scanning all collections in the source database and inserting them on a new member. All indexes are also copied during the initial syn. There might be changes to the data set when initial sync happens. At the end of copy, the changes from already copied collections are applied using oplog. Continuous Replication: After the initial sync the secondary members replicate data CONTINUOUSLY. We can decide which member sync happens. The replication of secondary member from their sync source happens asynchronously. These replications happen using oplog.

10.

db.employee.find({"employeeid" : {"$gte" : 15000, "$lte" : 70000}});

Answer»

Every operation on the primary is logged in operation logs known as oplog. These oplogs are replicated to For a healthy replica set system, it is RECOMMENDED that all MEMBERS are in sync with no replication lag. Data is first written on primary by the applications then replicated to secondary. This synchronization is important to maintain up-to-date copies of data on all members. Synchronization happens in 2 ways: initial sync and continuous replication.

Initial Sync: When we add new member to replica set data from one member is copied to the new member. When we PERFORM an initial sync, MongoDB copies all databases one by one except the local database. This is DONE by scanning all collections in the source database and inserting them on a new member. All indexes are also copied during the initial syn. There might be changes to the data set when initial sync happens. At the end of copy, the changes from already copied collections are applied using oplog.
Continuous Replication: After the initial sync the secondary members replicate data CONTINUOUSLY. We can decide which member sync happens. The replication of secondary member from their sync source happens asynchronously. These replications happen using oplog.

Discussion

11.	Which shards would be involved in answering the following query:
Answer» The FIRST part of the query WOULD give all documents where y&GT;=10. So we will have 2 documents i.e d> { "_id" : 4, "x" : 4, "y" : 10 } e> { "_id" : 5, "x" : 5, "y" : 75 } Now the second part of the query would update the value of Y for above 2 documents to 75, but we already have a document with value y:75, that will not be updated. Finally, we will have one 1 document that will be updated by the provided query. d> { "_id" : 4, "x" : 4, "y" : 10 }

11.

Which shards would be involved in answering the following query:

Answer»

The FIRST part of the query WOULD give all documents where y&GT;=10. So we will have 2 documents i.e

d> { "_id" : 4, "x" : 4, "y" : 10 } e> { "_id" : 5, "x" : 5, "y" : 75 }

Now the second part of the query would update the value of Y for above 2 documents to 75, but we already have a document with value y:75, that will not be updated.

Finally, we will have one 1 document that will be updated by the provided query.

d> { "_id" : 4, "x" : 4, "y" : 10 }

Discussion

12.	shard key: { "employeeid" : 1 } { "employeeid" : { "$minKey" : 1 } } -->> { "employeeid" : 8122 } on : shard0000 { "employeeid" : 8122 } -->> { "employeeid" : 17600 } on : shard0002 { "employeeid" : 17600 } -->> { "employeeid" : 25851 } on : shard0003 { "employeeid" : 25851 } -->> { "employeeid" : 35852 } on : shard0004 { "employeeid" : 35852 } -->> { "employeeid" : 46047 } on : shard0005 { "employeeid" : 46047 } -->> { "employeeid" : 55450 } on : shard0006 { "employeeid" : 55450 } -->> { "employeeid" : 64644 } on : shard0007 { "employeeid" : 64644 } -->> { "employeeid" : 73769 } on : shard0000 { "employeeid" : 73769 } -->> { "employeeid" : 82950 } on : shard0002 { "employeeid" : 82950 } -->> { "employeeid" : 91983 } on : shard0001 { "employeeid" : 91983 } -->> { "employeeid" : { "$maxKey" : 1 } } on : shard0001
Answer» In MONGODB data is stored as JSON documents. These documents can have different sets of fields, with different data type for each field. For example, we can have a collection with number, varchar and array all as different documents. { “a” : 143 } { “name” : “JOHN” } { “x” : [1,2,3] } It is not CORRECT to say MongoDB is schemaless, in fact, schema plays an important role in the designing of MongoDB applications. MongoDB has a dynamic schema having database structure with collections and indexes. These collections can be created either implicitly or explicitly. Due to the dynamic behaviour of the schema, MongoDB has several advantages over RDBMS systems. Schema Migrations become very easy as in traditional systems we had to use ALTER TABLE command after adding any column which could result in downtime. In MongoDB, such adjustments become transparent and automatic. For example, if we want to ADD CITY field to people collection, we can add the attribute and resave, that’s it. Whereas in a traditional system we would have to run ALTER TABLE command followed by reorg which would require downtime.

12.

shard key: { "employeeid" : 1 } { "employeeid" : { "$minKey" : 1 } } -->> { "employeeid" : 8122 } on : shard0000 { "employeeid" : 8122 } -->> { "employeeid" : 17600 } on : shard0002 { "employeeid" : 17600 } -->> { "employeeid" : 25851 } on : shard0003 { "employeeid" : 25851 } -->> { "employeeid" : 35852 } on : shard0004 { "employeeid" : 35852 } -->> { "employeeid" : 46047 } on : shard0005 { "employeeid" : 46047 } -->> { "employeeid" : 55450 } on : shard0006 { "employeeid" : 55450 } -->> { "employeeid" : 64644 } on : shard0007 { "employeeid" : 64644 } -->> { "employeeid" : 73769 } on : shard0000 { "employeeid" : 73769 } -->> { "employeeid" : 82950 } on : shard0002 { "employeeid" : 82950 } -->> { "employeeid" : 91983 } on : shard0001 { "employeeid" : 91983 } -->> { "employeeid" : { "$maxKey" : 1 } } on : shard0001

Answer»

In MONGODB data is stored as JSON documents. These documents can have different sets of fields, with different data type for each field. For example, we can have a collection with number, varchar and array all as different documents.

{ “a” : 143 } { “name” : “JOHN” } { “x” : [1,2,3] }

It is not CORRECT to say MongoDB is schemaless, in fact, schema plays an important role in the designing of MongoDB applications. MongoDB has a dynamic schema having database structure with collections and indexes. These collections can be created either implicitly or explicitly.

Due to the dynamic behaviour of the schema, MongoDB has several advantages over RDBMS systems.

Schema Migrations become very easy as in traditional systems we had to use ALTER TABLE command after adding any column which could result in downtime. In MongoDB, such adjustments become transparent and automatic. For example, if we want to ADD CITY field to people collection, we can add the attribute and resave, that’s it. Whereas in a traditional system we would have to run ALTER TABLE command followed by reorg which would require downtime.

Discussion

13.	Suppose we have a sharded cluster having a sharded collection employee sharded on key employee id having below chunk distribution:
Answer» In MongoDB we have Built-in roles as well as custom roles. Built-in roles already have pre-defined access associated with them. We can assign these roles directly to users or groups for access. To run mongostat we would require access to run the server STATUS on the server. Built-in role cluster monitor comes with required access for the same. Custom roles or user-defined roles are the ones where we have to manually define access actions to a particular resource. MongoDB provides method db.createRole() for CREATING user-defined roles. These roles can be created in a specific database as MongoDB uses a combination of database and role name to uniquely define the role. We will create a custom role mongostatRole that provides only the PRIVILEGES to run mongostat. First, we NEED to connect to mongod or mongos to the admin database with a user that has privileges to create roles in the admin as well as other databases. mongo --port 27017 -u admin -p 'abc***' --authenticationDatabase 'admin' Now we will create a desired custom role in the admin database. use admin db.createRole( role: "mongostatRole", privileges: [ {resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] ) This role can now be ASSIGNED to members of monitoring team.

13.

Suppose we have a sharded cluster having a sharded collection employee sharded on key employee id having below chunk distribution:

Answer»

In MongoDB we have Built-in roles as well as custom roles. Built-in roles already have pre-defined access associated with them. We can assign these roles directly to users or groups for access. To run mongostat we would require access to run the server STATUS on the server.

Built-in role cluster monitor comes with required access for the same.

Custom roles or user-defined roles are the ones where we have to manually define access actions to a particular resource. MongoDB provides method db.createRole() for CREATING user-defined roles. These roles can be created in a specific database as MongoDB uses a combination of database and role name to uniquely define the role.

We will create a custom role mongostatRole that provides only the PRIVILEGES to run mongostat.

First, we NEED to connect to mongod or mongos to the admin database with a user that has privileges to create roles in the admin as well as other databases.

mongo --port 27017 -u admin -p 'abc***' --authenticationDatabase 'admin'

Now we will create a desired custom role in the admin database.

use admin db.createRole( role: "mongostatRole", privileges: [ {resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] )

This role can now be ASSIGNED to members of monitoring team.

Discussion

14.	How can MongoDB wiredTiger internal cache be sized? Also, how does it differ from the filesystem cache?
Answer» only those DOCUMENTS that CONTAIN the field SPECIFIED in the query. For the following documents in EMPLOYEE collection *{ _id: 1, name: "Jonas", linkedInProfile: null }, { _id: 2, name: “Williams” }* The query { linkedInProfile: { $exists: true } } will return only the employee “Jonas”

14.

How can MongoDB wiredTiger internal cache be sized? Also, how does it differ from the filesystem cache?

Answer»

only those DOCUMENTS that CONTAIN the field SPECIFIED in the query.

For the following documents in EMPLOYEE collection

{ _id: 1, name: "Jonas", linkedInProfile: null }, { _id: 2, name: “Williams” }

The query { linkedInProfile: { $exists: true } } will return only the employee “Jonas”

Discussion

15.	In your production environment due to an imbalance in shard data balancer initiates the chunk migration. Explain how chunks will be migrated?
Answer» This can be achieved in MONGODB using the $type operator. A null value, i.e., BSON type null has the type number 10. Using this type number, only those documents can be retrieved WHOSE value is null. Take the example of the below TWO documents in startup collection *{ _id: 1, name: "XYZ Tech", website: null }, { _id: 2, name: “ABC PVT Ltd” }* The query { *website : {* $*type: 10 } }* will retrieve only those documents where the website is null, in the above case it would be the startup “XYZ Tech” Note: The query { *website :* *null }* on the other hand will match documents where the website is null or the documents where the website field does not exist. For the above collection data, this query will return both the startups.

15.

In your production environment due to an imbalance in shard data balancer initiates the chunk migration. Explain how chunks will be migrated?

Answer»

This can be achieved in MONGODB using the $type operator. A null value, i.e., BSON type null has the type number 10. Using this type number, only those documents can be retrieved WHOSE value is null.

Take the example of the below TWO documents in startup collection

{ _id: 1, name: "XYZ Tech", website: null }, { _id: 2, name: “ABC PVT Ltd” }

The query { website : { $type: 10 } } will retrieve only those documents where the website is null, in the above case it would be the startup “XYZ Tech”

Note: The query { website : null } on the other hand will match documents where the website is null or the documents where the website field does not exist. For the above collection data, this query will return both the startups.

Discussion

16.

What are the different Encryption options MongoDB offers?

Answer»

DB.<COLLECTION>.find().SKIP(n).LIMIT(n)

Note: n is the PAGESIZE, for the first page skip(n) will not be applicable

limit(n) limits the documents to be returned from the cursor to n, skip(n) will skip n documents

from the cursor

Discussion

17.	What are the different backup methods MongoDB provides?
Answer» We can change the configuration of the replica set as per the requirement of the application. Configuration changes may include ADDING a new member, adding Arbiter, removing a member, changing priority or votes for members, or changing member from normal secondary to hidden or delayed member. To ADD a new member, first we NEED to start the mongod process –replset option on the new server To add new secondary rs.add({host: “hostname” , port : “portno.”}) Once added member will FETCH the data from primary using initial sync and replication synchronism. To add Arbiter rs.addArb({host: “hostname” , port : “portno.”}) To remove a member rs.remove(hostname) As a good practice should shut down the member being removed before running the above command. Above steps can also be performed using below command providing new configuration. rs.reconfig(new config) Reconfig can be explained better with below example. Suppose we have replica set “rs0” with below configuration. To change the priority of member 1: From Primary: cfg = rs.conf(); cfg.members[1].priority = 2; rs.reconfig(cfg); To change the Votes of member 2: cfg = rs.conf(); cfg.members[2].votes = 0; rs.reconfig(cfg); To change current secondary member as a delayed member with 1-hour delay. cfg = rs.conf() cfg.members[N].priority = 0 cfg.members[n].hidden = true cfg.members[n].slaveDelay = 3600 rs.reconfig(cfg) To change current secondary member to hidden member cfg = rs.conf() cfg.members[n].priority = 0 cfg.members[n].hidden = true rs.reconfig(cfg)

17.

What are the different backup methods MongoDB provides?

Answer»

We can change the configuration of the replica set as per the requirement of the application. Configuration changes may include ADDING a new member, adding Arbiter, removing a member, changing priority or votes for members, or changing member from normal secondary to hidden or delayed member.

To ADD a new member, first we NEED to start the mongod process –replset option on the new server

To add new secondary

rs.add({host: “hostname” , port : “portno.”})

Once added member will FETCH the data from primary using initial sync and replication synchronism.

To add Arbiter

rs.addArb({host: “hostname” , port : “portno.”})

To remove a member

rs.remove(hostname)

As a good practice should shut down the member being removed before running the above command.

Above steps can also be performed using below command providing new configuration.

rs.reconfig(new config)

Reconfig can be explained better with below example. Suppose we have replica set “rs0” with below configuration.

To change the priority of member 1:

From Primary:

cfg = rs.conf(); cfg.members[1].priority = 2; rs.reconfig(cfg);

To change the Votes of member 2:

cfg = rs.conf(); cfg.members[2].votes = 0; rs.reconfig(cfg);

To change current secondary member as a delayed member with 1-hour delay.

cfg = rs.conf() cfg.members[N].priority = 0 cfg.members[n].hidden = true cfg.members[n].slaveDelay = 3600 rs.reconfig(cfg)

To change current secondary member to hidden member

cfg = rs.conf() cfg.members[n].priority = 0 cfg.members[n].hidden = true rs.reconfig(cfg)

Discussion

18.	How we can enable keyfile authentication on the existing sharded cluster without downtime?
Answer» MongoDB has replication to provide high availability and redundancy which are the basis for any production database. With replica sets, we can achieve HA as well as DR capability. This also enables up for horizontal scaling enabling the use of commodity servers instead of enterprise servers. With replication, we can prevent downtime even if entire DC goes down with proper configuration. There are several types of replica members based on the requirement: Primary: This is the member who accepts all the writes from the application. If the primary goes down, the new primary member is selected which then accepts all writes. MongoDB applies to write operations on the primary and then records the operations on the primary’s OPLOG. Secondary: A secondary MAINTAINS a copy of the primary’s data set. To replicate data, a secondary applies operations from the primary’s Oplog to its own data set in an asynchronous process replica set can have one or more secondary’s. ARBITER: An arbiter is a secondary with a copy of the data, due to which it cannot ever become primary. It PARTICIPATES in the election for primary in case needed. Basically is there to maintain quorum. Hidden Replica Set Members: A hidden member maintains a copy of the primary’s data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set. They cannot become primary. Delayed Replica Set Members: Delayed members contain copies of a replica set’s data set. However, a delayed member’s data set reflects an earlier, or delayed, state of the set. They are “rolling backup” or a RUNNING “historical” snapshot of the data set, which may help you recover from various kinds of human error.

18.

How we can enable keyfile authentication on the existing sharded cluster without downtime?

Answer»

MongoDB has replication to provide high availability and redundancy which are the basis for any production database. With replica sets, we can achieve HA as well as DR capability. This also enables up for horizontal scaling enabling the use of commodity servers instead of enterprise servers. With replication, we can prevent downtime even if entire DC goes down with proper configuration.

There are several types of replica members based on the requirement:

Primary: This is the member who accepts all the writes from the application. If the primary goes down, the new primary member is selected which then accepts all writes. MongoDB applies to write operations on the primary and then records the operations on the primary’s OPLOG.
Secondary: A secondary MAINTAINS a copy of the primary’s data set. To replicate data, a secondary applies operations from the primary’s Oplog to its own data set in an asynchronous process replica set can have one or more secondary’s.
ARBITER: An arbiter is a secondary with a copy of the data, due to which it cannot ever become primary. It PARTICIPATES in the election for primary in case needed. Basically is there to maintain quorum.
Hidden Replica Set Members: A hidden member maintains a copy of the primary’s data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set. They cannot become primary.
Delayed Replica Set Members: Delayed members contain copies of a replica set’s data set. However, a delayed member’s data set reflects an earlier, or delayed, state of the set. They are “rolling backup” or a RUNNING “historical” snapshot of the data set, which may help you recover from various kinds of human error.

Discussion

19.	What are the different authentication mechanism MongoDB supports?
Answer» It is important to maintain DATA consistency in any database especially when multiple applications are accessing the same piece of data simultaneously. MongoDB uses locking and other concurrency CONTROL measures to ensure consistency. Multiple CLIENTS can read and write the same data while ensuring that all writes to single document either occur in full or not at all so that clients never see inconsistent data. Effect of sharding on concurrency In sharding, collections are distributed among SEVERAL shard servers and so it improves concurrency. Mongos process routes multiple numbers of operations concurrently to different shards and finally combine them before sending back to the client. In a sharded cluster locking is at individual shard level rather than cluster level so the operations in one shard do not block other shard operations. Each shard uses its own locks independent of other shards in the cluster. Effect of replication on concurrency Primary In a MongoDB replica set each operation on the primary is also written to the special capped COLLECTION in the local database called oplog. So every time application writes to MongoDB it locks both databases i.e collection database and local database. Both these databases must be locked at the same time to maintain database consistency and ensuring that even with replication write operations maintain their ‘all-or-nothing” feature which ensures consistency. Secondary In MongoDB replication, the application does not write to secondary but the secondary gets write from primary in the form or oplog. This oplog are not applied serially but collected in batches and batches are applied in parallel. The write operations are applied in the same order as they appear in oplog. During the time oplog are applied secondary do not allow reads to applied data to maintain consistency.

19.

What are the different authentication mechanism MongoDB supports?

Answer»

It is important to maintain DATA consistency in any database especially when multiple applications are accessing the same piece of data simultaneously. MongoDB uses locking and other concurrency CONTROL measures to ensure consistency. Multiple CLIENTS can read and write the same data while ensuring that all writes to single document either occur in full or not at all so that clients never see inconsistent data.

Effect of sharding on concurrency

In sharding, collections are distributed among SEVERAL shard servers and so it improves concurrency. Mongos process routes multiple numbers of operations concurrently to different shards and finally combine them before sending back to the client.

In a sharded cluster locking is at individual shard level rather than cluster level so the operations in one shard do not block other shard operations. Each shard uses its own locks independent of other shards in the cluster.

Effect of replication on concurrency

Primary

In a MongoDB replica set each operation on the primary is also written to the special capped COLLECTION in the local database called oplog. So every time application writes to MongoDB it locks both databases i.e collection database and local database. Both these databases must be locked at the same time to maintain database consistency and ensuring that even with replication write operations maintain their ‘all-or-nothing” feature which ensures consistency.

Secondary

In MongoDB replication, the application does not write to secondary but the secondary gets write from primary in the form or oplog. This oplog are not applied serially but collected in batches and batches are applied in parallel. The write operations are applied in the same order as they appear in oplog. During the time oplog are applied secondary do not allow reads to applied data to maintain consistency.

Discussion

20.	How can we perform backup for a sharded cluster?
Answer» Indexes help in improving the performance of queries. Without indexes, query must perform collection scan where each and every document of collection is scanned for the desired query result. With the use of proper indexes, we can LIMIT the number of documents scanned thus improving the performance of queries. Like collections indexes also use storage as they store a small portion of collection data. For example, if we create an index on field ‘NAME’ it will store data for this field and in ascending or descending order which also helps sort operations. Using indexes, we can satisfy equality matches and range-based queries more efficiently. Some of the different index options available for MongoDB are: _id Index By default, MongoDB creates an index on the _id field at the time of creating an index. This is a unique index and PREVENTS applications from inserting multiple same values for the same _id field. MongoDB ensures that this index cannot be deleted. Single field and compound index These are indexes either on any one or combination of fields. i.e db.records.createIndex( { score: 1 } ) – Index on single field “score” db.products.createIndex( { "item": 1, "stock": 1 } ) – Index on comination of “item and stock” Multikey Index MongoDB provides the option of creating an index on the contents STORED in arrays. For every element of the array, a separate index entry is created. We can select matching elements of the array using multikey indexes more efficiently. Geospatial Index MongoDB also provides a geospatial index which helps to efficiently query the geospatial coordinate data. 2d indexes for planar geometry and 2dsphere indexes for spherical geometry. Text Indexes To support string content search in collection MongoDB provides text index. These indexes only store root words while ignoring the language-specific words like ‘the’, ‘a’ etc. Partial Indexes To search for specific filter expression in a collection partial indexes are used. Since they store only the subset of documents in a collection, they have lower storage requirements. Index creation maintenance and performance is also low for these indexes. Sparse Indexes If we only want to get the fields of a document that are indexed and skip all other fields we can do so by using the sparse index. TTL Indexes Certain application has requirements where documents need to be removed AUTOMATICALLY after a certain amount of time. We can achieve this using TTL indexes. We specify TTL (time to live) for the documents after which a background process runs and removed these documents. This index is ideal for logs, session data and event data as such data only needs to persist for a limited time.

20.

How can we perform backup for a sharded cluster?

Answer»

Indexes help in improving the performance of queries. Without indexes, query must perform collection scan where each and every document of collection is scanned for the desired query result. With the use of proper indexes, we can LIMIT the number of documents scanned thus improving the performance of queries.

Like collections indexes also use storage as they store a small portion of collection data. For example, if we create an index on field ‘NAME’ it will store data for this field and in ascending or descending order which also helps sort operations. Using indexes, we can satisfy equality matches and range-based queries more efficiently.

Some of the different index options available for MongoDB are:

_id Index

By default, MongoDB creates an index on the _id field at the time of creating an index. This is a unique index and PREVENTS applications from inserting multiple same values for the same _id field. MongoDB ensures that this index cannot be deleted.

Single field and compound index

These are indexes either on any one or combination of fields.

i.e

db.records.createIndex( { score: 1 } ) – Index on single field “score” db.products.createIndex( { "item": 1, "stock": 1 } ) – Index on comination of “item and stock”

Multikey Index

MongoDB provides the option of creating an index on the contents STORED in arrays. For every element of the array, a separate index entry is created. We can select matching elements of the array using multikey indexes more efficiently.

Geospatial Index

MongoDB also provides a geospatial index which helps to efficiently query the geospatial coordinate data. 2d indexes for planar geometry and 2dsphere indexes for spherical geometry.

Text Indexes

To support string content search in collection MongoDB provides text index. These indexes only store root words while ignoring the language-specific words like ‘the’, ‘a’ etc.

Partial Indexes

To search for specific filter expression in a collection partial indexes are used. Since they store only the subset of documents in a collection, they have lower storage requirements. Index creation maintenance and performance is also low for these indexes.

Sparse Indexes

If we only want to get the fields of a document that are indexed and skip all other fields we can do so by using the sparse index.

TTL Indexes

Certain application has requirements where documents need to be removed AUTOMATICALLY after a certain amount of time. We can achieve this using TTL indexes. We specify TTL (time to live) for the documents after which a background process runs and removed these documents. This index is ideal for logs, session data and event data as such data only needs to persist for a limited time.

Discussion

21.	What are the important factors that affect the choice of efficient shard key?
Answer» MongoDB creates oplogs for each operation on PRIMARY and these are then replicated to secondary using replication. MongoDB uses asynchronous replication and automatic failover feature to perform this efficiently. Asynchronous Replication Oplogs from the primary is applied to secondary asynchronously. This helps applications to continue without downtime despite the failure of members. MongoDB deployments are usually on commodity servers and for commodity servers, if we want to have synchronous replication, latency for waiting for acknowledgement is in the order of 100ms which is quite high. Due to this reason, MongoDB prefers asynchronous replication. From VERSION 4.0.6, MongoDB provides the capability to log entries of slow oplog operations for secondary members of a replica set. These slow oplog messages are logged for the secondaries in the diagnostic log under the REPL component. These slow oplog entries do not depend on log levels or profiling level but depend only on the slow operation threshold. The profiler does not capture slow oplog entries. Automatic Failover Many traditional databases follow Master-slave setup but in case of master failure, we have to manually cutover to a slave database. In MongoDB, we can have one primary with multiple secondary. If we have fewer servers, we can still afford to do manual cutover but MongoDB being big data may have 100 shards and it is impossible to cutover manually every time. So MongoDB has automatic failover. When the primary is unable to communicate to other members for more than the configured time(electionTimeoutMillis), and eligible secondary triggers election to nominate itself as primary. Until the new primary is elected cluster cannot SERVE WRITE requests and can only serve read requests. Once the new primary is selected cluster resumes normal operations The architecture of the cluster should be designed keeping in mind Network latency and time required for replica sets to complete elections as they affect the time our cluster runs without Primary.

21.

What are the important factors that affect the choice of efficient shard key?

Answer»

MongoDB creates oplogs for each operation on PRIMARY and these are then replicated to secondary using replication. MongoDB uses asynchronous replication and automatic failover feature to perform this efficiently.

Asynchronous Replication

Oplogs from the primary is applied to secondary asynchronously. This helps applications to continue without downtime despite the failure of members. MongoDB deployments are usually on commodity servers and for commodity servers, if we want to have synchronous replication, latency for waiting for acknowledgement is in the order of 100ms which is quite high. Due to this reason, MongoDB prefers asynchronous replication.

From VERSION 4.0.6, MongoDB provides the capability to log entries of slow oplog operations for secondary members of a replica set. These slow oplog messages are logged for the secondaries in the diagnostic log under the REPL component. These slow oplog entries do not depend on log levels or profiling level but depend only on the slow operation threshold. The profiler does not capture slow oplog entries.

Automatic Failover

Many traditional databases follow Master-slave setup but in case of master failure, we have to manually cutover to a slave database. In MongoDB, we can have one primary with multiple secondary. If we have fewer servers, we can still afford to do manual cutover but MongoDB being big data may have 100 shards and it is impossible to cutover manually every time. So MongoDB has automatic failover. When the primary is unable to communicate to other members for more than the configured time(electionTimeoutMillis), and eligible secondary triggers election to nominate itself as primary. Until the new primary is elected cluster cannot SERVE WRITE requests and can only serve read requests. Once the new primary is selected cluster resumes normal operations

The architecture of the cluster should be designed keeping in mind Network latency and time required for replica sets to complete elections as they affect the time our cluster runs without Primary.

Discussion

22.	What is the process to setup sharded cluster?
Answer» The balancer is a background process that runs on the primary of config server in a cluster. It constantly monitors the number of CHUNKS on each shard and if the number of chunks for a specific shard is more than the migration threshold, it tries to automatically migrate chunks between shards so that there are an equal number of chunks per shard. The balancer migrates chunks from shards having more chunks to shards with lesser chunks. For example, Suppose we have 2 shards[shard01, shard02] with chunks 4 and 5 respectively. Now suppose there is a need to add another shard[shard03]. Initially, shard03 will have no chunks. A balancer will notice this UNEVEN distribution and migrate chunks from shard01 and shard02 to shard03 until all 3 shards have three shards each. There might be performance impact when balancer migrates the chunks as they CARRY some overhead in TERMS of bandwidth and workload, which can impact database performance. To minimize the impact balancer: Attempts only one chunk migration at a given time. So a shard cannot participate in multiple chunk migrations at a given time. Multiple chink migrations should occur one after the other. Although for 3.4 parallel chunk migrations are possible. Suppose a sharded cluster has 4 shards it can participate in 2[shard/2] simultaneous chunk migrations. Kicks off balancing round only when a number of chunks between shard with the greatest and lowest number of shards reaches migration threshold. Impact of Adding and Removing Shards on a balancer Adding or removing the shard from the cluster creates imbalance as either new shard will have no chunks or removed shard chunks need to be redistributed throughout the cluster. In case shard was removed from the cluster with uneven chunk distribution the balancer will remove the chunks from draining shard before balancing remaining uneven chunks. When balancer notices this imbalance it starts chunk migration process immediately. The migration process takes time to complete.

22.

What is the process to setup sharded cluster?

Answer»

The balancer is a background process that runs on the primary of config server in a cluster. It constantly monitors the number of CHUNKS on each shard and if the number of chunks for a specific shard is more than the migration threshold, it tries to automatically migrate chunks between shards so that there are an equal number of chunks per shard. The balancer migrates chunks from shards having more chunks to shards with lesser chunks. For example, Suppose we have 2 shards[shard01, shard02] with chunks 4 and 5 respectively. Now suppose there is a need to add another shard[shard03]. Initially, shard03 will have no chunks. A balancer will notice this UNEVEN distribution and migrate chunks from shard01 and shard02 to shard03 until all 3 shards have three shards each.

There might be performance impact when balancer migrates the chunks as they CARRY some overhead in TERMS of bandwidth and workload, which can impact database performance. To minimize the impact balancer:

Attempts only one chunk migration at a given time. So a shard cannot participate in multiple chunk migrations at a given time. Multiple chink migrations should occur one after the other. Although for 3.4 parallel chunk migrations are possible. Suppose a sharded cluster has 4 shards it can participate in 2[shard/2] simultaneous chunk migrations.
Kicks off balancing round only when a number of chunks between shard with the greatest and lowest number of shards reaches migration threshold.

Impact of Adding and Removing Shards on a balancer

Adding or removing the shard from the cluster creates imbalance as either new shard will have no chunks or removed shard chunks need to be redistributed throughout the cluster. In case shard was removed from the cluster with uneven chunk distribution the balancer will remove the chunks from draining shard before balancing remaining uneven chunks. When balancer notices this imbalance it starts chunk migration process immediately. The migration process takes time to complete.

Discussion

23.	What is the need for sharding in MongoDB and what are the different components of a sharded cluster?
Answer» Security is very IMPORTANT for any production database. MongoDB provides us with the best PRACTICES to harden out MongoDB DEPLOYMENT. This list of best practices should act as security checklist before we give green light to any production deployment. We should enable authentication for our deployment. All clients should require to authenticate before they can access the MongoDB server. Methods like SCRAM-SHA-1, Certificate-based authentication or LDAP can be enabled. It is important to enabling authentication on each MongoDB server as if any server is left that COULD become a POINT of access to intruders. We should enable authorization via Role-based access control(RBAC) model for our deployments. There should be a single administrative user to configure other users. We should have unique users for each person and application that access the database. These users should follow the principle of least privilege meaning users should not have access that is not needed. As a best practice, we should group common access privileges to roles and then assign these roles to individual users or groups. We should enable encryption for our deployment. All connections via client or between nodes should go through tls/ssl protocols. apart from encrypting communication, it is important to encrypt data at rest using MongoDB native encryption available for WiredTiger. Also, it is important to rotate the encryption keys either by KMIP or any other protocol. Protecting MongoDB data files by assigning appropriate file-system permissions is also important. We can significantly affect the security of MongoDB deployment by having a strong Network security process. Firewalls should be configured to control access of our MongoDB systems. On cloud deployments, proper VPC/VPNs should be configured. We should limit network traffic to specific systems on the given port via the use of firewalls. Only traffic from trusted sources should reach Mongod or Mongos instances. MongoDB also supports its own firewall with the configuration bind_ip, using this we can configure connections from specific IP address at the database level. It is important to audit any kind of database configuration changes. Sometimes it may be required to audit changes in data within a database. It should be noted that there are performance implications in enabling auditing. We should not run our MongoDB applications from root user. There should be a dedicated user created for individual applications. MongoDB should be run using secure configuration options. The HTTP status interface and REST API must be disabled. Also if we are not using operations like MapReduce(), group(), and $where server-side scripting should be disabled. This prevents MongoDB from malicious javascript attacks.

23.

What is the need for sharding in MongoDB and what are the different components of a sharded cluster?

Answer»

Security is very IMPORTANT for any production database. MongoDB provides us with the best PRACTICES to harden out MongoDB DEPLOYMENT. This list of best practices should act as security checklist before we give green light to any production deployment.

We should enable authentication for our deployment. All clients should require to authenticate before they can access the MongoDB server. Methods like SCRAM-SHA-1, Certificate-based authentication or LDAP can be enabled. It is important to enabling authentication on each MongoDB server as if any server is left that COULD become a POINT of access to intruders.

We should enable authorization via Role-based access control(RBAC) model for our deployments. There should be a single administrative user to configure other users. We should have unique users for each person and application that access the database. These users should follow the principle of least privilege meaning users should not have access that is not needed. As a best practice, we should group common access privileges to roles and then assign these roles to individual users or groups.
We should enable encryption for our deployment. All connections via client or between nodes should go through tls/ssl protocols. apart from encrypting communication, it is important to encrypt data at rest using MongoDB native encryption available for WiredTiger. Also, it is important to rotate the encryption keys either by KMIP or any other protocol. Protecting MongoDB data files by assigning appropriate file-system permissions is also important.

We can significantly affect the security of MongoDB deployment by having a strong Network security process. Firewalls should be configured to control access of our MongoDB systems. On cloud deployments, proper VPC/VPNs should be configured. We should limit network traffic to specific systems on the given port via the use of firewalls. Only traffic from trusted sources should reach Mongod or Mongos instances. MongoDB also supports its own firewall with the configuration bind_ip, using this we can configure connections from specific IP address at the database level.

It is important to audit any kind of database configuration changes. Sometimes it may be required to audit changes in data within a database. It should be noted that there are performance implications in enabling auditing.

We should not run our MongoDB applications from root user. There should be a dedicated user created for individual applications.

MongoDB should be run using secure configuration options. The HTTP status interface and REST API must be disabled. Also if we are not using operations like MapReduce(), group(), and $where server-side scripting should be disabled. This prevents MongoDB from malicious javascript attacks.

Discussion

24.	What are different factors and conditions affecting elections when a primary replica set goes down?
Answer» Monitoring is a critical component of all database administration. A firm grasp of MongoDB’s reporting will allow us to assess the state of the database and maintain deployment without crisis. Below are some of the utilities USED for MongoDB monitoring. mongostat The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance. mongostat is functionally similar to the UNIX/Linux file system utility vmstat but provides data REGARDING mongod and mongos instances. In order to run mongostat user must have the serverStatus privilege action on the cluster resources. Eg. To run mongostat every 2 minutes below command can be used. mongostat 120 mongotop mongotop provides a method to track the amount of time a MongoDB instance mongod spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns value every second. Eg. To run mongotop every 30 sec below command can be used. mongotop 30 Commands MongoDB includes a number of commands that report on the state of the database. serverStatus The serverStatus command, or db.serverStatus() from the shell, return a general overview of the status of the database, detailing disk usage, memory use, connection, journaling, and index access. The command returns quickly and does not impact MongoDB performance. dbStats The dbStats command, or db.stats() from the shell, returns a document that addresses storage use and data VOLUMES. The dbStats reflect the amount of storage used, the quantity of data contained in the database, and the object, collection, and index counters. We can use this data to monitor the state and storage capacity of a specific database. This output also allows to compare use between databases and to DETERMINE the average document size in a database. collStats The collStats or db.collection.stats() from the shell that provides statistics that resemble dbStats on the collection level, including a count of the objects in the collection, the size of the collection, the amount of disk SPACE used by the collection, and information about its indexes. replSetGetStatus The replSetGetStatus command (rs.status() from the shell) returns an overview of replica set’s status. The replSetGetStatus document details the state and configuration of the replica set and statistics about its members. This data can be used to ensure that replication is properly configured, and to check the connections between the current host and the other members of the replica set. Ops-manager/Cloud-manager Apart from the above tools MongoDB also provides an option for GUI based monitoring with ops-manager and cloud-manager. These are very efficient and are mostly used in large enterprise environments.

24.

What are different factors and conditions affecting elections when a primary replica set goes down?

Answer»

Monitoring is a critical component of all database administration. A firm grasp of MongoDB’s reporting will allow us to assess the state of the database and maintain deployment without crisis.

Below are some of the utilities USED for MongoDB monitoring.

mongostat

The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance. mongostat is functionally similar to the UNIX/Linux file system utility vmstat but provides data REGARDING mongod and mongos instances.

In order to run mongostat user must have the serverStatus privilege action on the cluster resources.

Eg. To run mongostat every 2 minutes below command can be used.

mongostat 120

mongotop

mongotop provides a method to track the amount of time a MongoDB instance mongod spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns value every second.

Eg. To run mongotop every 30 sec below command can be used.

mongotop 30

Commands

MongoDB includes a number of commands that report on the state of the database.

serverStatus

The serverStatus command, or db.serverStatus() from the shell, return a general overview of the status of the database, detailing disk usage, memory use, connection, journaling, and index access. The command returns quickly and does not impact MongoDB performance.

dbStats

The dbStats command, or db.stats() from the shell, returns a document that addresses storage use and data VOLUMES. The dbStats reflect the amount of storage used, the quantity of data contained in the database, and the object, collection, and index counters.

We can use this data to monitor the state and storage capacity of a specific database. This output also allows to compare use between databases and to DETERMINE the average document size in a database.

collStats

The collStats or db.collection.stats() from the shell that provides statistics that resemble dbStats on the collection level, including a count of the objects in the collection, the size of the collection, the amount of disk SPACE used by the collection, and information about its indexes.

replSetGetStatus

The replSetGetStatus command (rs.status() from the shell) returns an overview of replica set’s status. The replSetGetStatus document details the state and configuration of the replica set and statistics about its members.

This data can be used to ensure that replication is properly configured, and to check the connections between the current host and the other members of the replica set.

Ops-manager/Cloud-manager

Apart from the above tools MongoDB also provides an option for GUI based monitoring with ops-manager and cloud-manager. These are very efficient and are mostly used in large enterprise environments.

Discussion

25.	Your replica set maintains five copies of the data.Either dc1-01, dc1-02 or dc2-01, dc2-02 may become primary.dc3-01 should never be primary.Clients may read from dc3-01.
Answer» If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as JUMBO. If the chunk size no longer hits the limits, MongoDB clears the jumbo flag for the chunk when the mongos reloads or rewrites the chunk metadata. But in some we need to follow the below process to clear the jumbo flag manually: Divisible Chunks If the chunk is divisible, MongoDB removes the flag upon successful split of the chunk. Process Connect to mongos and run sh.STATUS(true) looking for jumbo chunks. Below output from sh.status(true) shows that chunk with shard key range { "x" : 2 } --&GT;> { "x" : 4 } is jumbo. --- Sharding Status --- .................. .................. test.foo shard key: { "x" : 1 } chunks: shard-b 2 shard-a 2 { "x" : { "$minKey" : 1 } } -->> { "x" : 1 } on : shard-b Timestamp(2, 0) { "x" : 1 } -->> { "x" : 2 } on : shard-a Timestamp(3, 1) { "x" : 2 } -->> { "x" : 4 } on : shard-a Timestamp(2, 2) jumbo { "x" : 4 } -->> { "x" : { "$maxKey" : 1 } } on : shard-b Timestamp(3, 0) Split the jumbo chunk using sh.splitAt() sh.splitAt( "test.foo", { x: 3 }) MongoDB removes the jumbo flag upon successful split of the chunk. Indivisible Chunks In some instances, MongoDB cannot split the no-longer jumbo chunk, such as a chunk with a range of single shard key value, and the preferred method to clear the flag is not applicable. Process Stop the balancer. Create a backup of config database. mongodump --db config --port <config server port> --out <output file> Connect to mongos and check for jumbo chunks using sh.status UPDATE chunks collection. In the chunks collection of the config database, unset the jumbo flag for the chunk. For example, db.getSiblingDB("config").chunks.update( { ns: "test.foo", min: { x: 2 }, jumbo: true }, { $unset: { jumbo: "" } } ) Clear the cached routing information. After the jumbo flag has been cleared out from the chunks collection, update the cluster routing metadata cache. db.adminCommand( { flushRouterConfig: "test.foo" } )

25.

Your replica set maintains five copies of the data.Either dc1-01, dc1-02 or dc2-01, dc2-02 may become primary.dc3-01 should never be primary.Clients may read from dc3-01.

Answer»

If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as JUMBO. If the chunk size no longer hits the limits, MongoDB clears the jumbo flag for the chunk when the mongos reloads or rewrites the chunk metadata.

But in some we need to follow the below process to clear the jumbo flag manually:

Divisible Chunks

If the chunk is divisible, MongoDB removes the flag upon successful split of the chunk.

Process

Connect to mongos and run sh.STATUS(true) looking for jumbo chunks.

Below output from sh.status(true) shows that chunk with shard key range { "x" : 2 } --&GT;> { "x" : 4 } is jumbo.

--- Sharding Status --- .................. .................. test.foo shard key: { "x" : 1 } chunks: shard-b 2 shard-a 2 { "x" : { "$minKey" : 1 } } -->> { "x" : 1 } on : shard-b Timestamp(2, 0) { "x" : 1 } -->> { "x" : 2 } on : shard-a Timestamp(3, 1) { "x" : 2 } -->> { "x" : 4 } on : shard-a Timestamp(2, 2) jumbo { "x" : 4 } -->> { "x" : { "$maxKey" : 1 } } on : shard-b Timestamp(3, 0)

Split the jumbo chunk using sh.splitAt()

sh.splitAt( "test.foo", { x: 3 })

MongoDB removes the jumbo flag upon successful split of the chunk.

Indivisible Chunks

In some instances, MongoDB cannot split the no-longer jumbo chunk, such as a chunk with a range of single shard key value, and the preferred method to clear the flag is not applicable.

Process

Stop the balancer.
Create a backup of config database.

mongodump --db config --port <config server port> --out <output file>

Connect to mongos and check for jumbo chunks using sh.status
UPDATE chunks collection.

In the chunks collection of the config database, unset the jumbo flag for the chunk. For example,

db.getSiblingDB("config").chunks.update( { ns: "test.foo", min: { x: 2 }, jumbo: true }, { $unset: { jumbo: "" } } )

Clear the cached routing information.

After the jumbo flag has been cleared out from the chunks collection, update the cluster routing metadata cache.

db.adminCommand( { flushRouterConfig: "test.foo" } )

Discussion

26.	Suppose we have a five-node replica set distributed across three data centres: dc1, dc2 and dc3. What would be configurations that meet the following requirements:
Answer» Any query on sharded cluster goes through mongos to config database where it looks for metadata information about the chunk distribution. These queries are generally divided into broadly 2 groups: Scatter gather queries: Scatter-gather queries are the one which does not include the shard key. Since there are no shard keys, mongos does not know which shard to send this query to, HENCE it searches on all shards in the cluster. These queries are generally inefficient and are unfeasible for routine operations on large clusters. Targeted queries: If a query includes the shard key, the mongos DIRECTS the query to SPECIFIC shards only that are part of query as per shard key. These queries are very efficient. Now, in this CASE, we have a query with a shard key search 15000<=EMPLOYEEID<=70000, which is a subset of the data from the entire cluster and so it’s a targeted query. Any shard with employee id within this range will be queries. From the above sample, we can see below shards fall within this range and will all be accessed by the query. Shard0000 Shard0002 Shard0003 Shard0004 Shard0005 Shard0006 Shard0007

26.

Suppose we have a five-node replica set distributed across three data centres: dc1, dc2 and dc3. What would be configurations that meet the following requirements:

Answer»

Any query on sharded cluster goes through mongos to config database where it looks for metadata information about the chunk distribution.

These queries are generally divided into broadly 2 groups:

Scatter gather queries:

Scatter-gather queries are the one which does not include the shard key. Since there are no shard keys, mongos does not know which shard to send this query to, HENCE it searches on all shards in the cluster. These queries are generally inefficient and are unfeasible for routine operations on large clusters.

Targeted queries:

If a query includes the shard key, the mongos DIRECTS the query to SPECIFIC shards only that are part of query as per shard key. These queries are very efficient.

Now, in this CASE, we have a query with a shard key search 15000<=EMPLOYEEID<=70000, which is a subset of the data from the entire cluster and so it’s a targeted query. Any shard with employee id within this range will be queries. From the above sample, we can see below shards fall within this range and will all be accessed by the query.

Shard0000
Shard0002
Shard0003
Shard0004
Shard0005
Shard0006
Shard0007

Discussion

27.	How can we configure 3 node replica sets in MongoDB?
Answer» MongoDB wiredTiger storage engine uses both WiredTiger INTERNAL cache and FILE system cache for storing data. If we do not define wiredTiger internal cache by default it utilizes larger of either 256MB or 50% of (RAM - 1GB). For example, if a system has a total 0f 6GB RAM, so 2GB (50% 0f 6GB – 1 GB) will be allocated to wiredTiger internal cache. This default setting assumes that there is only one mongod process running. In case we have multiple mongodb instances on the server we should decrease the wiredTiger internal cache size to accommodate other instances. WiredTiger also provides compression options for both collections and indexes by default. While snappy compression is used for collections, prefix compression is used for all indexes. We can set the compression at the database as well as collection and index level. WiredTiger internal cache and filesystem cache differs in terms of data REPRESENTATION from on-disk format. Data is stored in the same manner for filesystem cache as the on-disk format, including the data files compression. The operating system uses the filesystem cache to reduce disk i/o. Although indexes are CREATED in different representation in wiredTiger internal cache than on-disk format, they still take advantage of index prefix compression to reduce RAM. The collected data in the WiredTiger internal cache uses different representation than the on-disk format. This data is uncompressed which allows it to be manipulated by the server. While on-disk format uses block compression which provides significant storage savings. All free memory that is not used by wiredTiger cache or by any other process is automatically used by MongoDB filesystem cache.

27.

How can we configure 3 node replica sets in MongoDB?

Answer»

MongoDB wiredTiger storage engine uses both WiredTiger INTERNAL cache and FILE system cache for storing data. If we do not define wiredTiger internal cache by default it utilizes larger of either 256MB or 50% of (RAM - 1GB). For example, if a system has a total 0f 6GB RAM, so 2GB (50% 0f 6GB – 1 GB) will be allocated to wiredTiger internal cache. This default setting assumes that there is only one mongod process running. In case we have multiple mongodb instances on the server we should decrease the wiredTiger internal cache size to accommodate other instances.

WiredTiger also provides compression options for both collections and indexes by default. While snappy compression is used for collections, prefix compression is used for all indexes. We can set the compression at the database as well as collection and index level.

WiredTiger internal cache and filesystem cache differs in terms of data REPRESENTATION from on-disk format.

Data is stored in the same manner for filesystem cache as the on-disk format, including the data files compression. The operating system uses the filesystem cache to reduce disk i/o.
Although indexes are CREATED in different representation in wiredTiger internal cache than on-disk format, they still take advantage of index prefix compression to reduce RAM.
The collected data in the WiredTiger internal cache uses different representation than the on-disk format. This data is uncompressed which allows it to be manipulated by the server. While on-disk format uses block compression which provides significant storage savings.

All free memory that is not used by wiredTiger cache or by any other process is automatically used by MongoDB filesystem cache.

Discussion

28.	Explain different architectural components of MongoDB.
Answer» The MongoDB balancer is a background process that monitors the number of chunks on each shard. When the number of chunks on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate chunks between SHARDS and reach an equal number of chunks per shard. All chunk MIGRATIONS use the following procedure: The moveChunk command is sent to the source shard by the balancer. First internal chunks at the source shard move due to moveChunk command. All operations are routed to the source shard during the migration process. All writes for the chunks are taken by source shard at this point. All required indexes are built at the destination shard. Once index built is completed destination shard starts requesting documents in a chunk from source shard and starts receiving them. Once the final document chunk is received, destination shard does synchronization so that all changes occurred during the migration are ALSO migrated. Once synchronization is complete cluster metadata is updated with the new location of the chunk in config DATABASE by source shard. Finally, source shard verifies that cluster metadata is updated correctly with new chunk location and once verified source shard deletes its copy of the migrated document.

28.

Explain different architectural components of MongoDB.

Answer»

The MongoDB balancer is a background process that monitors the number of chunks on each shard. When the number of chunks on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate chunks between SHARDS and reach an equal number of chunks per shard.

All chunk MIGRATIONS use the following procedure:

The moveChunk command is sent to the source shard by the balancer.
First internal chunks at the source shard move due to moveChunk command. All operations are routed to the source shard during the migration process. All writes for the chunks are taken by source shard at this point.
All required indexes are built at the destination shard.
Once index built is completed destination shard starts requesting documents in a chunk from source shard and starts receiving them.
Once the final document chunk is received, destination shard does synchronization so that all changes occurred during the migration are ALSO migrated.
Once synchronization is complete cluster metadata is updated with the new location of the chunk in config DATABASE by source shard.
Finally, source shard verifies that cluster metadata is updated correctly with new chunk location and once verified source shard deletes its copy of the migrated document.

Discussion

29.	Why does MongoDB store data in BSON format over JSON?
Answer» Encryption plays a key role in securing any production ENVIRONMENT. MongoDB offers encryption at-rest as well as transport encryption. Transport encryption offers to encrypt information over the network traffic between the client and the server. MongoDB supports TLS/SSL (Transport Layer Security/Secure Sockets Layer) to encrypt all of MongoDB’s network traffic. TLS/SSL ensures that MongoDB network traffic is only readable by the intended client. Encryption at rest encrypts the data on disk. This can be achieved either encrypting at the storage engine level or at the application level. Application level encryption is done at application end and is similar to masking as done EARLIER in RDBMS. Encrypted Storage Engine MongoDB Enterprise 3.2 introduces a native encryption option for the WiredTiger storage engine. This allows MongoDB to encrypt data files such that only parties with the decryption key can decode and read the data. The data encryption process includes: GENERATING a master key. Generating keys for each database. Encrypting data with the database keys. Encrypting the database keys with the master key. The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a file system PERSPECTIVE, and data only exists in an unencrypted state in memory and during transmission. Application Level Encryption Application Level Encryption provides encryption on a per-field or per-DOCUMENT basis within the application layer. To encrypt document or field level data, write custom encryption and decryption routines or use a commercial solution.

29.

Why does MongoDB store data in BSON format over JSON?

Answer»

Encryption plays a key role in securing any production ENVIRONMENT. MongoDB offers encryption at-rest as well as transport encryption.

Transport encryption offers to encrypt information over the network traffic between the client and the server. MongoDB supports TLS/SSL (Transport Layer Security/Secure Sockets Layer) to encrypt all of MongoDB’s network traffic. TLS/SSL ensures that MongoDB network traffic is only readable by the intended client.

Encryption at rest encrypts the data on disk. This can be achieved either encrypting at the storage engine level or at the application level. Application level encryption is done at application end and is similar to masking as done EARLIER in RDBMS.

Encrypted Storage Engine

MongoDB Enterprise 3.2 introduces a native encryption option for the WiredTiger storage engine. This allows MongoDB to encrypt data files such that only parties with the decryption key can decode and read the data.

The data encryption process includes:

GENERATING a master key.
Generating keys for each database.
Encrypting data with the database keys.
Encrypting the database keys with the master key.

The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a file system PERSPECTIVE, and data only exists in an unencrypted state in memory and during transmission.

Application Level Encryption

Application Level Encryption provides encryption on a per-field or per-DOCUMENT basis within the application layer. To encrypt document or field level data, write custom encryption and decryption routines or use a commercial solution.

Discussion

30.	How does MongoDB text search work on all string fields of a document? Should Compound Text Index be created on all the string fields to achieve this?
Answer» When deploying MongoDB in production, we should have a strategy for capturing and restoring backups in the case of data loss events. Below are the different backup options: BACK Up with Atlas MongoDB Atlas, the official MongoDB cloud service, provides 2 fully-managed methods for backups: Continuous Backups, which take INCREMENTAL backups of data in your cluster, ensuring your backups are typically just a few seconds behind the operational system. Cloud Provider Snapshots, which provide localized backup storage using the native snapshot functionality of the cluster’s cloud service provider. Back Up with MongoDB Cloud Manager or Ops Manager MongoDB Cloud Manager and Ops Manager provide back up, monitoring, and AUTOMATION service for MongoDB. They support backing up and restoring MongoDB replica sets and sharded clusters from a graphical user interface. Back Up by Copying Underlying Data Files Back Up with Filesystem Snapshots MongoDB can also be backed up with OPERATING system features which are not specific to MongoDB. Point-in-time filesystem snapshots can be used for backup If the volume where MongoDB stores its data files supports snapshots. Back Up with cp or rsync MongoDB deployments can also be backed up using system commands cp or rsync in case storage system does not support snapshots. It is recommended to stop all WRITES to mongo before copying database files as copying multiple is not an atomic operation. Back Up with mongodump mongodump is the utility using which we can take a backup of the MongoDB database in BSON files format. The backup files can then be used by a mongorestore utility for restoring to another database. Mongodump reads data page by page hence taking a lot of time and so is not recommended for large sized deployments.

30.

How does MongoDB text search work on all string fields of a document? Should Compound Text Index be created on all the string fields to achieve this?

Answer»

When deploying MongoDB in production, we should have a strategy for capturing and restoring backups in the case of data loss events. Below are the different backup options:

BACK Up with Atlas

MongoDB Atlas, the official MongoDB cloud service, provides 2 fully-managed methods for backups:

Continuous Backups, which take INCREMENTAL backups of data in your cluster, ensuring your backups are typically just a few seconds behind the operational system.
Cloud Provider Snapshots, which provide localized backup storage using the native snapshot functionality of the cluster’s cloud service provider.

Back Up with MongoDB Cloud Manager or Ops Manager

MongoDB Cloud Manager and Ops Manager provide back up, monitoring, and AUTOMATION service for MongoDB. They support backing up and restoring MongoDB replica sets and sharded clusters from a graphical user interface.

Back Up by Copying Underlying Data Files

Back Up with Filesystem Snapshots

MongoDB can also be backed up with OPERATING system features which are not specific to MongoDB. Point-in-time filesystem snapshots can be used for backup If the volume where MongoDB stores its data files supports snapshots.

Back Up with cp or rsync

MongoDB deployments can also be backed up using system commands cp or rsync in case storage system does not support snapshots. It is recommended to stop all WRITES to mongo before copying database files as copying multiple is not an atomic operation.

Back Up with mongodump

mongodump is the utility using which we can take a backup of the MongoDB database in BSON files format. The backup files can then be used by a mongorestore utility for restoring to another database. Mongodump reads data page by page hence taking a lot of time and so is not recommended for large sized deployments.

Discussion

31.	Why is it so important to choose the right shard key for sharding?
Answer» There are a few key differences while setting authentication on the sharded cluster. To set up authentication we should connect to mongos instead of mongod. Also, clients who want to authenticate to the sharded cluster MUST do from mongos. Ensure sharded cluster has at least two mongos instances available as it requires restarting each mongos in the cluster. If the sharded cluster has only one mongos instance, this results in downtime during the period that the mongos is offline. Generate keyfile from any method of your choice. Copy the keyfile to each server hosting the sharded cluster members. Ensure that the user running the mongod or mongos instances is the owner of the file and can access the keyfile. From mongos create a user with admin clusterAdmin and userAdmin role on the admin database. db.createUser({ user: "admin", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]}); Change current mongos configuration with keyfile authentication ENABLED file. security: transitionToAuth: true keyFile: <path-to-keyfile> The new configuration file should contain all of the configuration settings previously used by the mongos as well as the new security settings. Now restart all mongos one at a time starting with a new configuration file. Now change the configuration file to enable keyfile authentication for all members of the config database. FIRST, all secondary nodes should be updated. For updating PRIMARY force, a failover, change primary to secondary and then update the configuration file., Now we will create the shard-local administrator for each shard. In a sharded cluster that enforces authentication, each shard replica set should have its own shard-local administrator. we cannot use a shard-local administrator for one shard to access another shard or the sharded cluster. Connect to the primary member of each shard replica set and create a user with the db.createUser() method. db.createUser({ user: "admin1", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]}); This user can be used for maintenance activities on individual shards. Now change the configuration file to enable keyfile authentication for all shards. First, all secondary nodes should be updated. For updating primary force, a failover, change primary to secondary and then update the configuration file.

31.

Why is it so important to choose the right shard key for sharding?

Answer»

There are a few key differences while setting authentication on the sharded cluster. To set up authentication we should connect to mongos instead of mongod. Also, clients who want to authenticate to the sharded cluster MUST do from mongos.

Ensure sharded cluster has at least two mongos instances available as it requires restarting each mongos in the cluster. If the sharded cluster has only one mongos instance, this results in downtime during the period that the mongos is offline.

Generate keyfile from any method of your choice. Copy the keyfile to each server hosting the sharded cluster members. Ensure that the user running the mongod or mongos instances is the owner of the file and can access the keyfile.
From mongos create a user with admin clusterAdmin and userAdmin role on the admin database.

db.createUser({ user: "admin", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});

Change current mongos configuration with keyfile authentication ENABLED file.

security:

transitionToAuth: true keyFile: <path-to-keyfile>

The new configuration file should contain all of the configuration settings previously used by the mongos as well as the new security settings.

Now restart all mongos one at a time starting with a new configuration file.
Now change the configuration file to enable keyfile authentication for all members of the config database. FIRST, all secondary nodes should be updated. For updating PRIMARY force, a failover, change primary to secondary and then update the configuration file.,
Now we will create the shard-local administrator for each shard. In a sharded cluster that enforces authentication, each shard replica set should have its own shard-local administrator. we cannot use a shard-local administrator for one shard to access another shard or the sharded cluster.

Connect to the primary member of each shard replica set and create a user with the db.createUser() method.

db.createUser({ user: "admin1", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});

This user can be used for maintenance activities on individual shards.

Now change the configuration file to enable keyfile authentication for all shards. First, all secondary nodes should be updated. For updating primary force, a failover, change primary to secondary and then update the configuration file.

Discussion

32.	How easy or how difficult is it to maintain an audit trail in MongoDB?
Answer» We can broadly divide MongoDB AUTHENTICATION mechanism in 2 parts namely client/user authentication which mainly deals with how clients of database authenticate to MongoDB and internal authentication which is how different members of replica SETS or sharded clusters authenticate with each other. Client/User authentication: Below are the supported authentication mechanism which MongoDB supports to authenticate client access to the database. SCRAM-SHA-1 MONGODB-CR X.509 LDAP KERBEROS Community Editions – SCRAM-SHA-1, MONGODB-CR and X.509 are available with MongoDB community versions. SCRAM-SHA-1 and MONGODB-CR are considered as a challenge/RESPONSE mechanism. From version 3.0 SCRAM-SHA-1 is the default security mechanism and has replaced MONGODB-CR. SCRAM-SHA-1 is a client response mechanism for authentication. The client sends a response the o server to authenticate. The response sent is never in plain text and so secured from several kinds of attacks. X.509 is a certificate-BASED authentication mechanism. It became an authentication option as of version 2.6. With X.509, we are required to have a TLS connection. MongoDB 3.2.6 or greater, is already compiled with TLS support. Enterprise Editions – LDAP and KERBEROS are only available with enterprise versions. LDAP is a directory service protocol commonly used by companies. With LDAP authentication support, users can authenticate to MongoDB using their LDAP credentials. This makes LDAP an external authentication mechanism. This means that the actual credentials used to authenticate the client are not stored directly in MongoDB. LDAP wasn’t designed specifically for authentication but rather for storing metadata about users in an organization but is widely used as an authentication mechanism also. Kerberos is an industry standard authentication protocol for LARGE client-server systems. It is widely accepted to be a very secure authentication mechanism and was designed specifically for the purpose of authentication. Like LDAP, Kerberos is also an external authentication mechanism. This means that the actual credentials used to authenticate the client are not stored in MongoDB. Internal Authentication: If our replica set or sharded cluster spans multiple data centres or touches the internet in any way, it's very important to enable internal authentication. MongoDB currently supports two internal authentication mechanisms. There's keyfile authentication which uses SCRAM-SHA-1 and X.509 authentication. With keyfile authentication, the contents of keyfile essentially act as a shared password between the members of a replica set or sharded cluster. The same keyfile must be present on each member that talks to one another. X.509 is another internal authentication mechanism. And it utilizes certificates to authenticate members to one another. We can use the same certificate on all members, it is recommended to issue a different certificate to each member. This way, if one of the certificates is compromised, we only need to reissue and deploy that one certificate instead of having to update your entire cluster. It's important to note that whenever we enable internal authentication, either with X.509 or with keyfile based authentication, this automatically will enable client authentication.

32.

How easy or how difficult is it to maintain an audit trail in MongoDB?

Answer»

We can broadly divide MongoDB AUTHENTICATION mechanism in 2 parts namely client/user authentication which mainly deals with how clients of database authenticate to MongoDB and internal authentication which is how different members of replica SETS or sharded clusters authenticate with each other.

Client/User authentication: Below are the supported authentication mechanism which MongoDB supports to authenticate client access to the database.

SCRAM-SHA-1 MONGODB-CR X.509 LDAP KERBEROS

Community Editions – SCRAM-SHA-1, MONGODB-CR and X.509 are available with MongoDB community versions.

SCRAM-SHA-1 and MONGODB-CR are considered as a challenge/RESPONSE mechanism. From version 3.0 SCRAM-SHA-1 is the default security mechanism and has replaced MONGODB-CR.

SCRAM-SHA-1 is a client response mechanism for authentication. The client sends a response the o server to authenticate. The response sent is never in plain text and so secured from several kinds of attacks.
X.509 is a certificate-BASED authentication mechanism. It became an authentication option as of version 2.6. With X.509, we are required to have a TLS connection. MongoDB 3.2.6 or greater, is already compiled with TLS support.
Enterprise Editions – LDAP and KERBEROS are only available with enterprise versions.
LDAP is a directory service protocol commonly used by companies. With LDAP authentication support, users can authenticate to MongoDB using their LDAP credentials. This makes LDAP an external authentication mechanism. This means that the actual credentials used to authenticate the client are not stored directly in MongoDB. LDAP wasn’t designed specifically for authentication but rather for storing metadata about users in an organization but is widely used as an authentication mechanism also.
Kerberos is an industry standard authentication protocol for LARGE client-server systems. It is widely accepted to be a very secure authentication mechanism and was designed specifically for the purpose of authentication.
Like LDAP, Kerberos is also an external authentication mechanism. This means that the actual credentials used to authenticate the client are not stored in MongoDB.
Internal Authentication: If our replica set or sharded cluster spans multiple data centres or touches the internet in any way, it's very important to enable internal authentication.

MongoDB currently supports two internal authentication mechanisms. There's keyfile authentication which uses SCRAM-SHA-1 and X.509 authentication.

With keyfile authentication, the contents of keyfile essentially act as a shared password between the members of a replica set or sharded cluster. The same keyfile must be present on each member that talks to one another.

X.509 is another internal authentication mechanism. And it utilizes certificates to authenticate members to one another. We can use the same certificate on all members, it is recommended to issue a different certificate to each member. This way, if one of the certificates is compromised, we only need to reissue and deploy that one certificate instead of having to update your entire cluster.

It's important to note that whenever we enable internal authentication, either with X.509 or with keyfile based authentication, this automatically will enable client authentication.

Discussion

33.	The db.collection.bulkWrite() provides the ability for bulk CRUD operations. During execution, if there is any error from an operation, do the remaining operations get processed?
Answer» To backup sharded cluster we need to TAKE the backup for config database and individual shards. Disable the balancer First, we would need to disable the balancer from mongos. If we do not stop the balancer, the backup could duplicate data or OMIT data as CHUNKS migrate while recording backups. USE config sh.stopBalancer() Lock one secondary member of each replica set For each shard replica set in the sharded cluster, connect a mongo shell to the secondary member’s mongod instance and run db.fsyncLock(). db.fsyncLock() Lock config server replica set secondary Connect to secondary of config server replica set and run db.fsyncLock() Backup one config server and then unlock the member Now we will backup locked config secondary member. We are using mongodump for backup but we can also use any other method LIKE cp or rsync etc. Once the backup is taken, we can unlock the member so that it starts getting oplog from config primary. mongodump --oplog db.fsyncUnlock() Back up a replica set member for each shard Now we will backup locked member of each shard. We are using mongodump for backup but we can also use any other method like cp or rsync etc. Once the backup is taken, we can unlock the member so that it starts getting oplog from shard primary. mongodump --oplog db.fsyncUnlock() Re-enable the balancer process Once we have the backup from config and each shard we will enable the balancer by connecting to config database. use config sh.setBalancerState(true)

33.

The db.collection.bulkWrite() provides the ability for bulk CRUD operations. During execution, if there is any error from an operation, do the remaining operations get processed?

Answer»

To backup sharded cluster we need to TAKE the backup for config database and individual shards.

Disable the balancer

First, we would need to disable the balancer from mongos. If we do not stop the balancer, the backup could duplicate data or OMIT data as CHUNKS migrate while recording backups.

USE config sh.stopBalancer()

Lock one secondary member of each replica set

For each shard replica set in the sharded cluster, connect a mongo shell to the secondary member’s mongod instance and run db.fsyncLock().

db.fsyncLock()

Lock config server replica set secondary

Connect to secondary of config server replica set and run

db.fsyncLock()

Backup one config server and then unlock the member

Now we will backup locked config secondary member. We are using mongodump for backup but we can also use any other method LIKE cp or rsync etc.

Once the backup is taken, we can unlock the member so that it starts getting oplog from config primary.

mongodump --oplog db.fsyncUnlock()

Back up a replica set member for each shard

Now we will backup locked member of each shard. We are using mongodump for backup but we can also use any other method like cp or rsync etc.

Once the backup is taken, we can unlock the member so that it starts getting oplog from shard primary.

mongodump --oplog db.fsyncUnlock()

Re-enable the balancer process

Once we have the backup from config and each shard we will enable the balancer by connecting to config database.

use config sh.setBalancerState(true)

Discussion

34.	Are multi-document transactions possible in MongoDB?
Answer» Shard key selection is an important aspect of the sharded cluster as it affects the performance and overall efficiency of a cluster. Chunk creation and distribution among several SHARDS is based on the choice of the shard key. Ideally shard key should allow MongoDB to distribute documents evenly across all the shards in the cluster. There are three main factors that affect the selection of the shard key: Cardinality Cardinality refers to a number of DISTINCTIVE values for a given shard key. Ideally shard key should have high cardinality. It represents the maximum number of chunks that can exist in clusters. For example, suppose we have an application that was used only by members of a particular city and we are sharding on the STATE, we will have a maximum of one chunk as both UPPER and lower values of chunk would be that state only. And one chunk would only allow us to have one shard. Hence we need to ensure the shard key field has high cardinality. If we cannot have a field with high cardinality we can increase the cardinality of our shard key by creating compound shard key. So in the above scenario, we can have shard key with a combination of state and name for ensuring cardinality. Frequency Apart from having a large number of different values for our shard key, it is important to have even distribution for each value. It certain values occur more often than others then we may not have an equal distribution of load across the cluster. This limits the ability to handle scaled read and writes. For example, suppose we have an application where the majority of people using it have last name ‘jones’, the throughput of our application would be constraint with shard having those values. Chunks containing these values grow larger and larger and may SOMETIMES become jumbo chunks. These jumbo chunks reduce the ability to scale horizontally as they cannot be split. To address such issues, we should choose a good compound shard key. In the above scenario, we can add _id as a compound field to have a high frequency for compound shard key. Rate of change of Shard key values We should avoid shard keys on fields which values are always increasing or decreasing. For example, ObjectId in MongoDB whose value is always increasing with each new document. In such case, all our writes will go to the same chunk having an upper bound key. For monotonically decreasing values writes will go to the first shard with a lower bound. We can have shard key as objectId as long as it’s not the first field.

34.

Are multi-document transactions possible in MongoDB?

Answer»

Shard key selection is an important aspect of the sharded cluster as it affects the performance and overall efficiency of a cluster. Chunk creation and distribution among several SHARDS is based on the choice of the shard key. Ideally shard key should allow MongoDB to distribute documents evenly across all the shards in the cluster.

There are three main factors that affect the selection of the shard key:

Cardinality

Cardinality refers to a number of DISTINCTIVE values for a given shard key. Ideally shard key should have high cardinality. It represents the maximum number of chunks that can exist in clusters.

For example, suppose we have an application that was used only by members of a particular city and we are sharding on the STATE, we will have a maximum of one chunk as both UPPER and lower values of chunk would be that state only. And one chunk would only allow us to have one shard. Hence we need to ensure the shard key field has high cardinality.

If we cannot have a field with high cardinality we can increase the cardinality of our shard key by creating compound shard key. So in the above scenario, we can have shard key with a combination of state and name for ensuring cardinality.

Frequency

Apart from having a large number of different values for our shard key, it is important to have even distribution for each value. It certain values occur more often than others then we may not have an equal distribution of load across the cluster. This limits the ability to handle scaled read and writes. For example, suppose we have an application where the majority of people using it have last name ‘jones’, the throughput of our application would be constraint with shard having those values. Chunks containing these values grow larger and larger and may SOMETIMES become jumbo chunks. These jumbo chunks reduce the ability to scale horizontally as they cannot be split. To address such issues, we should choose a good compound shard key. In the above scenario, we can add _id as a compound field to have a high frequency for compound shard key.

Rate of change of Shard key values

We should avoid shard keys on fields which values are always increasing or decreasing. For example, ObjectId in MongoDB whose value is always increasing with each new document. In such case, all our writes will go to the same chunk having an upper bound key. For monotonically decreasing values writes will go to the first shard with a lower bound. We can have shard key as objectId as long as it’s not the first field.

Discussion

35.	What are the projection operators $, $elemMatch and $slice used for in MongoDB?
Answer» MongoDB sharded CLUSTER has 3 components namely shards, mongos and config servers. We will deploy all components using the below process. Deploy shards as a replica set. We need to start all the members of replica sets with the –shardsvr option. mongod --replSet "RS0" --shardsvr mongod --replSet "rs1" --shardsvr Suppose we have 2 shards with 3-member replica set each, all 6 shards should be started with the above option. These shard members are deployed as replica sets on the host (h1, h2, h3…. h6) at port 27017. sh1(M1, M2, M3 as replica set “rs0”) and sh2(M4, M5, M6 as replica set “rs1”) Deploy Config Server Replica Set We need to start all members of config servers as a replica set with --configsvr mongod --configsvr --replSet “cf1” Config Sever (Member c1, c2 and c3 as a replica set cf1) on host H7, h8 at port 27017. Deploy mongos. Start the mongos specifying the config server replica set name followed by a SLASH / and at least one of the config server HOSTNAMES and ports. Mongos is deployed on server h9 at port 27017. mongos --configdb cf1/h7:27017, h8:27017, h9:27017 Add all shard replica sets(rs0 and rs1) to the cluster with sh.addShard command from the mongos. mongo h9:27017/admin sh.addShard( "rs0/h1:27017,h2:27017,h3:27017" ) sh.addShard( "rs1/h5:27017,h6:27017,h7:27017" ) At this point we have sharded cluster ready, we can check the status using sh.status command. At last, we need to enable sharding at the database level, create an index on the shard key and shard a collection on the indexed shard key. mongo h9:27017/admin sh.enableSharding( "test" ) use test db.test_collection.createIndex( { a : 1 } ) sh.shardCollection( "test.test_collection", { "a" : 1 } )

35.

What are the projection operators $, $elemMatch and $slice used for in MongoDB?

Answer»

MongoDB sharded CLUSTER has 3 components namely shards, mongos and config servers. We will deploy all components using the below process.

Deploy shards as a replica set.

We need to start all the members of replica sets with the –shardsvr option.

mongod --replSet "RS0" --shardsvr mongod --replSet "rs1" --shardsvr

Suppose we have 2 shards with 3-member replica set each, all 6 shards should be started with the above option. These shard members are deployed as replica sets on the host (h1, h2, h3…. h6) at port 27017.

sh1(M1, M2, M3 as replica set “rs0”) and sh2(M4, M5, M6 as replica set “rs1”)

Deploy Config Server Replica Set

We need to start all members of config servers as a replica set with --configsvr

mongod --configsvr --replSet “cf1”

Config Sever (Member c1, c2 and c3 as a replica set cf1) on host H7, h8 at port 27017.

Deploy mongos.

Start the mongos specifying the config server replica set name followed by a SLASH / and at least one of the config server HOSTNAMES and ports. Mongos is deployed on server h9 at port 27017.

mongos --configdb cf1/h7:27017, h8:27017, h9:27017

Add all shard replica sets(rs0 and rs1) to the cluster with sh.addShard command from the mongos.

mongo h9:27017/admin sh.addShard( "rs0/h1:27017,h2:27017,h3:27017" ) sh.addShard( "rs1/h5:27017,h6:27017,h7:27017" )

At this point we have sharded cluster ready, we can check the status using sh.status command.
At last, we need to enable sharding at the database level, create an index on the shard key and shard a collection on the indexed shard key.

mongo h9:27017/admin sh.enableSharding( "test" ) use test db.test_collection.createIndex( { a : 1 } ) sh.shardCollection( "test.test_collection", { "a" : 1 } )

Discussion

36.	How does MongoDB support efficient querying against array fields?
Answer» Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling. Vertical Scaling Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers. Horizontal Scaling Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers. MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards. A MongoDB sharded cluster consists of the following components: shard: APPLICATION data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as REPLICA sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose. mongos: In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from CONFIG server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state. config servers: All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have PRIMARY at any time the cluster cannot perform metadata changes and becomes read-only for the time PERIOD so config server replica set should also be monitored and maintained as the application data shards.

36.

How does MongoDB support efficient querying against array fields?

Answer»

Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling.

Vertical Scaling

Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers.

Horizontal Scaling

Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers.

MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards.

A MongoDB sharded cluster consists of the following components:

shard:

APPLICATION data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as REPLICA sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose.

mongos:

In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from CONFIG server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state.

config servers:

All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have PRIMARY at any time the cluster cannot perform metadata changes and becomes read-only for the time PERIOD so config server replica set should also be monitored and maintained as the application data shards.

Discussion

37.	What is Covered Query and what is the usage of this?
Answer» When the primary of a replica set is not available secondary becomes primary, this is done via elections where the most appropriate member of the replica set is promoted to primary. Apart from unavailability, there are few situations when elections are triggered such as: When we initiate any replica set Adding new members to the replica set Changing the CONFIGURATION of the replica set using rs.reconfig() command. Performing maintenance on a replica set by stepping down the member using rs.stepDown(). Loss of connectivity between primary and secondary members for more than configured time. Until the elections are completed and a new primary is elected, the replica set cannot accept write operations and can only work in read-only mode, if configured for reading from secondary. Below Factors affect the election of new primary: MongoDB has introduced new Replication protocolVersion: 1 to reduces replica set failover time. In the earlier protocolVersion time taken for election of the new primary was high. All members in replica set send ping requests as heartbeats to every other member every two seconds. If it does not receive the reply of ping requests in 10 seconds, it assumes member is down and MARKS it inaccessible. While initiating the mongod for replica sets members are assigned priority which helps in the decision for the election of the primary. Members with high priority are given more preference to become primary than members with low priority. Zero priority members can NEVER become primary. We can USE this configuration to control which member can become primary. For example, if we want a member from a particular data centre to never become primary we can assign it to zero priority. Arbiters always have zero priority. There might be situations when complete datacenter is down. In such cases, the ability of a replica set to elect primary from other data centre may be affected. In case of a network partition, we may have primary in the partition with a minority member of nodes. In such cases, primary steps down to secondary as it can see only minority nodes. In case we have a partition with majority members a primary WOULD be elected out of them. Of all replica set members, 7 can be voting members. These members participate in the election for the election of the primary. We can control it by selecting appropriate members to be voting members and assigning them votes. For example, we can assign a member more votes to influence the decision of primary election.

37.

What is Covered Query and what is the usage of this?

Answer»

When the primary of a replica set is not available secondary becomes primary, this is done via elections where the most appropriate member of the replica set is promoted to primary. Apart from unavailability, there are few situations when elections are triggered such as:

When we initiate any replica set
Adding new members to the replica set
Changing the CONFIGURATION of the replica set using rs.reconfig() command.
Performing maintenance on a replica set by stepping down the member using rs.stepDown().
Loss of connectivity between primary and secondary members for more than configured time.
Until the elections are completed and a new primary is elected, the replica set cannot accept write operations and can only work in read-only mode, if configured for reading from secondary. Below Factors affect the election of new primary:
- MongoDB has introduced new Replication protocolVersion: 1 to reduces replica set failover time. In the earlier protocolVersion time taken for election of the new primary was high.
- All members in replica set send ping requests as heartbeats to every other member every two seconds. If it does not receive the reply of ping requests in 10 seconds, it assumes member is down and MARKS it inaccessible.
- While initiating the mongod for replica sets members are assigned priority which helps in the decision for the election of the primary. Members with high priority are given more preference to become primary than members with low priority. Zero priority members can NEVER become primary. We can USE this configuration to control which member can become primary. For example, if we want a member from a particular data centre to never become primary we can assign it to zero priority. Arbiters always have zero priority.
- There might be situations when complete datacenter is down. In such cases, the ability of a replica set to elect primary from other data centre may be affected.
- In case of a network partition, we may have primary in the partition with a minority member of nodes. In such cases, primary steps down to secondary as it can see only minority nodes. In case we have a partition with majority members a primary WOULD be elected out of them.
- Of all replica set members, 7 can be voting members. These members participate in the election for the election of the primary. We can control it by selecting appropriate members to be voting members and assigning them votes. For example, we can assign a member more votes to influence the decision of primary election.

Discussion

38.	The oplog.rs collection in MongoDB stores the log of operations (in a replica set). This means that this collection should not grow infinitely and there should be a rolling mechanism where in as new documents(logs) are inserted, the older ones are automatically removed? How does MongoDB achieve this?
Answer» The first requirement eliminates any five-node replica set where one node is an arbiter, as arbiters do not have a COPY of the DATA. The second requirements eliminate setting Priority of 0 for dc1-01, dc1-02 or dc2-01, dc2-02. They can be assigned any positive integer, or the default VALUE of 1 to be electable as Primary. As per the THIRD requirement, dc3-01 can never be primary so its priority has to be set 0. Finally, as per the fourth dc3-01 configuration cannot be listed as hidden, as this will prevent READING from this replica member. So below will be the config file meeting all the requirements. { "_id" : "rs0", "version" : 1, "members" : [ { "_id" : "dc1-01", "host" : "mongodb0.example.net:27017" }, { "_id" : "dc1-02", "host" : "mongodb1.example.net:27017" }, { "_id" : "dc2-01", "host" : "mongodb2.example.net:27017"}, { "_id" : "dc2-02", "host" : "mongodb3.example.net:27017"}, { "_id" : "dc3-01", "host" : "mongodb4.example.net:27017","priority" : 0 } ] }

38.

The oplog.rs collection in MongoDB stores the log of operations (in a replica set). This means that this collection should not grow infinitely and there should be a rolling mechanism where in as new documents(logs) are inserted, the older ones are automatically removed? How does MongoDB achieve this?

Answer»

The first requirement eliminates any five-node replica set where one node is an arbiter, as arbiters do not have a COPY of the DATA.

The second requirements eliminate setting Priority of 0 for dc1-01, dc1-02 or dc2-01, dc2-02. They can be assigned any positive integer, or the default VALUE of 1 to be electable as Primary.

As per the THIRD requirement, dc3-01 can never be primary so its priority has to be set 0.

Finally, as per the fourth dc3-01 configuration cannot be listed as hidden, as this will prevent READING from this replica member.

So below will be the config file meeting all the requirements.

{ "_id" : "rs0", "version" : 1, "members" : [ { "_id" : "dc1-01", "host" : "mongodb0.example.net:27017" }, { "_id" : "dc1-02", "host" : "mongodb1.example.net:27017" }, { "_id" : "dc2-01", "host" : "mongodb2.example.net:27017"}, { "_id" : "dc2-02", "host" : "mongodb3.example.net:27017"}, { "_id" : "dc3-01", "host" : "mongodb4.example.net:27017","priority" : 0 } ] }

Discussion

39.	Multiple addresses are stored as an embedded document for a user. Within the UI, each of these addresses need to be shown as an individual document along with all the user details. Should this be done programmatically? Or is there a simpler way to achieve this in MongoDB?
Answer» Suppose we have 3 servers abc.com,xyz.com andpqr.com. First, we need to start MongoD on each of the servers with appropriate options. mongod --replSet "rs0" --bind_ip abc.com –port 27017 mongod --replSet "rs0" --bind_ip xyz.com --port 27017 mongod --replSet "rs0" --bind_ip pqr.com –port 27017 option –replset is used for creating a replica set. We have given a replica set NAME as rs0. Bind IP is the IP to which SERVER can be connected to from outside. Connect to any of the servers and connect to mongo shell. Login to server abc.com and run command mongo> it will take you to mongo shell. Now we need to initiate the replica set with a configuration of all 3 MEMBERS. rs.initiate( { _id : "rs0", members: [ { _id: 0, host: "abc.com:27017" }, { _id: 1, host: "xyz.com:27017" }, { _id: 2, host: "pqr.com:27017" } }) MongoDB initiates a replica set, using the default replica set configuration. To view the configuration of the replica set from any member we can run a command rs.conf() Also to check the status for each member we can run command rs.status() The server from which we run rs.initiate will become primary and other 2 servers will become secondary.

39.

Multiple addresses are stored as an embedded document for a user. Within the UI, each of these addresses need to be shown as an individual document along with all the user details. Should this be done programmatically? Or is there a simpler way to achieve this in MongoDB?

Answer»

Suppose we have 3 servers abc.com,xyz.com andpqr.com.

First, we need to start MongoD on each of the servers with appropriate options.
- mongod --replSet "rs0" --bind_ip abc.com –port 27017
- mongod --replSet "rs0" --bind_ip xyz.com --port 27017
- mongod --replSet "rs0" --bind_ip pqr.com –port 27017

option –replset is used for creating a replica set. We have given a replica set NAME as rs0. Bind IP is the IP to which SERVER can be connected to from outside.

Connect to any of the servers and connect to mongo shell.

Login to server abc.com and run command

mongo>

it will take you to mongo shell.

Now we need to initiate the replica set with a configuration of all 3 MEMBERS.

rs.initiate( { _id : "rs0", members: [ { _id: 0, host: "abc.com:27017" }, { _id: 1, host: "xyz.com:27017" }, { _id: 2, host: "pqr.com:27017" } })

MongoDB initiates a replica set, using the default replica set configuration.

To view the configuration of the replica set from any member we can run a command

rs.conf()

Also to check the status for each member we can run command

rs.status()

The server from which we run rs.initiate will become primary and other 2 servers will become secondary.

Discussion

40.	What is the significance of the “as” field in $graphLookup?
Answer» First, we have the MongoDB QUERY language. This is the set of instructions and commands that we have to use to interact with MongoDB.All CRUD operations and the documents that we send back and forth in MongoDB are managed by this layer. They translate the incoming BSON wire protocol messages that MongoDB uses to communicate with the client side application libraries that we call drivers into MongoDB operations. Then, we have the MongoDB Data Model Layer. This is the layer RESPONSIBLE for applying all the CRUD operations defined in the MongoDB query language and how they should result in the data structures managed by MongoDB. Management of namespaces, database names, and collections, which indexes are defined per namespace and which interactions need to be performed to respond to the incoming requests are all managed here. This is ALSO the layer where a REPLICATION mechanism is defined. This is where we define WriteConcerns, ReadConcerns that applications may require. Next, we have the storage layer. At this layer, we will have all the persistence in physical medium calls, how data is stored on disk, what kind of files does it use, what levels of compression amongst other settings can be set. MongoDB has several different types of storage engines that will persist data with different properties, depending on how the system is configured. WiredTiger is the default storage engine. At this layer, all the actions regarding flushers to this, journal commits, compression operations, and low-level system access happens. We also have to traversal layers, which are security and administration layer. All operations regarding user management, authentication, network, encryption are managed by the security layer. All the operations around server administration, like creating databases, renaming collections, logging infrastructure, and such are managed by the administration layer. MongoDB is also a distributed data management system supporting replica sets and sharded clusters for high availability and scalability respectively. Replica sets are groups of different Mongo Ds that contain the same data. The purpose of a replica set cluster is to ensure high availability and automatic failover. Replica set cluster nodes can have different roles (Primary, Secondary, Arbiter), different hardware configuration, and different operating system. MongoDB is also a scalable database and allows to segment dataset into shards and grow storage capabilities by adding shard nodes to the cluster. Shards themselves are replica sets-- highly availability in units. Where we have other components as well, like: MongoS - which are our shared cluster routing components. MongoSs will be responsible for routing all of our operations and commands to the shards. Shards - MongoDB replica sets where actual data is stored. Config SERVICE - A special type of replica set managing all meta-information of our cluster.

40.

What is the significance of the “as” field in $graphLookup?

Answer»

First, we have the MongoDB QUERY language.

This is the set of instructions and commands that we have to use to interact with MongoDB.All CRUD operations and the documents that we send back and forth in MongoDB are managed by this layer. They translate the incoming BSON wire protocol messages that MongoDB uses to communicate with the client side application libraries that we call drivers into MongoDB operations.

Then, we have the MongoDB Data Model Layer.

This is the layer RESPONSIBLE for applying all the CRUD operations defined in the MongoDB query language and how they should result in the data structures managed by MongoDB. Management of namespaces, database names, and collections, which indexes are defined per namespace and which interactions need to be performed to respond to the incoming requests are all managed here.

This is ALSO the layer where a REPLICATION mechanism is defined. This is where we define WriteConcerns, ReadConcerns that applications may require.

Next, we have the storage layer.

At this layer, we will have all the persistence in physical medium calls, how data is stored on disk, what kind of files does it use, what levels of compression amongst other settings can be set. MongoDB has several different types of storage engines that will persist data with different properties, depending on how the system is configured. WiredTiger is the default storage engine. At this layer, all the actions regarding flushers to this, journal commits, compression operations, and low-level system access happens.

We also have to traversal layers, which are security and administration layer.
All operations regarding user management, authentication, network, encryption are managed by the security layer.
All the operations around server administration, like creating databases, renaming collections, logging infrastructure, and such are managed by the administration layer.
MongoDB is also a distributed data management system supporting replica sets and sharded clusters for high availability and scalability respectively.

Replica sets are groups of different Mongo Ds that contain the same data. The purpose of a replica set cluster is to ensure high availability and automatic failover. Replica set cluster nodes can have different roles (Primary, Secondary, Arbiter), different hardware configuration, and different operating system.
MongoDB is also a scalable database and allows to segment dataset into shards and grow storage capabilities by adding shard nodes to the cluster.

Shards themselves are replica sets-- highly availability in units. Where we have other components as well, like:

MongoS - which are our shared cluster routing components. MongoSs will be responsible for routing all of our operations and commands to the shards.
Shards - MongoDB replica sets where actual data is stored.
Config SERVICE - A special type of replica set managing all meta-information of our cluster.

Discussion

41.	How are recursive queries supported within MongoDB?
Answer» BSON is a binary JSON. Inside the database, there is a need for binary representation for efficiency. There are 3 major reasons for preference to BSON: FAST Scannability - In Mongo, we know documents can be quite large. BSON HELPS to skip undesired portions of documents thus enabling fast scannability. Example: In below document, we have a large subdocument named hobbies, now suppose we want to query FIELD "active" skipping "hobbies" we can do so in BSON due to its linear serialization property. {-ID: "32781", name: "Smith”, age: 30, hobbies: { .............................500 KB ..............}, active: "true”} Data types - BSON provides several EXTRA data types than BSON like Data datatype, Bin data datatype, Object Id datatype etc. Compact Storage - Data is stored in a compact manner (binary format), utilizing less space. Also, data movement from client to server and vice-versa is in BSON format thus securing data on the fly. Data can then be converted to JSON format at the client side using a custom program or MongoDB drivers.

41.

How are recursive queries supported within MongoDB?

Answer»

BSON is a binary JSON. Inside the database, there is a need for binary representation for efficiency.

There are 3 major reasons for preference to BSON:

FAST Scannability - In Mongo, we know documents can be quite large. BSON HELPS to skip undesired portions of documents thus enabling fast scannability.

Example: In below document, we have a large subdocument named hobbies, now suppose we want to query FIELD "active" skipping "hobbies" we can do so in BSON due to its linear serialization property.

{-ID: "32781", name: "Smith”, age: 30, hobbies: { .............................500 KB ..............}, active: "true”}

Data types - BSON provides several EXTRA data types than BSON like Data datatype, Bin data datatype, Object Id datatype etc.
Compact Storage - Data is stored in a compact manner (binary format), utilizing less space. Also, data movement from client to server and vice-versa is in BSON format thus securing data on the fly. Data can then be converted to JSON format at the client side using a custom program or MongoDB drivers.

Discussion

42.	Oracle provides the EXPLAIN PLAN and PostgreSQL provides EXPLAIN ANALYZE, both of these help to understand the query plan chosen by the database, what is the equivalent of this in MongoDB?
Answer» When any text content WITHIN a document needs to be searchable, all the string fields of the document can be indexed USING the $ wildcard specifier. db.articles.createIndex( { "$" : "text" } ) Note**: Any new string field ADDED to the document after creating the index will automatically be indexed. When data is huge, wildcard INDEXES will have an impact on performance and hence should be used with due CONSIDERATION of this.

42.

Oracle provides the EXPLAIN PLAN and PostgreSQL provides EXPLAIN ANALYZE, both of these help to understand the query plan chosen by the database, what is the equivalent of this in MongoDB?

Answer»

When any text content WITHIN a document needs to be searchable, all the string fields of the document can be indexed USING the $** wildcard specifier. db.articles.createIndex( { "$**" : "text" } )

Note: Any new string field ADDED to the document after creating the index will automatically be indexed. When data is huge, wildcard INDEXES will have an impact on performance and hence should be used with due CONSIDERATION of this.

Discussion

43.	When “fast reads” are a key criteria, what is the best recommended modeling to represent relationships like one-to-one and one-to-many in MongoDB?
Answer» Once selected, the shard key can't be changed LATER automatically. Hence it should be chosen after a lot of consideration. The distribution of the documents of a collection between the cluster shards is based on the shard key. EFFECTIVENESS of the chunk distribution is important for the efficient QUERYING and WRITING of the MongoDB database and this effectiveness of the chunk distribution is directly related to the shard key. That is why choosing of the right shard key up front is of utmost importance.

Discussion

44.	How can an array within a document be updated with multiple values in a single operation?
Answer» The MongoDB enterprise version includes auditing capability and this is fairly easy to SET up. Some salient features of auditing in MongoDB DML, DDL as well as authentication and authorization ACTIONS can be captured. Logging EVERY event will impact performance, usage of AUDIT filters is recommended to log only specific events. Audit logs can be written in multiple formats and to various destinations – to console and syslog , to a file (JSON / BSON). Performance wise, printing to a file in BSON format is better than JSON format. The file can be passed to the MongoDB utility bsondump for a human readable output. Note: Auditing adds performance overhead and the amount of overhead is determined by a combination of the SEVERAL factors listed above. The specific needs of the application should be taken into account to arrive at the optimal configuration.

44.

How can an array within a document be updated with multiple values in a single operation?

Answer»

The MongoDB enterprise version includes auditing capability and this is fairly easy to SET up. Some salient features of auditing in MongoDB

DML, DDL as well as authentication and authorization ACTIONS can be captured.
Logging EVERY event will impact performance, usage of AUDIT filters is recommended to log only specific events.
Audit logs can be written in multiple formats and to various destinations – to console and syslog , to a file (JSON / BSON). Performance wise, printing to a file in BSON format is better than JSON format.
The file can be passed to the MongoDB utility bsondump for a human readable output.

Note: Auditing adds performance overhead and the amount of overhead is determined by a combination of the SEVERAL factors listed above. The specific needs of the application should be taken into account to arrive at the optimal configuration.

Discussion

45.	When using Compound Index in MongoDB, what are the key points to consider when writing queries so that the query plan is able to use this index?
Answer» In the CASE of an ERROR, whether the remaining OPERATIONS get processed or not is determined if the bulk operation is ordered or unordered. If it is ordered, then MongoDB will not PROCESS the remaining operations, WHEREAS if it is unordered , MongoDB will continue to process the remaining operations. Note: “ordered” is an optional Boolean parameter that can be passed to bulkWrite(), by default this is true.

45.

When using Compound Index in MongoDB, what are the key points to consider when writing queries so that the query plan is able to use this index?

Answer»

In the CASE of an ERROR, whether the remaining OPERATIONS get processed or not is determined if the bulk operation is ordered or unordered. If it is ordered, then MongoDB will not PROCESS the remaining operations, WHEREAS if it is unordered , MongoDB will continue to process the remaining operations.

Note: “ordered” is an optional Boolean parameter that can be passed to bulkWrite(), by default this is true.

Discussion

46.	How can we capture slow running queries in MongoDB?
Answer» Starting in version 4.0, multi-document transactions are possible in MongoDB. EARLIER to this version, atomic operations were possible only on a single document. With embedded documents and arrays, data in the documents are generally denormalized and stored in a single structure. With this as the recommended data model, MongoDB's single document atomicity is sufficient for most of the applications. Multi-document transactions now enable the remaining small percentage of applications which require this (DUE to related data spread across documents) to depend on the DATABASE to handle transactions automatically rather than implement this programmatically into their application (which can cause performance overheads). NOTE: Performance cost is more for multi-document transactions (in most of the cases), hence it should be judiciously used.

46.

How can we capture slow running queries in MongoDB?

Answer»

Starting in version 4.0, multi-document transactions are possible in MongoDB. EARLIER to this version, atomic operations were possible only on a single document.

With embedded documents and arrays, data in the documents are generally denormalized and stored in a single structure. With this as the recommended data model, MongoDB's single document atomicity is sufficient for most of the applications.

Multi-document transactions now enable the remaining small percentage of applications which require this (DUE to related data spread across documents) to depend on the DATABASE to handle transactions automatically rather than implement this programmatically into their application (which can cause performance overheads).

NOTE: Performance cost is more for multi-document transactions (in most of the cases), hence it should be judiciously used.

Discussion

47.	Why _id index in a sharded cluster is recommended to be used as a shard key?
Answer» All the 3 projection operators, i.e., $, $elemMatch, $slice are USED for MANIPULATING arrays. They are used to LIMIT the contents of an array from the query results. For example, *db.startups.find( {}, { SKILLS: { $slice: 2 } } )* selects the first 2 items from the skills array for each document returned.

47.

Why _id index in a sharded cluster is recommended to be used as a shard key?

Answer»

All the 3 projection operators, i.e., $, $elemMatch, $slice are USED for MANIPULATING arrays. They are used to LIMIT the contents of an array from the query results.

For example, db.startups.find( {}, { SKILLS: { $slice: 2 } } ) selects the first 2 items from the skills array for each document returned.

Discussion

48.	How can we migrate primary shards in the sharded clusters?
Answer» Multikey indexes can be used for SUPPORTING efficient querying against array fields. MongoDB creates an index KEY for each element in the array. Note: MongoDB will AUTOMATICALLY create a multikey index if any indexed field is an array, no separate indication required. Consider the startups collection with array of skills *{ _id: 1, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }* Multikey indexes allow to search on the values in the skills array *db.startups.createIndex( { skills : 1 } )* The query db.startups.find( { skills : "AI" } ) will use this index on skills to RETURN the matching document

48.

How can we migrate primary shards in the sharded clusters?

Answer»

Multikey indexes can be used for SUPPORTING efficient querying against array fields. MongoDB creates an index KEY for each element in the array.

Note: MongoDB will AUTOMATICALLY create a multikey index if any indexed field is an array, no separate indication required.

Consider the startups collection with array of skills

{ _id: 1, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }

Multikey indexes allow to search on the values in the skills array

db.startups.createIndex( { skills : 1 } )

The query db.startups.find( { skills : "AI" } ) will use this index on skills to RETURN the matching document

Discussion

49.	You are an admin for MongoDB test database having sample collection. How can you export the contents of sample collection into CSV file?
Answer» A query that is able to return entire results only by using the index is called a Covered Query. This is one of the optimization techniques that can be used with queries for faster RETRIEVAL of data. A query can be a covered query only if all the fields in the query are part of an index and all the fields that are RETURNED are also part of the same index Since everything is part of the index, there is no need for the query to check the DOCUMENTS for any information.

49.

You are an admin for MongoDB test database having sample collection. How can you export the contents of sample collection into CSV file?

Answer»

A query that is able to return entire results only by using the index is called a Covered Query. This is one of the optimization techniques that can be used with queries for faster RETRIEVAL of data. A query can be a covered query only if

all the fields in the query are part of an index and
all the fields that are RETURNED are also part of the same index

Since everything is part of the index, there is no need for the query to check the DOCUMENTS for any information.

Discussion

50.	findupdateOneinsertOne
Answer» MongoDB supports Capped COLLECTIONS which are fixed-size collections. Once the allocated space is filled up, space is made for new documents by REMOVING (OVERWRITING) oldest documents. The insertion order is preserved and if a query does not specify any ORDERING then the ordering of results is same as the insertion order. The oplog.rs collection is a capped collection, thus ensuring that the collection of logs do not GROW infinitely.

Discussion

Explore topic-wise InterviewSolutions in Current Affairs.

How can we change the configuration of the replica set?

What is the need for Replication in MongoDB and what all kinds of replica members does it support?

How Does sharding and replication effect concurrency in MongoDB?

What are different index options MongoDB provides?

What are two important properties of MongoDB replication?

What is the role of balancer in a sharded cluster environment and how does it work?

As a MongoDB administrator you are asked to perform hardening of current MongoDB environment. What all should be implemented?

What are the different Monitoring utilities available for MongoDB?

Your production sharded cluster has lots of chunks with a jumbo flag which is hampering application performance. How will you clear the jumbo flag?

db.employee.find({"employeeid" : {"$gte" : 15000, "$lte" : 70000}});

Which shards would be involved in answering the following query:

Suppose we have a sharded cluster having a sharded collection employee sharded on key employee id having below chunk distribution:

How can MongoDB wiredTiger internal cache be sized? Also, how does it differ from the filesystem cache?

In your production environment due to an imbalance in shard data balancer initiates the chunk migration. Explain how chunks will be migrated?

What are the different Encryption options MongoDB offers?

What are the different backup methods MongoDB provides?

How we can enable keyfile authentication on the existing sharded cluster without downtime?

What are the different authentication mechanism MongoDB supports?

How can we perform backup for a sharded cluster?

What are the important factors that affect the choice of efficient shard key?

What is the process to setup sharded cluster?

What is the need for sharding in MongoDB and what are the different components of a sharded cluster?

What are different factors and conditions affecting elections when a primary replica set goes down?

Your replica set maintains five copies of the data.Either dc1-01, dc1-02 or dc2-01, dc2-02 may become primary.dc3-01 should never be primary.Clients may read from dc3-01.

Suppose we have a five-node replica set distributed across three data centres: dc1, dc2 and dc3. What would be configurations that meet the following requirements:

How can we configure 3 node replica sets in MongoDB?

Explain different architectural components of MongoDB.

Why does MongoDB store data in BSON format over JSON?

How does MongoDB text search work on all string fields of a document? Should Compound Text Index be created on all the string fields to achieve this?

Why is it so important to choose the right shard key for sharding?

How easy or how difficult is it to maintain an audit trail in MongoDB?

The db.collection.bulkWrite() provides the ability for bulk CRUD operations. During execution, if there is any error from an operation, do the remaining operations get processed?

Are multi-document transactions possible in MongoDB?

What are the projection operators $, $elemMatch and $slice used for in MongoDB?

How does MongoDB support efficient querying against array fields?

What is Covered Query and what is the usage of this?

Multiple addresses are stored as an embedded document for a user. Within the UI, each of these addresses need to be shown as an individual document along with all the user details. Should this be done programmatically? Or is there a simpler way to achieve this in MongoDB?

What is the significance of the “as” field in $graphLookup?

How are recursive queries supported within MongoDB?

Oracle provides the EXPLAIN PLAN and PostgreSQL provides EXPLAIN ANALYZE, both of these help to understand the query plan chosen by the database, what is the equivalent of this in MongoDB?

When “fast reads” are a key criteria, what is the best recommended modeling to represent relationships like one-to-one and one-to-many in MongoDB?

How can an array within a document be updated with multiple values in a single operation?

When using Compound Index in MongoDB, what are the key points to consider when writing queries so that the query plan is able to use this index?

How can we capture slow running queries in MongoDB?

Why _id index in a sharded cluster is recommended to be used as a shard key?

How can we migrate primary shards in the sharded clusters?

You are an admin for MongoDB test database having sample collection. How can you export the contents of sample collection into CSV file?

findupdateOneinsertOne