InterviewSolution
This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
How can we change the configuration of the replica set? |
|
Answer» Loading every DOCUMENT into RAM means that QUERY will not be using index efficiently and will have to FETCH documents from disk to ram. For using an index, the initial match in the FIND statement should either use index or index prefix. Below query has b key to find that does not use any existing index or index prefix so it would have to fetch from disk. db.sample.find( { b : 1 } ).sort( { C : 1, a : 1 } ) |
|
| 2. |
What is the need for Replication in MongoDB and what all kinds of replica members does it support? |
|
Answer» In replication we have multiple copies of the same data in sync with each other. It is mainly useful for High availability purpose. While in sharding we divide our entire dataset in small chunks and distribute among several servers. Sharding is used where we have some sort of bottleneck in terms of HARDWARE or for getting the benefits of query parallelism. If our dataset is very small sharding would not provide many advantages but as the data grows we should move to sharding. Below are a few of the situations where sharding is recommended over replication.
By breaking the dataset over shards will MEAN having more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster. |
|
| 3. |
How Does sharding and replication effect concurrency in MongoDB? |
|
Answer» In some cases, chunks can grow beyond the specified chunk SIZE but cannot undergo a split. The most common scenario is when a chunk represents a single shard key value. Since the chunk cannot split, it continues to grow beyond the chunk size, becoming a jumbo chunk. These jumbo chunks can become a performance bottleneck as they CONTINUE to grow, ESPECIALLY if the shard key value occurs with high frequency. The addition of new data or new shards can result in data distribution imbalances within the cluster. A particular shard may acquire more chunks than another shard, or the size of a chunk may grow beyond the configured MAXIMUM chunk size. MongoDB ensures a balanced cluster using TWO processes: chunk splitting and the balancer. |
|
| 4. |
What are different index options MongoDB provides? |
|
Answer» Chunk split operations are carried out automatically by the system when any insert operation causes chunk to exceed the maximum chunk size. Balancer then migrates recently split chunks to new shards. But in some cases we may want to pre-split the chunks manually:
To split the chunks manually we can use the split command with helper sh.splitFind() and sh.splitAt(). Example: To split the chunk of employee collection for employee id field at a value of 713626 below command should be used. sh.splitAt( "test.people", { "employeid": "713626" } )We should be careful while pre-splitting chunks as sometimes it can lead to a collection with different sized chunks. |
|
| 5. |
What are two important properties of MongoDB replication? |
|
Answer» Shard KEY selection is based on the workload. Since the first query is being used 90% it should be DRIVING the selection for selection of shard key. Combination of fields from that query would MAKE the best shard key. This eliminates option b, C and d. Option a and e uses a subset of fields from the most used workload and both can be shard key but option has more fields and so would be more suitable. |
|
| 6. |
What is the role of balancer in a sharded cluster environment and how does it work? |
|
Answer» In MongoDB we can read from Primary as well as secondary members of the replica set. This behaviour can be controlled by us as we can define the desired Read preference to which clients route read operations to a member of the replica set. If we do not specify any real preference by default MongoDB will read from primary. There are situations when you would want to reduce the load on your primary by FORCING applications to read from secondary. Below are DIFFERENT MongoDB read preference MODES:
This is the default mode. Applications read from the replica set primary.
In this mode, all applications read from primary but if the primary member is not AVAILABLE they start reading from secondary.
All applications read from the secondary members of the replica set.
In this mode, all applications read from secondary but if any secondary member is not available they start reading from primary.
In this mode, applications read from the member which is nearest to them in terms of network latency, irrespective of the member being primary or secondary. |
|
| 7. |
As a MongoDB administrator you are asked to perform hardening of current MongoDB environment. What all should be implemented? |
|
Answer» Idempotence is the PROPERTY of certain operations whereby they can be applied multiple times without changing the result beyond the initial application. In MongoDB, OPLOG is idempotent meaning even if they are applied multiple times the same output will be produced. So if the server goes down and we need to apply oplogs there WOULD not be any inconsistency, even if it applies any logs that were already applied there will not be changed in the database end state. Also, there was a DESIRE to have a new state of a document to be independent of the previous state. For these all operators that rely on the previous state to determine new VALUE needed to be transformed to see the actual values. For example, if an addition operation results in modifying the value from ‘21’ to ’30’, the operation should be changed to set value ‘30’ on the field. Replaying the operator multiple times should produce the same result. |
|
| 8. |
What are the different Monitoring utilities available for MongoDB? |
|
Answer» MongoDB applies database operations on the primary and then RECORDS the operations on the primary’s oplog. The secondary members then copy and APPLY these operations in an asynchronous PROCESS. For each operation, there is separate oplog entry. First, let’s check how many rows the query would fetch by changing delete to find operation. db.sample.find( { state : "WA" } ) This will give all the rows with the state is WA. {"FIRSTNAME" : "Arthur", "lastName" : "Aaronson", "state" : "WA", "CITY" : "Seattle", "likes" : [ "dogs", "cats" ] } {"firstName" : "Beth", "lastName" : "Barnes", "state" : "WA", "city" : "Richland", "likes" : [ "forest", "cats" ] } {"firstName" : "Dawn", "lastName" : "Davis", "state" : "WA", "city" : "Seattle", "likes" : [ "forest", "mountains" ] }Now Ideally delete should remove all matching rows but query says deleteOne. If the query would have said deleteMany then all the matching rows would have been deleted and there would be 3 oplog entries but deleteOne will remove first matching row. So 1 oplog entry will be generated with provided query |
|
| 9. |
Your production sharded cluster has lots of chunks with a jumbo flag which is hampering application performance. How will you clear the jumbo flag? |
|
Answer» The OPLOG is operation logs that keep an update of all operations that modify the data stored in databases. We can define the oplog size while STARTING MongoDB by specifying the --oplog option. If we do not specify this option it will take the default values which is 5% of PHYSICAL memory in case of wiredTiger. While the default value is sufficient for most workloads in some cases we may need to change the oplog size for the replica SET. OPlog size is CHANGED in a rolling manner, first, we change on all secondary and then a primary member of the replica set. To change oplog size
use local db.oplog.rs.stats().maxSize
|
|
| 10. |
db.employee.find({"employeeid" : {"$gte" : 15000, "$lte" : 70000}}); |
|
Answer» Every operation on the primary is logged in operation logs known as oplog. These oplogs are replicated to For a healthy replica set system, it is RECOMMENDED that all MEMBERS are in sync with no replication lag. Data is first written on primary by the applications then replicated to secondary. This synchronization is important to maintain up-to-date copies of data on all members. Synchronization happens in 2 ways: initial sync and continuous replication.
|
|
| 11. |
Which shards would be involved in answering the following query: |
|
Answer» The FIRST part of the query WOULD give all documents where y>=10. So we will have 2 documents i.e d> { "_id" : 4, "x" : 4, "y" : 10 } e> { "_id" : 5, "x" : 5, "y" : 75 }Now the second part of the query would update the value of Y for above 2 documents to 75, but we already have a document with value y:75, that will not be updated. Finally, we will have one 1 document that will be updated by the provided query. d> { "_id" : 4, "x" : 4, "y" : 10 } |
|
| 12. |
shard key: { "employeeid" : 1 } { "employeeid" : { "$minKey" : 1 } } -->> { "employeeid" : 8122 } on : shard0000 { "employeeid" : 8122 } -->> { "employeeid" : 17600 } on : shard0002 { "employeeid" : 17600 } -->> { "employeeid" : 25851 } on : shard0003 { "employeeid" : 25851 } -->> { "employeeid" : 35852 } on : shard0004 { "employeeid" : 35852 } -->> { "employeeid" : 46047 } on : shard0005 { "employeeid" : 46047 } -->> { "employeeid" : 55450 } on : shard0006 { "employeeid" : 55450 } -->> { "employeeid" : 64644 } on : shard0007 { "employeeid" : 64644 } -->> { "employeeid" : 73769 } on : shard0000 { "employeeid" : 73769 } -->> { "employeeid" : 82950 } on : shard0002 { "employeeid" : 82950 } -->> { "employeeid" : 91983 } on : shard0001 { "employeeid" : 91983 } -->> { "employeeid" : { "$maxKey" : 1 } } on : shard0001 |
|
Answer» In MONGODB data is stored as JSON documents. These documents can have different sets of fields, with different data type for each field. For example, we can have a collection with number, varchar and array all as different documents. { “a” : 143 } { “name” : “JOHN” } { “x” : [1,2,3] }It is not CORRECT to say MongoDB is schemaless, in fact, schema plays an important role in the designing of MongoDB applications. MongoDB has a dynamic schema having database structure with collections and indexes. These collections can be created either implicitly or explicitly. Due to the dynamic behaviour of the schema, MongoDB has several advantages over RDBMS systems. Schema Migrations become very easy as in traditional systems we had to use ALTER TABLE command after adding any column which could result in downtime. In MongoDB, such adjustments become transparent and automatic. For example, if we want to ADD CITY field to people collection, we can add the attribute and resave, that’s it. Whereas in a traditional system we would have to run ALTER TABLE command followed by reorg which would require downtime. |
|
| 13. |
Suppose we have a sharded cluster having a sharded collection employee sharded on key employee id having below chunk distribution: |
|
Answer» In MongoDB we have Built-in roles as well as custom roles. Built-in roles already have pre-defined access associated with them. We can assign these roles directly to users or groups for access. To run mongostat we would require access to run the server STATUS on the server. Built-in role cluster monitor comes with required access for the same. Custom roles or user-defined roles are the ones where we have to manually define access actions to a particular resource. MongoDB provides method db.createRole() for CREATING user-defined roles. These roles can be created in a specific database as MongoDB uses a combination of database and role name to uniquely define the role. We will create a custom role mongostatRole that provides only the PRIVILEGES to run mongostat. First, we NEED to connect to mongod or mongos to the admin database with a user that has privileges to create roles in the admin as well as other databases. mongo --port 27017 -u admin -p 'abc***' --authenticationDatabase 'admin'Now we will create a desired custom role in the admin database. use admin db.createRole( role: "mongostatRole", privileges: [ {resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] )This role can now be ASSIGNED to members of monitoring team. |
|
| 14. |
How can MongoDB wiredTiger internal cache be sized? Also, how does it differ from the filesystem cache? |
|
Answer» only those DOCUMENTS that CONTAIN the field SPECIFIED in the query. For the following documents in EMPLOYEE collection { _id: 1, name: "Jonas", linkedInProfile: null }, { _id: 2, name: “Williams” } The query { linkedInProfile: { $exists: true } } will return only the employee “Jonas” |
|
| 15. |
In your production environment due to an imbalance in shard data balancer initiates the chunk migration. Explain how chunks will be migrated? |
|
Answer» This can be achieved in MONGODB using the $type operator. A null value, i.e., BSON type null has the type number 10. Using this type number, only those documents can be retrieved WHOSE value is null. Take the example of the below TWO documents in startup collection { _id: 1, name: "XYZ Tech", website: null }, { _id: 2, name: “ABC PVT Ltd” } The query { website : { $type: 10 } } will retrieve only those documents where the website is null, in the above case it would be the startup “XYZ Tech” Note: The query { website : null } on the other hand will match documents where the website is null or the documents where the website field does not exist. For the above collection data, this query will return both the startups. |
|
| 16. |
What are the different Encryption options MongoDB offers? |
|
Answer»
|
||
| 17. |
What are the different backup methods MongoDB provides? |
|
Answer» We can change the configuration of the replica set as per the requirement of the application. Configuration changes may include ADDING a new member, adding Arbiter, removing a member, changing priority or votes for members, or changing member from normal secondary to hidden or delayed member. To ADD a new member, first we NEED to start the mongod process –replset option on the new server
rs.add({host: “hostname” , port : “portno.”}) Once added member will FETCH the data from primary using initial sync and replication synchronism.
rs.addArb({host: “hostname” , port : “portno.”})
rs.remove(hostname) As a good practice should shut down the member being removed before running the above command.
rs.reconfig(new config) Reconfig can be explained better with below example. Suppose we have replica set “rs0” with below configuration.
From Primary: cfg = rs.conf(); cfg.members[1].priority = 2; rs.reconfig(cfg);
|
|
| 18. |
How we can enable keyfile authentication on the existing sharded cluster without downtime? |
|
Answer» MongoDB has replication to provide high availability and redundancy which are the basis for any production database. With replica sets, we can achieve HA as well as DR capability. This also enables up for horizontal scaling enabling the use of commodity servers instead of enterprise servers. With replication, we can prevent downtime even if entire DC goes down with proper configuration. There are several types of replica members based on the requirement:
|
|
| 19. |
What are the different authentication mechanism MongoDB supports? |
|
Answer» It is important to maintain DATA consistency in any database especially when multiple applications are accessing the same piece of data simultaneously. MongoDB uses locking and other concurrency CONTROL measures to ensure consistency. Multiple CLIENTS can read and write the same data while ensuring that all writes to single document either occur in full or not at all so that clients never see inconsistent data. Effect of sharding on concurrency In sharding, collections are distributed among SEVERAL shard servers and so it improves concurrency. Mongos process routes multiple numbers of operations concurrently to different shards and finally combine them before sending back to the client. In a sharded cluster locking is at individual shard level rather than cluster level so the operations in one shard do not block other shard operations. Each shard uses its own locks independent of other shards in the cluster. Effect of replication on concurrency
In a MongoDB replica set each operation on the primary is also written to the special capped COLLECTION in the local database called oplog. So every time application writes to MongoDB it locks both databases i.e collection database and local database. Both these databases must be locked at the same time to maintain database consistency and ensuring that even with replication write operations maintain their ‘all-or-nothing” feature which ensures consistency.
In MongoDB replication, the application does not write to secondary but the secondary gets write from primary in the form or oplog. This oplog are not applied serially but collected in batches and batches are applied in parallel. The write operations are applied in the same order as they appear in oplog. During the time oplog are applied secondary do not allow reads to applied data to maintain consistency. |
|
| 20. |
How can we perform backup for a sharded cluster? |
|
Answer» Indexes help in improving the performance of queries. Without indexes, query must perform collection scan where each and every document of collection is scanned for the desired query result. With the use of proper indexes, we can LIMIT the number of documents scanned thus improving the performance of queries. Like collections indexes also use storage as they store a small portion of collection data. For example, if we create an index on field ‘NAME’ it will store data for this field and in ascending or descending order which also helps sort operations. Using indexes, we can satisfy equality matches and range-based queries more efficiently. Some of the different index options available for MongoDB are:
By default, MongoDB creates an index on the _id field at the time of creating an index. This is a unique index and PREVENTS applications from inserting multiple same values for the same _id field. MongoDB ensures that this index cannot be deleted.
These are indexes either on any one or combination of fields. i.e db.records.createIndex( { score: 1 } ) – Index on single field “score” db.products.createIndex( { "item": 1, "stock": 1 } ) – Index on comination of “item and stock”
MongoDB provides the option of creating an index on the contents STORED in arrays. For every element of the array, a separate index entry is created. We can select matching elements of the array using multikey indexes more efficiently.
MongoDB also provides a geospatial index which helps to efficiently query the geospatial coordinate data. 2d indexes for planar geometry and 2dsphere indexes for spherical geometry.
To support string content search in collection MongoDB provides text index. These indexes only store root words while ignoring the language-specific words like ‘the’, ‘a’ etc.
To search for specific filter expression in a collection partial indexes are used. Since they store only the subset of documents in a collection, they have lower storage requirements. Index creation maintenance and performance is also low for these indexes.
If we only want to get the fields of a document that are indexed and skip all other fields we can do so by using the sparse index.
Certain application has requirements where documents need to be removed AUTOMATICALLY after a certain amount of time. We can achieve this using TTL indexes. We specify TTL (time to live) for the documents after which a background process runs and removed these documents. This index is ideal for logs, session data and event data as such data only needs to persist for a limited time. |
|
| 21. |
What are the important factors that affect the choice of efficient shard key? |
|
Answer» MongoDB creates oplogs for each operation on PRIMARY and these are then replicated to secondary using replication. MongoDB uses asynchronous replication and automatic failover feature to perform this efficiently.
Oplogs from the primary is applied to secondary asynchronously. This helps applications to continue without downtime despite the failure of members. MongoDB deployments are usually on commodity servers and for commodity servers, if we want to have synchronous replication, latency for waiting for acknowledgement is in the order of 100ms which is quite high. Due to this reason, MongoDB prefers asynchronous replication. From VERSION 4.0.6, MongoDB provides the capability to log entries of slow oplog operations for secondary members of a replica set. These slow oplog messages are logged for the secondaries in the diagnostic log under the REPL component. These slow oplog entries do not depend on log levels or profiling level but depend only on the slow operation threshold. The profiler does not capture slow oplog entries.
Many traditional databases follow Master-slave setup but in case of master failure, we have to manually cutover to a slave database. In MongoDB, we can have one primary with multiple secondary. If we have fewer servers, we can still afford to do manual cutover but MongoDB being big data may have 100 shards and it is impossible to cutover manually every time. So MongoDB has automatic failover. When the primary is unable to communicate to other members for more than the configured time(electionTimeoutMillis), and eligible secondary triggers election to nominate itself as primary. Until the new primary is elected cluster cannot SERVE WRITE requests and can only serve read requests. Once the new primary is selected cluster resumes normal operations The architecture of the cluster should be designed keeping in mind Network latency and time required for replica sets to complete elections as they affect the time our cluster runs without Primary. |
|
| 22. |
What is the process to setup sharded cluster? |
|
Answer» The balancer is a background process that runs on the primary of config server in a cluster. It constantly monitors the number of CHUNKS on each shard and if the number of chunks for a specific shard is more than the migration threshold, it tries to automatically migrate chunks between shards so that there are an equal number of chunks per shard. The balancer migrates chunks from shards having more chunks to shards with lesser chunks. For example, Suppose we have 2 shards[shard01, shard02] with chunks 4 and 5 respectively. Now suppose there is a need to add another shard[shard03]. Initially, shard03 will have no chunks. A balancer will notice this UNEVEN distribution and migrate chunks from shard01 and shard02 to shard03 until all 3 shards have three shards each. There might be performance impact when balancer migrates the chunks as they CARRY some overhead in TERMS of bandwidth and workload, which can impact database performance. To minimize the impact balancer:
Impact of Adding and Removing Shards on a balancer Adding or removing the shard from the cluster creates imbalance as either new shard will have no chunks or removed shard chunks need to be redistributed throughout the cluster. In case shard was removed from the cluster with uneven chunk distribution the balancer will remove the chunks from draining shard before balancing remaining uneven chunks. When balancer notices this imbalance it starts chunk migration process immediately. The migration process takes time to complete. |
|
| 23. |
What is the need for sharding in MongoDB and what are the different components of a sharded cluster? |
|
Answer» Security is very IMPORTANT for any production database. MongoDB provides us with the best PRACTICES to harden out MongoDB DEPLOYMENT. This list of best practices should act as security checklist before we give green light to any production deployment.
|
|
| 24. |
What are different factors and conditions affecting elections when a primary replica set goes down? |
|
Answer» Monitoring is a critical component of all database administration. A firm grasp of MongoDB’s reporting will allow us to assess the state of the database and maintain deployment without crisis. Below are some of the utilities USED for MongoDB monitoring.
The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance. mongostat is functionally similar to the UNIX/Linux file system utility vmstat but provides data REGARDING mongod and mongos instances. In order to run mongostat user must have the serverStatus privilege action on the cluster resources. Eg. To run mongostat every 2 minutes below command can be used. mongostat 120
mongotop provides a method to track the amount of time a MongoDB instance mongod spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns value every second. Eg. To run mongotop every 30 sec below command can be used. mongotop 30
MongoDB includes a number of commands that report on the state of the database.
The serverStatus command, or db.serverStatus() from the shell, return a general overview of the status of the database, detailing disk usage, memory use, connection, journaling, and index access. The command returns quickly and does not impact MongoDB performance.
The dbStats command, or db.stats() from the shell, returns a document that addresses storage use and data VOLUMES. The dbStats reflect the amount of storage used, the quantity of data contained in the database, and the object, collection, and index counters. We can use this data to monitor the state and storage capacity of a specific database. This output also allows to compare use between databases and to DETERMINE the average document size in a database.
The collStats or db.collection.stats() from the shell that provides statistics that resemble dbStats on the collection level, including a count of the objects in the collection, the size of the collection, the amount of disk SPACE used by the collection, and information about its indexes.
The replSetGetStatus command (rs.status() from the shell) returns an overview of replica set’s status. The replSetGetStatus document details the state and configuration of the replica set and statistics about its members. This data can be used to ensure that replication is properly configured, and to check the connections between the current host and the other members of the replica set.
Apart from the above tools MongoDB also provides an option for GUI based monitoring with ops-manager and cloud-manager. These are very efficient and are mostly used in large enterprise environments. |
|
| 25. |
Your replica set maintains five copies of the data.Either dc1-01, dc1-02 or dc2-01, dc2-02 may become primary.dc3-01 should never be primary.Clients may read from dc3-01. |
|
Answer» If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as JUMBO. If the chunk size no longer hits the limits, MongoDB clears the jumbo flag for the chunk when the mongos reloads or rewrites the chunk metadata. But in some we need to follow the below process to clear the jumbo flag manually:
If the chunk is divisible, MongoDB removes the flag upon successful split of the chunk. Process
Below output from sh.status(true) shows that chunk with shard key range { "x" : 2 } -->> { "x" : 4 } is jumbo. --- Sharding Status --- .................. .................. test.foo shard key: { "x" : 1 } chunks: shard-b 2 shard-a 2 { "x" : { "$minKey" : 1 } } -->> { "x" : 1 } on : shard-b Timestamp(2, 0) { "x" : 1 } -->> { "x" : 2 } on : shard-a Timestamp(3, 1) { "x" : 2 } -->> { "x" : 4 } on : shard-a Timestamp(2, 2) jumbo { "x" : 4 } -->> { "x" : { "$maxKey" : 1 } } on : shard-b Timestamp(3, 0)
MongoDB removes the jumbo flag upon successful split of the chunk.
In some instances, MongoDB cannot split the no-longer jumbo chunk, such as a chunk with a range of single shard key value, and the preferred method to clear the flag is not applicable. Process
In the chunks collection of the config database, unset the jumbo flag for the chunk. For example, db.getSiblingDB("config").chunks.update( { ns: "test.foo", min: { x: 2 }, jumbo: true }, { $unset: { jumbo: "" } } )
After the jumbo flag has been cleared out from the chunks collection, update the cluster routing metadata cache. db.adminCommand( { flushRouterConfig: "test.foo" } ) |
|
| 26. |
Suppose we have a five-node replica set distributed across three data centres: dc1, dc2 and dc3. What would be configurations that meet the following requirements: |
|
Answer» Any query on sharded cluster goes through mongos to config database where it looks for metadata information about the chunk distribution. These queries are generally divided into broadly 2 groups: Scatter gather queries: Scatter-gather queries are the one which does not include the shard key. Since there are no shard keys, mongos does not know which shard to send this query to, HENCE it searches on all shards in the cluster. These queries are generally inefficient and are unfeasible for routine operations on large clusters. Targeted queries: If a query includes the shard key, the mongos DIRECTS the query to SPECIFIC shards only that are part of query as per shard key. These queries are very efficient. Now, in this CASE, we have a query with a shard key search 15000<=EMPLOYEEID<=70000, which is a subset of the data from the entire cluster and so it’s a targeted query. Any shard with employee id within this range will be queries. From the above sample, we can see below shards fall within this range and will all be accessed by the query.
|
|
| 27. |
How can we configure 3 node replica sets in MongoDB? |
|
Answer» MongoDB wiredTiger storage engine uses both WiredTiger INTERNAL cache and FILE system cache for storing data. If we do not define wiredTiger internal cache by default it utilizes larger of either 256MB or 50% of (RAM - 1GB). For example, if a system has a total 0f 6GB RAM, so 2GB (50% 0f 6GB – 1 GB) will be allocated to wiredTiger internal cache. This default setting assumes that there is only one mongod process running. In case we have multiple mongodb instances on the server we should decrease the wiredTiger internal cache size to accommodate other instances. WiredTiger also provides compression options for both collections and indexes by default. While snappy compression is used for collections, prefix compression is used for all indexes. We can set the compression at the database as well as collection and index level. WiredTiger internal cache and filesystem cache differs in terms of data REPRESENTATION from on-disk format.
All free memory that is not used by wiredTiger cache or by any other process is automatically used by MongoDB filesystem cache. |
|
| 28. |
Explain different architectural components of MongoDB. |
|
Answer» The MongoDB balancer is a background process that monitors the number of chunks on each shard. When the number of chunks on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate chunks between SHARDS and reach an equal number of chunks per shard. All chunk MIGRATIONS use the following procedure:
|
|
| 29. |
Why does MongoDB store data in BSON format over JSON? |
|
Answer» Encryption plays a key role in securing any production ENVIRONMENT. MongoDB offers encryption at-rest as well as transport encryption. Transport encryption offers to encrypt information over the network traffic between the client and the server. MongoDB supports TLS/SSL (Transport Layer Security/Secure Sockets Layer) to encrypt all of MongoDB’s network traffic. TLS/SSL ensures that MongoDB network traffic is only readable by the intended client. Encryption at rest encrypts the data on disk. This can be achieved either encrypting at the storage engine level or at the application level. Application level encryption is done at application end and is similar to masking as done EARLIER in RDBMS. Encrypted Storage Engine MongoDB Enterprise 3.2 introduces a native encryption option for the WiredTiger storage engine. This allows MongoDB to encrypt data files such that only parties with the decryption key can decode and read the data. The data encryption process includes:
The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a file system PERSPECTIVE, and data only exists in an unencrypted state in memory and during transmission. Application Level Encryption Application Level Encryption provides encryption on a per-field or per-DOCUMENT basis within the application layer. To encrypt document or field level data, write custom encryption and decryption routines or use a commercial solution. |
|
| 30. |
How does MongoDB text search work on all string fields of a document? Should Compound Text Index be created on all the string fields to achieve this? |
|
Answer» When deploying MongoDB in production, we should have a strategy for capturing and restoring backups in the case of data loss events. Below are the different backup options:
MongoDB Atlas, the official MongoDB cloud service, provides 2 fully-managed methods for backups:
MongoDB Cloud Manager and Ops Manager provide back up, monitoring, and AUTOMATION service for MongoDB. They support backing up and restoring MongoDB replica sets and sharded clusters from a graphical user interface. Back Up by Copying Underlying Data Files
MongoDB can also be backed up with OPERATING system features which are not specific to MongoDB. Point-in-time filesystem snapshots can be used for backup If the volume where MongoDB stores its data files supports snapshots.
MongoDB deployments can also be backed up using system commands cp or rsync in case storage system does not support snapshots. It is recommended to stop all WRITES to mongo before copying database files as copying multiple is not an atomic operation.
mongodump is the utility using which we can take a backup of the MongoDB database in BSON files format. The backup files can then be used by a mongorestore utility for restoring to another database. Mongodump reads data page by page hence taking a lot of time and so is not recommended for large sized deployments. |
|
| 31. |
Why is it so important to choose the right shard key for sharding? |
|
Answer» There are a few key differences while setting authentication on the sharded cluster. To set up authentication we should connect to mongos instead of mongod. Also, clients who want to authenticate to the sharded cluster MUST do from mongos. Ensure sharded cluster has at least two mongos instances available as it requires restarting each mongos in the cluster. If the sharded cluster has only one mongos instance, this results in downtime during the period that the mongos is offline.
security: transitionToAuth: true keyFile: <path-to-keyfile>The new configuration file should contain all of the configuration settings previously used by the mongos as well as the new security settings.
Connect to the primary member of each shard replica set and create a user with the db.createUser() method. db.createUser({ user: "admin1", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});This user can be used for maintenance activities on individual shards.
|
|
| 32. |
How easy or how difficult is it to maintain an audit trail in MongoDB? |
|
Answer» We can broadly divide MongoDB AUTHENTICATION mechanism in 2 parts namely client/user authentication which mainly deals with how clients of database authenticate to MongoDB and internal authentication which is how different members of replica SETS or sharded clusters authenticate with each other.
SCRAM-SHA-1 and MONGODB-CR are considered as a challenge/RESPONSE mechanism. From version 3.0 SCRAM-SHA-1 is the default security mechanism and has replaced MONGODB-CR.
MongoDB currently supports two internal authentication mechanisms. There's keyfile authentication which uses SCRAM-SHA-1 and X.509 authentication. With keyfile authentication, the contents of keyfile essentially act as a shared password between the members of a replica set or sharded cluster. The same keyfile must be present on each member that talks to one another. X.509 is another internal authentication mechanism. And it utilizes certificates to authenticate members to one another. We can use the same certificate on all members, it is recommended to issue a different certificate to each member. This way, if one of the certificates is compromised, we only need to reissue and deploy that one certificate instead of having to update your entire cluster. It's important to note that whenever we enable internal authentication, either with X.509 or with keyfile based authentication, this automatically will enable client authentication. |
|
| 33. |
The db.collection.bulkWrite() provides the ability for bulk CRUD operations. During execution, if there is any error from an operation, do the remaining operations get processed? |
|
Answer» To backup sharded cluster we need to TAKE the backup for config database and individual shards.
First, we would need to disable the balancer from mongos. If we do not stop the balancer, the backup could duplicate data or OMIT data as CHUNKS migrate while recording backups. USE config sh.stopBalancer()
For each shard replica set in the sharded cluster, connect a mongo shell to the secondary member’s mongod instance and run db.fsyncLock(). db.fsyncLock()
Connect to secondary of config server replica set and run db.fsyncLock()
Now we will backup locked config secondary member. We are using mongodump for backup but we can also use any other method LIKE cp or rsync etc. Once the backup is taken, we can unlock the member so that it starts getting oplog from config primary. mongodump --oplog db.fsyncUnlock()
Now we will backup locked member of each shard. We are using mongodump for backup but we can also use any other method like cp or rsync etc. Once the backup is taken, we can unlock the member so that it starts getting oplog from shard primary. mongodump --oplog db.fsyncUnlock()
Once we have the backup from config and each shard we will enable the balancer by connecting to config database. use config sh.setBalancerState(true) |
|
| 34. |
Are multi-document transactions possible in MongoDB? |
|
Answer» Shard key selection is an important aspect of the sharded cluster as it affects the performance and overall efficiency of a cluster. Chunk creation and distribution among several SHARDS is based on the choice of the shard key. Ideally shard key should allow MongoDB to distribute documents evenly across all the shards in the cluster. There are three main factors that affect the selection of the shard key:
Cardinality refers to a number of DISTINCTIVE values for a given shard key. Ideally shard key should have high cardinality. It represents the maximum number of chunks that can exist in clusters. For example, suppose we have an application that was used only by members of a particular city and we are sharding on the STATE, we will have a maximum of one chunk as both UPPER and lower values of chunk would be that state only. And one chunk would only allow us to have one shard. Hence we need to ensure the shard key field has high cardinality. If we cannot have a field with high cardinality we can increase the cardinality of our shard key by creating compound shard key. So in the above scenario, we can have shard key with a combination of state and name for ensuring cardinality.
Apart from having a large number of different values for our shard key, it is important to have even distribution for each value. It certain values occur more often than others then we may not have an equal distribution of load across the cluster. This limits the ability to handle scaled read and writes. For example, suppose we have an application where the majority of people using it have last name ‘jones’, the throughput of our application would be constraint with shard having those values. Chunks containing these values grow larger and larger and may SOMETIMES become jumbo chunks. These jumbo chunks reduce the ability to scale horizontally as they cannot be split. To address such issues, we should choose a good compound shard key. In the above scenario, we can add _id as a compound field to have a high frequency for compound shard key.
We should avoid shard keys on fields which values are always increasing or decreasing. For example, ObjectId in MongoDB whose value is always increasing with each new document. In such case, all our writes will go to the same chunk having an upper bound key. For monotonically decreasing values writes will go to the first shard with a lower bound. We can have shard key as objectId as long as it’s not the first field. |
|
| 35. |
What are the projection operators $, $elemMatch and $slice used for in MongoDB? |
|
Answer» MongoDB sharded CLUSTER has 3 components namely shards, mongos and config servers. We will deploy all components using the below process.
We need to start all the members of replica sets with the –shardsvr option. mongod --replSet "RS0" --shardsvr mongod --replSet "rs1" --shardsvrSuppose we have 2 shards with 3-member replica set each, all 6 shards should be started with the above option. These shard members are deployed as replica sets on the host (h1, h2, h3…. h6) at port 27017. sh1(M1, M2, M3 as replica set “rs0”) and sh2(M4, M5, M6 as replica set “rs1”)
We need to start all members of config servers as a replica set with --configsvr mongod --configsvr --replSet “cf1”Config Sever (Member c1, c2 and c3 as a replica set cf1) on host H7, h8 at port 27017.
Start the mongos specifying the config server replica set name followed by a SLASH / and at least one of the config server HOSTNAMES and ports. Mongos is deployed on server h9 at port 27017. mongos --configdb cf1/h7:27017, h8:27017, h9:27017
|
|
| 36. |
How does MongoDB support efficient querying against array fields? |
|
Answer» Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling.
Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers.
Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers. MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards. A MongoDB sharded cluster consists of the following components:
APPLICATION data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as REPLICA sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose.
In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from CONFIG server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state.
All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have PRIMARY at any time the cluster cannot perform metadata changes and becomes read-only for the time PERIOD so config server replica set should also be monitored and maintained as the application data shards. |
|
| 37. |
What is Covered Query and what is the usage of this? |
|
Answer» When the primary of a replica set is not available secondary becomes primary, this is done via elections where the most appropriate member of the replica set is promoted to primary. Apart from unavailability, there are few situations when elections are triggered such as:
|
|
| 38. |
The oplog.rs collection in MongoDB stores the log of operations (in a replica set). This means that this collection should not grow infinitely and there should be a rolling mechanism where in as new documents(logs) are inserted, the older ones are automatically removed? How does MongoDB achieve this? |
|
Answer» The first requirement eliminates any five-node replica set where one node is an arbiter, as arbiters do not have a COPY of the DATA. The second requirements eliminate setting Priority of 0 for dc1-01, dc1-02 or dc2-01, dc2-02. They can be assigned any positive integer, or the default VALUE of 1 to be electable as Primary. As per the THIRD requirement, dc3-01 can never be primary so its priority has to be set 0. Finally, as per the fourth dc3-01 configuration cannot be listed as hidden, as this will prevent READING from this replica member. So below will be the config file meeting all the requirements. { "_id" : "rs0", "version" : 1, "members" : [ { "_id" : "dc1-01", "host" : "mongodb0.example.net:27017" }, { "_id" : "dc1-02", "host" : "mongodb1.example.net:27017" }, { "_id" : "dc2-01", "host" : "mongodb2.example.net:27017"}, { "_id" : "dc2-02", "host" : "mongodb3.example.net:27017"}, { "_id" : "dc3-01", "host" : "mongodb4.example.net:27017","priority" : 0 } ] } |
|
| 39. |
Multiple addresses are stored as an embedded document for a user. Within the UI, each of these addresses need to be shown as an individual document along with all the user details. Should this be done programmatically? Or is there a simpler way to achieve this in MongoDB? |
|
Answer» Suppose we have 3 servers abc.com,xyz.com andpqr.com.
option –replset is used for creating a replica set. We have given a replica set NAME as rs0. Bind IP is the IP to which SERVER can be connected to from outside.
Login to server abc.com and run command mongo> it will take you to mongo shell. Now we need to initiate the replica set with a configuration of all 3 MEMBERS. rs.initiate( { _id : "rs0", members: [ { _id: 0, host: "abc.com:27017" }, { _id: 1, host: "xyz.com:27017" }, { _id: 2, host: "pqr.com:27017" } })MongoDB initiates a replica set, using the default replica set configuration.
rs.conf() Also to check the status for each member we can run command rs.status() The server from which we run rs.initiate will become primary and other 2 servers will become secondary. |
|
| 40. |
What is the significance of the “as” field in $graphLookup? |
|
Answer» First, we have the MongoDB QUERY language. This is the set of instructions and commands that we have to use to interact with MongoDB.All CRUD operations and the documents that we send back and forth in MongoDB are managed by this layer. They translate the incoming BSON wire protocol messages that MongoDB uses to communicate with the client side application libraries that we call drivers into MongoDB operations. Then, we have the MongoDB Data Model Layer. This is the layer RESPONSIBLE for applying all the CRUD operations defined in the MongoDB query language and how they should result in the data structures managed by MongoDB. Management of namespaces, database names, and collections, which indexes are defined per namespace and which interactions need to be performed to respond to the incoming requests are all managed here. This is ALSO the layer where a REPLICATION mechanism is defined. This is where we define WriteConcerns, ReadConcerns that applications may require. Next, we have the storage layer. At this layer, we will have all the persistence in physical medium calls, how data is stored on disk, what kind of files does it use, what levels of compression amongst other settings can be set. MongoDB has several different types of storage engines that will persist data with different properties, depending on how the system is configured. WiredTiger is the default storage engine. At this layer, all the actions regarding flushers to this, journal commits, compression operations, and low-level system access happens.
Shards themselves are replica sets-- highly availability in units. Where we have other components as well, like:
|
|
| 41. |
How are recursive queries supported within MongoDB? |
|
Answer» BSON is a binary JSON. Inside the database, there is a need for binary representation for efficiency. There are 3 major reasons for preference to BSON:
Example: In below document, we have a large subdocument named hobbies, now suppose we want to query FIELD "active" skipping "hobbies" we can do so in BSON due to its linear serialization property. {-ID: "32781", name: "Smith”, age: 30, hobbies: { .............................500 KB ..............}, active: "true”}
|
|
| 42. |
Oracle provides the EXPLAIN PLAN and PostgreSQL provides EXPLAIN ANALYZE, both of these help to understand the query plan chosen by the database, what is the equivalent of this in MongoDB? |
|
Answer» When any text content WITHIN a document needs to be searchable, all the string fields of the document can be indexed USING the $** wildcard specifier. db.articles.createIndex( { "$**" : "text" } ) Note: Any new string field ADDED to the document after creating the index will automatically be indexed. When data is huge, wildcard INDEXES will have an impact on performance and hence should be used with due CONSIDERATION of this. |
|
| 43. |
When “fast reads” are a key criteria, what is the best recommended modeling to represent relationships like one-to-one and one-to-many in MongoDB? |
|
Answer» Once selected, the shard key can't be changed LATER automatically. Hence it should be chosen after a lot of consideration. The distribution of the documents of a collection between the cluster shards is based on the shard key. EFFECTIVENESS of the chunk distribution is important for the efficient QUERYING and WRITING of the MongoDB database and this effectiveness of the chunk distribution is directly related to the shard key. That is why choosing of the right shard key up front is of utmost importance. |
|
| 44. |
How can an array within a document be updated with multiple values in a single operation? |
|
Answer» The MongoDB enterprise version includes auditing capability and this is fairly easy to SET up. Some salient features of auditing in MongoDB
Note: Auditing adds performance overhead and the amount of overhead is determined by a combination of the SEVERAL factors listed above. The specific needs of the application should be taken into account to arrive at the optimal configuration. |
|
| 45. |
When using Compound Index in MongoDB, what are the key points to consider when writing queries so that the query plan is able to use this index? |
|
Answer» In the CASE of an ERROR, whether the remaining OPERATIONS get processed or not is determined if the bulk operation is ordered or unordered. If it is ordered, then MongoDB will not PROCESS the remaining operations, WHEREAS if it is unordered , MongoDB will continue to process the remaining operations. Note: “ordered” is an optional Boolean parameter that can be passed to bulkWrite(), by default this is true. |
|
| 46. |
How can we capture slow running queries in MongoDB? |
|
Answer» Starting in version 4.0, multi-document transactions are possible in MongoDB. EARLIER to this version, atomic operations were possible only on a single document. With embedded documents and arrays, data in the documents are generally denormalized and stored in a single structure. With this as the recommended data model, MongoDB's single document atomicity is sufficient for most of the applications. Multi-document transactions now enable the remaining small percentage of applications which require this (DUE to related data spread across documents) to depend on the DATABASE to handle transactions automatically rather than implement this programmatically into their application (which can cause performance overheads). NOTE: Performance cost is more for multi-document transactions (in most of the cases), hence it should be judiciously used. |
|
| 47. |
Why _id index in a sharded cluster is recommended to be used as a shard key? |
|
Answer» All the 3 projection operators, i.e., $, $elemMatch, $slice are USED for MANIPULATING arrays. They are used to LIMIT the contents of an array from the query results. For example, db.startups.find( {}, { SKILLS: { $slice: 2 } } ) selects the first 2 items from the skills array for each document returned. |
|
| 48. |
How can we migrate primary shards in the sharded clusters? |
|
Answer» Multikey indexes can be used for SUPPORTING efficient querying against array fields. MongoDB creates an index KEY for each element in the array. Note: MongoDB will AUTOMATICALLY create a multikey index if any indexed field is an array, no separate indication required. Consider the startups collection with array of skills { _id: 1, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }Multikey indexes allow to search on the values in the skills array db.startups.createIndex( { skills : 1 } )The query db.startups.find( { skills : "AI" } ) will use this index on skills to RETURN the matching document |
|
| 49. |
You are an admin for MongoDB test database having sample collection. How can you export the contents of sample collection into CSV file? |
|
Answer» A query that is able to return entire results only by using the index is called a Covered Query. This is one of the optimization techniques that can be used with queries for faster RETRIEVAL of data. A query can be a covered query only if
Since everything is part of the index, there is no need for the query to check the DOCUMENTS for any information. |
|
| 50. |
findupdateOneinsertOne |
|
Answer» MongoDB supports Capped COLLECTIONS which are fixed-size collections. Once the allocated space is filled up, space is made for new documents by REMOVING (OVERWRITING) oldest documents. The insertion order is preserved and if a query does not specify any ORDERING then the ordering of results is same as the insertion order. The oplog.rs collection is a capped collection, thus ensuring that the collection of logs do not GROW infinitely. |
|