How does MongoDB support efficient querying against array fields?

1.	How does MongoDB support efficient querying against array fields?
Answer» Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling. Vertical Scaling Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers. Horizontal Scaling Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers. MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards. A MongoDB sharded cluster consists of the following components: shard: APPLICATION data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as REPLICA sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose. mongos: In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from CONFIG server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state. config servers: All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have PRIMARY at any time the cluster cannot perform metadata changes and becomes read-only for the time PERIOD so config server replica set should also be monitored and maintained as the application data shards.

Answer»

Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling.

Vertical Scaling

Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers.

Horizontal Scaling

Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers.

MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards.

A MongoDB sharded cluster consists of the following components:

shard:

APPLICATION data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as REPLICA sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose.

mongos:

In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from CONFIG server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state.

config servers:

All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have PRIMARY at any time the cluster cannot perform metadata changes and becomes read-only for the time PERIOD so config server replica set should also be monitored and maintained as the application data shards.

How does MongoDB support efficient querying against array fields?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment