# Data Modelling with MongoDB

Though MongoDB is schema-less, there's an implied structure and hence a data model.

When designing a data model, developers should ask what data needs to be stored, what data is likely to be accessed together, how often will a piece of data be accessed, and how fast will data grow and will it be unbounded. Answers to these questions will lead to a data model that's right for each application.

In designing the data model, there's no format process, algorithms or rules. Based on years of experience, MongoDB practitioners have come up with a set of design patterns and how they fit into common use cases. These are only guidelines. Ultimately, the developer has to analyze the application and see what fits best.

## Discussion

• Given that MongoDB is schema-less, why do we need data modelling?

By being schema-less, we can easily change the structure in which data is stored. Documents in a collection need not conform to a single rigid structure. The trade-off is that by not having a schema and not validating against that schema, software bugs can creep in. When data is stored in many different ways, application complexity increases. Schema is self-documenting and leads to cleaner code. With sophisticated validations, application code can become simpler.

While MongoDB is schema-less, in reality, data is seldom completely unstructured. Most data has some implied structure, though MongoDB may not validate that structure. Even when data evolves over time, there's some base structure that stays constant. Starting from MongoDB 3.2, document validation is possible.

Here are some specific areas where schema helps:

• When matching on nested document fields, the order of fields matters.
• Data with a complicated structure is easier to understand and process given the schema.
• When using an Object Document Manager (ODM), the ODM can benefit from the schema.
• Frequent changes to the document structure without a schema can cause performance issues.
• What's the difference between embedded data models and normalized data models?

While MongoDB is document-oriented, it stills allows for relations among collections. This is supported by the $lookup and $graphLookup pipeline operators. When designing a data model, we have to therefore decide if it makes sense to embed information within a document or to keep it in a separate collection.

Embedded data model is applicable when there's a "contains" relationship. Reads are performant when there's a need to access nested documents along with the main document. Document can also be updated in a single atomic operation. To model one-to-many relationship, an array of documents can be nested. However, MongoDB limits a document size to 16MB and at most 100 levels of nesting.

Normalized data model uses object references to model relationships between documents. Such a model avoids data duplication: a document can be referenced by many other documents without duplicating its content. Complex many-to-many relationships are easily modelled. Likewise, large hierarchical datasets can be modelled. Data can also be referenced across collections.

• How do I define 1-to-1, 1-to-n and n-to-n relationships in MongoDB?

Consider a User document. The person works for only one company. Company field or document is embedded within the User document. This is an example of 1-to-1 relationship.

A person may have multiple addresses, which is a 1-to-n relationship. This is modelled in MongoDB as an array of documents, that is, an array of addresses is embedded into the User document.

Another 1-to-n example is a Product document that contains many Part documents. There could be dozens of parts. It may be necessary to access each part on its own independent of the product. Hence, we might embed into a Product document an array of ObjectID references to the parts. Parts are stored in a separate collection.

Consider a to-do application that assigns tasks to users. A user can have many tasks and a task can be assigned to many users. This is an example of n-to-n relationship. A User document would embed an array of ObjectID references to tasks. A Task document would embed an array of ObjectID references to users.

• Could you share essential tips towards MongoDB schema design?

Embed documents unless there's a good reason not to. Objects that need to be accessed on their own or high-cardinality arrays are compelling reasons not to embed.

Arrays model 1-to-n relationship. If n is few, embed the documents. If n is in the hundreds, embed ObjectID references, called child referencing. If n is in the thousands, embed the 1 ObjectID reference from within n documents, called parent referencing.

Don't shun application-level joins. If data is correctly indexed and results are minimally projected, they're quite efficient.

Consider the read-to-write ratio. A high ratio favours denormalization and improves read performance. A field that's updated often is not a good candidate for denormalization since it has to be updated in multiple places.

Analyze your application and its data access patterns. Structure the data to match the application.

• What schema design patterns are available in MongoDB?

We briefly describe each pattern:

• Approximation: Fewer writes and calculations by saving only approximate values.
• Attribute: On large documents, index and query only on a subset of fields.
• Bucket: For streaming data or IoT applications, bucket values to reduce the number of documents. Pre-aggregation (sum, mean) simplifies data access.
• Computed: Avoids repeated computations on reads by doing them at writes or at regular intervals.
• Document Versioning: Allows different versions of documents to coexist.
• Extended Reference: Avoid lots of joins by embedding only frequently accessed fields.
• Outlier: Data model and queries are designed for typical use cases, and not influenced by outliers.
• Pre-Allocation: Reduce memory reallocation and improve performance when document structure is known in advance.
• Polymorphic: Useful when documents are similar but don't have the same structure.
• Schema Versioning: Useful when schema evolves during the application's lifetime. Avoids downtime and technical debt.
• Subset: Useful when only some data is used by application. Smaller dataset will fit into RAM and improve performance.
• Tree: Suited for hierarchical data. Application needs to manage updates to the graph.
• Could you share an example showing the use of MongoDB schema design patterns?

The example in the figure pertains to an e-commerce application. It shows the use of five design patterns among three collections:

• Schema Versioning: Every collection includes an integer field schema to store the schema version.
• Subset: Items and reviews are stored as separate collections but since top_reviews are frequently accessed from items, these are embedded into items documents. Other reviews are rarely accessed. Another example is staff information embedded into stores collection.
• Computed: Since sum_reviews and num_reviews are frequently accessed, these are pre-computed and stored in reviews. Another example are fields tot_rating and num_ratings in items collection.
• Bucket: Rather than store each review as a separate document, reviews are bucketed into a time window (start_date and end_date).
• Extended Reference: Fields part of other collections, but frequently accessed, are duplicated for higher read performance and avoid joins. Fields sold_at in items and items_in_stock in stores are two examples.
• What are some schema design anti-patterns in MongoDB?

We could have large arrays stored in documents, some of which are unbounded. Storing lots of a data together leads to bloated documents. Storing numerous collections in a database, some of which are unused, is another anti-pattern.

A collection may have many indexes, some of which could be unnecessary. Remove indexes that are rarely used. Remove indexes already covered by another compound index.

Another anti-pattern is storing information in separate collections although they're often accessed together. The $lookup operator is similar to JOIN in relational databases. It's slow and resource intensive. Instead, it's better to denormalize this data. A case-insensitive query without having a case-insensitive index to cover it is another anti-pattern. We could use $regex with the i option but this doesn't efficiently utilize case-insensitive indexes. Instead, create a case-insensitive index, that is, index with a collation strength of 1 or 2.

On MongoDB Atlas, Performance Advisor or Data Explorer can spot anti-patterns and warn developers of the same.

## Milestones

Feb
2009

MongoDB 1.0 is released. By August, this version becomes generally available for production environments.

Dec
2015

MongoDB 3.2 is released. Schema validation is now possible during updates and inserts. Validation rules can be specified with the validator option via the db.createCollection() method and collMod command. Validation rules follow the same syntax as query expression. This release also adds $lookup pipeline stage to join collections. Nov 2017 MongoDB 3.6 is released with support for JSON Schema validation. The $jsonSchema operator within a validator expression must be used for this purpose.

Jun
2018

MongoDB 4.0 is released with support for multi-document transactions. However, it's useful to note that due to associated performance costs, developers may benefit from schema redesign, such as a denormalized data model.

Apr
2019

On the MongoDB blog, Coupal and Alger conclude their series of articles titled Building with Patterns with a useful summary. Along with a brief description of each data modelling pattern, they note possible use cases of each pattern.

Jul
2021

MongoDB 5.0 is released. When schema validation fails, detailed explanations are shown.

Author
No. of Edits
No. of Chats
DevCoins
2
0
1039
1582
Words
2
Likes
2713
Hits

## Cite As

Devopedia. 2021. "Data Modelling with MongoDB." Version 2, November 14. Accessed 2022-10-09. https://devopedia.org/data-modelling-with-mongodb
Contributed by
1 author

Last updated on
2021-11-14 09:11:36