MongoDB The data model design - Part 3

JP
Feb 22, 2015
4 min read

One of the key issues to address, prior to setting up your MongoDB database, is which type of structure to use. MongoDB gives you two primary approaches. You can embed all of the data into a single document; or you can separate data out into separate documents, and use references to connect these documents. There isn’t a single choice. Now, what you see on the screen is that embedded approach. There is not a single choice you should always make. Which choice you have depends on the application needs, and performance requirements. So we’ll talk about when to use one approach over the other. Let’s start with looking at the so called denormalized approach. This is where you have all the related data within the same document. It has some very strong advantages, and a few disadvantages. First of all, an application will need fewer queries. If I use a related approach – where documents contain different data that’s connected together – then for a single query, I might have to get one document, then get a related document, then another related document. With the denormalized approach, I get a single document, and all my data is in it.

A key decision to make is the structure to use – embed or references. The decision should be based on application needs and performance requirements.

An example of a document with two embedded sub-documents is as follows:

In the example, the first embedded sub-document section is

phone: “555-555-1111”, Cell: “555-555-222 email: jdoe@xyz.com

The second embedded sub-document section is

Position: { department: “sales”, manager: “vp of sales”

Data is “denormalized” when related data is within the same documents and applications need fewer queries.

I should use this model in a few scenarios. When the data has a containing relationship – in other words, one object just naturally contains others. In the case you see on the screen is a pretty good example. John Doe, this user, contains contact information. It’s a pretty clear relationship. That contact information really makes no sense, unless it’s associated with an individual user or employee. So in this case, the denormalized approach, putting it all together, makes sense. If there are one-to-many relationships, one item relates to several other things, that’s a good place to use the denormalized approach. This is going to provide better performance for read operations. As we already indicated, you can make a single query, get back all the data to single point, instead of going back to different documents. I request and retrieve data in one single operation.

The denormalized approach uses embedded data models when there are “contains” relationships and one-to-many relationships among data elements. The denormalized approach also provides better performance for read operations. It updates related data in a single atomic write operation.

I can update the related data with a single write operation. Since all of the data is contained in a single document, I can do it in one single atomic write operation. However, the first problem is this can cause documents to grow. I may have a great deal of information in a single document, and that will in turn impact your write performance. It can also, if the documents get large enough, lead to data fragmentation. Let’s take a look at the so-called normalized approach. Instead of putting all the data in a single document, I have different documents that are connected together with a reference. So I use reference links to connect the various data points in various documents. Now, we use this model when embedding would result in lots of duplication of data.

The denormalized approach may cause documents to grow. This may impact write performance and can lead to data fragmentation.

A user document example is as follows:

{ _id: <ObjectId1>, username: “jdoe” }

A department document example is

{ _id: <ObjectId3>, user_id: <ObjectId1>, Name:Sales, Office: “Chicago” }

The related line in the user document is

_id: <ObjectId1>,

The related line in the department document is

user_id: <ObjectId1>,

With the normalized approach, related data is referenced using links to other documents. Use normalized models when embedding would result in duplication of data. Many-to-many relationships are required, and large hierarchical data sets need to be modeled.

Now, notice I said lots of duplication. Any time you do embedding, there’s likely to be some duplication. But if you consider embedding would lead to a great deal of duplication, you might want to instead go with this normalized approach. If instead of a one-to-many relationship, you have a many-to-many relationship, then a normalized approach is probably going to be a better solution. If you have large hierarchical data sets – in other words, you have a document here that has sub-documents, and perhaps even those sub-documents have sub-documents. Well it gets very complex, and it might be a better idea to put those documents into separate objects, and simply link them with a reference. This is going to provide more flexibility, particularly if I have data that might be shared in multiple places. I put in a single document, and reference it by all of them. Applications then have to use follow up queries to resolve the links. I get the primary document, and I look at any references to go get additional documents.

The normalized approach provides more flexibility and applications use “follow-up” queries to resolve the links.