MongoDB: Schema Planning Tips

MongoDB is advertised with features of its ability to be “schemaless”. It does not mean you do not need to design your database schema or there is no database schema applicable for MongoDB. It is a good idea to enforce some schema validation during the data insertion into the collections for better performance and scalability. Designing the schema can be tedious, yet, it can be fun too.

Avoid Growing Documents

By default, MongoDB allows 16MB size per document. If you intend to allow your documents to grow in size continuously, it is advisable to avoid it because,

  • It can lead to degradation of database and I/O performance.
  • A bad design of schema leads to failure of queries, sometimes.

Avoid Updating Whole Documents

When you do update, try to avoid updating whole document because MongoDB will rewrite the whole document elsewhere in the memory. Hence, it degrades the write performance in your database. Instead, you can use field modifiers to update only specific fields in the documents. It will trigger an in-place update in memory. Hence, it improves performance.

Avoid Application-Level Joins

As MongoDB does not support server level joins, therefore, we have to get all the data from the database and then perform the join at the application level. If we are working on a large amount of data, calling to the database several times to get necessary data is obviously required more time. A suggestion to denormalize schema makes more sense when your application heavily relies on joins. You can use embedded documents to get all the required data in a single query.

Below is an use case for embedded document where you put the addresses in an array inside of Person object.

The advantage of embedded document is you do not have to perform a separate query to get the embedded details. The disadvantage is you have no way to access the embedded details as standalone entities.

Field names Take Up Space

It is less important. When you get up to billions of records, it significantly affects on your index size. Disk space is cheap but RAM is not.

Use Proper Indexing

If the index on sorting field is not available, MongoDB is forced to sort without an index. There is a memory limit of 32MB of total size of all documents which are involved in the sort operation. If MongoDB hits that limit, then it may either produce and error or return an empty dataset. It is also important not to add unnecessary indexes because each index you add, you have to update all indexes while updating documents in database. It will cause,

  • degrade database performance.
  • occupy space and memory.
  • number of indexes can lead to storage-related problems.

One more way to optimize the use of an index is overriding the default _id field. The only purpose of this field is keeping one unique field per document. If your data contains a timestamp or any id field then you can override _id field and save one extra index.

If you create an index which contains all the fields that you would query and all the fields that will be returned by that query, MongoDB will never need to read the data because it is all contained within the index. This significantly reduces the need to fit all data into memory for maximum performance. It is called covered queries.

Read vs Write Ratio

When designing schema for any application, it depends whether the application is read heavy or write heavy. For example, when we build a dashboard to display timeseries data where constantly there is a stream of data loaded into the database, then you should design the schema in such a way that maximize the write throughput. If most of the operations in the application is read, then you should use denormalized schema to reduce the number of calls to be the database for getting data.

BSON Data Types

Make sure you define BSON data types for all fields correctly while designing the schema because changing the data type of any field, MongoDB will rewrite the whole document in a new memory space (can cause a document to be moved).

Day 42: Importing Data in Python


Importing data from various sources such as,
– flat files eg. txt, csv
– files from software eg. Excel, SAS, Matlab files.
– relational databases eg. SQL server, mySQL, NoSQL.

Reading a text file
It uses the open() function to open the connection to a file with passing two parameters,
– filename.
– mode. Eg: r is read, w is write.

Then, assigns a variable to read() the file. Lastly, close() the file after the process is done to close the connection to the file.

If you wish to print out the texts in the file, use print() statement.

file = open(filename, mode=’r’)
text =


The filename can be assigned to a variable instead of writing the filename in the open() function’s parameter. It is always a best practice to close() an opened connection.

To avoid being forgotten to close a file connection (miss out including file.close() in the end of our codes), it is advisable to use the context manager which uses the keyword with statement.

In my previous posts where some of the tutorials were using context manager to open a file connection to read the .csv files. The with statement executes the open() file command and this allows it to create a context in which it can execute commands with the file opens. Once out of this clause or context, the file is no longer opened, and for this reason it is called context manager.

What it is doing here is called ‘binding’ a variable in the context manager construct, while within this construct, the variable file will be bound to open(filename, ‘r’).

Let me share the tutorials I did in DataCamp’s website.

# Open a file: file
file = open('moby_dick.txt', mode='r')

# Print it

# Check whether file is closed

# Close file

# Check whether file is closed

First, it opens a file using open() function with passing the filename and read mode. It then, reads the file. The first print() statement prints out the context of the moby_dick text file. Then, second print() statement returns a Boolean which checks whether the file is closed. In this case, it returns ‘FALSE’. Lastly, proceeds to close the file using close() function and then check the Boolean. This time, it returns ‘TRUE’

Importing text files line by line
For a larger file, we do not want to print out everything inside the file. We may want to print out several lines of the content. To do this, it uses readline() method.

When the file is opened, use file.readline() to print each line. See the code below, using the with statement and file.readline() to print each line of the content in the file.

# Read & print the first 3 lines
with open('moby_dick.txt') as file:

Summary of the day:

  • Importing data from different sources.
  • Reading from a text file using open() and read(),
  • Importing text line by line using readline() method.
  • with statement as context manager.
  • close() to close a file.
  • file.closed() returns Boolean value if the condition is met.