MongoDB: Importing csv files with mongoimport

Someone asked for my help to upload some .csv files to the MongoDB database and backup the database before sending the file to the next person. I completed the task with the command below. It imports a .csv file to the selected database and collection by specifying the file type, location and whether the file has a headerline. It runs for both Linux and Windows’ machines using the terminal or command prompt.
mongoimport -d mydb -c things --type csv --file locations.csv --headerline

MongoDB: Enabling SSL for MongoDB

Sometimes ago, I was told to prepare and work with a team of people to explore the TLS/SSL for MongoDB and use the Windows Active Directory users to access MongoDB via LDAP authentication.

It is an administration work on MongoDB to enhance security between client and server during data transmission. The default connections to MongoDB servers are not encrypted. It is highly advisable to ensure all connections to the mongod areΒ Transport Layer Security (TLS)Β also known as SSL enabled.

The definition of TLS/SSL from Google is TLS and its now-deprecated predecessor, Secure Sockets Layer (SSL), which are protocols designed toΒ provide communications security over a computer network.Β 

For a development server, you can enable the TLS using a self-signed certificate. OpenSSL allows us to generate the self-signed certificate on our server itself. For the Windows machine, it requires to install the OpenSSL to generate the self-signed certificate. Then, you need to install the self-signed certificate. For the Linux Ubuntu machine, no installation required for the OpenSSL and self-signed certificate.

In the MongoDB configuration file (mongod.conf for Linux machine) or (mongod.cfg for Windows machine) can enable the SSL mode in the MongoDB. You need to specify the path to the .pem file. If you are working on a production server, you are required to include the CAFile path too. CA stands for certificate authority, is an entity that issues digital certificates.

net:
   tls:
      mode: requireTLS
      certificateKeyFile: /etc/ssl/mongodb.pem
      CAFile: /etc/ssl/caToValidateClientCertificates.pem

More reading about MongoDB TLS/SSL can refer to this document: https://docs.mongodb.com/manual/tutorial/configure-ssl/

Dates in MongoDB

My colleague came to me last Friday and asked me to help him out on one of the issues he faced with MongoDB’s query. He found that the JSON data received via the API when our BI tool inserted the data into MongoDB, it saved the date and datetime values in string format.

The JSON specification does not specify a format for exchanging dates which is why there are so many different ways to do it. The best format is the ISO date format, it is a well known and widely used format and can be handled across many different languages, making it very well suited for interoperability.

It shall look like below:

2012-04-23T18:25:43.511Z

When I took over the JSON data and found that most of the date and time are in normal string format (eg: 2012-04-23 18:25:43.511), MongoDB inserted the data into the database as string type.

To understand dates in MongoDB, I found this article is pretty good in explaining to them, here is the link. Instantly, it hit me! Dealing with dates in MongoDB or any databases is not an easy job at all, moreover, this time I am dealing with MongoDB where it has subdocument or array in a subdocument structure.

I can use the $dateFromString which converts a date/time string to a date object. For more detail about this function, you can find it out at this link.

This aggregation operator worked like magic to me when I am dealing with the date and time in the string until I reached a point where the date is in an array inside a subdocument. I faced a roadblock and I am not able to use the same method to do the conversion. It seems quite true that I have to use $unwind in my query. It deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

I need a solution that I can use $unwind to flatten the array from the collection and convert each date and time into a date object. I am still looking for a solution, if you happened to solve something similar like mine, please give me a helping hand and link me to the solutions. Thank you!

Maybe useful: https://stackoverflow.com/questions/38299186/query-that-combines-project-unwind-group

Big Data, A Long Introduction

One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.

MongoDB: The Best Way to Work With Data

Relational databases have a long-standing position in most organizations. This made them the default way to think about storing, using and enriching data. However, modern applicants present new challenges that stretch the limits of what is possible with a relational database. Relational database uses tabular data model, stores data across many tables and links by foreign keys as the need to normalize the data.

Document Model

In contrast, MongoDB uses a document data model and presents data in single structure with the related data embedded as sub-documents and arrays. Below JSON document shows how a customer object is modeled in a single document structure with embedded sub-documents and arrays.

Flexibility: Dynamically Adapting to Changes

MongoDB documents’ fields can vary from document to document within a single collection. There is no need to declare the structure of documents to the system – documents are self-describing. If a new field needed to be added into a document, the field can be added without affecting all other documents in the MongoDB, unlike relational databases, we need to run the ‘ALTER TABLE’ operations.

Schema Governance

While MongoDB allows flexible schema, MongoDB also provides schema validation with the database, from MongoDB version 3.6 and above. The JSON schema validator allows us to define a fixed schema and validation rules directly into the database and free the developers to take care of it from the application level. With this, we can apply data governance standard to the schema while maintaining the benefits of a flexible document model.

Below is the sample validation rule,

db.createCollection( "people" , {
   validator: { $jsonSchema: {
      bsonType: "object",
      required: [ "name", "surname", "email" ],
      properties: {
         name: {
            bsonType: "string",
            description: "required and must be a string" },
         surname: {
            bsonType: "string",
            description: "required and must be a string" },
         email: {
            bsonType: "string",
            pattern: "^.+\@.+$",
            description: "required and must be a valid email address" },
         year_of_birth: {
            bsonType: "int",
            minimum: 1900,
            maximum: 2018,
            description: "the value must be in the range 1900-2018" },
         gender: {
            enum: [ "M", "F" ],
            description: "can be only M or F" }
      }
   }
}})

So, it is possible also to implement the validation rules to the existing collections? The answer is we just need to use the collMod command instead of createCollection command.

db.runCommand( { collMod: "people3",
   validator: {
      $jsonSchema : {
         bsonType: "object",
         required: [ "name", "surname", "gender" ],
         properties: {
            name: {
               bsonType: "string",
               description: "required and must be a string" },
            surname: {
               bsonType: "string",
               description: "required and must be a string" },
            gender: {
               enum: [ "M", "F" ],
               description: "required and must be M or F" }
         }
       }
},
validationLevel: "moderate",
validationAction: "warn"
})

Having a Really Fixed Schema

MongoDB allows the additional fields that are not in the validation rules to be inserted into the collection. If we would like to be more restrictive and have a really fixed schema for the collection we need to add the following parameter in the validation rule,

additionalProperties: false

The below MongoDB script shows how to use the above parameter.

db.createCollection( "people2" , {
   validator: {
     $jsonSchema: {
        bsonType: "object",
        additionalProperties: false,
		required: ["name","age"],
        properties: {
           _id : {
              bsonType: "objectId" },
           name: {
              bsonType: "string",
              description: "required and must be a string" },
           age: {
              bsonType: "int",
              minimum: 0,
              maximum: 100,
              description: "required and must be in the range 0-100" }
        }
     }
}})

Speed: Great Performance

For most of the MongoDB’s queries, there is no need to JOIN multiple records. Should your application require it, MongoDB does provide the equivalent of a JOIN, the $lookup which was introduced since version 3.2. For more reading, you can find in this link.

I will stop here for now and shall return with more information in my next write up or I will continue from this post. Stay tuned.

Database Stability

This is one of the common question to be asked either during a talk or during the interview. Personally, I look at this topic highly and important for every database administrator to pay attention to it.

Slow performance means tasks take longer time to complete. If it takes longer, there is more likely to overlap when multiple users or connections at the same time. It leads to frequent locks, deadlocks and resource contention and eventually leads to errors and stability issues.

Poor scalability means it has limited options when demand exceed capacity such as queue requests or reject requests. Rejecting requests result error or unexpected behaviour and this is instability. Queuing requests lead to reduced performance, putting demands on resources such as CPU, memory and etc. When it increases demands, it leads to further stability issues.

Poor stability affects performance. The partial success and partial failure must be handled, usually with database rollbacks or manual compensation logic. It is an additional resource requirements on the system whether to do rollback or process the manual compensation logic. And it affects scalability.

I found from the MSDN website, someone shared some important points when come to designing whether a database or an application. It always consider performance, scalability, and stability when architecting, building, and testing your databases and applications.

MongoDB Indexes

Indexes

Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB performs collection scan, it scans every document in a collection to select documents match the query statement.

Default _id Index

As mentioned, MongoDB creates unique index on the _id field when a collection is created. Indexes help to prevent two documents with same value for the _id field. MongoDB supports the creation of user-defined ascending/descending indexes.

Index Types

  • Single Index – single field.
  • Compound Index – multiple fields. The order of fields in a compounded index has significance.
  • Multikey Index – to index the content stored in arrays.
  • Geospatial Index – to support efficient queries of geospatial coordinate data.
  • Text Indexes – provides a text index type that supports searching for string content in a collection.
  • Hashed Indexes – to support hash based sharding.

The syntax to create MongoDB indexes based on the index types above is shown below:

#Singple Index
db.collection.createIndex( <key and index type specification>, <options> )
db.collection.createIndex( { name: -1 } )

#Compound Index
db.collection.createIndex( { <field1>: <type>, <field2>: <type2>, ... } )
db.collection.createIndex( { "item": 1, "stock": 1 } )

#Multikey Index is used when any indexed field is an array
db.collection.createIndex( { <field>: < 1 or -1 > } )
db.collection.createIndex( { ratings: 1 } )

#Multiley Index in embedded document
db.collection.createIndex( { "stock.size": 1, "stock.quantity": 1 } )

#Text Index with keyword "text"
db.collection.createIndex( { <field>: "text" } )
db.collection.createIndex(
   {
     subject: "text",
     comments: "text"
   }
 )

#Hashed Indexes with keyword "hashed"
db.collection.createIndex( { _id: "hashed" } )

Option ‘-1’ is creating a single key descending index while option ‘1’ is creating a single key ascending index.