Big Data, A Long Introduction

One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.

Advertisements

MongoDB: The Best Way to Work With Data

Relational databases have a long-standing position in most organizations. This made them the default way to think about storing, using and enriching data. However, modern applicants present new challenges that stretch the limits of what is possible with a relational database. Relational database uses tabular data model, stores data across many tables and links by foreign keys as the need to normalize the data.

Document Model

In contrast, MongoDB uses a document data model and presents data in single structure with the related data embedded as sub-documents and arrays. Below JSON document shows how a customer object is modeled in a single document structure with embedded sub-documents and arrays.

Flexibility: Dynamically Adapting to Changes

MongoDB documents’ fields can vary from document to document within a single collection. There is no need to declare the structure of documents to the system – documents are self-describing. If a new field needed to be added into a document, the field can be added without affecting all other documents in the MongoDB, unlike relational databases, we need to run the ‘ALTER TABLE’ operations.

Schema Governance

While MongoDB allows flexible schema, MongoDB also provides schema validation with the database, from MongoDB version 3.6 and above. The JSON schema validator allows us to define a fixed schema and validation rules directly into the database and free the developers to take care of it from the application level. With this, we can apply data governance standard to the schema while maintaining the benefits of a flexible document model.

Below is the sample validation rule,

db.createCollection( "people" , {
   validator: { $jsonSchema: {
      bsonType: "object",
      required: [ "name", "surname", "email" ],
      properties: {
         name: {
            bsonType: "string",
            description: "required and must be a string" },
         surname: {
            bsonType: "string",
            description: "required and must be a string" },
         email: {
            bsonType: "string",
            pattern: "^.+\@.+$",
            description: "required and must be a valid email address" },
         year_of_birth: {
            bsonType: "int",
            minimum: 1900,
            maximum: 2018,
            description: "the value must be in the range 1900-2018" },
         gender: {
            enum: [ "M", "F" ],
            description: "can be only M or F" }
      }
   }
}})

So, it is possible also to implement the validation rules to the existing collections? The answer is we just need to use the collMod command instead of createCollection command.

db.runCommand( { collMod: "people3",
   validator: {
      $jsonSchema : {
         bsonType: "object",
         required: [ "name", "surname", "gender" ],
         properties: {
            name: {
               bsonType: "string",
               description: "required and must be a string" },
            surname: {
               bsonType: "string",
               description: "required and must be a string" },
            gender: {
               enum: [ "M", "F" ],
               description: "required and must be M or F" }
         }
       }
},
validationLevel: "moderate",
validationAction: "warn"
})

Having a Really Fixed Schema

MongoDB allows the additional fields that are not in the validation rules to be inserted into the collection. If we would like to be more restrictive and have a really fixed schema for the collection we need to add the following parameter in the validation rule,

additionalProperties: false

The below MongoDB script shows how to use the above parameter.

db.createCollection( "people2" , {
   validator: {
     $jsonSchema: {
        bsonType: "object",
        additionalProperties: false,
		required: ["name","age"],
        properties: {
           _id : {
              bsonType: "objectId" },
           name: {
              bsonType: "string",
              description: "required and must be a string" },
           age: {
              bsonType: "int",
              minimum: 0,
              maximum: 100,
              description: "required and must be in the range 0-100" }
        }
     }
}})

Speed: Great Performance

For most of the MongoDB’s queries, there is no need to JOIN multiple records. Should your application require it, MongoDB does provide the equivalent of a JOIN, the $lookup which was introduced since version 3.2. For more reading, you can find in this link.

I will stop here for now and shall return with more information in my next write up or I will continue from this post. Stay tuned.

Database Stability

This is one of the common question to be asked either during a talk or during the interview. Personally, I look at this topic highly and important for every database administrator to pay attention to it.

Slow performance means tasks take longer time to complete. If it takes longer, there is more likely to overlap when multiple users or connections at the same time. It leads to frequent locks, deadlocks and resource contention and eventually leads to errors and stability issues.

Poor scalability means it has limited options when demand exceed capacity such as queue requests or reject requests. Rejecting requests result error or unexpected behaviour and this is instability. Queuing requests lead to reduced performance, putting demands on resources such as CPU, memory and etc. When it increases demands, it leads to further stability issues.

Poor stability affects performance. The partial success and partial failure must be handled, usually with database rollbacks or manual compensation logic. It is an additional resource requirements on the system whether to do rollback or process the manual compensation logic. And it affects scalability.

I found from the MSDN website, someone shared some important points when come to designing whether a database or an application. It always consider performance, scalability, and stability when architecting, building, and testing your databases and applications.

MongoDB Indexes

Indexes

Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB performs collection scan, it scans every document in a collection to select documents match the query statement.

Default _id Index

As mentioned, MongoDB creates unique index on the _id field when a collection is created. Indexes help to prevent two documents with same value for the _id field. MongoDB supports the creation of user-defined ascending/descending indexes.

Index Types

  • Single Index – single field.
  • Compound Index – multiple fields. The order of fields in a compounded index has significance.
  • Multikey Index – to index the content stored in arrays.
  • Geospatial Index – to support efficient queries of geospatial coordinate data.
  • Text Indexes – provides a text index type that supports searching for string content in a collection.
  • Hashed Indexes – to support hash based sharding.

The syntax to create MongoDB indexes based on the index types above is shown below:

#Singple Index
db.collection.createIndex( <key and index type specification&gt;, <options&gt; )
db.collection.createIndex( { name: -1 } )

#Compound Index
db.collection.createIndex( { <field1&gt;: <type&gt;, <field2&gt;: <type2&gt;, ... } )
db.collection.createIndex( { "item": 1, "stock": 1 } )

#Multikey Index is used when any indexed field is an array
db.collection.createIndex( { <field&gt;: < 1 or -1 &gt; } )
db.collection.createIndex( { ratings: 1 } )

#Multiley Index in embedded document
db.collection.createIndex( { "stock.size": 1, "stock.quantity": 1 } )

#Text Index with keyword "text"
db.collection.createIndex( { <field&gt;: "text" } )
db.collection.createIndex(
   {
     subject: "text",
     comments: "text"
   }
 )

#Hashed Indexes with keyword "hashed"
db.collection.createIndex( { _id: "hashed" } )

Option ‘-1’ is creating a single key descending index while option ‘1’ is creating a single key ascending index.

MongoDB: Schema Planning Tips

MongoDB is advertised with features of its ability to be “schemaless”. It does not mean you do not need to design your database schema or there is no database schema applicable for MongoDB. It is a good idea to enforce some schema validation during the data insertion into the collections for better performance and scalability. Designing the schema can be tedious, yet, it can be fun too.

Avoid Growing Documents

By default, MongoDB allows 16MB size per document. If you intend to allow your documents to grow in size continuously, it is advisable to avoid it because,

  • It can lead to degradation of database and I/O performance.
  • A bad design of schema leads to failure of queries, sometimes.

Avoid Updating Whole Documents

When you do update, try to avoid updating whole document because MongoDB will rewrite the whole document elsewhere in the memory. Hence, it degrades the write performance in your database. Instead, you can use field modifiers to update only specific fields in the documents. It will trigger an in-place update in memory. Hence, it improves performance.

Avoid Application-Level Joins

As MongoDB does not support server level joins, therefore, we have to get all the data from the database and then perform the join at the application level. If we are working on a large amount of data, calling to the database several times to get necessary data is obviously required more time. A suggestion to denormalize schema makes more sense when your application heavily relies on joins. You can use embedded documents to get all the required data in a single query.

Below is an use case for embedded document where you put the addresses in an array inside of Person object.

The advantage of embedded document is you do not have to perform a separate query to get the embedded details. The disadvantage is you have no way to access the embedded details as standalone entities.

Field names Take Up Space

It is less important. When you get up to billions of records, it significantly affects on your index size. Disk space is cheap but RAM is not.

Use Proper Indexing

If the index on sorting field is not available, MongoDB is forced to sort without an index. There is a memory limit of 32MB of total size of all documents which are involved in the sort operation. If MongoDB hits that limit, then it may either produce and error or return an empty dataset. It is also important not to add unnecessary indexes because each index you add, you have to update all indexes while updating documents in database. It will cause,

  • degrade database performance.
  • occupy space and memory.
  • number of indexes can lead to storage-related problems.

One more way to optimize the use of an index is overriding the default _id field. The only purpose of this field is keeping one unique field per document. If your data contains a timestamp or any id field then you can override _id field and save one extra index.

If you create an index which contains all the fields that you would query and all the fields that will be returned by that query, MongoDB will never need to read the data because it is all contained within the index. This significantly reduces the need to fit all data into memory for maximum performance. It is called covered queries.

Read vs Write Ratio

When designing schema for any application, it depends whether the application is read heavy or write heavy. For example, when we build a dashboard to display timeseries data where constantly there is a stream of data loaded into the database, then you should design the schema in such a way that maximize the write throughput. If most of the operations in the application is read, then you should use denormalized schema to reduce the number of calls to be the database for getting data.

BSON Data Types

Make sure you define BSON data types for all fields correctly while designing the schema because changing the data type of any field, MongoDB will rewrite the whole document in a new memory space (can cause a document to be moved).

SysOps By Trials and Errors

During a discussion with my big boss and two other colleagues, my big boss mentioned about the sales account manager of MongoDB did not come back to him for the MongoDB Enterprise’s quotation. There is a project may need to use the enterprise version due to some security reasons. He, then, suggested me to try out the Percona Server for MongoDB.

According to the Percona’s website, “Percona Server for MongoDB is our free and open-source drop-in replacement for MongoDB Community Edition. It offers all the features and benefits of MongoDB Community Edition, plus additional enterprise-grade functionality.”

Below is the features comparison made by Percona and available at their website.

Since I am curious too, I started to try on setting up a virtual machine on my Linux machine on Friday. My IT guy told me to use VBox (VirtualBox) and so I did. I used the installer to complete the installation. It can be done using command lines too.

sudo apt-get update
sudo apt-get install virtualbox-6.0

For VBox installation, you can refer to this link.

Just I was about to think of where I can download the image disk for Ubuntu, he sent me a message to inform me where I can get them. Alternatively, you can get it from this link.

The setup and installation of the Ubuntu 16.04 were all done in the VBox using the .vdi image. I did not recall there was any complications during the Ubuntu installation. It was straight-forward all the way.

Next, I needed to get the Percona Server installed. I registered to the Percona’s website to obtain a copy of the PDF document which contains the installation guide and features’ setups. It is an useful documentation for me to setup the server. On Monday, by trials and errors, I installed latest Percona Server, configured it and ran the service in the virtual machine.

There are plenty features share in the documentation with guide to implement them. I completed the configuration to use Percona Memory Engine for storage engine. I am not sure whether I configured it correctly especially the virtual machine is running on 1GB memory. I set the Percona memory to be running at 3GB. It is something I need to re-visit after this.

Besides that, I did try to enable the authorization mode. Immediately, after it was enabled, I tried to launch my company’s product using default authentication method and the system returned errors because there are some databases not authorized to be used. Authorization and authentication are different, even I have created an user credential in MongoDB to access those databases. It is also a good topic to re-visit.

It gave me an experience being a day or two as a system engineer, or most people called them as SysOps nowadays. Although, it was not a full cycle of SysOps, the experience of installation and setup Ubuntu in the virtual machine, followed by installation of Percona Server and Robo 3T for MongoDB and lastly configurations and using the Percona server with my company’s product was so great that I wanted to share it here, today.

I always have the special privilege being a woman in the industry to have men to do this dirty job, but when a woman tries her hand to work on it, it is a beautiful piece of art!

Great thanks to my colleagues who are willing to help me and guide me through this trials and errors. At least, I did it for myself once!

MongoDB – MacOS Installation

I covered both Windows and Linux installation for MongoDB with my recent updated blog. Unfortunately, I am not able to write much about MacOS installation because I did not have the environment to try on.

Nevertheless, there are plenty of materials on the Internet we can search and follow. One of it and will always get updated is the MongoDB’s website, https://docs.mongodb.com/manual. They have a comprehensive guide on how to complete the installation.

From what I see, firstly, MacOS users need to download the MongoDB .tar.gz tarball. Secondly, extract the .tar.gz downloaded file using command,

tar -zxvf mongodb-osx-ssl-x86_64-4.0.5.tgz

for example.

After that, a couple of setups need to be done. In two examples I saw online, they did move the content of the extracted downloaded file to another folder using command similar to
sudo mv mongodb-osx-ssl-x86_64-4.0.5 /usr/local/mongodb

By default, the mongod process uses the /data/db directory to store data. Using command below to create a directory,

sudo mkdir -p /data/db 

If you wish to use different directory, you must specify that directory in the dbpath option when starting the mongod process. I will share the command later in my blog.

Let’s assume we are keeping the same location in this blog.

Then, change the permission of your username to access the directory.
To check your machine’s username use command, whoami and it returns the username. With this, we can set the permission using below command,

sudo chown <username&gt; /data/db

Lastly, we setup mongodb/bin PATH to ~/.bash_profile. A couple of steps to follow:
1. Type cd, so that it goes back to home directory.
2. Type pwd, to make sure you are in this directory, /Users/<username>.
3. Type ls -al to list down all the files in the directory including the hidden file. The .bash_profile is an hidden file in this case.
4. If the .bash_profile file is not found, then type touch .bash_profile to create.
5. If the .bash_profile file is found, then type open .bash_profile to open the file.
6. Add or append these two lines into the opened file. You can append at the end of the file,
export MONGO_PATH=/usr/local/mongodb
export PATH=$PATH:$MONGO_PATH/bin

7. Save the file.
8. Type source .bash_profile to reload the file.

Start mongo service using command, mongod. Then, you can see whether mongoDB is running from the terminal by looking for output line:
[initandlisten] waiting for connections on port 27017.

#Run without specifying path
mongod

#Run with specifying path
mongod --dbpath <data directory path&gt;

Again, default port for mongodb is 27017.