Big Data, A Long Introduction


One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s