SQL, NoSQL and MySQL

The SQL, NoSQL and MySQL are the common terms used in today’s database management system technology. In this article, I try to explain them in simple English to distinguish them. It covers the questions below:

  • What is SQL?
  • What is NoSQL?
  • What is MySQL?
  • What is MongoDB?
  • Whether use SQL or NoSQL?

What is SQL?

SQL stands for Structured Query Language. It is the core of the relational database that it is used for accessing, managing and manipulating structured data. The data is stored in the form of columns and rows, and holds relationships between those tables. There are four different type of relationships:

  • One-to-one relationship – It happens when a single row in Table A relates to a single row in Table B.
  • One-to-many relationship – It happens when a single row in Table A relates to multiple rows in Table B.
  • Many-to-many relationship – It happens when many rows in Table A can be related to many rows in Table B.
  • Self-referencing relationship – When a record in Table A relates to the same table itself.

Often SQL relates to the relational database such as Microsoft SQL Server, Oracle, PostgreSQL and MySQL.

What is NoSQL?

NoSQL also is known as Not SQL database provides a mechanism to store and retrieve unstructured data. NoSQL database can handle a large amount of data and has a dynamic schema. It means there is no specific number of columns or fields in the table. The data are stored in the format of collections and documents. NoSQL has no or very few relationships.

SQLNoSQL
Relational database, structured data in defined columns and rows with each table relates to other tables in the database.Non-relational database because data is stored in the form of collections and documents with no or few relationships.
Predefined schema.Dynamic schema.
Table based databasesDocument database
Key-value stores
Graph database
Column family database
SQL databases are vertically scalable. You can load balance the data servers by optimizing hardware such as increasing CPU, RAM, SSD.NoSQL databases are horizontally scalable. You can perform load balancing by adding more servers to your cluster to handle a large amount of traffic.
SQL is mainly used for Online Transactional Processing (OLTP).

Good for complex queries.
NoSQL is mainly used for Online Analytical Processing (OLAP).

Not a good fit for complex queries.

Maybe some non-technical people may think MySQL is another type of SQL besides SQL and NoSQL. It is incorrect. MySQL is one of the popular relational databases. Below is some information about MySQL and MongoDB for your references.

What is MySQL?

It is one of the famous relational database management systems. MySQL is an open-source tool that works on many platforms. MySQL uses SQL to write simple to complex queries to retrieve and manipulate data. MySQL supports relationships using the JOIN statement to retrieve data from a table that relates to another table. MySQL uses a privilege-based security model that authenticates a user and user privileges on a particular database. Most of the relational databases support master-slave replication, MySQL too.

What is MongoDB

MongoDB is a non-relational database that stores the data in documents. The document database stores the related information together for quicker query processing. MongoDB has its own unstructured query language and it does not support JOIN statement, but it supports placing a document inside another document (embedded document or sub-document) and multi-dimensional data types such as arrays. MongoDB is based on role-based access control with a flexible set of privileges. MongoDB supports built-in replication, sharding, and auto-elections.

I have written a post that share the top 10 databases in 2019, and some information of what is key-value stores, document-based database, and etc that I think it is useful for basic understanding. Refer to this link for more information.

Whether use SQL or NoSQL?

Based on my reading, there is no clear winner or clear conclusion which one is better and which one to be used. The choice of the database depends upon the schema of your database and how you want to access your data.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

References:
https://www.edureka.co/blog/sql-vs-nosql-db/#What%20is%20MongoDB

[SQL] What is ACID property in a database?

ACID stands for Atomicity, Consistency, Isolation, Durability. It is used to ensure that the data transactions are processed reliably in a database system. 

ACID Model

Atomicity: Atomicity refers to the transactions that are completely done or failed where transaction refers to a single logical operation of a data. It means if one part of any transaction fails, the entire transaction fails and the database state is left unchanged.

Consistency: Consistency ensures that the data must meet all the validation rules. In simple words, you can say that your transaction never leaves the database without completing its state.

Isolation: The main goal of isolation is concurrency control.

Durability: Durability means that if a transaction has been committed, it will occur whatever may come in between such as power loss, crash or any sort of error.

Base Model

Last time, I knew some of the NoSQL databases are not ACID, and this may not be a suitable database for your application, depending on your use-cases. The acronym BASE is slightly more confusing than ACID. BASE stands for:

  • Basically Available – Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of data by spreading and replicating it across the nodes of the database cluster.
  • Soft State – Due to the lack of immediate consistency, data values may change over time. The BASE model breaks off with the concept of a database which enforces its own consistency, delegating that responsibility to developers.
  • Eventually Consistent – The fact that BASE does not enforce immediate consistency does not mean that it never achieves it. However, until it does, data reads are still possible (even though they might not reflect the reality).

SQL

[SQL] WHAT IS THE DIFFERENCE BETWEEN CLUSTERED AND NON CLUSTERED INDEX IN SQL?

 | EDIT

The differences between the clustered and non clustered index in SQL are :

  1. A clustered index is used for easy retrieval of data from the database and it is faster, whereas reading from non clustered index is relatively slower. It optimizes search queries.
  2. Clustered index alters the way records are stored in a database as it sorts out rows by the column which is set to be clustered index. In a non clustered index, it does not alter the way it was stored but it creates a separate object within a table that points back to the original table rows after searching.
  3. It reduces the amount of data pages that need to be read to retrieve the data in a SQL statement.
  4. One table can only have one clustered index whereas it can have many non clustered index.

EXPLAIN DIFFERENT TYPES OF INDEX

There are three types of index namely:

UNIQUE INDEX:

This index does not allow the field to have duplicate values if the column is unique indexed. If a primary key is defined, a unique index can be applied automatically.

CLUSTERED INDEX:

This index reorders the physical order of the table and searches based on the basis of key values. Each table can only have one clustered index.

NON-CLUSTERED INDEX:

Non-Clustered Index does not alter the physical order of the table and maintains a logical order of the data. Each table can have many nonclustered indexes.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5435&origin=liyenz.wordpress.com&obj_id=4915531-5435-6092b1b4da3d8

SQL, NOSQL AND MYSQL

 | EDIT

The SQL, NoSQL and MySQL are the common terms used in today’s database management system technology. In this article, I try to explain them in simple English to distinguish them. It covers the questions below:

  • What is SQL?
  • What is NoSQL?
  • What is MySQL?
  • What is MongoDB?
  • Whether use SQL or NoSQL?

WHAT IS SQL?

SQL stands for Structured Query Language. It is the core of the relational database that it is used for accessing, managing and manipulating structured data. The data is stored in the form of columns and rows, and holds relationships between those tables. There are four different type of relationships:

  • One-to-one relationship – It happens when a single row in Table A relates to a single row in Table B.
  • One-to-many relationship – It happens when a single row in Table A relates to multiple rows in Table B.
  • Many-to-many relationship – It happens when many rows in Table A can be related to many rows in Table B.
  • Self-referencing relationship – When a record in Table A relates to the same table itself.

Often SQL relates to the relational database such as Microsoft SQL Server, Oracle, PostgreSQL and MySQL.

WHAT IS NOSQL?

NoSQL also is known as Not SQL database provides a mechanism to store and retrieve unstructured data. NoSQL database can handle a large amount of data and has a dynamic schema. It means there is no specific number of columns or fields in the table. The data are stored in the format of collections and documents. NoSQL has no or very few relationships.

SQLNoSQL
Relational database, structured data in defined columns and rows with each table relates to other tables in the database.Non-relational database because data is stored in the form of collections and documents with no or few relationships.
Predefined schema.Dynamic schema.
Table based databasesDocument database
Key-value stores
Graph database
Column family database
SQL databases are vertically scalable. You can load balance the data servers by optimizing hardware such as increasing CPU, RAM, SSD.NoSQL databases are horizontally scalable. You can perform load balancing by adding more servers to your cluster to handle a large amount of traffic.
SQL is mainly used for Online Transactional Processing (OLTP).

Good for complex queries.
NoSQL is mainly used for Online Analytical Processing (OLAP).

Not a good fit for complex queries.

Maybe some non-technical people may think MySQL is another type of SQL besides SQL and NoSQL. It is incorrect. MySQL is one of the popular relational databases. Below is some information about MySQL and MongoDB for your references.

WHAT IS MYSQL?

It is one of the famous relational database management systems. MySQL is an open-source tool that works on many platforms. MySQL uses SQL to write simple to complex queries to retrieve and manipulate data. MySQL supports relationships using the JOIN statement to retrieve data from a table that relates to another table. MySQL uses a privilege-based security model that authenticates a user and user privileges on a particular database. Most of the relational databases support master-slave replication, MySQL too.

WHAT IS MONGODB

MongoDB is a non-relational database that stores the data in documents. The document database stores the related information together for quicker query processing. MongoDB has its own unstructured query language and it does not support JOIN statement, but it supports placing a document inside another document (embedded document or sub-document) and multi-dimensional data types such as arrays. MongoDB is based on role-based access control with a flexible set of privileges. MongoDB supports built-in replication, sharding, and auto-elections.

I have written a post that share the top 10 databases in 2019, and some information of what is key-value stores, document-based database, and etc that I think it is useful for basic understanding. Refer to this link for more information.

WHETHER USE SQL OR NOSQL?

Based on my reading, there is no clear winner or clear conclusion which one is better and which one to be used. The choice of the database depends upon the schema of your database and how you want to access your data.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

References:
https://www.edureka.co/blog/sql-vs-nosql-db/#What%20is%20MongoDB

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5417&origin=liyenz.wordpress.com&obj_id=4915531-5417-6092b1b4dce0f

[SQL] EXCEPT AND INTERSECT

 | EDIT

Last Wednesday I went back to office and met up with my teammates. It was our weekly back-to-office after 9 months of staying away from office. During our meet-up we discussed on our work progress, and my teammate asked whether we would like to explore using the SQL EXCEPT in our data extraction. I have a few questions in my mind before I hop on to share our idea to the bosses and business analysts.

WHAT IS EXCEPT?

In SQL Server, EXCEPT returns distinct rows from the left input query that aren’t output by the right input query. It returns any distinct values from the query left of the EXCEPT operator. Those values return as long the right query doesn’t return those values as well.

Just as with the UNION operator, the same rules apply when using the EXCEPT operator.

WHAT IS INTERSECT?

In SQL Server, INTERSECT returns distinct rows that are output by both the left and right input queries operator. It returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operator.

BASIC RULES

To combine the result sets of two queries that use EXCEPT or INTERSECT, the basic rules are:

  • The number and the order of the columns must be the same in all queries.
  • The data types must be compatible.

In our use case, our data mart and temporary table are having the same number of columns in the same order, and the same data types. There are no issues for us to apply EXCEPT in our data extraction. Besides that, we are not using data types such as XML, text, ntext, image. These data types are non-comparable data types.

NULLABILITY

When comparing column values for determining DISTINCT rows, two NULL values are considered equal.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

REFERENCE:

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5972&origin=liyenz.wordpress.com&obj_id=4915531-5972-6092b1b4debef

POWER BI CANNOT USE SQL CTE

 | EDIT

I just want to do a quick sharing of what I have found out today in Power BI Desktop. I was trying to use the SQL CTE in the Power BI Desktop to get a result set for my project’s testers to check the data transformation and ETL processes in the UAT database. A little bit background, we are using Power BI Desktop to access our data in the UAT and Production environment. Users do not have direct access to the databases in the server or access the data via SQL Server Management Tool (SSMS).

WHAT IS CTE?

CTE stands for command table expression in SQL Server. According to the Microsoft website, CTE is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or MERGE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression. For more reading about CTE, please refer to the link in the reference list below.

IT GAVES ERROR!

The CTE script that I wrote was running perfectly good in SSMS installed in my local machine. However, when I used the Get Data function in the Power BI Desktop and placed the same script into the SQL command text field, the Power BI Desktop threw an error on the screen that suggested it has a syntax error. I knew it was not having any syntax errors, even some of the online solutions mentioned that we have to put a semi-colon in the script.

1234567891011121314-- Define the CTE expression name and column list.  WITH Sales_CTE (SalesPersonID, SalesOrderID, SalesYear)  AS -- Define the CTE query.      SELECT SalesPersonID, SalesOrderID, YEAR(OrderDate) AS SalesYear      FROM Sales.SalesOrderHeader      WHERE SalesPersonID IS NOT NULL -- Define the outer query referencing the CTE name.  SELECT SalesPersonID, COUNT(SalesOrderID) AS TotalSales, SalesYear  FROM Sales_CTE  GROUP BY SalesYear, SalesPersonID  ORDER BY SalesPersonID, SalesYear;

POWER BI DOES NOT SUPPORT CTE

I am quite agree with one of the Power BI users’ replies, the Power BI Desktop is supposed to be a self-servicing tool for users. In other words, it is not meant writing SQL scripts such as views or stored procedures. To resolve the error, I have to change the SQL statement to use normal SELECT statement. This is a good time to test the SQL skill.

I do not have the answer or SQL script for the above SQL statement, if you are looking for one because I just picked the sample script from the Microsoft website. If you have the solution, you can share it with me in the comment box below.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

REFERENCES

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5981&origin=liyenz.wordpress.com&obj_id=4915531-5981-6092b1b4e1847

[SQL] HOW SQL SERVER COMPARES STRINGS WITH TRAILING SPACES

 | EDIT

A few months back I asked the Business Analyst a question, why do we need to perform trimming by using the SQL keywords, LTRIM() and RTRIM() to the dataset that we extracted from the data warehouse before sending the data to our client. The Business Analyst is not a technical person, so I did not get an answer from my question. But, I answered my own question.

From what I understood through data profiling, some dataset contains the trailing spaces because its original data sources use ‘char’ as the data type whereas, in our data warehouse, we use varchar data type. During the data ingestion from the source systems into our data warehouse, we did not perform any kind of transformation except the data type for strings based columns have changed to a varchar data type. It led the columns saving the data with trailing spaces.

Transact-SQL considers the strings ‘abc’ and ‘abc ‘ to be equivalent for most comparison operations.

An article from Microsoft support website states the above and shares a query that helps us to understand how the padding for character strings affects the semantics of WHERE and HAVING clause predicates. The only exception to this rule is the LIKE predicate because the purpose of the LIKE predicate is to perform pattern searches.

Let us look into the given query below:

123456789101112131415CREATE TABLE #tmp (c1 varchar(10))GOINSERT INTO #tmp VALUES ('abc ')INSERT INTO #tmp VALUES ('abc')GOSELECT DATALENGTH(c1) as 'EqualWithSpace', * FROM #tmp WHERE c1 = 'abc 'SELECT DATALENGTH(c1) as 'EqualNoSpace  ', * FROM #tmp WHERE c1 = 'abc'SELECT DATALENGTH(c1) as 'GTWithSpace   ', * FROM #tmp WHERE c1 > 'ab 'SELECT DATALENGTH(c1) as 'GTNoSpace     ', * FROM #tmp WHERE c1 > 'ab'SELECT DATALENGTH(c1) as 'LTWithSpace   ', * FROM #tmp WHERE c1 < 'abd 'SELECT DATALENGTH(c1) as 'LTNoSpace     ', * FROM #tmp WHERE c1 < 'abd'SELECT DATALENGTH(c1) as 'LikeWithSpace ', * FROM #tmp WHERE c1 LIKE 'abc %'SELECT DATALENGTH(c1) as 'LikeNoSpace   ', * FROM #tmp WHERE c1 LIKE 'abc%'GODROP TABLE #tmp

You can try above on our local machine or SQL Fiddle.

It creates a temporary table to hold the 2 records. You can create a table too. No issues. Then, it inserts two records into the temporary table. The next few lines are to make a comparison when executing the queries with different conditions. When I execute the first SELECT statement, it returns result as below:

1SELECT DATALENGTH(c1) as 'EqualWithSpace', * FROM #tmp WHERE c1 = 'abc '
EQUALWITHSPACEC1
4abc
3abc

Let me also explain the definition of DATALENGTH() used in the above query. The DATALENGTH() function returns the length (number of bytes) used to represent an expression. The DATALENGTH() function counts both leading and trailing spaces when calculating the length of the expression. You may ask what the difference between the DATALENGTH() and the LEN(). We use LEN() function to return the number of characters in an expression, excluding trailing spaces.

Okay back to the result table. It returns two rows of result for checking ‘EqualWithSpace’ because the T-SQL considers both strings ‘abc ‘ and ‘abc’ to be the same but it has a different number of bytes used to represent the string.

When we execute the below SELECT statement a different result returns, let us look closer to the result below.

12SELECT DATALENGTH(c1) as 'LikeWithSpace ', * FROM #tmp WHERE c1 LIKE 'abc %'SELECT DATALENGTH(c1) as 'LikeNoSpace   ', * FROM #tmp WHERE c1 LIKE 'abc%'
LIKEWITHSPACEC1
4abc
LIKENOSPACEC1
4abc
3abc

As mentioned earlier the LIKE predicate is a pattern search, and when we execute with a condition WHERE c1 LIKE ‘abc %’, the first table above (‘LikeWithSpace’) returns 1 record that exactly matching the same pattern with trailing space.

The result for the 2nd SELECT statement returns 2 rows of records because the pattern search that used did not include the space at the end of the string ‘abc’, but uses the wildcard character % that represents zero or more characters.

I hope you can get some ideas from the explanation above. If you want to read more about it you can refer to the references link below that I found useful. Other reading on LEN() and DATALENGTH() can be found below too.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

References

January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.

Database Model

During a check on the Database Engines ranking just now, I found that this link is listing the available database management systems according to their popularity. The website updates monthly and you able to see the current ranking, previous month ranking, and the ranking one year ago.

The picture above shows the list of the top 10 database management systems. Another exciting part of the website is the list of the database model.

Relational DBMS

Relational database management systems (RDBMS) support the relational or table-oriented data model. The schema of a table or the defines by the table name, fixed number of columns (attributes) with fixed data types. A record corresponds to a row in the table (entity) and consists of the values of each column. A relation thus consists of a set of uniform records, according to the website.

The normalization in the process of data modeling generates table schemas. There are some operations used to define a relationship. For example,

  • classical set operations (union, intersection, and difference)
  • Selection (selection of a subset of records according to certain filter criteria for the attribute values)
  • Projection (selecting a subset of attributes/columns of the table)
  • Join: special conjunction of multiple tables as a combination of the Cartesian product with selection and projection.

Document Stores

Document stores, also called document-oriented database systems, are characterized by their schema-free organization of data. According to the website that means,

  • Records do not need to have a uniform structure, i.e. different records may have different columns.
  • The types of values of individual columns can be different for each record.
  • Columns can have more than one value (arrays).
  • Records can have a nested structure.

Document stores often use internal notations, which can be processed directly in applications, mostly JSON. JSON documents, of course, can also be stored as pure text in key-value stores or relational database systems.

Key-value Stores

Key-value stores are probably the simplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.

These simple systems usually are not adequate for complex applications. On the other hand, it is exactly this simplicity that makes such systems attractive in certain circumstances. For example, resource-efficient key-value stores that apply in the embedded systems or as high performance in-process databases.

Advanced Forms

An extended form of key-value stores is able to sort the keys, and thus enables range queries as well as ordered processing of keys. Many systems provide further extensions so that we see a fairly seamless transition to document stores and wide column stores.

Search Engines

Search engines are NoSQL database management systems dedicated to the search for data content. In addition to general optimization for this type of application, the specialization consists of typically offering the following features:

  • Support for complex search expressions
  • Full text search
  • Stemming (reducing inflected words to their stem)
  • Ranking and grouping of search results
  • Geospatial search
  • Distributed search for high scalability

Wide Column Stores

As mentioned above, the wide column stores, also called extensible record stores, store data in records with an ability to hold huge numbers of dynamic columns. Since the column names, as well as the record keys, are not fixed, and since a record can have billions of columns, wide column stores see as two-dimensional key-value stores.

The wide column stores share the characteristic of being schema-free with document stores. However, the implementation is very different. The wide column stores must not be confused with the column-oriented storage in some relational systems. The wide column stores is an internal concept for improving the performance of an RDBMS for OLAP (Online Analytical Processing) workloads and stores the data of a table, not record after record but column by column.

Graph DBMS

Graph DBMS, also called graph-oriented DBMS or graph database, represent data in graph structures as nodes and edges, which are relationships between nodes. Graph DBMS allows easy processing of data in that form, and a simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node. Graph DBMS usually does not provide indexes on all nodes, direct access to nodes based on attribute values is not possible in these cases.

Time Series DBMS

A Time Series DBMS is a database management system that optimizes handling time series data; for example, each entry associated with a timestamp.

Time Series DBMS is designed to efficiently collect, store, and query various time series with high transaction volumes. Although the management of the time series data can be the same as other categories of DBMS (from key-value stores to relational systems), the specific challenges often require specialized systems.

I hope the information extracted from the website is able to help us understand the differences between the database models.

MariaDB

MariaDB is a newer database management system, created by the original developers of MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another popular commercial database management system). One notable feature of MariaDB is its built-in support for temporal data. A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.

PostgreSQL

PostgreSQL is a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties. The database management system is extensible; you can add code modules to the database, which can be run by queries. Another key feature is the ability to store and manipulate geometric data, such as lines, circles, and polygons.

PostgreSQL has its own query language called pgsql. This language is a variant of the standard relational query language, SQL, with features that enable you to write stored procedures that run inside the database.

References: https://db-engines.com/en/ranking

MongoDB: Importing csv files with mongoimport

Someone asked for my help to upload some .csv files to the MongoDB database and backup the database before sending the file to the next person. I completed the task with the command below. It imports a .csv file to the selected database and collection by specifying the file type, location and whether the file has a headerline. It runs for both Linux and Windows’ machines using the terminal or command prompt.
mongoimport -d mydb -c things --type csv --file locations.csv --headerline

MongoDB: Enabling SSL for MongoDB

Sometimes ago, I was told to prepare and work with a team of people to explore the TLS/SSL for MongoDB and use the Windows Active Directory users to access MongoDB via LDAP authentication.

It is an administration work on MongoDB to enhance security between client and server during data transmission. The default connections to MongoDB servers are not encrypted. It is highly advisable to ensure all connections to the mongod are Transport Layer Security (TLS) also known as SSL enabled.

The definition of TLS/SSL from Google is TLS and its now-deprecated predecessor, Secure Sockets Layer (SSL), which are protocols designed to provide communications security over a computer network

For a development server, you can enable the TLS using a self-signed certificate. OpenSSL allows us to generate the self-signed certificate on our server itself. For the Windows machine, it requires to install the OpenSSL to generate the self-signed certificate. Then, you need to install the self-signed certificate. For the Linux Ubuntu machine, no installation required for the OpenSSL and self-signed certificate.

In the MongoDB configuration file (mongod.conf for Linux machine) or (mongod.cfg for Windows machine) can enable the SSL mode in the MongoDB. You need to specify the path to the .pem file. If you are working on a production server, you are required to include the CAFile path too. CA stands for certificate authority, is an entity that issues digital certificates.

net:
   tls:
      mode: requireTLS
      certificateKeyFile: /etc/ssl/mongodb.pem
      CAFile: /etc/ssl/caToValidateClientCertificates.pem

More reading about MongoDB TLS/SSL can refer to this document: https://docs.mongodb.com/manual/tutorial/configure-ssl/

Dates in MongoDB

My colleague came to me last Friday and asked me to help him out on one of the issues he faced with MongoDB’s query. He found that the JSON data received via the API when our BI tool inserted the data into MongoDB, it saved the date and datetime values in string format.

The JSON specification does not specify a format for exchanging dates which is why there are so many different ways to do it. The best format is the ISO date format, it is a well known and widely used format and can be handled across many different languages, making it very well suited for interoperability.

It shall look like below:

2012-04-23T18:25:43.511Z

When I took over the JSON data and found that most of the date and time are in normal string format (eg: 2012-04-23 18:25:43.511), MongoDB inserted the data into the database as string type.

To understand dates in MongoDB, I found this article is pretty good in explaining to them, here is the link. Instantly, it hit me! Dealing with dates in MongoDB or any databases is not an easy job at all, moreover, this time I am dealing with MongoDB where it has subdocument or array in a subdocument structure.

I can use the $dateFromString which converts a date/time string to a date object. For more detail about this function, you can find it out at this link.

This aggregation operator worked like magic to me when I am dealing with the date and time in the string until I reached a point where the date is in an array inside a subdocument. I faced a roadblock and I am not able to use the same method to do the conversion. It seems quite true that I have to use $unwind in my query. It deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

I need a solution that I can use $unwind to flatten the array from the collection and convert each date and time into a date object. I am still looking for a solution, if you happened to solve something similar like mine, please give me a helping hand and link me to the solutions. Thank you!

Maybe useful: https://stackoverflow.com/questions/38299186/query-that-combines-project-unwind-group

Big Data, A Long Introduction

One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description.

During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. More explanation is given from the same website on the new Vs.

Data veracity refers to the degree of certainty in data sets. Uncertain raw data collected from multiple sources, such as social media platforms and webpages, can cause serious data quality issues that may be difficult to pinpoint.

Bad data leads to inaccurate analysis and may undermine the value of business analytics because it can cause executives to mistrust data as a whole. The amount of uncertain data in an organization must be accounted for before it is used in big data analytics applications. IT and analytics teams also need to ensure that they have enough accurate data available to produce valid results.

Not all data collected has real business value and the use of inaccurate data can weaken insights provided by analytics applications. It is critical that organizations employ practices such as data cleansing and confirm that data relates to relevant business issues before they use it in a big data analytics project.

Variability also often applies to sets of big data, which are less consistent than conventional transaction data and may have multiple meanings or be formatted in different ways from one data source to another — things that further complicate efforts to process and analyze the data.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.
https://searchdatamanagement.techtarget.com/definition/big-data

MongoDB: The Best Way to Work With Data

Relational databases have a long-standing position in most organizations. This made them the default way to think about storing, using and enriching data. However, modern applicants present new challenges that stretch the limits of what is possible with a relational database. Relational database uses tabular data model, stores data across many tables and links by foreign keys as the need to normalize the data.

Document Model

In contrast, MongoDB uses a document data model and presents data in single structure with the related data embedded as sub-documents and arrays. Below JSON document shows how a customer object is modeled in a single document structure with embedded sub-documents and arrays.

Flexibility: Dynamically Adapting to Changes

MongoDB documents’ fields can vary from document to document within a single collection. There is no need to declare the structure of documents to the system – documents are self-describing. If a new field needed to be added into a document, the field can be added without affecting all other documents in the MongoDB, unlike relational databases, we need to run the ‘ALTER TABLE’ operations.

Schema Governance

While MongoDB allows flexible schema, MongoDB also provides schema validation with the database, from MongoDB version 3.6 and above. The JSON schema validator allows us to define a fixed schema and validation rules directly into the database and free the developers to take care of it from the application level. With this, we can apply data governance standard to the schema while maintaining the benefits of a flexible document model.

Below is the sample validation rule,

db.createCollection( "people" , {
   validator: { $jsonSchema: {
      bsonType: "object",
      required: [ "name", "surname", "email" ],
      properties: {
         name: {
            bsonType: "string",
            description: "required and must be a string" },
         surname: {
            bsonType: "string",
            description: "required and must be a string" },
         email: {
            bsonType: "string",
            pattern: "^.+\@.+$",
            description: "required and must be a valid email address" },
         year_of_birth: {
            bsonType: "int",
            minimum: 1900,
            maximum: 2018,
            description: "the value must be in the range 1900-2018" },
         gender: {
            enum: [ "M", "F" ],
            description: "can be only M or F" }
      }
   }
}})

So, it is possible also to implement the validation rules to the existing collections? The answer is we just need to use the collMod command instead of createCollection command.

db.runCommand( { collMod: "people3",
   validator: {
      $jsonSchema : {
         bsonType: "object",
         required: [ "name", "surname", "gender" ],
         properties: {
            name: {
               bsonType: "string",
               description: "required and must be a string" },
            surname: {
               bsonType: "string",
               description: "required and must be a string" },
            gender: {
               enum: [ "M", "F" ],
               description: "required and must be M or F" }
         }
       }
},
validationLevel: "moderate",
validationAction: "warn"
})

Having a Really Fixed Schema

MongoDB allows the additional fields that are not in the validation rules to be inserted into the collection. If we would like to be more restrictive and have a really fixed schema for the collection we need to add the following parameter in the validation rule,

additionalProperties: false

The below MongoDB script shows how to use the above parameter.

db.createCollection( "people2" , {
   validator: {
     $jsonSchema: {
        bsonType: "object",
        additionalProperties: false,
		required: ["name","age"],
        properties: {
           _id : {
              bsonType: "objectId" },
           name: {
              bsonType: "string",
              description: "required and must be a string" },
           age: {
              bsonType: "int",
              minimum: 0,
              maximum: 100,
              description: "required and must be in the range 0-100" }
        }
     }
}})

Speed: Great Performance

For most of the MongoDB’s queries, there is no need to JOIN multiple records. Should your application require it, MongoDB does provide the equivalent of a JOIN, the $lookup which was introduced since version 3.2. For more reading, you can find in this link.

I will stop here for now and shall return with more information in my next write up or I will continue from this post. Stay tuned.

Database Stability

This is one of the common question to be asked either during a talk or during the interview. Personally, I look at this topic highly and important for every database administrator to pay attention to it.

Slow performance means tasks take longer time to complete. If it takes longer, there is more likely to overlap when multiple users or connections at the same time. It leads to frequent locks, deadlocks and resource contention and eventually leads to errors and stability issues.

Poor scalability means it has limited options when demand exceed capacity such as queue requests or reject requests. Rejecting requests result error or unexpected behaviour and this is instability. Queuing requests lead to reduced performance, putting demands on resources such as CPU, memory and etc. When it increases demands, it leads to further stability issues.

Poor stability affects performance. The partial success and partial failure must be handled, usually with database rollbacks or manual compensation logic. It is an additional resource requirements on the system whether to do rollback or process the manual compensation logic. And it affects scalability.

I found from the MSDN website, someone shared some important points when come to designing whether a database or an application. It always consider performance, scalability, and stability when architecting, building, and testing your databases and applications.

%d bloggers like this: