Cardinality in Databases

Recently, I read an article on cardinality in databases. When you do a Google search to define cardinality in general terms, it returns a definition as “the number of elements in a set or other grouping, as a property of that grouping“. It may sound a bit difficult to visualize and understand.

In another search in Stackoverflow website, a contributor named Oded shared, the cardinality can be described in two different contexts, data modelling and data query optimization.

In term of data modelling, cardinality means how one table relates to another, for example, one to one relationship, one to many relationship or many to many relationship. Below diagram is extracted from lucidchart website which shows the different types of relationship in the database. It is used in ER diagram, entity-relationship diagram.

In term of data query optimization, cardinality means the data in a column of a table, specifically how many unique values are in it. If you have done data profiling before using the Microsoft Power BI, for example, you notice there is a summary statistics of the table loaded into the application. This information helps with planning queries and optimizing the execution plans.

Advertisements

Big Data, A Long Introduction

One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.

Database Stability

This is one of the common question to be asked either during a talk or during the interview. Personally, I look at this topic highly and important for every database administrator to pay attention to it.

Slow performance means tasks take longer time to complete. If it takes longer, there is more likely to overlap when multiple users or connections at the same time. It leads to frequent locks, deadlocks and resource contention and eventually leads to errors and stability issues.

Poor scalability means it has limited options when demand exceed capacity such as queue requests or reject requests. Rejecting requests result error or unexpected behaviour and this is instability. Queuing requests lead to reduced performance, putting demands on resources such as CPU, memory and etc. When it increases demands, it leads to further stability issues.

Poor stability affects performance. The partial success and partial failure must be handled, usually with database rollbacks or manual compensation logic. It is an additional resource requirements on the system whether to do rollback or process the manual compensation logic. And it affects scalability.

I found from the MSDN website, someone shared some important points when come to designing whether a database or an application. It always consider performance, scalability, and stability when architecting, building, and testing your databases and applications.

MSSQL: Database Partitioning

mssql

I was being called into a meeting room to discuss with the support team regarding an issue faced by the customer. I did not have the full context of the issue and could only think of some of the queries made by the customer to build the Business Intelligence’s cubes caused some serious performance issues.

And, during the meeting, my boss said the word, database partitioning. For one moment, I did mistakenly mixed up between database partitioning and database normalization. I believed both are used to organize the data in the database but both are targeting at different areas.

Database normalization
According to Microsoft, normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

Redundant data wastes disk space and creates maintenance problems. If data appears in more than 1 table, we need to make changes exactly the same way in all the other tables.

In general, normalization requires additional tables. There are a few rules for database normalization. Each rule is called a “normal form”. I will write more about it in the next blog.

Database Partitioning
Again, according to Microsoft, partitioning a database improves performance and simplifies maintenance. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan. Maintenance tasks, such as rebuilding indexes or backing up a table, can run more quickly.

a.Hardware Partitioning
Partitioning can be achieved without splitting tables by physically putting tables on individual disk drives. Putting a table on one physical drive and related tables on a separate drive can improve query performance because, when queries that involve joins between the tables are run, multiple disk heads read data at the same time. SQL Server filegroups can be used to specify on which disks to put the tables.

Multiprocessors that enable multiple threads of operations, permitting many queries to run at the same time. Alternatively, a single query may be able to run faster on multiple processors by letting components of the query run at the same time. For example, each table referenced in the query can be scanned at the same time by a different thread.

b.Horizontal Partitioning
Horizontal partitioning divides a table into multiple tables. Each table then contains the same number of columns, but fewer rows. For example, a table that contains 1 billion rows could be partitioned horizontally into 12 tables, with each smaller table representing one month of data for a specific year. Any queries requiring data for a specific month only reference the appropriate table.

Partitioning data horizontally based on age and use is common. For example, a table may contain data for the last five years, but only data from the current year is regularly accessed. In this case, you may consider partitioning the data into five tables, with each table containing data from only one year.

c.Vertical Partitioning
Vertical partitioning divides a table into multiple tables that contain fewer columns. The two types of vertical partitioning are normalization and row splitting.

Normalization is the standard database process of removing redundant columns from a table and putting them in secondary tables that are linked to the primary table by primary key and foreign key relationships.

Row splitting divides the original table vertically into tables with fewer columns. Each logical row in a split table matches the same logical row in the other tables as identified by a UNIQUE KEY column that is identical in all of the partitioned tables. For example, joining the row with ID 712 from each split table re-creates the original row.

Sources:
https://support.microsoft.com/en-sg/help/283878/description-of-the-database-normalization-basics
https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms178148(v=sql.105)

MSSQL: Change DATA Folder Directory

mssql

My colleague came and asked me if the client wanted to change their current SQL Server’s DATA folder’s directory to another directory in their server, can it be done?

The answer is it can be done.

Then, the next question is how? There is a method I found online and tried it and it worked for my local machine. The procedure of changing is not difficult and it is simple to understand. If you are using Window machine, the steps are as below:

Step 1: Identify the mdf and ldf files.
If you did not change the file path during the installation, by default it will set at, C:\Program Files\Microsoft SQL Server\<server_version>\MSSQL\DATA.

So, your filenames are:
– database_name.mdf
– database_name_log.ldf

Just make sure there are .mdf and .ldf files with the matching database name. For example, in my case,
MoveDatabase.mdf
MoveDatabase_log.ldf

Step 2: Set the new file path for the SQL database files.
For my case, I will move the current file path to another drive which I have created a folder to hold the data files. My path is E:\Data. Open the SQL Server Management Studio (SSMS) and execute the below script after you have changed the database name and new file path.

Then, what are the things to be changed?

ALTER DATABASE MoveDatabase   
    MODIFY FILE ( NAME = MoveDatabase,   
                  FILENAME = 'E:\Data\MoveDatabase.mdf');  
 
ALTER DATABASE MoveDatabase
    MODIFY FILE ( NAME = MoveDatabase_log,   
                  FILENAME = 'E:\Data\MoveDatabase_log.ldf');  

There is a couple of things to be done before you can execute the above script in SSMS.

  1. Change the database name “MoveDatabase” to your database name.
  2. In some cases, the .mdf file name is with “_data” at the end of the database name. It could be something like, “MoveDatabase_data”. Just check your database name from step 1. It uses the same name from the step 1.

To execute the above SQL script, the SQL server’s service does not need to be stopped and the database does not need to be offline at this point of time, however, it is advised to ensure your production’s database is not in-used during this point of time to avoid any data losses.

Note: It is advisable to keep database folder in a different directory or best to be in different drive in a machine. Same goes to the database backup files.

Step 3: Set database to offline.
Run the following SQL script in the SSMS to set the database to offline.

ALTER DATABASE MoveDatabase   SET OFFLINE; 

This script is important from this point onward. Change the database name according to your database name. If the database is not set to offline, the following steps cannot be done.

Step 4: Move the database files to new location.
Move .mdf and .ldf files of the specific SQL database to a new location specified in step 2 either by using copy and paste or cut and paste function. If you are familiar with using command, it works the same.

Note: Moving files to another location, there is one important thing to look into, the folder or file permission access.

Step 5: Set database to online.
Run the following SQL script in the SSMS to set the database to online.

ALTER DATABASE MoveDatabase   SET ONLINE;

From this point onward, the specific database starts pointing to new location. However, if you received an error message reads as below:

Msg 5120, Level 16, State 101, Line 13
Unable to open the physical file “E:\Data\MoveDatabase.mdf”. Operating system error 5: “5(Access is denied.)”.


Then, there is a need to set the folder or file permission access with the user account which you can do it at the Permission Setting of the folder. For more details, you can refer to this link, https://www.sqlshack.com/move-sql-database-files-mdf-ldf-another-location/.

MSSQL : Import CSV file into a Table

Recently, I am working on the data stored in the Excel files. Those data are needed to be imported into the database in order to allow the stored procedures to execute and tabulate data back to the users. During the process of importing the Excel files using the “Import Data” (Right click the database > Task), I found the process was tedious because of setting the correct data type for each columns and always cause me to have an error related to truncation text.

Since, the Excel files have a lot of columns and rows, I decided to try another method to import the Excel data into the table, by using the BULK INSERT method. This method requires the Excel files to be in .csv format to begin with. It is easy to convert the .xlsx into .csv file.

Below is the script on how to use the “BULK INSERT” script. Beforehand, a table called dbo.upload_file is created. Based on the delimiter set in the local machine, place the value accordingly at the “FieldTerminator”.

BULK INSERT [dbo].[upload_file] FROM 'C:\Users\<filename&gt;.csv'
    WITH (
       FIELDTERMINATOR = ',',
       ROWTERMINATOR = '\n'
 );

The nightmare began when I realized the data in the Excel contains a lot of comma (,). The data cleaning process began by removing them altogether from the Excel file. Besides that, you want to check if you have inserted the header into the database’ table.

We can set the first row to be inserted into the table.

BULK INSERT [dbo].[upload_file] FROM 'C:\Users\<filename&gt;.csv'
    WITH (
       FIRSTROW = 2,
       FIELDTERMINATOR = ',',
       ROWTERMINATOR = '\n'
 );

By default, Excel saves the .csv file using delimiter comma (,), however it can be changed at the Control Panel setting. Open up Control Panel –> Region and Language, and then click the Additional settings button on the bottom. Now look very closely at the “List separator” item, which normally has a comma in the field, change it to your preferred delimiter such as pipes “|”. Then, hit “Apply” and “Save” button to save the changes made.

Then, next time when saving a .csv file, it will use the new delimiter. Either way can solve the issues mentioned above and the data can be inserted into the database’s table successfully.

MSSQL: Customize theme for SSMS Express 2014

The newer versions of SQL Server Management Studio (SSMS) allows users to change the theme easily from the option menu, however, the version 2014 has different approach and I googled online to find a method to customize it according to my preference of dark theme.

This link gives a good example of how to customize dark theme in SSMS 2014, https://blogs.sentryone.com/aaronbertrand/making-ssms-pretty-my-dark-theme/.

It is simple to follow and you can change the RBG according to your preferences without having to follow exactly every colours mentioned by the writer.

Enjoy!