Database Stability

This is one of the common question to be asked either during a talk or during the interview. Personally, I look at this topic highly and important for every database administrator to pay attention to it.

Slow performance means tasks take longer time to complete. If it takes longer, there is more likely to overlap when multiple users or connections at the same time. It leads to frequent locks, deadlocks and resource contention and eventually leads to errors and stability issues.

Poor scalability means it has limited options when demand exceed capacity such as queue requests or reject requests. Rejecting requests result error or unexpected behaviour and this is instability. Queuing requests lead to reduced performance, putting demands on resources such as CPU, memory and etc. When it increases demands, it leads to further stability issues.

Poor stability affects performance. The partial success and partial failure must be handled, usually with database rollbacks or manual compensation logic. It is an additional resource requirements on the system whether to do rollback or process the manual compensation logic. And it affects scalability.

I found from the MSDN website, someone shared some important points when come to designing whether a database or an application. It always consider performance, scalability, and stability when architecting, building, and testing your databases and applications.

What is the difference between Distinct and Unique in Power BI?

I looked into the summary statistic which I shared previously and I was a bit confused to distinguish the different between “Distinct” and “Unique” when both of them give almost the similar meaning. In one of the datasets, the name column has “Distinct” value of 740. This means within the datasets there are 740 unique names.

Then, why the “Unique” value is not 740 also.

Here is the explanation I found in most layman way, “Distinct” means total number of different values regardless how many times it appears in the dataset. A name appears in the list multiple times is counted as 1 distinct count.

Whereas, the “Unique” value is total number of values that only appear once. This means there are 740 distinct names in the dataset and out of it there is 485 names has 1 record only.

Depend on the business use cases, for example in my project, it mentions that for client names appear more than once, keep the first record. Then, after some data cleaning, we will able to see the “Distinct” value and “Unique” value is the same in the statistics summary.

My experience using Power BI

As I mentioned in the previous entry, what data visualization tool to be used is depending on what kind of data you have, what is the purpose of your project and how you want to present it. My project does require me to demonstrate my ability to perform data exploration using Descriptive Statistics and apply techniques to clean data using Power BI.

I was given 3 sets of data. I was told that the data is messy because it used different external sources to compile it. I loaded the Power BI software tool on my machine, There is an option called “Get Data” found under the Home menu. It is able to load many types of data sources, from files to databases and online services and cloud.

Once the file is selected, in my case, I chose a .csv file. it will load into this dialog box to preview the source file. From here, it allows us to load or edit the data. We can edit the data anytime in the Power Query Editor. So, I chose to load the data first by clicking on the “Load” button.

I loaded all the 3 sets and under the “Fields” explorer on the right side of the tool, you can view the columns names for each of the datasets. Firstly, I noticed there is something wrong with the country code dataset. It did not read the first row as header, therefore my column name turned into “Column1”, “Column2”, “Column3”. Usually, Power BI able to identify the header in the excel file, just like the other 2 sets of data.

How to set first row as header?

I changed my view into “Model” view which I can select it from the side bar on the left side of the tool. The other 2 buttons on the left side bar are the “Report” view where we can do our visualization and the “Data” view where we can view our data. Again, there is a limitation on how many rows of data is able to be previewed. So, be aware of this point.

In the “Model” view, I changed the columns names by selecting the “countrycode” table and click on the “Edit Query” button under the Home menu. It opened the Power Query Editor which I mentioned earlier. There is a function called “Use First Row as Headers” in the Home menu.

This editor gives us the data profiling view, allows us to perform data cleaning and checking through the data quality through its summary statistics and column distribution. Also, I noticed the column names have been updated accordingly.

Data Profiling

It is a process of examine the data available from the data sources, either from flat files or databases. It collects and shows statistics summary such as count, min and max values, unique and distinct value, and other informative summaries about the data.

It allows us to understand the data before we can decide the data cleaning process. More often, it is a toolbox of business rules and analytical algorithms to discover, understand and potentially expose inconsistencies in your data. This helps to improve data quality or maintaining the health of the data.

Why do we need data profiling?

Data that is not formatted right, standardized or correctly integrated with the rest of the database can cause delays and problems that lead to missed opportunities, confused customers and bad decisions“.

Totally correct. Uncovering erroneous data within the dataset helps you ensure that your data is up to standard statistical measures, as well as business rules specific. Data quality check helps to identify incorrect or ambiguous data and provides ideas on what are the data to be cleaned.

You can check whether the data is consistent and formatted correctly through some pattern matching on the data types for example. Pattern matching helps you to understand whether the field is a text or number based field.

Lastly, relationship discovery. You want to know how each data sets are related or connected to one and another by starting with metadata analysis to determine key relationships.

End of the day, you will get a dataset diagram which looks similar to the screenshot once the data is cleaned, integrated and correctly formatted.

Data Visualization

The story did not end here, data visualization comes in when we have done some data crunching so that the data can be useful for analysis and visualization. Power BI data visualization covers a number of different charts, graphs and etc. It can be easily be located from its marketplace.

Histogram

I added a histogram from the marketplace and pulled out some data from my dataset to see the age distribution and worth in billions distribution. It is quite interesting to look at data in this histogram and start asking questions from it.

Scatter Plot & Line Chart

Scatter plot is type of plot diagram which showing the pattern of the resulting points revealing any correlation present. A line chart or line plot or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of chart common in many fields.

There are plenty of different charts and graphs can be used in Power BI. The important question here is knowing what kind of charts and graphs to be used in data visualization. It is important to choose the right one to draw the visual (chart) for data presentation.

Data Visualization Tool

Recently, I started using PowerBI for my study and I did a quick search on the comparison between tools used in data visualization. As many of us know, data visualization is the most initial and important steps of uncovering insights from data.

As it is getting more important, some of the top data visualization best practices include standardized methodologies, clear classification, data profiling and data presentation based on its audience.

What features are needed in Data Visualization Software?

One useful feature in many visualization software types is the ability to choose different visuals and graphs. Different types of charts or graphs represent different type of data. In other words, there is no single graphs or charts can be used to present all types of data.

Data visualization tools should have an accurate trend tracking capability. For example, you can see which are the products selling at each months and its demographics of buying those products.

Depending on what type of information you need, you might also requirehigh level of security. It is crucial to protect your data because you may have people who want to break into your account and steal important information and without proper security measurement, you are simply put your company and data at risk.

Lastly and most importantly, the software has to be ease to use. A simplified user interface and easy to understand tips to help users to work around on the software right away can avoid users from getting confused and trying to decode and interpret lots of data.

What are the top best data visualization tool in 2019?

Sometimes, I wonder whether I need to search for answers for this question when we always hear people just say they are using Tableau or QlikSense. However, I do notice there are a bunch of other data visualization tools out there and many of them serve different purposes in the business, come with different features and customization levels.

It depends on your data to decide which visualization tool is good to be used to give insight and present data. There are some which are desktop based, some are available by publishing the links to the audiences and some are just conveniently available in the cloud.

Let me just show you a table of some of the available data visualization tools in the market. I did not use all of them before.

If you are a Business Analyst or BI Developer, then you may have been seeing or using Tableau, Power BI, QlikView (QlikSense) or even the IBM Watson Analytics and SAP Analytics Cloud.

If you are a Developer or Data Engineer, you may want to have chart plugin or generator, then Plotly, Highcharts, D3.js or MATLAB.

And before I end my write-up, I want to share something which I found from the Internet and still related to visualization, it is the visualization on large screen. There is one called Ali DataV. It is a drag-and-drop visualization tool from Alibaba Cloud, which is mainly used for big data visualization of business data combined with geographic information.

Another one is Digital Hail, focuses on Data Imaging, 3D Processing, Data Analysis, and other related services. You can visualize and display data analysis results, which is more used in Smart Cities and Industrial Monitoring.

There are the two visualization tools which I find it so impressive and new to me.

How to use SSH to connect to a remote server

To establish a connection with a remote machine depending on the operation system you are running, there are two most commonly used protocols:

  • Secure Shell (SSH) for Linux based machine.
  • Remote desktop protocal (RDP) for Windows based machine.

This protocols use client and server applications to establish a remote connection. It allows you to gain access and remotely manage other machines. Today’s topic focuses on Linux based machine and I am using Ubuntu 16.04.

How to install an OpenSSH Client

For Linux Ubuntu, you can install the openssh-server to enable the SSH on Ubuntu client machine. For Windows machine, you can install PuTTY or any other client of your choice.

sudo apt-get install openssh-server

OpenSSH defaults

  • TCP port: 22
  • OpenSSH server config file is called sshd_config which is located at /etc/ssh/

How to install an OpenSSH Server

In order to accept SSH connections, a machine needs to have a server-side part of the SSH. First, you can check if OpenSSH server is available on the Ubuntu server machine of the remote computer that needs to accept SSH connections, you can try to connect to the localhost.

ssh localhost

Ubuntu machine that without the SSH server installed, the screen may shows:

username@host:~$ ssh localhost
ssh: connect to host localhost port 22: Connection refused username@host:~$

The port 22 is not established, therefore the connection is refused. Then, you can install the SSH server,

sudo apt-get install openssh-server ii

You can check if SSH server is running after the installation is completed on the Ubuntu machine by using command below:

sudo service ssh status

The screenshot above show the status is active and running currently. Then, we can move back to the client machine which can be our local machine to try the command ssh to the remote server machine.

How to connect via SSH

Open the terminal from your machine and run command: ssh username@host_ip_address

Key in the password and start connecting. If you are connected for the first time, an ECDSA key fingerprint is required, so just follow the instruction on the terminal and then, you are connected to the remote server. If the default port has changed, you can specify the port behind the ip address with -p 9876

Firewall rules

Next you may want to look at the firewall setting in the server machine to ensure the port is listening, not blocked by firewall and forwarded incorrectly.

Creating a Linux service

I think creating a Linux service is a fairly easy task and it allows us to write a program and turn it into a service using systemd. The service can start or stop using a terminal or GUI (Graphical User Interface) for Windows. I am going to use terminal for this topic here as we are going to create a Linux service to start running my program.

Another reason I want to create a service to run my program is it allows my program to be restarted if it terminates due to unforeseen reasons.

How to begin?

I started with create a script file with .service file extension. I got a sample copy from my colleague and began to modify the script. I have to be careful enough with the directory path. While I thought it may just a directory change from /opt/ to /home/ for the WorkingDirectory and ExecStart in the script, it did not turn out working when I enabled and started the service. I will share what the stupid mistaken I made was.

Once you have the script file ready, saved and copy it to the following directory, /etc/systemd/system. You cannot do a copy-paste of the document into this directory path.

sudo cp mytest_service.service /etc/systemd/system

Permission and Executable File

Next, add the file permission to the root user and make the file executable. You can check on the command, chmod for details setting of file permission for user, guest and other. Otherwise, to set the file to be executable, the below command does the work.

sudo chmod +x mytest_service.service

Enable and Start Service

Once you have the .service file ready in the said directory and make it executable, I think it should be good enough for the next step which is to enable to the service and start the service.

sudo systemctl enable mytest_service.service
sudo systemctl start mytest_service.service

Created symlink from /etc/systemd/system/multi-user.target.wants/mytest_service.service to /etc/systemd/system/mytest_service.service.

Upon running the above command to enable the service, the symlink is created as shown above.

If you wish to know whether the service is started correctly, you can use the following command to check the running status.

sudo systemctl status mytest_service.service

Checking the status will give us few information which we want know, whether the service is active, running or failed with error. The error message serves an important keyword for us to search online for solutions.

If you search on the Internet, you may find that some command is using service instead of systemctl. service is a fairly simple wrapper which only supports limited actions such as start, stop and check the status of the service. For more complex tasks, actual command to be used is systemctl.

For more reading about systemctl and service command, you can find it in this link.

What is the error I received?

The “Active” status on the screen shows “failed” when I ran the command to start the service. It showed an error code 200/CHDIR. I googled the error and found out this error indicated the path is not found or accessible at the time the service is attempting to run.

Since, I have set the file permission earlier on, the access right to the file should be granted with executable file. Hence, it should be my path wrong.

Having my program saved at home directory, I missed out to include my full home directory path in the “WorkingDirectory” and “ExecStart”. Use “pwd” in the terminal, it is a command line to print the current working directory. Hence, it helps to get a correct path to allow the service to execute the program.

/home/myname/program/run.sh
##/home/program/run.sh

Data Types : Statistics

Two main data types in the Satistics, the qualitative data and the quantitative data. In the previous topic, we discussed about the “Level of Data Measurement” in which we talked about nominal, ordinal, interval and ratio. How does these measurements can be related to qualitative and quantitative data.

The qualitative data is the nominal and ordinal measurements which describe a “feature” of a data object. Meanwhile, the quantitative data refers to data can be counted or measured by numbers. The quantitative data is the interval and ratio measurements.

More examples to distinguish the qualitative data and quantitative data as below:

And, further discussion on the qualitative data where it has a sub level of data types called discrete and continuous. Both have differences in few areas, see the table below:

Discrete data is a whole number (integer) and it cannot be subdivided into smaller and smaller parts.

Continuous data continues on and on and on.

Source:
https://www.mymarketresearchmethods.com/data-types-in-statistics/