Data Management

Data Management is the process of profiling, cleaning and transforming data sources into useful information. Data Management covers the areas of data profiling, data cleaning, data exploration, data integration and data transformation.

Currently, I am working on the part time course’s project which uses Power BI to do data profiling and data exploration of different datasets. It made me understanding the important of understanding the working datasets and know what story we want to tell as a conclusion before deciding what to be cleaned in the data cleaning process.

The project made me to understand, even there are some incorrect values or data needed to be cleaned and usually it is encouraged to be cleaned, yet it is not required to do so. I believed as long as we are able to justify the reasons of the data to be cleaned and not to be cleaned in our work and how does it affect to the overall data exploration.

Each steps serve different purposes in the ETL process.

  • Data Profiling – to get an overview structure of the data and assess the data quality.
  • Data cleaning – to improve the data quality.
  • Data exploration – to use statistics charts to get a sense of the distribution or correlation.

Data Profiling

  • Get a visual profile of the data to assess the structure and quality of the data. Using Power BI, you will get a table profile of summary statistics such as min and max values, distinct count, error or missing value counts.
  • It allows us to identify missing values and inconsistencies in the data. Based on this information, we can further assess the quality and plan for the data cleaning process.

The above view shows the Power BI Editor once we loaded the data into Power BI tool through the Get Data option. The Power BI supports different types of data files.

One limitation on this summary statistics is Power BI works on the first 1,000 records only. In other words, Power BI will not be able to show us any data error and missing values from 1001th row onward.

Similar data profiling can be found in other software tools such as SAS which allows its users to check the data quality using their tools.

Data Cleaning

Data cleaning or data cleansing is a process of detecting and correcting (or removing) incorrect values or records from the datasets and replacing, modifying and deleting the dirty data. Several things we can do during the data cleaning such as,

  • Remove duplication
  • Change text to lower/upper/proper case
  • Spell check
  • Remove extra spaces
  • Treat blank cells
  • Standardization

Are data cleaning technique essential?

The answer is yes. We spend 80% of our time in data cleaning and it is not only essentially important for data analytics and data science, it is most time consuming part to ensure the data always matches the correct fields, interact effectively and making it easy for data visualization.

Data Exploration

It is the part where the story-telling begins after all the datasets are cleaned and integrated. How well is the data transformed from its raw values and integrated together with all the datasets affect the overall quality of the data. It is important that we have sets of quality datasets before begin the data exploration.

Many times, in my experience of data exploration I found there are more data to be cleaned in order to get the right insights or stories that I want to tell in my dashboard. During the data profiling, these should be identified and any wrong outcomes from the data exploration maybe due to outliers which we overlooked earlier.

Advertisements

What is the difference between Distinct and Unique in Power BI?

I looked into the summary statistic which I shared previously and I was a bit confused to distinguish the different between “Distinct” and “Unique” when both of them give almost the similar meaning. In one of the datasets, the name column has “Distinct” value of 740. This means within the datasets there are 740 unique names.

Then, why the “Unique” value is not 740 also.

Here is the explanation I found in most layman way, “Distinct” means total number of different values regardless how many times it appears in the dataset. A name appears in the list multiple times is counted as 1 distinct count.

Whereas, the “Unique” value is total number of values that only appear once. This means there are 740 distinct names in the dataset and out of it there is 485 names has 1 record only.

Depend on the business use cases, for example in my project, it mentions that for client names appear more than once, keep the first record. Then, after some data cleaning, we will able to see the “Distinct” value and “Unique” value is the same in the statistics summary.

My experience using Power BI

As I mentioned in the previous entry, what data visualization tool to be used is depending on what kind of data you have, what is the purpose of your project and how you want to present it. My project does require me to demonstrate my ability to perform data exploration using Descriptive Statistics and apply techniques to clean data using Power BI.

I was given 3 sets of data. I was told that the data is messy because it used different external sources to compile it. I loaded the Power BI software tool on my machine, There is an option called “Get Data” found under the Home menu. It is able to load many types of data sources, from files to databases and online services and cloud.

Once the file is selected, in my case, I chose a .csv file. it will load into this dialog box to preview the source file. From here, it allows us to load or edit the data. We can edit the data anytime in the Power Query Editor. So, I chose to load the data first by clicking on the “Load” button.

I loaded all the 3 sets and under the “Fields” explorer on the right side of the tool, you can view the columns names for each of the datasets. Firstly, I noticed there is something wrong with the country code dataset. It did not read the first row as header, therefore my column name turned into “Column1”, “Column2”, “Column3”. Usually, Power BI able to identify the header in the excel file, just like the other 2 sets of data.

How to set first row as header?

I changed my view into “Model” view which I can select it from the side bar on the left side of the tool. The other 2 buttons on the left side bar are the “Report” view where we can do our visualization and the “Data” view where we can view our data. Again, there is a limitation on how many rows of data is able to be previewed. So, be aware of this point.

In the “Model” view, I changed the columns names by selecting the “countrycode” table and click on the “Edit Query” button under the Home menu. It opened the Power Query Editor which I mentioned earlier. There is a function called “Use First Row as Headers” in the Home menu.

This editor gives us the data profiling view, allows us to perform data cleaning and checking through the data quality through its summary statistics and column distribution. Also, I noticed the column names have been updated accordingly.

Data Profiling

It is a process of examine the data available from the data sources, either from flat files or databases. It collects and shows statistics summary such as count, min and max values, unique and distinct value, and other informative summaries about the data.

It allows us to understand the data before we can decide the data cleaning process. More often, it is a toolbox of business rules and analytical algorithms to discover, understand and potentially expose inconsistencies in your data. This helps to improve data quality or maintaining the health of the data.

Why do we need data profiling?

Data that is not formatted right, standardized or correctly integrated with the rest of the database can cause delays and problems that lead to missed opportunities, confused customers and bad decisions“.

Totally correct. Uncovering erroneous data within the dataset helps you ensure that your data is up to standard statistical measures, as well as business rules specific. Data quality check helps to identify incorrect or ambiguous data and provides ideas on what are the data to be cleaned.

You can check whether the data is consistent and formatted correctly through some pattern matching on the data types for example. Pattern matching helps you to understand whether the field is a text or number based field.

Lastly, relationship discovery. You want to know how each data sets are related or connected to one and another by starting with metadata analysis to determine key relationships.

End of the day, you will get a dataset diagram which looks similar to the screenshot once the data is cleaned, integrated and correctly formatted.

Data Visualization

The story did not end here, data visualization comes in when we have done some data crunching so that the data can be useful for analysis and visualization. Power BI data visualization covers a number of different charts, graphs and etc. It can be easily be located from its marketplace.

Histogram

I added a histogram from the marketplace and pulled out some data from my dataset to see the age distribution and worth in billions distribution. It is quite interesting to look at data in this histogram and start asking questions from it.

Scatter Plot & Line Chart

Scatter plot is type of plot diagram which showing the pattern of the resulting points revealing any correlation present. A line chart or line plot or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of chart common in many fields.

There are plenty of different charts and graphs can be used in Power BI. The important question here is knowing what kind of charts and graphs to be used in data visualization. It is important to choose the right one to draw the visual (chart) for data presentation.

Data Visualization Tool

Recently, I started using PowerBI for my study and I did a quick search on the comparison between tools used in data visualization. As many of us know, data visualization is the most initial and important steps of uncovering insights from data.

As it is getting more important, some of the top data visualization best practices include standardized methodologies, clear classification, data profiling and data presentation based on its audience.

What features are needed in Data Visualization Software?

One useful feature in many visualization software types is the ability to choose different visuals and graphs. Different types of charts or graphs represent different type of data. In other words, there is no single graphs or charts can be used to present all types of data.

Data visualization tools should have an accurate trend tracking capability. For example, you can see which are the products selling at each months and its demographics of buying those products.

Depending on what type of information you need, you might also requirehigh level of security. It is crucial to protect your data because you may have people who want to break into your account and steal important information and without proper security measurement, you are simply put your company and data at risk.

Lastly and most importantly, the software has to be ease to use. A simplified user interface and easy to understand tips to help users to work around on the software right away can avoid users from getting confused and trying to decode and interpret lots of data.

What are the top best data visualization tool in 2019?

Sometimes, I wonder whether I need to search for answers for this question when we always hear people just say they are using Tableau or QlikSense. However, I do notice there are a bunch of other data visualization tools out there and many of them serve different purposes in the business, come with different features and customization levels.

It depends on your data to decide which visualization tool is good to be used to give insight and present data. There are some which are desktop based, some are available by publishing the links to the audiences and some are just conveniently available in the cloud.

Let me just show you a table of some of the available data visualization tools in the market. I did not use all of them before.

If you are a Business Analyst or BI Developer, then you may have been seeing or using Tableau, Power BI, QlikView (QlikSense) or even the IBM Watson Analytics and SAP Analytics Cloud.

If you are a Developer or Data Engineer, you may want to have chart plugin or generator, then Plotly, Highcharts, D3.js or MATLAB.

And before I end my write-up, I want to share something which I found from the Internet and still related to visualization, it is the visualization on large screen. There is one called Ali DataV. It is a drag-and-drop visualization tool from Alibaba Cloud, which is mainly used for big data visualization of business data combined with geographic information.

Another one is Digital Hail, focuses on Data Imaging, 3D Processing, Data Analysis, and other related services. You can visualize and display data analysis results, which is more used in Smart Cities and Industrial Monitoring.

There are the two visualization tools which I find it so impressive and new to me.

Data Types : Statistics

Two main data types in the Satistics, the qualitative data and the quantitative data. In the previous topic, we discussed about the “Level of Data Measurement” in which we talked about nominal, ordinal, interval and ratio. How does these measurements can be related to qualitative and quantitative data.

The qualitative data is the nominal and ordinal measurements which describe a “feature” of a data object. Meanwhile, the quantitative data refers to data can be counted or measured by numbers. The quantitative data is the interval and ratio measurements.

More examples to distinguish the qualitative data and quantitative data as below:

And, further discussion on the qualitative data where it has a sub level of data types called discrete and continuous. Both have differences in few areas, see the table below:

Discrete data is a whole number (integer) and it cannot be subdivided into smaller and smaller parts.

Continuous data continues on and on and on.

Source:
https://www.mymarketresearchmethods.com/data-types-in-statistics/

Levels of Data Measurement : Statistics

Last week, during my 2nd class in Business Intelligence, a statistics topic on levels of measurement was being discussed. The lecturer tried her very best to explain to us the differences between each of the levels.

Nominal, Ordinal, Interval or Ratio.

In statistics, there are four levels of data measurement, nominal, ordinal, interval and ratio (and sometimes, the interval and ratio are called in other terms such as continuous and scale).

I think this is important for researchers to understand this theory part of statistics to determine which statistic analysis suitable for their problem statements. And, for students like me, I think it is good enough if I can differentiate them as I was told, the exam paper would not ask us to differentiate but theoretically, we have to understand what each of them is.

There are a number of statistics’ articles online which explained it and I found the website called, http://www.mymarketresearchmethods.com gave me a better understanding. You can refer to the link below for the write-up and I will explain a bit here too.

It is quite easily to distinguish the nominal and ordinal measurements.

Nominal & Ordinal

First level of measurement is nominal. The numbers in the variable are used only to classify the data. The words, letters, and alpha-numeric symbols can be used as the values or numbers in the variable (without quantitative value). Best example is gender, male or female.

Nonimal data can be in ordered or no ordered such as gender. Ordered nominal data can be something like, cold, warm, hot and very hot.

Second level of measurement is ordinal. With ordinal scales, the order of the values is what is important and significant, but the differences between each one is not really known. Examples is the ranking No.1, No.2 and No.3 for students’ score, with highest score is No.1 follows by second highest score gets No.2.

However, I get a bit confused because the above mentioned, “the differences between each one is not really known”. But scores and ranks did tell the differences, unless we use exam grade such as grade A+, A and A-. Do you agree with this example?

Maybe, I shall follow what the website says, the satisfaction level, it cannot quantify–how much better it is.

Interval & Ratio

The third level of measurement is interval. Interval scales are numeric scales in which we know both the classification, order and the exact differences between the values. I picked up the explanation from the same website.

Like the others, you can remember the key points of an “interval scale” pretty easily. “Interval” itself means “space in between,” which is the important thing to remember–interval scales not only tell us about order, but also about the value between each item.

For example, the difference between 60 and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees.

Here’s the problem with interval scales: they don’t have a “true zero.” For example, there is no such thing as “no temperature,” at least not with celsius. In the case of interval scales, zero doesn’t mean the absence of value, but is actually another number used on the scale, like 0 degrees celsius. Negative numbers also have meaning. Without a true zero, it is impossible to compute ratios. With interval data, we can add and subtract, but cannot multiply or divide.

Consider this: 10 degrees C + 10 degrees C = 20 degrees C. No problem there. 20 degrees C is not twice as hot as 10 degrees C, however, because there is no such thing as “no temperature” when it comes to the Celsius scale. When converted to Fahrenheit, it’s clear: 10C=50F and 20C=68F, which is clearly not twice as hot.

The fourth level of measurement is the ratio. Ratio tells us about the order, they tell us the exact value between units, AND they also have an absolute zero–which allows for a wide range of both descriptive and inferential statistics to be applied. Good examples of ratio variables include height, weight, and duration. These variables can be meaningfully added, subtracted, multiplied, divided (ratios).

Sources:
https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
https://www.statisticssolutions.com/data-levels-of-measurement/

Day 42: Importing Data in Python

Introduction

Importing data from various sources such as,
– flat files eg. txt, csv
– files from software eg. Excel, SAS, Matlab files.
– relational databases eg. SQL server, mySQL, NoSQL.

Reading a text file
It uses the open() function to open the connection to a file with passing two parameters,
– filename.
– mode. Eg: r is read, w is write.

Then, assigns a variable to read() the file. Lastly, close() the file after the process is done to close the connection to the file.

If you wish to print out the texts in the file, use print() statement.

Syntax:
file = open(filename, mode=’r’)
text = file.read()
file.close()

print(text)

The filename can be assigned to a variable instead of writing the filename in the open() function’s parameter. It is always a best practice to close() an opened connection.

To avoid being forgotten to close a file connection (miss out including file.close() in the end of our codes), it is advisable to use the context manager which uses the keyword with statement.

In my previous posts where some of the tutorials were using context manager to open a file connection to read the .csv files. The with statement executes the open() file command and this allows it to create a context in which it can execute commands with the file opens. Once out of this clause or context, the file is no longer opened, and for this reason it is called context manager.

What it is doing here is called ‘binding’ a variable in the context manager construct, while within this construct, the variable file will be bound to open(filename, ‘r’).

Let me share the tutorials I did in DataCamp’s website.

# Open a file: file
file = open('moby_dick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

First, it opens a file using open() function with passing the filename and read mode. It then, reads the file. The first print() statement prints out the context of the moby_dick text file. Then, second print() statement returns a Boolean which checks whether the file is closed. In this case, it returns ‘FALSE’. Lastly, proceeds to close the file using close() function and then check the Boolean. This time, it returns ‘TRUE’

Importing text files line by line
For a larger file, we do not want to print out everything inside the file. We may want to print out several lines of the content. To do this, it uses readline() method.

When the file is opened, use file.readline() to print each line. See the code below, using the with statement and file.readline() to print each line of the content in the file.

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Summary of the day:

  • Importing data from different sources.
  • Reading from a text file using open() and read(),
  • Importing text line by line using readline() method.
  • with statement as context manager.
  • close() to close a file.
  • file.closed() returns Boolean value if the condition is met.