As I mentioned in the previous entry, what data visualization tool to be used is depending on what kind of data you have, what is the purpose of your project and how you want to present it. My project does require me to demonstrate my ability to perform data exploration using Descriptive Statistics and apply techniques to clean data using Power BI.
I was given 3 sets of data. I was told that the data is messy because it used different external sources to compile it. I loaded the Power BI software tool on my machine, There is an option called “Get Data” found under the Home menu. It is able to load many types of data sources, from files to databases and online services and cloud.
Once the file is selected, in my case, I chose a .csv file. it will load into this dialog box to preview the source file. From here, it allows us to load or edit the data. We can edit the data anytime in the Power Query Editor. So, I chose to load the data first by clicking on the “Load” button.
I loaded all the 3 sets and under the “Fields” explorer on the right side of the tool, you can view the columns names for each of the datasets. Firstly, I noticed there is something wrong with the country code dataset. It did not read the first row as header, therefore my column name turned into “Column1”, “Column2”, “Column3”. Usually, Power BI able to identify the header in the excel file, just like the other 2 sets of data.
How to set first row as header?
I changed my view into “Model” view which I can select it from the side bar on the left side of the tool. The other 2 buttons on the left side bar are the “Report” view where we can do our visualization and the “Data” view where we can view our data. Again, there is a limitation on how many rows of data is able to be previewed. So, be aware of this point.
In the “Model” view, I changed the columns names by selecting the “countrycode” table and click on the “Edit Query” button under the Home menu. It opened the Power Query Editor which I mentioned earlier. There is a function called “Use First Row as Headers” in the Home menu.
This editor gives us the data profiling view, allows us to perform data cleaning and checking through the data quality through its summary statistics and column distribution. Also, I noticed the column names have been updated accordingly.
It is a process of examine the data available from the data sources, either from flat files or databases. It collects and shows statistics summary such as count, min and max values, unique and distinct value, and other informative summaries about the data.
It allows us to understand the data before we can decide the data cleaning process. More often, it is a toolbox of business rules and analytical algorithms to discover, understand and potentially expose inconsistencies in your data. This helps to improve data quality or maintaining the health of the data.
Why do we need data profiling?
“Data that is not formatted right, standardized or correctly integrated with the rest of the database can cause delays and problems that lead to missed opportunities, confused customers and bad decisions“.
Totally correct. Uncovering erroneous data within the dataset helps you ensure that your data is up to standard statistical measures, as well as business rules specific. Data quality check helps to identify incorrect or ambiguous data and provides ideas on what are the data to be cleaned.
You can check whether the data is consistent and formatted correctly through some pattern matching on the data types for example. Pattern matching helps you to understand whether the field is a text or number based field.
Lastly, relationship discovery. You want to know how each data sets are related or connected to one and another by starting with metadata analysis to determine key relationships.
End of the day, you will get a dataset diagram which looks similar to the screenshot once the data is cleaned, integrated and correctly formatted.
The story did not end here, data visualization comes in when we have done some data crunching so that the data can be useful for analysis and visualization. Power BI data visualization covers a number of different charts, graphs and etc. It can be easily be located from its marketplace.
I added a histogram from the marketplace and pulled out some data from my dataset to see the age distribution and worth in billions distribution. It is quite interesting to look at data in this histogram and start asking questions from it.
Scatter Plot & Line Chart
Scatter plot is type of plot diagram which showing the pattern of the resulting points revealing any correlation present. A line chart or line plot or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of chart common in many fields.
There are plenty of different charts and graphs can be used in Power BI. The important question here is knowing what kind of charts and graphs to be used in data visualization. It is important to choose the right one to draw the visual (chart) for data presentation.