Charting Guideline on Tableau: How to decide what chart to be used

The following is a sharing made by the instructor of the Udemy’s online learning course which I subscribed to. The course is called Tableau for Beginners: Get CA Certified, Grow Your Career.

Okay, now back to the original question which I think most people always ask, how to decide what chart to be used in different situations. The instructor shares some information which I think it may help us to understand and practice more in Tableau so that we can familiarize with the tool and able to pick the right chart next time.

Most of the time when you want to show how a numeric value differs according to different categories, bar charts are the way to go. The eye is very good at making comparisons based on length (as compared with differences in angle, color, etc)

If you are showing change over a date range, you will want to use a line chart.

Histograms and box plots are to show the distribution of data.

Scatter plots show how two continuous variables are related.

There is also more detail in this guide: https://www.tableau.com/learn/whitepapers/tableau-visual-guidebook. It gets into talking about how to use color and other visual elements to add more information to your chart.

Advertisements

Intermediate Python for Data Science: Logic, Control Flow and Filtering

Boolean logic is the foundation of decision-making in Python programs. Learn about different comparison operators, how to combine them with Boolean operators, and how to use the Boolean outcomes in control structures. Also learn to filter data in pandas DataFrames using logic.

In the earlier days when I started to learn Python, there is a topic on Boolean and Comparison Operators, where I studied Boolean (True and False), logical operators (‘and’, ‘or’, ‘not’) and comparison operators (‘==’ ‘!=’, ‘<‘ and ‘>’).

Comparison operators can tell how two Python values relate and result in a Boolean. It allows to compare two numbers, strings or any same type of variables. It throws exception or error message when it is comparing a variable from a different data type. Python cannot tell how the two objects of different type relate.

Comparison a Numpy array with an integer

Based on the example above taken from a tutorial in DataCamp online learning course that I am taking currently, the variable bmi is a Numpy array, then it compares if the bmi is greater than 23. It works perfectly and returns the Boolean values. Behind the scenes, Numpy builds a Numpy array of the same size, perform an element-wise comparison, filtered with the number 23.

Boolean operators with Numpy

To use these operators with Numpy, you will need np.logical_and(), np.logical_or() and np.logical_not(). Here’s an example on the my_house and your_house arrays from before to give you an idea:

np.logical_and(your_house &gt; 13, 
               your_house < 15)

Refer to below for the sample code:

# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house &gt; 18.5, my_house < 10))

# Both my_house and your_house smaller than 11
print(np.logical_and(my_house < 11, your_house < 11))

The first print statement is checking on the ‘or’ condition means, any one of the two condition is true, it returns true. The second print statement is checking on the ‘and’ condition means, both of the comparison has to be True then it returns a True. The output of the execution returns in Boolean array as below:

[False  True False  True]
[False False False  True]

Combining Boolean operators and Comparison operators with conditional statement, if, else and elif.

It follows the if statement syntax. The most simplest code which can be used to explain the above,

z = 4
if z % 2 == 0:
  print('z is even')

Same goes to the if else statement with comparison operator, see code below:

z = 5
if z % 2 == 0:
  print('z is even')
else:
  print('z is odd')

Or if you are working with if, elif and else statement, it works too. See the code below:

z = 6
if z % 2 == 0:
  print('z is divisible by 2')
elif z % 3 == 0:
  print('z is divisible by 3')
else:
  print('z is neither divisible by 2 nor 3')

In the example above, both first and second condition are matched, however, in this control structure, once Python hits into a condition that returns a True value, it executes the corresponding code and exits the control structure after that. It will not execute the next condition, corresponding to the elif statement.

Filtering Pandas DataFrame

For an example taken from DataCamp’s tutorial, using the DataFrame below, select countries with area over 8 millions km. There are 3 steps to achieve this.

Step 1: select the area column from the DataFrame. Ideally, it gets a Pandas Series, not a Pandas DataFrame. Assume that the DataFrame is called bric, then it calls the column area using,

brics["area"]

#alternatively it can use the below too:
# brics.loc[:, "area"]
# brics.iloc[:, 2]

Step 2: When the code adds in the comparison operator to see which rows have an area greater than 8, it returns a Series containing Boolean values. The final step is using this Boolean Series to subset the Pandas DataFrame.

Step 3: Store this Boolean Series as ‘is_huge’ as below:

is_huge = brics["area"] > 8

Then, creates a subset of DataFrame using the following code and the result returns as per the screenshot:

brics[is_huge]

It shows those countries with ares greater than 8 million km. The steps can be shorten into 1 line of code:

brics[brics["area"] > 8]

Also, it is able to work with Boolean operators (np.logical_and(), np.logical_or() and np.logical_not()). For example, if it looks for areas between 8 and 10 km, then the single line code can be:

brics[np.logical_and(brics["area"] &gt; 8, brics["area"] < 10)]

The result returns from the above code is Brazil and China.

Intermediate Python for Data Science I

The subjects in this DataCamp’s track, Intermediate Python for Data Science include:

  • Matplotlib
  • Dictionaries and Pandas
  • Logic, Control Flow and Filtering
  • Loops

It looks at data visualization – how to visualize data, data structures – how to store data. Along the way, it shows how control structures customize the flow of your scripts (codes).

Data Visualization

It is one of the key skills for data scientists and Matplotlib makes it easy to create meaningful and informative charts. Matplotlib allows us to build various charts and customize them to make it more visually interpretable. It is not an hard thing to be done and it is pretty interesting to work on it. In my previous write-up, I wrote about how to use Matplotlib to build a line chart, scatter plot and histogram.

Data visualization is a very important part in data analysis. It helps to explore the dataset which it extracts insights. I call this as data profiling, the process of examine the dataset coming from existing data source such as databases, which consists of statistics or summaries of the dataset. The purpose is to find existing data can be used for other purposes, determine the accuracy, completeness and validity of the dataset. I can relate this to “perform a body check on the dataset to ensure it is healthy”.

One of the methods I learned from my school on data profiling is the use of histogram, scatter plot and boxplot to examine the dataset and find out the outliers. I can use either the Python’s Matplotlib, Excel, Power Bi or Tableau to perform this action.

It does not end here…

Python allows us to do customization on the charts to suit our data. There are many types of charts and customization ones can do with Python, changing from colours, labels and axes’ tick size. It depends on the data and the story ones want to tell. Refer the links above to read my write-up on those charts.

Dictionaries

We can use lists to store a collection of data and access the values using the indexes. It can be troublesome and inefficient when it comes to large dataset, therefore, the use of dictionaries in data analysis is important as it represents data in the form of key-value pairs. Creating a dictionary from the lists of data can be found in this link. It has one simple example demonstrating how to convert it. However, I do have a question, how about converting long lists to dictionary? I assumed it is not going to be the same method in this simple example. Does anyone have an example to share?

If you have questions about dictionaries, then you can refer to my blog which I wrote a quite comprehensive introduction of dictionaries in Python.

What is the difference between lists and dictionaries?

If you have a collection of values where order matters, and you want to easily select entire subsets, you will want to go with a list. On the other hand, if you need some sort of lookup table where looking for data should be fast, by specifying unique keys, dictionary is a preferred option.

Lastly, Pandas

Pandas is a high level data manipulation tool built on top of NumPy package. Since NumPy 2D array allows to use one data type in their elements, it may not suitable for some of the data structure which comprise of more than one data type. In Pandas, data is stored like a tabular table called DataFrame, for example:

How to build a DataFrame?

There are few ways to build a Pandas DataFrame and we need to import Pandas package before we begin. In my blog, there are two methods shared, using dictionaries and external file such as .csv file. You can find the examples from the given link. Reading from dictionaries can be done by converting dictionary into DataFrame using DataFrame() and reading from the external file can be done using Pandas’ read_csv().

  • Converting dictionary using DataFrame()
  • reading from external file using read_csv()

How to read from a DataFrame?

The above screenshot shows how the Pandas’ DataFrame looks like, it is in the form of rows and columns. If you wonder why the first column goes without naming. Yes, in the .csv file it has no column name. It appears to be an identifier for each row, just like an index of the table or row label. I have no idea whether the content of the file is done with this purpose or it has other meaning.

Index and Select Data

There are two methods you can select data:

  • Using square bracket []
  • Advanced methods: loc and iloc.

The advanced methods, loc and iloc is Python’s powerful, advanced data access. To access a column using the square bracket, with reference to the above screenshot again, the following codes demonstrate how to select that country column:

brics["country"]

The result shows the row label together with the country column. This is how it read a DataFrame which it returns an object called Pandas Series, which you can assume Series is a one dimension labelled array and when a bunch of Series comes together then, it is called DataFrame.

If you want to do the same selection of country column and keep the data as DataFrame, then using the double square brackets, it can do the magic with following code:

brics[["country"]]

If you check the type of the object, it returns as DataFrame. You can define more than one column to be returned. To access rows using the square bracket and slices, with reference to the same screenshot, the below code is used:

brics[1:4]

The result returns the from row number 2 to 4 or index 1 to 3 which contains, Russia, India and China. If you still remember, the characteristic of slice? The stop value (end value) of a slice is exclusive (not included in the output.

However, this method is has a limitation. For example, if you want to access the data similar to 2D Numpy Array, it can be done using the square bracket with specific column and row.

my_array[column, row]

Hence, Pandas has this powerful and advanced data access, loc and iloc, where loc is a label based and iloc is position based. Let us looking into the usage of loc. The first example reads row loc and follow by another example reads row and column loc. With the same concept as above, single square bracket returns a Series and double square brackets return a DataFrame, just as below:

brics.loc["RU"] --Series single row
brics.loc[["RU"]] --DataFrame single row
brics.loc[["RU, "IN", "CH"]] --DataFrame multiple row

Let extends the above code to read country and capital columns using the row and column with loc. First part it mentions the rows and second part it mentions the column labels. The below code returns a DataFrame.

brics.loc[["RU, "IN", "CH"], ["country", "capital"]]

The above rows values can be replaced with slice, just like the sample code below:

brics.loc[:, ["country", "capital"]]

The above code did not specify the start and end index, it means it returns all the rows with country and capital columns. Below is the screenshot of comparison between square brackets and loc (label-based).

Using iloc is similar to the loc, the only different is how you refer column and row which is using index instead of specifying the rows and column labels.

Python: formatter

Below shows how to do more complicated string formatting. Refer below for the sample code:

formatter = "{} {} {} {}"

print(formatter.format(1,2,3,4))
print(formatter.format("one","two","three","four"))
print(formatter.format(False, False, True, False))
print(formatter.format(formatter, formatter, formatter, formatter)
print(formatter.format(
  "First thing",
  "that we can try",
  "maybe is having",
  "a line of sentence"))

It is using something called function to turn the formatter variable into other strings. When the code write, formatter.format, it tells the Python compiler to do the following:

  • Take its formatter string declare in the first line.
  • Call its format function.
  • Pass to format function, the 4 arguments which matches up with the 4 curly brackets {}s in the formatter variable.
  • The result of calling format on formatter is the new string that has the {} replaced with four variables.

This is what the print statement prints.
1,2,3,4
one, two, three, four
False, False, True, False
{} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {}
First thing that we can try maybe is having a line of sentence

Python: Introduction IV

It has been a while back I wrote about Python Introduction I, II and III. Today, I am going to complete the last part of the introduction, the NumPy. Months ago during my Python’s self-learning time, I wrote about NumPy, here is the link.

NumPy

It is an alternative to Python List, the NumPy array helps us to solve problems dealing with Python List’s operations. Calculations on Python Lists cannot be done in the same way we do for two integers or strings. This package needs to be installed before we can import and use it.

In my blog above, I wrote about the behaviour of the NumPy Array. It does not allow different types of elements in the array. When a NumPy Array is built, element’s data type changed to end up with a homogeneous list. Supposed, the list contains a string, number and Boolean, now it changes to all string format, for an example.

Also, the operator “+”, “-” and etc which we used along with Python List, are different in NumPy Array. Refer below for an example:

py_list = [1,2,3]
numpy_array = np.array([1,2,3])
py_list + py_list 
numpy_array + numpy_array

First output shows the two lists are merged or combined together into a single list. Second output shows an array returns an output of addition of those numbers. The screenshot below shows the result which I used Jupyter Notebook to execute.

Whatever that it has covered in the link above is good enough to give us a basic understanding of Numpy. If you wish to learn more, there is another link I found from the Medium which we can refer to.

NumPy Subsetting

Specifically for NumPy, there is a way of doing list subsetting, using an array of Boolean. Example below shows how we can get all the BMI values above 23. Refer to the example from DataCamp,

First result returns as Boolean, True if BMI value is above 23. Then, you can use this Boolean array inside a square bracket to do the subsetting. When the Boolean’s value is True, it selects its value.

In short, it is using the result of the comparison to make a selection of data.

2D NumPy Array

I covered the 2D NumPy Array in this link, where it shows how to declare a 2D NumPy Array and how does it work in subsetting, indexing, slicing and perform math operations.

NumPy: Basic Statistics

You can generate summary statistics of the data using NumPy. Python NumPy has few useful statistical functions which can be used for analytics. It includes finding min, max, average, standard deviation, variance and etc. from a given elements in the array. Refer to my write up on this basic statistics in this link.

Career Changed

Yesterday, I was looking at my photographs that I took during numerous events that I took part in Singapore and Malaysia since year 2015. It was 6 years after I started working in Malaysia, I thought it would be a right time for me to choose another path within the IT industry. I was a MSSQL Database Administrator with 3-4 years experience. I did some C# coding before that too.

How I got started with?

One day, I bumped into an event organized by MindValley in Kuala Lumpur, Malaysia, they want to share some knowledge about the NoSQL. It was a new thing back then, it was something that I heard before, but nobody really using it. I decided to book a flight ticket and flew back to Kuala Lumpur just for that event and spend a long weekend there. As soon as I landed myself in Kuala Lumpur, I directly went to the venue, which was at Bangsar. It was quite near to KL Sentral, the last stop where the buses or trains from airport stop. The whole journey sounded a bit crazy to some people when I shared it.

As I sat down on those colourful bean bags around the open area and enjoyed my pizza and drink, I spotted my university senior. She recognized me too and we talked a while. She gave me some encouraging words and that was how I began attending talks and events to learn more from different people.

Got into Data Science…

The same year, I joined a few online groups such as Singapore SQL User Group, CodingGirls, TechLadies and Google Developers Group. I attended a few of the SQL user group’s meetup. They are experienced users who work in Singapore and each meetup, they will share different topics.

Back in Malaysia, there was a Data Science program organized by MDEC (Malaysia Digital Economy Corporation) with collaborated with Coursera to provide Malaysians the platform to learn Data Science and apply them at work. It was the first time, I heard about such programs and I intended to join it with few friends of mine.

I signed up the course and paid the Coursera’s course called Introduction to Data Science in USD while attending the online lectures and offline lectures in Cyberjaya, Malaysia for 6 times in order to get myself qualify for the reimbursement of the course fee after completed with certain grades. I made special arrangements to head back to Malaysia to attend those classes and it proved the worth and some of the lecturers (each class has different lecturers or working adults) facilitated the learning and helped us to understand and complete the course.

Then, I attended an event called Opportunities in Data Science Talk Sept 2015 to find out more work opportunity in Singapore.

Well, it did not immediately give me an opportunity to join Data Science field. This kind of events usually share experience and knowledge for non Data Science people or those who want to know about this role. It gave me an insight what Data Science can do to help me solve the problems faced by the management team of my previous company. I started to look into data in different angles of my life and I love playing around with data to find out more. Data, eventually, became my boyfriend.

My involvements in Microsoft and Google’s events were quite a lot during the same year as well as the following year too. I loved to meet up with them and learn whatever I could understand from them. Sometimes, I was really hard to catch up their tempo.

Slowly, I found myself attending these meetup more to getting myself check-in into Microsoft’s office or Google’s office or sometimes, it could be at some local university’s auditoriums and feeding my stomach with food. I rarely got in touch with the people and did not socialize throughout the event.

The Microsoft’s old interior design of the open space used for events.

Google has a pretty cool open space after they moved to Pasir Panjang.

I need to do something…

After completed my Coursera’s course, I wanted to make a change in my career. It would be great if I could move into Data Science since I just finished the course, however, it turned out to be impossible in Singapore. Employers were looking for experienced data scientists to join the company. With no experiences, even a startup company also kind of hard to get in. Then, I reflected myself. The diagram below suggested me data analytics is my next career path, instead. So, I decided to take a step backward and explore.

After all, at the end of year 2016, I quit and started another role in another company in year 2017 as a BI developer. The company requires the developers to develop the ETL and use business intelligence software to generate reports and dashboard. Some programming skills such as Scala and ScalaJS are required. I took a steep turning in my life, from using the Windows platform to Linux platform, from database to programming, from SQL Server to MongoDB.

It was a completely different life that I took quite a while to adjust myself with the help from the team who I always think they are helpful. The team spirit was really high and I almost thought this company could be the next retirement company for me. I thought I found someone who works like Richard Branson.

One year later, I was so lucky to get paired up for mentorship in a Data Science group in Singapore but I truly not ready to go for it, so it did not turn out well and eventually discontinued.

Back then, I struggled a lot with Scala and my mentor told me to look for a software engineer who is good in Scala to help me out from there. My mentor did not see that I was anywhere near to Data Science. After a long email to my CTO, I came to the conclusion. My passion toward coding is not really strong and I did not intend to do it for long term, maybe at most, I would only code when it is required to clean the data or transform the data as part of my data preparation.

Data Analytics or BigData?

In year 2018, I was looking into BigData. I wanted to know how BI, Data Analytics and BigData come into pictures and why people so into BigData. I even took annual leave to attend the 3-day Big Data World event in October 2018 to explore and look for the possibility of getting some job opportunities for my next career.

After the event, I realized how wrongly I put Big Data definition in my knowledge, even to some people. Big Data is not just having one meaning. Then onward, I was quite determined to get things right, using the right terms and explain things using the correct ways. After that, I thought Data Analyst role is something closer to what I wanted to do and there are a lot of job opportunities out there. Once my client told me, there are some people working as Data Analyst and Data Scientist but they did not really know what they are doing. It sounded like a glamoured job title that everyone has used. And, I still followed the trend. wanted to be a data analyst and looked into taking short course to move myself into this role.

There are a lot of courses from the polytechnic schools in Singapore, offering analytics courses for working adults. At that point of time, I was intended to do Masters either at SMU or NUS or at least go back to school and study. I really need to study analytics for real. However, being a foreign student I need to take the GMAT exam. Then, I think I prefer something short and cover the essential topics on data analytics.

Being a volunteer

In Dec 2018, I went to be a volunteer for Google Developers Group. I was at one of their big event organized at Google’s office at Pasir Panjang. I was told to assist them at the registration area and I could also attend the talks during the that time. However, I chose to talk to people, for the first time. I learned from them, even though it was something not related to what I was working. I realized the important of networking and being socialized. It was a great experience and subsequently, I decided to register myself to help TechLadies group.

I wanted to reach out more people, especially those who want to study something and want to have some people to motivate each other. It is a right platform to begin with. My focus in the group is organizing study groups.

Charging toward analytics and system administrator works

In the early months of year 2019, I signed up for the Specialized Diploma in Business Analytics and Specialized Diploma in Big Data at Temasek Polytechnic. They offered me a place for the first option. I preferred the Big Data course over the Analytics because I wanted to learn something different.

In April, I started my school in Singapore, learning Business Analytics. It helped me to understand my job well. I got real picture of data analytics, I am able to conceptualize better than before and with in-class discussion with other coursemates, I got fascinated how much data analytics can do. I resumed attending data analytics talk but I chose what to attend. If the events give me the opportunities for networking and job opportunities, then it is something worthy to go. I do not need anymore introduction sessions.

Same month, my volunteer works began. I started to learn how to juggle my time between work, school and study groups. I have conducted 4 study groups and there are more to go. Meanwhile, I took a break and will assist on the pre-bootcamp workshops. I think it would be a great experience to learn from different team of people, how to run a workshop.

On the other hand, I got exposed to system administration work while working with my colleague. He is kind enough to teach and let me work around with virtual machines and setting up databases. Of course, all these can be done by himself without my assistance.

It is something I wanted to do in my previous job. And, now I know how to do it so. Besides that, one of my projects’ partner and the IT team got me into the setting up the encrypted database and connection pool. It was interesting area, database, system administration and security.

Image: Udureka.

Actually is it data analytics or data engineering interests me more?

About 5 months back, I started to think about this question. All the while, I thought I liked analytics. Yes, no doubt, but it is not entirely all about analytics only. Data engineering gives me a sense of security. What kind of security I am talking about? I mean, it makes me feel more sense if I know how the data, that I am using daily for analysis, are coming from and how to process them from raw data to something clean and meaningful for analysis.

Studying helps me to solve my puzzle. I am able to make a clearer pictures after learning through the program. My assignment and project help me to understand things. Soon, I was quite confident to answer people, that I am in love with another boyfriend, the data engineering.

Data Engineering

Then, I started to compare the 3 roles, data analyst, data engineer and data scientist. Simple definition as below:

  • Data Analyst takes data and helps companies make better business decisions. Collect data, analyse data and make reports.
  • Data engineer develops, constructs, tests and maintains the whole architecture of the large-scale processing system. Set up data pipeline.
  • Data scientist, a professional who deals with enormous mass structured/unstructured data and uses their skills in math, statistics and programming. Visualization and business decision making.

While the roadmap shared by Udureka seems to be a little bit misleading from my point of view. I think these 3 different roles have its own roadmap and it is not something that starts with Data Analyst then Data Engineer and eventually becomes Data Scientist.

Look into the breakdown of the skillset, it tells me each of them is having different skillset. You may look at each of them needs different sets of knowledge and some of the advanced knowledge are uniquely for the role itself such as Machine learning and deep learning for Data Scientist, data architecture and ETL for Data Engineer and reporting and data visualization for Data Analyst.

And different tasks to be done per each roles.

During the interviews I went to recently, I shared with my interviewers about my future career path as a Data Engineer and how I can use my knowledge and experience to march into this role. Looking back all this years, even I did not have the Data Analyst title, however, I have done a lot on the data warehouse, ETL, reporting and visualization including data cleaning work on the data sources. It shows how close I am toward data engineering. And, all I need is to upgrade my skill set to advance levels.

My next job next month, I will be working closely with the system analysts to build data warehouses and data marts (one of the areas I think I am closing working toward the data engineering – next involvement is data pipeline) and the stakeholders to do data visualization (storytelling) and training. I do hope it is going to be an exciting journey of learning things. Again, I am not holding a Data Engineer or Data Analyst title in my new job 🙂

Data Management

Data Management is the process of profiling, cleaning and transforming data sources into useful information. Data Management covers the areas of data profiling, data cleaning, data exploration, data integration and data transformation.

Currently, I am working on the part time course’s project which uses Power BI to do data profiling and data exploration of different datasets. It made me understanding the important of understanding the working datasets and know what story we want to tell as a conclusion before deciding what to be cleaned in the data cleaning process.

The project made me to understand, even there are some incorrect values or data needed to be cleaned and usually it is encouraged to be cleaned, yet it is not required to do so. I believed as long as we are able to justify the reasons of the data to be cleaned and not to be cleaned in our work and how does it affect to the overall data exploration.

Each steps serve different purposes in the ETL process.

  • Data Profiling – to get an overview structure of the data and assess the data quality.
  • Data cleaning – to improve the data quality.
  • Data exploration – to use statistics charts to get a sense of the distribution or correlation.

Data Profiling

  • Get a visual profile of the data to assess the structure and quality of the data. Using Power BI, you will get a table profile of summary statistics such as min and max values, distinct count, error or missing value counts.
  • It allows us to identify missing values and inconsistencies in the data. Based on this information, we can further assess the quality and plan for the data cleaning process.

The above view shows the Power BI Editor once we loaded the data into Power BI tool through the Get Data option. The Power BI supports different types of data files.

One limitation on this summary statistics is Power BI works on the first 1,000 records only. In other words, Power BI will not be able to show us any data error and missing values from 1001th row onward.

Similar data profiling can be found in other software tools such as SAS which allows its users to check the data quality using their tools.

Data Cleaning

Data cleaning or data cleansing is a process of detecting and correcting (or removing) incorrect values or records from the datasets and replacing, modifying and deleting the dirty data. Several things we can do during the data cleaning such as,

  • Remove duplication
  • Change text to lower/upper/proper case
  • Spell check
  • Remove extra spaces
  • Treat blank cells
  • Standardization

Are data cleaning technique essential?

The answer is yes. We spend 80% of our time in data cleaning and it is not only essentially important for data analytics and data science, it is most time consuming part to ensure the data always matches the correct fields, interact effectively and making it easy for data visualization.

Data Exploration

It is the part where the story-telling begins after all the datasets are cleaned and integrated. How well is the data transformed from its raw values and integrated together with all the datasets affect the overall quality of the data. It is important that we have sets of quality datasets before begin the data exploration.

Many times, in my experience of data exploration I found there are more data to be cleaned in order to get the right insights or stories that I want to tell in my dashboard. During the data profiling, these should be identified and any wrong outcomes from the data exploration maybe due to outliers which we overlooked earlier.