Data Management is the process of profiling, cleaning and transforming data sources into useful information. Data Management covers the areas of data profiling, data cleaning, data exploration, data integration and data transformation.
Currently, I am working on the part time course’s project which uses Power BI to do data profiling and data exploration of different datasets. It made me understanding the important of understanding the working datasets and know what story we want to tell as a conclusion before deciding what to be cleaned in the data cleaning process.
The project made me to understand, even there are some incorrect values or data needed to be cleaned and usually it is encouraged to be cleaned, yet it is not required to do so. I believed as long as we are able to justify the reasons of the data to be cleaned and not to be cleaned in our work and how does it affect to the overall data exploration.
Each steps serve different purposes in the ETL process.
- Data Profiling – to get an overview structure of the data and assess the data quality.
- Data cleaning – to improve the data quality.
- Data exploration – to use statistics charts to get a sense of the distribution or correlation.
- Get a visual profile of the data to assess the structure and quality of the data. Using Power BI, you will get a table profile of summary statistics such as min and max values, distinct count, error or missing value counts.
- It allows us to identify missing values and inconsistencies in the data. Based on this information, we can further assess the quality and plan for the data cleaning process.
The above view shows the Power BI Editor once we loaded the data into Power BI tool through the Get Data option. The Power BI supports different types of data files.
One limitation on this summary statistics is Power BI works on the first 1,000 records only. In other words, Power BI will not be able to show us any data error and missing values from 1001th row onward.
Similar data profiling can be found in other software tools such as SAS which allows its users to check the data quality using their tools.
Data cleaning or data cleansing is a process of detecting and correcting (or removing) incorrect values or records from the datasets and replacing, modifying and deleting the dirty data. Several things we can do during the data cleaning such as,
- Remove duplication
- Change text to lower/upper/proper case
- Spell check
- Remove extra spaces
- Treat blank cells
Are data cleaning technique essential?
The answer is yes. We spend 80% of our time in data cleaning and it is not only essentially important for data analytics and data science, it is most time consuming part to ensure the data always matches the correct fields, interact effectively and making it easy for data visualization.
It is the part where the story-telling begins after all the datasets are cleaned and integrated. How well is the data transformed from its raw values and integrated together with all the datasets affect the overall quality of the data. It is important that we have sets of quality datasets before begin the data exploration.
Many times, in my experience of data exploration I found there are more data to be cleaned in order to get the right insights or stories that I want to tell in my dashboard. During the data profiling, these should be identified and any wrong outcomes from the data exploration maybe due to outliers which we overlooked earlier.