Data Management: Data Wrangling Versus ETL

Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.

Data Wrangling Versus ETL: What’s the Difference?

The top three major differences between the two technologies.

1. The Users Are Different

The core idea of data wrangling technologies is that the people who know the data best should be exploring and preparing that data. This means business analysts, line-of-business users, and managers (among others) are the intended users of data wrangling tools. I can personally attest to the painstaking amount of design and engineering effort that has gone into developing a product that enables business people to intuitively do this work themselves.

In comparison, ETL technologies are focused on IT as the end-users. IT employees receive requirements from their business counterparts and implement pipelines or workflows using ETL tools to deliver the desired data to the systems in the required formats.

Business users rarely see or leverage ETL technologies when working with data. Before data wrangling tools were available, these users’ interactions with data would only occur in spreadsheets or business intelligence tools.

2. The Data Is Different

The rise of data wrangling software solutions came out of necessity. A growing variety of data sources can now be analyzed, but analysts didn’t have the right tools to understand, clean, and organize this data in the appropriate format. Much of the data business analysts must deal with today comes in a growing variety of shapes and sizes that are either too big or too complex to work within traditional self-service tools such as Excel. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale.

ETL is designed to handle data that is generally well-structured, often originating from a variety of operational systems or databases the organization wants to report against. Large-scale data or complex raw sources that require substantial extraction and derivation to structure are not one of the ETL tools’ strengths.

Additionally, a growing amount of analysis occurs in environments where the schema of data is not defined or known ahead of time. This means the analyst doing the wrangling is determining how the data can be leveraged for analysis as well as the schema required to perform that analysis.

3. The Use Cases Are Different

The use cases we see among users of data wrangling solutions tend to be more exploratory in nature and are often conducted by small teams or departments before being rolled out across the organization. Users of data wrangling technologies typically are trying to work with a new data source or a new combination of data sources for an analytics initiative. We also see data wrangling solutions making existing analytics processes more efficient and accurate as users can always have their eyes on their data as they prepare it.

ETL technologies initially gained popularity in the 1970s as tools primarily focused on extracting, transforming, and loading data into a centralized enterprise data warehouse for reporting and analysis via business intelligence applications. This continues to be the primary use case for ETL tools and one that they are extremely good at.

With some customers, we see data wrangling and ETL solutions deployed as complementary elements of an organization’s data platform. IT leverages ETL tools to move and manage data, so business users have access to explore and prepare the appropriate data with data wrangling solutions.

Reference: https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx

Data Management

Data Management is the process of profiling, cleaning and transforming data sources into useful information. Data Management covers the areas of data profiling, data cleaning, data exploration, data integration and data transformation.

Currently, I am working on the part time course’s project which uses Power BI to do data profiling and data exploration of different datasets. It made me understanding the important of understanding the working datasets and know what story we want to tell as a conclusion before deciding what to be cleaned in the data cleaning process.

The project made me to understand, even there are some incorrect values or data needed to be cleaned and usually it is encouraged to be cleaned, yet it is not required to do so. I believed as long as we are able to justify the reasons of the data to be cleaned and not to be cleaned in our work and how does it affect to the overall data exploration.

Each steps serve different purposes in the ETL process.

  • Data Profiling – to get an overview structure of the data and assess the data quality.
  • Data cleaning – to improve the data quality.
  • Data exploration – to use statistics charts to get a sense of the distribution or correlation.

Data Profiling

  • Get a visual profile of the data to assess the structure and quality of the data. Using Power BI, you will get a table profile of summary statistics such as min and max values, distinct count, error or missing value counts.
  • It allows us to identify missing values and inconsistencies in the data. Based on this information, we can further assess the quality and plan for the data cleaning process.

The above view shows the Power BI Editor once we loaded the data into Power BI tool through the Get Data option. The Power BI supports different types of data files.

One limitation on this summary statistics is Power BI works on the first 1,000 records only. In other words, Power BI will not be able to show us any data error and missing values from 1001th row onward.

Similar data profiling can be found in other software tools such as SAS which allows its users to check the data quality using their tools.

Data Cleaning

Data cleaning or data cleansing is a process of detecting and correcting (or removing) incorrect values or records from the datasets and replacing, modifying and deleting the dirty data. Several things we can do during the data cleaning such as,

  • Remove duplication
  • Change text to lower/upper/proper case
  • Spell check
  • Remove extra spaces
  • Treat blank cells
  • Standardization

Are data cleaning technique essential?

The answer is yes. We spend 80% of our time in data cleaning and it is not only essentially important for data analytics and data science, it is most time consuming part to ensure the data always matches the correct fields, interact effectively and making it easy for data visualization.

Data Exploration

It is the part where the story-telling begins after all the datasets are cleaned and integrated. How well is the data transformed from its raw values and integrated together with all the datasets affect the overall quality of the data. It is important that we have sets of quality datasets before begin the data exploration.

Many times, in my experience of data exploration I found there are more data to be cleaned in order to get the right insights or stories that I want to tell in my dashboard. During the data profiling, these should be identified and any wrong outcomes from the data exploration maybe due to outliers which we overlooked earlier.