Data Management: Data Wrangling Versus ETL

Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.

Data Wrangling Versus ETL: What’s the Difference?

The top three major differences between the two technologies.

1. The Users Are Different

The core idea of data wrangling technologies is that the people who know the data best should be exploring and preparing that data. This means business analysts, line-of-business users, and managers (among others) are the intended users of data wrangling tools. I can personally attest to the painstaking amount of design and engineering effort that has gone into developing a product that enables business people to intuitively do this work themselves.

In comparison, ETL technologies are focused on IT as the end-users. IT employees receive requirements from their business counterparts and implement pipelines or workflows using ETL tools to deliver the desired data to the systems in the required formats.

Business users rarely see or leverage ETL technologies when working with data. Before data wrangling tools were available, these users’ interactions with data would only occur in spreadsheets or business intelligence tools.

2. The Data Is Different

The rise of data wrangling software solutions came out of necessity. A growing variety of data sources can now be analyzed, but analysts didn’t have the right tools to understand, clean, and organize this data in the appropriate format. Much of the data business analysts must deal with today comes in a growing variety of shapes and sizes that are either too big or too complex to work within traditional self-service tools such as Excel. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale.

ETL is designed to handle data that is generally well-structured, often originating from a variety of operational systems or databases the organization wants to report against. Large-scale data or complex raw sources that require substantial extraction and derivation to structure are not one of the ETL tools’ strengths.

Additionally, a growing amount of analysis occurs in environments where the schema of data is not defined or known ahead of time. This means the analyst doing the wrangling is determining how the data can be leveraged for analysis as well as the schema required to perform that analysis.

3. The Use Cases Are Different

The use cases we see among users of data wrangling solutions tend to be more exploratory in nature and are often conducted by small teams or departments before being rolled out across the organization. Users of data wrangling technologies typically are trying to work with a new data source or a new combination of data sources for an analytics initiative. We also see data wrangling solutions making existing analytics processes more efficient and accurate as users can always have their eyes on their data as they prepare it.

ETL technologies initially gained popularity in the 1970s as tools primarily focused on extracting, transforming, and loading data into a centralized enterprise data warehouse for reporting and analysis via business intelligence applications. This continues to be the primary use case for ETL tools and one that they are extremely good at.

With some customers, we see data wrangling and ETL solutions deployed as complementary elements of an organization’s data platform. IT leverages ETL tools to move and manage data, so business users have access to explore and prepare the appropriate data with data wrangling solutions.

Reference: https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx

January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.

Database Model

During a check on the Database Engines ranking just now, I found that this link is listing the available database management systems according to their popularity. The website updates monthly and you able to see the current ranking, previous month ranking, and the ranking one year ago.

The picture above shows the list of the top 10 database management systems. Another exciting part of the website is the list of the database model.

Relational DBMS

Relational database management systems (RDBMS) support the relational or table-oriented data model. The schema of a table or the defines by the table name, fixed number of columns (attributes) with fixed data types. A record corresponds to a row in the table (entity) and consists of the values of each column. A relation thus consists of a set of uniform records, according to the website.

The normalization in the process of data modeling generates table schemas. There are some operations used to define a relationship. For example,

  • classical set operations (union, intersection, and difference)
  • Selection (selection of a subset of records according to certain filter criteria for the attribute values)
  • Projection (selecting a subset of attributes/columns of the table)
  • Join: special conjunction of multiple tables as a combination of the Cartesian product with selection and projection.

Document Stores

Document stores, also called document-oriented database systems, are characterized by their schema-free organization of data. According to the website that means,

  • Records do not need to have a uniform structure, i.e. different records may have different columns.
  • The types of values of individual columns can be different for each record.
  • Columns can have more than one value (arrays).
  • Records can have a nested structure.

Document stores often use internal notations, which can be processed directly in applications, mostly JSON. JSON documents, of course, can also be stored as pure text in key-value stores or relational database systems.

Key-value Stores

Key-value stores are probably the simplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.

These simple systems usually are not adequate for complex applications. On the other hand, it is exactly this simplicity that makes such systems attractive in certain circumstances. For example, resource-efficient key-value stores that apply in the embedded systems or as high performance in-process databases.

Advanced Forms

An extended form of key-value stores is able to sort the keys, and thus enables range queries as well as ordered processing of keys. Many systems provide further extensions so that we see a fairly seamless transition to document stores and wide column stores.

Search Engines

Search engines are NoSQL database management systems dedicated to the search for data content. In addition to general optimization for this type of application, the specialization consists of typically offering the following features:

  • Support for complex search expressions
  • Full text search
  • Stemming (reducing inflected words to their stem)
  • Ranking and grouping of search results
  • Geospatial search
  • Distributed search for high scalability

Wide Column Stores

As mentioned above, the wide column stores, also called extensible record stores, store data in records with an ability to hold huge numbers of dynamic columns. Since the column names, as well as the record keys, are not fixed, and since a record can have billions of columns, wide column stores see as two-dimensional key-value stores.

The wide column stores share the characteristic of being schema-free with document stores. However, the implementation is very different. The wide column stores must not be confused with the column-oriented storage in some relational systems. The wide column stores is an internal concept for improving the performance of an RDBMS for OLAP (Online Analytical Processing) workloads and stores the data of a table, not record after record but column by column.

Graph DBMS

Graph DBMS, also called graph-oriented DBMS or graph database, represent data in graph structures as nodes and edges, which are relationships between nodes. Graph DBMS allows easy processing of data in that form, and a simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node. Graph DBMS usually does not provide indexes on all nodes, direct access to nodes based on attribute values is not possible in these cases.

Time Series DBMS

A Time Series DBMS is a database management system that optimizes handling time series data; for example, each entry associated with a timestamp.

Time Series DBMS is designed to efficiently collect, store, and query various time series with high transaction volumes. Although the management of the time series data can be the same as other categories of DBMS (from key-value stores to relational systems), the specific challenges often require specialized systems.

I hope the information extracted from the website is able to help us understand the differences between the database models.

References: https://db-engines.com/en/ranking

Cardinality in Databases

Recently, I read an article on cardinality in databases. When you do a Google search to define cardinality in general terms, it returns a definition as “the number of elements in a set or other grouping, as a property of that grouping“. It may sound a bit difficult to visualize and understand.

In another search in Stackoverflow website, a contributor named Oded shared, the cardinality can be described in two different contexts, data modelling and data query optimization.

In term of data modelling, cardinality means how one table relates to another, for example, one to one relationship, one to many relationship or many to many relationship. Below diagram is extracted from lucidchart website which shows the different types of relationship in the database. It is used in ER diagram, entity-relationship diagram.

In term of data query optimization, cardinality means the data in a column of a table, specifically how many unique values are in it. If you have done data profiling before using the Microsoft Power BI, for example, you notice there is a summary statistics of the table loaded into the application. This information helps with planning queries and optimizing the execution plans.

Big Data, A Long Introduction

One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.

Database Stability

This is one of the common question to be asked either during a talk or during the interview. Personally, I look at this topic highly and important for every database administrator to pay attention to it.

Slow performance means tasks take longer time to complete. If it takes longer, there is more likely to overlap when multiple users or connections at the same time. It leads to frequent locks, deadlocks and resource contention and eventually leads to errors and stability issues.

Poor scalability means it has limited options when demand exceed capacity such as queue requests or reject requests. Rejecting requests result error or unexpected behaviour and this is instability. Queuing requests lead to reduced performance, putting demands on resources such as CPU, memory and etc. When it increases demands, it leads to further stability issues.

Poor stability affects performance. The partial success and partial failure must be handled, usually with database rollbacks or manual compensation logic. It is an additional resource requirements on the system whether to do rollback or process the manual compensation logic. And it affects scalability.

I found from the MSDN website, someone shared some important points when come to designing whether a database or an application. It always consider performance, scalability, and stability when architecting, building, and testing your databases and applications.

MSSQL: Database Partitioning

mssql

I was being called into a meeting room to discuss with the support team regarding an issue faced by the customer. I did not have the full context of the issue and could only think of some of the queries made by the customer to build the Business Intelligence’s cubes caused some serious performance issues.

And, during the meeting, my boss said the word, database partitioning. For one moment, I did mistakenly mixed up between database partitioning and database normalization. I believed both are used to organize the data in the database but both are targeting at different areas.

Database normalization
According to Microsoft, normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

Redundant data wastes disk space and creates maintenance problems. If data appears in more than 1 table, we need to make changes exactly the same way in all the other tables.

In general, normalization requires additional tables. There are a few rules for database normalization. Each rule is called a “normal form”. I will write more about it in the next blog.

Database Partitioning
Again, according to Microsoft, partitioning a database improves performance and simplifies maintenance. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan. Maintenance tasks, such as rebuilding indexes or backing up a table, can run more quickly.

a.Hardware Partitioning
Partitioning can be achieved without splitting tables by physically putting tables on individual disk drives. Putting a table on one physical drive and related tables on a separate drive can improve query performance because, when queries that involve joins between the tables are run, multiple disk heads read data at the same time. SQL Server filegroups can be used to specify on which disks to put the tables.

Multiprocessors that enable multiple threads of operations, permitting many queries to run at the same time. Alternatively, a single query may be able to run faster on multiple processors by letting components of the query run at the same time. For example, each table referenced in the query can be scanned at the same time by a different thread.

b.Horizontal Partitioning
Horizontal partitioning divides a table into multiple tables. Each table then contains the same number of columns, but fewer rows. For example, a table that contains 1 billion rows could be partitioned horizontally into 12 tables, with each smaller table representing one month of data for a specific year. Any queries requiring data for a specific month only reference the appropriate table.

Partitioning data horizontally based on age and use is common. For example, a table may contain data for the last five years, but only data from the current year is regularly accessed. In this case, you may consider partitioning the data into five tables, with each table containing data from only one year.

c.Vertical Partitioning
Vertical partitioning divides a table into multiple tables that contain fewer columns. The two types of vertical partitioning are normalization and row splitting.

Normalization is the standard database process of removing redundant columns from a table and putting them in secondary tables that are linked to the primary table by primary key and foreign key relationships.

Row splitting divides the original table vertically into tables with fewer columns. Each logical row in a split table matches the same logical row in the other tables as identified by a UNIQUE KEY column that is identical in all of the partitioned tables. For example, joining the row with ID 712 from each split table re-creates the original row.

Sources:
https://support.microsoft.com/en-sg/help/283878/description-of-the-database-normalization-basics
https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms178148(v=sql.105)