Data Management: Data Wrangling Versus ETL

Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.

Data Wrangling Versus ETL: What’s the Difference?

The top three major differences between the two technologies.

1. The Users Are Different

The core idea of data wrangling technologies is that the people who know the data best should be exploring and preparing that data. This means business analysts, line-of-business users, and managers (among others) are the intended users of data wrangling tools. I can personally attest to the painstaking amount of design and engineering effort that has gone into developing a product that enables business people to intuitively do this work themselves.

In comparison, ETL technologies are focused on IT as the end-users. IT employees receive requirements from their business counterparts and implement pipelines or workflows using ETL tools to deliver the desired data to the systems in the required formats.

Business users rarely see or leverage ETL technologies when working with data. Before data wrangling tools were available, these users’ interactions with data would only occur in spreadsheets or business intelligence tools.

2. The Data Is Different

The rise of data wrangling software solutions came out of necessity. A growing variety of data sources can now be analyzed, but analysts didn’t have the right tools to understand, clean, and organize this data in the appropriate format. Much of the data business analysts must deal with today comes in a growing variety of shapes and sizes that are either too big or too complex to work within traditional self-service tools such as Excel. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale.

ETL is designed to handle data that is generally well-structured, often originating from a variety of operational systems or databases the organization wants to report against. Large-scale data or complex raw sources that require substantial extraction and derivation to structure are not one of the ETL tools’ strengths.

Additionally, a growing amount of analysis occurs in environments where the schema of data is not defined or known ahead of time. This means the analyst doing the wrangling is determining how the data can be leveraged for analysis as well as the schema required to perform that analysis.

3. The Use Cases Are Different

The use cases we see among users of data wrangling solutions tend to be more exploratory in nature and are often conducted by small teams or departments before being rolled out across the organization. Users of data wrangling technologies typically are trying to work with a new data source or a new combination of data sources for an analytics initiative. We also see data wrangling solutions making existing analytics processes more efficient and accurate as users can always have their eyes on their data as they prepare it.

ETL technologies initially gained popularity in the 1970s as tools primarily focused on extracting, transforming, and loading data into a centralized enterprise data warehouse for reporting and analysis via business intelligence applications. This continues to be the primary use case for ETL tools and one that they are extremely good at.

With some customers, we see data wrangling and ETL solutions deployed as complementary elements of an organization’s data platform. IT leverages ETL tools to move and manage data, so business users have access to explore and prepare the appropriate data with data wrangling solutions.

Reference: https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx

January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.

Database Model

During a check on the Database Engines ranking just now, I found that this link is listing the available database management systems according to their popularity. The website updates monthly and you able to see the current ranking, previous month ranking, and the ranking one year ago.

The picture above shows the list of the top 10 database management systems. Another exciting part of the website is the list of the database model.

Relational DBMS

Relational database management systems (RDBMS) support the relational or table-oriented data model. The schema of a table or the defines by the table name, fixed number of columns (attributes) with fixed data types. A record corresponds to a row in the table (entity) and consists of the values of each column. A relation thus consists of a set of uniform records, according to the website.

The normalization in the process of data modeling generates table schemas. There are some operations used to define a relationship. For example,

  • classical set operations (union, intersection, and difference)
  • Selection (selection of a subset of records according to certain filter criteria for the attribute values)
  • Projection (selecting a subset of attributes/columns of the table)
  • Join: special conjunction of multiple tables as a combination of the Cartesian product with selection and projection.

Document Stores

Document stores, also called document-oriented database systems, are characterized by their schema-free organization of data. According to the website that means,

  • Records do not need to have a uniform structure, i.e. different records may have different columns.
  • The types of values of individual columns can be different for each record.
  • Columns can have more than one value (arrays).
  • Records can have a nested structure.

Document stores often use internal notations, which can be processed directly in applications, mostly JSON. JSON documents, of course, can also be stored as pure text in key-value stores or relational database systems.

Key-value Stores

Key-value stores are probably the simplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.

These simple systems usually are not adequate for complex applications. On the other hand, it is exactly this simplicity that makes such systems attractive in certain circumstances. For example, resource-efficient key-value stores that apply in the embedded systems or as high performance in-process databases.

Advanced Forms

An extended form of key-value stores is able to sort the keys, and thus enables range queries as well as ordered processing of keys. Many systems provide further extensions so that we see a fairly seamless transition to document stores and wide column stores.

Search Engines

Search engines are NoSQL database management systems dedicated to the search for data content. In addition to general optimization for this type of application, the specialization consists of typically offering the following features:

  • Support for complex search expressions
  • Full text search
  • Stemming (reducing inflected words to their stem)
  • Ranking and grouping of search results
  • Geospatial search
  • Distributed search for high scalability

Wide Column Stores

As mentioned above, the wide column stores, also called extensible record stores, store data in records with an ability to hold huge numbers of dynamic columns. Since the column names, as well as the record keys, are not fixed, and since a record can have billions of columns, wide column stores see as two-dimensional key-value stores.

The wide column stores share the characteristic of being schema-free with document stores. However, the implementation is very different. The wide column stores must not be confused with the column-oriented storage in some relational systems. The wide column stores is an internal concept for improving the performance of an RDBMS for OLAP (Online Analytical Processing) workloads and stores the data of a table, not record after record but column by column.

Graph DBMS

Graph DBMS, also called graph-oriented DBMS or graph database, represent data in graph structures as nodes and edges, which are relationships between nodes. Graph DBMS allows easy processing of data in that form, and a simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node. Graph DBMS usually does not provide indexes on all nodes, direct access to nodes based on attribute values is not possible in these cases.

Time Series DBMS

A Time Series DBMS is a database management system that optimizes handling time series data; for example, each entry associated with a timestamp.

Time Series DBMS is designed to efficiently collect, store, and query various time series with high transaction volumes. Although the management of the time series data can be the same as other categories of DBMS (from key-value stores to relational systems), the specific challenges often require specialized systems.

I hope the information extracted from the website is able to help us understand the differences between the database models.

References: https://db-engines.com/en/ranking

MongoDB: Importing csv files with mongoimport

Someone asked for my help to upload some .csv files to the MongoDB database and backup the database before sending the file to the next person. I completed the task with the command below. It imports a .csv file to the selected database and collection by specifying the file type, location and whether the file has a headerline. It runs for both Linux and Windows’ machines using the terminal or command prompt.
mongoimport -d mydb -c things --type csv --file locations.csv --headerline

MongoDB: Enabling SSL for MongoDB

Sometimes ago, I was told to prepare and work with a team of people to explore the TLS/SSL for MongoDB and use the Windows Active Directory users to access MongoDB via LDAP authentication.

It is an administration work on MongoDB to enhance security between client and server during data transmission. The default connections to MongoDB servers are not encrypted. It is highly advisable to ensure all connections to the mongod are Transport Layer Security (TLS) also known as SSL enabled.

The definition of TLS/SSL from Google is TLS and its now-deprecated predecessor, Secure Sockets Layer (SSL), which are protocols designed to provide communications security over a computer network

For a development server, you can enable the TLS using a self-signed certificate. OpenSSL allows us to generate the self-signed certificate on our server itself. For the Windows machine, it requires to install the OpenSSL to generate the self-signed certificate. Then, you need to install the self-signed certificate. For the Linux Ubuntu machine, no installation required for the OpenSSL and self-signed certificate.

In the MongoDB configuration file (mongod.conf for Linux machine) or (mongod.cfg for Windows machine) can enable the SSL mode in the MongoDB. You need to specify the path to the .pem file. If you are working on a production server, you are required to include the CAFile path too. CA stands for certificate authority, is an entity that issues digital certificates.

net:
   tls:
      mode: requireTLS
      certificateKeyFile: /etc/ssl/mongodb.pem
      CAFile: /etc/ssl/caToValidateClientCertificates.pem

More reading about MongoDB TLS/SSL can refer to this document: https://docs.mongodb.com/manual/tutorial/configure-ssl/

Dates in MongoDB

My colleague came to me last Friday and asked me to help him out on one of the issues he faced with MongoDB’s query. He found that the JSON data received via the API when our BI tool inserted the data into MongoDB, it saved the date and datetime values in string format.

The JSON specification does not specify a format for exchanging dates which is why there are so many different ways to do it. The best format is the ISO date format, it is a well known and widely used format and can be handled across many different languages, making it very well suited for interoperability.

It shall look like below:

2012-04-23T18:25:43.511Z

When I took over the JSON data and found that most of the date and time are in normal string format (eg: 2012-04-23 18:25:43.511), MongoDB inserted the data into the database as string type.

To understand dates in MongoDB, I found this article is pretty good in explaining to them, here is the link. Instantly, it hit me! Dealing with dates in MongoDB or any databases is not an easy job at all, moreover, this time I am dealing with MongoDB where it has subdocument or array in a subdocument structure.

I can use the $dateFromString which converts a date/time string to a date object. For more detail about this function, you can find it out at this link.

This aggregation operator worked like magic to me when I am dealing with the date and time in the string until I reached a point where the date is in an array inside a subdocument. I faced a roadblock and I am not able to use the same method to do the conversion. It seems quite true that I have to use $unwind in my query. It deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

I need a solution that I can use $unwind to flatten the array from the collection and convert each date and time into a date object. I am still looking for a solution, if you happened to solve something similar like mine, please give me a helping hand and link me to the solutions. Thank you!

Maybe useful: https://stackoverflow.com/questions/38299186/query-that-combines-project-unwind-group

Big Data, A Long Introduction

One of my colleagues from the Business Operation texted me on one morning and asked me where she can get insights, understand some of the terminology, difference between the SQL and NoSQL, and how to make decision which type of database to be used. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer and I wanted to explain databases in a more interesting way.

What is SQL?

Structured Query Language (SQL) is computer language for database management systems and data manipulation. SQL is used to perform insertion, updation, deletion. It allows us accessing and modifying data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that are designed to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar. 

This explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amount of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites which we can search online to give us the differences and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same, it does not have a fixed schema which locked same number of the columns (fields) for data entry. NoSQL data allow to have different number of columns when data is added.

Image: https://www.guru99.com/nosql-tutorial.html

Another major difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

Relational databases are designed to run on single server in order to maintain integrity of the table mappings and avoid the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It is scale up or vertical scaling. This process is expensive.

NoSQL databases is non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article, the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well.

Big Data

Big data is used to refer not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required in order to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data’s description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited to the understanding of the volume and variety. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage which cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular method dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

This Big Data does not limited to search engine and healthcare, it can be data e-commerce websites where we want to perform targeted advertising and provide recommendations systems which we can often see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds are the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While smart home has became a reality in the recent years, the successful invention of smart vehicles which allows vehicles drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and UK are planning smart cities, using the implementation of IoTs and Big Data management techniques to develop the smart cities.

I am looking forward.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.