Describe the difference between batch and streaming data

Data processing is simply converting raw data to meaningful information through a process. Depending on how the data is ingested into your system, you could process each data item as it arrives or buffers the raw data and process it in groups. Processing data as it arrives is called streaming. Buffering and processing the data in groups is called batch processing.

Understand batch processing

In batch processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time as a batch. Exactly when each group is processed can be determined in several ways. For example, I can process data based on a scheduled time interval (for example, every hour), or it could be triggered when a certain amount of data has arrived or as the result of some other event.

Advantages of batch processing include:

  • Large volumes of data can be processed at a convenient time.
  • It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight or during off-peak hours.

Disadvantages of batch processing include:

  • The time delay between ingesting the data and getting the results is because it is scheduled to run at a time.
  • All batch job input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes during batch jobs halt the process. The input data must be carefully checked before the job can be rerun. Even minor data errors, such as typographical errors in dates, can prevent a batch job from running.

Understand streaming and real-time data

In-stream processing, each new piece of data is processed when it arrives. For example, data ingestion is inherently a streaming process.

Streaming handles data in real-time. Unlike batch processing, there is no waiting until the next batch processing interval, and data is processed as individual pieces rather than a batch at a time. Streaming data processing is beneficial in most scenarios where new, dynamic data is generated continually.

Examples of streaming data include:

  • A financial institution tracks changes in the stock market in real-time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.
  • An online gaming company collects real-time data about player-game interactions and feeds the data into its gaming platform. It then analyses the data in real-time and offers incentives and dynamic experiences to engage its players.
  • A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-time property recommendations of properties to visit based on their geo-location.
  • Stream processing is ideal for time-critical operations that require an instant real-time response. For example, a system that monitors a building for smoke and heat needs to trigger alarms and unlock doors to allow residents to escape immediately in the event of a fire.

Understand the differences between batch and streaming data

Apart from how batch processing and streaming processing handle data, there are other differences:

  • Data Scope: Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received or within a rolling time window (the last 30 seconds, for example).
  • Data Size: Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro-batches consisting of few records.
  • Performance: The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds. Latency is the time taken for the data to be received and processed.
  • Analysis: You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.

Day 2: Machine Learning for Beginner

I begin to pick up some topics and learn them. One of them is machine learning. I was being exposed to machine learning, ML in short when I was studying for my Specialized Diploma in 2019-2020. It is a good time to write some simple posts and give a basic introduction for beginners like me to learn a thing or two. I am absolutely no idea about machine learning, please share your thought on what I write below.

What is Supervised and Unsupervised Learning?

The very first keywords that I come across when I started watching the online learning video are supervised and unsupervised learning. It is important to know the difference between these two.

In supervised learning, I am expected to train the machine using labelled data. The labelled data means it is tagged with the correct answer. The supervised learning algorithm learns from these training data. Then, machine learning applies the knowledge to test data to produce the new outcomes from the labelled data.

Unsupervised learning is the training of a machine using unlabelled data. Unsupervised learning is a machine learning technique where I do not have to supervise the model. It allows the algorithm to act on that information. It says can find all kinds of unknown patterns in data. It allows the model to perform more complex processing tasks compared to supervised learning.

Types of Supervised Machine Learning Techniques

The regression technique predicts a single output value using training data. For example, I can use regression to predict the house price from training data. The output variable is a real value such as the weight or size of a house.

Classification means to group the output inside a class. If the algorithm tries to label input into two distinct classes, it is called binary classification. The output variable is a category such as “red and blue” or “disease and no disease”. Selecting between more than two classes is referred to as multiclass classification. For example, determining whether or not someone will be a defaulter of the loan.

Other techniques I think include below:

  • Logistic Regression
  • Naive Bayes Classifiers
  • K-NN (k nearest neighbors)
  • Decision Trees
  • Support Vector Machine

Types of Unsupervised Machine Learning Techniques

Clustering is an important concept when it comes to unsupervised learning. It mainly deals with finding a structure or pattern in a collection of uncategorized data. Clustering algorithms will process your data and find natural clusters(groups) if they exist in the data. You can also modify how many clusters your algorithms should identify. It allows you to adjust the granularity of these groups.

Association rules allow you to establish associations amongst data objects inside large databases. This unsupervised technique is about discovering exciting relationships between variables in large databases. For example, people that buy a new home are most likely to buy new furniture.

Reference

Day 1: Machine Learning for Beginner

I begin to pick up some topics and learn them. One of them is machine learning. I was being exposed to machine learning, ML in short when I was studying for my Specialized Diploma in 2019-2020. It is a good time to write some simple posts and give a basic introduction for beginners like me to learn a thing or two. I am absolutely no idea about machine learning, please share your thought on what I write below.

Introduction to Machine Learning for Beginners

It is always good to start with an introduction. One of the best introduction memes for machine learning:

What is Machine Learning?

Machine Learning (ML) uses statistics and algorithms to predict the outputs. The basic of ML is to build algorithms that can receive input data and use statistical analysis to predict an output while updating the outputs as and when new data becomes available. ML is a category of algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. I guess this is where the machine is learning to predict outputs.

Machine Learning is used anywhere. What does it do?

I managed to pull out some examples from the online articles. Based on these examples, I noticed that most of them are referring to past historical data to identify or recognize a pattern and predict an output.

  • Machine learning can be used in the prediction systems. Considering the loan example, to compute the probability of a fault, the system will need to classify the available data in groups.
  • Machine learning can be used for face detection in an image. There is a separate category for each person in a database of several people.
  • Speech recognization where it translates words into the text. It is used in voice user interfaces include voice dialing, call routing, and appliance control. Besides that, it can be used as data entry and the preparation of structured documents.
  • In medical diagnoses, machine learning is trained to recognize cancerous tissues.
  • Financial industry and trading companies use machine learning in fraud investigations and credit checks.

There are a lot of new terms or keywords that I see when I am going through the online videos. I cannot remember all of them now. I think it is good to spend some time writing down some notes and save the reference links in my browser’s bookmark.

References:

Tableau: What is the difference between continuous and discrete

The following is a sharing made by the instructor of the Udemy’s online learning course which I subscribed to. The course is called Tableau for Beginners: Get CA Certified, Grow Your Career.

One of the questions asked by a student that I wanted to find out the answer was when I moved from bar chart to line graph, I noticed there were two different types of line graphs, one with continuous and another is discrete.

In Google’s definition, discrete are individually separate and distinct, and continuous is forming an unbroken whole; without interruption. I put it in a simple way, discrete data is a whole number, and continuous data is running values.

Based on the instructor’s explanation, Continuous creates an axis, while discrete create headers. With headers, I get a different label for each unique value in the data. With an axis, the axis represents the range of possible values.

Confused?

When I choose a field called Sales from the Measure panel and another field called OrderDate from the Dimension panel, automatically, it generates the discrete line graph as below:

The x-axis shows the year values for the “columns” and there is no values or breakdown of values for each month in the line graph.

Meanwhile, if I choose to change from discrete to continuous, the values on the x-axis change to go by month/year. I can change it by selecting “Month” from the drop-down list.

Unfortunately, I realized that the drop-down list did not specify the first set of Year, Month, etc is for discrete data and the second set of Year, Month, etc is for continuous data. It gives me the impression that it shows different formats on the x-axis only, but it means differently. I also noticed it changed from blue to green colour on the “column” field on the top. This visual change indicates the values have changed from discrete to continuous.

Now, the line graph is showing the range of dates in month and year. This is a simple line graph to explain the difference between discrete values and continuous values.

I am unable to continue updating the entry because my access to Tableau has expired. I am sorry for the incomplete information.

References

Python: How to code in 5 minutes

Happy New Year to all my dear readers. We are in the year 2021 now. This year new resolution is learning technologies without any barriers. As a part of the skill improvement within my organization, our management has decided to get everyone on-board in learning all kinds of technologies that my organization is going to use this year. It is part of our strategies for data digitization and self-servicing. During the long Christmas and New Year holidays, the first step took place when every one of us is given some time off to learn Python, Tableau, Power BI, AWS Glue and Azure.

Get python installed

Today, I will start with Python installation and display “Hello World” on my screen. If you have not installed Python in your machine before, you can head to the python.org website to get Python 3.8.x, or Python 3.9 installed. My machine is using the Windows operating system. You just select the operating systems and head to download the Python and installed in your machine. The download and installation do not take me more than 2 minutes to complete.

get installed with anaconda

Next, I download and install the Anaconda, and the whole process took me less than 2 minutes to complete. You can visit anaconda’s website to get the installer to get it installed. There are two steps, to begin with when we want to set up our machine for Python programming. After installation, you are automatically in the default conda environment with all packages installed. A conda is a package and environment manager.

what is python and anaconda?

You may be wondering why we need Python and Anaconda. What is this Anaconda for? Wiki tells us that Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), aiming to simplify package management and deployment.

Anaconda helps manage the packages and environments and reduces future issues dealing with various libraries that you will be using it. Anaconda is a distribution of packages built for data science. It comes with conda, a package, and an environment manager. We usually used conda to create environments for isolating our projects that use different versions of Python and/or different version of packages. For example, you want to set up Python 2 and Python 3 environments. You can read for more information about Anaconda from the reference links below.

jupyter notebook

Before I begin my first Python programming exercise to display the “Hello World”, I choose to use the Anaconda Navigator’s Jupyter Notebook. The Jupyter Notebook is an application that allows me to create documents (notebooks) to code/write Python codes to display input and output from the Python codes. The Jupyter Notebook is widely used for other programming languages too. While I Googled which IDE (integrated development environment) is most used for Python programming, Spyder is another tool to do coding, and I can find it from the Anaconda.

Click on the “Launch” button, my browser launches the localhost (http://localhost:8888/tree), and shows the Jupyter Notebook’s dashboard and my machine’s working directory. In my case, it is C:\Users\<myname>. I need to create a new notebook by clicking the “New” button on the right side of the window. A new window launches, I can see the Jupyter Notebook.

I rename my notebook to “HelloWorld” as below. Then, I started Python programming by writing the first line of codes.

print("Hello World, Happy New Year 2021")

first python code

print() display in the double quotes when I run the selected cell. The number in the bracket after In [1]: stands for the number of commands run. If you keep running the same cell over again, the number will change for that cell. If you switch to another cell and run that cell again, the number will change for that cell.

Alright, I stop here for the first blog entry of my Python programming learning journey. Here, I have covered the following topics:

  • Anaconda
  • Python
  • Jupyter Notebook

PIP

It is not my first time I write about Python programming in my blog. Previously, I installed the Python by using the pip, the package manager. Pip is a tool that allows me to install and manage additional libraries and dependencies. Pip installs Python packages whereas conda installs packages which may contain software written in any languages. You can refer to my entry to share how I updated pip and Python on my Windows machine. Installation using pip completes in Windows command prompt.

In my previous Python posts, I wrote about how I used IntelliJ as the IDE and signed up a course at Udemy to begin my Python learning journey. Here is the link to the Day 1: Let Get Started with Python. If you wish to check out my previous write-ups, please visit this link. Hope you enjoy my sharing, please stay tuned for the next updates. Thank you.

references

Business Analytics Framework

I am looking through the slides from my Specialized Diploma in Business Analytics today since it is a public holiday in Singapore. I take this opportunity to focus on reading. The diagram above brought me to re-look into Data Management that my boss told me sometimes ago. Recently, my teammate told me that there supposed to have a CDMP (Certified Data Management Professional) examination in August. However, he is busy with the work and unable to prepare for it. Last year, I did think about taking the same certification, a step to the professional level.

According to DMBOK, data management is vital for every organization. Whether known as Data Management, Data Resource Management, or Enterprise Information Management, organizations increasingly recognize that the data they possess is a valuable asset which must be appropriately managed to ensure success.

DMBOK is Data Management Bodies of Knowledge, similar to other professional organizations’ Bodies of Knowledge, for example, Project Management (PMBOK) and soft engineering (SWEBOK).

Why Do I link It with the DMBOK?

Based on the above diagram, I realized that I used to learn about data integration, central metadata management, and data warehouses during the Diploma period. And, this relates to some of the topics in the DMBOK. Besides that, this Diploma covered data governance in general.


I extracted some of the points from the 11 Data Management Knowledge Area above:

  • Data Governance – plan, oversight, and control over the management of data and the use of data and data-related resources. While we understand that governance covers ‘processes’, not ‘things’, the common term for Data Management Governance is Data Governance.
  • Data Integration & Interoperability – acquisition, extraction, transformation, movement, delivery, replication, federation, virtualization and operational support (a Knowledge Area new in DMBOK2).
  • Metadata – collecting, categorizing, maintaining, integrating, controlling, managing, and delivering metadata.
  • Data Warehousing & Business Intelligence – managing analytical data processing and enabling access to decision support data for reporting and analysis

By reading through the description above, the business analytics framework consists of some of the data management knowledge areas. These knowledge areas give us ideas about the industry standard, terminology, and common best practices, without going into implementation details.

Data Governance in BI

Furthermore, the Data Governance knowledge area is a big topic to talk about. In Business Intelligence, it governs all activities within the environment. The guided principles ensure the information is managed as corporate asset-standardized, integrated, and reused across the organization.

Objectives of BI Governance

  • Clearly defined authority and accountability, roles and responsibilities.
  • Program planning, prioritization, and funding processes.
  • Communicating strategic business opportunities to IT.
  • Transparent decision-making processes for development activities.
  • Tracking value and reporting results.

Benefits of BI and Data Governance

  • Generate greater ROI.
  • Balance business needs and IT imperatives.

Regardless, you are working for public or private sector, data governance in the business intelligence or business analytics context does play an important role to every organization. Hence, as part of my revision on what I have learned during my Specialized Diploma helped me to understand what the industry is needed and make linkage to DMBOK topics. I am not sure how many of us are into this topic area and I believe it is a good topic to discuss with the community to share the best or common practices and learn from each other to improve the standard and guideline.

Explore Power BI Desktop

I updated my current Power BI Desktop version via the Windows Apps Store recently, and now it is a good time to share the new user interface (UI) of the Power BI after the installation. In the year 2019, during my Specialized Diploma study, the Power BI Desktop skinning was in dark mode. I am not sure when the Microsoft team has changed the Power BI Desktop’s skinning to white mode, as well as having the Filter Pane on the right side.

Another new feature that I spotted is the Power BI has the theme options for the dashboard and reports. This theme is not referring to the Power BI Desktop’s skin. You have to enable this feature from the Power BI Settings, and it allows you to change the theme to suit your dashboard and reports presentation.

To do so, navigate to the File menu, select Options and Settings, then Options. Next, in the Preview feature section, select Customize current theme.

Click OK button to proceed. It may prompt you to restart the machine so that it takes effect and enables the theme feature. There is a list of built-in theme available in the Power BI Desktop, and you can refer to this link for more detail. Furthermore, you can optionally export a theme’s JSON file. You can make amendments by manually modifying the settings in that file. You can rename that fine-tuned JSON file and later import it. It gives more control to the users to customize the theme according to their dashboard and reports.

Getting familiar with the interface

From the Microsoft website, it shares the detail of each pane labelled below. I extracted the picture and its explanation.

  1. Ribbon – Displays common tasks that are associated with reports and visualizations.
  2. Report view, or canvas – It is a place where visualizations are created and arranged. You can switch between ReportData, and Model views by selecting the icons in the left column.
  3. Pages tab – This area is where you would select or add a report page.
  4. Visualizations pane – It is the pane where you can change visualizations, customize colours or axes, apply filters, drag fields, and more.
  5. Fields pane – It is the pane where query elements and filters can be dragged onto the Report view or dragged to the Filters area of the Visualizations pane.

You can collapse the Visualizations and Fields panes to provide more space in the Report view by selecting the small arrow.

The screenshot above shows an example of the collapsible pane for Filter pane. It works for Visualizations and Fields panes too.

Connect to data sources

Power BI Desktop connects to many types of data sources, you can choose from local databases, excel sheets or data on the cloud. There are about 70 different types of data sources available. Go to the Get Data from the ribbon on the Home tab to begin accessing the data. Then, select a source to establish a connection. For some data source connection, you may require to input the user credential to authenticate and accessing the data. Here is the list of data connectors available in the Power BI’s Get Data function.

It brings you to the Navigator window that displays the entities (tables) of your data source. It gives you a preview of the selected data. In the same window, you can choose to Load or Transform Data. If you are not making any changes, formatting and data transformation, then you can click on the Load button, else Transform Data allows you to perform data cleaning and conversion before importing the data into the Power BI Desktop. You are allowed to edit the data after importing too.

Transform data to include in a report

Power BI Desktop includes the Power Query Editor tool that helps you shape and transform data so that it is ready for your visualizations. To launch the Power Query Editor tool, there are two ways to bring up this window:

  1. use Transform Data button on the Home ribbon. [For April/2020 version]
  2. use Edit Queries button on the Home ribbon. [For older versions]

If you click on the Enter Data button on the Home ribbon (as shown above), a Create Table window prompts up. From this window, click the Edit button, it brings up the Power Query Editor tool. Remember, earlier I mentioned about the Load and Transform Data buttons when we load data from the Get Data button? The Transform Data button brings up the Power Query Editor too, similar function as to how the Create Table window’s Edit button works. I am not going to cover any data transformation in this blog. It is a big topic to discuss, so I think it is good to share it with some good examples and dataset in the next article.

Connect from multiple sources

Most of the time, we deal with more than one data source when we build a report. You can use the Power Query Editor tool to combine data from multiple sources into a single report. How does it able to combine into a single table? In Power BI Desktop, it has a feature called Append Queries to add the data from a new table to an existing query.

Create a visual

If I remember correctly, in Tableau, when fields are selected, the Tableau suggests the suitable visualizations to the users to use in the dashboard or reports. I am not sure whether Power BI has a similar feature. In the Report View, drag a field onto the Report View canvas, the Power BI Desktop automatically creates a table visual as default visual. This visual as a report listing because it lists the selected fields in a tabular form. You can choose to have different visuals, such as a bar chart or line graph if you wish to do so.

To create a visual, select a field from the Fields pane, you can drag the field into the data field (Values) in the Visualization pane, or you can click on the checkbox. A table visual displays on the screen, and you can choose another type of visual from the Visualization pane. There is no precedence to create a visual, and you can select a visual before selecting the fields. Each visual has a different visualization pane, for example, if you choose a dual chart, the following screenshot shows shared axis, column and line values. When you choose a pie chart, it displays legend and values.

Publish a report

After all the hard work on the dashboard or reports, you want to publish it and share it with other people. You can do so in Power BI Desktop by clicking on the Publish button in the Home menu. You will be prompted to sign in to Power BI, follow the steps and you will see the published reports after that.

At this point of writing, I do not have any published report to show. Therefore, I cannot put up the steps here and show how to pin a visual to the dashboard. This feature allows you to choose whether to pin the visual to an existing dashboard or to create a new dashboard.

Conclusion

This article is a high-level walkthrough of the Power BI Desktop, that explains how to use it to create visuals and publish the dashboard and reports. I do not cover the explanation of the visualization and publication in this article, I will include them in the next article in the future.

I hope this article gives a good impression of the Power BI Desktop’s features and allows you to have some sensing of this tool. Furthermore, the Power BI Desktop’s buttons are self-explanatory, so you should not have issues or troubles to use and navigate around. Besides that, people who have been using Microsoft Excel and Tableau for data analysis may find the Power BI Desktop has some similar functions because the Power BI Desktop is another data visualization tool too.

Reference: microsoft.com

Power BI – Learning new skill

Recently, I get access to the abundance of online learning resources for Microsoft Power BI. I learned the fundamental of using the Power BI in my Specialized Diploma course. Now, it is a good time to recap what I have learned.

So, what is Power BI? It is a Microsoft product. It is a business analytics service that delivers insights to enable fast, informed decisions. This software has both free version and paid versions, Pro and Premium (with different subscription fees and features). Small introduction of what is Power BI and its versions as below.

What is Power BI Desktop?

The Power BI Free/Desktop enables you to connect to 70+ data sources, analyse data, publish to the web, export to excel and much more. The free version gives you the basic features of Power BI.

What is Power BI Pro?

Power BI Pro is the full version of Power BI, which means it comes complete with the ability to use Power BI for both building dashboards, reports and unlimited viewing, sharing and consumption of your created reports (and reports shared by others) which is not possible with Power BI Desktop.

What is the difference?

  • Power BI Pro has the ability to share the data, reports, and dashboards with a large number of other users that also have a Power BI Pro license.
  • Power BI Pro able to create an app-based workspace.
  • Power BI Pro has a 10 GB per Pro user data storage limit.

Maybe, these differences are a little irrelevant if you just want to learn Power BI for leisure instead of using it for commercial usage. For personal learning, I did not need to use up to 10gb data. As long as my email account is valid, I can start using the Power BI.

What is the Power BI App?

All Power BI’s versions can be connected via mobile applications. Furthermore, the Power BI Mobile applications are available for multiple platforms including Android, iOS and Windows devices.

What is Power BI Report Server?

Power BI Report Server is an on-premises (at your own location) server that publish and share both Power BI reports via the website within your organisation’s firewall (infrastructure). Power BI On-Premise or Report Server is an option included with Power BI Premium and is ideal for your business if you want to establish reporting infrastructure on-premises and have it operate under your own policies and rules. The server allows you to seamlessly scale up and move to the cloud if you wish to do so.

The above is a visual that helps to understand all the above. These three elements—Desktop, the service, and Mobile apps. The Power BI Desktop accesses the data and creates the dashboard and reports. Then, publish to the Power BI Service, and share the Power BI reports to users, who can access it via Power BI Mobile too.

By now, you may start getting familiar with some of the terms used in the Power BI. These are some of them:

  • Dashboard or visualization or tile. A tile is a single visualization on a dashboard or report. Visualization is a visual representation of data, like a chart. A dashboard is a collection of visuals from a single page.
  • Reports. A report is a collection of visualizations that appear together on one or more pages.
  • Datasets. A dataset is a collection of data that Power BI uses to create its visualizations.

The example above shows the dashboard contains the bar charts, line graph and cards. These are different visualizations available in Power BI, and the red box refers as a tile.

Limitations: Power BI Free/Desktop

As most of us in the learning stage will use the Power BI Free version, there are some feature limitations with Power BI Desktop.

  • Can’t share created reports with non-Power BI Pro users
  • No App Workspaces
  • No API embedding
  • No email subscriptions
  • No peer-to-peer-sharing
  • No support to analyse in Excel within Power BI Desktop

However, there are useful features available for Power BI Free/Desktop users.

Advantages: Power BI Free/Desktop

  • You can connect and import data from over 70 cloud-based and on-premises sources
  • The same rich visualisations and filters from Power BI Pro
  • Auto-detect that finds and creates data relationships between tables and formats
  • Export your reports to CSV, Microsoft Excel, Microsoft PowerPoint and PDF
  • Python support
  • Save, upload and publish your reports to the Web and the full Power BI service
  • Storage limit of 10 GB per user

I will be sharing more about Power BI Desktop from time to time as part of my learning objectives and improving my technical writing. Hope to hear some feedback from my readers from time to time. Please help me to fill up the survey form so that I can improve in my next blog.

References:
https://dynamics.folio3.com/blog/difference-between-power-bi-pro-vs-free-vs-premium/
https://docs.microsoft.com/en-gb/learn/modules/get-started-with-power-bi/1-introduction

Data Management: Data Wrangling Versus ETL

Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.

Data Wrangling Versus ETL: What’s the Difference?

The top three major differences between the two technologies.

1. The Users Are Different

The core idea of data wrangling technologies is that the people who know the data best should be exploring and preparing that data. This means business analysts, line-of-business users, and managers (among others) are the intended users of data wrangling tools. I can personally attest to the painstaking amount of design and engineering effort that has gone into developing a product that enables business people to intuitively do this work themselves.

In comparison, ETL technologies are focused on IT as the end-users. IT employees receive requirements from their business counterparts and implement pipelines or workflows using ETL tools to deliver the desired data to the systems in the required formats.

Business users rarely see or leverage ETL technologies when working with data. Before data wrangling tools were available, these users’ interactions with data would only occur in spreadsheets or business intelligence tools.

2. The Data Is Different

The rise of data wrangling software solutions came out of necessity. A growing variety of data sources can now be analyzed, but analysts didn’t have the right tools to understand, clean, and organize this data in the appropriate format. Much of the data business analysts must deal with today comes in a growing variety of shapes and sizes that are either too big or too complex to work within traditional self-service tools such as Excel. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale.

ETL is designed to handle data that is generally well-structured, often originating from a variety of operational systems or databases the organization wants to report against. Large-scale data or complex raw sources that require substantial extraction and derivation to structure are not one of the ETL tools’ strengths.

Additionally, a growing amount of analysis occurs in environments where the schema of data is not defined or known ahead of time. This means the analyst doing the wrangling is determining how the data can be leveraged for analysis as well as the schema required to perform that analysis.

3. The Use Cases Are Different

The use cases we see among users of data wrangling solutions tend to be more exploratory in nature and are often conducted by small teams or departments before being rolled out across the organization. Users of data wrangling technologies typically are trying to work with a new data source or a new combination of data sources for an analytics initiative. We also see data wrangling solutions making existing analytics processes more efficient and accurate as users can always have their eyes on their data as they prepare it.

ETL technologies initially gained popularity in the 1970s as tools primarily focused on extracting, transforming, and loading data into a centralized enterprise data warehouse for reporting and analysis via business intelligence applications. This continues to be the primary use case for ETL tools and one that they are extremely good at.

With some customers, we see data wrangling and ETL solutions deployed as complementary elements of an organization’s data platform. IT leverages ETL tools to move and manage data, so business users have access to explore and prepare the appropriate data with data wrangling solutions. For other data series blog entries of mine, please click here.

Reference: https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx

January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.

%d bloggers like this: