Part 2: Introduction to Kafka

I wrote about the Introduction to Kafka a while ago without touching the technical side of it and its use cases. I will not explain in detail for each use cases for now. There are couples of jargons to be familiar with this blog. I used an image I downloaded from the Internet to explain it.

Image: https://images.app.goo.gl/A6PnHPocHe8yJeveA

There are four core APIs (Application Programming Interfaces) we need to know:

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

We can run the Kafka in a single node server (node) or in a cluster mode with multiple nodes (Kafka broker). Producers are processes that publish data or a stream of records (push messages) into Kafka topics within the broker. A consumer pulls records off a or more Kafka topic and processes the streams of records produced to them.

Main parts of Kafka system:

  • Broker: Handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.
  • Zookeeper: Keeps the state of the cluster (brokers, topics, users). (It is a system).
  • Producer: Sends records to a broker.
  • Consumer: Consumes batches of records from the broker.

For now, I keep the explanation of Zookeeper in another blog. In my self-learning course, the instructor shared some use cases of using the Kafka:

  • Messaging system
  • Activity tracking
  • Application logs gathering
  • Streaming processes with Spark or Kafka Stream API.
  • Decoupling system dependencies.
  • Integration with Spark, Flink, Hadoop, Storm and other Big Data technologies.

Reference:
https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html
https://docs.confluent.io/
https://kafka.apache.org/

Data Management: Data Wrangling Versus ETL

Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.

Data Wrangling Versus ETL: What’s the Difference?

The top three major differences between the two technologies.

1. The Users Are Different

The core idea of data wrangling technologies is that the people who know the data best should be exploring and preparing that data. This means business analysts, line-of-business users, and managers (among others) are the intended users of data wrangling tools. I can personally attest to the painstaking amount of design and engineering effort that has gone into developing a product that enables business people to intuitively do this work themselves.

In comparison, ETL technologies are focused on IT as the end-users. IT employees receive requirements from their business counterparts and implement pipelines or workflows using ETL tools to deliver the desired data to the systems in the required formats.

Business users rarely see or leverage ETL technologies when working with data. Before data wrangling tools were available, these users’ interactions with data would only occur in spreadsheets or business intelligence tools.

2. The Data Is Different

The rise of data wrangling software solutions came out of necessity. A growing variety of data sources can now be analyzed, but analysts didn’t have the right tools to understand, clean, and organize this data in the appropriate format. Much of the data business analysts must deal with today comes in a growing variety of shapes and sizes that are either too big or too complex to work within traditional self-service tools such as Excel. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale.

ETL is designed to handle data that is generally well-structured, often originating from a variety of operational systems or databases the organization wants to report against. Large-scale data or complex raw sources that require substantial extraction and derivation to structure are not one of the ETL tools’ strengths.

Additionally, a growing amount of analysis occurs in environments where the schema of data is not defined or known ahead of time. This means the analyst doing the wrangling is determining how the data can be leveraged for analysis as well as the schema required to perform that analysis.

3. The Use Cases Are Different

The use cases we see among users of data wrangling solutions tend to be more exploratory in nature and are often conducted by small teams or departments before being rolled out across the organization. Users of data wrangling technologies typically are trying to work with a new data source or a new combination of data sources for an analytics initiative. We also see data wrangling solutions making existing analytics processes more efficient and accurate as users can always have their eyes on their data as they prepare it.

ETL technologies initially gained popularity in the 1970s as tools primarily focused on extracting, transforming, and loading data into a centralized enterprise data warehouse for reporting and analysis via business intelligence applications. This continues to be the primary use case for ETL tools and one that they are extremely good at.

With some customers, we see data wrangling and ETL solutions deployed as complementary elements of an organization’s data platform. IT leverages ETL tools to move and manage data, so business users have access to explore and prepare the appropriate data with data wrangling solutions.

Reference: https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx

How to install CentOS 7 using the GUI in Virtual Box

It took me a while to get this installation works on my machine. First, I have to install the VirtualBox in my machine. Previously, I was using VMware Workstation in my machine. The installation of the VirtualBox has been completed previously before installing the CentOS 7.

Setup new VM

When you launch the VirtualBox, it looks empty as above or you may have other virtual machines (VM) set up before. I need to set up a new VM by clicking at the “New” button. Next, follow through the wizard, a guided mode to set up the new VM. If you are familiar with using VirtualBox, you can use the Expert Mode to set up.

There are few things to be done in the above screen.

  • Name the VM
  • Set the folder path.
  • Select Type: Linux
  • Version: Red Had (64-bit)

Note: Since CentOS is the clone of Red Hat and it uses the similar architecture.

In this screen, it allocates the amount of memory to the virtual machine. In my set up, I leave it as default. You can allocate more memory if you have enough memory in your machine.

The above screen, it chooses to create a virtual disk (vdi) and proceed to create.

Choose the storage size on the physical hard disk. There are two options:

Fixed-size of the disk is not recommended in any scenario because you will be downloading many packages to run various applications.

Dynamically allocated will use space on the hard disk as it fills up. Select the dynamically allocated, make sure that your hard drive has enough free space. The 15GB space is sufficient enough to start with.

Click “Next” to proceed.

Click “Create” to proceed and finish the set up. Upon successfully created the virtual machine, the screen appears as below:

You can run the virtual machine now… You need to download the CentOS ISO image and link up the image with the newly created virtual machine.

Where to download the ISO image?

I downloaded the ISO image from this link. The download may take a while to complete due to the file size. The file comes with a .iso file extension.

Link up ISO image with VM

From the screenshot above, click on the “Setting” button and go to “Storage”. Under the optical drive (Empty), select the ISO image (.iso) file that you have downloaded earlier. Also, you need to enable the network adapter so that it can use the internet to download the required packages.

Start the virtual machine

Click on the “Start” button to start the virtual machine. There are different option of running the virtual machine in the VirtualBox. Select the option “Install CentOS Linux 7” and proceed to install. Again, it will take a while to load the required packages to complete the installation.

Once it is ready, you will see the opening screen of the server. It requires basic information to set up the server such as language, timezone and user account. You can set up accordingly.

Reference: https://resources.infosecinstitute.com/installing-configuring-centos-7-virtualbox/

Introduction to Kafka

The idea came about when we have multiple source systems and target systems, the integrations need to write many different configurations. Each of these configurations comes with difficult around:

  • Protocol – how the data is transported (example: HTTP, REST, TCP, etc).
  • Data format – how the data is parsed (example: CSV, JSON, binary, etc).
  • Data schema – how the data is shaped and may change.

Each source system may have an increased load from the connections.

Why Apache Kafka?

Decoupling the data streams & systems

What is Apache Kafka?

Apache Kafka is a high-throughput distributed messaging system (or streaming platform). It was created by LinkedIn, and it is an open-source project maintained by Confluent.

You can have any data streams from websites, micro-services, financial transactions, and etc. Once it is in Kafka, you may want to put the data into your databases, analytics system, email system, and etc.

Kafka is used for these broad classes of applications:

  • Building real-time streaming data pipelines that reliably get data between systems or applications.
  • Building real-time streaming applications that transform or react to the streams of data.

Kafka Concepts

  • Kafka runs on a cluster on one or more servers that can span multiple data centres.
  • The Kafka cluster stores streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.

Reference:
https://docs.confluent.io/
https://kafka.apache.org/

VirtualBox Installation

I have installed the VirtualBox in my machine recently, and I planned to write about the installation step for Windows machine. It is easy with the wizard and follow through the steps during the installation. First, download the installer from the website. I am using this link.

Upon successfully downloaded, run the .exe file. The first view of the wizard looks as below. It depends on the version that you downloaded, the interface may look different. I am using version 6.1.4 for this installation.

Setup using wizard

Click “Next” to proceed.

Next, select the location where you want to install the program. I left it as default location. This screen shows the required disk space to install the software in your machine. Click “Next” to proceed.

Next, you can choose whether to create shortcuts on your machine. In my case, I chose to untick the checkboxes for create shortcuts on the desktop and Quick Launch. Click “Next” to proceed.

Then, it shows a warning page, you can just click “Next” to proceed and click “Install” to install the software on your machine. Make sure you allow the wizard to continue install the software on your machine when it prompts you messages.

Launch the VirtualBox

You can begin to use the VirtualBox once you have downloaded some of the images to run it here.

If your machine is running in Linux, you can install VirtualBox from this link. It has the command lines that install the software. Choose the correct Linux version to begin the installation.

Chong Qing Grilled Fish

It is located at Mosque Street, Chinatown, it is the same restaurant as the Bugis branch at Liang Seah Street. There was no queue on the day I visited the restaurant with two other friends. I made a reservation before that, and we got our table after 10 minutes of waiting. They provide some drinks at the entrance of the restaurant for waiting customers.

Grilled Fish with Mild Spiciness

We ordered a mild-level of spiciness for the grilled seabass with a few additional ingredients such as lotus root, enoki mushrooms, etc. Out of my expectation, the ingredients were added into the pot when it was being served. It would be better if they could serve the side dishes separately and allow us to add into the grilled pot as and when we like to avoid the ingredients over-cooked.

Stir-fried Clams

The stir-fried spicy clams are quite delicious and the level of spiciness kicked in well. The sauce went well with the white rice I ordered. The amount of clams is generous too.

Stir-fried Frog Meat

It was a disappointment to order this stir-fried either frog meat because it did not taste as good. It was plain spicy without any tastes and it was oily. It seemed like a malaxiangguo with two or three ingredients stir-fried together.

All the dished that we ordered are mild spiciness, but ate them at the same time made us felt the level of spiciness has increased to hot spiciness. And, that made me wanted to order a drink from their menu to ease the spiciness and the saltiness after eating the food.

Drink

I could not remember clearly what drink did I ordered. It could be a concoction of the Yakult with something else and tasted quite nice.

Address: 18 Mosque St, #01-01, Singapore 059498.

Breakthrough Cafe

It is located at the People Park Centre, opposite the Chinatown Point. It is easily accessible and located as it is facing the side of the Singapore State Court. It opens in the morning for dim sum as breakfast and serves lunches before it closes around 4pm, according to my colleague.

Four of us walked to this restaurant for lunch and we decided to share the dishes together.

We tried the pig trotters, it is not my favourite at all especially the amount of fat on top of the meat. So, I take half of the egg soaked in the vinegar. This sauce is rich in collagen and said to be protein-rich and helps with muscle strengthening and repair. No doubt it is one of the warm to your heart dishes.

My favourite is the sesame oil chicken topped with shredded fried gingers. According to two of my colleagues, they said to make these shredded fried gingers need a lot of effort and time to prepare and cook. The sauce is nice to mix with rice and the amount of meat inside this little claypot is generous enough for us to share.

The sesame oil chicken always sells fast and it would be good if you can be there early for lunch to avoid this dish being sold out.

The less promising dish is the curry fish with assorted vegetables. The sauce is a little diluted. All of us agreed that this curry needs more santan (coconut milk) to make them taste richer and the sauce thicker. They wish to make more pineapples too. However, the portion looks generous and a lot of vegetables.

My colleague recommends us to try the egg tarts as well. It is a Hong Kong style egg tart which sells like a hot cake.

Address: 101A Upper Cross Street, #01-02A-C, People’s Park Centre, 058358 Singapore.