Part 2: Introduction to Kafka

I wrote about the Introduction to Kafka a while ago without touching the technical side of it and its use cases. I will not explain in detail for each use cases for now. There are couples of jargons to be familiar with this blog. I used an image I downloaded from the Internet to explain it.


There are four core APIs (Application Programming Interfaces) we need to know:

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

We can run the Kafka in a single node server (node) or in a cluster mode with multiple nodes (Kafka broker). Producers are processes that publish data or a stream of records (push messages) into Kafka topics within the broker. A consumer pulls records off a or more Kafka topic and processes the streams of records produced to them.

Main parts of Kafka system:

  • Broker: Handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.
  • Zookeeper: Keeps the state of the cluster (brokers, topics, users). (It is a system).
  • Producer: Sends records to a broker.
  • Consumer: Consumes batches of records from the broker.

For now, I keep the explanation of Zookeeper in another blog. In my self-learning course, the instructor shared some use cases of using the Kafka:

  • Messaging system
  • Activity tracking
  • Application logs gathering
  • Streaming processes with Spark or Kafka Stream API.
  • Decoupling system dependencies.
  • Integration with Spark, Flink, Hadoop, Storm and other Big Data technologies.


Data Management: Data Wrangling Versus ETL

Data management (DM) consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise, to meet the data consumption requirements of all applications and business processes.

Data Wrangling Versus ETL: What’s the Difference?

The top three major differences between the two technologies.

1. The Users Are Different

The core idea of data wrangling technologies is that the people who know the data best should be exploring and preparing that data. This means business analysts, line-of-business users, and managers (among others) are the intended users of data wrangling tools. I can personally attest to the painstaking amount of design and engineering effort that has gone into developing a product that enables business people to intuitively do this work themselves.

In comparison, ETL technologies are focused on IT as the end-users. IT employees receive requirements from their business counterparts and implement pipelines or workflows using ETL tools to deliver the desired data to the systems in the required formats.

Business users rarely see or leverage ETL technologies when working with data. Before data wrangling tools were available, these users’ interactions with data would only occur in spreadsheets or business intelligence tools.

2. The Data Is Different

The rise of data wrangling software solutions came out of necessity. A growing variety of data sources can now be analyzed, but analysts didn’t have the right tools to understand, clean, and organize this data in the appropriate format. Much of the data business analysts must deal with today comes in a growing variety of shapes and sizes that are either too big or too complex to work within traditional self-service tools such as Excel. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale.

ETL is designed to handle data that is generally well-structured, often originating from a variety of operational systems or databases the organization wants to report against. Large-scale data or complex raw sources that require substantial extraction and derivation to structure are not one of the ETL tools’ strengths.

Additionally, a growing amount of analysis occurs in environments where the schema of data is not defined or known ahead of time. This means the analyst doing the wrangling is determining how the data can be leveraged for analysis as well as the schema required to perform that analysis.

3. The Use Cases Are Different

The use cases we see among users of data wrangling solutions tend to be more exploratory in nature and are often conducted by small teams or departments before being rolled out across the organization. Users of data wrangling technologies typically are trying to work with a new data source or a new combination of data sources for an analytics initiative. We also see data wrangling solutions making existing analytics processes more efficient and accurate as users can always have their eyes on their data as they prepare it.

ETL technologies initially gained popularity in the 1970s as tools primarily focused on extracting, transforming, and loading data into a centralized enterprise data warehouse for reporting and analysis via business intelligence applications. This continues to be the primary use case for ETL tools and one that they are extremely good at.

With some customers, we see data wrangling and ETL solutions deployed as complementary elements of an organization’s data platform. IT leverages ETL tools to move and manage data, so business users have access to explore and prepare the appropriate data with data wrangling solutions.


How to install CentOS 7 using the GUI in Virtual Box

It took me a while to get this installation works on my machine. First, I have to install the VirtualBox in my machine. Previously, I was using VMware Workstation in my machine. The installation of the VirtualBox has been completed previously before installing the CentOS 7.

Setup new VM

When you launch the VirtualBox, it looks empty as above or you may have other virtual machines (VM) set up before. I need to set up a new VM by clicking at the “New” button. Next, follow through the wizard, a guided mode to set up the new VM. If you are familiar with using VirtualBox, you can use the Expert Mode to set up.

There are few things to be done in the above screen.

  • Name the VM
  • Set the folder path.
  • Select Type: Linux
  • Version: Red Had (64-bit)

Note: Since CentOS is the clone of Red Hat and it uses the similar architecture.

In this screen, it allocates the amount of memory to the virtual machine. In my set up, I leave it as default. You can allocate more memory if you have enough memory in your machine.

The above screen, it chooses to create a virtual disk (vdi) and proceed to create.

Choose the storage size on the physical hard disk. There are two options:

Fixed-size of the disk is not recommended in any scenario because you will be downloading many packages to run various applications.

Dynamically allocated will use space on the hard disk as it fills up. Select the dynamically allocated, make sure that your hard drive has enough free space. The 15GB space is sufficient enough to start with.

Click “Next” to proceed.

Click “Create” to proceed and finish the set up. Upon successfully created the virtual machine, the screen appears as below:

You can run the virtual machine now… You need to download the CentOS ISO image and link up the image with the newly created virtual machine.

Where to download the ISO image?

I downloaded the ISO image from this link. The download may take a while to complete due to the file size. The file comes with a .iso file extension.

Link up ISO image with VM

From the screenshot above, click on the “Setting” button and go to “Storage”. Under the optical drive (Empty), select the ISO image (.iso) file that you have downloaded earlier. Also, you need to enable the network adapter so that it can use the internet to download the required packages.

Start the virtual machine

Click on the “Start” button to start the virtual machine. There are different option of running the virtual machine in the VirtualBox. Select the option “Install CentOS Linux 7” and proceed to install. Again, it will take a while to load the required packages to complete the installation.

Once it is ready, you will see the opening screen of the server. It requires basic information to set up the server such as language, timezone and user account. You can set up accordingly.


Introduction to Kafka

The idea came about when we have multiple source systems and target systems, the integrations need to write many different configurations. Each of these configurations comes with difficult around:

  • Protocol – how the data is transported (example: HTTP, REST, TCP, etc).
  • Data format – how the data is parsed (example: CSV, JSON, binary, etc).
  • Data schema – how the data is shaped and may change.

Each source system may have an increased load from the connections.

Why Apache Kafka?

Decoupling the data streams & systems

What is Apache Kafka?

Apache Kafka is a high-throughput distributed messaging system (or streaming platform). It was created by LinkedIn, and it is an open-source project maintained by Confluent.

You can have any data streams from websites, micro-services, financial transactions, and etc. Once it is in Kafka, you may want to put the data into your databases, analytics system, email system, and etc.

Kafka is used for these broad classes of applications:

  • Building real-time streaming data pipelines that reliably get data between systems or applications.
  • Building real-time streaming applications that transform or react to the streams of data.

Kafka Concepts

  • Kafka runs on a cluster on one or more servers that can span multiple data centres.
  • The Kafka cluster stores streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.


VirtualBox Installation

I have installed the VirtualBox in my machine recently, and I planned to write about the installation step for Windows machine. It is easy with the wizard and follow through the steps during the installation. First, download the installer from the website. I am using this link.

Upon successfully downloaded, run the .exe file. The first view of the wizard looks as below. It depends on the version that you downloaded, the interface may look different. I am using version 6.1.4 for this installation.

Setup using wizard

Click “Next” to proceed.

Next, select the location where you want to install the program. I left it as default location. This screen shows the required disk space to install the software in your machine. Click “Next” to proceed.

Next, you can choose whether to create shortcuts on your machine. In my case, I chose to untick the checkboxes for create shortcuts on the desktop and Quick Launch. Click “Next” to proceed.

Then, it shows a warning page, you can just click “Next” to proceed and click “Install” to install the software on your machine. Make sure you allow the wizard to continue install the software on your machine when it prompts you messages.

Launch the VirtualBox

You can begin to use the VirtualBox once you have downloaded some of the images to run it here.

If your machine is running in Linux, you can install VirtualBox from this link. It has the command lines that install the software. Choose the correct Linux version to begin the installation.

Happy International Women’s Day

Happy International Women’s Day! This year theme is #EachforEqual. According to the IWD website, it wrote, “We can actively choose to challenge stereotypes, fight bias, broaden perceptions, improve situations and celebrate women’s achievements.”

An equal world is an enabled world.

International Women’s Day, 2020.

Milestones Achieved

On this beautiful day, I humble to share the recent milestones that I achieved over this one year. I hope this will inspire more women to go forward to achieve their goals in life.

AI4I (AI for Industry)

The AI Singapore, is a national programme in artificial intelligence (AI), set up to enhance Singapore’s AI capabilities to power our future digital economy, according to the LinkedIn website. I first heard about this programme was in Dec 2018 when a colleague shared.

At that point of time, two other colleagues were doing AI proof-of-concept projects for the company and one of them quite well in doing machine learning. I think it would be great to kick start the Python learning journey on my own before losing out. Since I was not involved in any Python projects, I think this is the best way for myself to pick up a new language.

I started the online learning and did a few modules in the first four months. Then, I took a six-month break from this programme to concentrate on my specialized diploma course.

And, I have gotten my certificate of completion from AI Singapore in February 2020.

Completed One-Year Volunteer Work with TechLadies

I began my first volunteer work, and I was chosen by TechLadies to assist in running the study groups for the year 2019. I proud to help out people, especially ladies, who keen to start coding and move into the tech industry.

The initial thought of doing the study group was closely related to the online course I got started. It would be great if I could meet up with other ladies who took the same online course and we could group up together to complete the course. However, I realized it would be better to build the foundation well before pushing out new technologies or new programming languages to someone new to the tech industry.

Although I did not manage to run the Python study group during my 1-year volunteering, it gave me enough ideas on what worked and what did not work out well in TechLadies community when running a study group event.

Time fast-forwarded so fast, and I came to the end of my volunteering work with TechLadies and had my graduation dinner recently with a group of great people I worked for a year.

Specialized Diploma in Business Analytics

I do agree sometimes; a specialized diploma does not have the same weight as the actual diploma or degree. Before I started my course in mid-April, my colleagues have been telling me that I would be wasting my time attending the classes in polytechnic school. Appreciated their advice, I did not find it wrong if I were to think that the modules taught in polytechnic less detailed, technical and easy to score.

However, my first semester in business intelligence modules in Temasek Polytechnic, opened my eyes of using powerful tools such as Tableau and PowerBI to perform simple data analysis and data visualization easily. It does not need a technical person to use the tools.

There were class discussions among coursemates from different industries that tried to put the concepts learned in class into real work. I loved the interactions and the knowledge sharing session that nobody would do it in the office even I tried to cultivate this culture in my previous company. Indeed, it needed a lot of time, effort and encouragement to make it work and keep running.

Little that I knew, my specialized diploma course was not just about business intelligence; it upgraded the curriculum by including popular topics such as text analysis and machine learning. Indirectly, I completed some formal education in AI and machine learning through this course.

Studying helped me to identify what I wanted to be next. I gained experienced working on data visualization and next I wanted to use my SQL and Python skill for data management. Therefore, I switched from mastering data analyst skill to data management skill.

It is challenging and yet, fulfilling journey with Temasek Polytechnic, and I have completed and submitted my last project last week. I would never think this course easy, but it is not hard, just right to meet the basic needs and the rest is on my own.

Finally, Learning Agile

Before I continue, I am officially six months into my current role in my new workplace now. I enjoyed both the work and the people here who emphasize and demonstrate good teamwork. Although the technology stack over here is strictly limited to Microsoft products, we keep learning new technologies through sharing sessions.

My team decided to pick up agile and started to implement scrum methodology in our team. I am excited to hop onto this bandwagon with the rest of the project teams. There are many things to learn and adapt when it comes to agile. It gives me a chance to learn the proper way of working in a team. Someone told me to learn it!

Work Life Getting Busy, Yet I’m Happy

Many people will think this could only happen when we are working on something that we passionated. Yes, for now at least, the vision and mission are clear. The management included the project team members in the team planning last year November. The director and management team listened to ground staffs open discussion. Last week, during the knowledge sharing session, some of the planned actions have put in-placed. It has never been so well-planned in my previous work, and everyone is able to visualize the future.

Another news came to me. I have another project team to work with; in other words, I am handling two projects now. While there is a small restructuring happened within my team, right now, most of us are heading into their preferred direction for sure!

What is next, then? For real…

Looking into getting certifications

My boss introduced this certification to my project team during the knowledge sharing session. He is a DAMA certified and actively contribute to Data Management within the project teams and organization. The data governance is not fancy work, and most of the time, it will not implement in the project due to costs and time. To add on, not many people understand the needs of it.

Good enough to have just a handful of people who look into this seriously, and I benefited from taking up my current role to maintain the data governance. I think pursuing this certification is a good investment if I keen to concentrate on data management and data governance work in future.

How can I put my learning in good use?

My coursemate inspired me to start the brown bag programme. I need a group of regulars, a combination of both experienced and inexperienced people to group regularly to learn something in an hour and share the finding in the next hour. It increases the interaction between community members.

I did not have an exact plan on how can I execute this programme, and whether I want to try within my department or project team or collaborate with tech communities. Any other ideas to conduct the brown bag session?

I intend to give more sharing about my Python learning journey through my technical blog in Medium or teaching session (if they do not mind having a newbie to teach). In return, I hope to meet some regulars that are actively using Python for data analysis and AI learning. Also, I want to engage with them proactively to build a two-way communication. It helps to keep me learning the language, although I am not using it at work.

Mentorship Programme

Besides Python, I wish to pick up some skills on Apache project big family such as Apache Kafka for big data, Apache Spark for machine learning, etc. In other words, I am looking into learning open source technology, continuous learning from where I left off six months ago. I planned to put this into my mentorship programme. My mentor is a person who used to head a department in NUS.

What else I can do?

Besides what I mentioned above, what other options available that I can explore and try out. I am open to suggestions.

January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.