January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.

Intermediate Python for Data Science: Looping Data Structure

After the matplotlib for visualization, introduction to dictionaries and Pandas DataFrame, follows by logical, Boolean and comparison operators with if-elif-else control flow and now, comes to the last part, the while loop, for loop and loop for a different data structure.

In Python, some of the objects are iterable which means it loops through the object in a list, for example, to get each element. It loops through a string to capture each character in the string. A for loop iterates over a collection of things and while loop can do any kind of iteration within the block of codes, while some condition remains True

For Loop

The main keywords are for and in. It uses along with colon (:) and indentation (whitespace). Below is the syntax, 

#loop statement
my_iterable = [1,2,3]
for item_name in my_iterable:
    print(item_name)

I used two iterator variables (index, area) with enumerate(), for example, the sample code below. enumerate() loops over something and has an automatic counter, then returns an enumerate object.

# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Change for loop to use enumerate() and update print()
for index, area in enumerate(areas) :
    print("room " + str(index) + ": " + str(area))

#Output:
"""
room 0: 11.25
room 1: 18.0
room 2: 20.0
room 3: 10.75
room 4: 9.5
"""

Another example utilizes a loop that goes through each sublist of house and prints out the x is y sqm, where x is the name of the room and y is the area of the room.

# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
         
# Build a for loop from scratch
for x in house:
    print("the " + str(x[0]) + " is " + str(x[1]) + " sqm")

# Output:
"""
the hallway is 11.25 sqm
the kitchen is 18.0 sqm
the living room is 20.0 sqm
the bedroom is 10.75 sqm
the bathroom is 9.5 sqm
"""

Definition of enumerate() can be found here. My post on for loop is here.

While Loop

The main keyword is while, colon (:) and indentation (whitespace). Below is the syntax,

# while loop statement
while some_boolean_condition:
     # do something 

# Examples
x = 0
while x < 5:
     print(f'The number is {x}')
     x += 1  

An example of putting an if-else statement inside a while loop.

# Initialize offset
offset = -6

# Code the while loop
while offset != 0 :
    print("correcting...")
    if offset > 0:
        offset = offset - 1
    else:
        offset = offset + 1
    print(offset)

# Output:
"""
correcting...
-5
correcting...
-4
correcting...
-3
correcting...
-2
correcting...
-1
correcting...
0
"""

My post on while loop is here.

Loop Data Structure

Dictionary:
If you want to iterate over key-value pairs in a dictionary, use the items() method on the dictionary to define the sequence in the loop.

for key, value in my_dic.items() : 

Numpy Array:
If you want to iterate all elements in a Numpy array, use the nditer() function to specify the sequence.

for val in np.nditer(my_array) : 

Some examples as below:

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
          
# Iterate over europe
for key, value in europe.items():
    print("the capital of " + key + " is " + value)

# Output:
the capital of austria is vienna
the capital of norway is oslo
the capital of italy is rome
the capital of spain is madrid
the capital of germany is berlin
the capital of poland is warsaw
the capital of france is paris
"""

# Import numpy as np
import numpy as np

# For loop over np_height
for x in np_height:
    print(str(x) + " inches")

# For loop over np_baseball
for x in np.nditer(np_baseball):
    print(x)

Loop over DataFrame explanation and example can be found in my post here.

Intermediate Python for Data Science: Logic, Control Flow and Filtering

Boolean logic is the foundation of decision-making in Python programs. Learn about different comparison operators, how to combine them with Boolean operators, and how to use the Boolean outcomes in control structures. Also learn to filter data in pandas DataFrames using logic.

In the earlier days when I started to learn Python, there is a topic on Boolean and Comparison Operators, where I studied Boolean (True and False), logical operators (‘and’, ‘or’, ‘not’) and comparison operators (‘==’ ‘!=’, ‘<‘ and ‘>’).

Comparison operators can tell how two Python values relate and result in a Boolean. It allows to compare two numbers, strings or any same type of variables. It throws exception or error message when it is comparing a variable from a different data type. Python cannot tell how the two objects of different type relate.

Comparison a Numpy array with an integer

Based on the example above taken from a tutorial in DataCamp online learning course that I am taking currently, the variable bmi is a Numpy array, then it compares if the bmi is greater than 23. It works perfectly and returns the Boolean values. Behind the scenes, Numpy builds a Numpy array of the same size, perform an element-wise comparison, filtered with the number 23.

Boolean operators with Numpy

To use these operators with Numpy, you will need np.logical_and(), np.logical_or() and np.logical_not(). Here’s an example on the my_house and your_house arrays from before to give you an idea:

np.logical_and(your_house > 13, 
               your_house < 15)

Refer to below for the sample code:

# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5, my_house < 10))

# Both my_house and your_house smaller than 11
print(np.logical_and(my_house < 11, your_house < 11))

The first print statement is checking on the ‘or’ condition means, any one of the two condition is true, it returns true. The second print statement is checking on the ‘and’ condition means, both of the comparison has to be True then it returns a True. The output of the execution returns in Boolean array as below:

[False  True False  True]
[False False False  True]

Combining Boolean operators and Comparison operators with conditional statement, if, else and elif.

It follows the if statement syntax. The most simplest code which can be used to explain the above,

z = 4
if z % 2 == 0:
  print('z is even')

Same goes to the if else statement with comparison operator, see code below:

z = 5
if z % 2 == 0:
  print('z is even')
else:
  print('z is odd')

Or if you are working with if, elif and else statement, it works too. See the code below:

z = 6
if z % 2 == 0:
  print('z is divisible by 2')
elif z % 3 == 0:
  print('z is divisible by 3')
else:
  print('z is neither divisible by 2 nor 3')

In the example above, both first and second condition are matched, however, in this control structure, once Python hits into a condition that returns a True value, it executes the corresponding code and exits the control structure after that. It will not execute the next condition, corresponding to the elif statement.

Filtering Pandas DataFrame

For an example taken from DataCamp’s tutorial, using the DataFrame below, select countries with area over 8 millions km. There are 3 steps to achieve this.

Step 1: select the area column from the DataFrame. Ideally, it gets a Pandas Series, not a Pandas DataFrame. Assume that the DataFrame is called bric, then it calls the column area using,

brics["area"]

#alternatively it can use the below too:
# brics.loc[:, "area"]
# brics.iloc[:, 2]

Step 2: When the code adds in the comparison operator to see which rows have an area greater than 8, it returns a Series containing Boolean values. The final step is using this Boolean Series to subset the Pandas DataFrame.

Step 3: Store this Boolean Series as ‘is_huge’ as below:

is_huge = brics["area"] > 8

Then, creates a subset of DataFrame using the following code and the result returns as per the screenshot:

brics[is_huge]

It shows those countries with ares greater than 8 million km. The steps can be shorten into 1 line of code:

brics[brics["area"] > 8]

Also, it is able to work with Boolean operators (np.logical_and(), np.logical_or() and np.logical_not()). For example, if it looks for areas between 8 and 10 km, then the single line code can be:

brics[np.logical_and(brics["area"] > 8, brics["area"] < 10)]

The result returns from the above code is Brazil and China.

Intermediate Python for Data Science

The subjects in this DataCamp’s track, Intermediate Python for Data Science include:

  • Matplotlib
  • Dictionaries and Pandas
  • Logic, Control Flow and Filtering
  • Loops

It looks at data visualization – how to visualize data, data structures – how to store data. Along the way, it shows how control structures customize the flow of your scripts (codes).

Data Visualization

It is one of the key skills for data scientists and Matplotlib makes it easy to create meaningful and informative charts. Matplotlib allows us to build various charts and customize them to make it more visually interpretable. It is not an hard thing to be done and it is pretty interesting to work on it. In my previous write-up, I wrote about how to use Matplotlib to build a line chart, scatter plot and histogram.

Data visualization is a very important part in data analysis. It helps to explore the dataset which it extracts insights. I call this as data profiling, the process of examine the dataset coming from existing data source such as databases, which consists of statistics or summaries of the dataset. The purpose is to find existing data can be used for other purposes, determine the accuracy, completeness and validity of the dataset. I can relate this to “perform a body check on the dataset to ensure it is healthy”.

One of the methods I learned from my school on data profiling is the use of histogram, scatter plot and boxplot to examine the dataset and find out the outliers. I can use either the Python’s Matplotlib, Excel, Power Bi or Tableau to perform this action.

It does not end here…

Python allows us to do customization on the charts to suit our data. There are many types of charts and customization ones can do with Python, changing from colours, labels and axes’ tick size. It depends on the data and the story ones want to tell. Refer the links above to read my write-up on those charts.

Dictionaries

We can use lists to store a collection of data and access the values using the indexes. It can be troublesome and inefficient when it comes to large dataset, therefore, the use of dictionaries in data analysis is important as it represents data in the form of key-value pairs. Creating a dictionary from the lists of data can be found in this link. It has one simple example demonstrating how to convert it. However, I do have a question, how about converting long lists to dictionary? I assumed it is not going to be the same method in this simple example. Does anyone have an example to share?

If you have questions about dictionaries, then you can refer to my blog which I wrote a quite comprehensive introduction of dictionaries in Python.

What is the difference between lists and dictionaries?

If you have a collection of values where order matters, and you want to easily select entire subsets, you will want to go with a list. On the other hand, if you need some sort of lookup table where looking for data should be fast, by specifying unique keys, dictionary is a preferred option.

Lastly, Pandas

Pandas is a high level data manipulation tool built on top of NumPy package. Since NumPy 2D array allows to use one data type in their elements, it may not suitable for some of the data structure which comprise of more than one data type. In Pandas, data is stored like a tabular table called DataFrame, for example:

How to build a DataFrame?

There are few ways to build a Pandas DataFrame and we need to import Pandas package before we begin. In my blog, there are two methods shared, using dictionaries and external file such as .csv file. You can find the examples from the given link. Reading from dictionaries can be done by converting dictionary into DataFrame using DataFrame() and reading from the external file can be done using Pandas’ read_csv().

  • Converting dictionary using DataFrame()
  • reading from external file using read_csv()

How to read from a DataFrame?

The above screenshot shows how the Pandas’ DataFrame looks like, it is in the form of rows and columns. If you wonder why the first column goes without naming. Yes, in the .csv file it has no column name. It appears to be an identifier for each row, just like an index of the table or row label. I have no idea whether the content of the file is done with this purpose or it has other meaning.

Index and Select Data

There are two methods you can select data:

  • Using square bracket []
  • Advanced methods: loc and iloc.

The advanced methods, loc and iloc is Python’s powerful, advanced data access. To access a column using the square bracket, with reference to the above screenshot again, the following codes demonstrate how to select that country column:

brics["country"]

The result shows the row label together with the country column. This is how it read a DataFrame which it returns an object called Pandas Series, which you can assume Series is a one dimension labelled array and when a bunch of Series comes together then, it is called DataFrame.

If you want to do the same selection of country column and keep the data as DataFrame, then using the double square brackets, it can do the magic with following code:

brics[["country"]]

If you check the type of the object, it returns as DataFrame. You can define more than one column to be returned. To access rows using the square bracket and slices, with reference to the same screenshot, the below code is used:

brics[1:4]

The result returns the from row number 2 to 4 or index 1 to 3 which contains, Russia, India and China. If you still remember, the characteristic of slice? The stop value (end value) of a slice is exclusive (not included in the output.

However, this method is has a limitation. For example, if you want to access the data similar to 2D Numpy Array, it can be done using the square bracket with specific column and row.

my_array[column, row]

Hence, Pandas has this powerful and advanced data access, loc and iloc, where loc is a label based and iloc is position based. Let us looking into the usage of loc. The first example reads row loc and follow by another example reads row and column loc. With the same concept as above, single square bracket returns a Series and double square brackets return a DataFrame, just as below:

brics.loc["RU"] --Series single row
brics.loc[["RU"]] --DataFrame single row
brics.loc[["RU, "IN", "CH"]] --DataFrame multiple row

Let extends the above code to read country and capital columns using the row and column with loc. First part it mentions the rows and second part it mentions the column labels. The below code returns a DataFrame.

brics.loc[["RU, "IN", "CH"], ["country", "capital"]]

The above rows values can be replaced with slice, just like the sample code below:

brics.loc[:, ["country", "capital"]]

The above code did not specify the start and end index, it means it returns all the rows with country and capital columns. Below is the screenshot of comparison between square brackets and loc (label-based).

Using iloc is similar to the loc, the only different is how you refer column and row which is using index instead of specifying the rows and column labels.

Python: formatter

Below shows how to do more complicated string formatting. Refer below for the sample code:

formatter = "{} {} {} {}"

print(formatter.format(1,2,3,4))
print(formatter.format("one","two","three","four"))
print(formatter.format(False, False, True, False))
print(formatter.format(formatter, formatter, formatter, formatter)
print(formatter.format(
  "First thing",
  "that we can try",
  "maybe is having",
  "a line of sentence"))

It is using something called function to turn the formatter variable into other strings. When the code write, formatter.format, it tells the Python compiler to do the following:

  • Take its formatter string declare in the first line.
  • Call its format function.
  • Pass to format function, the 4 arguments which matches up with the 4 curly brackets {}s in the formatter variable.
  • The result of calling format on formatter is the new string that has the {} replaced with four variables.

This is what the print statement prints.
1,2,3,4
one, two, three, four
False, False, True, False
{} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {}
First thing that we can try maybe is having a line of sentence

Python: Introduction IV

It has been a while back I wrote about Python Introduction I, II and III. Today, I am going to complete the last part of the introduction, the NumPy. Months ago during my Python’s self-learning time, I wrote about NumPy, here is the link.

NumPy

It is an alternative to Python List, the NumPy array helps us to solve problems dealing with Python List’s operations. Calculations on Python Lists cannot be done in the same way we do for two integers or strings. This package needs to be installed before we can import and use it.

In my blog above, I wrote about the behaviour of the NumPy Array. It does not allow different types of elements in the array. When a NumPy Array is built, element’s data type changed to end up with a homogeneous list. Supposed, the list contains a string, number and Boolean, now it changes to all string format, for an example.

Also, the operator “+”, “-” and etc which we used along with Python List, are different in NumPy Array. Refer below for an example:

py_list = [1,2,3]
numpy_array = np.array([1,2,3])
py_list + py_list 
numpy_array + numpy_array

First output shows the two lists are merged or combined together into a single list. Second output shows an array returns an output of addition of those numbers. The screenshot below shows the result which I used Jupyter Notebook to execute.

Whatever that it has covered in the link above is good enough to give us a basic understanding of Numpy. If you wish to learn more, there is another link I found from the Medium which we can refer to.

NumPy Subsetting

Specifically for NumPy, there is a way of doing list subsetting, using an array of Boolean. Example below shows how we can get all the BMI values above 23. Refer to the example from DataCamp,

First result returns as Boolean, True if BMI value is above 23. Then, you can use this Boolean array inside a square bracket to do the subsetting. When the Boolean’s value is True, it selects its value.

In short, it is using the result of the comparison to make a selection of data.

2D NumPy Array

I covered the 2D NumPy Array in this link, where it shows how to declare a 2D NumPy Array and how does it work in subsetting, indexing, slicing and perform math operations.

NumPy: Basic Statistics

You can generate summary statistics of the data using NumPy. Python NumPy has few useful statistical functions which can be used for analytics. It includes finding min, max, average, standard deviation, variance and etc. from a given elements in the array. Refer to my write up on this basic statistics in this link.

Python: Introduction III

The last part of the Python Introduction and I will cover topics on functions, methods and the packages in Python. For sure, there is a difference between function and method. I revisit my original post which I wrote about the differences between functions and methods. You can read up those before continue here.

User-defined Functions

The simplest way I can explain what is function which I wrote in my original post:

A function is a block of code to carry out a task and it calls by its name. All functions may have zero or many arguments. The arguments are passed explicitly (directly). On the exit of the function, it may or may not return value or values.

There are some examples in this post to explain about functions, how to define a function with and without arguments, uses default value for an argument, uses flexible arguments *args and **kwargs and uses of return statement in the function.

Methods

It is like a function, except it is attached to an object (dependent). A method is implicitly (indirectly) passed to the object for which it is invoked. It may or may not return a value or values. The method is accessible to data that is contained within the class.

For methods examples, I wrote it in this post.

Packages

Think the packages as a directory of Python scripts. Example each .py script is a module. This module specifies functions, methods and types in solving a particular problem. I found a link which explained in detail about packages in Python. Refer here for more reading.

In this part III, I know there are many external links are given, mainly is to reduce re-write of those entries which I wrote them sometimes ago. This blog serves as a place to find the relevant resources for reading and examples which I think it is enough to cover the basic understanding of the functions, methods and packages in Python.