Python: How to code in 5 minutes

Happy New Year to all my dear readers. We are in the year 2021 now. This year new resolution is learning technologies without any barriers. As a part of the skill improvement within my organization, our management has decided to get everyone on-board in learning all kinds of technologies that my organization is going to use this year. It is part of our strategies for data digitization and self-servicing. During the long Christmas and New Year holidays, the first step took place when every one of us is given some time off to learn Python, Tableau, Power BI, AWS Glue and Azure.

Get python installed

Today, I will start with Python installation and display “Hello World” on my screen. If you have not installed Python in your machine before, you can head to the python.org website to get Python 3.8.x, or Python 3.9 installed. My machine is using the Windows operating system. You just select the operating systems and head to download the Python and installed in your machine. The download and installation do not take me more than 2 minutes to complete.

get installed with anaconda

Next, I download and install the Anaconda, and the whole process took me less than 2 minutes to complete. You can visit anaconda’s website to get the installer to get it installed. There are two steps, to begin with when we want to set up our machine for Python programming. After installation, you are automatically in the default conda environment with all packages installed. A conda is a package and environment manager.

what is python and anaconda?

You may be wondering why we need Python and Anaconda. What is this Anaconda for? Wiki tells us that Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), aiming to simplify package management and deployment.

Anaconda helps manage the packages and environments and reduces future issues dealing with various libraries that you will be using it. Anaconda is a distribution of packages built for data science. It comes with conda, a package, and an environment manager. We usually used conda to create environments for isolating our projects that use different versions of Python and/or different version of packages. For example, you want to set up Python 2 and Python 3 environments. You can read for more information about Anaconda from the reference links below.

jupyter notebook

Before I begin my first Python programming exercise to display the “Hello World”, I choose to use the Anaconda Navigator’s Jupyter Notebook. The Jupyter Notebook is an application that allows me to create documents (notebooks) to code/write Python codes to display input and output from the Python codes. The Jupyter Notebook is widely used for other programming languages too. While I Googled which IDE (integrated development environment) is most used for Python programming, Spyder is another tool to do coding, and I can find it from the Anaconda.

Click on the “Launch” button, my browser launches the localhost (http://localhost:8888/tree), and shows the Jupyter Notebook’s dashboard and my machine’s working directory. In my case, it is C:\Users\<myname>. I need to create a new notebook by clicking the “New” button on the right side of the window. A new window launches, I can see the Jupyter Notebook.

I rename my notebook to “HelloWorld” as below. Then, I started Python programming by writing the first line of codes.

print("Hello World, Happy New Year 2021")

first python code

print() display in the double quotes when I run the selected cell. The number in the bracket after In [1]: stands for the number of commands run. If you keep running the same cell over again, the number will change for that cell. If you switch to another cell and run that cell again, the number will change for that cell.

Alright, I stop here for the first blog entry of my Python programming learning journey. Here, I have covered the following topics:

  • Anaconda
  • Python
  • Jupyter Notebook

PIP

It is not my first time I write about Python programming in my blog. Previously, I installed the Python by using the pip, the package manager. Pip is a tool that allows me to install and manage additional libraries and dependencies. Pip installs Python packages whereas conda installs packages which may contain software written in any languages. You can refer to my entry to share how I updated pip and Python on my Windows machine. Installation using pip completes in Windows command prompt.

In my previous Python posts, I wrote about how I used IntelliJ as the IDE and signed up a course at Udemy to begin my Python learning journey. Here is the link to the Day 1: Let Get Started with Python. If you wish to check out my previous write-ups, please visit this link. Hope you enjoy my sharing, please stay tuned for the next updates. Thank you.

references

Updated pip and Installed Jupyter Notebook

It is just a quick blog entry before the end of Sunday. I have been trying to write some blog entries since last week when I started to clear my annual leaves, but I could not get the right topic to kick start. I did a small update just now on my laptop. I used to code python using the Jupyter Notebook that I installed in another laptop. I have returned the laptop to my friend, and it is a good time to update my laptop to launch the Jupyter Notebook. When I first started my self-learning Python, I learned it through Jupyter Notebook web version. It is convenient to use, quick, and no installation is needed. Now, I wanted to use the Jupyter Notebook without going to the web version, and I can direct host it on my localhost. The pip version in my laptop was too old to install the Jupyter Notebook.

You are using pip version 8.1.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

The above message I copied from the command prompt. It was just a year, and how does the version number jump so much? It is easy to do the pip update because it has provided the command directly on the command prompt screen. Just follow it!

C:\Users\Li Yen\AppData\Local\Programs\Python\Python35-32\Scripts>python -m pip install --upgrade pip
Cache entry deserialization failed, entry ignored
Collecting pip
  Cache entry deserialization failed, entry ignored
  Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB)
    100% |################################| 1.5MB 603kB/s
Installing collected packages: pip
  Found existing installation: pip 8.1.1
    Uninstalling pip-8.1.1:
      Successfully uninstalled pip-8.1.1
Successfully installed pip-20.2.2

Upon successfully installed the updated version of pip, then I run the next command to install Jupyter Notebook by using the below command:

C:\Users\Li Yen\AppData\Local\Programs\Python\Python35-32\Scripts>pip3 install jupyter

It took a few minutes to complete the whole installation without any problems on my laptop, and the Jupyter Notebook 6.1.3 installed. Lastly, to launch the Jupyter Notebook by using the commands below.

[I 18:56:08.259 NotebookApp] Writing notebook server cookie secret to C:\Users\Li Yen\AppData\Roaming\jupyter\runtime\notebook_cookie_secret
[I 18:56:09.241 NotebookApp] Serving notebooks from local directory: C:\Users\Li Yen\AppData\Local\Programs\Python\Python35-32
[I 18:56:09.241 NotebookApp] Jupyter Notebook 6.1.3 is running at:
[I 18:56:09.244 NotebookApp] http://localhost:8888/?token=4cda80785065046f320111b2788cd0057428c4112abe6915
[I 18:56:09.247 NotebookApp]  or http://127.0.0.1:8888/?token=4cda80785065046f320111b2788cd0057428c4112abe6915
[I 18:56:09.247 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:56:09.272 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///C:/Users/Li%20Yen/AppData/Roaming/jupyter/runtime/nbserver-6628-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=4cda80785065046f320111b2788cd0057428c4112abe6915
     or http://127.0.0.1:8888/?token=4cda80785065046f320111b2788cd0057428c4112abe6915

How many of us really read what is on the black screen? It gave some information that I can see on the browser too, such as the localhost is running on port 8888. Another beautiful information, it tells us to stop the server (yes, you are running a Jupyter server, and the web browser acts as the Jupyter client where you can code) by using the Ctrl+C keys to stop the server and shut down all the kernels.

What is a kernel? It receives the code sent by the Jupyter client (our browser), executes it, and returns the results back to the client (browser) for display. If you wish to read a bit about Jupyter Notebook, you can refer to the link below.

If you know a better way to do the above, please share with me. I am learning every single thing from everyone daily 🙂

Reference: https://threathunterplaybook.com/tutorials/jupyter/introduction.html

January 2020

I hope it is not late to write out my plans for the year 2020. My volunteer work with the TechLadies will come to an end, this March. The TechLadies is recruiting the new core team for the year 2020. The upcoming boot-camp graduation will introduce the new team to the community. Then, the year 2019 core team will pass the baton to the new team.

Will I still continue volunteering with TechLadies?

I have this question in my mind lately, and I am not sure how the TechLadies plans for it. I am quite sure there it would be a great idea to let a new team leads the community. New team, new ideas and directions.

I may consider taking a side role to continue on the study group sessions. But, I also hope that someone is going to plan and run the study group sessions together. If not, then I will be slowly running the events as and when I am available. I am not sure whether a mobile study group will work in Singapore.

Besides TechLadies, what else?

Good question. I have a plan to conduct, learn and teach program after being inspired by my classmate. This program teaches the community (not necessarily must be within TechLadies) of what I learned recently.

I will randomly pick up a topic to learn and share to the community via my blog or private meet-ups. I hope to get more interaction between community members, instead of just giving inputs without receiving feedback from the community.

I hope I will write and share more technical stuff through my blog here as well as my posts in the Medium website.

New focuses

I am looking out for other communities in Singapore that work closely on master data management (MDM), focuses on SQL and NoSQL databases, work on data engineering and use Power BI for data visualization.

I am not going away from my core interest, the databases. Also, I want to go in-depth into master data management and will consider taking some of the courses or certifications in this area. Next, I need to upskill and gain essential experience in the data engineering field while continue exploring the data visualization with Power BI. I am still looking out for Data Engineering meetup or users group in Singapore. Do you know any?

Not to forget, I am doing my data analytics in my final module in Temasek Poly. It is going to be an end-to-end data specialization when I graduate with my Specialized Diploma in Business Analytics this April.

Complete my Python course!

Last but not least, I want to complete my Python course before I graduate too, so that everything is fresh in my mind. Right now, I have completed 10/26 modules. I still need to complete some Pandas, statistics and machine learning topics before the end of February. Maybe, I will take a bit time off from other activities to focus on study and work.

Intermediate Python for Data Science

The subjects in this DataCamp’s track, Intermediate Python for Data Science include:

  • Matplotlib
  • Dictionaries and Pandas
  • Logic, Control Flow and Filtering
  • Loops

It looks at data visualization – how to visualize data, data structures – how to store data. Along the way, it shows how control structures customize the flow of your scripts (codes).

Data Visualization

It is one of the key skills for data scientists and Matplotlib makes it easy to create meaningful and informative charts. Matplotlib allows us to build various charts and customize them to make it more visually interpretable. It is not an hard thing to be done and it is pretty interesting to work on it. In my previous write-up, I wrote about how to use Matplotlib to build a line chart, scatter plot and histogram.

Data visualization is a very important part in data analysis. It helps to explore the dataset which it extracts insights. I call this as data profiling, the process of examine the dataset coming from existing data source such as databases, which consists of statistics or summaries of the dataset. The purpose is to find existing data can be used for other purposes, determine the accuracy, completeness and validity of the dataset. I can relate this to “perform a body check on the dataset to ensure it is healthy”.

One of the methods I learned from my school on data profiling is the use of histogram, scatter plot and boxplot to examine the dataset and find out the outliers. I can use either the Python’s Matplotlib, Excel, Power Bi or Tableau to perform this action.

It does not end here…

Python allows us to do customization on the charts to suit our data. There are many types of charts and customization ones can do with Python, changing from colours, labels and axes’ tick size. It depends on the data and the story ones want to tell. Refer the links above to read my write-up on those charts.

Dictionaries

We can use lists to store a collection of data and access the values using the indexes. It can be troublesome and inefficient when it comes to large dataset, therefore, the use of dictionaries in data analysis is important as it represents data in the form of key-value pairs. Creating a dictionary from the lists of data can be found in this link. It has one simple example demonstrating how to convert it. However, I do have a question, how about converting long lists to dictionary? I assumed it is not going to be the same method in this simple example. Does anyone have an example to share?

If you have questions about dictionaries, then you can refer to my blog which I wrote a quite comprehensive introduction of dictionaries in Python.

What is the difference between lists and dictionaries?

If you have a collection of values where order matters, and you want to easily select entire subsets, you will want to go with a list. On the other hand, if you need some sort of lookup table where looking for data should be fast, by specifying unique keys, dictionary is a preferred option.

Lastly, Pandas

Pandas is a high level data manipulation tool built on top of NumPy package. Since NumPy 2D array allows to use one data type in their elements, it may not suitable for some of the data structure which comprise of more than one data type. In Pandas, data is stored like a tabular table called DataFrame, for example:

How to build a DataFrame?

There are few ways to build a Pandas DataFrame and we need to import Pandas package before we begin. In my blog, there are two methods shared, using dictionaries and external file such as .csv file. You can find the examples from the given link. Reading from dictionaries can be done by converting dictionary into DataFrame using DataFrame() and reading from the external file can be done using Pandas’ read_csv().

  • Converting dictionary using DataFrame()
  • reading from external file using read_csv()

How to read from a DataFrame?

The above screenshot shows how the Pandas’ DataFrame looks like, it is in the form of rows and columns. If you wonder why the first column goes without naming. Yes, in the .csv file it has no column name. It appears to be an identifier for each row, just like an index of the table or row label. I have no idea whether the content of the file is done with this purpose or it has other meaning.

Index and Select Data

There are two methods you can select data:

  • Using square bracket []
  • Advanced methods: loc and iloc.

The advanced methods, loc and iloc is Python’s powerful, advanced data access. To access a column using the square bracket, with reference to the above screenshot again, the following codes demonstrate how to select that country column:

brics["country"]

The result shows the row label together with the country column. This is how it read a DataFrame which it returns an object called Pandas Series, which you can assume Series is a one dimension labelled array and when a bunch of Series comes together then, it is called DataFrame.

If you want to do the same selection of country column and keep the data as DataFrame, then using the double square brackets, it can do the magic with following code:

brics[["country"]]

If you check the type of the object, it returns as DataFrame. You can define more than one column to be returned. To access rows using the square bracket and slices, with reference to the same screenshot, the below code is used:

brics[1:4]

The result returns the from row number 2 to 4 or index 1 to 3 which contains, Russia, India and China. If you still remember, the characteristic of slice? The stop value (end value) of a slice is exclusive (not included in the output.

However, this method is has a limitation. For example, if you want to access the data similar to 2D Numpy Array, it can be done using the square bracket with specific column and row.

my_array[column, row]

Hence, Pandas has this powerful and advanced data access, loc and iloc, where loc is a label based and iloc is position based. Let us looking into the usage of loc. The first example reads row loc and follow by another example reads row and column loc. With the same concept as above, single square bracket returns a Series and double square brackets return a DataFrame, just as below:

brics.loc["RU"] --Series single row
brics.loc[["RU"]] --DataFrame single row
brics.loc[["RU, "IN", "CH"]] --DataFrame multiple row

Let extends the above code to read country and capital columns using the row and column with loc. First part it mentions the rows and second part it mentions the column labels. The below code returns a DataFrame.

brics.loc[["RU, "IN", "CH"], ["country", "capital"]]

The above rows values can be replaced with slice, just like the sample code below:

brics.loc[:, ["country", "capital"]]

The above code did not specify the start and end index, it means it returns all the rows with country and capital columns. Below is the screenshot of comparison between square brackets and loc (label-based).

Using iloc is similar to the loc, the only different is how you refer column and row which is using index instead of specifying the rows and column labels.

Python: Introduction IV

It has been a while back I wrote about Python Introduction I, II and III. Today, I am going to complete the last part of the introduction, the NumPy. Months ago during my Python’s self-learning time, I wrote about NumPy, here is the link.

NumPy

It is an alternative to Python List, the NumPy array helps us to solve problems dealing with Python List’s operations. Calculations on Python Lists cannot be done in the same way we do for two integers or strings. This package needs to be installed before we can import and use it.

In my blog above, I wrote about the behaviour of the NumPy Array. It does not allow different types of elements in the array. When a NumPy Array is built, element’s data type changed to end up with a homogeneous list. Supposed, the list contains a string, number and Boolean, now it changes to all string format, for an example.

Also, the operator “+”, “-” and etc which we used along with Python List, are different in NumPy Array. Refer below for an example:

py_list = [1,2,3]
numpy_array = np.array([1,2,3])
py_list + py_list 
numpy_array + numpy_array

First output shows the two lists are merged or combined together into a single list. Second output shows an array returns an output of addition of those numbers. The screenshot below shows the result which I used Jupyter Notebook to execute.

Whatever that it has covered in the link above is good enough to give us a basic understanding of Numpy. If you wish to learn more, there is another link I found from the Medium which we can refer to.

NumPy Subsetting

Specifically for NumPy, there is a way of doing list subsetting, using an array of Boolean. Example below shows how we can get all the BMI values above 23. Refer to the example from DataCamp,

First result returns as Boolean, True if BMI value is above 23. Then, you can use this Boolean array inside a square bracket to do the subsetting. When the Boolean’s value is True, it selects its value.

In short, it is using the result of the comparison to make a selection of data.

2D NumPy Array

I covered the 2D NumPy Array in this link, where it shows how to declare a 2D NumPy Array and how does it work in subsetting, indexing, slicing and perform math operations.

NumPy: Basic Statistics

You can generate summary statistics of the data using NumPy. Python NumPy has few useful statistical functions which can be used for analytics. It includes finding min, max, average, standard deviation, variance and etc. from a given elements in the array. Refer to my write up on this basic statistics in this link.

God forbid you analyze your data using traditional Python lists; particularly in the context of data science, NumPy arrays have numerous advantages over lists.

Murtaza Ali

Reference

Day 42: Importing Data in Python

Introduction

Importing data from various sources such as,
– flat files eg. txt, csv
– files from software eg. Excel, SAS, Matlab files.
– relational databases eg. SQL server, mySQL, NoSQL.

Reading a text file
It uses the open() function to open the connection to a file with passing two parameters,
– filename.
– mode. Eg: r is read, w is write.

Then, assigns a variable to read() the file. Lastly, close() the file after the process is done to close the connection to the file.

If you wish to print out the texts in the file, use print() statement.

Syntax:
file = open(filename, mode=’r’)
text = file.read()
file.close()

print(text)

The filename can be assigned to a variable instead of writing the filename in the open() function’s parameter. It is always a best practice to close() an opened connection.

To avoid being forgotten to close a file connection (miss out including file.close() in the end of our codes), it is advisable to use the context manager which uses the keyword with statement.

In my previous posts where some of the tutorials were using context manager to open a file connection to read the .csv files. The with statement executes the open() file command and this allows it to create a context in which it can execute commands with the file opens. Once out of this clause or context, the file is no longer opened, and for this reason it is called context manager.

What it is doing here is called ‘binding’ a variable in the context manager construct, while within this construct, the variable file will be bound to open(filename, ‘r’).

Let me share the tutorials I did in DataCamp’s website.

# Open a file: file
file = open('moby_dick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

First, it opens a file using open() function with passing the filename and read mode. It then, reads the file. The first print() statement prints out the context of the moby_dick text file. Then, second print() statement returns a Boolean which checks whether the file is closed. In this case, it returns ‘FALSE’. Lastly, proceeds to close the file using close() function and then check the Boolean. This time, it returns ‘TRUE’

Importing text files line by line
For a larger file, we do not want to print out everything inside the file. We may want to print out several lines of the content. To do this, it uses readline() method.

When the file is opened, use file.readline() to print each line. See the code below, using the with statement and file.readline() to print each line of the content in the file.

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Summary of the day:

  • Importing data from different sources.
  • Reading from a text file using open() and read(),
  • Importing text line by line using readline() method.
  • with statement as context manager.
  • close() to close a file.
  • file.closed() returns Boolean value if the condition is met.

Using Python for Streaming Data with Iterators

Using pandas read_csv iterator for streaming data

Next, the tutorial uses the Pandas read_csv() function and chunksize argument to read by chunk of the same dataset, World Bank World Development Indicator.

Another way to read data too large to store in memory in chunks is to read the file in as DataFrames of a certain length (using the chunksize).

First and foremost, import the pandas’ package, then use the .read_csv() function which creates an interable reader object. Then, it can use next() method on it to print the value. Refer below for the sample code.

# Import the pandas package
import pandas as pd

# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv', chunksize=10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))

The output of the above code which processes the data by chunk size of 10.

The next 10 records will be from index 10 to 19.

Next, the tutorials require to create another DataFrame composed of only the rows from a specific country. Then, zip together two of the columns from the new DataFrame, ‘Total Population’ and ‘Urban population (% of total)’. Finally, create a list of tuples from the zip object, where each tuple is composed of a value from each of the two columns mentioned.

Sound a bit complicated now… Let see the sample code below:

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the DataFrame
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)

My output looks as below:

Now, it requires to plot a scatter plot. The below source code contains the previous exercise’s code from the DataCamp itself. Therefore, there is a different of method used in this line,

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc[df_urb_pop['CountryCode'] == 'CEB']

#From DataCamp
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

It requires to use list comprehension to create a new DataFrame. The values in this new DataFrame column is ‘Total Urban Population’, therefore, the product of the first and second element in each tuple.

Furthermore, because the 2nd element is a percentage, it needs to divide the entire result by 100, or alternatively, multiply it by 0.01.

Then, using the Matplotlib’s package, plot the scatter plot with new column ‘Total Urban Population’ and ‘Year’. It is quite a lot of stuff combined together up to this point. See the code and result of the plot shown below.

# Code from previous exercise
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
df_urb_pop = next(urb_pop_reader)
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
pops = zip(df_pop_ceb['Total Population'], 
           df_pop_ceb['Urban population (% of total)'])
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(entry[0] * entry[1] * 0.01) for entry in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

I realized that the ‘Year’ is not in integer format. I printed out the value of the column ‘Year’ and it looked perfectly fine. Do you know what is wrong with my code above?

This time, you will aggregate the results over all the DataFrame chunks in the dataset. This basically means you will be processing the entire dataset now. This is neat because you’re going to be able to process the entire large dataset by just working on smaller pieces of it!

Below sample code consists of some of DataCamp’s code, so some variable names have changed to use their variable names. Here, it requires to append the DataFrame chunk into a variable called ‘data’ and plot the scatter plot.

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:

    # Check out specific country: df_pop_ceb
    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

I tried to compare the lines of codes used by DataCamp and mine.

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(entry[0] * entry[1] * 0.01) for entry in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

Both of the df_pop_ceb look the same, so how come the scatter plots show differently. I still cannot figure out why my scatter plot is showing ‘Year’ with decimal point.

Lastly, wrap up the tutorial by creating an user defined function and taking two parameters, the filename and the country code to do all the above. The source code and another scatter plot is shown as below:

# Define plot_pop()
def plot_pop(filename, country_code):

    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize empty DataFrame: data
    data = pd.DataFrame()
    
    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])

        # Turn zip object into list: pops_list
        pops_list = list(pops)

        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
        # Append DataFrame chunk to data: data
        data = data.append(df_pop_ceb)

    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')

The scatter plot of the country code ‘ARB’ is shown below.

Summary of the day:

  • Use of user defined functions with parameters.
  • Iterators and list comprehensions.
  • Pandas’ DataFrame.
  • Matplotlib scatter plot.

Using Python for Streaming Data with Generators

Continuous from the same dataset from the World Bank World Development Indicator. Before this, I wrote about using the iterators to load data chunk by chunk and then, using the generators to load data or file line by line.

Generators work for streaming data where the data is written line by line from time to time. Generators able to read and process the data until it reaches end of file or no lines to process. Sometimes, data sources can be so large in size that storing the entire dataset in memory becomes too resource-intensive.

In this exercise from DataCamp’s tutorials, I process the first 1000 rows of a file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. Below is some details about the dataset and how to import the dataset.

To begin, I need to open a connection to a file using what is known as a context manager. For example, the command,

with open(‘datacamp.csv’) as datacamp

binds the csv file ‘datacamp.csv’ as datacamp in the context manager.

Here, the with statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

The sample code below uses the .readline() method to read a line from the file object. It then, can split the line into a list using .split() method.

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

The output looks as below:

{‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 18}

Use generator to load data
Generators allow users to lazily evaluate data. This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

The tutorial requires to define a generator function read_large_file()that produces a generator object which yields a single line from a file each time next() is called on it. 

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

Next, use the generator function created above to call a file line by line, create a dictionary of count of how many times each country appears in a column in the dataset and process all the rows in the file and print out the result. See the sample code below:

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)

And, the output looks as below:

{‘CountryName’: 1, ‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 99, ‘High income’: 131, ‘High income: nonOECD’: 68, ‘High income: OECD’: 127, ‘Latin America & Caribbean (all income levels)’: 130, ‘Latin America & Caribbean (developing only)’: 133, ‘Least developed countries: UN classification’: 78, ‘Low & middle income’: 138, ‘Low income’: 80, ‘Lower middle income’: 126, ‘Middle East & North Africa (all income levels)’: 89, ‘Middle East & North Africa (developing only)’: 94, ‘Middle income’: 138, ‘North America’: 123, ‘OECD members’: 130, ‘Other small states’: 63, ‘Pacific island small states’: 66, ‘Small states’: 69, ‘South Asia’: 36}

Obviously, it shows more data than before this because it does not limit to 1000 rows.

Summary of the data:

  • Generators for streaming data.
  • Context manager, open a file connection.
  • Use generator function to load data.
  • Use a dictionary to store result.

World Bank World Development Indicator Case Study with Python

In the DataCamp’s tutorials, Python Data Science Toolbox (Part 2), it combines user defined functions, iterators, list comprehensions and generators to wrangle and extract meaningful information from the real-world case study.

It is going to use World Bank’s dataset. The tutorials will use all which I have learned recently to work around with this dataset.

Dictionaries for Data Science

The zip() function to combine two lists into a zip object and convert it into a dictionary.

Before I share the sample code, let me share again what the zip() function does.

Using zip()
It allows us to stitch together any arbitrary number of iterables. In other words, it is zipping them together to create a zip object which is an iterator of tuple.

# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)

# Output: {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}

Next, the tutorial wants us to create an user defined function with two parameters. I can re-use the above code and add an user defined function and call it with passing two arguments, feature_names and row_vals.

# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

It should give the same result when run the codes. Next, tutorial requires me to use list comprehension. It requires to turn a bunch of lists into a list of dictionaries with the help of a list comprehension, where the keys are the header names and the values are the row entries.

The syntax,
[[output expression] for iterator variable in iterable]

The question in the tutorial is,
Create a list comprehension that generates a dictionary using lists2dict() for each sublist in row_lists. The keys are from the feature_names list and the values are the row entries in row_lists. Use sublist as your iterator variable and assign the resulting list of dictionaries to list_of_dicts.

The code on the screen before I coded. As above, sublist is the iterator variable, so it is substituted in between “for” and “in” keyword. The instruction says,
for each sublist in row_lists

indirectly, it means,

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [--- for sublist in row_lists]

The lists2dict() function which I created above returns a dictionary. The question says,
generates a dictionary using lists2dict()

indirectly it means calling the lists2dict() function at the output expression. But if I code,

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, row_lists) for sublist in row_lists]

The output was very wrong and when I clicked the “Submit” button, it prompted me error message,
Check your call of lists2dict(). Did you correctly specify the second argument? Expected sublist, but got row_lists.

It expected sublist and yes, the for loop is reading each list in the row_lists. I have a code to print each list,
print(row_lists[0])

It is more meaningful to use sublist as 2nd argument rather than using row_lists. Therefore, the final code is,

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
#Output:
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}

The above code is really taking my time to find out what I should code and why it did not get a right code. That did not stop me from continuing my tutorial.

Turning it into a DataFrame
Up to here, this case study I did a zip() function, put it into an user defined function and used the new created function in list comprehensions to generate a list of dictionaries.

Next, the tutorial wants to convert the list of dictionaries into Pandas’ DataFrame. First and foremost, I need to import the Pandas package. Let refer to the code below:

# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

Summary of the day:

  • zip() function combines the lists into a zip object.
  • Use user defined function in list comprehension.
  • Convert list comprehension into a DataFrame.

Day 41: Write Generator Function

What is generator function?

– It produces generator object when it called.
– It defines like general function using def keyword.
– Its return value is using yield keyword. It yields a sequence of values instead of a single value.

How to build a generator?

A generator function is defined as you do a regular function, but whenever it generates a value, it uses the keyword yield instead of return. Let us look at the exercise in DataCamp’s tutorial which walk me through how to write generator function in Python.

The instructions as below:
1. Complete the function header for the function get_lengths() that has a single parameter, input_list.
2. In the for loop in the function definition, yield the length of the strings in input_list.
3. Complete the iterable part of the for loop for printing the values generated by the get_lengths() generator function. Supply the call to get_lengths(), passing in the list lannister.

The code as below:

# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)

#Output:
#6
#5
#5
#6
#7

Summary of the day:

  • Generator function.
  • Keyword: yield.
%d bloggers like this: