Python: Introduction III

The last part of the Python Introduction and I will cover topics on functions, methods and the packages in Python. For sure, there is a difference between function and method. I revisit my original post which I wrote about the differences between functions and methods. You can read up those before continue here.

User-defined Functions

The simplest way I can explain what is function which I wrote in my original post:

A function is a block of code to carry out a task and it calls by its name. All functions may have zero or many arguments. The arguments are passed explicitly (directly). On the exit of the function, it may or may not return value or values.

There are some examples in this post to explain about functions, how to define a function with and without arguments, uses default value for an argument, uses flexible arguments *args and **kwargs and uses of return statement in the function.

Methods

It is like a function, except it is attached to an object (dependent). A method is implicitly (indirectly) passed to the object for which it is invoked. It may or may not return a value or values. The method is accessible to data that is contained within the class.

For methods examples, I wrote it in this post.

Packages

Think the packages as a directory of Python scripts. Example each .py script is a module. This module specifies functions, methods and types in solving a particular problem. I found a link which explained in detail about packages in Python. Refer here for more reading.

In this part III, I know there are many external links are given, mainly is to reduce re-write of those entries which I wrote them sometimes ago. This blog serves as a place to find the relevant resources for reading and examples which I think it is enough to cover the basic understanding of the functions, methods and packages in Python.

Advertisements

Python: Introduction II

I continue from the Python: Introduction which I wrote it yesterday and it gave the very basic idea of how Python is in term of declaring variables, the data types and how we can store the data in a collection. So. variable, data type and collections. If you missed out or cannot recall them, here is the link to yesterday’s post. There are links to various data types and the collections. I did not want to repeat here.

For the part 2, I have decided to concentrate on writing the introduction to control flows. I think it will be great to have these control flows being taught first before we head to more Data Science orientated topic such as using the package called Numpy and creating our own functions and methods. So, I swapped a little bit in term of syllabus or topic in my writing.

if, elif, else

It is a conditional statement which we can use to match certain conditions. There are few methods to write this statement and it is not always that we need to use elif and else. Let us look at this syntax below:

if condition :
  expression

This syntax has 1 single condition to match only, that is why “if” is being used. Example:

z = 4
if z % 2 == 0 :
  print("z is even")

All control flows has a standard syntax and indentation applied to each of them to mark the beginning to the expression or what should it does when the condition is matched. Therefore, you can see the print() statement is slightly indented. In most IDEs, it is automatically indented when we use the colon (:) sign after the condition, in this case, z % 2 == 0.

The moment we have 1 more condition in our code, if else statement is used. See the syntax below:

if condition :
  expression
else :
  expression

In the else statement, often we do not need to specify the condition within the line because it is understood that when the if statement or condition did not match then it goes to the else statement and execute the code. It looks as though else statement is a default statement. I know in some other programming language, it does have default statement at the end of else statement which say, anything not matches then run default statement.

We can omit this else statement when we do not require to process anything if the first condition does not match. However, sometimes, we might miss out important scenarios of the if statement is skipped. What I meant here is besides being legit that the condition does not match, there is also possibilities of exceptions happened while checking through the conditions. It is recommended to use the else statement as an exception handling, either print out a line in the console or log file. This helps in the debugging procedure.

z = 5
if z % 2 == 0 :
  print("z is even")
else
  print("z is odd")

The above is an example of using if-else statement when there are two conditions, either-or situation. The variable z is 5, hence it goes to the else statement and print out “z is odd”.

Next, if-elif-else statement is used when there are few conditions in the scenario which one may get matched during the condition checking. When the first condition does not match, it goes to the next elif condition to check until it has no matches, it will end at else statement. You can have many elif statements in your codes. Below is the syntax:

if condition :
  expression
elif condition :
  expression
else :
  expression

Example of using the syntax:

z = 3
if z % 2 == 0 :
  print("z is divisible by 2")
elif z % 3 == 0 :
  print("z is divisible by 3")
else :
  print("z is neither divisible by 2 nor by 3")

The output is z is divisible by 3. After each expression, the statement terminates and returns the result. It will not proceed to check whether next condition is matched. With these 3 examples, I hope it gives some ideas about if statement, if-else statement and if-elif-else statement.

while

while statement works by repeating an action until condition is met. It is important to assess the code before running the while statement because if any chances the condition is not met, the statement will keep running and this we call it infinite loop. You have to force to end the application manually. The syntax for while statement:

while condition :
  expression

Example of using the while syntax:

x = 0
while x < 5:
     print(f'The number is {x}')
     x += 1  

The most crucial part here is the variable x which works as a counter to ensure the condition is met. Without this line of code (x += 1 ) the condition is always true and the loop becomes infinite.

This is the output from executing the while statement. When x = 5, it stops and exits from the while statement and does not print out anything.

for

Remember in the previous blog, I mentioned about using Python list (collections)? For statement is a good control flow to iterate (repeat) through the Python list to get each element.

for var in seq :
  expression

Without using for statement, we might want to repeat few times of the print statements to print out the elements inside the Python list below:

fam = [1.73, 1.68, 1.71, 1.89]

print(fam[0])
print(fam[1])
print(fam[2])
print(fam[3])

Although this is a correct syntax, it is not a good practice. Below demonstrates how to use a for statement to iterate through the Python list and print the values out.

for height in fam :
  print(height)

Both of the codes returns the same output.

1.73
1.68
1.71
1.89

For statement works well for any types of collections or even with a string. Below example uses my_string as the “list” and a variable character as the “item_name” to represent the elements inside the list. One by one, it prints out each character in the my_string.

In for statement, we can use the enumerate(), a Python built-in function. enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.

Why is enumerate return a tuple?

When enumerate() returns in a form of enumerate object, it comes in a form of index, value. It is because enumerate() accepts start parameter which is the index value of the counter, by default it is 0. A simple illustration is as below:

enumerate(iterable, start=0)
list = ["eat","sleep","repeat"] 
print(enumerate(list)

When we check the output from the console, it shows as below:

[(0, 'eat'), (1, 'sleep'), (2, 'repeat')]

It starts with an index 0, of course, it can be changed with indicating the index value, such as enumerate(list,1), then the index begins with 1 instead of 0. This enumerate() function may look useful when we want to list the elements from the collections with the index and value.

fam = [1.73, 1.68, 1.71, 1.89]
for index, height in enumerate(fam) :
  print("index " + str(index) + ": " + str(height))

Reusing the above example and now, we have added enumerate(fam) in the for statement instead of using “for height in fam”. Then, in the print() statement, we convert the index value and height value to string and concatenate them. This maybe useful when we want to print out our shopping cart’s items list. Its output shows:

index 0: 1.73
index 1: 1.68
index 2: 1.71
index 3: 1.89

Mastering the use of the control flows can help in the later stage when we go into the data structure section. I have written separate blogs about if-else statements, while and for loops, you can refer to the links below:

Python: Introduction I

It has been a while I stopped learning Python from DataCamp due to my part time classes and assignment, and work commitment. It is not easy to keep track each of them everyday. On top of it, I still have my volunteer work with TechLadies and regularly have to meet up to brainstorm and updates each other.

Today’s topic is very much on Python, definitely. I want to concentrate on my writing in Python for next 2 weeks before I head off for an holiday. I am sure, I will be lazy after my break. It would be great if I can write up something to summarize or reorganize what I have been writing for the past few months on my Python’s learning using DataCamp and Udemy.

Remember my very first day I started learning Python using Udemy, it taught about the installation and I went on to install IntelliJ. Till to date, I hardly using it, most of my time, I am using the online version of Jupyter Notebook. I find it pretty easy to be used. I understood that there are many other IDEs in the market and there is no specific software to be used to code Python. For now, I will just keep it simple for my learning.

Checking version

On the very first time, we always want to know if everything we installed for Python works or not. Checking the version, if it is updated, latest and correct version to be used is first time we might want to do with:

python --version

Simple open up your terminal or command line to type the above command on it. On the screen, it may return you the version info such as below:

Python 3.7.0

print(‘Hello World’)

Next, we always start with simple print statement using the built-in function named print() to print out some lines, most often we will print in our first line is “Hello World”. Really, most people who first started learning programming language will have this line printed. I use this function everywhere in my coding and it is very useful. It is just same as the PRINT statement from the SQL server, if you are coming from database background. Using single quote or double quote is not a matter.

print('Hello World!')
print("Hello World!")

Variables and Types

Then, we touch on the variables and types, the important component in most programming languages. Variables and types are interrelated. I discussed about the characteristic of a variable in my first post. Let me have it here too!

  • Specific and case-sensitive name, best practice to use lowercase.
  • Define things that are subject to change.
  • Can be used to store texts, numbers or dates.
  • Cannot start with number.
  • Cannot use space  and symbols in the name, use _ instead.

Then, there are plenty of different data types as well, yes, that is the types I meant here. Remember, different types have different behaviours. I wrote many posts about each of them before. I will link them up whenever we re-visit the topic.

  • Boolean operations: and, or, not (True, False)
  • Numeric types: int, float, complex (number, decimal)
  • Text sequence type: str (string)
  • Sequence type: list, tuple, range
  • Mapping type: dict 
  • Sets type: set

The simplest way to demonstrate how we can create a variable and assign a value to it.

height = 1.67
weight = 180

name = 'Joanne'
gender = 'Female'

isStudent = True

The above shows the height and weight variables in float and int data type, then we have name and gender in string and a variable called isStudent with a Boolean value. In Python, it does not require to declare a variable with any prefix in front of or behind the variable which we can see in Javascript or SQL Server, if you are familiar with those languages. Then, you may ask how does compiler (computer) knows it is of what type of data types.

type()

type(height)
type(weight)
type(name)
type(isStudent)

type() is a built-in function which allows us to check the data types of the variables we created with assigned values. type() helps to answer the above question.

That completed the fundamental and basic to code in Python. Now, you know how to do the following:

  • use the print() statement to print texts.
  • use of variables and data types.
  • use the type() statement to print out the data type of a variable.

Probably, now you want to know what is integer, string, Boolean and etc. I have some links here to help out the basic explanation together with examples:

To talk about numbers and strings, it can be another topics by its own as there are many interesting about them such as the use of (+) sign. It is concatenate sign which means it combines two or more variables of same type together. The way number and string use (+) sign also difference than each other. Also, we have to remember that in Python, string and integer cannot use of (+) sign together. It throws exception (error). Exception is a programming jargon means error. There is a topic of exception handling in Python too. In this case, there is string formatting and integer formatting.

Let us move into fundamental part two, Python List.

Python Collections

It is an interesting topic and important part in Python. Almost everyone of us will use Python List in our daily coding life 🙂 It is a collection of values and allows to have different types within the elements, one of the most simplest and easiest collections. When it comes to the word “collection”, Python has four type of collections.

You can read more about the basic of these collection here. Each of them has different characteristics, syntax, structure and usage. Along the way, we use different collections to explain the Python codes and concepts. Below is an example of how list looks like:

fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']

Declaring a list is same as declare a variable, it just requires to follow the list’s syntax to create one. As mentioned earlier, it can be any data types in a list. So, you can declare a list as below too:

family = ['Anna', 1.73, 'Eddie', 1.68, 'Mother', 1.71, 'Father', 1.89]

We can use the lists above to work with control flows, going through the iteration and/or condition checking, then calculate a value and return a result. I think I will cover it in the next post.

Up to now, this portion is still a basic Python and does not involve any analytics or data science work if you are looking for one.

Day 42: Importing Data in Python

Introduction

Importing data from various sources such as,
– flat files eg. txt, csv
– files from software eg. Excel, SAS, Matlab files.
– relational databases eg. SQL server, mySQL, NoSQL.

Reading a text file
It uses the open() function to open the connection to a file with passing two parameters,
– filename.
– mode. Eg: r is read, w is write.

Then, assigns a variable to read() the file. Lastly, close() the file after the process is done to close the connection to the file.

If you wish to print out the texts in the file, use print() statement.

Syntax:
file = open(filename, mode=’r’)
text = file.read()
file.close()

print(text)

The filename can be assigned to a variable instead of writing the filename in the open() function’s parameter. It is always a best practice to close() an opened connection.

To avoid being forgotten to close a file connection (miss out including file.close() in the end of our codes), it is advisable to use the context manager which uses the keyword with statement.

In my previous posts where some of the tutorials were using context manager to open a file connection to read the .csv files. The with statement executes the open() file command and this allows it to create a context in which it can execute commands with the file opens. Once out of this clause or context, the file is no longer opened, and for this reason it is called context manager.

What it is doing here is called ‘binding’ a variable in the context manager construct, while within this construct, the variable file will be bound to open(filename, ‘r’).

Let me share the tutorials I did in DataCamp’s website.

# Open a file: file
file = open('moby_dick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

First, it opens a file using open() function with passing the filename and read mode. It then, reads the file. The first print() statement prints out the context of the moby_dick text file. Then, second print() statement returns a Boolean which checks whether the file is closed. In this case, it returns ‘FALSE’. Lastly, proceeds to close the file using close() function and then check the Boolean. This time, it returns ‘TRUE’

Importing text files line by line
For a larger file, we do not want to print out everything inside the file. We may want to print out several lines of the content. To do this, it uses readline() method.

When the file is opened, use file.readline() to print each line. See the code below, using the with statement and file.readline() to print each line of the content in the file.

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Summary of the day:

  • Importing data from different sources.
  • Reading from a text file using open() and read(),
  • Importing text line by line using readline() method.
  • with statement as context manager.
  • close() to close a file.
  • file.closed() returns Boolean value if the condition is met.

Using Python for Streaming Data with Iterators

Using pandas read_csv iterator for streaming data

Next, the tutorial uses the Pandas read_csv() function and chunksize argument to read by chunk of the same dataset, World Bank World Development Indicator.

Another way to read data too large to store in memory in chunks is to read the file in as DataFrames of a certain length (using the chunksize).

First and foremost, import the pandas’ package, then use the .read_csv() function which creates an interable reader object. Then, it can use next() method on it to print the value. Refer below for the sample code.

# Import the pandas package
import pandas as pd

# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv', chunksize=10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))

The output of the above code which processes the data by chunk size of 10.

The next 10 records will be from index 10 to 19.

Next, the tutorials require to create another DataFrame composed of only the rows from a specific country. Then, zip together two of the columns from the new DataFrame, ‘Total Population’ and ‘Urban population (% of total)’. Finally, create a list of tuples from the zip object, where each tuple is composed of a value from each of the two columns mentioned.

Sound a bit complicated now… Let see the sample code below:

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the DataFrame
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)

My output looks as below:

Now, it requires to plot a scatter plot. The below source code contains the previous exercise’s code from the DataCamp itself. Therefore, there is a different of method used in this line,

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc[df_urb_pop['CountryCode'] == 'CEB']

#From DataCamp
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

It requires to use list comprehension to create a new DataFrame. The values in this new DataFrame column is ‘Total Urban Population’, therefore, the product of the first and second element in each tuple.

Furthermore, because the 2nd element is a percentage, it needs to divide the entire result by 100, or alternatively, multiply it by 0.01.

Then, using the Matplotlib’s package, plot the scatter plot with new column ‘Total Urban Population’ and ‘Year’. It is quite a lot of stuff combined together up to this point. See the code and result of the plot shown below.

# Code from previous exercise
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
df_urb_pop = next(urb_pop_reader)
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
pops = zip(df_pop_ceb['Total Population'], 
           df_pop_ceb['Urban population (% of total)'])
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(entry[0] * entry[1] * 0.01) for entry in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

I realized that the ‘Year’ is not in integer format. I printed out the value of the column ‘Year’ and it looked perfectly fine. Do you know what is wrong with my code above?

This time, you will aggregate the results over all the DataFrame chunks in the dataset. This basically means you will be processing the entire dataset now. This is neat because you’re going to be able to process the entire large dataset by just working on smaller pieces of it!

Below sample code consists of some of DataCamp’s code, so some variable names have changed to use their variable names. Here, it requires to append the DataFrame chunk into a variable called ‘data’ and plot the scatter plot.

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:

    # Check out specific country: df_pop_ceb
    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

I tried to compare the lines of codes used by DataCamp and mine.

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(entry[0] * entry[1] * 0.01) for entry in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

Both of the df_pop_ceb look the same, so how come the scatter plots show differently. I still cannot figure out why my scatter plot is showing ‘Year’ with decimal point.

Lastly, wrap up the tutorial by creating an user defined function and taking two parameters, the filename and the country code to do all the above. The source code and another scatter plot is shown as below:

# Define plot_pop()
def plot_pop(filename, country_code):

    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize empty DataFrame: data
    data = pd.DataFrame()
    
    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])

        # Turn zip object into list: pops_list
        pops_list = list(pops)

        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
        # Append DataFrame chunk to data: data
        data = data.append(df_pop_ceb)

    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')

The scatter plot of the country code ‘ARB’ is shown below.

Summary of the day:

  • Use of user defined functions with parameters.
  • Iterators and list comprehensions.
  • Pandas’ DataFrame.
  • Matplotlib scatter plot.

Using Python for Streaming Data with Generators

Continuous from the same dataset from the World Bank World Development Indicator. Before this, I wrote about using the iterators to load data chunk by chunk and then, using the generators to load data or file line by line.

Generators work for streaming data where the data is written line by line from time to time. Generators able to read and process the data until it reaches end of file or no lines to process. Sometimes, data sources can be so large in size that storing the entire dataset in memory becomes too resource-intensive.

In this exercise from DataCamp’s tutorials, I process the first 1000 rows of a file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. Below is some details about the dataset and how to import the dataset.

To begin, I need to open a connection to a file using what is known as a context manager. For example, the command,

with open(‘datacamp.csv’) as datacamp

binds the csv file ‘datacamp.csv’ as datacamp in the context manager.

Here, the with statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

The sample code below uses the .readline() method to read a line from the file object. It then, can split the line into a list using .split() method.

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

The output looks as below:

{‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 18}

Use generator to load data
Generators allow users to lazily evaluate data. This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

The tutorial requires to define a generator function read_large_file()that produces a generator object which yields a single line from a file each time next() is called on it. 

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

Next, use the generator function created above to call a file line by line, create a dictionary of count of how many times each country appears in a column in the dataset and process all the rows in the file and print out the result. See the sample code below:

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)

And, the output looks as below:

{‘CountryName’: 1, ‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 99, ‘High income’: 131, ‘High income: nonOECD’: 68, ‘High income: OECD’: 127, ‘Latin America & Caribbean (all income levels)’: 130, ‘Latin America & Caribbean (developing only)’: 133, ‘Least developed countries: UN classification’: 78, ‘Low & middle income’: 138, ‘Low income’: 80, ‘Lower middle income’: 126, ‘Middle East & North Africa (all income levels)’: 89, ‘Middle East & North Africa (developing only)’: 94, ‘Middle income’: 138, ‘North America’: 123, ‘OECD members’: 130, ‘Other small states’: 63, ‘Pacific island small states’: 66, ‘Small states’: 69, ‘South Asia’: 36}

Obviously, it shows more data than before this because it does not limit to 1000 rows.

Summary of the data:

  • Generators for streaming data.
  • Context manager, open a file connection.
  • Use generator function to load data.
  • Use a dictionary to store result.

World Bank World Development Indicator Case Study with Python

In the DataCamp’s tutorials, Python Data Science Toolbox (Part 2), it combines user defined functions, iterators, list comprehensions and generators to wrangle and extract meaningful information from the real-world case study.

It is going to use World Bank’s dataset. The tutorials will use all which I have learned recently to work around with this dataset.

Dictionaries for Data Science

The zip() function to combine two lists into a zip object and convert it into a dictionary.

Before I share the sample code, let me share again what the zip() function does.

Using zip()
It allows us to stitch together any arbitrary number of iterables. In other words, it is zipping them together to create a zip object which is an iterator of tuple.

# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)

# Output: {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}

Next, the tutorial wants us to create an user defined function with two parameters. I can re-use the above code and add an user defined function and call it with passing two arguments, feature_names and row_vals.

# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

It should give the same result when run the codes. Next, tutorial requires me to use list comprehension. It requires to turn a bunch of lists into a list of dictionaries with the help of a list comprehension, where the keys are the header names and the values are the row entries.

The syntax,
[[output expression] for iterator variable in iterable]

The question in the tutorial is,
Create a list comprehension that generates a dictionary using lists2dict() for each sublist in row_lists. The keys are from the feature_names list and the values are the row entries in row_lists. Use sublist as your iterator variable and assign the resulting list of dictionaries to list_of_dicts.

The code on the screen before I coded. As above, sublist is the iterator variable, so it is substituted in between “for” and “in” keyword. The instruction says,
for each sublist in row_lists

indirectly, it means,

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [--- for sublist in row_lists]

The lists2dict() function which I created above returns a dictionary. The question says,
generates a dictionary using lists2dict()

indirectly it means calling the lists2dict() function at the output expression. But if I code,

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, row_lists) for sublist in row_lists]

The output was very wrong and when I clicked the “Submit” button, it prompted me error message,
Check your call of lists2dict(). Did you correctly specify the second argument? Expected sublist, but got row_lists.

It expected sublist and yes, the for loop is reading each list in the row_lists. I have a code to print each list,
print(row_lists[0])

It is more meaningful to use sublist as 2nd argument rather than using row_lists. Therefore, the final code is,

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
#Output:
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}

The above code is really taking my time to find out what I should code and why it did not get a right code. That did not stop me from continuing my tutorial.

Turning it into a DataFrame
Up to here, this case study I did a zip() function, put it into an user defined function and used the new created function in list comprehensions to generate a list of dictionaries.

Next, the tutorial wants to convert the list of dictionaries into Pandas’ DataFrame. First and foremost, I need to import the Pandas package. Let refer to the code below:

# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

Summary of the day:

  • zip() function combines the lists into a zip object.
  • Use user defined function in list comprehension.
  • Convert list comprehension into a DataFrame.