Day 42: Importing Data in Python

Introduction

Importing data from various sources such as,
– flat files eg. txt, csv
– files from software eg. Excel, SAS, Matlab files.
– relational databases eg. SQL server, mySQL, NoSQL.

Reading a text file
It uses the open() function to open the connection to a file with passing two parameters,
– filename.
– mode. Eg: r is read, w is write.

Then, assigns a variable to read() the file. Lastly, close() the file after the process is done to close the connection to the file.

If you wish to print out the texts in the file, use print() statement.

Syntax:
file = open(filename, mode=’r’)
text = file.read()
file.close()

print(text)

The filename can be assigned to a variable instead of writing the filename in the open() function’s parameter. It is always a best practice to close() an opened connection.

To avoid being forgotten to close a file connection (miss out including file.close() in the end of our codes), it is advisable to use the context manager which uses the keyword with statement.

In my previous posts where some of the tutorials were using context manager to open a file connection to read the .csv files. The with statement executes the open() file command and this allows it to create a context in which it can execute commands with the file opens. Once out of this clause or context, the file is no longer opened, and for this reason it is called context manager.

What it is doing here is called ‘binding’ a variable in the context manager construct, while within this construct, the variable file will be bound to open(filename, ‘r’).

Let me share the tutorials I did in DataCamp’s website.

# Open a file: file
file = open('moby_dick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

First, it opens a file using open() function with passing the filename and read mode. It then, reads the file. The first print() statement prints out the context of the moby_dick text file. Then, second print() statement returns a Boolean which checks whether the file is closed. In this case, it returns ‘FALSE’. Lastly, proceeds to close the file using close() function and then check the Boolean. This time, it returns ‘TRUE’

Importing text files line by line
For a larger file, we do not want to print out everything inside the file. We may want to print out several lines of the content. To do this, it uses readline() method.

When the file is opened, use file.readline() to print each line. See the code below, using the with statement and file.readline() to print each line of the content in the file.

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Summary of the day:

  • Importing data from different sources.
  • Reading from a text file using open() and read(),
  • Importing text line by line using readline() method.
  • with statement as context manager.
  • close() to close a file.
  • file.closed() returns Boolean value if the condition is met.
Advertisements

Using Python for Streaming Data with Iterators

Using pandas read_csv iterator for streaming data

Next, the tutorial uses the Pandas read_csv() function and chunksize argument to read by chunk of the same dataset, World Bank World Development Indicator.

Another way to read data too large to store in memory in chunks is to read the file in as DataFrames of a certain length (using the chunksize).

First and foremost, import the pandas’ package, then use the .read_csv() function which creates an interable reader object. Then, it can use next() method on it to print the value. Refer below for the sample code.

# Import the pandas package
import pandas as pd

# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv', chunksize=10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))

The output of the above code which processes the data by chunk size of 10.

The next 10 records will be from index 10 to 19.

Next, the tutorials require to create another DataFrame composed of only the rows from a specific country. Then, zip together two of the columns from the new DataFrame, ‘Total Population’ and ‘Urban population (% of total)’. Finally, create a list of tuples from the zip object, where each tuple is composed of a value from each of the two columns mentioned.

Sound a bit complicated now… Let see the sample code below:

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the DataFrame
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)

My output looks as below:

Now, it requires to plot a scatter plot. The below source code contains the previous exercise’s code from the DataCamp itself. Therefore, there is a different of method used in this line,

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc[df_urb_pop['CountryCode'] == 'CEB']

#From DataCamp
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

It requires to use list comprehension to create a new DataFrame. The values in this new DataFrame column is ‘Total Urban Population’, therefore, the product of the first and second element in each tuple.

Furthermore, because the 2nd element is a percentage, it needs to divide the entire result by 100, or alternatively, multiply it by 0.01.

Then, using the Matplotlib’s package, plot the scatter plot with new column ‘Total Urban Population’ and ‘Year’. It is quite a lot of stuff combined together up to this point. See the code and result of the plot shown below.

# Code from previous exercise
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
df_urb_pop = next(urb_pop_reader)
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
pops = zip(df_pop_ceb['Total Population'], 
           df_pop_ceb['Urban population (% of total)'])
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(entry[0] * entry[1] * 0.01) for entry in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

I realized that the ‘Year’ is not in integer format. I printed out the value of the column ‘Year’ and it looked perfectly fine. Do you know what is wrong with my code above?

This time, you will aggregate the results over all the DataFrame chunks in the dataset. This basically means you will be processing the entire dataset now. This is neat because you’re going to be able to process the entire large dataset by just working on smaller pieces of it!

Below sample code consists of some of DataCamp’s code, so some variable names have changed to use their variable names. Here, it requires to append the DataFrame chunk into a variable called ‘data’ and plot the scatter plot.

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:

    # Check out specific country: df_pop_ceb
    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

I tried to compare the lines of codes used by DataCamp and mine.

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(entry[0] * entry[1] * 0.01) for entry in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

Both of the df_pop_ceb look the same, so how come the scatter plots show differently. I still cannot figure out why my scatter plot is showing ‘Year’ with decimal point.

Lastly, wrap up the tutorial by creating an user defined function and taking two parameters, the filename and the country code to do all the above. The source code and another scatter plot is shown as below:

# Define plot_pop()
def plot_pop(filename, country_code):

    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize empty DataFrame: data
    data = pd.DataFrame()
    
    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])

        # Turn zip object into list: pops_list
        pops_list = list(pops)

        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
        # Append DataFrame chunk to data: data
        data = data.append(df_pop_ceb)

    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')

The scatter plot of the country code ‘ARB’ is shown below.

Summary of the day:

  • Use of user defined functions with parameters.
  • Iterators and list comprehensions.
  • Pandas’ DataFrame.
  • Matplotlib scatter plot.

Using Python for Streaming Data with Generators

Continuous from the same dataset from the World Bank World Development Indicator. Before this, I wrote about using the iterators to load data chunk by chunk and then, using the generators to load data or file line by line.

Generators work for streaming data where the data is written line by line from time to time. Generators able to read and process the data until it reaches end of file or no lines to process. Sometimes, data sources can be so large in size that storing the entire dataset in memory becomes too resource-intensive.

In this exercise from DataCamp’s tutorials, I process the first 1000 rows of a file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. Below is some details about the dataset and how to import the dataset.

To begin, I need to open a connection to a file using what is known as a context manager. For example, the command,

with open(‘datacamp.csv’) as datacamp

binds the csv file ‘datacamp.csv’ as datacamp in the context manager.

Here, the with statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

The sample code below uses the .readline() method to read a line from the file object. It then, can split the line into a list using .split() method.

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

The output looks as below:

{‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 18}

Use generator to load data
Generators allow users to lazily evaluate data. This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

The tutorial requires to define a generator function read_large_file()that produces a generator object which yields a single line from a file each time next() is called on it. 

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

Next, use the generator function created above to call a file line by line, create a dictionary of count of how many times each country appears in a column in the dataset and process all the rows in the file and print out the result. See the sample code below:

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)

And, the output looks as below:

{‘CountryName’: 1, ‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 99, ‘High income’: 131, ‘High income: nonOECD’: 68, ‘High income: OECD’: 127, ‘Latin America & Caribbean (all income levels)’: 130, ‘Latin America & Caribbean (developing only)’: 133, ‘Least developed countries: UN classification’: 78, ‘Low & middle income’: 138, ‘Low income’: 80, ‘Lower middle income’: 126, ‘Middle East & North Africa (all income levels)’: 89, ‘Middle East & North Africa (developing only)’: 94, ‘Middle income’: 138, ‘North America’: 123, ‘OECD members’: 130, ‘Other small states’: 63, ‘Pacific island small states’: 66, ‘Small states’: 69, ‘South Asia’: 36}

Obviously, it shows more data than before this because it does not limit to 1000 rows.

Summary of the data:

  • Generators for streaming data.
  • Context manager, open a file connection.
  • Use generator function to load data.
  • Use a dictionary to store result.

World Bank World Development Indicator Case Study with Python

In the DataCamp’s tutorials, Python Data Science Toolbox (Part 2), it combines user defined functions, iterators, list comprehensions and generators to wrangle and extract meaningful information from the real-world case study.

It is going to use World Bank’s dataset. The tutorials will use all which I have learned recently to work around with this dataset.

Dictionaries for Data Science

The zip() function to combine two lists into a zip object and convert it into a dictionary.

Before I share the sample code, let me share again what the zip() function does.

Using zip()
It allows us to stitch together any arbitrary number of iterables. In other words, it is zipping them together to create a zip object which is an iterator of tuple.

# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)

# Output: {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}

Next, the tutorial wants us to create an user defined function with two parameters. I can re-use the above code and add an user defined function and call it with passing two arguments, feature_names and row_vals.

# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

It should give the same result when run the codes. Next, tutorial requires me to use list comprehension. It requires to turn a bunch of lists into a list of dictionaries with the help of a list comprehension, where the keys are the header names and the values are the row entries.

The syntax,
[[output expression] for iterator variable in iterable]

The question in the tutorial is,
Create a list comprehension that generates a dictionary using lists2dict() for each sublist in row_lists. The keys are from the feature_names list and the values are the row entries in row_lists. Use sublist as your iterator variable and assign the resulting list of dictionaries to list_of_dicts.

The code on the screen before I coded. As above, sublist is the iterator variable, so it is substituted in between “for” and “in” keyword. The instruction says,
for each sublist in row_lists

indirectly, it means,

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [--- for sublist in row_lists]

The lists2dict() function which I created above returns a dictionary. The question says,
generates a dictionary using lists2dict()

indirectly it means calling the lists2dict() function at the output expression. But if I code,

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, row_lists) for sublist in row_lists]

The output was very wrong and when I clicked the “Submit” button, it prompted me error message,
Check your call of lists2dict(). Did you correctly specify the second argument? Expected sublist, but got row_lists.

It expected sublist and yes, the for loop is reading each list in the row_lists. I have a code to print each list,
print(row_lists[0])

It is more meaningful to use sublist as 2nd argument rather than using row_lists. Therefore, the final code is,

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
#Output:
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}

The above code is really taking my time to find out what I should code and why it did not get a right code. That did not stop me from continuing my tutorial.

Turning it into a DataFrame
Up to here, this case study I did a zip() function, put it into an user defined function and used the new created function in list comprehensions to generate a list of dictionaries.

Next, the tutorial wants to convert the list of dictionaries into Pandas’ DataFrame. First and foremost, I need to import the Pandas package. Let refer to the code below:

# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

Summary of the day:

  • zip() function combines the lists into a zip object.
  • Use user defined function in list comprehension.
  • Convert list comprehension into a DataFrame.

Day 41: Write Generator Function

What is generator function?

– It produces generator object when it called.
– It defines like general function using def keyword.
– Its return value is using yield keyword. It yields a sequence of values instead of a single value.

How to build a generator?

A generator function is defined as you do a regular function, but whenever it generates a value, it uses the keyword yield instead of return. Let us look at the exercise in DataCamp’s tutorial which walk me through how to write generator function in Python.

The instructions as below:
1. Complete the function header for the function get_lengths() that has a single parameter, input_list.
2. In the for loop in the function definition, yield the length of the strings in input_list.
3. Complete the iterable part of the for loop for printing the values generated by the get_lengths() generator function. Supply the call to get_lengths(), passing in the list lannister.

The code as below:

# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)

#Output:
#6
#5
#5
#6
#7

Summary of the day:

  • Generator function.
  • Keyword: yield.

Day 41: Introduction to Generator in Python

Generator has some connections with list comprehension. Still remember, the list comprehension is using square brackets [].

Generator expression
Instead of using square brackets, it uses normal brackets. When I execute the code, it creates generator object.

The syntax:

( output expression for iterator variable in iterable )

What is this generator?
It is same as list comprehension except it does not store the list in the memory.

List comprehensions vs generators vs dict comprehensions

So now, there are three things to remember, their differences,
– List comprehension returns a list.
– Dict comprehension returns a dictionary.
– Generator returns generator object.

Generator is good to use when we want to generate a large volume of data, example, a range of 10*10000000. List comprehension may cause the server out of memory. However, generator will be able to do so because it has not yet generated the entire list.

List comprehensions and generator expressions look very similar in their syntax, except for the use of parentheses () in generator expressions and brackets [] in list comprehensions. Both can be iterated over.

Below is an example from DataCamp whereby, it creates generator object using bracket () and combines with next() method which I learned from the iterators to iterate each element and print the first 5 values out of the range 0 to 50. The remaining values are then print out using the for loop statement. Let see the codes below:

# Create generator object: result
result = (num for num in range(0,31))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)

Next, what we can apply in list comprehension such as conditionals in list comprehension, it applies to generator expression.

Changing the output in generator expressions
It works similarly as the list comprehension, we are able to add to the output expression of a generator expression. Example as below:

# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister)

# Iterate over and print the values in lengths
for value in lengths:
    print(value)

It generates lengths of each string in the list. Then, uses the for loop statement to print out the value in the generator object.

Lastly, generator function which produces generator object when it called. I will cover generator function in my next entry.

Summary of the day:

  • Generators in Python.
  • List comprehensions vs generators.
  • Conditionals in generator expressions.

Day 40: Create Dictionary Comprehension

Create Dict Comprehensions

There are many other objects you can build using comprehensions, such as dictionaries, pervasive objects in Data Science. Main difference between a list comprehension and a dict comprehension is the use of curly braces {} instead of []. Additionally, members of the dictionary are created using a colon (:) as in <key> : <value> as the output expression.

The syntax:

[output expressionforiterator variableiniterable].

DataCamp’s example:

Another sample code I extracted from DataCamp’s tutorial to output the fellowship’s name and its length in key-value pair.

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create dict comprehension: new_fellowship
new_fellowship = {key: len(key)  for key in fellowship}

# Print the new list
print(new_fellowship)

#Output: {'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}

Summary of the day:

  • Dict comprehensions.
  • Dict comprehensions vs list comprehensions.