Using Python for Streaming Data with Generators

Continuous from the same dataset from the World Bank World Development Indicator. Before this, I wrote about using the iterators to load data chunk by chunk and then, using the generators to load data or file line by line.

Generators work for streaming data where the data is written line by line from time to time. Generators able to read and process the data until it reaches end of file or no lines to process. Sometimes, data sources can be so large in size that storing the entire dataset in memory becomes too resource-intensive.

In this exercise from DataCamp’s tutorials, I process the first 1000 rows of a file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. Below is some details about the dataset and how to import the dataset.

To begin, I need to open a connection to a file using what is known as a context manager. For example, the command,

with open(‘datacamp.csv’) as datacamp

binds the csv file ‘datacamp.csv’ as datacamp in the context manager.

Here, the with statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

The sample code below uses the .readline() method to read a line from the file object. It then, can split the line into a list using .split() method.

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

The output looks as below:

{‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 18}

Use generator to load data
Generators allow users to lazily evaluate data. This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

The tutorial requires to define a generator function read_large_file()that produces a generator object which yields a single line from a file each time next() is called on it. 

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

Next, use the generator function created above to call a file line by line, create a dictionary of count of how many times each country appears in a column in the dataset and process all the rows in the file and print out the result. See the sample code below:

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)

And, the output looks as below:

{‘CountryName’: 1, ‘Arab World’: 80, ‘Caribbean small states’: 77, ‘Central Europe and the Baltics’: 71, ‘East Asia & Pacific (all income levels)’: 122, ‘East Asia & Pacific (developing only)’: 123, ‘Euro area’: 119, ‘Europe & Central Asia (all income levels)’: 109, ‘Europe & Central Asia (developing only)’: 89, ‘European Union’: 116, ‘Fragile and conflict affected situations’: 76, ‘Heavily indebted poor countries (HIPC)’: 99, ‘High income’: 131, ‘High income: nonOECD’: 68, ‘High income: OECD’: 127, ‘Latin America & Caribbean (all income levels)’: 130, ‘Latin America & Caribbean (developing only)’: 133, ‘Least developed countries: UN classification’: 78, ‘Low & middle income’: 138, ‘Low income’: 80, ‘Lower middle income’: 126, ‘Middle East & North Africa (all income levels)’: 89, ‘Middle East & North Africa (developing only)’: 94, ‘Middle income’: 138, ‘North America’: 123, ‘OECD members’: 130, ‘Other small states’: 63, ‘Pacific island small states’: 66, ‘Small states’: 69, ‘South Asia’: 36}

Obviously, it shows more data than before this because it does not limit to 1000 rows.

Summary of the data:

  • Generators for streaming data.
  • Context manager, open a file connection.
  • Use generator function to load data.
  • Use a dictionary to store result.
Advertisements

Day 41: Write Generator Function

What is generator function?

– It produces generator object when it called.
– It defines like general function using def keyword.
– Its return value is using yield keyword. It yields a sequence of values instead of a single value.

How to build a generator?

A generator function is defined as you do a regular function, but whenever it generates a value, it uses the keyword yield instead of return. Let us look at the exercise in DataCamp’s tutorial which walk me through how to write generator function in Python.

The instructions as below:
1. Complete the function header for the function get_lengths() that has a single parameter, input_list.
2. In the for loop in the function definition, yield the length of the strings in input_list.
3. Complete the iterable part of the for loop for printing the values generated by the get_lengths() generator function. Supply the call to get_lengths(), passing in the list lannister.

The code as below:

# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)

#Output:
#6
#5
#5
#6
#7

Summary of the day:

  • Generator function.
  • Keyword: yield.

Day 41: Introduction to Generator in Python

Generator has some connections with list comprehension. Still remember, the list comprehension is using square brackets [].

Generator expression
Instead of using square brackets, it uses normal brackets. When I execute the code, it creates generator object.

The syntax:

( output expression for iterator variable in iterable )

What is this generator?
It is same as list comprehension except it does not store the list in the memory.

List comprehensions vs generators vs dict comprehensions

So now, there are three things to remember, their differences,
– List comprehension returns a list.
– Dict comprehension returns a dictionary.
– Generator returns generator object.

Generator is good to use when we want to generate a large volume of data, example, a range of 10*10000000. List comprehension may cause the server out of memory. However, generator will be able to do so because it has not yet generated the entire list.

List comprehensions and generator expressions look very similar in their syntax, except for the use of parentheses () in generator expressions and brackets [] in list comprehensions. Both can be iterated over.

Below is an example from DataCamp whereby, it creates generator object using bracket () and combines with next() method which I learned from the iterators to iterate each element and print the first 5 values out of the range 0 to 50. The remaining values are then print out using the for loop statement. Let see the codes below:

# Create generator object: result
result = (num for num in range(0,31))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)

Next, what we can apply in list comprehension such as conditionals in list comprehension, it applies to generator expression.

Changing the output in generator expressions
It works similarly as the list comprehension, we are able to add to the output expression of a generator expression. Example as below:

# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister)

# Iterate over and print the values in lengths
for value in lengths:
    print(value)

It generates lengths of each string in the list. Then, uses the for loop statement to print out the value in the generator object.

Lastly, generator function which produces generator object when it called. I will cover generator function in my next entry.

Summary of the day:

  • Generators in Python.
  • List comprehensions vs generators.
  • Conditionals in generator expressions.