Day 39: Using Iterator for Big Data


The above illustrate the real scenario of data science where often they need to load a big chunk of data and sometimes it is too huge to be handled by the memory. The usage of Pandas, read_csv() function and setting the chunksize, it helps to load data in a smaller chunk, process the data and store the result somewhere before discard the chunk to load the next set to be processed. This is where iterator becomes useful.

Examples as below:

Either we use a variable “total” to hold the sum’s result or we can create an empty dictionary to perform the same computation and it gives the same result. Below is the exercise I did in DataCamp’s online learning website using the Twitter’s data.

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)

Create an empty dictionary, iterate over the csv file with chunksize is 10. Read the ‘lang’ column in the chunk and iterate again to get the count of each ‘lang’ in the .csv file. The output on the screen when executed is:

{‘en’: 97, ‘et’: 1, ‘und’: 2}

Let us convert the above code into an user defined function and takes three parameters, the csv filename, the chunk size and column name in the csv file. The updated version of the code with user defined function looks as below:

# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')

# Print result_counts
print(result_counts)

It gives the same result as the previous code.

Summary of the day:

  • Iterator for Big Data.
  • Using Pandas’ read_csv().
  • Using dictionaries, for loop statement to iterate data.
  • Create an user defined function for the above and call to the function to print out result.
Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s