Day 34: Using Pandas with Dictionary

Below is an exercise I did in DataCamp whereby, it required to use pandas package and read/import a .csv file into a DataFrame. Then, with a dictionary called ‘langs_count’, and a for loop iteration over the column named ‘lang’ in the DataFrame, to perform some actions.

Loop through each entry in the DataFrame, if the ‘lang’ is not found in the dictionary, then add the ‘lang’ with default value 1 into the dictionary, else add the current value with 1.

At first, it sounds a bit confused to me. Let us go through the codes and get the above statements understood. My code snippet is as below:

# Import pandas
import pandas as pd 

# Import Twitter data as DataFrame: df
df = pd.read_csv('tweets.csv')

# Initialize an empty dictionary: langs_count
langs_count = {}

# Extract column from DataFrame: col
col = df['lang']

# Iterate over lang column in DataFrame
for entry in col:

    # If the language is in langs_count, add 1
    if entry in langs_count.keys():
        langs_count[entry] =  langs_count[entry] + 1
    # Else add the language to langs_count, set the value to 1
    else:
        langs_count[entry] = 1

# Print the populated dictionary
print(langs_count)

As mentioned earlier, it imports the Pandas packages and it reads the .csv file using pandas’ read_csv() function. Next line, it creates an empty dictionary calls ‘langs_count’.

Up to here, it is straightforward.

Next line, it extracts the column ‘lang’ from the DataFrame and stores it in a variable calls ‘col’. You can do a print to check what is in df and col respectively to check the content inside.

Then, using a for loop to iterate the entry found in col. In other words, iterates each lang column in the DataFrame. Inside a for loop, we have an if-else condition check.

In dictionary, .key() returns a view object of all the keys in a dictionary.

The condition checks, if the language is inside ‘lang_count’ dictionary, then add 1 (+1) to its current the value (dictionary has key and value as a pair) else, create a new language in ‘langs_count’ and set as value 1 (=1).

We can access the value by specifying the key,
langs_count[entry]

where entry is the key and the code returns the value. And, with that value add 1 (+1), it gets a new value. Example like the price_lookup[‘apple’], it works the same.

Otherwise, we need to create a new item inside the dictionary by defining its key and value with this code,
langs_count[entry] = 1

This is how we add a new item, the same method used in the below example,
d[‘k3’] = 300

Lastly, when the print(langs_count) is called, it prints out the dictionary as below:

print(langs_count)
{'en': 97, 'und': 2, 'et': 1}

<script.py&gt; output:
    {'en': 97, 'und': 2, 'et': 1}

Based on the codes, it tells us there are 97 times of language ‘EN’ found in the dictionary and language ‘ET’ has 1 and I think undefined, ‘UND’ has 2.

The above codes can be converted into a user-defined function. By accepting two parameters with a return value statement, calls the function and stores the return value into a variable which we can print out the values. The codes as below:

# Define count_entries()
def count_entries(df, col_name):
    """Return a dictionary with counts of 
    occurrences as value for each key."""

    # Initialize an empty dictionary: langs_count
    langs_count = {}
    
    # Extract column from DataFrame: col
    col = df[col_name]
    
    # Iterate over lang column in DataFrame
    for entry in col:

        # If the language is in langs_count, add 1
        if entry in langs_count.keys():
            langs_count[entry] = langs_count[entry] + 1
        # Else add the language to langs_count, set the value to 1
        else:
            langs_count[entry] = 1

    # Return the langs_count dictionary
    return langs_count

# Call count_entries(): result
result = count_entries(tweets_df,'lang')

# Print the result
print(result)

The result is the same as the previous codes. The difference is now, we can call using a function and substitute arguments required into the two parameters.

This is a basic data science function. There is more to learn in the upcoming topics in DataCamp, including error-handling functions.

Summary of the day:

  • Using Pandas package and read_csv() to read a .csv file.
  • Using dictionary to store values.
  • Using For loop and If-else conditions to workaround.

Advertisements

What is Big Data?

Recently, I threw myself with a lot of “what is” questions as part of my journey to learn new knowledge apart from what I know now. This entry is a basic overview of Big Data.

Introduction
It is used to describe large volume of both structured and unstructured data that is so large and difficult to be analyzed and processed using the traditional databases and data processing methods. It is a scenario of the volume is too big and exceeds the processing capacity.

In Big Data, it has 3 V’s concepts:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Based on a write up I found at searchdatamanagement.techtarget.com, Big Data also encompasses a wide variety of data types, including structured data in SQL databases and data warehouses, unstructured data, such as text and document files held in Hadoop clusters or NoSQL systems, and semi-structured data, such as web server logs or streaming data from sensors. Furthermore, big data includes multiple, simultaneous data sources, which may not otherwise be integrated.

It furthers shared,
Velocity refers to the speed at which big data is generated and must be processed and analyzed. In many cases, sets of big data are updated on a real- or near-real-time basis, compared with daily, weekly or monthly updates in many traditional data warehouses. Big data analytics projects ingest, correlate and analyze the incoming data and then render an answer or result based on an overarching query. This means data scientists and other data analysts must have a detailed understanding of the available data and possess some sense of what answers they are looking for to make sure the information they get is valid and up to date. Velocity is also important as big data analysis expands into fields like machine learning (ML) and artificial intelligence (AI), where analytical processes automatically find patterns in the collected data and use them to generate insights.

Over the time, other Vs become relevant to descriptions of Big Data. Let us look further down,

  • Veracity. The degree to which Big Data can be trusted.
  • Value. The business value of the data collected.
  • Variability. The way in which the Big Data can be used and formatted.

More explanation is given from the same website on the new Vs.

Data veracity refers to the degree of certainty in data sets. Uncertain raw data collected from multiple sources, such as social media platforms and webpages, can cause serious data quality issues that may be difficult to pinpoint.

Bad data leads to inaccurate analysis and may undermine the value of business analytics because it can cause executives to mistrust data as a whole. The amount of uncertain data in an organization must be accounted for before it is used in big data analytics applications. IT and analytics teams also need to ensure that they have enough accurate data available to produce valid results.

Not all data collected has real business value and the use of inaccurate data can weaken insights provided by analytics applications. It is critical that organizations employ practices such as data cleansing and confirm that data relates to relevant business issues before they use it in a big data analytics project.

Variability also often applies to sets of big data, which are less consistent than conventional transaction data and may have multiple meanings or be formatted in different ways from one data source to another — things that further complicate efforts to process and analyze the data.

To make sense of all of this messy data, Big Data projects often use cutting-edge analytics involving artificial intelligence (AI) and machine learning (ML). By teaching computers to identify what this data represents– through image recognition or natural language processing, for example – they can learn to spot patterns much more quickly and reliably than humans.

Source:
https://searchdatamanagement.techtarget.com/definition/big-data