Dates in MongoDB

My colleague came to me last Friday and asked me to help him out on one of the issues he faced with MongoDB’s query. He found that the JSON data received via the API when our BI tool inserted the data into MongoDB, it saved the date and datetime values in string format.

The JSON specification does not specify a format for exchanging dates which is why there are so many different ways to do it. The best format is the ISO date format, it is a well known and widely used format and can be handled across many different languages, making it very well suited for interoperability.

It shall look like below:

2012-04-23T18:25:43.511Z

When I took over the JSON data and found that most of the date and time are in normal string format (eg: 2012-04-23 18:25:43.511), MongoDB inserted the data into the database as string type.

To understand dates in MongoDB, I found this article is pretty good in explaining to them, here is the link. Instantly, it hit me! Dealing with dates in MongoDB or any databases is not an easy job at all, moreover, this time I am dealing with MongoDB where it has subdocument or array in a subdocument structure.

I can use the $dateFromString which converts a date/time string to a date object. For more detail about this function, you can find it out at this link.

This aggregation operator worked like magic to me when I am dealing with the date and time in the string until I reached a point where the date is in an array inside a subdocument. I faced a roadblock and I am not able to use the same method to do the conversion. It seems quite true that I have to use $unwind in my query. It deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

I need a solution that I can use $unwind to flatten the array from the collection and convert each date and time into a date object. I am still looking for a solution, if you happened to solve something similar like mine, please give me a helping hand and link me to the solutions. Thank you!

Maybe useful: https://stackoverflow.com/questions/38299186/query-that-combines-project-unwind-group

Advertisements

Intermediate Python for Data Science: Looping Data Structure

After the matplotlib for visualization, introduction to dictionaries and Pandas DataFrame, follows by logical, Boolean and comparison operators with if-elif-else control flow and now, comes to the last part, the while loop, for loop and loop for a different data structure.

In Python, some of the objects are iterable which means it loops through the object in a list, for example, to get each element. It loops through a string to capture each character in the string. A for loop iterates over a collection of things and while loop can do any kind of iteration within the block of codes, while some condition remains True

For Loop

The main keywords are for and in. It uses along with colon (:) and indentation (whitespace). Below is the syntax, 

#loop statement
my_iterable = [1,2,3]
for item_name in my_iterable:
    print(item_name)

I used two iterator variables (index, area) with enumerate(), for example, the sample code below. enumerate() loops over something and has an automatic counter, then returns an enumerate object.

# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Change for loop to use enumerate() and update print()
for index, area in enumerate(areas) :
    print("room " + str(index) + ": " + str(area))

#Output:
"""
room 0: 11.25
room 1: 18.0
room 2: 20.0
room 3: 10.75
room 4: 9.5
"""

Another example utilizes a loop that goes through each sublist of house and prints out the x is y sqm, where x is the name of the room and y is the area of the room.

# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
         
# Build a for loop from scratch
for x in house:
    print("the " + str(x[0]) + " is " + str(x[1]) + " sqm")

# Output:
"""
the hallway is 11.25 sqm
the kitchen is 18.0 sqm
the living room is 20.0 sqm
the bedroom is 10.75 sqm
the bathroom is 9.5 sqm
"""

Definition of enumerate() can be found here. My post on for loop is here.

While Loop

The main keyword is while, colon (:) and indentation (whitespace). Below is the syntax,

# while loop statement
while some_boolean_condition:
     # do something 

# Examples
x = 0
while x < 5:
     print(f'The number is {x}')
     x += 1  

An example of putting an if-else statement inside a while loop.

# Initialize offset
offset = -6

# Code the while loop
while offset != 0 :
    print("correcting...")
    if offset > 0:
        offset = offset - 1
    else:
        offset = offset + 1
    print(offset)

# Output:
"""
correcting...
-5
correcting...
-4
correcting...
-3
correcting...
-2
correcting...
-1
correcting...
0
"""

My post on while loop is here.

Loop Data Structure

Dictionary:
If you want to iterate over key-value pairs in a dictionary, use the items() method on the dictionary to define the sequence in the loop.

for key, value in my_dic.items() : 

Numpy Array:
If you want to iterate all elements in a Numpy array, use the nditer() function to specify the sequence.

for val in np.nditer(my_array) : 

Some examples as below:

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
          
# Iterate over europe
for key, value in europe.items():
    print("the capital of " + key + " is " + value)

# Output:
the capital of austria is vienna
the capital of norway is oslo
the capital of italy is rome
the capital of spain is madrid
the capital of germany is berlin
the capital of poland is warsaw
the capital of france is paris
"""

# Import numpy as np
import numpy as np

# For loop over np_height
for x in np_height:
    print(str(x) + " inches")

# For loop over np_baseball
for x in np.nditer(np_baseball):
    print(x)

Loop over DataFrame explanation and example can be found in my post here.

Data Architecture

Recently, I changed my job and my workplace enforces good practice of Data Architecture. I tried to understand better about data architecture on my own and I found a good link that explained it.

Data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated and put to use in data systems and the organization. Data architecture describes how data is processed, stored and utilized in an information system.

Data architecture provides criteria for data processing operations, makes it possible to design data flows and also controls the flow of data in the system. Data architecture should be defined in the planning phase of the design of a new data processing and storage system.

Data Modeling and Design defines as “the process of discovering, analyzing, representing and communicating data requirements in a precise form called the data model.” Data models illustrate and enable an organization to understand its data assets through core building blocks such as entities, relationships, and attributes. These represent the core concepts of the business such as customer, product, employee, and more.

Data architecture and data modeling should align with core business processes and activities of the organization. It needs to be integrated into the entire architecture. Without knowing what the existing data import and export processes are, it is difficult to know whether the new platform will be a good fit. A model entails developing simple business rules about what business has: customer, products, part, etc.

Link: https://www.dataversity.net/data-modeling-vs-data-architecture/

Cafe Salivation

One of the TechLadies’ team meetup at this vegetarian cafe at Little India, Singapore. It features Western-style food such as pasta, lasagna, pizza and etc. An interesting item in the drink menu caught my attention and one of my teammates tried it too. The sugarless sweet potato latte. There is no trace of coffee but very strong sweet potato smell, even after mixed with the cinnamon powder.

The presentation needs a lot of improvement as the place and cup holder look very redundant and the table is quite small to put so many things on it when the food is served. We shared the potato skins which is strong with spices taste.

The pizza looks quite good and big in portion. It is not supposed to be in thin crust but we requested it to be thin crust instead. It can be shared with a few people.

Another great dish good for sharing is the spinach lasagna which I cannot manage to finish it all by myself. I did not expect it to be such a big portion and we did not think of sharing it initially. Generally, it tastes good but a little too cheesy for me. It is really filling.

Address: Cafe Salivation, 176 Race Course Rd, Singapore 218607.

Cardinality in Databases

Recently, I read an article on cardinality in databases. When you do a Google search to define cardinality in general terms, it returns a definition as “the number of elements in a set or other grouping, as a property of that grouping“. It may sound a bit difficult to visualize and understand.

In another search in Stackoverflow website, a contributor named Oded shared, the cardinality can be described in two different contexts, data modelling and data query optimization.

In term of data modelling, cardinality means how one table relates to another, for example, one to one relationship, one to many relationship or many to many relationship. Below diagram is extracted from lucidchart website which shows the different types of relationship in the database. It is used in ER diagram, entity-relationship diagram.

In term of data query optimization, cardinality means the data in a column of a table, specifically how many unique values are in it. If you have done data profiling before using the Microsoft Power BI, for example, you notice there is a summary statistics of the table loaded into the application. This information helps with planning queries and optimizing the execution plans.

Charting Guideline on Tableau: How to decide what chart to be used

The following is a sharing made by the instructor of the Udemy’s online learning course which I subscribed to. The course is called Tableau for Beginners: Get CA Certified, Grow Your Career.

Okay, now back to the original question which I think most people always ask, how to decide what chart to be used in different situations. The instructor shares some information which I think it may help us to understand and practice more in Tableau so that we can familiarize with the tool and able to pick the right chart next time.

Most of the time when you want to show how a numeric value differs according to different categories, bar charts are the way to go. The eye is very good at making comparisons based on length (as compared with differences in angle, color, etc)

If you are showing change over a date range, you will want to use a line chart.

Histograms and box plots are to show the distribution of data.

Scatter plots show how two continuous variables are related.

There is also more detail in this guide: https://www.tableau.com/learn/whitepapers/tableau-visual-guidebook. It gets into talking about how to use color and other visual elements to add more information to your chart.

Intermediate Python for Data Science: Logic, Control Flow and Filtering

Boolean logic is the foundation of decision-making in Python programs. Learn about different comparison operators, how to combine them with Boolean operators, and how to use the Boolean outcomes in control structures. Also learn to filter data in pandas DataFrames using logic.

In the earlier days when I started to learn Python, there is a topic on Boolean and Comparison Operators, where I studied Boolean (True and False), logical operators (‘and’, ‘or’, ‘not’) and comparison operators (‘==’ ‘!=’, ‘<‘ and ‘>’).

Comparison operators can tell how two Python values relate and result in a Boolean. It allows to compare two numbers, strings or any same type of variables. It throws exception or error message when it is comparing a variable from a different data type. Python cannot tell how the two objects of different type relate.

Comparison a Numpy array with an integer

Based on the example above taken from a tutorial in DataCamp online learning course that I am taking currently, the variable bmi is a Numpy array, then it compares if the bmi is greater than 23. It works perfectly and returns the Boolean values. Behind the scenes, Numpy builds a Numpy array of the same size, perform an element-wise comparison, filtered with the number 23.

Boolean operators with Numpy

To use these operators with Numpy, you will need np.logical_and(), np.logical_or() and np.logical_not(). Here’s an example on the my_house and your_house arrays from before to give you an idea:

np.logical_and(your_house &gt; 13, 
               your_house < 15)

Refer to below for the sample code:

# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house &gt; 18.5, my_house < 10))

# Both my_house and your_house smaller than 11
print(np.logical_and(my_house < 11, your_house < 11))

The first print statement is checking on the ‘or’ condition means, any one of the two condition is true, it returns true. The second print statement is checking on the ‘and’ condition means, both of the comparison has to be True then it returns a True. The output of the execution returns in Boolean array as below:

[False  True False  True]
[False False False  True]

Combining Boolean operators and Comparison operators with conditional statement, if, else and elif.

It follows the if statement syntax. The most simplest code which can be used to explain the above,

z = 4
if z % 2 == 0:
  print('z is even')

Same goes to the if else statement with comparison operator, see code below:

z = 5
if z % 2 == 0:
  print('z is even')
else:
  print('z is odd')

Or if you are working with if, elif and else statement, it works too. See the code below:

z = 6
if z % 2 == 0:
  print('z is divisible by 2')
elif z % 3 == 0:
  print('z is divisible by 3')
else:
  print('z is neither divisible by 2 nor 3')

In the example above, both first and second condition are matched, however, in this control structure, once Python hits into a condition that returns a True value, it executes the corresponding code and exits the control structure after that. It will not execute the next condition, corresponding to the elif statement.

Filtering Pandas DataFrame

For an example taken from DataCamp’s tutorial, using the DataFrame below, select countries with area over 8 millions km. There are 3 steps to achieve this.

Step 1: select the area column from the DataFrame. Ideally, it gets a Pandas Series, not a Pandas DataFrame. Assume that the DataFrame is called bric, then it calls the column area using,

brics["area"]

#alternatively it can use the below too:
# brics.loc[:, "area"]
# brics.iloc[:, 2]

Step 2: When the code adds in the comparison operator to see which rows have an area greater than 8, it returns a Series containing Boolean values. The final step is using this Boolean Series to subset the Pandas DataFrame.

Step 3: Store this Boolean Series as ‘is_huge’ as below:

is_huge = brics["area"] > 8

Then, creates a subset of DataFrame using the following code and the result returns as per the screenshot:

brics[is_huge]

It shows those countries with ares greater than 8 million km. The steps can be shorten into 1 line of code:

brics[brics["area"] > 8]

Also, it is able to work with Boolean operators (np.logical_and(), np.logical_or() and np.logical_not()). For example, if it looks for areas between 8 and 10 km, then the single line code can be:

brics[np.logical_and(brics["area"] &gt; 8, brics["area"] < 10)]

The result returns from the above code is Brazil and China.