MongoDB: Importing csv files with mongoimport

Someone asked for my help to upload some .csv files to the MongoDB database and backup the database before sending the file to the next person. I completed the task with the command below. It imports a .csv file to the selected database and collection by specifying the file type, location and whether the file has a headerline. It runs for both Linux and Windows’ machines using the terminal or command prompt.
mongoimport -d mydb -c things --type csv --file locations.csv --headerline

MongoDB: Enabling SSL for MongoDB

Sometimes ago, I was told to prepare and work with a team of people to explore the TLS/SSL for MongoDB and use the Windows Active Directory users to access MongoDB via LDAP authentication.

It is an administration work on MongoDB to enhance security between client and server during data transmission. The default connections to MongoDB servers are not encrypted. It is highly advisable to ensure all connections to the mongod are Transport Layer Security (TLS) also known as SSL enabled.

The definition of TLS/SSL from Google is TLS and its now-deprecated predecessor, Secure Sockets Layer (SSL), which are protocols designed to provide communications security over a computer network

For a development server, you can enable the TLS using a self-signed certificate. OpenSSL allows us to generate the self-signed certificate on our server itself. For the Windows machine, it requires to install the OpenSSL to generate the self-signed certificate. Then, you need to install the self-signed certificate. For the Linux Ubuntu machine, no installation required for the OpenSSL and self-signed certificate.

In the MongoDB configuration file (mongod.conf for Linux machine) or (mongod.cfg for Windows machine) can enable the SSL mode in the MongoDB. You need to specify the path to the .pem file. If you are working on a production server, you are required to include the CAFile path too. CA stands for certificate authority, is an entity that issues digital certificates.

net:
   tls:
      mode: requireTLS
      certificateKeyFile: /etc/ssl/mongodb.pem
      CAFile: /etc/ssl/caToValidateClientCertificates.pem

More reading about MongoDB TLS/SSL can refer to this document: https://docs.mongodb.com/manual/tutorial/configure-ssl/

Dates in MongoDB

My colleague came to me last Friday and asked me to help him out on one of the issues he faced with MongoDB’s query. He found that the JSON data received via the API when our BI tool inserted the data into MongoDB, it saved the date and datetime values in string format.

The JSON specification does not specify a format for exchanging dates which is why there are so many different ways to do it. The best format is the ISO date format, it is a well known and widely used format and can be handled across many different languages, making it very well suited for interoperability.

It shall look like below:

2012-04-23T18:25:43.511Z

When I took over the JSON data and found that most of the date and time are in normal string format (eg: 2012-04-23 18:25:43.511), MongoDB inserted the data into the database as string type.

To understand dates in MongoDB, I found this article is pretty good in explaining to them, here is the link. Instantly, it hit me! Dealing with dates in MongoDB or any databases is not an easy job at all, moreover, this time I am dealing with MongoDB where it has subdocument or array in a subdocument structure.

I can use the $dateFromString which converts a date/time string to a date object. For more detail about this function, you can find it out at this link.

This aggregation operator worked like magic to me when I am dealing with the date and time in the string until I reached a point where the date is in an array inside a subdocument. I faced a roadblock and I am not able to use the same method to do the conversion. It seems quite true that I have to use $unwind in my query. It deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.

I need a solution that I can use $unwind to flatten the array from the collection and convert each date and time into a date object. I am still looking for a solution, if you happened to solve something similar like mine, please give me a helping hand and link me to the solutions. Thank you!

Maybe useful: https://stackoverflow.com/questions/38299186/query-that-combines-project-unwind-group

Intermediate Python for Data Science: Looping Data Structure

After the matplotlib for visualization, introduction to dictionaries and Pandas DataFrame, follows by logical, Boolean and comparison operators with if-elif-else control flow and now, comes to the last part, the while loop, for loop and loop for a different data structure.

In Python, some of the objects are iterable which means it loops through the object in a list, for example, to get each element. It loops through a string to capture each character in the string. A for loop iterates over a collection of things and while loop can do any kind of iteration within the block of codes, while some condition remains True

For Loop

The main keywords are for and in. It uses along with colon (:) and indentation (whitespace). Below is the syntax, 

#loop statement
my_iterable = [1,2,3]
for item_name in my_iterable:
    print(item_name)

I used two iterator variables (index, area) with enumerate(), for example, the sample code below. enumerate() loops over something and has an automatic counter, then returns an enumerate object.

# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Change for loop to use enumerate() and update print()
for index, area in enumerate(areas) :
    print("room " + str(index) + ": " + str(area))

#Output:
"""
room 0: 11.25
room 1: 18.0
room 2: 20.0
room 3: 10.75
room 4: 9.5
"""

Another example utilizes a loop that goes through each sublist of house and prints out the x is y sqm, where x is the name of the room and y is the area of the room.

# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
         
# Build a for loop from scratch
for x in house:
    print("the " + str(x[0]) + " is " + str(x[1]) + " sqm")

# Output:
"""
the hallway is 11.25 sqm
the kitchen is 18.0 sqm
the living room is 20.0 sqm
the bedroom is 10.75 sqm
the bathroom is 9.5 sqm
"""

Definition of enumerate() can be found here. My post on for loop is here.

While Loop

The main keyword is while, colon (:) and indentation (whitespace). Below is the syntax,

# while loop statement
while some_boolean_condition:
     # do something 

# Examples
x = 0
while x < 5:
     print(f'The number is {x}')
     x += 1  

An example of putting an if-else statement inside a while loop.

# Initialize offset
offset = -6

# Code the while loop
while offset != 0 :
    print("correcting...")
    if offset > 0:
        offset = offset - 1
    else:
        offset = offset + 1
    print(offset)

# Output:
"""
correcting...
-5
correcting...
-4
correcting...
-3
correcting...
-2
correcting...
-1
correcting...
0
"""

My post on while loop is here.

Loop Data Structure

Dictionary:
If you want to iterate over key-value pairs in a dictionary, use the items() method on the dictionary to define the sequence in the loop.

for key, value in my_dic.items() : 

Numpy Array:
If you want to iterate all elements in a Numpy array, use the nditer() function to specify the sequence.

for val in np.nditer(my_array) : 

Some examples as below:

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
          
# Iterate over europe
for key, value in europe.items():
    print("the capital of " + key + " is " + value)

# Output:
the capital of austria is vienna
the capital of norway is oslo
the capital of italy is rome
the capital of spain is madrid
the capital of germany is berlin
the capital of poland is warsaw
the capital of france is paris
"""

# Import numpy as np
import numpy as np

# For loop over np_height
for x in np_height:
    print(str(x) + " inches")

# For loop over np_baseball
for x in np.nditer(np_baseball):
    print(x)

Loop over DataFrame explanation and example can be found in my post here.

Data Architecture

Recently, I changed my job and my workplace enforces good practice of Data Architecture. I tried to understand better about data architecture on my own and I found a good link that explained it.

Data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated and put to use in data systems and the organization. Data architecture describes how data is processed, stored and utilized in an information system.

Data architecture provides criteria for data processing operations, makes it possible to design data flows and also controls the flow of data in the system. Data architecture should be defined in the planning phase of the design of a new data processing and storage system.

Data Modeling and Design defines as “the process of discovering, analyzing, representing and communicating data requirements in a precise form called the data model.” Data models illustrate and enable an organization to understand its data assets through core building blocks such as entities, relationships, and attributes. These represent the core concepts of the business such as customer, product, employee, and more.

Data architecture and data modeling should align with core business processes and activities of the organization. It needs to be integrated into the entire architecture. Without knowing what the existing data import and export processes are, it is difficult to know whether the new platform will be a good fit. A model entails developing simple business rules about what business has: customer, products, part, etc.

Link: https://www.dataversity.net/data-modeling-vs-data-architecture/

Cardinality in Databases

Recently, I read an article on cardinality in databases. When you do a Google search to define cardinality in general terms, it returns a definition as “the number of elements in a set or other grouping, as a property of that grouping“. It may sound a bit difficult to visualize and understand.

In another search in Stackoverflow website, a contributor named Oded shared, the cardinality can be described in two different contexts, data modelling and data query optimization.

In term of data modelling, cardinality means how one table relates to another, for example, one to one relationship, one to many relationship or many to many relationship. Below diagram is extracted from lucidchart website which shows the different types of relationship in the database. It is used in ER diagram, entity-relationship diagram.

In term of data query optimization, cardinality means the data in a column of a table, specifically how many unique values are in it. If you have done data profiling before using the Microsoft Power BI, for example, you notice there is a summary statistics of the table loaded into the application. This information helps with planning queries and optimizing the execution plans.

Charting Guideline on Tableau: How to decide what chart to be used

The following is a sharing made by the instructor of the Udemy’s online learning course which I subscribed to. The course is called Tableau for Beginners: Get CA Certified, Grow Your Career.

Okay, now back to the original question which I think most people always ask, how to decide what chart to be used in different situations. The instructor shares some information which I think it may help us to understand and practice more in Tableau so that we can familiarize with the tool and able to pick the right chart next time.

Most of the time when you want to show how a numeric value differs according to different categories, bar charts are the way to go. The eye is very good at making comparisons based on length (as compared with differences in angle, color, etc)

If you are showing change over a date range, you will want to use a line chart.

Histograms and box plots are to show the distribution of data.

Scatter plots show how two continuous variables are related.

There is also more detail in this guide: https://www.tableau.com/learn/whitepapers/tableau-visual-guidebook. It gets into talking about how to use color and other visual elements to add more information to your chart.