Describe the difference between batch and streaming data

Data processing is simply converting raw data to meaningful information through a process. Depending on how the data is ingested into your system, you could process each data item as it arrives or buffers the raw data and process it in groups. Processing data as it arrives is called streaming. Buffering and processing the data in groups is called batch processing.

Understand batch processing

In batch processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time as a batch. Exactly when each group is processed can be determined in several ways. For example, I can process data based on a scheduled time interval (for example, every hour), or it could be triggered when a certain amount of data has arrived or as the result of some other event.

Advantages of batch processing include:

  • Large volumes of data can be processed at a convenient time.
  • It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight or during off-peak hours.

Disadvantages of batch processing include:

  • The time delay between ingesting the data and getting the results is because it is scheduled to run at a time.
  • All batch job input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes during batch jobs halt the process. The input data must be carefully checked before the job can be rerun. Even minor data errors, such as typographical errors in dates, can prevent a batch job from running.

Understand streaming and real-time data

In-stream processing, each new piece of data is processed when it arrives. For example, data ingestion is inherently a streaming process.

Streaming handles data in real-time. Unlike batch processing, there is no waiting until the next batch processing interval, and data is processed as individual pieces rather than a batch at a time. Streaming data processing is beneficial in most scenarios where new, dynamic data is generated continually.

Examples of streaming data include:

  • A financial institution tracks changes in the stock market in real-time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.
  • An online gaming company collects real-time data about player-game interactions and feeds the data into its gaming platform. It then analyses the data in real-time and offers incentives and dynamic experiences to engage its players.
  • A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-time property recommendations of properties to visit based on their geo-location.
  • Stream processing is ideal for time-critical operations that require an instant real-time response. For example, a system that monitors a building for smoke and heat needs to trigger alarms and unlock doors to allow residents to escape immediately in the event of a fire.

Understand the differences between batch and streaming data

Apart from how batch processing and streaming processing handle data, there are other differences:

  • Data Scope: Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received or within a rolling time window (the last 30 seconds, for example).
  • Data Size: Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro-batches consisting of few records.
  • Performance: The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds. Latency is the time taken for the data to be received and processed.
  • Analysis: You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.

[SQL] Datetime and Datetime2

Lately, I was working on updating the data marts in the testing environment, and I found one of the data marts was using the data type, datetime2 in the MSSQL. Therefore, I went to do some reading about the differences between datetime and datetime2.

Introduction

Format:

  • Datetime – YYYY-MM-DD hh:mm:ss[.nnn]
  • Datetime2 – YYYY-MM-DD hh:mm:ss[.nnnnnnn]

The datetime needs 8 bytes of storage and it has a range from 1753-01-01 to 9999-12-31. The datetime2 datatype uses 6 to 8 bytes depending on the millisecond precision. It ranges from 0001-01-01 00:00:00.0000000 through 9999-12-31 23:59:59.9999999.

Also, I understand that data from Oracle are using datetime2. Therefore, some tables that store source data from Oracle may use data type datetime2 in the table schema.

Problem faced in ETL

Occasionally, I faced errors when I tried to do Data Conversion of a column with the data type, datetime2 to database timestamp (datetime) in the SSIS package. For example, it tried to convert a datetime2 value, 2021-05-23 15:38:44:6666667, to datetime. The error randomly happened to my SSIS package, and I could not figure out why it happened. In one of the SSIS packages, the Data Conversion worked well with this example without any changes. However, another SSIS package worked well after I deleted the data flow arrow linking from Data Conversion to OLE DB Destination for data insertion.

Cast datetime2 to datetime

On the same problem, I did try to use T-SQL to perform data conversion of a datetime2 value, 2021-05-23 15:38:44:6666667, to datetime in a SELECT statement in SQL Server Management Studio to datetime. It returns the result without any errors. I was expecting it may throw truncation error because of the milliseconds value in this record. If I pre-transformed the data in the SELECT statement before I begin to load data for actual transformation, does it still consider as ETL or it is ELT?

If you have encountered the same problem before and know how to fix it, please share it with me in the comment box below. Appreciated your help. Thank you 🙂

References

[SQL] Why use “where 1=2”

Sometimes ago, I went to an interview and one of the interview questions was, what do I get after running a SQL statement below:

select * from table where 1=2

I should have quickly Googled it since it was a video interview. But, I did not do so and did not answer this question. After the interview, I did Googled it and posted my answer on my Facebook. Facebook today’s memories reminded me that I shared this interview question and its answer. I would like to share one of the examples from stackoverflow.com.

example

CREATE TABLE EMPL_DEMO AS SELECT * FROM employees WHERE 1=2

EXPLANATION

The above is an example of how where 1=2 is used in table creation. A respondent replied that this is a command used to copy table structure to another table, such as EMPL_DEMO. It will have the same columns as the employees’ table, except for the keys or constraints.

The 1=2 always evaluates to False that prevents from copying any records from the table.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

reference

[SQL] Columnstore Indexes

When my team introduced the columnstore indexes, I tried my best to understand it as I needed to implement the columnstore indexes for my projects as well. The Columnstore indexes are the standard for storing and querying large data warehousing fact tables, according to the document written in Microsoft website.

To put it in a simple way, it stores, and retrieves data by using the columnar data format. The data is logically organized as a table with rows and columns but physically stored in a column-based data. I found an easy to understand the comparison from data-nerd.blog website.

The standard row-store index is useful when searching for a particular value on small tables and small ranges of value. It is suitable for the transactional workload. However, workloads from data warehouse and business analytics that its queries will perform full scan for large datasets on large tables such as fact tables, the columnstore index is said to be better.

Columnstore index stores each column in a different disk pages, rather than multiple rows per page.

Few things that you gain from applying columnstore indexes in your MSSQL databases.

  • The index uses column-based data storage over conventional row-oriented storage. The query performance in data warehouse gains up to 10 times.
  • Achieve gains up to 10 times the data compression over the uncompressed data size.
  • Reduce I/O where it fetches only columns needed instead of the entire row from disk to memory buffer, even though we specify the column name in the select statement.

Nonclustered columnstore index

A nonclustered columnstore index and a clustered columnstore index function the same. The difference is that a nonclustered index is a secondary index that’s created on a rowstore table, but a clustered columnstore index is the primary storage for the entire table.

The nonclustered index contains a copy of part or all of the rows and columns in the underlying table. The index is defined as one or more columns of the table and has an optional condition that filters the rows.

A nonclustered columnstore index enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index. 

columnstore has drawback too

Any tuple (row) writes are very expensive on a columnstore index. The tables with columnstore indexes cannot be updated directly using INSERT, UPDATE, DELETE and MERGE statements or bulk load operations. To perform the DML (data manipulation language) on this table, it recommends to disable or drop the columnstore index temporarily, and re-create post DML activity.

references

[SQL] What is the difference between clustered and non clustered index in SQL?

Before I share the differences, let me do a short introduction of what is SQL indexes in case some of us are new to this topic.

What is Index?

An index helps to search for data in a table. Think of an index over a table as an index at the back of the book. When you want to find a reference to an item in the book, you look it up through the index. Then, you can use the page number in the index to go directly to the correct pages in the book. Without an index, you might have to read through the entire book to find the references you are looking for.

When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in sorted order, with pointers to the corresponding rows in the table. When the user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row.

An index might consume additional storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations, and incur additional processing charges. Therefore, when deciding which indexes to create, you must strike a balance between having indexes that speed up your queries versus the cost of performing other operations.

In a table that is read-only, or that contains data that is modified infrequently, more indexes will improve query performance. If a table is queried infrequently, but subject to a large number of inserts, updates, and deletes (such as a table involved in OLTP), then creating indexes on that table can slow your system down.

The differences between the clustered and non clustered index in SQL

  1. A clustered index is used for easy retrieval of data from the database and it is faster, whereas reading from non clustered index is relatively slower. It optimizes search queries.
  2. Clustered index alters the way records are stored in a database as it sorts out rows by the column which is set to be clustered index. In a non clustered index, it does not alter the way it was stored but it creates a separate object within a table that points back to the original table rows after searching.
  3. It reduces the amount of data pages that need to be read to retrieve the data in a SQL statement.
  4. One table can only have one clustered index whereas it can have many non clustered index.

A clustered index physically reorganizes a table by the index key. This arrangement can improve the performance of queries still further, because the relational database management system doesn’t have to follow references from the index to find the corresponding data in the underlying table.

Explain different types of index

There are three types of index namely:

Unique Index:

This index does not allow the field to have duplicate values if the column is unique indexed. If a primary key is defined, a unique index can be applied automatically.

Clustered Index:

This index reorders the physical order of the table and searches based on the basis of key values. Each table can only have one clustered index.

Non-Clustered Index:

Non-Clustered Index does not alter the physical order of the table and maintains a logical order of the data. Each table can have many nonclustered indexes.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

SQL, NoSQL and MySQL

The SQL, NoSQL and MySQL are the common terms used in today’s database management system technology. In this article, I try to explain them in simple English to distinguish them. It covers the questions below:

  • What is SQL?
  • What is NoSQL?
  • What is MySQL?
  • What is MongoDB?
  • Whether use SQL or NoSQL?

What is SQL?

SQL stands for Structured Query Language. It is the core of the relational database that it is used for accessing, managing and manipulating structured data. The data is stored in the form of columns and rows, and holds relationships between those tables. There are four different type of relationships:

  • One-to-one relationship – It happens when a single row in Table A relates to a single row in Table B.
  • One-to-many relationship – It happens when a single row in Table A relates to multiple rows in Table B.
  • Many-to-many relationship – It happens when many rows in Table A can be related to many rows in Table B.
  • Self-referencing relationship – When a record in Table A relates to the same table itself.

Often SQL relates to the relational database such as Microsoft SQL Server, Oracle, PostgreSQL and MySQL.

What is NoSQL?

NoSQL also is known as Not SQL database provides a mechanism to store and retrieve unstructured data. NoSQL database can handle a large amount of data and has a dynamic schema. It means there is no specific number of columns or fields in the table. The data are stored in the format of collections and documents. NoSQL has no or very few relationships.

SQLNoSQL
Relational database, structured data in defined columns and rows with each table relates to other tables in the database.Non-relational database because data is stored in the form of collections and documents with no or few relationships.
Predefined schema.Dynamic schema.
Table based databasesDocument database
Key-value stores
Graph database
Column family database
SQL databases are vertically scalable. You can load balance the data servers by optimizing hardware such as increasing CPU, RAM, SSD.NoSQL databases are horizontally scalable. You can perform load balancing by adding more servers to your cluster to handle a large amount of traffic.
SQL is mainly used for Online Transactional Processing (OLTP).

Good for complex queries.
NoSQL is mainly used for Online Analytical Processing (OLAP).

Not a good fit for complex queries.

Maybe some non-technical people may think MySQL is another type of SQL besides SQL and NoSQL. It is incorrect. MySQL is one of the popular relational databases. Below is some information about MySQL and MongoDB for your references.

What is MySQL?

It is one of the famous relational database management systems. MySQL is an open-source tool that works on many platforms. MySQL uses SQL to write simple to complex queries to retrieve and manipulate data. MySQL supports relationships using the JOIN statement to retrieve data from a table that relates to another table. MySQL uses a privilege-based security model that authenticates a user and user privileges on a particular database. Most of the relational databases support master-slave replication, MySQL too.

What is MongoDB

MongoDB is a non-relational database that stores the data in documents. The document database stores the related information together for quicker query processing. MongoDB has its own unstructured query language and it does not support JOIN statement, but it supports placing a document inside another document (embedded document or sub-document) and multi-dimensional data types such as arrays. MongoDB is based on role-based access control with a flexible set of privileges. MongoDB supports built-in replication, sharding, and auto-elections.

I have written a post that share the top 10 databases in 2019, and some information of what is key-value stores, document-based database, and etc that I think it is useful for basic understanding. Refer to this link for more information.

Whether use SQL or NoSQL?

Based on my reading, there is no clear winner or clear conclusion which one is better and which one to be used. The choice of the database depends upon the schema of your database and how you want to access your data.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

References:
https://www.edureka.co/blog/sql-vs-nosql-db/#What%20is%20MongoDB

Types of Non-Relational Data

I came across a few different types of semi-structured data introduced by the speaker during a presentation. I used to deal with JSON format in my previous projects. It is a common format or file types used.

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is a collection of key-value pairs.

Other formats

Other format or file types that I was being introduced are:

  • AVRO – Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols and serializes data in a compact binary format. Combined with Kafka, it provides schema-based, robust, and fast binary serialization.
  • ORC – Apache ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem, such as RCFile and Parquet.
  • Parquet – Parquet is an open-source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficiency and the performant flat columnar storage format of data compared to row-based files like CSV or TSV files.

I am not familiar with these formats. If you have more experiences using them, please share your experiences in the comment box below. If you wish to help me to explain these formats in simple English, please do so too.

[SQL] What is Normalization and what are the advantages of it?

Normalization is the process of organizing data to avoid duplication and redundancy. Normalization helps to reduce storage and improve data quality. See below for more advantages:

  • Better Database organization
  • More Tables with smaller rows
  • Efficient data access
  • Greater Flexibility for Queries
  • Quickly find the information
  • Easier to implement Security
  • Allows easy modification
  • Reduction of redundant and duplicate data
  • More Compact Database
  • Ensure Consistent data after modification

Relations

In a normalized database schema, the primary keys and foreign keys are used to define relationships. No data duplication exists other than key values in 3rd normal form (3NF). The data can be queried or retrieved by joining the tables together in a query.

Next, I am going to share some theories related to normalization. It should be covered in the data management subject in the university.

Explain different types of Normalization

There are many successive levels of normalization. These are called normal forms. Each consecutive normal form depends on the previous one.The first three normal forms are usually adequate.

  • First Normal Form (1NF) – No repeating groups within rows
  • Second Normal Form (2NF) – Every non-key (supporting) column value is dependent on the whole primary key.
  • Third Normal Form (3NF) – Dependent solely on the primary key and no other non-key (supporting) column value.

First Normal Form (1NF)

A table is 1NF if every cell contains a single value, not a list of values. These properties are known as atomic. 1NF also prohibits repeating group of columns such as item1, item2,.., itemN. Instead, you should create another table using a one-to-many relationship, and gives each table a primary key. Each field contains at most one value from its attribute domain.

Second Normal Form (2NF)

A table is 2NF, if it is 1NF and every non-key column is fully dependent on the primary key. Furthermore, if the primary key is made up of several columns, every non-key column shall depend on the entire set and not part of it. If an attribute depends on only part of a multi-valued key, remove it to a separate table.

For example, the primary key of the OrderDetails table comprising orderID and productID. If unitPrice is dependent only on productID, it shall not be kept in the OrderDetails table (but in the Products table). On the other hand, if the unitPrice is dependent on the product as well as the particular order, then it shall be kept in the OrderDetails table.

Third Normal Form (3NF)

A table is 3NF, if it is 2NF and the non-key columns are independent of each other. In other words, the non-key columns are dependent on primary key, only on the primary key and nothing else. If attributes do not contribute to a description of the key, remove them into a separate table. All attributes must be directly dependent on the primary key.

For example, suppose that we have a Products table with columns productID (primary key), name and unitPrice. The column discountRate shall not belong to Products table if it is also dependent on the unitPrice, which is not part of the primary key.

Higher Normal Form

The 3NF has its inadequacies, which leads to higher Normal form, such as Boyce/Codd Normal form (BCNF), Fourth Normal Form (4NF) and Fifth Normal Form (5NF).

  • BCNF – If there are non-trivial dependencies between candidates key attributes, separate them out into distinct tables.
  • 4NF – No table may contain two or more 1:n or n:m relationships that are not directly related.
  • 5NF – There may be practical constraints on information that justify separating logically related many-to-many relationships.

At times, you may decide to break some of the normalization rules, for performance reason (e.g., create a column called totalPrice in Orders table which can be derived from the orderDetails records); or because the end-user requested for it. Make sure that you fully aware of it, develop programming logic to handle it, and properly document the decision.

It takes a while to understand and put it in practice. If you have any experiences that you want to share, please leave your comments. I will update this post from time to time. Please check out my next post about denormalization.

[SQL] What is ACID property in a database?

ACID stands for Atomicity, Consistency, Isolation, Durability. It is used to ensure that the data transactions are processed reliably in a database system. 

ACID Model

Atomicity: Atomicity refers to the transactions that are completely done or failed where transaction refers to a single logical operation of a data. It means if one part of any transaction fails, the entire transaction fails and the database state is left unchanged.

Consistency: Consistency ensures that the data must meet all the validation rules. In simple words, you can say that your transaction never leaves the database without completing its state.

Isolation: The main goal of isolation is concurrency control.

Durability: Durability means that if a transaction has been committed, it will occur whatever may come in between such as power loss, crash or any sort of error.

Base Model

Last time, I knew some of the NoSQL databases are not ACID, and this may not be a suitable database for your application, depending on your use-cases. The acronym BASE is slightly more confusing than ACID. BASE stands for:

  • Basically Available – Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of data by spreading and replicating it across the nodes of the database cluster.
  • Soft State – Due to the lack of immediate consistency, data values may change over time. The BASE model breaks off with the concept of a database which enforces its own consistency, delegating that responsibility to developers.
  • Eventually Consistent – The fact that BASE does not enforce immediate consistency does not mean that it never achieves it. However, until it does, data reads are still possible (even though they might not reflect the reality).

SQL

[SQL] WHAT IS THE DIFFERENCE BETWEEN CLUSTERED AND NON CLUSTERED INDEX IN SQL?

 | EDIT

The differences between the clustered and non clustered index in SQL are :

  1. A clustered index is used for easy retrieval of data from the database and it is faster, whereas reading from non clustered index is relatively slower. It optimizes search queries.
  2. Clustered index alters the way records are stored in a database as it sorts out rows by the column which is set to be clustered index. In a non clustered index, it does not alter the way it was stored but it creates a separate object within a table that points back to the original table rows after searching.
  3. It reduces the amount of data pages that need to be read to retrieve the data in a SQL statement.
  4. One table can only have one clustered index whereas it can have many non clustered index.

EXPLAIN DIFFERENT TYPES OF INDEX

There are three types of index namely:

UNIQUE INDEX:

This index does not allow the field to have duplicate values if the column is unique indexed. If a primary key is defined, a unique index can be applied automatically.

CLUSTERED INDEX:

This index reorders the physical order of the table and searches based on the basis of key values. Each table can only have one clustered index.

NON-CLUSTERED INDEX:

Non-Clustered Index does not alter the physical order of the table and maintains a logical order of the data. Each table can have many nonclustered indexes.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5435&origin=liyenz.wordpress.com&obj_id=4915531-5435-6092b1b4da3d8

SQL, NOSQL AND MYSQL

 | EDIT

The SQL, NoSQL and MySQL are the common terms used in today’s database management system technology. In this article, I try to explain them in simple English to distinguish them. It covers the questions below:

  • What is SQL?
  • What is NoSQL?
  • What is MySQL?
  • What is MongoDB?
  • Whether use SQL or NoSQL?

WHAT IS SQL?

SQL stands for Structured Query Language. It is the core of the relational database that it is used for accessing, managing and manipulating structured data. The data is stored in the form of columns and rows, and holds relationships between those tables. There are four different type of relationships:

  • One-to-one relationship – It happens when a single row in Table A relates to a single row in Table B.
  • One-to-many relationship – It happens when a single row in Table A relates to multiple rows in Table B.
  • Many-to-many relationship – It happens when many rows in Table A can be related to many rows in Table B.
  • Self-referencing relationship – When a record in Table A relates to the same table itself.

Often SQL relates to the relational database such as Microsoft SQL Server, Oracle, PostgreSQL and MySQL.

WHAT IS NOSQL?

NoSQL also is known as Not SQL database provides a mechanism to store and retrieve unstructured data. NoSQL database can handle a large amount of data and has a dynamic schema. It means there is no specific number of columns or fields in the table. The data are stored in the format of collections and documents. NoSQL has no or very few relationships.

SQLNoSQL
Relational database, structured data in defined columns and rows with each table relates to other tables in the database.Non-relational database because data is stored in the form of collections and documents with no or few relationships.
Predefined schema.Dynamic schema.
Table based databasesDocument database
Key-value stores
Graph database
Column family database
SQL databases are vertically scalable. You can load balance the data servers by optimizing hardware such as increasing CPU, RAM, SSD.NoSQL databases are horizontally scalable. You can perform load balancing by adding more servers to your cluster to handle a large amount of traffic.
SQL is mainly used for Online Transactional Processing (OLTP).

Good for complex queries.
NoSQL is mainly used for Online Analytical Processing (OLAP).

Not a good fit for complex queries.

Maybe some non-technical people may think MySQL is another type of SQL besides SQL and NoSQL. It is incorrect. MySQL is one of the popular relational databases. Below is some information about MySQL and MongoDB for your references.

WHAT IS MYSQL?

It is one of the famous relational database management systems. MySQL is an open-source tool that works on many platforms. MySQL uses SQL to write simple to complex queries to retrieve and manipulate data. MySQL supports relationships using the JOIN statement to retrieve data from a table that relates to another table. MySQL uses a privilege-based security model that authenticates a user and user privileges on a particular database. Most of the relational databases support master-slave replication, MySQL too.

WHAT IS MONGODB

MongoDB is a non-relational database that stores the data in documents. The document database stores the related information together for quicker query processing. MongoDB has its own unstructured query language and it does not support JOIN statement, but it supports placing a document inside another document (embedded document or sub-document) and multi-dimensional data types such as arrays. MongoDB is based on role-based access control with a flexible set of privileges. MongoDB supports built-in replication, sharding, and auto-elections.

I have written a post that share the top 10 databases in 2019, and some information of what is key-value stores, document-based database, and etc that I think it is useful for basic understanding. Refer to this link for more information.

WHETHER USE SQL OR NOSQL?

Based on my reading, there is no clear winner or clear conclusion which one is better and which one to be used. The choice of the database depends upon the schema of your database and how you want to access your data.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

References:
https://www.edureka.co/blog/sql-vs-nosql-db/#What%20is%20MongoDB

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5417&origin=liyenz.wordpress.com&obj_id=4915531-5417-6092b1b4dce0f

[SQL] EXCEPT AND INTERSECT

 | EDIT

Last Wednesday I went back to office and met up with my teammates. It was our weekly back-to-office after 9 months of staying away from office. During our meet-up we discussed on our work progress, and my teammate asked whether we would like to explore using the SQL EXCEPT in our data extraction. I have a few questions in my mind before I hop on to share our idea to the bosses and business analysts.

WHAT IS EXCEPT?

In SQL Server, EXCEPT returns distinct rows from the left input query that aren’t output by the right input query. It returns any distinct values from the query left of the EXCEPT operator. Those values return as long the right query doesn’t return those values as well.

Just as with the UNION operator, the same rules apply when using the EXCEPT operator.

WHAT IS INTERSECT?

In SQL Server, INTERSECT returns distinct rows that are output by both the left and right input queries operator. It returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operator.

BASIC RULES

To combine the result sets of two queries that use EXCEPT or INTERSECT, the basic rules are:

  • The number and the order of the columns must be the same in all queries.
  • The data types must be compatible.

In our use case, our data mart and temporary table are having the same number of columns in the same order, and the same data types. There are no issues for us to apply EXCEPT in our data extraction. Besides that, we are not using data types such as XML, text, ntext, image. These data types are non-comparable data types.

NULLABILITY

When comparing column values for determining DISTINCT rows, two NULL values are considered equal.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

REFERENCE:

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5972&origin=liyenz.wordpress.com&obj_id=4915531-5972-6092b1b4debef

POWER BI CANNOT USE SQL CTE

 | EDIT

I just want to do a quick sharing of what I have found out today in Power BI Desktop. I was trying to use the SQL CTE in the Power BI Desktop to get a result set for my project’s testers to check the data transformation and ETL processes in the UAT database. A little bit background, we are using Power BI Desktop to access our data in the UAT and Production environment. Users do not have direct access to the databases in the server or access the data via SQL Server Management Tool (SSMS).

WHAT IS CTE?

CTE stands for command table expression in SQL Server. According to the Microsoft website, CTE is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or MERGE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression. For more reading about CTE, please refer to the link in the reference list below.

IT GAVES ERROR!

The CTE script that I wrote was running perfectly good in SSMS installed in my local machine. However, when I used the Get Data function in the Power BI Desktop and placed the same script into the SQL command text field, the Power BI Desktop threw an error on the screen that suggested it has a syntax error. I knew it was not having any syntax errors, even some of the online solutions mentioned that we have to put a semi-colon in the script.

1234567891011121314-- Define the CTE expression name and column list.  WITH Sales_CTE (SalesPersonID, SalesOrderID, SalesYear)  AS -- Define the CTE query.      SELECT SalesPersonID, SalesOrderID, YEAR(OrderDate) AS SalesYear      FROM Sales.SalesOrderHeader      WHERE SalesPersonID IS NOT NULL -- Define the outer query referencing the CTE name.  SELECT SalesPersonID, COUNT(SalesOrderID) AS TotalSales, SalesYear  FROM Sales_CTE  GROUP BY SalesYear, SalesPersonID  ORDER BY SalesPersonID, SalesYear;

POWER BI DOES NOT SUPPORT CTE

I am quite agree with one of the Power BI users’ replies, the Power BI Desktop is supposed to be a self-servicing tool for users. In other words, it is not meant writing SQL scripts such as views or stored procedures. To resolve the error, I have to change the SQL statement to use normal SELECT statement. This is a good time to test the SQL skill.

I do not have the answer or SQL script for the above SQL statement, if you are looking for one because I just picked the sample script from the Microsoft website. If you have the solution, you can share it with me in the comment box below.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

REFERENCES

SHARE THIS:

Customize buttonshttps://widgets.wp.com/likes/index.html?ver=20210413#blog_id=4915531&post_id=5981&origin=liyenz.wordpress.com&obj_id=4915531-5981-6092b1b4e1847

[SQL] HOW SQL SERVER COMPARES STRINGS WITH TRAILING SPACES

 | EDIT

A few months back I asked the Business Analyst a question, why do we need to perform trimming by using the SQL keywords, LTRIM() and RTRIM() to the dataset that we extracted from the data warehouse before sending the data to our client. The Business Analyst is not a technical person, so I did not get an answer from my question. But, I answered my own question.

From what I understood through data profiling, some dataset contains the trailing spaces because its original data sources use ‘char’ as the data type whereas, in our data warehouse, we use varchar data type. During the data ingestion from the source systems into our data warehouse, we did not perform any kind of transformation except the data type for strings based columns have changed to a varchar data type. It led the columns saving the data with trailing spaces.

Transact-SQL considers the strings ‘abc’ and ‘abc ‘ to be equivalent for most comparison operations.

An article from Microsoft support website states the above and shares a query that helps us to understand how the padding for character strings affects the semantics of WHERE and HAVING clause predicates. The only exception to this rule is the LIKE predicate because the purpose of the LIKE predicate is to perform pattern searches.

Let us look into the given query below:

123456789101112131415CREATE TABLE #tmp (c1 varchar(10))GOINSERT INTO #tmp VALUES ('abc ')INSERT INTO #tmp VALUES ('abc')GOSELECT DATALENGTH(c1) as 'EqualWithSpace', * FROM #tmp WHERE c1 = 'abc 'SELECT DATALENGTH(c1) as 'EqualNoSpace  ', * FROM #tmp WHERE c1 = 'abc'SELECT DATALENGTH(c1) as 'GTWithSpace   ', * FROM #tmp WHERE c1 > 'ab 'SELECT DATALENGTH(c1) as 'GTNoSpace     ', * FROM #tmp WHERE c1 > 'ab'SELECT DATALENGTH(c1) as 'LTWithSpace   ', * FROM #tmp WHERE c1 < 'abd 'SELECT DATALENGTH(c1) as 'LTNoSpace     ', * FROM #tmp WHERE c1 < 'abd'SELECT DATALENGTH(c1) as 'LikeWithSpace ', * FROM #tmp WHERE c1 LIKE 'abc %'SELECT DATALENGTH(c1) as 'LikeNoSpace   ', * FROM #tmp WHERE c1 LIKE 'abc%'GODROP TABLE #tmp

You can try above on our local machine or SQL Fiddle.

It creates a temporary table to hold the 2 records. You can create a table too. No issues. Then, it inserts two records into the temporary table. The next few lines are to make a comparison when executing the queries with different conditions. When I execute the first SELECT statement, it returns result as below:

1SELECT DATALENGTH(c1) as 'EqualWithSpace', * FROM #tmp WHERE c1 = 'abc '
EQUALWITHSPACEC1
4abc
3abc

Let me also explain the definition of DATALENGTH() used in the above query. The DATALENGTH() function returns the length (number of bytes) used to represent an expression. The DATALENGTH() function counts both leading and trailing spaces when calculating the length of the expression. You may ask what the difference between the DATALENGTH() and the LEN(). We use LEN() function to return the number of characters in an expression, excluding trailing spaces.

Okay back to the result table. It returns two rows of result for checking ‘EqualWithSpace’ because the T-SQL considers both strings ‘abc ‘ and ‘abc’ to be the same but it has a different number of bytes used to represent the string.

When we execute the below SELECT statement a different result returns, let us look closer to the result below.

12SELECT DATALENGTH(c1) as 'LikeWithSpace ', * FROM #tmp WHERE c1 LIKE 'abc %'SELECT DATALENGTH(c1) as 'LikeNoSpace   ', * FROM #tmp WHERE c1 LIKE 'abc%'
LIKEWITHSPACEC1
4abc
LIKENOSPACEC1
4abc
3abc

As mentioned earlier the LIKE predicate is a pattern search, and when we execute with a condition WHERE c1 LIKE ‘abc %’, the first table above (‘LikeWithSpace’) returns 1 record that exactly matching the same pattern with trailing space.

The result for the 2nd SELECT statement returns 2 rows of records because the pattern search that used did not include the space at the end of the string ‘abc’, but uses the wildcard character % that represents zero or more characters.

I hope you can get some ideas from the explanation above. If you want to read more about it you can refer to the references link below that I found useful. Other reading on LEN() and DATALENGTH() can be found below too.

I hope this is simple enough for quick understanding and learning about SQL. You can check out other SQL topics in my blog from this link. If you have any feedback, please leave me a message in the comment box below.

References

Apache Flink

It is my first time I came across Apache Flink when I saw it from a job description. Today, I try to write some Apache Flink information in this entry, from zero to something. My posting may not be accurate because I am sharing what I read and what I understand. Hopefully, I can get some feedback and sharing to learn more.

What is Apache Flink?

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and scale. The definition is taken from the website, https://flink.apache.org/flink-architecture.html

Unbounded and Bounded Data Streams

Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream. Let me do a quick summary of bounded and unbounded data.

Bounded Data

  • Bounded data is finite; it has a beginning and an end. 
  • Bounded data to batch processing. It can be processed by ingesting all data before performing any computations.
  • Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted.

Unbounded Data

  • Unbounded data is an ever-growing, essentially infinite data set.
  • Unbounded data is equated to stream processing. It must be continuously processed. It is not possible to wait for all input data to arrive because the input is unbounded and will not be complete at any point in time.
  • Processing unbounded data often requires that events are ingested in a specific order.

While researching on Apache Flink, I saw an article talking about Apache Apex that it can be used to (and are very capable of) processing both unbounded and bounded data. For more reading on this article, refer to the reference link below.

Run In All Common Cluster Environment

Apache Flink is a distributed system and requires to compute resources to execute applications. Apache Flink integrates with all common cluster resource managers such as Hadoop YARN, Apache Mesos, and Kubernetes. (I did not know what Apache Mesos is. I will Google it and share something about these terms). Apache Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager. In case of a failure, Apache Flink replaces the failed container by requesting new resources. All communication to submit or control an application happens via REST calls.

Run Applications at Any Scale, Speed, in-memory

Apache Flink runs stateful streaming applications at any scale and optimizes for local state access. Applications are parallelized into tasks that are distributed and concurrently executed in a cluster. Therefore, an application can leverage virtually unlimited CPUs, main memory (often in-memory), disk (in access-efficient on-disk), and network IO. Moreover, Apache Flink easily maintains an extensive application state. Its asynchronous and incremental checkpointing algorithm (the local state to durable storage) ensures minimal impact on processing latencies while guaranteeing exactly-once state consistency.

My apologies I could not explain much at this point because I am in the midst of understanding this “new” technology that I just started Googling today. I will update this entry again once I become knowledgeable.

References

%d bloggers like this: