Home Artificial Intelligence When Do You Self Join? A Handy Trick

When Do You Self Join? A Handy Trick

1
When Do You Self Join? A Handy Trick

Intermediate SQL for ETL dev to Data Engineer Transition

Photo by Campaign Creators on Unsplash

There may be nothing called Self take part SQL. Hear me out.

Often Data Analysts and Data Engineers need to search out patterns in data that aren’t obvious. Nonetheless the insights could be generated and patterns identified through the use of common SQL practices, like Self Join.

Many beginners often struggle to know Self Join. They confuse Self Take part SQL as one other command just like the Inner or Left Join. But Self Join shouldn’t be a keyword in SQL. Neither is it a command in SQL.

Self-join is similar to a standard Join(Inner/Left/Right/Outer), between two tables. Nonetheless, in a self-join, the 2 tables are the identical but act as different tables via their aliases.

Self Join is usually considered a foul practice in Data Engineering. They are saying it’s dangerous to make use of. But, there are scenarios when using a self-join is practical and the very best option to tackle the problem.

Let’s see a number of examples:

Hierarchical Data:

Self-joins are useful for working with hierarchical data. In an organisational chart, we are able to join a table to itself based on manager-employee relationships to search out worker reports, department heads, etc.

Let’s generate some free data to check this.

create table worker
(
employee_id int,
employee_name varchar(10),
EmpSSN varchar(11),
manager_id int null,
city varchar(20)
);

--Correct data
insert into worker values(1, 'Jack', '555-55-5555','','Kolkata');
insert into worker values (2, 'Joe', '555-56-5555',1,'Kolkata');
insert into worker values (3, 'Fred', '555-57-5555',2,'Dehli');
insert into worker values (4, 'Mike', '555-58-5555',2,'Kolkata');
insert into worker values (5, 'Cathy', '555-59-5555',2,'Dehli');
insert into worker values (6, 'Lisa', '555-70-5555',3,'Bangalore');

Here the worker details of an organisation are stored together with their manager ID. We will use Self join to discover the manager of all distinct employees.

select emp.*,isnull(mgr.employee_name,'Boss') as managerName from worker emp
left join worker mgr on emp.manager_id = mgr.employee_id

Here the query returns the manager’s name corresponding to every worker by joining the identical worker table with itself on manager_id = employee_id.

Return the name of the manager through the use of self-join, image by writer

WARNING: Don’t forget to make use of alias for the worker table, to distinguish between two parts of the self join. Also, the join column ought to be appropriately used.

Similarly, we are able to even find the various levels of hierarchy by recursively joining the CTE as a self-join with itself.

WITH
EmployeeHierarchy AS (
SELECT
employee_id, employee_name, manager_id, 0
AS
level
FROM
worker
WHERE
manager_id = 0
UNION ALL
SELECT
emp.employee_id, emp.employee_name, emp.manager_id, eh.level + 1
FROM
worker emp
JOIN
EmployeeHierarchy eh
ON
emp.manager_id = eh.employee_id
)
SELECT
employee_id, employee_name, level
FROM
EmployeeHierarchy;

Products and Categories:

This could be related to hierarchical data only, but that is a selected subset. Self Joins could be extremely handy to discover all mixtures of products, categories and subcategories. Within the manufacturing industry, this will provide components and sub-components, in e-commerce it may be used to get similar products or categories.

Let’s learn it through an example:

Create a table and insert dummy data:

create table bom (item_id int, parent_id int null,description varchar(50), quantity int)

INSERT INTO bom (item_id, parent_id, description, quantity)
VALUES (1, NULL, 'Widget (Foremost Assembly)', 1),
(2, 1, 'Gear A', 2),
(3, 1, 'Spring B', 4),
(4, 2, 'Screw C (Small)', 10),
(5, 2, 'Screw C (Large)', 5),
(6, 3, 'Nut D', 1);

We created a table with columns of item_id, parent_id, description, and quantity. We’ve got also inserted sample data from a producing line, where ‘Widget (Foremost Assembly)’ is the parent product and Gear, Screw, nut etc. are sub-products.

We will use self-join to discover the parent-child relationship, and recursive self-join can discover the total product sequence.

Let’s review this with the query and the outcomes:

WITH recursive_bom AS (
SELECT item_id, parent_id, description, quantity, solid(description as nvarchar(255)) AS full_path
FROM bom
WHERE parent_id IS NULL -- Start line: Top-level items
UNION ALL
SELECT
b.item_id,
b.parent_id,
b.description,
b.quantity,
solid(CONCAT(rb.full_path, '.', b.description) as nvarchar(255)) AS full_path
FROM bom b
INNER JOIN recursive_bom rb ON b.parent_id = rb.item_id
)
SELECT item_id, description, quantity, full_path
FROM recursive_bom
ORDER BY full_path;

Sample Data and Output

Generating product sequence using self join.
Image by writer

User Segmentation:

In Business and Data Analytics, a very important aspect is user segmentation. Users are sometimes categorised based on their purchase behaviour, their frequency of interaction with the business etc. A self-join could be an incredible option to discover these patterns within the transaction data.

Let’s consider the below example:

We’d like to discover returning customers inside a specified time period ( 7 days ) for an eccommerce business. Similar challenges could be found on the web, example here.

Let’s create a test table and insert some sample records within the table.

Trick: You’ll be able to ask ChatGpt to generate the test data as per need.

Create a table and insert dummy data:

create table ecom_tran (
tranid int,
userid int,
created_date datetime,
itemname varchar(50)
)

INSERT INTO ecom_tran (tranid, userid, created_date, itemname, price_paid)
VALUES
(1, 201, '2024-02-23 11:45:00', 'Running Shoes'),
(2, 202, '2024-02-24 10:00:00', 'Yoga Mat'),
(3, 203, '2024-02-26 14:10:00', 'Water Bottle'),
(4, 204, '2024-02-27 09:30:00', 'Gym Bag'),
(5, 205, '2024-02-28 12:00:00', 'Protein Powder'),
(6, 201, '2024-02-29 15:15:00', 'Phone Case'),
(7, 206, '2024-03-01 10:45:00', 'Webcam'),
(8, 202, '2024-03-02 16:30:00', 'Pen Drive'),
(9, 207, '2024-03-04 12:00:00', 'Powerbank'),
(10, 203, '2024-03-05 09:00:00', 'Monitor'),
(11, 101, '2024-03-06 11:00:00', 'Mouse'),
(12, 102, '2024-03-07 14:45:00', 'Speaker'),
(13, 103, '2024-03-08 10:10:00', 'Tablet'),
(14, 101, '2024-03-09 13:30:00', 'Headphones'),
(15, 104, '2024-03-10 17:00:00', 'Book'),
(16, 102, '2024-03-11 08:20:00', 'Coffee Maker'),
(17, 105, '2024-03-12 11:15:00', 'Smartwatch'),
(18, 101, '2024-03-13 15:45:00', 'Shirt'),
(19, 103, '2024-03-14 12:30:00', 'Laptop')

Solution approach:

Within the sample table created, we’ve the userid, transactionid and the created_date column that are relevant to the challenge. As we’ve been asked to discover the users who’ve made a minimum of 2 purchases inside a 7 day period, we are able to consider the below approach:

  1. Check how many various transactions the users have made.
  2. Mix each transaction together with itself to discover all possible pairs of transactions by the identical user.
  3. Calculate the date difference between the 2 mixtures.
  4. The date difference ought to be > 0 and < 7. This can ensure only records where transactions have been made inside 7 days are returned.
  5. We will collect the distinct userids to discover the users which have returning transactions inside 7 days.

This can be a classic use case to contemplate Self-Join together with non-equi join.

SELECT a.userid,
a.tranid AS id1,
a.created_date AS created_at1,
b.tranid AS id2,
b.created_date AS created_at2,
mod(DATEDIFF(dd,a.created_date,b.created_date))
FROM ecom_tran a
JOIN ecom_tran b
ON a.userid=b.userid
AND a.tranid <> b.tranid
ORDER BY a.userid

The above query generates all mixtures of transactions made by the identical users. We’ve got achieved this by joining the ecom_tran to itself, with the assistance of aliases, on column userid. This inner join ensures, that only transactions of the identical user are returned.

Nonetheless the non-equi join is on a.tranid <> b.tranid , this ensures the identical transactions aren’t repeated.

We also calculated the date difference between the 2 transactions.

Queries and pictures by the writer

Now if we just filter on those where the date difference between the created_dates is > 0 and < 7, that may give us all transactions that happened inside 7 days by the identical user. We will take a definite value of the userid column to only discover the users that made returning purchases inside 7 days.

Self join example in user segmentation.
Image by writer

Conclusion:

I hope you bought a general understanding and intuition about how the self-join performs in SQL. Though self-joins aren’t very intuitive to know and use, there are specific use cases where they’re indispensable.

I actually have covered only a number of of the possible scenarios. Yet, that is sufficient to provide you confidence while facing any SQL interview questions. Even when the query might need an intuitive understanding to unravel the challenges, these concepts will help to discover what approach to make use of.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here