SQL Challenge #1 – Solution

Challenge – When the marketing department calls

The marketing department is planning a private event for its 10 TOP customers. The marketing department provides the following guidelines for customer selection:

  • Only customers with at least 3 orders in the last quarter of 2022 may be considered
  • The 10 customers with the most orders in the last quarter of 2022 should be considered
  • If more customers had the same number of orders in the last quarter of 2022, they should also be invited – regardless of sales.
  • For the total amount of sales, only the last 3 orders from 2022 may be taken into account.

Solution

When you are challenged with multiple kind of requirements, you shall it break down into as small parts as possible. This allows you to concentrate exclusively on the relevant sub-area.

Only Customers with at least 3 orders in the last quarter of 2022

It is intended to be an event for the company’s top 10 customers. The selection criteria is the number of orders, regardless of the order value. This creates a conflict for those customers who – as can be seen in the example – also have 38 orders, but have made less revenue. The marketing department wants to avoid this and has set an additional condition that – if there are several customers with the same number of orders – these customers must also be invited.

SELECT	TOP (10)
		o_custkey,
		COUNT_BIG(*)	AS	num_orders
FROM	dbo.orders
WHERE	o_orderdate >= '2022-10-01'
		AND o_orderdate <= '2022-12-31'
GROUP BY
		o_custkey
HAVING	COUNT_BIG(*) >= 3
ORDER BY
		COUNT_BIG(*) DESC
SELECT	TOP (10) WITH TIES
		o_custkey,
		COUNT_BIG(*)	AS	num_orders
FROM	dbo.orders
WHERE	o_orderdate >= '2022-10-01'
		AND o_orderdate <= '2022-12-31'
GROUP BY
		o_custkey
HAVING	COUNT_BIG(*) >= 3
ORDER BY
		COUNT_BIG(*) DESC

This solution approach takes into account that only the 10 customers with the most orders are selected. However, this solution does not allow for customers with the same number of orders to also be listed. The WITH TIES option in a SELECT statement, when used with TOP (n), ensures that if there are additional rows that have the same values as the last row in the result set according to the ORDER BY clause, those rows are included in the final output as well, potentially resulting in more than n rows being returned.

Without TiesWith Ties
List of 10 customers with the highest number of ordersList of 13 customers with the highest number of orders
The option WITH TIES include all customers with 38 orders

Only the last 3 orders are considered for the aggregation

This part of the challenge presents us with bigger problems. It should be made clear that there are different approaches to a solution! A clear hint in the challenge was that this query is frequently updated in a dashboard. It has to be performant!

;WITH l
AS
(
	SELECT	ROW_NUMBER() OVER
			(
			  PARTITION BY o_custkey
			  ORDER BY o_orderdate DESC
			) AS rn,
			o_custkey,
			o_totalprice
	FROM	dbo.orders
	WHERE	o_orderdate  >= '2022-10-01'
			AND o_orderdate <= '2022-12-31'
)
SELECT	o_custkey,
		o_totalprice
FROM	l
WHERE	o_custkey = 1184546
		AND rn <= 3;
GO
SELECT	TOP (3)
		o_custkey,
		o_totalprice
FROM	dbo.orders
WHERE	o_custkey = 1184546
		AND o_orderdate  >= '2022-10-01'
		AND o_orderdate <= '2022-12-31'
ORDER BY
		o_orderdate DES
 SQL Server Execution Times:
   CPU time = 483 ms,  elapsed time = 243 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 3 ms.
 SQL Server Execution Times:
   CPU time = 437 ms,  elapsed time = 224 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Both solutions behave almost identically (for ONE customer) in their query behavior. Because we are not usin any indexes at this time, this behavior should not be surprising. But even with a suitable index, there are no significant differences in runtime behavior.

/* Creation of an nonclustered index covering the predicates and the output */
CREATE NONCLUSTERED INDEX nix_orders_o_custkey
ON dbo.orders
(
	o_custkey,
	o_orderdate DESC
)
INCLUDE (o_totalprice)
WITH
(
	DATA_COMPRESSION = PAGE,
	SORT_IN_TEMPDB = ON
);
GO

Let’s assemble the elements

There is hardly a measurable difference in the performance of solutions. Nevertheless, the solution you choose is crucial for further processing.

The variant using TOP (n) cannot be used in a JOIN, as the number of records to be delivered is always determined depending on the current customer number. In a JOIN, it is not possible to use attributes of the outer table in the subquery. Only the APPLY operator has this option.

;WITH cl
AS
(
	/*
		All customers with more than 3 orders
		in the last quarter
	*/
	SELECT	TOP (10) WITH TIES
			o_custkey,
			COUNT_BIG(*)	AS	num_orders
	FROM	dbo.orders
	WHERE	o_orderdate >= '2022-10-01'
			AND o_orderdate <= '2022-12-31'
	GROUP BY
			o_custkey
	HAVING	COUNT_BIG(*) >= 3
	ORDER BY
			COUNT_BIG(*) DESC
)
SELECT	c.c_custkey,
		c.c_name,
		cl.num_orders,
		SUM(lo.o_totalprice)	AS	total_amount
FROM	dbo.customers AS c
		INNER JOIN cl ON (c.c_custkey = cl.o_custkey)
		INNER JOIN
		(
			/* All orders from each of the TOP n customers */
			SELECT	ROW_NUMBER() OVER (PARTITION BY o.o_custkey ORDER BY o.o_orderdate DESC) AS rn,
					o.o_custkey,
					o.o_totalprice
			FROM	dbo.orders AS o
			WHERE	o_orderdate >= '2022-10-01'
					AND o_orderdate <= '2022-12-31'
		) AS lo
		ON (c.c_custkey = lo.o_custkey)
WHERE	lo.rn <= 3
GROUP BY
		c.c_custkey,
		c.c_name,
		cl.num_orders
ORDER BY
		cl.num_orders DESC,
		SUM(lo.o_totalprice) DESC;
GO
;WITH cl
AS
(
	/*
		All customers with more than 3 orders
		in the last quarter
	*/
	SELECT	TOP (10) WITH TIES
			o_custkey,
			COUNT_BIG(*)	AS	num_orders
	FROM	dbo.orders
	WHERE	o_orderdate >= '2022-10-01'
			AND o_orderdate <= '2022-12-31'
	GROUP BY
			o_custkey
	HAVING	COUNT_BIG(*) >= 3
	ORDER BY
			COUNT_BIG(*) DESC
)
SELECT	c.c_custkey,
		c.c_name,
		cl.num_orders,
		SUM(lo.o_totalprice)	AS	total_amount
FROM	dbo.customers AS c
		INNER JOIN cl
		ON (c.c_custkey = cl.o_custkey)
		CROSS APPLY
		(
			/* The last 3 orders */
			SELECT	TOP (3)
					o.o_totalprice
			FROM	dbo.orders AS o
			WHERE	o.o_custkey = c.c_custkey
					AND o_orderdate >= '2022-10-01'
					AND o_orderdate <= '2022-12-31'
			ORDER BY
					o_orderdate DESC
		) AS lo
GROUP BY
		c.c_custkey,
		c.c_name,
		cl.num_orders
ORDER BY
		cl.num_orders DESC,
		SUM(lo.o_totalprice) DESC;
GO
JOIN variantAPPLY variant
SQL Server Execution Times:
CPU time = 2015 ms, elapsed time = 1045 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 1094 ms, elapsed time = 557 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
Execution Plan when using an INNER JOINExecution Plan with efficient nested loop operation against the indexed orders table

Let’s make it perfect

Finally, let’s implement indexes to access the tables as efficiently as possible.

/* Placing a clustered index by adding the PRIMARY KEY to dbo.customers */
ALTER TABLE dbo.customers
ADD CONSTRAINT pk_customers PRIMARY KEY CLUSTERED (c_custkey)
WITH
(
	DATA_COMPRESSION = PAGE,
	SORT_IN_TEMPDB = ON
);
GO
/* This index is for the optimization of the search for the TOP (10) customers */
CREATE NONCLUSTERED INDEX nix_orders_o_orderdate
ON dbo.orders
(o_orderdate)
INCLUDE (o_custkey)
WITH
(
	DATA_COMPRESSION = PAGE,
	SORT_IN_TEMPDB = ON
);
GO
 SQL Server Execution Times:
   CPU time = 1078 ms,  elapsed time = 575 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
 SQL Server Execution Times:
   CPU time = 484 ms,  elapsed time = 263 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Thank you for reading.