SQL Challenge #2 – Solution

Challenge – Aggregations in Partitioned Tables

The customer operates a 45 TB data warehouse that is updated 24/7 with a frequence of 5 minutes. Partitioning is used on the biggest table to increase efficiency in loading processes. Before the loading starts the loading process must evaluate the last value from the attribute [o_orderkey] from the table [dbo].[orders]:

  • This request is causing long runtimes and massive resources (CPU / Memory) when determining the maximum value.
  • As impact of the long running evaluation process the timeframe for the load sometimes exceed.
  • You are asked to find out the root cause of the long runtime and find an adequate solution to the problem.

Business Condition

The business regularly imports new data into the table at short intervals and needs to update existing data. Therefore, they use the SWITCH function to detach existing partitions from the table to edit them outside the table. Using a NON-aligned index cannot be used for the existing problem because NON-aligned indexes do not support SWITCH() and MERGE().

Query to be optimized

SELECT MAX(o_orderkey) FROM dbo.orders;

Execution Plan

Parallel execution plan for recieving the MAX value on an indexed column
Execution Plan

Solution

OK, I admit. This challenge is tough, because it is not a problem that can be easily solved with an index. The problem for this SQL Challenge is partitioning. For partitioned tables, partition elimination is performed before applying predicates. Although the primary key for the table [dbo].[orders] is set to [o_orderkey], [o_orderdate], SQL Server must completely search each partition. This must be done because it is not clear in which partition the maximum value is present.

The query itself does not have the partition key as a criteria so all partitions must be scanned. Paul White and Izik Ben Gan describes this “lack of optimization” in detail:

AuthorBlog Post
Paul Whitehttps://www.sql.kiwi/2013/07/aggregates-and-partitioning/
Izik Ben Ganhttps://www.itprotoday.com/early-versions/max-and-min-aggregates-against-partitioned-tables

To optimize the query we cannot help with an additional indexe for two reasons:

  • A NON-aligned index prevent the business to switch partitions out for updating data
  • An additional index will behave the same way (SCAN) than the Primary Key. There is a small benefit if only the key attribute o_orderkey is part of the index but the root cause cannot be solved with an additional index!
/*
	The trick is to do the aggregation by partition
	and evaluate the aggregation at the end!
*/
SELECT	MAX(part_agg.o_orderkey) AS max_o_orderkey
FROM	sys.partitions AS p
		CROSS APPLY
		(
			SELECT	MAX(o_orderkey) AS o_orderkey
			FROM	dbo.orders
			WHERE	$PARTITION.pf_o_orderdate(o_orderdate) = p.partition_number
		) AS part_agg;
GO

Original Query

Optimized Query

Solution by Seba

In the comments section of the SQL Challenge #2, the user Seba suggested using a columnstore index to solve the problem. Thank you very much for the suggestion!

I also tested this variant as an alternative and came to the following conclusion:

  • Due to the nature of a columnstore index (no sorting), a HASH operator must be used for aggregation. The HASH operator is a STOP operator and requires additional memory.
  • The columnstore index will slow down the ETL processes, as UPDATE operations always initiate a DELETE -> INSERT process. (-)
  • The columnstore index is implemented as an aligned index and does not prevent the SWITCH operations (+)
  • The columnstore index does not prevent all partitions from being scanned. (-)
  • The columnstore index is significantly faster compared to the original query. However, this is more due to compression than to efficient access to the data

Demo

/* Create the columnstore index */
CREATE NONCLUSTERED COLUMNSTORE INDEX nccs_orders_o_orderkey
ON dbo.orders (o_orderkey)
ON ps_o_orderdate (o_orderdate);
GO

/* as aligned - partitioned index */
SELECT	partition_number,
		data_compression_desc,
		rows
FROM	sys.partitions
WHERE	object_id = OBJECT_ID(N'dbo.orders')
		AND index_id = 3;
GO

/* run the query to test the performance */
SELECT MAX(o_orderkey) FROM dbo.orders;
GO

IO and CPU statistics

Table 'orders'. Scan count 2, logical reads 0
Table 'orders'. Segment reads 11, segment skipped 0.

SQL Server Execution Times:
    CPU time = 31 ms,  elapsed time = 53 ms.

Execution Plan

The depiction shows a full scan on the columnstore indexe and an HASH Aggregation which costs memory grants.

Thank you for reading.