Sorting Algorithms in DBMS

Sorting in Database Management Systems

Introduction

Sorting is a fundamental operation in database management systems (DBMS) that is used in various scenarios, including:

ORDER BY clause in SQL queries: Sorting data to present it in a specific order.
Grouping operations: Arranging data for aggregate functions.
Sort-merge join algorithm: Combining data from multiple tables.
Duplicate removal: Identifying and eliminating redundant records.
Bulk loading of B+ tree indexes: Creating indexes for efficient data retrieval.

When data exceeds available main memory, external sorting algorithms are employed.

DBMS Interaction with the Operating System

The DBMS requests data from the OS in terms of pages.
Page size is a configurable parameter, typically 4KB or 8KB.
The number of pages that can be buffered in memory is another DBMS parameter.
An SQL query is translated into a sequence of page requests by the DBMS.
Each record has a record ID (rid) that identifies its page (disk address).
Page I/O operations are the most significant cost factor.

Importance of Indexing

Indexing is crucial for query performance.
Indexing structures must adapt to file changes, especially growth.

2-Way Merge Sort

A basic external sorting algorithm using 3 buffer pages.

Pass 1: Read a page, sort it, and write it back to disk. Only one buffer page is used.
Pass 2, 3, …: Merge sorted pages using three buffer pages (two for input, one for output).

Cost Analysis

Each pass involves reading and writing each page in the file, resulting in a cost of $2N$ where $N$ is the number of pages.
The number of passes is log_2 N + 1.
Total cost: 2N (log_2 N + 1).

General External Merge Sort

An optimized external sorting algorithm that utilizes available buffer pages more efficiently.

Phase 1: Prepare

Use $B$ buffer pages to construct starter lists (sorted runs) as large as possible.
This reduces the number of passes needed.

Phase 2: Merge

Merge as many sorted sublists as possible into one long sorted list.
Keep 1 buffer page for output.
Use $B-1$ buffers to read from $B-1$ lists.

Algorithm Details

To sort a file with $N$ pages using $B$ buffer pages:

Pass 0: Use $B$ buffer pages to produce sorted runs of $B$ pages each. The number of runs is N / B.
Passes 1, 2, …: Merge $B-1$ runs in each pass.

Cost Calculation

Number of passes: 1 + log_{B-1} (N / B)
Total cost: 2N * (# \text{ of passes})

Example

Sorting a file with 108 pages using 5 buffer pages:

Pass 0: 108 / 5 = 22 sorted runs of 5 pages each (the last run has only 3 pages).
Pass 1: 22 / 4 = 6 sorted runs of 20 pages each (the last run has only 8 pages).
Pass 2: 2 sorted runs: one of 80 pages, one of 28 pages.
Pass 3: Sorted file of 108 pages.
Total I/O costs: $2 * 108 * 4 = 864$

Number of Passes based on buffer size

| #Pages in File | B=3 | B=5 | B=9 | B=17 | B=129 | B=257 |
| :-------------- | :-: | :-: | :-: | :-: | :---: | :---: |
| 100 | 7 | 4 | 3 | 2 | 1 | 1 |
| 1,000 | 10 | 5 | 4 | 3 | 2 | 2 |
| 10,000 | 13 | 7 | 5 | 4 | 2 | 2 |
| 100,000 | 17 | 9 | 6 | 5 | 3 | 3 |
| 1,000,000 | 20 | 10 | 7 | 5 | 3 | 3 |
| 10,000,000 | 23 | 12 | 8 | 6 | 4 | 3 |
| 100,000,000 | 26 | 14 | 9 | 7 | 4 | 4 |
| 1,000,000,000 | 30 | 15 | 10 | 8 | 5 | 4 |

Utilizing available buffers significantly reduces the number of passes.
High fan-in during merging is important.

Summary

B+ trees are widely used to speed up equality and range queries.
For using alternative 1, the file must be sorted on the search key, which can be done on the fly.
Sorting can be done quickly if the database has adequate buffer space.