Sorting Algorithms in DBMS

Sorting in Database Management Systems

Introduction

Sorting is a fundamental operation in database management systems (DBMS) that is used in various scenarios, including:

  • ORDER BY clause in SQL queries: Sorting data to present it in a specific order.
  • Grouping operations: Arranging data for aggregate functions.
  • Sort-merge join algorithm: Combining data from multiple tables.
  • Duplicate removal: Identifying and eliminating redundant records.
  • Bulk loading of B+ tree indexes: Creating indexes for efficient data retrieval.

When data exceeds available main memory, external sorting algorithms are employed.

DBMS Interaction with the Operating System

  • The DBMS requests data from the OS in terms of pages.
  • Page size is a configurable parameter, typically 4KB or 8KB.
  • The number of pages that can be buffered in memory is another DBMS parameter.
  • An SQL query is translated into a sequence of page requests by the DBMS.
  • Each record has a record ID (rid) that identifies its page (disk address).
  • Page I/O operations are the most significant cost factor.

Importance of Indexing

  • Indexing is crucial for query performance.
  • Indexing structures must adapt to file changes, especially growth.

2-Way Merge Sort

A basic external sorting algorithm using 3 buffer pages.

  • Pass 1: Read a page, sort it, and write it back to disk. Only one buffer page is used.
  • Pass 2, 3, …: Merge sorted pages using three buffer pages (two for input, one for output).
Cost Analysis
  • Each pass involves reading and writing each page in the file, resulting in a cost of 2N2N where NN is the number of pages.
  • The number of passes is log_2 N + 1.
  • Total cost: 2N (log_2 N + 1).

General External Merge Sort

An optimized external sorting algorithm that utilizes available buffer pages more efficiently.

Phase 1: Prepare
  • Use BB buffer pages to construct starter lists (sorted runs) as large as possible.
  • This reduces the number of passes needed.
Phase 2: Merge
  • Merge as many sorted sublists as possible into one long sorted list.
  • Keep 1 buffer page for output.
  • Use B1B-1 buffers to read from B1B-1 lists.
Algorithm Details

To sort a file with NN pages using BB buffer pages:

  • Pass 0: Use BB buffer pages to produce sorted runs of BB pages each. The number of runs is N / B.
  • Passes 1, 2, …: Merge B1B-1 runs in each pass.
Cost Calculation
  • Number of passes: 1 + log_{B-1} (N / B)
  • Total cost: 2N * (# \text{ of passes})

Example

Sorting a file with 108 pages using 5 buffer pages:

  • Pass 0: 108 / 5 = 22 sorted runs of 5 pages each (the last run has only 3 pages).
  • Pass 1: 22 / 4 = 6 sorted runs of 20 pages each (the last run has only 8 pages).
  • Pass 2: 2 sorted runs: one of 80 pages, one of 28 pages.
  • Pass 3: Sorted file of 108 pages.
  • Total I/O costs: 21084=8642 * 108 * 4 = 864

Number of Passes based on buffer size

| #Pages in File | B=3 | B=5 | B=9 | B=17 | B=129 | B=257 |
| :-------------- | :-: | :-: | :-: | :-: | :---: | :---: |
| 100 | 7 | 4 | 3 | 2 | 1 | 1 |
| 1,000 | 10 | 5 | 4 | 3 | 2 | 2 |
| 10,000 | 13 | 7 | 5 | 4 | 2 | 2 |
| 100,000 | 17 | 9 | 6 | 5 | 3 | 3 |
| 1,000,000 | 20 | 10 | 7 | 5 | 3 | 3 |
| 10,000,000 | 23 | 12 | 8 | 6 | 4 | 3 |
| 100,000,000 | 26 | 14 | 9 | 7 | 4 | 4 |
| 1,000,000,000 | 30 | 15 | 10 | 8 | 5 | 4 |

  • Utilizing available buffers significantly reduces the number of passes.
  • High fan-in during merging is important.

Summary

  • B+ trees are widely used to speed up equality and range queries.
  • For using alternative 1, the file must be sorted on the search key, which can be done on the fly.
  • Sorting can be done quickly if the database has adequate buffer space.