Sorting Algorithms in DBMS
Sorting in Database Management Systems
Introduction
Sorting is a fundamental operation in database management systems (DBMS) that is used in various scenarios, including:
ORDER BYclause in SQL queries: Sorting data to present it in a specific order.- Grouping operations: Arranging data for aggregate functions.
- Sort-merge join algorithm: Combining data from multiple tables.
- Duplicate removal: Identifying and eliminating redundant records.
- Bulk loading of B+ tree indexes: Creating indexes for efficient data retrieval.
When data exceeds available main memory, external sorting algorithms are employed.
DBMS Interaction with the Operating System
- The DBMS requests data from the OS in terms of pages.
- Page size is a configurable parameter, typically 4KB or 8KB.
- The number of pages that can be buffered in memory is another DBMS parameter.
- An SQL query is translated into a sequence of page requests by the DBMS.
- Each record has a record ID (rid) that identifies its page (disk address).
- Page I/O operations are the most significant cost factor.
Importance of Indexing
- Indexing is crucial for query performance.
- Indexing structures must adapt to file changes, especially growth.
2-Way Merge Sort
A basic external sorting algorithm using 3 buffer pages.
- Pass 1: Read a page, sort it, and write it back to disk. Only one buffer page is used.
- Pass 2, 3, …: Merge sorted pages using three buffer pages (two for input, one for output).
Cost Analysis
- Each pass involves reading and writing each page in the file, resulting in a cost of where is the number of pages.
- The number of passes is log_2 N + 1.
- Total cost: 2N (log_2 N + 1).
General External Merge Sort
An optimized external sorting algorithm that utilizes available buffer pages more efficiently.
Phase 1: Prepare
- Use buffer pages to construct starter lists (sorted runs) as large as possible.
- This reduces the number of passes needed.
Phase 2: Merge
- Merge as many sorted sublists as possible into one long sorted list.
- Keep 1 buffer page for output.
- Use buffers to read from lists.
Algorithm Details
To sort a file with pages using buffer pages:
- Pass 0: Use buffer pages to produce sorted runs of pages each. The number of runs is N / B.
- Passes 1, 2, …: Merge runs in each pass.
Cost Calculation
- Number of passes: 1 + log_{B-1} (N / B)
- Total cost: 2N * (# \text{ of passes})
Example
Sorting a file with 108 pages using 5 buffer pages:
- Pass 0: 108 / 5 = 22 sorted runs of 5 pages each (the last run has only 3 pages).
- Pass 1: 22 / 4 = 6 sorted runs of 20 pages each (the last run has only 8 pages).
- Pass 2: 2 sorted runs: one of 80 pages, one of 28 pages.
- Pass 3: Sorted file of 108 pages.
- Total I/O costs:
Number of Passes based on buffer size
| #Pages in File | B=3 | B=5 | B=9 | B=17 | B=129 | B=257 |
| :-------------- | :-: | :-: | :-: | :-: | :---: | :---: |
| 100 | 7 | 4 | 3 | 2 | 1 | 1 |
| 1,000 | 10 | 5 | 4 | 3 | 2 | 2 |
| 10,000 | 13 | 7 | 5 | 4 | 2 | 2 |
| 100,000 | 17 | 9 | 6 | 5 | 3 | 3 |
| 1,000,000 | 20 | 10 | 7 | 5 | 3 | 3 |
| 10,000,000 | 23 | 12 | 8 | 6 | 4 | 3 |
| 100,000,000 | 26 | 14 | 9 | 7 | 4 | 4 |
| 1,000,000,000 | 30 | 15 | 10 | 8 | 5 | 4 |
- Utilizing available buffers significantly reduces the number of passes.
- High fan-in during merging is important.
Summary
- B+ trees are widely used to speed up equality and range queries.
- For using alternative 1, the file must be sorted on the search key, which can be done on the fly.
- Sorting can be done quickly if the database has adequate buffer space.