Distributed Database Processing Lecture Notes

UNIT 1: DISTRIBUTED DATA PROCESSING

Distribution Database

A distributed database is a collection of multiple interconnected databases, physically spread across various locations interconnected via a computer network.
- Features:
- Databases are logically interrelated, often representing a single logical database.
- Data stored across multiple sites, managed by independent DBMSs.
- No multiprocessor configuration; connected via a network.
- Not merely a loosely connected file system.
- Incorporates transaction processing but differs from transaction processing systems.
- Data modified at any site is universally updated.
- Used in applications processing large volumes of data accessed by multiple users simultaneously.
- Designed for heterogeneous database platforms.
- Maintains confidentiality and data integrity.

Distributed Data Storage

Two methods for storing data across sites:
1. Replication:
  - Entire relationship redundantly stored at two or more sites.
  - If the entire database is available at all sites, it's a fully redundant database.
  - Advantage: Increases data availability, permits parallel processing of queries.
  - Disadvantage: Requires constant updates across all sites, increasing overhead.
  - Complexity in concurrency control due to concurrent access across multiple sites.
2. Fragmentation:
  - Relations divided into smaller parts (fragments) stored where needed.
  - Critical to ensure reconstructability (no data loss).
  - Advantages:
  - No duplication, alleviates consistency concerns.
  - Types of fragmentation:
  - Horizontal Fragmentation: Splits by rows—tuples assigned to at least one fragment.
  - Vertical Fragmentation: Splits by columns—schemas divided into smaller schemas. Each fragment must share a common candidate key.
  - Hybrid approaches of fragmentation and replication may also be employed.

Applications of Distributed Database

Utilized in:
- Corporate Management Information Systems
- Multimedia applications
- Military control systems, Hotel chains
- Manufacturing control systems

Advantages of Distributed Database Systems

Rapid data processing due to participation of multiple sites in request processing.
High reliability and availability.
Reduced operating costs.
Easier expansion by adding additional sites.
Enhanced sharing abilities and local autonomy.

Disadvantages of Distributed Database Systems

Increased management and control complexity.
Potential security issues need careful handling.
Requirement for deadlock handling during transaction processing; inconsistency can arise otherwise.
Necessity for standardization in processing distributed database systems.

Types of DDBMS

Homogeneous Distributed Databases

Characteristics:
- Identical DBMS and operating systems used at all sites.
- Sites use similar software.
- Each site aware of and cooperates with others to process requests.
- Accessed through a single interface.
- Further divided into:
- Autonomous: Each database independent, integrated by a controlling application.
- Non-autonomous: Central master DBMS coordinates data updates.

Heterogeneous Distributed Databases

Characteristics:
- Different sites utilize different operating systems, DBMS products, and data models.
- Diverse schemas and software across sites.
- Query processing complexity due to varied schemas and software.
- Limited cooperation among sites for user requests.
- Types include:
- Federated: Independent heterogeneous systems integrated into a single operational system.
- Un-federated: Use a central module for access coordination.

Problem Areas of Distributed Database Systems

Complexity of management due to network structures and multiple sites.
High operational costs from maintenance, hardware, network communication, and labor.
Security challenges due to the distributed nature potentially exposed to data theft and misuse.
Integrity control is crucial; modifications must be consistent across all sites, leading to high communication and processing costs for data integrity enforcement.
Lack of standard protocols for transitioning from centralized to distributed DBMS decreases effectiveness.
Difficulty in designing robust distributed databases compared to centralized systems.

Distributed DBMS Architectures

Parameters Affecting DDBMS Architectures

Distribution: Physical distribution of data across sites.
Autonomy: Control distribution allowing DBMSs to operate independently.
Heterogeneity: Consistency or variety in data models, components, and databases.

Architectural Models

Client-Server Architecture
- Two-level architecture dividing functionality between servers (data management, queries) and clients (user interfaces).
- Types include:
  - Single Server Multiple Client
  - Multiple Server Multiple Client
Peer-to-Peer Architecture
- Each peer acts as both client and server for database services.
- Has schema levels including global conceptual, local conceptual, local internal, and external schema.
Multi-DBMS Architectures
- Involves integrating two or more autonomous systems.
- Six layers of schemas exist, facilitating multi-database architectures.

Transparency in DDBMS

Definition: Refers to the extent to which users can access a distributed database as a single entity, without needing to understand its complexities.
Types include:
- Transaction Transparency: Preserves integrity and regularity of distributed transactions.
- Performance Transparency: Ensures performance is optimal as if operating in a centralized database.
- DBMS Transparency: Hides differences in local DBMSs for users.
- Distribution Transparency: Users treat the distributed nature as a single logical entity, unaware of distribution.
Types of Distribution Transparency:
- Fragmentation Transparency: Users unaware of data fragmentation.
- Location Transparency: Users do not need to know where data is stored.
- Replication Transparency: Users unaware of data being copied across sites.
- Local Mapping Transparency: Users define fragment names without relating to locations and duplicates.
- Naming Transparency: Ensures unique naming conventions in structured data.

Global Directory Issues

Global directory maintains location and makeup information of the vertices in a Distributed DBMS or multi-DBMS using a global conceptual schema.
Features:
- A directory could be global or local.
- Central directories can be maintained at one site or distributed over multiple sites.
- Replication may involve single or multiple copies for reliability.

UNIT II: DISTRIBUTED DATABASE DESIGN

Strategies for Database Design

Strategies broadly categorized into replication and fragmentation, with combinations frequently used.

Data Replication

Involves storing multiple copies of databases at different sites as a fault tolerance approach.

Advantages of Data Replication

Reliability: Ensures operations continue if one site fails due to other available copies.
Network Load Reduction: Local data copies reduce query processing network usage.
Quicker Response: Local data copies lead to faster query responses.
Simpler Transactions: Reduces complexity in table joins across sites.

Disadvantages of Data Replication

Storage Requirements Increase: More copies entail higher storage needs.
Cost and Complexity of Updates: Synchronizing copies requires complex updates each time the data changes.
Coupling Issues: Poor update mechanisms lead to application-database coupling.

Data Fragmentation

Involves dividing a table into smaller fragments, which can be horizontally, vertically, or hybrid fragmented.

Advantages of Fragmentation

Proximity of data to usage enhances efficiency.
Local query optimization is more effective as data is locally available.
Limits irrelevant data, thus maintaining security and privacy.

Disadvantages of Fragmentation

Access speeds may drop if data from multiple fragments is required.
Complex reconstruction techniques may be needed for recursive fragmentations.
Lack of backup across sites may lead to system failure.

Types of Fragmentation

Vertical Fragmentation:
- Involves grouping columns; each fragment must include primary key(s) to maintain reconstructability.
Horizontal Fragmentation:
- Involves grouping rows based on field values; all columns of the base table must be retained in each fragment.
Hybrid Fragmentation:
- Combines horizontal and vertical techniques for greater flexibility despite expensive reconstruction.

Semantic Data Control in Distributed Databases

Overview of Semantic Data

Semantic Data is data representing meanings, facilitating machine understanding of data context.
Focuses on accurate representation of real-world implications within datasets leading to effective data modeling.

Role of Semantic Data Control

Encompasses view management, data security, and integrity control.
- View Management: Derivation of views in DDBMS mimics centralized systems, but fragmented relations generate costly evaluations. Optimizations, like snapshots (temporary relations), are crucial in execution.
- Semantic Integrity Control: Defines and enforces integrity constraints, including:
- Data Type Integrity Constraint: Limits values and operations for fields of a particular type, ensuring valid database entries.
- Entity Integrity Control: Ensures unique identification for tuples through primary keys with no NULL values allowed in primary key fields.
- Referential Integrity Constraint: Governs relationships between tables by ensuring foreign keys are either valid or NULL.

Query Processing Issues in Distributed Databases

Overview

Processing queries involves global and local optimization as queries enter via client sites for validation and execution.

Objectives of Query Processing

The aim is to convert high-level queries into efficient execution strategies in a way that minimizes resource consumption., reflecting on parameters like communication costs, CPU time, and issue optimization strategies.

Layers of Query Processing

Query Decomposition: Transforming calculus queries into algebraic queries on global relations.
Data Localization: Identifying the relevant fragments for a query and modifying them into localized queries.
Global Query Optimization: Finding optimal execution strategies based on query structure and resource allocation.
Distributed Query Execution: Executing local queries at relevant sites and merging results accordingly.

Distributed Query Optimization

Involves evaluating potential query trees to achieve optimal solutions across various replicated and fragmented data.

Challenges

Critical issues include optimal resource utilization, managing query trading (buyer/seller dynamics), and reducing query solution space with heuristics similar to centralized systems, including performing early selections and local optimizations.

Load Balancing in Distributed Systems

Importance of Load Balancing

Serves as a mechanism for distributing traffic and resource balancing across multiple servers for enhanced security, availability, response, and user experience.

Types of Load Balancing Approaches

Round Robin
Least Connections
Least Time
IP Hashing

Benefits of Load Balancing

Enhances performance, ensures high availability, provides security against threats (DDoS), and optimizes resource utilization, minimizing response time across distributed databases or systems.

Migration in Distributed Systems

Migration Models: Define how to manage processes across different nodes, including code sections, resource references, and execution states.
Types of Migration: Powerless (only code moved) vs. Solid (code and execution sections moved).

Mobile Databases

Introduction

Mobile databases connect to mobile devices across networks, facilitating various computing applications in today’s growing mobile tech landscape.
Features: Frequent data caching, independent operation without constant connection, and compatibility with various mobile platforms.

Limitations

Issues include: limited bandwidth, wireless communication speed, battery dependency, and security vulnerabilities.

Distributed Object Management

Overview

Aims for transparent object management across distributed contexts, fostering singular image views while addressing challenges of traditional relational systems.

Architecture and Features

Establishes frameworks for client-server systems, facilitating distributed processing and connectivity between applications via Object Management tools.

Conclusion

All aspects discussed demonstrate the intricate relationship between data accessibility, control structures, query processing, and maintenance of integrity, security, and performance across distributed environments. Ensuring effective systems facilitates smooth operations in modern technological applications, proving essential in determining the efficiency of database management strategies across numerous applications.