D426 Database Management Chapter 6: Database Architecture

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/97

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

98 Terms

1
New cards

architecture

describes the components of a computer system and the relationships between those components

2
New cards

What are the four layers of MySQL components?

  • Tools

  • query processor

  • storage engine/storage manager

  • file system

3
New cards

What information does MySQL Enterprise Monitor collect, and what do DBAs use it for?

CPU, memory, index utilization, and information about queries and results. This can be used to manage and tune large databases with many users.

4
New cards

What does Audit track, and what kinds of entities does it support?

The time that any change is made to the database and who made the change. This supports sensitive databases such as finance, medical, and defense that might receive audits from government or other businesses.

5
New cards

What are the five categories of utility programs?

  • installation

  • client

  • administrative

  • developer

  • miscellaneous

6
New cards

Most utility programs are intended for __.

database administrators or programmers

7
New cards

What is the difference between a Connector and an API in MySQL?

Connectors are newer versions of APIs that are mostly built on top of APIs and developed directly by Oracle as opposed to a third party.

8
New cards

Name the two main functions of a query processor.

  1. Manage connections between tools and the query processor.

  2. Compile queries and generate execution plans.

9
New cards

execution plan

a detailed, low-level sequence of steps that specify exactly how to process a query

10
New cards

query parser

checks each query for syntax errors and converts valid queries to an internal representation

11
New cards

query optimizer

reads the internal representation of a query, generates alternative execution plans, estimates execution times, and selects the fastest plan

12
New cards

What does a query optimizer base its estimates of query time on?

Heuristics and statistics about data, like the number of rows in each table and the number of values in each column, which it pulls from the data dictionary.

13
New cards

cache manager

a component of the query processor layer that stores reusable information in main memory, such as execution plans for common queries or even query results if the data does not change

14
New cards

Name the two main functions of the storage engine layer.

  1. transaction management

  2. data access

15
New cards

What does the data access component do?

Communicate with the file system and translate table, column, and index reads into block addresses.

16
New cards

buffer manager

retains data blocks from the file system for possible reuse in order to reduce data access time

17
New cards

On the storage engine layer, the buffer manager is the equivalent of the __ on the query processor layer.

cache manager

18
New cards

How does InnoDB handle the limited amount of memory in the buffer manager?

Using a least recently used (LRU) algorithm, which tracks the time each block was last accessed and, when space is needed, discards ‘stale’ blocks. A block that has been updated gets saved to disk before deletion.

19
New cards

What three components are included in the transaction manager?

  • concurrency system

  • recovery system

  • lock manager

20
New cards

Name the three types of data that a file system contains for each database.

  • user data

  • log files

  • data dictionary

21
New cards

What are the tiers in a multi-tier architecture?

Grouped layers of computers linked by a network and arranged in a hierarchy.

22
New cards

The top tier of a multi-tier architecture consists of __.

computers interacting directly with end-users

23
New cards

The bottom tier of a multi-tier architecture consists of __.

servers managing resources like databases and email

24
New cards

What functions do the middle tiers of a multi-tier architecture execute?

A variety, such as user authorization, business logic, and communication with other computers.

25
New cards

web architecture

a multi-tier architecture consisting of web browsers and web servers communicating over the internet

26
New cards

Name the three broad categories of cloud services.

  • Infrastructure as a service

  • Platform as a service

  • Software as a service

27
New cards

Infrastructure-as-a-Service (IaaS)

provides computer processing, memory, and storage media, as if the customer were renting a computer

ex: Elastic Compute Cloud (EC2) from AWS

28
New cards

Platform-as-a-service (PaaS)

provides tools and services, such as databases, application development tools, and messaging services.

ex: Microsoft Azure

29
New cards

Software-as-a-service (SaaS)

provides complete applications, usually through web browsers on customer machines

ex: Salesforce, Google Drive

30
New cards

virtual machine (VM)

a software layer that emulates a complete, independent computing environment

31
New cards

parallel computer

consists of multiple processors managed by a single operating system instance and can achieve faster processing speeds by processing multiple instructions concurrently

32
New cards

Name the 3 categories of parallel computers.

  1. shared memory

  2. shared storage

  3. shared nothing

33
New cards

What is the benefit of shared memory for data management?

It is optimal for parallel processing against a common data set stored within a single memory space.

34
New cards

node

a single computer in a group connected be either LAN or WAN

35
New cards

cluster

a group of nodes connected by LAN, managed by separate operating system instances, and coordinated by specialized cluster management software.

36
New cards

What is the difference between a cluster and a parallel computer?

Computers in a cluster can share storage, but not memory because LAN connections are too slow to support memory access.

37
New cards

parallel database

runs on a parallel computer or a cluster

38
New cards

distributed database

runs on multiple computers connected by a wide area network

39
New cards

distributed transaction

a transaction that updates data on multiple nodes of a distributed database

40
New cards

What are the steps of a two-phase commit?

  1. A central transaction coordinator notifies all participating nodes of the required updates.

  2. Participating nodes receive the notification, store the update in a local log, and send a confirmation message to the transaction coordinator.

  3. After receiving confirmation from all participating nodes, the transaction coordinator instructs all nodes to commit.

  4. Participating nodes receive the commit message, commit the update to the database, and notify the transaction coordinator of success.

41
New cards

local transaction

updates data on a single node of a distributed database

42
New cards

What is the downside of local transactions?

They can create temporary inconsistency in a distribute database, as nodes are updated at different times.

43
New cards

network partition

forms when a network error prevents nodes from communicating

44
New cards

CAP theorem

states that a distributed database cannot simultaneously be Consistent, Available, and Partition-tolerant; it can guarantee two at most of these properties

45
New cards

available (database)

‘live’ nodes must respond to queries at all times, regardless of the state of other nodes

46
New cards

consistent (database)

conforms to all rules at all times across all nodes that these rules apply to

47
New cards

Most distributed databases must always function and are therefore __.

partition-tolerant

48
New cards

partition-tolerant

when a database can continue to function if a network partition occurs

49
New cards

replicated database

maintains two or more replicas on separate storage devices

50
New cards

What are the advantages of replicated databases?

  1. High availability in case of storage device failures.

  2. Fast concurrent reads by decomposing large queries into smaller queries that read separate replicas in parallel.

  3. Local reads for distributed databases.

51
New cards

What is the major disadvantage of replicated databases?

Slow or inconsistent updates.

52
New cards

How is security enhanced for a replicated database?

Updates can be restricted to a single replica, accessible only to trusted database users, and then propagated to read-only replicas across the userbase.

53
New cards

How do replicated databases simplify backups?

One replica can be backed up while transactions execute against other replicas, meaning that the database is more consistently available.

54
New cards

How do replicated databases complicate server administration?

DBAs have to figure out how to propagate updates across replicas.

55
New cards

storage arrays

a type of storage device that can manage database replicas internally, without database intervention

56
New cards

What happens if the primary node fails when updating a database with the primary/secondary technique?

The database automatically designates a new primary node to ensure continued availability.

57
New cards

primary/secondary technique

one node is designated as a primary node that receives updates first in local transactions, and all other nodes receive the update only after the primary node commits

58
New cards

group replication technique

applies updates to any node in a group, which then broadcasts transaction information to other nodes so that they can search for conflicts with concurrent transactions before any node commits the update

59
New cards

What happens if a node in a group replication system detects a conflict between concurrent transactions?

A predetermined algorithm selects which transaction commits and rolls back any others.

60
New cards

Between group replication, distributed transactions, and primary/secondary technique, which is partition-tolerant, and which is always consistent?

Primary/secondary technique is partition-tolerant because it updates only the primary node independent of secondary nodes and assigns a new primary node if the original is unavailable. Distributed transactions are always consistent because they update data on multiple nodes within one transaction.

61
New cards

central catalog

the entire catalog of a distributed database resides on a single node

62
New cards

What are the pros and cons of a central catalog?

Pros: it is easy to manage

Cons:
- remote nodes must access via WAN, which may be slow or unreliable
- there may be a bottleneck at the central catalog

63
New cards

replicated catalog

a catalog for a distributed database that has a copy on each node

64
New cards

What are the pros and cons of a replicated catalog?

Pros: most queries are fast and reliable since the catalog data is all available locally

Cons: statements that update the catalog will increase network traffic and may fail in a distributed transaction if any replica is unavailable

65
New cards

Catalog updates are __ compared to other database queries.

infrequent

66
New cards

operational data

used to conduct daily business functions

Ex: sales invoices, student test scores, driving violation records

67
New cards

analytic data

used to understand, manage, and plan the business

Ex: sales totals by region, average student grades over time, driving violation counts by ZIP code

68
New cards

Analytic data is sometimes called __.

reporting data or decision support data

69
New cards

What are four ways in which operational and analytic data differ?

  1. Volatility - operational data changes in real time and analytic data is updated at fixed intervals.

  2. Detail - operational data reflects individual transactions and analytic data is summarized by business dimensions like time period or geography

  3. Scope - operational databases supporting different functions are often incompatible, while analytic databases combine this data with standard formats, data types, and keys

  4. History - operational databases have primarily current data, while analytic databases contain both current and historic data in order to track trends over time

70
New cards

What problems does storing operational and analytic data in the same database create?

  1. The two types of data have different design requirements.

  2. Analytic queries on an operational database can interfere with business operations.

  3. The volatility of operational databases creates analytic queries with uncertain reference times and misleading results.

71
New cards

data mart

a data warehouse designed for a specific business area, such as sales, HR, or product development

72
New cards

Name the 5 steps of the ETL process.

  1. Extract data to the staging area.

  2. Cleanse data to eliminate errors, unusual spellings, and incorrect data.

  3. Integrate data into a uniform structure.

  4. Restructure data into a design optimized for analytic queries.

  5. Load data to the data warehouse.

73
New cards

List 3 examples of ETL tools

  1. PowerCenter from Informatica

  2. SQL Server Integration Services from Microsoft

  3. Oracle Data Integrator

74
New cards

fact table

contains numeric data used to measure business performance where each row consists of numeric fact columns and foreign keys that reference dimension tables

75
New cards

dimension tables

contain textual data that describes the fact data

76
New cards

Why are fact tables easy to maintain?

Because their primary keys are made up of foreign keys that consist of small, meaningless integers that never change.

77
New cards

dimension hierarchy

a sequence of columns in which each column has a one-many relationship to the next column

78
New cards

type 2 design for slowly changing dimensions

a method of tracking historical data that adds start and end date foreign keys to a fact table

79
New cards

What are high-level dimensions of value for a BI program?

  • Financial value

  • Productivity value

  • Trust value

  • Risk value

80
New cards

knowledge discovery (data mining)

the process of discovering patterns that lead to actionable knowledge from large data sets through one or more traditional data mining techniques, such as market basket analysis and clusering

81
New cards

Explain the difference between clustering and classification.

In clustering tasks, the classes are not defined beforehand, but are discovered in the clustering process.

82
New cards

in-memory database

a database that stores data in main memory instead of, or in addition to, storage media

83
New cards

Describe SQL Server In-Memory OLTP

An extension to SQL Server by Microsoft that supports in-memory tables with the same transaction and recover options as storage media tables.

84
New cards

Oracle Database In-Memory

Creates in-memory copies of tables with columnar organization to optimize analytic queries while the source data remains on storage media, grouped by rows in blocks.

85
New cards

What storage engines allow in-memory databases for MySQL?

MEMORY can create temporary in-memory tables that do not support transactions or recovery. NDB Cluster supports transactions, recovery, and distributed data for in-memory tables.

86
New cards

embedded/in-process database

a database that is packaged with a programming language and used in single-user applications that require no database administration, such as apps for mobile devices

87
New cards

SQLite

an open source relational database that supports all major programming languages and is the dominant embedded relational database

88
New cards

federated database

a collection of two or more participating databases underneath a coordinating software layer, each of which is autonomous and heterogeneous

89
New cards

autonomous database

operates independently of other participating databases in a coordinating software layer

90
New cards

heterogeneous database

either run under different database systems or have incompatible schema, such as inconsistent primary and foreign keys or similar columns with different names and data types

91
New cards

middleware

the coordinating software layer in a federated database

92
New cards

What components are common to most federated database middleware products?

  • global catalog

  • global query processor

  • database wrapper - converts decomposed queries to the appropriate syntax for each participating database

93
New cards

SQL/MED

an extension of the SQL standard for federated database that adds constructs such as nicknames for participating database objects and user mapping, which associates a federated database user with a participating database user

94
New cards

What are the main benefits of a federated database over a distributed database or a data warehouse?

It is easier to build and may be the only practical option to combine data from existing, incompatible databases.

95
New cards

Name two prominent examples of federated database middleware.

  • InfoSphere Federated Server from IBM

  • WebLogic Server from Oracle

96
New cards

How is a data lake similar to a data warehouse?

It is a separate database designed for analytic queries and consisting of data extracted from multiple sources.

97
New cards

How is a data lake similar to a federated database?

The data in a lake is not cleansed, integrated or restructured, and it may be loaded continuously rather than at fixed intervals, depending on the source.

98
New cards

Because formulating and understanding queries is more difficult with a data lake than a data warehouse, lakes are more suitable for __ than business analysts.

data scientists