1/97
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
architecture
describes the components of a computer system and the relationships between those components
What are the four layers of MySQL components?
Tools
query processor
storage engine/storage manager
file system
What information does MySQL Enterprise Monitor collect, and what do DBAs use it for?
CPU, memory, index utilization, and information about queries and results. This can be used to manage and tune large databases with many users.
What does Audit track, and what kinds of entities does it support?
The time that any change is made to the database and who made the change. This supports sensitive databases such as finance, medical, and defense that might receive audits from government or other businesses.
What are the five categories of utility programs?
installation
client
administrative
developer
miscellaneous
Most utility programs are intended for __.
database administrators or programmers
What is the difference between a Connector and an API in MySQL?
Connectors are newer versions of APIs that are mostly built on top of APIs and developed directly by Oracle as opposed to a third party.
Name the two main functions of a query processor.
Manage connections between tools and the query processor.
Compile queries and generate execution plans.
execution plan
a detailed, low-level sequence of steps that specify exactly how to process a query
query parser
checks each query for syntax errors and converts valid queries to an internal representation
query optimizer
reads the internal representation of a query, generates alternative execution plans, estimates execution times, and selects the fastest plan
What does a query optimizer base its estimates of query time on?
Heuristics and statistics about data, like the number of rows in each table and the number of values in each column, which it pulls from the data dictionary.
cache manager
a component of the query processor layer that stores reusable information in main memory, such as execution plans for common queries or even query results if the data does not change
Name the two main functions of the storage engine layer.
transaction management
data access
What does the data access component do?
Communicate with the file system and translate table, column, and index reads into block addresses.
buffer manager
retains data blocks from the file system for possible reuse in order to reduce data access time
On the storage engine layer, the buffer manager is the equivalent of the __ on the query processor layer.
cache manager
How does InnoDB handle the limited amount of memory in the buffer manager?
Using a least recently used (LRU) algorithm, which tracks the time each block was last accessed and, when space is needed, discards ‘stale’ blocks. A block that has been updated gets saved to disk before deletion.
What three components are included in the transaction manager?
concurrency system
recovery system
lock manager
Name the three types of data that a file system contains for each database.
user data
log files
data dictionary
What are the tiers in a multi-tier architecture?
Grouped layers of computers linked by a network and arranged in a hierarchy.
The top tier of a multi-tier architecture consists of __.
computers interacting directly with end-users
The bottom tier of a multi-tier architecture consists of __.
servers managing resources like databases and email
What functions do the middle tiers of a multi-tier architecture execute?
A variety, such as user authorization, business logic, and communication with other computers.
web architecture
a multi-tier architecture consisting of web browsers and web servers communicating over the internet
Name the three broad categories of cloud services.
Infrastructure as a service
Platform as a service
Software as a service
Infrastructure-as-a-Service (IaaS)
provides computer processing, memory, and storage media, as if the customer were renting a computer
ex: Elastic Compute Cloud (EC2) from AWS
Platform-as-a-service (PaaS)
provides tools and services, such as databases, application development tools, and messaging services.
ex: Microsoft Azure
Software-as-a-service (SaaS)
provides complete applications, usually through web browsers on customer machines
ex: Salesforce, Google Drive
virtual machine (VM)
a software layer that emulates a complete, independent computing environment
parallel computer
consists of multiple processors managed by a single operating system instance and can achieve faster processing speeds by processing multiple instructions concurrently
Name the 3 categories of parallel computers.
shared memory
shared storage
shared nothing
What is the benefit of shared memory for data management?
It is optimal for parallel processing against a common data set stored within a single memory space.
node
a single computer in a group connected be either LAN or WAN
cluster
a group of nodes connected by LAN, managed by separate operating system instances, and coordinated by specialized cluster management software.
What is the difference between a cluster and a parallel computer?
Computers in a cluster can share storage, but not memory because LAN connections are too slow to support memory access.
parallel database
runs on a parallel computer or a cluster
distributed database
runs on multiple computers connected by a wide area network
distributed transaction
a transaction that updates data on multiple nodes of a distributed database
What are the steps of a two-phase commit?
A central transaction coordinator notifies all participating nodes of the required updates.
Participating nodes receive the notification, store the update in a local log, and send a confirmation message to the transaction coordinator.
After receiving confirmation from all participating nodes, the transaction coordinator instructs all nodes to commit.
Participating nodes receive the commit message, commit the update to the database, and notify the transaction coordinator of success.
local transaction
updates data on a single node of a distributed database
What is the downside of local transactions?
They can create temporary inconsistency in a distribute database, as nodes are updated at different times.
network partition
forms when a network error prevents nodes from communicating
CAP theorem
states that a distributed database cannot simultaneously be Consistent, Available, and Partition-tolerant; it can guarantee two at most of these properties
available (database)
‘live’ nodes must respond to queries at all times, regardless of the state of other nodes
consistent (database)
conforms to all rules at all times across all nodes that these rules apply to
Most distributed databases must always function and are therefore __.
partition-tolerant
partition-tolerant
when a database can continue to function if a network partition occurs
replicated database
maintains two or more replicas on separate storage devices
What are the advantages of replicated databases?
High availability in case of storage device failures.
Fast concurrent reads by decomposing large queries into smaller queries that read separate replicas in parallel.
Local reads for distributed databases.
What is the major disadvantage of replicated databases?
Slow or inconsistent updates.
How is security enhanced for a replicated database?
Updates can be restricted to a single replica, accessible only to trusted database users, and then propagated to read-only replicas across the userbase.
How do replicated databases simplify backups?
One replica can be backed up while transactions execute against other replicas, meaning that the database is more consistently available.
How do replicated databases complicate server administration?
DBAs have to figure out how to propagate updates across replicas.
storage arrays
a type of storage device that can manage database replicas internally, without database intervention
What happens if the primary node fails when updating a database with the primary/secondary technique?
The database automatically designates a new primary node to ensure continued availability.
primary/secondary technique
one node is designated as a primary node that receives updates first in local transactions, and all other nodes receive the update only after the primary node commits
group replication technique
applies updates to any node in a group, which then broadcasts transaction information to other nodes so that they can search for conflicts with concurrent transactions before any node commits the update
What happens if a node in a group replication system detects a conflict between concurrent transactions?
A predetermined algorithm selects which transaction commits and rolls back any others.
Between group replication, distributed transactions, and primary/secondary technique, which is partition-tolerant, and which is always consistent?
Primary/secondary technique is partition-tolerant because it updates only the primary node independent of secondary nodes and assigns a new primary node if the original is unavailable. Distributed transactions are always consistent because they update data on multiple nodes within one transaction.
central catalog
the entire catalog of a distributed database resides on a single node
What are the pros and cons of a central catalog?
Pros: it is easy to manage
Cons:
- remote nodes must access via WAN, which may be slow or unreliable
- there may be a bottleneck at the central catalog
replicated catalog
a catalog for a distributed database that has a copy on each node
What are the pros and cons of a replicated catalog?
Pros: most queries are fast and reliable since the catalog data is all available locally
Cons: statements that update the catalog will increase network traffic and may fail in a distributed transaction if any replica is unavailable
Catalog updates are __ compared to other database queries.
infrequent
operational data
used to conduct daily business functions
Ex: sales invoices, student test scores, driving violation records
analytic data
used to understand, manage, and plan the business
Ex: sales totals by region, average student grades over time, driving violation counts by ZIP code
Analytic data is sometimes called __.
reporting data or decision support data
What are four ways in which operational and analytic data differ?
Volatility - operational data changes in real time and analytic data is updated at fixed intervals.
Detail - operational data reflects individual transactions and analytic data is summarized by business dimensions like time period or geography
Scope - operational databases supporting different functions are often incompatible, while analytic databases combine this data with standard formats, data types, and keys
History - operational databases have primarily current data, while analytic databases contain both current and historic data in order to track trends over time
What problems does storing operational and analytic data in the same database create?
The two types of data have different design requirements.
Analytic queries on an operational database can interfere with business operations.
The volatility of operational databases creates analytic queries with uncertain reference times and misleading results.
data mart
a data warehouse designed for a specific business area, such as sales, HR, or product development
Name the 5 steps of the ETL process.
Extract data to the staging area.
Cleanse data to eliminate errors, unusual spellings, and incorrect data.
Integrate data into a uniform structure.
Restructure data into a design optimized for analytic queries.
Load data to the data warehouse.
List 3 examples of ETL tools
PowerCenter from Informatica
SQL Server Integration Services from Microsoft
Oracle Data Integrator
fact table
contains numeric data used to measure business performance where each row consists of numeric fact columns and foreign keys that reference dimension tables
dimension tables
contain textual data that describes the fact data
Why are fact tables easy to maintain?
Because their primary keys are made up of foreign keys that consist of small, meaningless integers that never change.
dimension hierarchy
a sequence of columns in which each column has a one-many relationship to the next column
type 2 design for slowly changing dimensions
a method of tracking historical data that adds start and end date foreign keys to a fact table
What are high-level dimensions of value for a BI program?
Financial value
Productivity value
Trust value
Risk value
knowledge discovery (data mining)
the process of discovering patterns that lead to actionable knowledge from large data sets through one or more traditional data mining techniques, such as market basket analysis and clusering
Explain the difference between clustering and classification.
In clustering tasks, the classes are not defined beforehand, but are discovered in the clustering process.
in-memory database
a database that stores data in main memory instead of, or in addition to, storage media
Describe SQL Server In-Memory OLTP
An extension to SQL Server by Microsoft that supports in-memory tables with the same transaction and recover options as storage media tables.
Oracle Database In-Memory
Creates in-memory copies of tables with columnar organization to optimize analytic queries while the source data remains on storage media, grouped by rows in blocks.
What storage engines allow in-memory databases for MySQL?
MEMORY can create temporary in-memory tables that do not support transactions or recovery. NDB Cluster supports transactions, recovery, and distributed data for in-memory tables.
embedded/in-process database
a database that is packaged with a programming language and used in single-user applications that require no database administration, such as apps for mobile devices
SQLite
an open source relational database that supports all major programming languages and is the dominant embedded relational database
federated database
a collection of two or more participating databases underneath a coordinating software layer, each of which is autonomous and heterogeneous
autonomous database
operates independently of other participating databases in a coordinating software layer
heterogeneous database
either run under different database systems or have incompatible schema, such as inconsistent primary and foreign keys or similar columns with different names and data types
middleware
the coordinating software layer in a federated database
What components are common to most federated database middleware products?
global catalog
global query processor
database wrapper - converts decomposed queries to the appropriate syntax for each participating database
SQL/MED
an extension of the SQL standard for federated database that adds constructs such as nicknames for participating database objects and user mapping, which associates a federated database user with a participating database user
What are the main benefits of a federated database over a distributed database or a data warehouse?
It is easier to build and may be the only practical option to combine data from existing, incompatible databases.
Name two prominent examples of federated database middleware.
InfoSphere Federated Server from IBM
WebLogic Server from Oracle
How is a data lake similar to a data warehouse?
It is a separate database designed for analytic queries and consisting of data extracted from multiple sources.
How is a data lake similar to a federated database?
The data in a lake is not cleansed, integrated or restructured, and it may be loaded continuously rather than at fixed intervals, depending on the source.
Because formulating and understanding queries is more difficult with a data lake than a data warehouse, lakes are more suitable for __ than business analysts.
data scientists