1/41
Flashcards for reviewing data engineering concepts with AWS, focusing on data modeling, relational and NoSQL databases, Apache Cassandra, and database structuring techniques.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Data Modeling
An abstraction that organizes elements based on their relationships, essential for database modeling.
Conceptual Data Modeling
Mapping concepts a database will have, similar to naming columns in Excel spreadsheets.
Logical Data Modeling
Mapping conceptual models to tables, schemas, and columns, making them more practical.
Physical Data Modeling
Turning logical data models into the database's Data Definition Language (DDL).
Relational Databases
Databases that organize data into tables with rows and columns, each row having a unique key.
Relational Database Management System (RDBMS)
Software for managing relational databases.
SQL
A language used to interact with relational databases.
ACID Transactions
Properties guaranteeing database transaction validity, including Atomicity, Consistency, Isolation, and Durability.
Atomicity
All or nothing processing of a transaction.
Consistency
Only transactions abiding by certain rules can change the database.
Isolation
Transactions are processed independently of each other.
Durability
Completed transactions are saved even if the system fails.
NoSQL Databases
Databases designed for simpler design, horizontal scaling, and finer control of availability.
Apache Cassandra
A NoSQL database that distributes data by partitions across nodes and servers, organized in columns and rows.
Keyspace
Collection of tables in Apache Cassandra.
Table (Cassandra)
Group of partitions in Apache Cassandra.
Partition (Cassandra)
Fundamental unit of access in Cassandra; a collection of rows.
Primary Key (Cassandra)
Consists of a partition key and clustering columns in Cassandra.
MongoDB
A NoSQL database with key lookups performed by key-value store, offering API that retrieves documents based on content search
DynamoDB
A NoSQL database where the data is represented as a collection of key and value pairs.
Apache HBase
A NoSQL database that uses tables, rows, and columns but allows column names and formats to vary from row to row.
Neo4J
A NoSQL database focused on relationships between entities, representing data as nodes and edges.
CQL (Cassandra Query Language)
Cassandra’s query language, similar to SQL but without JOINS, GROUP BY, or subqueries.
Normalization
Reduces data redundancy and increases data integrity in databases.
Denormalization
Done to improve read performance by making write performance worse through redundant copies of data.
Normal Form
Ensures a database is free from unwanted insertion, update, and deletion dependencies.
First Normal Form (1NF)
Each cell has unique and single values, with no sets, collections, or lists in a column.
Second Normal Form (2NF)
All columns must rely on the primary key, with no composite keys.
Third Normal Form (3NF)
There are no transitive dependencies in the database.
Fact Tables
Measurements, metrics, or facts of a business process, often numeric and aggregated.
Dimension Tables
Categorizes facts and measures to help answer business questions, typically people, products, places, or time.
Star Schema
One or more fact tables referencing any number of dimension tables, often denormalized.
Snowflake Schema
Logical arrangement of tables in a multidimensional database with a centralized fact table and multiple dimensions.
Distributed Database
A database scaled out horizontally and made of multiple machines.
Eventual Consistency
Guarantees that if no new updates are made to a data item, all access to that item will return the last updated state.
CAP Theorem
It is impossible for a distributed data store to guarantee more than two out of the three qualities: Consistency, Availability, and Partition Tolerance.
Consistency (CAP Theorem)
Every read gets the most correct piece of data or returns an error.
Availability (CAP Theorem)
Every request gets a response, but there’s no guarantee that the data is the latest update.
Partition Tolerance (CAP Theorem)
Functions regardless of losing network connectivity between nodes.
Primary Key (General)
How each row is uniquely identified and how data is distributed between nodes/servers in the system.
Partition Key
First element of primary key in noSQL databases which determines data distribution.
Clustering Columns
A primary key made of partition key and clustering columns which determine sort order within a partition.