Data Management Final

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/89

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

90 Terms

New cards

SQL Code: Create a table based on its meta data using CREATE TABLE

CREATE TABLE schema.table (column1name datatype NULL/NOT NULL, column2name datatype NULL/NOT NULL), PRIMARY KEY (column name) FOREIGN KEY (column name) REFERENCES table

New cards

which column always has NOT NULL when creating a table

primary key

New cards

SQL Code: change the structure of a table using ALTER TABLE - adding a column to a table

ALTER TABLE schema.table ADD column_name datatype NULL/NOT NULL;

New cards

SQL Code: change the structure of a table using ALTER TABLE - removing a column and its data from a table

ALTER TABLE schema.table DROP COLUMN column_name

New cards

SQL Code: change the structure of a table using ALTER TABLE - changing a column datatype and/or NULL in a table

ALTER TABLE schema.table

ALTER COLUMN column_name datatype NULL/NOT NULL;

New cards

which SQL command to use for adding a column

ALTER TABLE, because you are changing the table structure (this affects every row)

New cards

which SQL command to use for adding a row

INSERT INTO, because you are changing the table data but not the structure

New cards

SQL Code: add a record to a table using INSERT INTO

INSERT INTO schema.table (column1, column2, column3) VALUES (value1, value2, value3);

New cards

which datatype values must be surrounded with single quotes using INSERT INTO

VARCHAR, DATE, and DATETIME

New cards

SQL Code: delete a record from a table using DELETE

DELETE FROM schema.table WHERE condition;

New cards

SQL Code: delete a table using DROP TABLE

DROP TABLE schema.table

New cards

order of dropping tables

if the primary key of a table is referenced as a FK in another table, must drop the table w/ the FK reference first

New cards

SQL Code: create a database view using CREATE VIEW

CREATE VIEW schema.view AS SELECT column_name(s) FROM schema.table WHERE condition

New cards

SQL Code: to see a view

SELECT * FROM view

New cards

SQL Code: deleting a view

DROP VIEW schema.view

New cards

SQL CODE: updating a row w/ condition

UPDATE schema.table SET column1=value1, column2=value2 WHERE condition;

New cards

what is the safest way to update records in SQL

one row at a time, based on primary key

New cards

SQL Code: enforcing FK/PK relations in tables

PRIMARY KEY (columnname) FOREIGN KEY (fkcolumn) REFERENCES table (fkcolumn)

New cards

SQL Code: add records to one table using data from another (extract and load w/ SQL)

INSERT INTO target_table (column1, column2, column3) SELECT column1, column2, column3 FROM source_table WHERE condition;

New cards

SQL data types

INT, DECIMAL (p,s), VARCHAR(n), DATE, DATETIME

New cards

DECIMAL (p,s)

p = precision (maximum # of digits to be stored) and s = scale (# of digits stored to the right of the decimal point)

New cards

DATE format

YYYY-MM-DD

New cards

advantages of normalization

1. less redundancy

2. less storage

New cards

disadvantages of normalization

takes longer to do joins

New cards

advantages of denormalization

1. quicker (less joins)

2. easier to understand structure

3. simpler to construct queries

New cards

disadvantages of denormalization

1. more storage

2. redundancy

New cards

first normal form (1NF) test

is each row uniquely identified and do all fields contain one value (atomicity)

New cards

first normal form (1NF) fix

create additional rows and/or identifiers (maybe composite key)

New cards

second normal form (2NF) test

is the table free from partial functional dependencies?

New cards

partial functional dependencies

columns that are functionally dependent on just a part of the composite key (is knowing that part of the key sufficient to determining the value?

New cards

second normal form (2NF) fix

split the table into separate tables based on dependencies

New cards

third normal form (3NF) test

is the table free from transitive functional dependencies?

New cards

transitive functional dependency

occurs when a non-key column is functionally dependent on another non-key column

New cards

candidate key

a column (or group of columns) that is not the key but could be in a new table

New cards

third normal form (3NF) fix

split the table according to these dependencies

New cards

denormalization

process of reintroducing redundant data into a normalized database

New cards

connection between views and denormalization

views make normalized data look denormalized, but data integrity is maintained because the original tables are still fully normalized

New cards

data warehouse

repository for many sources of data across the organization, where data is "preprocessed"

New cards

data mart

type of data warehouse, more focused and topic specific (ex: sales)

New cards

data cube

aggregates data from data mart in a multidimensional format with dimensions and facts and allows you to analyze data across dimensions

New cards

data cubes describe _____

a business event

New cards

dimensions

aspects of the event (product, store, time)

New cards

facts

raw data about the event (quantity sold, price, etc)

New cards

star schema (OLAP)

a. optimized for analytics and reporting (aggregation, slicing and dicing)

b. large fact table with denormalized dimension tables

c. redundant (month, day as attributes)

d. stored in data marts/data warehouses

New cards

relational schema (OLTP)

a. optimized for day to day transactions (insert, update, delete)

b. normalized tables

c. no redundancy, attributes only appear once

d. stored in operational databases

New cards

role of denormalization within star schema

dimension tables are denormalized and this reduces the number of joins (faster and easier analyis)

New cards

advantages of star schema

1. faster

2. less joins

3. easier for non-tech users

4. allows slicing and dicing

New cards

disadvantages of star schema

1. data redundancy increases

2. slower to update

3. not good for frequent insert/delete

4. storage

New cards

in star schema, the fact table has FKs from ______

each dimension table

New cards

___ relationship between dimension tables and fact table

1:n (one store could be associated with many sales, but sale can only occur at one store)

New cards

which table in star schema is normalized

fact table (can get very big)

New cards

each row in the fact table is...

a block in the data cube

New cards

slicing data

constraining a dimension on a value (specific store, date, etc)

New cards

dicing data

filtering by multiple dimensions (sales by product for a certain store in a certain quarter)

New cards

how to slice and dice in SQL

using WHERE and HAVING

New cards

data mart granularity

level of detail in the data

New cards

costs of higher granularity

1. more space, less speed

2. effort and expense gathering and storing the data

3. more complex analysis and reporting

New cards

benefits of higher granularity

1. more accurate reporting

2. more use cases for the data mart

3. more flexibility in how the data is sliced and diced

New cards

data granularity rule of thumb

use the lowest level of granularity that still allows you to answer the question

New cards

ETL: extract

pulling data from the operational data store

New cards

ETL: transform

make sure data is in an analysis format (data consistency, missing data)

New cards

ETL: load

load the data into analytical data store

New cards

data quality

the degree to which the data reflects the actual environment (consistent with purpose, verified measurement)

New cards

role of SQL in ETL process

SELECT * then save results as Excel file to load into application

New cards

what does power query do

allows users to extract data from variety of sources, transform it and clean it, and load it into worksheet/dashboard

New cards

major functions of power query

1. find and replace

2. add/split/merge columns

3. filter rows

4. remove duplicates

5. change data type

6. calculated fields

7. join tables (append queries)

New cards

why you would use SQL instead of other data analysis tools

1. better at scale

2. more control

3. easier to automate

New cards

SQL data analysis: better at scale

a. more efficient at joins, filtering, and sorting

b. allows you to filter/sort data so excel/powerBI don't have to process entire database

New cards

SQL data analysis: control

more flexible since it is a language

New cards

SQL data analysis: easier to automate

a. write a query once, run it multiple times

b. can run a query within a programming language

New cards

data visualization vs dashboard

data visualization = single chart/graph, dashboard = collection

New cards

dashboard

collection of data visualizations that summarize key metrics, facilitate understanding, and can be interactive

New cards

dashboard interaction

clickable elements impose filters

New cards

Power BI and Pivot Table similarities

1. drag and drop

2. Power Query for ETL

3. star schemas

4. slicing and dicing (filtering)

New cards

unstructured data

no data model and no predefined organization (text documents, images)

New cards

semi-structured data

no formal data model, but contains symbols to separate and label data elements (CSV, XML, JSON)

New cards

semistructured data purpose

common way to transfer data between software applications

New cards

CSV (comma separated value file)

a. each value separated by a comma, otherwise its plain text

b. no specified field lengths

c. first row is often the field names

d. quotes don't indicate data type, but rather allows commas to be considered part of the value instead of a separator

e. cannot skip values in a row

f. no hierarchies

New cards

XML (extensible markup language)

a. plain text file

b. text between tags (><) is the value, and tags (<>) create labels

c. values can be any length, and include commas and quotes

d. field can be skipped

e. can have hierarchies (attributes nested underneath)

New cards

JSON (javascript object notation)

a. plain text file, organized as objects within braces {}

b. uses key-value pairs

c. fields can be skipped

d. values can be any length

e. can have hierarchies (nested)

New cards

key-value pairs (JSON)

keys = column names; strings in quotes and values = data (can be strings, integers, or decimals)

New cards

semi-structured data comparison: space

least = CSV, middle = XML (tags), most = JSON (tags and spaces)

New cards

semi-structured data comparison: hierarchies

XML and JSON

New cards

semi-structured data comparison: lightweight

most = JSON, then XML then CSV

New cards

semi-structured data comparison: skipping fields

XML and JSON

New cards

standard for transferring data across web

XML and JSON

New cards

NoSQL

category of databases that use flexible schema, are optimized for unstructured/semi-structured data and are designed for high performance through horizontal scaling

New cards

horizontal scaling (NoSQL)

add more computers instead of upgrading computers

New cards

NOSQL is used for

high volume transaction processing with semi-structured data

New cards

examples of NoSQL use cases

online shopping carts, product catalogs, real-time chat, etc