1/89
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
SQL Code: Create a table based on its meta data using CREATE TABLE
CREATE TABLE schema.table (column1name datatype NULL/NOT NULL, column2name datatype NULL/NOT NULL), PRIMARY KEY (column name) FOREIGN KEY (column name) REFERENCES table
which column always has NOT NULL when creating a table
primary key
SQL Code: change the structure of a table using ALTER TABLE - adding a column to a table
ALTER TABLE schema.table ADD column_name datatype NULL/NOT NULL;
SQL Code: change the structure of a table using ALTER TABLE - removing a column and its data from a table
ALTER TABLE schema.table DROP COLUMN column_name
SQL Code: change the structure of a table using ALTER TABLE - changing a column datatype and/or NULL in a table
ALTER TABLE schema.table
ALTER COLUMN column_name datatype NULL/NOT NULL;
which SQL command to use for adding a column
ALTER TABLE, because you are changing the table structure (this affects every row)
which SQL command to use for adding a row
INSERT INTO, because you are changing the table data but not the structure
SQL Code: add a record to a table using INSERT INTO
INSERT INTO schema.table (column1, column2, column3) VALUES (value1, value2, value3);
which datatype values must be surrounded with single quotes using INSERT INTO
VARCHAR, DATE, and DATETIME
SQL Code: delete a record from a table using DELETE
DELETE FROM schema.table WHERE condition;
SQL Code: delete a table using DROP TABLE
DROP TABLE schema.table
order of dropping tables
if the primary key of a table is referenced as a FK in another table, must drop the table w/ the FK reference first
SQL Code: create a database view using CREATE VIEW
CREATE VIEW schema.view AS SELECT column_name(s) FROM schema.table WHERE condition
SQL Code: to see a view
SELECT * FROM view
SQL Code: deleting a view
DROP VIEW schema.view
SQL CODE: updating a row w/ condition
UPDATE schema.table SET column1=value1, column2=value2 WHERE condition;
what is the safest way to update records in SQL
one row at a time, based on primary key
SQL Code: enforcing FK/PK relations in tables
PRIMARY KEY (columnname) FOREIGN KEY (fkcolumn) REFERENCES table (fkcolumn)
SQL Code: add records to one table using data from another (extract and load w/ SQL)
INSERT INTO target_table (column1, column2, column3) SELECT column1, column2, column3 FROM source_table WHERE condition;
SQL data types
INT, DECIMAL (p,s), VARCHAR(n), DATE, DATETIME
DECIMAL (p,s)
p = precision (maximum # of digits to be stored) and s = scale (# of digits stored to the right of the decimal point)
DATE format
YYYY-MM-DD
advantages of normalization
1. less redundancy
2. less storage
disadvantages of normalization
takes longer to do joins
advantages of denormalization
1. quicker (less joins)
2. easier to understand structure
3. simpler to construct queries
disadvantages of denormalization
1. more storage
2. redundancy
first normal form (1NF) test
is each row uniquely identified and do all fields contain one value (atomicity)
first normal form (1NF) fix
create additional rows and/or identifiers (maybe composite key)
second normal form (2NF) test
is the table free from partial functional dependencies?
partial functional dependencies
columns that are functionally dependent on just a part of the composite key (is knowing that part of the key sufficient to determining the value?
second normal form (2NF) fix
split the table into separate tables based on dependencies
third normal form (3NF) test
is the table free from transitive functional dependencies?
transitive functional dependency
occurs when a non-key column is functionally dependent on another non-key column
candidate key
a column (or group of columns) that is not the key but could be in a new table
third normal form (3NF) fix
split the table according to these dependencies
denormalization
process of reintroducing redundant data into a normalized database
connection between views and denormalization
views make normalized data look denormalized, but data integrity is maintained because the original tables are still fully normalized
data warehouse
repository for many sources of data across the organization, where data is "preprocessed"
data mart
type of data warehouse, more focused and topic specific (ex: sales)
data cube
aggregates data from data mart in a multidimensional format with dimensions and facts and allows you to analyze data across dimensions
data cubes describe _____
a business event
dimensions
aspects of the event (product, store, time)
facts
raw data about the event (quantity sold, price, etc)
star schema (OLAP)
a. optimized for analytics and reporting (aggregation, slicing and dicing)
b. large fact table with denormalized dimension tables
c. redundant (month, day as attributes)
d. stored in data marts/data warehouses
relational schema (OLTP)
a. optimized for day to day transactions (insert, update, delete)
b. normalized tables
c. no redundancy, attributes only appear once
d. stored in operational databases
role of denormalization within star schema
dimension tables are denormalized and this reduces the number of joins (faster and easier analyis)
advantages of star schema
1. faster
2. less joins
3. easier for non-tech users
4. allows slicing and dicing
disadvantages of star schema
1. data redundancy increases
2. slower to update
3. not good for frequent insert/delete
4. storage
in star schema, the fact table has FKs from ______
each dimension table
___ relationship between dimension tables and fact table
1:n (one store could be associated with many sales, but sale can only occur at one store)
which table in star schema is normalized
fact table (can get very big)
each row in the fact table is...
a block in the data cube
slicing data
constraining a dimension on a value (specific store, date, etc)
dicing data
filtering by multiple dimensions (sales by product for a certain store in a certain quarter)
how to slice and dice in SQL
using WHERE and HAVING
data mart granularity
level of detail in the data
costs of higher granularity
1. more space, less speed
2. effort and expense gathering and storing the data
3. more complex analysis and reporting
benefits of higher granularity
1. more accurate reporting
2. more use cases for the data mart
3. more flexibility in how the data is sliced and diced
data granularity rule of thumb
use the lowest level of granularity that still allows you to answer the question
ETL: extract
pulling data from the operational data store
ETL: transform
make sure data is in an analysis format (data consistency, missing data)
ETL: load
load the data into analytical data store
data quality
the degree to which the data reflects the actual environment (consistent with purpose, verified measurement)
role of SQL in ETL process
SELECT * then save results as Excel file to load into application
what does power query do
allows users to extract data from variety of sources, transform it and clean it, and load it into worksheet/dashboard
major functions of power query
1. find and replace
2. add/split/merge columns
3. filter rows
4. remove duplicates
5. change data type
6. calculated fields
7. join tables (append queries)
why you would use SQL instead of other data analysis tools
1. better at scale
2. more control
3. easier to automate
SQL data analysis: better at scale
a. more efficient at joins, filtering, and sorting
b. allows you to filter/sort data so excel/powerBI don't have to process entire database
SQL data analysis: control
more flexible since it is a language
SQL data analysis: easier to automate
a. write a query once, run it multiple times
b. can run a query within a programming language
data visualization vs dashboard
data visualization = single chart/graph, dashboard = collection
dashboard
collection of data visualizations that summarize key metrics, facilitate understanding, and can be interactive
dashboard interaction
clickable elements impose filters
Power BI and Pivot Table similarities
1. drag and drop
2. Power Query for ETL
3. star schemas
4. slicing and dicing (filtering)
unstructured data
no data model and no predefined organization (text documents, images)
semi-structured data
no formal data model, but contains symbols to separate and label data elements (CSV, XML, JSON)
semistructured data purpose
common way to transfer data between software applications
CSV (comma separated value file)
a. each value separated by a comma, otherwise its plain text
b. no specified field lengths
c. first row is often the field names
d. quotes don't indicate data type, but rather allows commas to be considered part of the value instead of a separator
e. cannot skip values in a row
f. no hierarchies
XML (extensible markup language)
a. plain text file
b. text between tags (><) is the value, and tags (<>) create labels
c. values can be any length, and include commas and quotes
d. field can be skipped
e. can have hierarchies (attributes nested underneath)
JSON (javascript object notation)
a. plain text file, organized as objects within braces {}
b. uses key-value pairs
c. fields can be skipped
d. values can be any length
e. can have hierarchies (nested)
key-value pairs (JSON)
keys = column names; strings in quotes and values = data (can be strings, integers, or decimals)
semi-structured data comparison: space
least = CSV, middle = XML (tags), most = JSON (tags and spaces)
semi-structured data comparison: hierarchies
XML and JSON
semi-structured data comparison: lightweight
most = JSON, then XML then CSV
semi-structured data comparison: skipping fields
XML and JSON
standard for transferring data across web
XML and JSON
NoSQL
category of databases that use flexible schema, are optimized for unstructured/semi-structured data and are designed for high performance through horizontal scaling
horizontal scaling (NoSQL)
add more computers instead of upgrading computers
NOSQL is used for
high volume transaction processing with semi-structured data
examples of NoSQL use cases
online shopping carts, product catalogs, real-time chat, etc