Looks like no one added any tags here yet for you.
Data schema
used to describe both the organization of data & the relationships between tables in a given data
Database Engineers
plans the database schema before they begin to create the system
Relational Database
uses tables to store the data that's being capture
Spreadsheet
has multiple tables linked together with different relationships
Tabular Schema
use rows and columns in a table format to store all of their data
Relational Database Mgmt. System (RDBMS)
Used with lots of different database software, such as my SQL, Maria DB, and even Amazon serverless database system Aurora
SQL
a programming language for data, and it works across all relationship databases. Structured Query Language is a domain-specific language used to manage data, especially in a relational database management system. It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables
Non-relational Databases
Databases not based on relationships, not use SQL or structured query language, able to handle large amounts of traffic and data, its faster to get info from non relational DB than relational DB, Remember- SQL= relational databases, No SQL or Graph QL= non relational db
Document oriented databases
store the data inside of xml documents or using JSON format, which is the JavaScript object notation
Key value stores
store the data inside of xml documents or using JSON format, which is the JavaScript object notation
Column oriented databases
store data in columns instead of rows like we do in a traditional relational database
Graph Stores
x are used to store individual elements as nodes inside of this database
Relational DB
Tables are used to store fields inside of the different columns |
Each row is going to have the record that holds the data for that relationship db |
Tool to optimally design the tables and be able to have the least amount of info possible in those tables |
Uses SQL, to write data into the DB |
the amount of data that can be held in each field and the data type is limited |
Each particular field in your db is going to have a specific purpose and field type |
Non relational DB
Alternative to relational db |
Stores as much info as you want in a key value pair, as opposed to limited to 255 characters |
Easier to scale and build out for web based or cloud based applications |
Uses any programming language |
Stores both structured and unstructured data within those DBs because you have more flexibility than relational DB |
Data normalization
Goal of x is to establish the relationships between the different data in forms to be able to have the data that we need when we run our reports, optimizes the storage and use of data within a given DB
First normal form (1NF)
eliminates any redundant info in individual tables
Second normal form (2NF)
has all the related info applicable to multiple tables using a foreign key
Third normal form (3NF)
eliminates fields that do not depend on a given key
Fourth normal form (4NF)
the data has to have a relationship that is in BC, NF or Boyce Code normal form and has no multi valued dependences in them
Fifth Normal form (5NF)
has all the same characteristics as 4NF and ensures there are no joint dependencies
Primary key-
unique identifier for a record that cannot contain duplicates, every table needs to have a primary key
Foreign key-
a primary key that was referenced by a different table
DB Relationships- One to one-
one record in the table will be associated with only one record in another table
DB Relationships: One to many
one record in the table with a primary key is associated with multiple records in other tables
Referential integrity
Remember: make sure that all of the changes we're doing to any part of our db are cascading throughout that db, either through an update or a deletion |
Used to establish and maintain record relationships in tables |
Ensures and guarantees that the primary key is being used as a foreign key in a table, and making sure that it actually exists in the table before creating records in at the second table |
Prevents the occurrence of bad or missing data in any of the tables |
Referential integrity comes under attack when people start modifying or deleting parts of the db |
Cascade delete/update
make sure that no data is being left orphaned as you make updates or deletion to the primary key and the records they control
Data denormalization
Occurs when the data is not structured in tables using normalization |
Will have many redundancies and repetitive data |
Happens when dealing with big data, data warehousing, mining, data analysis and data visualization |
Data systems
any info tech system that captures, processes, stores, queries or reports on the data contained within them
Data processing types
Its imp to determine whether you should use OLTP or OLAP to prevent performance issues
DB transaction
an insertion, deletion, or simply query of the db
Online Transactional Processing (OLTP)
systems designed to handle very large scales of transaction, built for large numbers of real time db transactions
Online Analytical Processing (OLAP)
systems designed to handle longer running queries for more complex data processing, build for longer running and more complex db queries, built to handle big queries that might involve a lot of joins and large data sets
Data Warehouse
Ingests data from the source systems record and combines them together, considered single source of truth, combines all data into one place to be accessed efficiently, example of an OLAP system where its built to handle really big, heavy queries
Source System
system of record for any particular kind of data
Clickstream data
data where individual users are clicking on different pages of the website
Purchase data
-involved people who handle credit cards to buy something that we then had to fulfil and ship to them
Catalog information
another source system that contains the categorization and the descriptions about individual items
OLAP Cubes
3D structure that provides data grouped together by different dimensions, optimizes the expected queries from each data marts customer
Data warehouse schemas
Fact Table- contains all of the main keys associated with the queries |
Dimension table- info associated with a fact table that is tied together using the course id (?) |
Star schema- individual dimensions tables branching off from that fact table that looks like a star, on layer dimension |
Snowflake schema- multiple dimensions associated with each dimension branching out like a snowflake |
Data Lakes
Centralized repository that can hold both structured and unstructured data, does not require data to be structured, but giving it structure makes it easier to query
Data Lakehouse
queries data in place on the data lake, in order to query, need structured data, and schema data off on the side that has been built up, you'll get flexibility and cost effectiveness of a data lake but you can still conduct queries across the entire data set, lake and lake or a true data warehouse can be used to analyse large amounts of data
Dimensional table-
contains metadata about stuff in your fact table, how manage changes to that dimensional data, how retain a history of what it used to be?
Type 1- slowly changing dimension,
slowly changing dimension, where new info is simply overriding the old info, can no longer query for previous names that existed in the past
Type 2 dimensional table
slowly change dimension, has complete history of the info and retains history of all previous data changes
Type 3 dimensional table
approach is to maintain the current and previous data
Discrete data
data that can be counted with a certain # of values
Continuous data
data that can be counted but with changing values, Continuous data is data that can take any value. Height, weight, temperature and length are all examples of continuous data. Some continuous data will change over time; the weight of a baby in its first year or the temperature in a room throughout the day.
Quantitative
data that is defined through #s
Qualitative
data arranged into groups or categories based on some kind of quality
Nominal data
Nominal data is data that can be labelled or classified into mutually exclusive categories within a variable. These categories cannot be ordered in a meaningful way. For example, for the nominal variable of preferred mode of transportation, you may have the categories of car, bus, train, tram or bicycle.
Ordinal data
type of data that has a natural ordering or ranking. It is categorical data that can be ranked or ordered in accordance with a specific attribute or characteristic. Examples of ordinal data are the level of education, the range of income, or the grades.
Data fields
contains the different pieces of info in different databases, data types are a thing we can control
Texts and alphanumeric field data types-
the most common type, and you'll hear this called either a character, a text or a string
Character type
a single character, either a letter or a # that's being stored in a field
Text or string type
grouping of characters that contains letters or #s in this alpha numeric data field type, can be uppercase or lowercase, cannot do mathematical operations on a # in a string
Date type
store exactly a calendar date in a month, day, year or day, month year format, can also store time,
Number data type
will not allow any text, as it only stores numbers (as you might expect). Use of the number data type allows for calculations. It is very important to use the number data type if you need to do any basic arithmetic in your reporting.
Currency data type
special type of number that represents $, allows for two decimal places
Boolean data type
used for things that have only two values, either a yes or a no, a true or a false, on off, one or zero
Structured Data
follows an existing convention, fits into tables and dbs nicely, specific format in a specific data field type for each particular data
Unstructured Data
not organized in a pre-defined manner that meets standards for structured data, can be text, images or video
Semi-structured data
mix of both, webpage is great example, also xml files and webpages, zip files, emails, JSON files
Delimited file
fields in which some form of character is going to be used to separate each field of data from the other data fields, most common type, CSV, common separated value file
Tab delimited file-
uses a tab to be able to separate each of the fields in the fie, can use a different character, like a pipe | to separate fields, .CSV, .TAB .TSV .TXT
Flat file
any delimited file that is exported out of db system and can then be sent to someone else, has been exported from db in real time, is back up data or point in time snapshot of db
SQL
x most commonly used when using a db and working with data, uses a series of statements to provide info to the db
Select statement-
how query info from db to select fields
Where keyword-
allows you to select something where a certain condition happens
HTML
hypertext markup language, used to write webpages and semi structured environment, uses tags to dictate what parts of info are display
XML
extensible markup language, text based market language, like html, but different purpose, interacts really well with JavaScript, goal to transfer data, not display to screen
JavaScript Object Notation (JSON)
language used to get data to and from different website, X is the de facto file format when sending info, uses arrays which make it easier to use in modern networks
Data System
any info tech system that captures, processes, stores, queries or reports on the data contained within it
Extract Transform Load (ETL)
process that occurs when moving data from a source system to a data warehouse by extracting data from the sources, transforming the data and then loading it to the data warehouse
Extract Load and Transform (ELT)
a modern method used when preparing data for data lakes by holding data in preparation for future transformation
Extracting data
Process of extracting source data and importing it into the system, obj of extraction is to connect to data source, SQL, Power BI and Power Query are tools for extracting data from external databases
Transforming data
Process of transforming data to another table format, timestamps are very useful for end users
API-
connection between computers or other programs, designed to present a set of questions and define answers in the system, Application Programming Interface. In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.
Pull model-
continuously pulls data into system
Push model
only sends notifications when data changes
Web service
comm between or among electronic devices, JSON and SML both encode structured data, has a specific function to provide different kinds of info
Synchronous
request from web service and wait for response
Asynchronous
allows you to do there tasks while waiting for the response
Web/Data/Screen Scraping
act of extracting data from a website
Machine data
data generated by webservers, machine data can be used as a predictive maintenance tool
Observation
the act of collecting data by observing and then analyzing afterwards, observation data should not be manipulated
Sampling
creating a smaller data set from a larger data set, random sampling, systematic sampling, stratified sampling
Data Profiling
the process of working with data to being to discern information and trends present in that data
Steps of Data Profiling
1. ID and document data source |
2. ID the field names and data types |
3. Determine the fields to be ID'd for reporting |
4. Check for the primary, natural or foreign keys |
5. Recognize all the data in the dataset |
Redundant data
ID data stored in multiple places, determine the redundant data and work on how to minimize it
Duplicated data
data repeated within the same dataset, to find duplicated data, use the built-in tools in a data analytics software
Unnecessary data
Its important to understand what data you need and what data you can ignore, extra data slows down your system Tools: Excel, Power BI, Tableau
Missing values
Missing data is referred as Null, represented as blank fields, Null or n/a, Null is...When the value is not applicable to the field, when the dataset doesn't have the information, when the datasets do not match the expected information, when the survey data is incomplete, What can you do about it? Filter out "NULL" values, replace missing values
Invalid Data
Invalid Data = Incorrect data, Different reasons data can be invalid: hard coding data, invalid data questions, extreme values, incorrect data, invisible characters, look for lead and trailing space, remove/replace invalid data, ASCII is a data code inside of a computer system that is invisible or non printable characters, look for leading or/and trailing space, remove/replace invalid data
Meeting specification
Specifications: certain types or quality set by DB engineers when designing systems, most common reason that data doesn't meet specifications is wrong data type, also improper storage of numeric characters
Data Outliers
Any data or piece of data that is outside the normal distance from the other values in a sample, Nonparametric Statistics: IDs data not assumed to come from a prescribed model that are predetermined by a small number of parameters
Parametric
normal baseline
Nonparametric
distribution independent
SCD type 3 relies on creating _____ to reflect historical data
new columns
Schema that will improve performance and allow for faster queries:
star schema
Common raster graphic format
GIF