1/153
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Data Lake
a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data.
Data Lakehouse
Data lakehouses are a combination of a data warehouse and data lakes. They provide a cost-effective and flexible solution for data storage needs, but are not focused on transactional data.
OLTP (online transaction processing)
a technology used for real-time data queries and record creation
OLAP (online analytical processing)
a class of software that allows complex analysis to be conducted on large databases without affecting transactional systems
Data Mart
A data mart is a data storage technology used for specific departments or needs within a company - A subject-oriented relational database that stores transactional data in rows and columns, which makes it easy to access, organize, and understand. As it contains historical data, this structure makes it easier for an analyst to determine data trends
quantitative (discrete)
countable and with a limited number of values
quantitative (continuous)
measurable and can use any value
qualitative (nominal)
has no natural order ex. pink, green, brown
qualitative (ordinal)
follows a natural order ex. bad, good, great
SQL
direct interaction with the database
HTML
web pages
XML (Extensible Markup Language)
sending of data to and from another system
JSON (JavaScript Object Notation)
key value pairing of data between two web applications
delimited file
a delimited flat file contains one or more records set off from each other by a specified delimiter, or separator
synchronous
must wait for the response for requests from the web service
Asynchronous
allows you to do other tasks while waiting for the response
web scraping
the act of extracting data from a website (last resort, get permission)
machine data
data generated by the web servers - can be used for predictive maintenance
sampling
creating a smaller data set from a larger data set
data profiling
the process of checking information that is present in the data
1. identify and document the source of data
2. identify the field names and data types
3. determine the fields to be identified for reporting
4. check for the primary, natural, or foreign keys
5. recognize all the data in the data set
parametric
Data with an underlying normal distribution
nonparametric
Data for which the probability distribution is unknown or known not to be normal
noise
Unnecessary data fields that have no value to the analysis
data manipulation
the process of recoding data so that it can be more useful during our processing, correlation, analysis, and reporting
derived variable
a new variable or data point derived or created from existing data
recoded data
transforming data
data imputation
substitutes missing data with estimated values
index field
a unique, non-personally identifiable number that can be used a unique identifier
transposing data
swapping columns for rows, and rows for columns
appending data
combines data from one data set to another data set
inline append
combines data sets together (discards original)
intermediate append
retains individual data sets, but also creates a new data set with the combined data (keeps original)
data blending
takes data and uses different text-based functions to determine how it will be displayed or stored inside a data environment
conditional logic
any kind of function that checks if there is a logical condition that's being met (if, isnull, and, or)
IF
is a logical function that uses a logical test to validate whether a condition is true or false
ISNULL
returns a specified value if the expression is null
AND
a logical join function that tests two conditions
OR
tests if either one of two conditions is true
system functions
any functions that are packaged with your reporting tool or analysis tool to perform certain functions inside of that software
aggregate functions
written for a group of records, not just for a single record, and work with a column of data
data functions
derive attributes from date fields, like determining the day of the week, month, or year from a single date
indexing
a field property setting that improves query speed and performance for fields that are commonly queried, sorted, or filtered
parsing
breaks and extracts data out of a field for use
inner join
selects records that have matching values in both tables
left outter join
the matching data in the right, but all the data on the left whether it matches or not
right outter join
the matching data in the left, but all the data on the right whether it matches or not
full outter join
everything that intersects, as well as the rest of the data from both sides
cross outter join
It joins every row of the first table with every row of the second table, resulting in a potentially very large result set
parameterization
the concept of replacing values within the query with parameters
temporary table
a table that just resides in memory on the database
subquery
a query nested inside another query statement
actual execution plan
confirms the requirements used for a query
estimated execution plan
is a list of possible requirements for executing a query
exploratory analysis
the goal is to figure out what type of cleaning, profiling and transformation the data needs - it's all about the initial look at a given data set
performance analysis
type of analysis that measures the performance of a particular product, outcome, or scenario against the defined objective
KPIs (Key Performance Indicators)
measurements and goals that help identify whether a business is achieving its objectives (qualitative or quantitative)
gap analysis
analyzes the difference between the present state and a desired or future state (mostly quantitative measures)
delta
the change between where you are and where you want to be
trend analysis
measures the trend on historical data to predict a future outcome
link analysis
determines how a single data point links to other data points
finding standing deviation for SAMPLE
find the mean, subtract the mean to get the differences, square the differences, get the "mean" of those differences (this is the variance). The standard deviation is the squared root of the variance. with SAMPLE, its the number of samples -1
z-score
frequency
number of times that the given data value appears in the dataset
percentage difference
overall difference relative to the mean of two data points. b6-b5/average(b5+b6/2) *100
percentage change
(b6-b5)/b5*100
T-test
compares two groups to determine if there's a significant difference between their means
P-value
shows the probability that an observed difference occurred by chance (you want lower than 5%)
null hypothesis
assumes that there is NO relationship between the two variables being tested
alternative hypothesis
assumes that there is a relationship between the two variables being tested
type I error
false positive
type II error
false negative
chi-square statistic
compares the size of the difference between the expected result and the actual result
regression analysis
Statistical method used to estimate relationships between a dependent variable and one or more independent variables
pie chart
to show percentages
tree map
made for representing hierarchical data
bar chart
values on the x axis
column chart
values on the Y axis
line chart
time based data
scatter plot
to see if your data fits a trend
bubble chart
A type of scatter plot with circular symbols used to compare three variables; the area of the circle indicates the value of a third variable
Histogram
A graph of vertical bars representing the frequency distribution of a set of data. (no spaces)
waterfall chart
we're looking at discrete events over time
and seeing how each event
plays off the previous event before it
and adds or subtracts from where it left you
stacked chart
The stacked column/bar chart breaks a bar or column into separate portions to represent each data point
static report
report that is not automatically updated
real-time reporting
occurs when receiving up-to-date data
ad hoc report
generated in response to a one-time request
paginated report
a multi-page report that is not suitable for display on a dashboard
wireframe
a series of multiple mockups for multiple screens that are likely connected on a dashboard
narrative
a summary of the report contents and key findings
data creation
when data is acquired, entered, or captured in the system
data acquisition
occurs when existing data is produced outside and imported automatically to the system
data entry
occurs when information is manually typed into the system
data capture
occurs when data is generated by a device into the organization
data storage
occurs when data is not being actively used
data use
viewing, processing, modifying, manipulating or saving the data
data archival
copying and storing of data that can be used when needed
data destruction
when the data is no longer valuable or has reached its useful life and needs to be destroyed
data steward
the person responsible for ensuring data is properly labeled, identified, collected, and stored
data custodian
a role that's responsible for handling the management of the system on which the data assets are going to be stored
data sovereignty
A term that refers to the legal implications of data stored in different countries or states