1/151
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Motivation
The reason the analysis is being performed
Opportunities
New, potentially advantageous occasions that are a motivation for data analytics
Professional Issues and Requirements
Ensuring compliance with regulation, which is a motivation for data analytics
Problem Solving
Solving problems related to risks and issues around clients, which could be a motivation for data analytics
Process and Performance Assessment
Evaluating financial statements for potential material misstatements or risks of material misstatement, which could be a movativation for data analysis
Objective
Should follow the motivation, goal needs to be clear
Articulating Questions
Good questions
Role of critical thinking
motivation, objectives, and questions need to be critically evaluated
Data analysis objective
requires knowing why the task is being performed
descriptive analytics
objectives involving understanding something that is currently happening, or has happened
Descriptive Questions
often asked prior to advanced analyses
Descriptive Questions
identify purpose of analysis
break down the purpose into questions
Analyses Used to Answer Descriptive Questions
Frequency measures, measures of location, measure of dispersion, measure of percentage change
Frequency measures
understand categories of data
Measure of location
reveal average observations in a data set
measure of dispersion
show how much variance there is among the observation in the data set
measure of percentage change
help compare to prior periods and the percent of the total
Diagnostic Questions
builds on descriptive questions, explores the data to find a cause of an otucome
What do diagnostic questions look for?
anomalies, correlations, patterns, trends
Diagnostic Analyses
Anomaly detection, correlation, pattern detection, trend analysis
Predictive Questions
what may happen on the future
Use of predictive analsyses
financial accountants: use to find trends in sales/expenses
cost accountants: predict costs, create forecasts, and evaluate cost drivers
auditors: identify potenital material misstatements
tax accountants: tax planning
Trendlines
Used by predictive analyses, show underlying relationship of data (functional relationship and linear function)
Functional relationships
effect of an independent variable on a dependent variable
linear function
steady increases or decreases over the range of the independent variable
Linear regression
tool for building mathematical and statistical models, explains the relationship between dependent and one or more independent variables (predictive)
Building a predictive model
Variable, Dependent Variable, Independent Variables
Variable
data field used for analysis
Dependent Variable
outcome measure (warranty expenses)
independent variable
variables that influence the dependent variable
Multiple R
correlation coefficient, measures strength of relationship between dependent and independent (-1 to 1)
R Squared
Coefficient of determination, measure of how well the regression line fits the data (closer to 1, the better the regression line fits to the data)
Adjusted R-Squared
Explains how well the regression lines fits the data, modifies r-squared given he proportion of the variation in the dependent variable
standard error
variability of the observed dependent variable values from the values predicted by the model
regression statisics
statistical measures used to evaluate the model
Data Analysis Plan
focus on the objective
select a data strategy
select an analysis strategy
consider risks
embed controls
Best Data strategy Alternative
The one with the highest overall factor ratings
In Progress Project Plans
Data Strategy/Analysis Strategy Risks, and data strategy/analysis controls
data strategy risks
data errors, data extraction errors, inaccurate management assumptions and estimates
analysis strategy risks
errors in age group categorization and calculations, erroneous l application of bad debt estimation percentages to age group totals, erroneous summing an estimated bad debts across age groups, errors in the calculation of the calculate to use for bad debt expense
data strategy controls
compare data to source documentation, compare to prior period to verify management estimates and assumptions, review customer/policy/procedure changes
analysis strategy controls
verify all classifications and groupings, verify all calculations, verify correct use of analysis technologies
When is data appropriate for analysis?
relevant, available, and characteristics match the analysis method requirements
Internal data
generated within the organization and more easily controlled/verified by organization (sales data, purchase data)
external data
obtained from sources outside the organization and are somewhat riskier to use, can provide insights that internal data alone cannot prove (weather data, publicly available competitor data)
Measured raw data
data created by a controlled process capturing the value of the data (price, cost, number on hand, weight, etc), discrete or continous
ex: InventoryCost, NumberOnHand
non-measured raw data
data created automatically y the computer or company policy for control, discrete format (identification codes, standard descriptions0
ex: InventoryCode, ProductCategory
Calculated Data
data created when one or more fields in a particular record (row) have any number of math operations, discrete or continuous
ex: =UnitCost x NumberOnHand
Objective question
what is a reasonable estimate of the uncollectible receivables in the 2025 year-end outstanding accounts recieveables?
Data strategy alternatives example
all credit sales invoices and collections data that petain to the account period
Analysis Strategy Examples
Different methods for determine the age of an outstanding invoice
days outstanding from the invoice data forward
Data Strategy Risk and Controls Example
Missing or incorrectly included invoices in the general or the subsidiary ledger
Possible Risks
human bias, changes in customer behavior, business process changes
Analysis Strategy Risk and Controls Examples
Errors in age group categorizatoin
Identifying Appropriate Data
relevant, available, and the characteristics match the analysis method requirements
data set
collection of data columns and rows available for analysis
statistical measures and tests often require certain data characteristics
Fields
individual columns in a data set are called fields
attributes
the columns if the source of the data was a database
each column describes and represents one unique characteristic
records
rows in a data set from a database which represent the collection of columns that hold the description of a single occurrence of the data set’s purpose
Typical elements in a data set
column = data field = attribute
data set if the entire collection of row and columns
row = record
Why is considering the source of the data important?
quality of the data in the fields impact the quality of the analysis
automatically sequentially assigned field data
can be confident that the data is consistently correct
categorical (nominal) data
labeled or named data that can be sorted into groups according to specific characteristics, do not have a quantitative value
ex: regions, product type, account number, account category, location
Ordinal Data
ordered or ranked categorical data, distance between the categories does not need to be known or equal
ex: survey question asking about customer service on a scale of 1-10
interval data
ordinal data that have equal and constant differences between observations and an arbitrary zero point
ex: temperature, time, credit scores
ratio data
interval data with a natural zero point (a natural zero point means that it is not arbitrary)
ex: economic data (dollars)
Data Risks
non—representative sample selection, outlier data points, dirty data
controls risk
verify representative of a sample,
perform a histogram or quartile analysis used to identify outliers, then explain the rule used for outlier adjustment or removal
verify integrity of data set and clean up dirty data issues
outlier data points
unusual data points compared to rest of data
controlled by visualized the data to check if measure is very different than rest of data
dirty data
missing, invalid, duplicate, and incorrectly formatted data
test before the analysis performed
compare to source documents or test for reasonableness
table names
must be clear, intuitive, and clear
scan tables for incorrect ambiguous names and rename
problem
incorrect or ambiguous table names make it harder to understand and work with a data set
detect (data profiling)
visually scan tables for incorrect or ambiguous names
correct (ETL)
rename tables
Pattern 3
irrelevant and unreliable data that bloat the data model
scan columns for irrelevant and unreliable data
can be done visually, data dictionary can be a good tool
pattern 4
incorrect and ambiguous column names, become variables during data exploration and interpretation
names are important because other people may use the analytic data based
visually scan a column’s content
pattern 5
incorrect data types, integral part of column definition because they determine what can and cannot be done with the data in a column
inspect data type
change the data type
pattern 6
composite and multi-valued columns, each cell should contain one value describing one characteristics
2 or more values in the same cell make analysis difficult
scenarios that violate the single-valued rule make analysis more complex, are composite and multi-valued columns
detect them by visually scanning, solution is to split it
By delimiter
used to split columns
pattern 7
wrong value is assigned to one of the entities’ characteristics
helpful to look for outlying values outside numeric data
more than 1.5 outside IQR
identify root and eliminate, correct value, value could also be corrected in the analytical database
outlier
falls more than 1.5 times the IR below the first quartile or above the third quartile
pattern 8
data inconsistency when two or more different representations of the same value when mixed in the same column
distinct values: visually scanning the distinct values of a column
frequencies: values with a low frequency could indicate inconsistent data
Eliminate bad data by identify root cause and eliminating/modifying the values in either source data or analytical data base
Pattern 9
incomplete values
addresses the incompleteness that might make data unusable and unreliable
investigate null values (ETL tools reveal, on a column by column basis)
replace
remove
design a consistent schema
pattern 10
invalid values
domain specific rules that determine whether data is acceptable, can be created for most columns
create and apply validation rules
rely on profiling data
mandatory column, stats about null values provided by ETL tools can be used for validation
eliminate root cause, change value, change the value in data base
Design and Implementation of Validation Rules
actual hours for a service must be positive, and can’t exceed 14
ACTUALHOURSVALID=IF SERVICe.ACTUALTIME> 0 AND SERVICEACTUAL TIME <=14, THEN “YES, ELSE “NO
minimum employee rate is $150, and the maximum employee rate must be lower than $500
RATEVALID = IF EMPLOYEE.RATE >= 150 AND EMPLOYEE.RARE < 500, THEN “YES”, ELSE “NO”
Pattern 11
non-intuitive and ambiguous table names
part of data model and data set vocab, so must be correct, intuitive, and clear
scan tables for ambiguous names and rename
whether it accurately reflect content
intuitive, avoid spaces, underscores, and special coding
pattern 12
missing primary keys
tables are descriptions of entities, and each instance of an entity should be uniquely identified
column must have unique value for each instances and no null
already in place when data is extracted
pattern 13
redundant content across columns
data inconsistancy occur when same data are recorded mroe than once, and changed in one place but not the other (email address)
Possible scenarios
overlap as an address that contains state info and separate state field
dependency, which exists when one column’s values are dependent on the values of another column in the same table
perform column by column comparison, delete redundant and dependent columns
pattern 14
find invalid values with intra-table rules
creating and apply intra-table rules
identify invalid data
creating validation rules requires in-depth knowledge
Transform Models
search for data issues across tables
ex: data that describe the same entity spread across multiple tables, models with a structure that is hard to understand, models that do not support efficient processing
Pattern 15
data spread across tables
analysis is more challenging when data that describe the same entity are spread across multiple tables
Identify similarly structure table, or tables describing different characteristics of the same identity, or combine tables
Pattern 16
data models do not comply with principles of dimensional modeling
technique of creating data models with fact tables surrounded by dimension tables
stars schemas
Determine the fact and dimension tables belong to the correct table
Pattern 17
find invalid values with inter-table rules
determine validity of a column’s values based on values in one or more tables
referential integrity
create and apply inter-table validation rules that identify invalid data, modify invalid rules
referential integrity
all values in a foreign key should also exist as values in the corresponding primary key
Star schema
database used to organize data in a warehouse/data mart, simplifying data analysis and querying
information modeling
process of generating additional knowledge from data that is relevant for analysis purposes
data is the input (raw facts and figures)
algorithms
set of instructions that transform the data into information, which is the output of additional knowledge gained from the data
link input of facts and the output of useful info
used to calculate depreciation, cost, ratios, and more
simple and complex
calculated columns
value is calculated for each cell in the column, integral parts of the table
ex:NetAmount and TransactionSIze
Measure
aggregate, or total, tht can be used in reports, and thus for analytical purposes
created by algorithms, not integral parts of a table
Data model
shows the structure of a data set
shows the concepts being described, the tables, and the fields used to describe the concepts
Extended Sales Star Schema
shows extended sales star scheme with field names such as brand, country and gender
fact table describes sales transaction
fields with question marks indicate the information model’s calculated columns and measures