1/194
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is the purpose of primary and foreign keys in a relational database?
Keys help link different data together
According to Moss (2021), 179 assets were unintentionally included in a purchases contract due to...
Hidden rows becoming visible when the spreadsheet was converted to a PDF file
How are data stored in a relational database? (Rosemblum and Dorsey 2013)
Different types of data are stored in different tables
According to Redman (2013), which is NOT a cause of bad data?
Everyone is invested in getting the data right
According to Tableau, if data are incorrect, the analysis may...
Produce results that look as if they were correct
According to IBM (2020), ETL process often involves...
Data cleansing to improve the quality and consistency of data
According to IBM (2020), Extract, Transform, Load (ETL) processes are often needed because...
Data from multiple sources need to be integrated for analysis
How does an ELT process differ from an ETL process (IBM 2020)?
ELT involves less data transformation before the data is loaded into a target system
According to Aaltonen (2013), metrics can lead to mindlessly optimizing every individual aspect of business. What does this mean?
Managers lose sight of the big picture and things that matter to customers
According to Schambra (2013), the problem with scoring non-profit outcomes as "success" or "failure" is...
The distinction between success and failure is not always clear
According to Aaltonen (2013), metrics and intuition are...
Are both needed in organizations
What might happen if non-profits would be funded solely on their success KPIs? (Schambra 2013)
Small non-profits serving the most vulnerable may become culled
What does ETL stand for?
Extract, Transform, Load
Define the ETL process
ETL is an automated process of:
Extracting data from multiple sources
Transforming data into a consistent format
Loading data into an analytical system
What happens in the EXTRACT phase of ETL?
Data is extracted/pulled from multiple different sources (databases, files, APIs, etc.)
What happens in the TRANSFORM phase of ETL?
Data is transformed into a consistent format by:
Resolving formatting differences
Standardizing data representation
Cleaning inconsistencies
Validating data quality
What happens in the LOAD phase of ETL
The transformed data is loaded into an analytical system (like a data warehouse) where it can be analyzed
Why do we need ETL?
The power of data analytics is often based on combining data from different sources
Data stored in different places are often formatted differently
These differences need to be resolved before data can be combined
Why is it difficult to enforce consistent schemas even within the same organization?
Different departments may:
Use different systems
Have different data collection practices
Use different naming conventions
Store data in different formats
Have different data requirements
When data are in the same database, how can they be combined?
Data can be combined directly using database queries (like JOIN operations)
What challenges arise when data come from different sources?
Need to identify which columns represent the same data across tables
Must decide which version of data to use when represented differently
Different naming conventions (e.g., EmpNo vs EmpID)
Different value formats (e.g., "Abbie" vs "Abigail")
Inconsistent data types or structures
What are the three steps to setting up an ETL process?
Inspect metadata (data dictionary) to assess which columns map to each other in different datasets
Decide how to choose the correct version of data that reside in multiple databases
Setup rules for resolving inconsistencies, duplicates, omissions, and other problems; validate the data
What is metadata in the context of ETL?
Metadata is "data about data" - it includes data dictionaries that describe the structure, format, and meaning of data columns in different datasets
How does ETL improve data quality?
ETL transforms data into a format that is ready for use in business tasks by:
Resolving inconsistencies
Removing duplicates
Filling omissions
Validating data
Standardizing formats
Give an example of integrating data for insights
Combining Seattle Crime Rate data with Seattle Real Estate Price data to find correlations or patterns
In the lecture example with Departments, Employees, and Performance tables, what mapping issues existed?
DeptNum vs DeptNo (different column names)
EmpNo vs EmpID (different column names)
Name vs FirstName (different column names)
"Abbie" vs "Abigail" (different name formats)
"Carol" vs "Carolyn" (different name formats)
"Douglas" vs "Doug" (different name formats)
What are the row and column limits for Excel 1997-2003?
65,536 rows by 256 columns
What are the row and column limits for Excel 2007 and later?
1,048,576 rows by 16,384 columns
How many rows can Excel digest for practical analysis?
Perhaps a few hundred thousand rows (datasets can have billions of rows, which Excel cannot handle)
What are the "Three Vs" of Big Data?
Volume - large amounts of data
Velocity - speed at which data is generated/processed
Variety - different types and formats of data
What is the fundamental approach of distributed data processing?
nstead of bringing huge amounts of data to a huge processing unit (supercomputer), it is better to:
Distribute data to several smaller processing units
Send the analytical code to them
Combine the results together
What is “Big Data”?
A set of technologies that allow processing very large amounts of data so that they can be analyzed
What is Hadoop?
Hadoop stores data in smaller chunks across a network on different computers (nodes)
What is MapReduce?
MapReduce processes the pieces of data in parallel in different nodes and combines the results together
What are Hadoop and MapReduce known for?
They are some of the most popular technologies for dealing with large amounts of data
What is the main advantage of the Hadoop/MapReduce approach?
The idea is deceptively simple
Provides much more scalability
Can process huge amounts of data on relatively cheap hardware (not expensive supercomputers)
What is a limitation of the Hadoop/MapReduce approach?
It becomes difficult when data are so interdependent that processing one data item requires knowledge of other data items (e.g., social network data)
Describe the Comcast cable box example from lecture
Comcast:
Stores real-time cable box activity for millions of customers by region
Analyzes which programs people are most likely to pause and then skip commercials
Uses distributed processing across different regions (Region A, B, C, D)
What is a UNION operation in data integration?
Union combines rows from two or more tables that have the same structure (same columns), stacking them vertically to create a single larger dataset
What is a JOIN operation in data integration?
Join combines columns from two or more tables based on a related column between them (like a common key), connecting them horizontally
What tools are available for ETL processes?
There are numerous tools available to create ETL processes for different purposes (specific tools not mentioned in lecture, but they automate the Extract, Transform, Load workflow)
Why is distributed processing better than using a supercomputer?
More cost-effective (uses cheaper hardware)
More scalable (can add more nodes as needed)
Better for parallel processing of independent data chunks
Avoids bottleneck of moving massive amounts of data to one location
What happened in the Barclays-Lehman Brothers case?
Barclays purchased 179 extra assets from failed Lehman Brothers due to an Excel error.
What happened with UK COVID-19 data?
The UK government missed 16,000 coronavirus cases.
What happens to zip codes like 08098 in Excel?
The leading zero disappears when interpreted as an integer.
What happens when you export a large number from Excel to CSV?
The exported value becomes 2.34555E+14 instead of the original number.
What gets exported from Excel to CSV: display value or exact value?
The display-formatted value is exported instead of the exact value.
What does CSV stand for?
Comma-Separated Values.
What are CSV files commonly used for?
Transferring data between systems and applications (along with Excel files).
How can you inspect CSV files?
Using any text editor because they are just text files.
What do CSV files contain?
Just the plain dataset - no formatting, no formulas, no visualizations.
What is the Agency Problem in data?
The data creator is often NOT the data consumer.
What percentage of time do analysts spend on data quality issues?
50% of their time.
What three activities do analysts spend 50% of their time on?
1) Searching for data, 2) Correcting errors, 3) Verifying correctness.
According to Varian (2010), what's better when experiments are cheap?
Experiments provide more reliable answers than opinion
According to Varian, when is expert opinion plausible for decisions?
According to Varian, when is expert opinion plausible for decisions?
What should managers say to questions about user preferences according to Varian?
"Run an experiment."
What's the first step in the vicious cycle of bad data?
Mistakes in data.
What do mistakes in data lead to?
Errors in decision-making.
What happens after errors in decision-making?
People stop trusting data.
What do people do when they stop trusting data?
People return to intuition as the basis for decisions.
What happens when people return to intuition? (Data)
Less investment in getting the data right.
What does the vicious cycle of bad data ultimately kill?
A data-driven decision-making culture.
Who identified the vicious cycle of bad data?
Redman (2013).
What's the first best practice for data quality?
Focus on getting new data right.
What's the second best practice for data quality?
Limit time fixing old data.
What's the third best practice for data quality?
Data producers should communicate with data consumers.
What's the fourth best practice for data quality?
Have a mindset to check your work constantly.
What is an outlier?
An observation that lies an abnormal distance from other values in a sample.
Who decides what counts as abnormal for outliers?
The analyst (or a consensus process).
Is dealing with outliers always a clear decision?
No, it's always a judgment call.
What are three ways to deal with outliers?
1) Remove them, 2) Replace with another value, 3) Leave as is.
Are outliers always bad?
No, sometimes we may be specifically interested in the outliers.
What's Step 1 in cleaning data?
Remove duplicate or irrelevant observations.
What's Step 2 in cleaning data?
Fix structural errors.
What's Step 3 in cleaning data?
Filter unwanted outliers.
What's Step 4 in cleaning data?
Handle missing data.
What's Step 5 in cleaning data?
Validate and QA (Quality Assurance).
What are flat files?
CSV and Excel files that store data in a simple format.
How are flat files structured?
As datasets that are more or less ready to be analyzed.
What are the two most common types of flat files?
CSV files and Excel spreadsheet files.
What does SQL stand for?
Structured Query Language.
What is SQL used for?
Operating relational databases.
Name three relational database products.
Oracle, Microsoft Access, MySQL.
What does a relational database contain?
A collection of tables that store data for different types of entities.
What are database tables made of?
Rows (records/observations) and columns (variables).
What are fields made of?
Characters that can represent different types of data (data types).
What is a database schema?
The logical and visual configuration of the entire relational database.
What does a database schema describe?
The structure of the database.
What visual tool is used to show database structure?
ERD Diagram (Entity Relationship Diagram).
What does a database query do?
Combines data from two or more tables.
How are tables joined in a database?
By common fields.
Give an example of a simple SQL query.
SELECT * FROM Employees;
How does a relational database differ from one big dataset?
It stores different data into different tables instead of one big two-dimensional matrix.
What do you usually need to do before using data from a relational database?
Combine data from different tables into a new table (dataset) using queries.
What is the first benefit of relational databases?
Integrity - easier to maintain data integrity when the same item is recorded in one place only.
What is the second benefit of relational databases?
Flexibility - you can create different cuts into the data.
What is the third benefit of relational databases?
Efficiency - faster to retrieve and update data without redundant values.
What is the main drawback of relational databases?
They are more complex to operate and use than flat files.
What does the MATCH function in Excel do?
Finds a specified value in a range of cells and returns its relative position.