Data Science Culner-Smith Exam 2

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/194

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

195 Terms

New cards

What is the purpose of primary and foreign keys in a relational database?

Keys help link different data together

New cards

According to Moss (2021), 179 assets were unintentionally included in a purchases contract due to...

Hidden rows becoming visible when the spreadsheet was converted to a PDF file

New cards

How are data stored in a relational database? (Rosemblum and Dorsey 2013)

Different types of data are stored in different tables

New cards

According to Redman (2013), which is NOT a cause of bad data?

Everyone is invested in getting the data right

New cards

According to Tableau, if data are incorrect, the analysis may...

Produce results that look as if they were correct

New cards

According to IBM (2020), ETL process often involves...

Data cleansing to improve the quality and consistency of data

New cards

According to IBM (2020), Extract, Transform, Load (ETL) processes are often needed because...

Data from multiple sources need to be integrated for analysis

New cards

How does an ELT process differ from an ETL process (IBM 2020)?

ELT involves less data transformation before the data is loaded into a target system

New cards

According to Aaltonen (2013), metrics can lead to mindlessly optimizing every individual aspect of business. What does this mean?

Managers lose sight of the big picture and things that matter to customers

New cards

According to Schambra (2013), the problem with scoring non-profit outcomes as "success" or "failure" is...

The distinction between success and failure is not always clear

New cards

According to Aaltonen (2013), metrics and intuition are...

Are both needed in organizations

New cards

What might happen if non-profits would be funded solely on their success KPIs? (Schambra 2013)

Small non-profits serving the most vulnerable may become culled

New cards

What does ETL stand for?

Extract, Transform, Load

New cards

Define the ETL process

ETL is an automated process of:

Extracting data from multiple sources
Transforming data into a consistent format
Loading data into an analytical system

New cards

What happens in the EXTRACT phase of ETL?

Data is extracted/pulled from multiple different sources (databases, files, APIs, etc.)

New cards

What happens in the TRANSFORM phase of ETL?

Data is transformed into a consistent format by:

Resolving formatting differences
Standardizing data representation
Cleaning inconsistencies
Validating data quality

New cards

What happens in the LOAD phase of ETL

The transformed data is loaded into an analytical system (like a data warehouse) where it can be analyzed

New cards

Why do we need ETL?

The power of data analytics is often based on combining data from different sources
Data stored in different places are often formatted differently
These differences need to be resolved before data can be combined

New cards

Why is it difficult to enforce consistent schemas even within the same organization?

Different departments may:

Use different systems
Have different data collection practices
Use different naming conventions
Store data in different formats
Have different data requirements

New cards

When data are in the same database, how can they be combined?

Data can be combined directly using database queries (like JOIN operations)

New cards

What challenges arise when data come from different sources?

Need to identify which columns represent the same data across tables
Must decide which version of data to use when represented differently
Different naming conventions (e.g., EmpNo vs EmpID)
Different value formats (e.g., "Abbie" vs "Abigail")
Inconsistent data types or structures

New cards

What are the three steps to setting up an ETL process?

Inspect metadata (data dictionary) to assess which columns map to each other in different datasets
Decide how to choose the correct version of data that reside in multiple databases
Setup rules for resolving inconsistencies, duplicates, omissions, and other problems; validate the data

New cards

What is metadata in the context of ETL?

Metadata is "data about data" - it includes data dictionaries that describe the structure, format, and meaning of data columns in different datasets

New cards

How does ETL improve data quality?

ETL transforms data into a format that is ready for use in business tasks by:

Resolving inconsistencies
Removing duplicates
Filling omissions
Validating data
Standardizing formats

New cards

Give an example of integrating data for insights

Combining Seattle Crime Rate data with Seattle Real Estate Price data to find correlations or patterns

New cards

In the lecture example with Departments, Employees, and Performance tables, what mapping issues existed?

DeptNum vs DeptNo (different column names)
EmpNo vs EmpID (different column names)
Name vs FirstName (different column names)
"Abbie" vs "Abigail" (different name formats)
"Carol" vs "Carolyn" (different name formats)
"Douglas" vs "Doug" (different name formats)

New cards

What are the row and column limits for Excel 1997-2003?

65,536 rows by 256 columns

New cards

What are the row and column limits for Excel 2007 and later?

1,048,576 rows by 16,384 columns

New cards

How many rows can Excel digest for practical analysis?

Perhaps a few hundred thousand rows (datasets can have billions of rows, which Excel cannot handle)

New cards

What are the "Three Vs" of Big Data?

Volume - large amounts of data
Velocity - speed at which data is generated/processed
Variety - different types and formats of data

New cards

What is the fundamental approach of distributed data processing?

nstead of bringing huge amounts of data to a huge processing unit (supercomputer), it is better to:

Distribute data to several smaller processing units
Send the analytical code to them
Combine the results together

New cards

What is “Big Data”?

A set of technologies that allow processing very large amounts of data so that they can be analyzed

New cards

What is Hadoop?

Hadoop stores data in smaller chunks across a network on different computers (nodes)

New cards

What is MapReduce?

MapReduce processes the pieces of data in parallel in different nodes and combines the results together

New cards

What are Hadoop and MapReduce known for?

They are some of the most popular technologies for dealing with large amounts of data

New cards

What is the main advantage of the Hadoop/MapReduce approach?

The idea is deceptively simple
Provides much more scalability
Can process huge amounts of data on relatively cheap hardware (not expensive supercomputers)

New cards

What is a limitation of the Hadoop/MapReduce approach?

It becomes difficult when data are so interdependent that processing one data item requires knowledge of other data items (e.g., social network data)

New cards

Describe the Comcast cable box example from lecture

Comcast:

Stores real-time cable box activity for millions of customers by region
Analyzes which programs people are most likely to pause and then skip commercials
Uses distributed processing across different regions (Region A, B, C, D)

New cards

What is a UNION operation in data integration?

Union combines rows from two or more tables that have the same structure (same columns), stacking them vertically to create a single larger dataset

New cards

What is a JOIN operation in data integration?

Join combines columns from two or more tables based on a related column between them (like a common key), connecting them horizontally

New cards

What tools are available for ETL processes?

There are numerous tools available to create ETL processes for different purposes (specific tools not mentioned in lecture, but they automate the Extract, Transform, Load workflow)

New cards

Why is distributed processing better than using a supercomputer?

More cost-effective (uses cheaper hardware)
More scalable (can add more nodes as needed)
Better for parallel processing of independent data chunks
Avoids bottleneck of moving massive amounts of data to one location

New cards

What happened in the Barclays-Lehman Brothers case?

Barclays purchased 179 extra assets from failed Lehman Brothers due to an Excel error.

New cards

What happened with UK COVID-19 data?

The UK government missed 16,000 coronavirus cases.

New cards

What happens to zip codes like 08098 in Excel?

The leading zero disappears when interpreted as an integer.

New cards

What happens when you export a large number from Excel to CSV?

The exported value becomes 2.34555E+14 instead of the original number.

New cards

What gets exported from Excel to CSV: display value or exact value?

The display-formatted value is exported instead of the exact value.

New cards

What does CSV stand for?

Comma-Separated Values.

New cards

What are CSV files commonly used for?

Transferring data between systems and applications (along with Excel files).

New cards

How can you inspect CSV files?

Using any text editor because they are just text files.

New cards

What do CSV files contain?

Just the plain dataset - no formatting, no formulas, no visualizations.

New cards

What is the Agency Problem in data?

The data creator is often NOT the data consumer.

New cards

What percentage of time do analysts spend on data quality issues?

50% of their time.

New cards

What three activities do analysts spend 50% of their time on?

1) Searching for data, 2) Correcting errors, 3) Verifying correctness.

New cards

According to Varian (2010), what's better when experiments are cheap?

Experiments provide more reliable answers than opinion

New cards

According to Varian, when is expert opinion plausible for decisions?

New cards

What should managers say to questions about user preferences according to Varian?

"Run an experiment."

New cards

What's the first step in the vicious cycle of bad data?

Mistakes in data.

New cards

What do mistakes in data lead to?

Errors in decision-making.

New cards

What happens after errors in decision-making?

People stop trusting data.

New cards

What do people do when they stop trusting data?

People return to intuition as the basis for decisions.

New cards

What happens when people return to intuition? (Data)

Less investment in getting the data right.

New cards

What does the vicious cycle of bad data ultimately kill?

A data-driven decision-making culture.

New cards

Who identified the vicious cycle of bad data?

Redman (2013).

New cards

What's the first best practice for data quality?

Focus on getting new data right.

New cards

What's the second best practice for data quality?

Limit time fixing old data.

New cards

What's the third best practice for data quality?

Data producers should communicate with data consumers.

New cards

What's the fourth best practice for data quality?

Have a mindset to check your work constantly.

New cards

What is an outlier?

An observation that lies an abnormal distance from other values in a sample.

New cards

Who decides what counts as abnormal for outliers?

The analyst (or a consensus process).

New cards

Is dealing with outliers always a clear decision?

No, it's always a judgment call.

New cards

What are three ways to deal with outliers?

1) Remove them, 2) Replace with another value, 3) Leave as is.

New cards

Are outliers always bad?

No, sometimes we may be specifically interested in the outliers.

New cards

What's Step 1 in cleaning data?

Remove duplicate or irrelevant observations.

New cards

What's Step 2 in cleaning data?

Fix structural errors.

New cards

What's Step 3 in cleaning data?

Filter unwanted outliers.

New cards

What's Step 4 in cleaning data?

Handle missing data.

New cards

What's Step 5 in cleaning data?

Validate and QA (Quality Assurance).

New cards

What are flat files?

CSV and Excel files that store data in a simple format.

New cards

How are flat files structured?

As datasets that are more or less ready to be analyzed.

New cards

What are the two most common types of flat files?

CSV files and Excel spreadsheet files.

New cards

What does SQL stand for?

Structured Query Language.

New cards

What is SQL used for?

Operating relational databases.

New cards

Name three relational database products.

Oracle, Microsoft Access, MySQL.

New cards

What does a relational database contain?

A collection of tables that store data for different types of entities.

New cards

What are database tables made of?

Rows (records/observations) and columns (variables).

New cards

What are fields made of?

Characters that can represent different types of data (data types).

New cards

What is a database schema?

The logical and visual configuration of the entire relational database.

New cards

What does a database schema describe?

The structure of the database.

New cards

What visual tool is used to show database structure?

ERD Diagram (Entity Relationship Diagram).

New cards

What does a database query do?

Combines data from two or more tables.

New cards

How are tables joined in a database?

By common fields.

New cards

Give an example of a simple SQL query.

SELECT * FROM Employees;

New cards

How does a relational database differ from one big dataset?

It stores different data into different tables instead of one big two-dimensional matrix.

New cards

What do you usually need to do before using data from a relational database?

Combine data from different tables into a new table (dataset) using queries.

New cards

What is the first benefit of relational databases?

Integrity - easier to maintain data integrity when the same item is recorded in one place only.

New cards

What is the second benefit of relational databases?

Flexibility - you can create different cuts into the data.

New cards

What is the third benefit of relational databases?

Efficiency - faster to retrieve and update data without redundant values.

New cards

What is the main drawback of relational databases?

They are more complex to operate and use than flat files.

100

New cards

What does the MATCH function in Excel do?

Finds a specified value in a range of cells and returns its relative position.

Explore top notes

C2.2 Neural signalling

Updated 154d ago

Note

Viking Expansion, c. 750 — c. 1050

Updated 475d ago

Note

Algebra1 SOL Brain Dump

Updated 559d ago

Note

AP Art History Ultimate Guide

Note

Note

Note

GEOL 101: Final Exam Review

Updated 554d ago

Note

Unit 2: Freedom, Enslavement, and Resistance

Updated 601d ago

Note

C2.2 Neural signalling

Updated 154d ago

Note

Viking Expansion, c. 750 — c. 1050

Updated 475d ago

Note

Algebra1 SOL Brain Dump

Updated 559d ago

Note

AP Art History Ultimate Guide

Note

Note

Note

GEOL 101: Final Exam Review

Updated 554d ago

Note

Unit 2: Freedom, Enslavement, and Resistance

Updated 601d ago

Note

Explore top flashcards

Flashcards (33)

Flashcards (70)

Flashcards (51)

Flashcards (56)

geo unit 3 vocabulary

Updated 1120d ago

Flashcards (41)

Korean1101_L2C1

Updated 211d ago

Flashcards (20)

MENTAL STATUS EXAMINATION (MSE)

Updated 541d ago

Flashcards (25)

Lec 39 Metabolic adaptations to nutritional interventions

Flashcards (43)

Flashcards (33)

Flashcards (70)

Flashcards (51)

Flashcards (56)

geo unit 3 vocabulary

Updated 1120d ago

Flashcards (41)

Korean1101_L2C1

Updated 211d ago

Flashcards (20)

MENTAL STATUS EXAMINATION (MSE)

Updated 541d ago

Flashcards (25)

Lec 39 Metabolic adaptations to nutritional interventions

Updated 415d ago

Flashcards (43)