Data Science Culner-Smith Exam 2

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/194

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

195 Terms

1
New cards

What is the purpose of primary and foreign keys in a relational database?

Keys help link different data together

2
New cards

According to Moss (2021), 179 assets were unintentionally included in a purchases contract due to...

Hidden rows becoming visible when the spreadsheet was converted to a PDF file

3
New cards

How are data stored in a relational database? (Rosemblum and Dorsey 2013)

Different types of data are stored in different tables

4
New cards

According to Redman (2013), which is NOT a cause of bad data?

Everyone is invested in getting the data right

5
New cards

According to Tableau, if data are incorrect, the analysis may...

Produce results that look as if they were correct

6
New cards

According to IBM (2020), ETL process often involves...

Data cleansing to improve the quality and consistency of data

7
New cards

According to IBM (2020), Extract, Transform, Load (ETL) processes are often needed because...

Data from multiple sources need to be integrated for analysis

8
New cards

How does an ELT process differ from an ETL process (IBM 2020)?

ELT involves less data transformation before the data is loaded into a target system

9
New cards

According to Aaltonen (2013), metrics can lead to mindlessly optimizing every individual aspect of business. What does this mean?

Managers lose sight of the big picture and things that matter to customers

10
New cards

According to Schambra (2013), the problem with scoring non-profit outcomes as "success" or "failure" is...

The distinction between success and failure is not always clear 

11
New cards

According to Aaltonen (2013), metrics and intuition are...

Are both needed in organizations

12
New cards

What might happen if non-profits would be funded solely on their success KPIs? (Schambra 2013)

Small non-profits serving the most vulnerable may become culled 

13
New cards

What does ETL stand for?

Extract, Transform, Load

14
New cards

Define the ETL process

ETL is an automated process of:

  • Extracting data from multiple sources

  • Transforming data into a consistent format

  • Loading data into an analytical system

15
New cards

What happens in the EXTRACT phase of ETL?

Data is extracted/pulled from multiple different sources (databases, files, APIs, etc.)

16
New cards

What happens in the TRANSFORM phase of ETL?

Data is transformed into a consistent format by:

  • Resolving formatting differences

  • Standardizing data representation

  • Cleaning inconsistencies

  • Validating data quality

17
New cards

What happens in the LOAD phase of ETL

The transformed data is loaded into an analytical system (like a data warehouse) where it can be analyzed

18
New cards

Why do we need ETL?

  • The power of data analytics is often based on combining data from different sources

  • Data stored in different places are often formatted differently

  • These differences need to be resolved before data can be combined

19
New cards

Why is it difficult to enforce consistent schemas even within the same organization?

Different departments may:

  • Use different systems

  • Have different data collection practices

  • Use different naming conventions

  • Store data in different formats

  • Have different data requirements

20
New cards

When data are in the same database, how can they be combined?

Data can be combined directly using database queries (like JOIN operations)

21
New cards

What challenges arise when data come from different sources?

  • Need to identify which columns represent the same data across tables

  • Must decide which version of data to use when represented differently

  • Different naming conventions (e.g., EmpNo vs EmpID)

  • Different value formats (e.g., "Abbie" vs "Abigail")

  • Inconsistent data types or structures

22
New cards

What are the three steps to setting up an ETL process?

  1. Inspect metadata (data dictionary) to assess which columns map to each other in different datasets

  2. Decide how to choose the correct version of data that reside in multiple databases

  3. Setup rules for resolving inconsistencies, duplicates, omissions, and other problems; validate the data

23
New cards

What is metadata in the context of ETL?

Metadata is "data about data" - it includes data dictionaries that describe the structure, format, and meaning of data columns in different datasets

24
New cards

How does ETL improve data quality?

ETL transforms data into a format that is ready for use in business tasks by:

  • Resolving inconsistencies

  • Removing duplicates

  • Filling omissions

  • Validating data

  • Standardizing formats

25
New cards

Give an example of integrating data for insights

Combining Seattle Crime Rate data with Seattle Real Estate Price data to find correlations or patterns

26
New cards

In the lecture example with Departments, Employees, and Performance tables, what mapping issues existed?

  • DeptNum vs DeptNo (different column names)

  • EmpNo vs EmpID (different column names)

  • Name vs FirstName (different column names)

  • "Abbie" vs "Abigail" (different name formats)

  • "Carol" vs "Carolyn" (different name formats)

  • "Douglas" vs "Doug" (different name formats)

27
New cards

What are the row and column limits for Excel 1997-2003?

65,536 rows by 256 columns

28
New cards

What are the row and column limits for Excel 2007 and later?

1,048,576 rows by 16,384 columns

29
New cards

How many rows can Excel digest for practical analysis?

Perhaps a few hundred thousand rows (datasets can have billions of rows, which Excel cannot handle) 

30
New cards

What are the "Three Vs" of Big Data?

  • Volume - large amounts of data

  • Velocity - speed at which data is generated/processed

  • Variety - different types and formats of data

31
New cards

What is the fundamental approach of distributed data processing?

nstead of bringing huge amounts of data to a huge processing unit (supercomputer), it is better to:

  • Distribute data to several smaller processing units

  • Send the analytical code to them

  • Combine the results together

32
New cards

What is “Big Data”?

A set of technologies that allow processing very large amounts of data so that they can be analyzed

33
New cards

What is Hadoop?

Hadoop stores data in smaller chunks across a network on different computers (nodes)

34
New cards

What is MapReduce?

MapReduce processes the pieces of data in parallel in different nodes and combines the results together

35
New cards

What are Hadoop and MapReduce known for?

They are some of the most popular technologies for dealing with large amounts of data

36
New cards

What is the main advantage of the Hadoop/MapReduce approach?

  • The idea is deceptively simple

  • Provides much more scalability

  • Can process huge amounts of data on relatively cheap hardware (not expensive supercomputers)

37
New cards

What is a limitation of the Hadoop/MapReduce approach?

It becomes difficult when data are so interdependent that processing one data item requires knowledge of other data items (e.g., social network data)

38
New cards

Describe the Comcast cable box example from lecture

Comcast:

  • Stores real-time cable box activity for millions of customers by region

  • Analyzes which programs people are most likely to pause and then skip commercials

  • Uses distributed processing across different regions (Region A, B, C, D)

39
New cards

What is a UNION operation in data integration?

Union combines rows from two or more tables that have the same structure (same columns), stacking them vertically to create a single larger dataset

40
New cards

What is a JOIN operation in data integration?

Join combines columns from two or more tables based on a related column between them (like a common key), connecting them horizontally

41
New cards

What tools are available for ETL processes?

There are numerous tools available to create ETL processes for different purposes (specific tools not mentioned in lecture, but they automate the Extract, Transform, Load workflow)

42
New cards

Why is distributed processing better than using a supercomputer?

  • More cost-effective (uses cheaper hardware)

  • More scalable (can add more nodes as needed)

  • Better for parallel processing of independent data chunks

  • Avoids bottleneck of moving massive amounts of data to one location

43
New cards

What happened in the Barclays-Lehman Brothers case?

 Barclays purchased 179 extra assets from failed Lehman Brothers due to an Excel error.

44
New cards

What happened with UK COVID-19 data?

The UK government missed 16,000 coronavirus cases.

45
New cards

What happens to zip codes like 08098 in Excel?

The leading zero disappears when interpreted as an integer.

46
New cards

What happens when you export a large number from Excel to CSV?

The exported value becomes 2.34555E+14 instead of the original number.

47
New cards

What gets exported from Excel to CSV: display value or exact value?

The display-formatted value is exported instead of the exact value.

48
New cards

What does CSV stand for?

Comma-Separated Values.

49
New cards

What are CSV files commonly used for?

Transferring data between systems and applications (along with Excel files).

50
New cards

How can you inspect CSV files?

Using any text editor because they are just text files.

51
New cards

What do CSV files contain?

Just the plain dataset - no formatting, no formulas, no visualizations.

52
New cards

What is the Agency Problem in data?

The data creator is often NOT the data consumer.

53
New cards

What percentage of time do analysts spend on data quality issues?

50% of their time.

54
New cards

What three activities do analysts spend 50% of their time on?

1) Searching for data, 2) Correcting errors, 3) Verifying correctness.

55
New cards

According to Varian (2010), what's better when experiments are cheap?

Experiments provide more reliable answers than opinion

56
New cards

According to Varian, when is expert opinion plausible for decisions?

According to Varian, when is expert opinion plausible for decisions?

57
New cards

What should managers say to questions about user preferences according to Varian?

"Run an experiment."

58
New cards

What's the first step in the vicious cycle of bad data?

Mistakes in data.

59
New cards

What do mistakes in data lead to?

Errors in decision-making.

60
New cards

What happens after errors in decision-making?

People stop trusting data.

61
New cards

What do people do when they stop trusting data?

People return to intuition as the basis for decisions.

62
New cards

What happens when people return to intuition? (Data)

Less investment in getting the data right.

63
New cards

What does the vicious cycle of bad data ultimately kill?

A data-driven decision-making culture.

64
New cards

Who identified the vicious cycle of bad data?

Redman (2013).

65
New cards

What's the first best practice for data quality?

Focus on getting new data right.

66
New cards

What's the second best practice for data quality?

Limit time fixing old data.

67
New cards

What's the third best practice for data quality?

Data producers should communicate with data consumers.

68
New cards

What's the fourth best practice for data quality?

Have a mindset to check your work constantly.

69
New cards

What is an outlier?

An observation that lies an abnormal distance from other values in a sample.

70
New cards

Who decides what counts as abnormal for outliers?

The analyst (or a consensus process).

71
New cards

Is dealing with outliers always a clear decision?

No, it's always a judgment call.

72
New cards

What are three ways to deal with outliers?

1) Remove them, 2) Replace with another value, 3) Leave as is.

73
New cards

Are outliers always bad?

No, sometimes we may be specifically interested in the outliers.

74
New cards

What's Step 1 in cleaning data?

Remove duplicate or irrelevant observations.

75
New cards

What's Step 2 in cleaning data?

Fix structural errors.

76
New cards

What's Step 3 in cleaning data?

Filter unwanted outliers.

77
New cards

What's Step 4 in cleaning data?

Handle missing data.

78
New cards

What's Step 5 in cleaning data?

Validate and QA (Quality Assurance).

79
New cards

What are flat files?

CSV and Excel files that store data in a simple format.

80
New cards

How are flat files structured?

As datasets that are more or less ready to be analyzed.

81
New cards

What are the two most common types of flat files?

CSV files and Excel spreadsheet files.

82
New cards

What does SQL stand for?

Structured Query Language.

83
New cards

What is SQL used for?

Operating relational databases.

84
New cards

Name three relational database products.

Oracle, Microsoft Access, MySQL.

85
New cards

What does a relational database contain?

A collection of tables that store data for different types of entities.

86
New cards

What are database tables made of?

Rows (records/observations) and columns (variables).

87
New cards

What are fields made of?

Characters that can represent different types of data (data types).

88
New cards

What is a database schema?

The logical and visual configuration of the entire relational database.

89
New cards

What does a database schema describe?

The structure of the database.

90
New cards

What visual tool is used to show database structure?

ERD Diagram (Entity Relationship Diagram).

91
New cards

What does a database query do?

Combines data from two or more tables.

92
New cards

How are tables joined in a database?

By common fields.

93
New cards

Give an example of a simple SQL query.

SELECT * FROM Employees;

94
New cards

How does a relational database differ from one big dataset?

It stores different data into different tables instead of one big two-dimensional matrix.

95
New cards

What do you usually need to do before using data from a relational database?

Combine data from different tables into a new table (dataset) using queries.

96
New cards

What is the first benefit of relational databases?

Integrity - easier to maintain data integrity when the same item is recorded in one place only.

97
New cards

What is the second benefit of relational databases?

Flexibility - you can create different cuts into the data.

98
New cards

What is the third benefit of relational databases?

Efficiency - faster to retrieve and update data without redundant values.

99
New cards

What is the main drawback of relational databases?

They are more complex to operate and use than flat files.

100
New cards

What does the MATCH function in Excel do?

Finds a specified value in a range of cells and returns its relative position.