Unit 5 Study Guide

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/36

Earn XP

Description and Tags

A set of flashcards based on key concepts from the Unit 5 Study Guide on Big Data.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

37 Terms

New cards

What is Big Data?

Extremely large sets of data that can be analyzed for patterns, trends, and associations.

New cards

Where does Big Data come from?

It comes from sources like social media, search engines, sensors, transactions, and devices.

New cards

How do we collect Big Data?

Through user activities, devices, sensors, online actions, surveys, etc.

New cards

How do we store and use Big Data?

Stored in servers, databases, and the cloud; used for analysis, decision-making, and predictions.

New cards

What are the steps for understanding Big Data?

Collect - Gather raw data from multiple sources.
Store - Save data safely using databases or cloud storage.
Process - Organize and clean the data for use.
Analyze - Find patterns, trends, and insights.
Visualize - Present data in charts, graphs, etc., for understanding.

New cards

Define Usable data. What makes data usable?

Usable data is organized, clean, and accessible; it must be easy to retrieve and interpret.

New cards

Define Useful data. What makes data useful?

Useful data is relevant and helpful for a specific goal or question.

New cards

Why do we collect data?

To gain insights, make decisions, predict trends, and improve products/services.

New cards

What are the differences between structured and unstructured data?

Structured data fits into organized systems (like spreadsheets); unstructured data is messy (like videos, emails, or tweets).

New cards

What happens as we go from unstructured to structured data? Is it reversible?

Data is cleaned and organized for easier analysis; it’s hard to go back to raw form once structured.

New cards

What is Data Extraction? Why do we need it?

Pulling out important information from raw data to make it useful.

New cards

Is the internet structured or unstructured?

Mostly unstructured.

New cards

Difference between structured and unstructured searches?

Structured Search: Filters and organized queries; faster and accurate but limited. Unstructured Search: Open-ended; finds unexpected results but slower.

New cards

How does Google search the internet for your queries?

Uses bots (crawlers) to index websites and an algorithm to match your search with the most relevant pages.

New cards

What is screen scraping? Why is it useful?

Automated extraction of data from websites; useful when there’s no easy way to download the information.

New cards

What do we do after extracting data? Why?

Clean, organize, and validate it — to ensure accuracy and usability.

New cards

How do we store big data? Is it necessary to store all data?

In servers, cloud storage, and databases; no, only valuable or necessary data is kept.

New cards

What is metadata? Why is it useful?

Data about data (e.g., file size, date created); it helps organize and search data faster.

New cards

Two ways to structure data and pros/cons?

Relational Databases (Tables): Easy to query but strict formats. NoSQL (Flexible storage): Stores messy data easily but harder to search.

New cards

What is data persistence?

Data continues to exist and can be retrieved over time.

New cards

What is PII? Examples?

Personally Identifiable Information — like names, addresses, Social Security numbers.

New cards

Pros and cons of data persisting online?

Pros: Easy access, backup, analysis. Cons: Privacy risks, hacking.

New cards

What are we trading for convenience when sharing data?

Privacy.

New cards

Three types of data analysis and differences?

Descriptive: Summarizes what happened (high confidence). Predictive: Forecasts future events (medium confidence). Prescriptive: Recommends actions (lower confidence but actionable).

New cards

Two methods for finding patterns in Big Data?

Regression: Predicts future trends based on past data. Clustering: Groups similar data together.

New cards

Six strategies for data mining (explain each):

Clustering: Grouping similar items. 2. Classification: Sorting into categories. 3. Anomaly Detection: Finding outliers. 4. Regression: Predicting trends. 5. Association Rule Mining: Finding relationships (like 'people who buy X also buy Y'). 6. Summarization: Giving a general overview of data.

New cards

How does association rule mining work?

It finds links between behaviors or actions (e.g., if a user buys milk, they also buy bread); helps make predictions.

New cards

What is a model? Why is it useful?

A simplified version of a system or concept; helps predict or understand real-world phenomena.

New cards

What are simulations? Why are they useful?

Running a model to see how a system might behave; useful because it's safer, faster, and cheaper than real-world tests.

New cards

Do we need to test everything 1:1 in real life? Why model and simulate?

No — modeling saves time, money, and avoids risks.

New cards

Drawbacks of modeling/simulating?

Models might be inaccurate if based on bad data; always room for error.

New cards

Real-world examples of modeling/simulation?

Weather forecasting, traffic flow modeling, testing new airplane designs.

New cards

Why is data sometimes called 'the new oil'?

Because it’s extremely valuable when processed but raw data itself needs refining.

New cards

What makes Big Data 'big'?

Volume (amount), Variety (types), Velocity (speed of creation).

New cards

Give one real-life example of a model/simulation.

Simulating virus spread to predict pandemic outcomes.

New cards

What’s the benefit of organizing unstructured data using metadata?

Easier to search, organize, and retrieve information.

New cards

Which is more accurate: descriptive or predictive analysis? Why?

Descriptive — it’s based on actual past data, not guesses about the future.