1/36
A set of flashcards based on key concepts from the Unit 5 Study Guide on Big Data.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is Big Data?
Extremely large sets of data that can be analyzed for patterns, trends, and associations.
Where does Big Data come from?
It comes from sources like social media, search engines, sensors, transactions, and devices.
How do we collect Big Data?
Through user activities, devices, sensors, online actions, surveys, etc.
How do we store and use Big Data?
Stored in servers, databases, and the cloud; used for analysis, decision-making, and predictions.
What are the steps for understanding Big Data?
Collect - Gather raw data from multiple sources.
Store - Save data safely using databases or cloud storage.
Process - Organize and clean the data for use.
Analyze - Find patterns, trends, and insights.
Visualize - Present data in charts, graphs, etc., for understanding.
Define Usable data. What makes data usable?
Usable data is organized, clean, and accessible; it must be easy to retrieve and interpret.
Define Useful data. What makes data useful?
Useful data is relevant and helpful for a specific goal or question.
Why do we collect data?
To gain insights, make decisions, predict trends, and improve products/services.
What are the differences between structured and unstructured data?
Structured data fits into organized systems (like spreadsheets); unstructured data is messy (like videos, emails, or tweets).
What happens as we go from unstructured to structured data? Is it reversible?
Data is cleaned and organized for easier analysis; it’s hard to go back to raw form once structured.
What is Data Extraction? Why do we need it?
Pulling out important information from raw data to make it useful.
Is the internet structured or unstructured?
Mostly unstructured.
Difference between structured and unstructured searches?
Structured Search: Filters and organized queries; faster and accurate but limited. Unstructured Search: Open-ended; finds unexpected results but slower.
How does Google search the internet for your queries?
Uses bots (crawlers) to index websites and an algorithm to match your search with the most relevant pages.
What is screen scraping? Why is it useful?
Automated extraction of data from websites; useful when there’s no easy way to download the information.
What do we do after extracting data? Why?
Clean, organize, and validate it — to ensure accuracy and usability.
How do we store big data? Is it necessary to store all data?
In servers, cloud storage, and databases; no, only valuable or necessary data is kept.
What is metadata? Why is it useful?
Data about data (e.g., file size, date created); it helps organize and search data faster.
Two ways to structure data and pros/cons?
Relational Databases (Tables): Easy to query but strict formats. NoSQL (Flexible storage): Stores messy data easily but harder to search.
What is data persistence?
Data continues to exist and can be retrieved over time.
What is PII? Examples?
Personally Identifiable Information — like names, addresses, Social Security numbers.
Pros and cons of data persisting online?
Pros: Easy access, backup, analysis. Cons: Privacy risks, hacking.
What are we trading for convenience when sharing data?
Privacy.
Three types of data analysis and differences?
Descriptive: Summarizes what happened (high confidence). Predictive: Forecasts future events (medium confidence). Prescriptive: Recommends actions (lower confidence but actionable).
Two methods for finding patterns in Big Data?
Regression: Predicts future trends based on past data. Clustering: Groups similar data together.
Six strategies for data mining (explain each):
How does association rule mining work?
It finds links between behaviors or actions (e.g., if a user buys milk, they also buy bread); helps make predictions.
What is a model? Why is it useful?
A simplified version of a system or concept; helps predict or understand real-world phenomena.
What are simulations? Why are they useful?
Running a model to see how a system might behave; useful because it's safer, faster, and cheaper than real-world tests.
Do we need to test everything 1:1 in real life? Why model and simulate?
No — modeling saves time, money, and avoids risks.
Drawbacks of modeling/simulating?
Models might be inaccurate if based on bad data; always room for error.
Real-world examples of modeling/simulation?
Weather forecasting, traffic flow modeling, testing new airplane designs.
Why is data sometimes called 'the new oil'?
Because it’s extremely valuable when processed but raw data itself needs refining.
What makes Big Data 'big'?
Volume (amount), Variety (types), Velocity (speed of creation).
Give one real-life example of a model/simulation.
Simulating virus spread to predict pandemic outcomes.
What’s the benefit of organizing unstructured data using metadata?
Easier to search, organize, and retrieve information.
Which is more accurate: descriptive or predictive analysis? Why?
Descriptive — it’s based on actual past data, not guesses about the future.