1/4
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
data science
an interdisciplinary field that uses statistical and computational techniques to extract insights or knowledge from data. It encompasses aspects of statistics, computer science (especially machine learning and data mining), and domain exercise. Data scientists clean and organize raw data, analyze it (often writing code in languages like Python and R), apply algorithms/models to find patterns, and communicate results (usually via visualizations or reports) to inform decision-making
Big data (the three V’s)
a term describing datasets that are large in volume, generated at high velocity, and coming in a variety of formats, often beyond the ability of traditional databases to handle. For example, data from millions of users on a social network - it streams in continuously (velocity), can be structured (tables of user info), semi-structured (logs, JSON), or unstructured (images, text) (variety), and the total volume can be terabytes or perabytes. Big data requires special processing tools (like distributed computing frameworks eg. Hadoop/Spark)
Machine learning
a subset of artificial intelligence where algorithms learn patterns from data and improve their performance on a task without being explicitly programmed with domain-specific rules. For instance, a machine learning model can learn to predict house prices by training on past housing data. Types of machine learning include supervised learning (with labeled examples), unsupervised learning (finding structure in unlabeled data), and reinforcement learning (learning via trial-and-error rewards)
Correlation vs Causation
correlation is a statistical measure that indicates the extent to which to variables move together. If one group goes up when the other goes up, they have a positive correlation; if one goes up when the other goes down, negative correlation. Causation means that changes in one variable actually cause changes in another. Importantly, correlation does not imply causation — two things might correlate due to coincidence or a third factor. For example, ice cream sales correlate with drowning incidents (both rise in summer) but ice cream doesn’t cause drowning
Data cleaning (Data preparation)
the process of detecting and correcting (or removing) errors and inconsistencies in data to improve its quality before analysis. This can include handling missing values (eg, filling them or dropping records), removing duplicate entries, correcting typos or formatting issues, and ensuring that data types are correct. Since raw data is often messy (with errors, outliers, or different conventions), data cleaning is a crucial step that can consume a large portion of a data scientist’s time to ensure that subsequent analysis or modeling is accurate