1/108
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data Science
Interdisciplinary field that focuses on extracting knowledge and insights from data using algorithms statistics and domain expertise.
Big Data 5 V’s
Volume Velocity Variety Veracity Value describe scale speed diversity quality and usefulness of data.
Volume
Refers to the massive amount of data collected from sensors web or transactions.
Velocity
Refers to the speed at which data is generated and processed for example streaming data.
Variety
Refers to the diversity of data formats structured semi structured unstructured.
Veracity
Refers to the trustworthiness and quality of data including noise and missing values.
Value
Refers to the usefulness and actionable insight data can provide.
Visualization vs Value
Visualization is not one of the 5 V’s and is often used as a trick question on exams.
Knowledge Discovery in Databases KDD
Nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns from large collections of data.
Data Mining vs KDD
Data mining is the algorithmic extraction step within the larger KDD process.
KDD Step 1 Understanding the Problem Domain
Define goals success criteria domain knowledge identify key people and translate business goals into KDD goals.
KDD Step 2 Understanding the Data
Collect and assess data examine format size completeness redundancy missing values plausibility determine if data is sufficient.
KDD Step 3 Preparation of the Data
Clean transform and prepare data impute or remove missing data reduce dimensionality discretize consumes about half of total project effort.
KDD Step 4 Data Mining
Apply algorithms such as decision trees k NN Bayesian methods SVM clustering association rules or neural networks can be descriptive or predictive.
KDD Step 5 Evaluation
Assess patterns for validity novelty usefulness and interpretability select best models and potentially revisit previous steps.
KDD Step 6 Using Discovered Knowledge
Deploy models integrate into decision making monitor performance and document results possibly extending to other domains.
KDD Iteration
The KDD process is iterative with feedback loops between steps often revisiting earlier phases if results are poor.
Value Data
A single measurement or observation such as a number or category.
Feature Attribute
A characteristic describing an object also called dimension descriptor or variable.
Object Instance
A single data point described by multiple features also called example record or data point.
Numerical Values
Values expressed as numbers such as integers or real numbers.
Symbolic Values
Qualitative concepts like words or categories.
Discrete Feature
Feature with a finite set of possible values for example chest pain type 1 2 3 4.
Continuous Feature
Feature with infinite or very large number of values within an interval for example blood pressure between 0 and 250.
Nominal Feature
Feature with no natural ordering between values for example colors.
Ordinal Feature
Feature with natural ordering between values for example sizes small medium large.
Binary Feature
Special case of discrete feature with only two values such as 0 or 1 child or adult.
Dataset
A collection of objects described by the same set of features typically stored as a rectangular flat file with rows as objects and columns as features.
Flat File
Simple table structure often exported from databases and stored as text formats like CSV.
Data Repository UCI ML
Common source of benchmark datasets for data mining and machine learning for example the Labor Relations dataset is multivariate with categorical integer and real features.
Data Storage Options
Datasets can be stored in relational databases flat files or data warehouses.
Importance of Data Quality
Success of data science depends heavily on the quality and structure of input data.
Big Data
Refers to datasets with extremely large numbers of objects features and values that pose computational challenges.
Big Data Issues
Quantity and quality issues like incompleteness redundancy noise and missing values affect performance and scalability.
Scalability
Refers to how well an algorithm handles increasing data size not storage efficiency.
Data Scale Dimensions
Three dimensions number of objects number of features and number of feature values.
Number of Objects
Refers to the number of rows examples or instances in a dataset ranging from hundreds to billions.
Number of Features
Refers to the number of columns attributes or dimensions ranging from a few to thousands.
Feature Value Range
Refers to the number of distinct values a feature can assume ranging from two to millions.
Asymptotic Complexity
Describes the growth rate of algorithm runtime as dataset size increases used to compare scalability.
Linear Complexity O n
Algorithm runtime grows linearly with the number of objects best for large data.
Log Linear Complexity O n log n
Algorithm runtime grows slightly faster than linear common in many scalable methods.
Quadratic Complexity O n squared
Runtime grows rapidly with data size often infeasible for very large datasets.
Cubic Complexity O n cubed
Runtime grows extremely fast typically impractical for large n.
Runtime Growth Example
Doubling dataset size doubles linear runtime but multiplies cubic runtime by eight and quadratic by four.
Scalability Improvement Techniques
Algorithmic optimization and dataset partitioning are two main strategies to handle big data.
Algorithmic Optimization
Use heuristics efficient data structures and parallelization to reduce runtime.
Dataset Partitioning
Reduce dimensionality sample representative subsets and process data in chunks sequentially or in parallel.
Preprocessing Challenges Big Data
High noise missing values inconsistent formats and redundant features require careful preprocessing.
Missing Data Causes
Common causes include human error equipment malfunction attrition skip patterns nonresponse and noise removal.
MCAR
Missing completely at random missingness unrelated to data safe to delete or impute.
MAR
Missing at random systematic within subgroups but random within groups can be handled with careful imputation or deletion.
MNAR
Missing not at random depends on unobserved variables cannot be reliably imputed and may require additional data.
Listwise Deletion
Remove entire rows with missing values quick but reduces data size and may bias analysis.
Variable Deletion
Remove features with many missing values may lose important information.
Mean Imputation
Replace missing values with the mean of observed values for that feature fast but reduces variance.
Hot Deck Imputation
Replace missing values using similar records nearest neighbor more accurate but slower.
Choosing Missing Data Method
Use deletion for small random gaps imputation for important features MNAR often requires new data.
Data Warehouse Definition
Subject oriented integrated time variant and nonvolatile collection of data for decision support.
Subject Oriented
Data organized around major subjects like customer or sales not applications.
Integrated
Combines heterogeneous sources cleans inconsistencies ensures uniformity.
Time Variant
Stores historical data unlike operational databases which store current snapshots.
Nonvolatile
Data warehouse is separate from operational databases only loaded and read with no updates or deletes.
OLTP vs OLAP
OLTP supports daily transactions with current data many users and read write access while OLAP supports analytical queries with historical data fewer users and read only access.
Data Mart
Subset of a data warehouse focused on a specific business area.
Data Warehouse Architecture
Three tiers bottom storage middle OLAP server top front end tools plus a metadata repository.
Metadata Repository
Stores schema definitions dimension hierarchies data mart locations operational metadata summarization algorithms and business definitions.
Multidimensional Data Model
Represents data as cubes with dimensions and measures for OLAP analysis.
Cuboid
One element in a lattice of data cube groupings apex is zero dimensional summary base is n dimensional full data.
Number of Cuboids Formula
Number of cuboids equals two to the power of d where d is the number of dimensions.
Star Schema
Central fact table linked to denormalized dimension tables for fast querying.
Snowflake Schema
Normalized refinement of star schema uses more tables and less redundancy.Protein Data Case Study
Protein Data Overview
Proteins have sequences and 3D structures stored in databases which are growing exponentially due to sequencing technologies.
Protein Data Characteristics
Protein data is high dimensional heterogeneous noisy and contains complex nonlinear patterns.
Protein Data Importance
Protein data provides a motivating example for preprocessing feature selection and scalable algorithms.
Altair AI Studio
GUI based data science software formerly RapidMiner used to build and evaluate models without coding.
Altair Features
Supports data import preprocessing feature selection sampling partitioning built in algorithms and evaluation tools.
Altair Algorithms
Supports k NN Decision Tree Decision Stump Naive Bayes and some SVM.
Altair Course Use
Used in assignments and projects for imputation model building and evaluation with datasets like Labor Relations.
Labor Relations Dataset
Multivariate dataset with categorical integer and real attributes used for practicing in Altair.
Information Retrieval
Field concerned with finding relevant documents in large collections using term weighting and similarity measures.
Document
A text unit to be retrieved such as a web page or article.
Term
A word or token used to index documents.
Query
A set of terms representing a user’s information need.
Stop Words
Common words like and or the that are removed to reduce noise.
Stemming
Reducing words to their root forms to improve matching.
Inverted Index
Data structure that maps terms to lists of documents containing them.
TF Term Frequency
Number of times a term appears in a document normalized by the maximum frequency in that document.
IDF Inverse Document Frequency
Logarithm of total number of documents divided by the number of documents containing the term measures term rarity.
TF IDF Weighting
Multiplies term frequency and inverse document frequency to assign importance to each term in each document.
Cosine Similarity
Measures similarity between document and query vectors using their dot product divided by their magnitudes.
Cosine Similarity Formula
Similarity equals d dot q divided by norm d times norm q where d and q are document and query vectors.
High Cosine Similarity
Indicates that a document is more relevant to the query.
TF IDF Example
Science and mining have idf 2 data has idf 1 interesting has idf 0 document weights reflect frequency times idf.
Precision
Fraction of retrieved documents that are relevant true positives divided by true positives plus false positives.
Recall
Fraction of relevant documents that are retrieved true positives divided by true positives plus false negatives.
Low Precision High Recall
Means most relevant documents were retrieved but many irrelevant ones were also retrieved.
High Precision Low Recall
Means few relevant documents were retrieved and many were missed.
Inductive Learning
Process of building general models from specific examples to make predictions on unseen data.
Supervised Learning
Learns mapping from inputs to known outputs for classification and regression.