data science midterm

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/108

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

109 Terms

New cards

Data Science

Interdisciplinary field that focuses on extracting knowledge and insights from data using algorithms statistics and domain expertise.

New cards

Big Data 5 V’s

Volume Velocity Variety Veracity Value describe scale speed diversity quality and usefulness of data.

New cards

Volume

Refers to the massive amount of data collected from sensors web or transactions.

New cards

Velocity

Refers to the speed at which data is generated and processed for example streaming data.

New cards

Variety

Refers to the diversity of data formats structured semi structured unstructured.

New cards

Veracity

Refers to the trustworthiness and quality of data including noise and missing values.

New cards

Value

Refers to the usefulness and actionable insight data can provide.

New cards

Visualization vs Value

Visualization is not one of the 5 V’s and is often used as a trick question on exams.

New cards

Knowledge Discovery in Databases KDD

Nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns from large collections of data.

New cards

Data Mining vs KDD

Data mining is the algorithmic extraction step within the larger KDD process.

New cards

KDD Step 1 Understanding the Problem Domain

Define goals success criteria domain knowledge identify key people and translate business goals into KDD goals.

New cards

KDD Step 2 Understanding the Data

Collect and assess data examine format size completeness redundancy missing values plausibility determine if data is sufficient.

New cards

KDD Step 3 Preparation of the Data

Clean transform and prepare data impute or remove missing data reduce dimensionality discretize consumes about half of total project effort.

New cards

KDD Step 4 Data Mining

Apply algorithms such as decision trees k NN Bayesian methods SVM clustering association rules or neural networks can be descriptive or predictive.

New cards

KDD Step 5 Evaluation

Assess patterns for validity novelty usefulness and interpretability select best models and potentially revisit previous steps.

New cards

KDD Step 6 Using Discovered Knowledge

Deploy models integrate into decision making monitor performance and document results possibly extending to other domains.

New cards

KDD Iteration

The KDD process is iterative with feedback loops between steps often revisiting earlier phases if results are poor.

New cards

Value Data

A single measurement or observation such as a number or category.

New cards

Feature Attribute

A characteristic describing an object also called dimension descriptor or variable.

New cards

Object Instance

A single data point described by multiple features also called example record or data point.

New cards

Numerical Values

Values expressed as numbers such as integers or real numbers.

New cards

Symbolic Values

Qualitative concepts like words or categories.

New cards

Discrete Feature

Feature with a finite set of possible values for example chest pain type 1 2 3 4.

New cards

Continuous Feature

Feature with infinite or very large number of values within an interval for example blood pressure between 0 and 250.

New cards

Nominal Feature

Feature with no natural ordering between values for example colors.

New cards

Ordinal Feature

Feature with natural ordering between values for example sizes small medium large.

New cards

Binary Feature

Special case of discrete feature with only two values such as 0 or 1 child or adult.

New cards

Dataset

A collection of objects described by the same set of features typically stored as a rectangular flat file with rows as objects and columns as features.

New cards

Flat File

Simple table structure often exported from databases and stored as text formats like CSV.

New cards

Data Repository UCI ML

Common source of benchmark datasets for data mining and machine learning for example the Labor Relations dataset is multivariate with categorical integer and real features.

New cards

Data Storage Options

Datasets can be stored in relational databases flat files or data warehouses.

New cards

Importance of Data Quality

Success of data science depends heavily on the quality and structure of input data.

New cards

Big Data

Refers to datasets with extremely large numbers of objects features and values that pose computational challenges.

New cards

Big Data Issues

Quantity and quality issues like incompleteness redundancy noise and missing values affect performance and scalability.

New cards

Scalability

Refers to how well an algorithm handles increasing data size not storage efficiency.

New cards

Data Scale Dimensions

Three dimensions number of objects number of features and number of feature values.

New cards

Number of Objects

Refers to the number of rows examples or instances in a dataset ranging from hundreds to billions.

New cards

Number of Features

Refers to the number of columns attributes or dimensions ranging from a few to thousands.

New cards

Feature Value Range

Refers to the number of distinct values a feature can assume ranging from two to millions.

New cards

Asymptotic Complexity

Describes the growth rate of algorithm runtime as dataset size increases used to compare scalability.

New cards

Linear Complexity O n

Algorithm runtime grows linearly with the number of objects best for large data.

New cards

Log Linear Complexity O n log n

Algorithm runtime grows slightly faster than linear common in many scalable methods.

New cards

Quadratic Complexity O n squared

Runtime grows rapidly with data size often infeasible for very large datasets.

New cards

Cubic Complexity O n cubed

Runtime grows extremely fast typically impractical for large n.

New cards

Runtime Growth Example

Doubling dataset size doubles linear runtime but multiplies cubic runtime by eight and quadratic by four.

New cards

Scalability Improvement Techniques

Algorithmic optimization and dataset partitioning are two main strategies to handle big data.

New cards

Algorithmic Optimization

Use heuristics efficient data structures and parallelization to reduce runtime.

New cards

Dataset Partitioning

Reduce dimensionality sample representative subsets and process data in chunks sequentially or in parallel.

New cards

Preprocessing Challenges Big Data

High noise missing values inconsistent formats and redundant features require careful preprocessing.

New cards

Missing Data Causes

Common causes include human error equipment malfunction attrition skip patterns nonresponse and noise removal.

New cards

MCAR

Missing completely at random missingness unrelated to data safe to delete or impute.

New cards

MAR

Missing at random systematic within subgroups but random within groups can be handled with careful imputation or deletion.

New cards

MNAR

Missing not at random depends on unobserved variables cannot be reliably imputed and may require additional data.

New cards

Listwise Deletion

Remove entire rows with missing values quick but reduces data size and may bias analysis.

New cards

Variable Deletion

Remove features with many missing values may lose important information.

New cards

Mean Imputation

Replace missing values with the mean of observed values for that feature fast but reduces variance.

New cards

Hot Deck Imputation

Replace missing values using similar records nearest neighbor more accurate but slower.

New cards

Choosing Missing Data Method

Use deletion for small random gaps imputation for important features MNAR often requires new data.

New cards

Data Warehouse Definition

Subject oriented integrated time variant and nonvolatile collection of data for decision support.

New cards

Subject Oriented

Data organized around major subjects like customer or sales not applications.

New cards

Integrated

Combines heterogeneous sources cleans inconsistencies ensures uniformity.

New cards

Time Variant

Stores historical data unlike operational databases which store current snapshots.

New cards

Nonvolatile

Data warehouse is separate from operational databases only loaded and read with no updates or deletes.

New cards

OLTP vs OLAP

OLTP supports daily transactions with current data many users and read write access while OLAP supports analytical queries with historical data fewer users and read only access.

New cards

Data Mart

Subset of a data warehouse focused on a specific business area.

New cards

Data Warehouse Architecture

Three tiers bottom storage middle OLAP server top front end tools plus a metadata repository.

New cards

Metadata Repository

Stores schema definitions dimension hierarchies data mart locations operational metadata summarization algorithms and business definitions.

New cards

Multidimensional Data Model

Represents data as cubes with dimensions and measures for OLAP analysis.

New cards

Cuboid

One element in a lattice of data cube groupings apex is zero dimensional summary base is n dimensional full data.

New cards

Number of Cuboids Formula

Number of cuboids equals two to the power of d where d is the number of dimensions.

New cards

Star Schema

Central fact table linked to denormalized dimension tables for fast querying.

New cards

Snowflake Schema

Normalized refinement of star schema uses more tables and less redundancy.Protein Data Case Study

New cards

Protein Data Overview

Proteins have sequences and 3D structures stored in databases which are growing exponentially due to sequencing technologies.

New cards

Protein Data Characteristics

Protein data is high dimensional heterogeneous noisy and contains complex nonlinear patterns.

New cards

Protein Data Importance

Protein data provides a motivating example for preprocessing feature selection and scalable algorithms.

New cards

Altair AI Studio

GUI based data science software formerly RapidMiner used to build and evaluate models without coding.

New cards

Altair Features

Supports data import preprocessing feature selection sampling partitioning built in algorithms and evaluation tools.

New cards

Altair Algorithms

Supports k NN Decision Tree Decision Stump Naive Bayes and some SVM.

New cards

Altair Course Use

Used in assignments and projects for imputation model building and evaluation with datasets like Labor Relations.

New cards

Labor Relations Dataset

Multivariate dataset with categorical integer and real attributes used for practicing in Altair.

New cards

Information Retrieval

Field concerned with finding relevant documents in large collections using term weighting and similarity measures.

New cards

Document

A text unit to be retrieved such as a web page or article.

New cards

Term

A word or token used to index documents.

New cards

Query

A set of terms representing a user’s information need.

New cards

Stop Words

Common words like and or the that are removed to reduce noise.

New cards

Stemming

Reducing words to their root forms to improve matching.

New cards

Inverted Index

Data structure that maps terms to lists of documents containing them.

New cards

TF Term Frequency

Number of times a term appears in a document normalized by the maximum frequency in that document.

New cards

IDF Inverse Document Frequency

Logarithm of total number of documents divided by the number of documents containing the term measures term rarity.

New cards

TF IDF Weighting

Multiplies term frequency and inverse document frequency to assign importance to each term in each document.

New cards

Cosine Similarity

Measures similarity between document and query vectors using their dot product divided by their magnitudes.

New cards

Cosine Similarity Formula

Similarity equals d dot q divided by norm d times norm q where d and q are document and query vectors.

New cards

High Cosine Similarity

Indicates that a document is more relevant to the query.

New cards

TF IDF Example

Science and mining have idf 2 data has idf 1 interesting has idf 0 document weights reflect frequency times idf.

New cards

Precision

Fraction of retrieved documents that are relevant true positives divided by true positives plus false positives.

New cards

Recall

Fraction of relevant documents that are retrieved true positives divided by true positives plus false negatives.

New cards

Low Precision High Recall

Means most relevant documents were retrieved but many irrelevant ones were also retrieved.

New cards

High Precision Low Recall

Means few relevant documents were retrieved and many were missed.

New cards

Inductive Learning

Process of building general models from specific examples to make predictions on unseen data.

100

New cards

Supervised Learning

Learns mapping from inputs to known outputs for classification and regression.