1/98
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Mean
The average in a dataset/distribution
Median
The middle value in a dataset/distribution
Mode
The most frequent value in a dataset/distribution
Range
the difference between the highest and lowest scores in a dataset/distribution
Understanding these measures helps in summarizing data and making comparisons between different datasets.
Why is it important in understanding these measures
Varience
tells us how much the numbers in a dataset differ from the mean which shows the degree of spread or variability.
Standard Deviation (SD)
used to measure how much the data points in a numerical dataset are spread out from the mean
Low SD = More concentrated, more consistent data
High SD: More spread out, less consistent data
Is the square root of Variance
Covariance
A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship (think of it like slope values)
Characteristics of a Gaussian distribution (normal distribution)
Has a "bell" curve defined by its mean and standard deviation
It helps with statistical tests that assume normality, giving way into predicting an outcome of a certain value
What is the importance of the Gaussian distribution (normal distribution)?
Empirical rule
What states that approximately 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three, which helps in understanding data spread.
Discrete variable
Uses integers that involve a counting system
Continuous variable
Uses a range of infinite numbers which involves the use of measurement
Boxplots
displaying the distribution of data and identifying outliers
Histograms
show frequency distributions.
Scatter plots
effective for visualizing relationships between two continuous variables, helping to identify correlations.
Multivariable data analysis
involves examining multiple variables simultaneously to understand relationships and dependencies.
Dependence method
multivariate technique appropriate when one or more of the variables can be identified as dependent variable(s) and the remaining as independent variables
Interdependence method
Focus on relationships among all variables without specifying dependent or independent ones.
Used for data reduction or structure detection.
Multiple linear technique
model the relationship between one or more dependent variables and multiple independent (predictor) variables, where relationships are assumed to be linear.
Logistics regression
a dependence method used when the dependent variable (outcome) is categorical, typically binary (e.g., yes/no, success/failure, 0/1).
Its goal for this is classification
Clean data enhances the reliability of analyses and the validity of insights drawn from data.
What is the importance of data cleaning
Factors that can affect data quality
Duplicates
Incomplete datasets (missing values)
Inaccurate or incorrect data
Low-quality or unreliable data sources
Inconsistent data formats or entries
Outdated (untimely) data
Data entry errors
Data integration or merging issues
Lack of standardization (different units, formats)
Missing or unclear metadata/documentation
Human error during collection or processing
Technical issues (e.g., system failures or transmission errors)
Poor data governance or management practices
Lack of validation or verification checks
Ambiguous or undefined data definitions
Linear Regression can be used in situations involving continuous numbers like the consumption of energy
Decision trees are used to make predictions or classifications based on splitting data into different branches based on feature values. Can be used to diagnose diseases with classifying symptoms into possible illnesses
k-means group data points into clusters based on similarity, used when you don't have predefined labels. They can be used in situations like grouping cities into similar weather patterns
Describe how data science algorithms are applied to real-world problems (e.g., linear regression, decision trees, k-means)
Selects all rows and columns (SELECT)
SELECT *
FROM employees
Shows specific columns
SELECT first_name, last_name, department
FROM employees;
Filtering results with them having only employees from the IT department (WHERE)
SELECT *
FROM employees
WHERE department = 'IT';
Using comparison (AND) and logic operators (>,=)
SELECT *
FROM employees
WHERE salary > 55000 AND department = 'IT';
Sorting results (ORDER BY)
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC;
Finding the average in the data (AVG, AS)
SELECT department, AVG(salary) AS average_salary
FROM employees
GROUP BY department;
Filtering groups (HAVING)
SELECT department, COUNT(*) AS total_employees
FROM employees
GROUP BY department
HAVING COUNT(*) > 1;
Inserting new data (INSERT)
INSERT INTO employees (first_name, last_name, department, salary, hire_date)
VALUES ('Emily', 'Green', 'Sales', 52000, '2022-04-05');
Updating data (UPDATE)
UPDATE employees
SET salary = 65000
WHERE employee_id = 2;
Deleting data (DELETE FROM)
DELETE FROM employees
WHERE department = 'Sales';
NumPy (Numerical Python)
Python library that is the foundation for numerical and scientific computing in Python.
Provides powerful multi-dimensional arrays (ndarray) for storing data.
Supports vectorized operations (fast element-wise math).
Includes linear algebra, random number generation, and Fourier transforms.
Pandas
Python library for data manipulation and analysis.
Uses SERIES (1D) and Dataframe (2D like an Excel Sheet)
Used for: Cleaning, exploring, and analyzing structured data (spreadsheets, databases, CSVs).
Matplotlib
Data visualization Python library
Core plotting library for Python.
Makes line plots, bar charts, histograms, etc.
Seaborn
Built on top of Matplotlib.
Adds aesthetic statistical plots with fewer lines of code.
Scikit-learn
A traditional learning machine Python library
Implements algorithms for classification, regression, clustering, and dimensionality reduction.
Includes model selection, feature scaling, and evaluation metrics.
Used for: Building ML models like regression, decision trees, and k-means.
TensorFlow
A Python library that uses deep learning and large-scale numerical computation (by Google).
Supports neural networks, automatic differentiation, and GPU acceleration.
Often used with Keras, a high-level API for model building.
Used for: Neural networks, image recognition, NLP, and AI deployment.
PyTorch
A Python library that is using deep learning and AI research (by Meta/Facebook).
Uses dynamic computation graphs (easy to debug).
Popular for research, computer vision, and NLP.
Used for: Custom AI model development and experimentation.
Being able to understand it easily, having it solve large and complex databases easily, having the opportunity to integrate with machine learning, and the support by the community makes it useful
Discuss the use of Python for cleaning and wrangling datasets
Has built-in statistical functions (e.g., regression, hypothesis testing).
Handles data cleaning, visualization, and modeling in one environment.
Offers thousands of packages for specialized analyses.
Integrates well with databases, big data systems, and Python.
Discuss the use of R for data science
Characteristics of relational databases
Data is organized on tables
There are relationships in that data
Uses SQL
Uses constraints to maintain data accuracy
Often normalized, or organized into multiple related tables to reduce redundancy
Supports user access control, encryption, and roles for data security
Can scale vertically or horizontally
Uses JOINS to maintain consistency
Atomicitiy
Consistency
Isolation
Durability
What are the properties of ACID called?
Generative AI is an advanced form of artificial intelligence that learns from large datasets to create new and original content.
It blends creativity with computation, transforming how humans write, design, code, and communicate while raising important questions about ethics, originality, and human-AI collaboration.
Discuss the nature of generative AI
Capabilities of AI
Can create text, images, audio, video, and code from simple prompts.
Understands and responds to human language naturally.
Generates content tailored to user preferences or contexts (e.g., personalized ads, study materials, or health insights).
Assists in brainstorming ideas, designing logos, drafting stories, or generating new concepts that inspire human creativity.
Produces synthetic data to train or test other AI models when real data is limited.
Speeds up tasks such as coding, report writing, summarizing, or designing, saving time and resources.
Limitations of AI
AI models don't "think" or "understand" like humans — they generate output based on patterns, not reasoning or comprehension.
Can produce incorrect, fabricated, or misleading information (called AI hallucination).
If trained on biased data, the AI may reproduce or amplify stereotypes, discrimination, or misinformation.
Raises questions about authorship, copyright, and data privacy, since outputs may resemble copyrighted or personal content.
Quality of generated output is limited by the quality and diversity of the training data.
Struggles with long-term reasoning, emotions, or understanding real-world context beyond what's in the prompt.
Training and running large models require massive computing power and energy, which can be expensive and environmentally impactful.
Uses of Generative AI in the Real World
Creating study guides, quizzes, and explanations for students
Using tools like Github Copilot to suggest or make code
Creating market slogans
Types of AI subfields
Computer Vision, Natural Language Processing, Human Interaction, Robotics, Machine Learning, and Expert Systems.
Large Language Models (LLMs)
AI models that learn from massive text datasets to produce human-like language and assist in communication, learning, and problem-solving.
Capabilities of LLMs
Natural language understanding
Text Generation
Summarization and Translation
Question Answering and Chatting
Coding and technical assistance
Information Retrieval and Reasoning
Personalization and Adaptation
It is Data Driven, Adaptive, Predictive, iterative, and autonomous
What is the nature of machine learning?
Used to teach the model by showing examples of inputs and their corresponding outputs.
What is the use of training datasets for machine learning?
Used to tune the model and select the best version while training.
What is the use of testing datasets for machine learning?
Used to evaluate the final model's performance on completely unseen data.
What is the use of validation datasets for machine learning?
Machine learning algorithms
What behaves by identifying patterns, learning from data, and adapting their internal parameters to make accurate predictions or decisions?
Neural networks
Inspired by the human brain, what part of machine learning consist of layers of interconnected nodes (neurons) that process information.
Supervised learning predicts outcomes from known labels.
Unsupervised learning uncovers hidden structures in data.
Reinforcement learning learns to make decisions by interacting with the environment and optimizing rewards.
How to tell the difference between unsupervised, supervised, and reinforcement learning algorithms
Supervised learning
If you have labeled data (inputs with known outputs), what machine learning algorithm would be the most appropriate one?
Unsupervised learning
If you have unlabeled data (needing to develop pattern and structure), what machine learning algorithm would be the most appropriate one?
Reinforcement learning
If it involves decision-making through trial and error in a dynamic environment, what machine learning algorithm would be the most appropriate one?
Deep learning
An AI approach where multi-layered neural networks learn complex patterns from data, enabling machines to perform tasks like image recognition, speech processing, and language understanding while automatically extracting features and making predictions.
Predicate logic in AI allows machines to represent facts, rules, and relationships, and reason about them to make intelligent decisions or derive new knowledge.
Explain how predicate logic is used in AI models
Examples of predicate logic
Representing Facts, Rules, and Relationships
Predicate Logic
a formal system used in AI to represent knowledge and reason about it.
Logic-based reasoning is precise and rule-driven (ideal for problems with clear facts) while probability-based reasoning handles uncertainty (making it suitable for real-world problems where information is incomplete or noisy)
What is the difference between logic based and probability based reasoning?
Bayesian Distribution
Probabilistic graphical models that represent a set of variables and their conditional dependencies using a Directed Acyclic Graph (DAG). They are widely used in AI for reasoning under uncertainty.
Nodes
What component of the Bayesian Distribution represents random variables in the domain and can be discrete or continuous?
Edges
What component of the Bayesian Distribution that has directed edges (arrows) represent conditional dependencies between nodes.
Directed Acyclic Graph
What component of the Bayesian Distribution is a graph with directed edges and no cycles, ensuring no variable depends on itself directly or indirectly.
Conditional Probability Tests (CPT)
What component of the Bayesian Distribution quantifies the probability of the node given its parents.
Nature of Knowledge Regression and Reasoning for AI
focuses on how machines store, understand, and use knowledge to make intelligent decisions.
Reasoning for AI
how AI uses stored knowledge to draw conclusions or make decisions.
It allows systems to:
Possible dilemmas that can happen because of AI
ethics, accountability, privacy, and bias
AI inherits bias through biased data, human assumptions, and algorithmic design choices, leading to unfair or discriminatory results.
How does AI inherit bias?
You need diverse data, transparent development, and ongoing monitoring.
What is needed to ensure fairness in AI
Data Privacy Risks in LLMs
Training on Sensitive Data
Unintentional data leakages
Users sharing information that can be stored
Security Risks in LLMs
Inserting hidden instructions
Tricking models to reveal confidential information
Altering training data to produce false patterns
Copying the software
Credibility issues in AI
hallucinations
misinformation
bias
lack of transparency
overconfidence
The nature of data science
What does this explanation refer to?
The ability to transform raw data into meaningful insights using a mix of statistical analysis, programming, and critical thinking to guide decisions and innovations.
Structured data is neatly organized and easy to analyze, while unstructured data is more complex and varied, requiring advanced AI and machine learning methods to extract insights.
What is the difference between structured data and unstructured data
Numeric data
What type of data uses
measurable quantities
Used for mathematical calculations
Can either be an Integer or a Float (continuous)
Categorical data
What type of data uses
Labels, names, or categories
Used to classify group or data
Can be nominal or ordinal
Decimal to Binary
What conversion uses the method of dividing the decimal number and reading the remainders bottom-up
Binary to Decimal
What conversion uses the method of multiplying each bit by 2 raised to its position power, starting from 0 on the right. You then sum up the results
Decimal to Hexadecimal
What conversion uses the method of dividing the decimal number by 16 repeatedly, noting the remainder. You then use letters A-F for values 10-15
Hexadecimal to Decimal
What conversion uses the method of multiplying each digit by 16 raised to its position power, starting from 0 on the right. You then sum up the results
Binary to Hexadecimal
What conversion uses the method of grouping binary digits in groups of 4 bits, starting from the right.
You then convert each group to its hex equivalent.
Hexadecimal to Binary
What conversion uses the method of converting each hex digit to 4-bit binary.
Structured
Unstructured
Semi-Structured Data
Sensor or Machine Data
Transactional Data
Survey or Observational Data.
What are the types of data that could be gathered from various sources
It is important because it involves cleaning, organizing, and converting data into a usable format so it can be effectively analyzed or used by AI and machine learning models.
What is the importance of data wrangling and transformation
First stage
What stage in the data science process uses
Data Collection:
Gathering data from various sources such as databases, surveys, sensors, or APIs to obtain relevant, high-quality data for the problem being studied.
Second stage
What stage in the data science process uses
Data Cleaning:
Removing errors, duplicates, and missing values; correcting inconsistencies to ensure the data is accurate, complete, and ready for analysis?
Third Stage
What stage in the data science process uses
Data Exploration and Analysis:
Examining data through visualization and statistical methods to identify patterns and trends to understand the data's structure, relationships, and potential insights.
Fourth stage
What stage in the data science process uses
Data Transformation and Feature Engineering:
Converting data into usable formats and creating new variables (features) that improve model performance to prepare the data for modeling and enhance its predictive power.
Fifth Stage
What stage in the data science process uses
Modeling:
Applying machine learning or statistical algorithms to make predictions or detect patterns to build models that can explain or predict outcomes.
Sixth Stage
What stage in the data science process uses
Evaluation:
Testing model accuracy using validation and test datasets to measure performance and ensure the model generalizes well to new data.
Seventh stage
What stage in the data science process uses
Deployment and Communication:
Implementing the model in real-world systems and sharing insights through reports or dashboards to use findings to support decision-making or automate processes.