FBLA Data Science & AI

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/98

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

99 Terms

New cards

Mean

The average in a dataset/distribution

New cards

Median

The middle value in a dataset/distribution

New cards

Mode

The most frequent value in a dataset/distribution

New cards

Range

the difference between the highest and lowest scores in a dataset/distribution

New cards

Understanding these measures helps in summarizing data and making comparisons between different datasets.

Why is it important in understanding these measures

New cards

Varience

tells us how much the numbers in a dataset differ from the mean which shows the degree of spread or variability.

New cards

Standard Deviation (SD)

used to measure how much the data points in a numerical dataset are spread out from the mean

Low SD = More concentrated, more consistent data

High SD: More spread out, less consistent data

Is the square root of Variance

New cards

Covariance

A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship (think of it like slope values)

New cards

Characteristics of a Gaussian distribution (normal distribution)

Has a "bell" curve defined by its mean and standard deviation

New cards

It helps with statistical tests that assume normality, giving way into predicting an outcome of a certain value

What is the importance of the Gaussian distribution (normal distribution)?

New cards

Empirical rule

What states that approximately 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three, which helps in understanding data spread.

New cards

Discrete variable

Uses integers that involve a counting system

New cards

Continuous variable

Uses a range of infinite numbers which involves the use of measurement

New cards

Boxplots

displaying the distribution of data and identifying outliers

New cards

Histograms

show frequency distributions.

New cards

Scatter plots

effective for visualizing relationships between two continuous variables, helping to identify correlations.

New cards

Multivariable data analysis

involves examining multiple variables simultaneously to understand relationships and dependencies.

New cards

Dependence method

multivariate technique appropriate when one or more of the variables can be identified as dependent variable(s) and the remaining as independent variables

New cards

Interdependence method

Focus on relationships among all variables without specifying dependent or independent ones.

Used for data reduction or structure detection.

New cards

Multiple linear technique

model the relationship between one or more dependent variables and multiple independent (predictor) variables, where relationships are assumed to be linear.

New cards

Logistics regression

a dependence method used when the dependent variable (outcome) is categorical, typically binary (e.g., yes/no, success/failure, 0/1).

Its goal for this is classification

New cards

Clean data enhances the reliability of analyses and the validity of insights drawn from data.

What is the importance of data cleaning

New cards

Factors that can affect data quality

Duplicates

Incomplete datasets (missing values)

Inaccurate or incorrect data

Low-quality or unreliable data sources

Inconsistent data formats or entries

Outdated (untimely) data

Data entry errors

Data integration or merging issues

Lack of standardization (different units, formats)

Missing or unclear metadata/documentation

Human error during collection or processing

Technical issues (e.g., system failures or transmission errors)

Poor data governance or management practices

Lack of validation or verification checks

Ambiguous or undefined data definitions

New cards

Linear Regression can be used in situations involving continuous numbers like the consumption of energy

Decision trees are used to make predictions or classifications based on splitting data into different branches based on feature values. Can be used to diagnose diseases with classifying symptoms into possible illnesses

k-means group data points into clusters based on similarity, used when you don't have predefined labels. They can be used in situations like grouping cities into similar weather patterns

Describe how data science algorithms are applied to real-world problems (e.g., linear regression, decision trees, k-means)

New cards

Selects all rows and columns (SELECT)

SELECT *

FROM employees

New cards

Shows specific columns

SELECT first_name, last_name, department

FROM employees;

New cards

Filtering results with them having only employees from the IT department (WHERE)

SELECT *

FROM employees

WHERE department = 'IT';

New cards

Using comparison (AND) and logic operators (>,=)

SELECT *

FROM employees

WHERE salary > 55000 AND department = 'IT';

New cards

Sorting results (ORDER BY)

SELECT first_name, last_name, salary

FROM employees

ORDER BY salary DESC;

New cards

Finding the average in the data (AVG, AS)

SELECT department, AVG(salary) AS average_salary

FROM employees

GROUP BY department;

New cards

Filtering groups (HAVING)

SELECT department, COUNT(*) AS total_employees

FROM employees

GROUP BY department

HAVING COUNT(*) > 1;

New cards

Inserting new data (INSERT)

INSERT INTO employees (first_name, last_name, department, salary, hire_date)

VALUES ('Emily', 'Green', 'Sales', 52000, '2022-04-05');

New cards

Updating data (UPDATE)

UPDATE employees

SET salary = 65000

WHERE employee_id = 2;

New cards

Deleting data (DELETE FROM)

DELETE FROM employees

WHERE department = 'Sales';

New cards

NumPy (Numerical Python)

Python library that is the foundation for numerical and scientific computing in Python.

Provides powerful multi-dimensional arrays (ndarray) for storing data.

Supports vectorized operations (fast element-wise math).

Includes linear algebra, random number generation, and Fourier transforms.

New cards

Pandas

Python library for data manipulation and analysis.

Uses SERIES (1D) and Dataframe (2D like an Excel Sheet)

Used for: Cleaning, exploring, and analyzing structured data (spreadsheets, databases, CSVs).

New cards

Matplotlib

Data visualization Python library

Core plotting library for Python.

Makes line plots, bar charts, histograms, etc.

New cards

Seaborn

Built on top of Matplotlib.

Adds aesthetic statistical plots with fewer lines of code.

New cards

Scikit-learn

A traditional learning machine Python library

Implements algorithms for classification, regression, clustering, and dimensionality reduction.

Includes model selection, feature scaling, and evaluation metrics.

Used for: Building ML models like regression, decision trees, and k-means.

New cards

TensorFlow

A Python library that uses deep learning and large-scale numerical computation (by Google).

Supports neural networks, automatic differentiation, and GPU acceleration.

Often used with Keras, a high-level API for model building.

Used for: Neural networks, image recognition, NLP, and AI deployment.

New cards

PyTorch

A Python library that is using deep learning and AI research (by Meta/Facebook).

Uses dynamic computation graphs (easy to debug).

Popular for research, computer vision, and NLP.

Used for: Custom AI model development and experimentation.

New cards

Being able to understand it easily, having it solve large and complex databases easily, having the opportunity to integrate with machine learning, and the support by the community makes it useful

Discuss the use of Python for cleaning and wrangling datasets

New cards

Has built-in statistical functions (e.g., regression, hypothesis testing).

Handles data cleaning, visualization, and modeling in one environment.

Offers thousands of packages for specialized analyses.

Integrates well with databases, big data systems, and Python.

Discuss the use of R for data science

New cards

Characteristics of relational databases

Data is organized on tables

There are relationships in that data

Uses SQL

Uses constraints to maintain data accuracy

Often normalized, or organized into multiple related tables to reduce redundancy

Supports user access control, encryption, and roles for data security

Can scale vertically or horizontally

Uses JOINS to maintain consistency

New cards

Atomicitiy

Consistency

Isolation

Durability

What are the properties of ACID called?

New cards

Generative AI is an advanced form of artificial intelligence that learns from large datasets to create new and original content.

It blends creativity with computation, transforming how humans write, design, code, and communicate while raising important questions about ethics, originality, and human-AI collaboration.

Discuss the nature of generative AI

New cards

Capabilities of AI

Can create text, images, audio, video, and code from simple prompts.

Understands and responds to human language naturally.

Generates content tailored to user preferences or contexts (e.g., personalized ads, study materials, or health insights).

Assists in brainstorming ideas, designing logos, drafting stories, or generating new concepts that inspire human creativity.

Produces synthetic data to train or test other AI models when real data is limited.

Speeds up tasks such as coding, report writing, summarizing, or designing, saving time and resources.

New cards

Limitations of AI

AI models don't "think" or "understand" like humans — they generate output based on patterns, not reasoning or comprehension.

Can produce incorrect, fabricated, or misleading information (called AI hallucination).

If trained on biased data, the AI may reproduce or amplify stereotypes, discrimination, or misinformation.

Raises questions about authorship, copyright, and data privacy, since outputs may resemble copyrighted or personal content.

Quality of generated output is limited by the quality and diversity of the training data.

Struggles with long-term reasoning, emotions, or understanding real-world context beyond what's in the prompt.

Training and running large models require massive computing power and energy, which can be expensive and environmentally impactful.

New cards

Uses of Generative AI in the Real World

Creating study guides, quizzes, and explanations for students

Using tools like Github Copilot to suggest or make code

Creating market slogans

New cards

Types of AI subfields

Computer Vision, Natural Language Processing, Human Interaction, Robotics, Machine Learning, and Expert Systems.

New cards

Large Language Models (LLMs)

AI models that learn from massive text datasets to produce human-like language and assist in communication, learning, and problem-solving.

New cards

Capabilities of LLMs

Natural language understanding

Text Generation

Summarization and Translation

Question Answering and Chatting

Coding and technical assistance

Information Retrieval and Reasoning

Personalization and Adaptation

New cards

It is Data Driven, Adaptive, Predictive, iterative, and autonomous

What is the nature of machine learning?

New cards

Used to teach the model by showing examples of inputs and their corresponding outputs.

What is the use of training datasets for machine learning?

New cards

Used to tune the model and select the best version while training.

What is the use of testing datasets for machine learning?

New cards

Used to evaluate the final model's performance on completely unseen data.

What is the use of validation datasets for machine learning?

New cards

Machine learning algorithms

What behaves by identifying patterns, learning from data, and adapting their internal parameters to make accurate predictions or decisions?

New cards

Neural networks

Inspired by the human brain, what part of machine learning consist of layers of interconnected nodes (neurons) that process information.

New cards

Supervised learning predicts outcomes from known labels.

Unsupervised learning uncovers hidden structures in data.

Reinforcement learning learns to make decisions by interacting with the environment and optimizing rewards.

How to tell the difference between unsupervised, supervised, and reinforcement learning algorithms

New cards

Supervised learning

If you have labeled data (inputs with known outputs), what machine learning algorithm would be the most appropriate one?

New cards

Unsupervised learning

If you have unlabeled data (needing to develop pattern and structure), what machine learning algorithm would be the most appropriate one?

New cards

Reinforcement learning

If it involves decision-making through trial and error in a dynamic environment, what machine learning algorithm would be the most appropriate one?

New cards

Deep learning

An AI approach where multi-layered neural networks learn complex patterns from data, enabling machines to perform tasks like image recognition, speech processing, and language understanding while automatically extracting features and making predictions.

New cards

Predicate logic in AI allows machines to represent facts, rules, and relationships, and reason about them to make intelligent decisions or derive new knowledge.

Explain how predicate logic is used in AI models

New cards

Examples of predicate logic

Representing Facts, Rules, and Relationships

New cards

Predicate Logic

a formal system used in AI to represent knowledge and reason about it.

New cards

Logic-based reasoning is precise and rule-driven (ideal for problems with clear facts) while probability-based reasoning handles uncertainty (making it suitable for real-world problems where information is incomplete or noisy)

What is the difference between logic based and probability based reasoning?

New cards

Bayesian Distribution

Probabilistic graphical models that represent a set of variables and their conditional dependencies using a Directed Acyclic Graph (DAG). They are widely used in AI for reasoning under uncertainty.

New cards

Nodes

What component of the Bayesian Distribution represents random variables in the domain and can be discrete or continuous?

New cards

Edges

What component of the Bayesian Distribution that has directed edges (arrows) represent conditional dependencies between nodes.

New cards

Directed Acyclic Graph

What component of the Bayesian Distribution is a graph with directed edges and no cycles, ensuring no variable depends on itself directly or indirectly.

New cards

Conditional Probability Tests (CPT)

What component of the Bayesian Distribution quantifies the probability of the node given its parents.

New cards

Nature of Knowledge Regression and Reasoning for AI

focuses on how machines store, understand, and use knowledge to make intelligent decisions.

New cards

Reasoning for AI

how AI uses stored knowledge to draw conclusions or make decisions.

It allows systems to:

New cards

Possible dilemmas that can happen because of AI

ethics, accountability, privacy, and bias

New cards

AI inherits bias through biased data, human assumptions, and algorithmic design choices, leading to unfair or discriminatory results.

How does AI inherit bias?

New cards

You need diverse data, transparent development, and ongoing monitoring.

What is needed to ensure fairness in AI

New cards

Data Privacy Risks in LLMs

Training on Sensitive Data

Unintentional data leakages

Users sharing information that can be stored

New cards

Security Risks in LLMs

Inserting hidden instructions

Tricking models to reveal confidential information

Altering training data to produce false patterns

Copying the software

New cards

Credibility issues in AI

hallucinations

misinformation

bias

lack of transparency

overconfidence

New cards

The nature of data science

What does this explanation refer to?

The ability to transform raw data into meaningful insights using a mix of statistical analysis, programming, and critical thinking to guide decisions and innovations.

New cards

Structured data is neatly organized and easy to analyze, while unstructured data is more complex and varied, requiring advanced AI and machine learning methods to extract insights.

What is the difference between structured data and unstructured data

New cards

Numeric data

What type of data uses

measurable quantities

Used for mathematical calculations

Can either be an Integer or a Float (continuous)

New cards

Categorical data

What type of data uses

Labels, names, or categories

Used to classify group or data

Can be nominal or ordinal

New cards

Decimal to Binary

What conversion uses the method of dividing the decimal number and reading the remainders bottom-up

New cards

Binary to Decimal

What conversion uses the method of multiplying each bit by 2 raised to its position power, starting from 0 on the right. You then sum up the results

New cards

Decimal to Hexadecimal

What conversion uses the method of dividing the decimal number by 16 repeatedly, noting the remainder. You then use letters A-F for values 10-15

New cards

Hexadecimal to Decimal

What conversion uses the method of multiplying each digit by 16 raised to its position power, starting from 0 on the right. You then sum up the results

New cards

Binary to Hexadecimal

What conversion uses the method of grouping binary digits in groups of 4 bits, starting from the right.

You then convert each group to its hex equivalent.

New cards

Hexadecimal to Binary

What conversion uses the method of converting each hex digit to 4-bit binary.

New cards

Structured

Unstructured

Semi-Structured Data

Sensor or Machine Data

Transactional Data

Survey or Observational Data.

What are the types of data that could be gathered from various sources

New cards

It is important because it involves cleaning, organizing, and converting data into a usable format so it can be effectively analyzed or used by AI and machine learning models.

What is the importance of data wrangling and transformation

New cards

First stage