GCSE Computer Science

DATA SCIENCE Summary

Studied by 7 people

0.0(0)

View linked note

LearnA personalized and smart learning plan

Practice TestTake a test on your terms and definitions

Spaced RepetitionScientifically backed study method

Matching GameHow quick can you match all your cards?

FlashcardsStudy terms and definitions

Get a hint

Hint

Dataset

Get a hint

Hint

A collection of data used for analysis or experimentation.

Get a hint

Hint

Feature

Get a hint

Hint

An individual measurable property or characteristic of a dataset.

1 / 84

Anonymous user

Earn XP

Description and Tags

Computer Science

GCSE Computer Science

85 Terms

Dataset

A collection of data used for analysis or experimentation.

New cards

Feature

An individual measurable property or characteristic of a dataset.

New cards

Variable

A feature or attribute that can change or take different values.

New cards

Observation/Instance

A single row or data point in a dataset.

New cards

Label/Target

The variable being predicted or analyzed in a machine learning problem (often the output).

New cards

Descriptive Statistics

Techniques used to describe and summarize features of a dataset, like mean, median, variance, and standard deviation.

New cards

Inferential Statistics

Methods that infer insights or make predictions about a larger population based on sample data.

New cards

Hypothesis Testing

A statistical method to test assumptions or hypotheses about a population parameter.

New cards

Correlation

The measure of the strength and direction of the relationship between two variables.

New cards

Supervised Learning

Machine learning approach where models learn from labeled data to make predictions or classifications.

New cards

Unsupervised Learning

Machine learning approach where models find patterns and structures in unlabeled data.

New cards

Feature Engineering

The process of creating new features or transforming existing ones to improve model performance.

New cards

Overfitting and Underfitting

Overfitting occurs when a model performs well on training data but poorly on new data; underfitting occurs when a model is too simple to capture the underlying patterns.

New cards

Cross-validation

Technique to assess the generalization performance of a model by splitting data into subsets for training and validation.

New cards

Exploratory Data Analysis (EDA)

Initial analysis to understand the dataset's main characteristics through visualizations and summary statistics.

New cards

Data Visualization

Presenting data graphically to communicate patterns, trends, and insights effectively.

New cards

Histogram

A graphical representation of the distribution of numerical data.

New cards

Python/R

Programming languages commonly used for data manipulation, analysis, and machine learning.

New cards

Pandas

Python library for data manipulation and analysis.

New cards

Scikit-learn

Python library providing machine learning algorithms and tools.

New cards

Jupyter Notebooks

Interactive environments for creating and sharing documents containing live code, visualizations, and narrative text.

New cards

Business Acumen

Ability to translate data insights into actionable business strategies and decisions, working closely with stakeholders to solve business problems.

New cards

Problem-solving

Strong analytical and problem-solving skills to tackle complex issues using data-driven approaches.

New cards

Communication Skills

Effective communication is vital to convey complex findings and insights to both technical and non-technical stakeholders.

New cards

Teamwork and Collaboration

Capability to work in multidisciplinary teams, collaborate with other professionals, and share knowledge effectively.

New cards

Curiosity and Continuous Learning

Given the evolving nature of technology and data science, a passion for learning and staying updated with new techniques and tools is essential.

New cards

Education

A bachelor's or master's degree in fields like computer science, statistics, mathematics, data science, or a related field. Some roles may require a Ph.D. for research-oriented positions.

New cards

Experience

Depending on the role, companies may seek candidates with a few years of relevant work experience in data analysis, machine learning, or a related field.

New cards

Python

Widely used for data analysis, machine learning, and statistical modeling. Libraries like Pandas, NumPy, SciPy, Matplotlib, and Scikit-learn are commonly used in Python.

New cards

Another popular language for statistical analysis, data manipulation, and visualization, with a wide range of packages like dplyr, ggplot2, and caret.

New cards

SQL (Structured Query Language)

Essential for managing and querying relational databases.

New cards

Pandas

Python library for data manipulation and analysis, offering data structures and tools for cleaning and preprocessing.

New cards

dplyr

R package for data manipulation tasks like filtering, summarizing, and transforming data.

New cards

Apache Hadoop

Framework for distributed storage and processing of large datasets.

New cards

Apache Spark

Provides a fast and general-purpose cluster computing system for big data processing.

New cards

Scikit-learn

Python library offering various machine learning algorithms and tools for modeling and evaluation.

New cards

TensorFlow and Keras

Libraries for building and training neural networks and deep learning models.

New cards

PyTorch

Another deep learning framework used for building neural network architectures.

New cards

Jupyter Notebooks

Interactive environments for creating and sharing documents containing live code, visualizations, and narrative text.

New cards

Matplotlib

Python library for creating static, interactive, and 3D visualizations.

New cards

Seaborn

Built on top of Matplotlib, Seaborn provides more visually appealing statistical graphics.

New cards

ggplot2

R package for creating elegant and complex data visualizations.

New cards

Tableau

User-friendly platform for data visualization and analytics.

New cards

Power BI

Microsoft's business analytics tool for visualizing and sharing insights from data.

New cards

QlikView/Qlik Sense

Platforms for data visualization, business intelligence, and data discovery.

New cards

OpenRefine

Tool for cleaning and transforming messy data.

New cards

Trifacta

Platform for data wrangling and preparation tasks.

New cards

Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Cloud services offering various tools and resources for data storage, processing, and analysis.

New cards

Descriptive Questions

What are the key characteristics or trends in the dataset? How is the data distributed across different categories or groups? What are the summary statistics for the variables in the dataset?

New cards

Diagnostic Questions

What factors are contributing to a particular outcome or phenomenon? Are there any anomalies, outliers, or patterns that need further investigation? What is the root cause of a specific problem in the dataset?

New cards

Predictive Questions

Can we predict future outcomes based on historical data? What variables are most predictive of a certain event or outcome? How accurate are our predictions using different models or algorithms?

New cards

Prescriptive Questions

What actions or interventions can be recommended based on predictive models? How can we optimize a process or system to achieve better outcomes? What changes can be made to improve a specific metric or result?

New cards

Exploratory Questions

Are there any hidden patterns or relationships in the data? Can we identify clusters or groups within the dataset? What variables are most correlated with each other?

New cards

Causal Questions

What is the cause-and-effect relationship between variables? Can we establish causation based on observational or experimental data? How does changing one variable affect another in the dataset?

New cards

Comparative Questions

How do different groups or categories in the dataset compare to each other? What are the differences or similarities between subsets of the data? Are there significant differences in outcomes between different treatments or conditions?

New cards

Primary Sources

Data collected firsthand for a specific purpose. It includes surveys, experiments, observations, interviews, and focus groups.

New cards

Secondary Sources

Data that already exists and is collected by someone else for their own purposes. This includes books, articles, official records, databases, and previously conducted research.

New cards

Tertiary Sources

Compilations or summaries of primary and

New cards

Predictive Analysis

Using historical data to forecast or predict future outcomes or trends.

New cards

Diagnostic Analysis

Identifying reasons behind certain outcomes or patterns by investigating cause-and-effect relationships in data.

New cards

Prescriptive Analysis

Recommending actions or strategies based on analysis to optimize or improve future outcomes.

New cards

Variables

Containers for storing data values in programming languages.

New cards

Functions

Reusable blocks of code that perform specific tasks.

New cards

Control Structures

Statements that determine the flow of execution in a program.

New cards

Data Types

Categories of data that determine the kind of values that can be stored and manipulated.

New cards

Modules

Files containing Python code that can be imported and used in other programs.

New cards

Libraries

Collections of modules that provide additional functionality for specific tasks.

New cards

Lists

Ordered collections of items in programming languages.

New cards

Tuples

Immutable ordered collections of items in programming languages.

New cards

File Handling

Manipulating files in a program, such as reading from or writing to files.

New cards

Dictionaries

Key-value pairs used to store and retrieve data in programming languages.

New cards

Plotting

Creating visual representations of data using graphs or charts.

New cards

Data Manipulation

Modifying or transforming data to make it suitable for analysis.

New cards

Visualization

Presenting data in a visual format to gain insights or communicate information effectively.

New cards

Data Cleaning

Preprocessing data by handling missing values, duplicates, outliers, and normalizing or standardizing data.

New cards

Model Accuracy

Techniques to improve the accuracy of a predictive model, such as cross-validation, ensemble methods, and increasing the quantity and quality of data.

New cards

Data Formats

Different ways in which data can be structured and represented, such as CSV, Excel, or JSON.

New cards

Data Science Applications

Examples of how data science can be used in different domains, such as student performance analysis in education or audience insights in the movie industry.

New cards

Python Skills

Proficiency in the Python programming language for data analysis and modeling.

New cards

Online Platforms

Tools like Jupyter Notebook or Google Colab for creating and running data science models.

New cards

Importing Datasets

Steps to import datasets into a data science environment using tools like Google Colab.

New cards

Creating Datasets

Generating or creating custom datasets for analysis using existing data sources.