DATA SCIENCE Summary

studied byStudied by 7 people
0.0(0)
Get a hint
Hint

Dataset

1 / 84

flashcard set

Earn XP

Description and Tags

85 Terms

1

Dataset

A collection of data used for analysis or experimentation.

New cards
2

Feature

An individual measurable property or characteristic of a dataset.

New cards
3

Variable

A feature or attribute that can change or take different values.

New cards
4

Observation/Instance

A single row or data point in a dataset.

New cards
5

Label/Target

The variable being predicted or analyzed in a machine learning problem (often the output).

New cards
6

Descriptive Statistics

Techniques used to describe and summarize features of a dataset, like mean, median, variance, and standard deviation.

New cards
7

Inferential Statistics

Methods that infer insights or make predictions about a larger population based on sample data.

New cards
8

Hypothesis Testing

A statistical method to test assumptions or hypotheses about a population parameter.

New cards
9

Correlation

The measure of the strength and direction of the relationship between two variables.

New cards
10

Supervised Learning

Machine learning approach where models learn from labeled data to make predictions or classifications.

New cards
11

Unsupervised Learning

Machine learning approach where models find patterns and structures in unlabeled data.

New cards
12

Feature Engineering

The process of creating new features or transforming existing ones to improve model performance.

New cards
13

Overfitting and Underfitting

Overfitting occurs when a model performs well on training data but poorly on new data; underfitting occurs when a model is too simple to capture the underlying patterns.

New cards
14

Cross-validation

Technique to assess the generalization performance of a model by splitting data into subsets for training and validation.

New cards
15

Exploratory Data Analysis (EDA)

Initial analysis to understand the dataset's main characteristics through visualizations and summary statistics.

New cards
16

Data Visualization

Presenting data graphically to communicate patterns, trends, and insights effectively.

New cards
17

Histogram

A graphical representation of the distribution of numerical data.

New cards
18

Python/R

Programming languages commonly used for data manipulation, analysis, and machine learning.

New cards
19

Pandas

Python library for data manipulation and analysis.

New cards
20

Scikit-learn

Python library providing machine learning algorithms and tools.

New cards
21

Jupyter Notebooks

Interactive environments for creating and sharing documents containing live code, visualizations, and narrative text.

New cards
22

Business Acumen

Ability to translate data insights into actionable business strategies and decisions, working closely with stakeholders to solve business problems.

New cards
23

Problem-solving

Strong analytical and problem-solving skills to tackle complex issues using data-driven approaches.

New cards
24

Communication Skills

Effective communication is vital to convey complex findings and insights to both technical and non-technical stakeholders.

New cards
25

Teamwork and Collaboration

Capability to work in multidisciplinary teams, collaborate with other professionals, and share knowledge effectively.

New cards
26

Curiosity and Continuous Learning

Given the evolving nature of technology and data science, a passion for learning and staying updated with new techniques and tools is essential.

New cards
27

Education

A bachelor's or master's degree in fields like computer science, statistics, mathematics, data science, or a related field. Some roles may require a Ph.D. for research-oriented positions.

New cards
28

Experience

Depending on the role, companies may seek candidates with a few years of relevant work experience in data analysis, machine learning, or a related field.

New cards
29

Python

Widely used for data analysis, machine learning, and statistical modeling. Libraries like Pandas, NumPy, SciPy, Matplotlib, and Scikit-learn are commonly used in Python.

New cards
30

R

Another popular language for statistical analysis, data manipulation, and visualization, with a wide range of packages like dplyr, ggplot2, and caret.

New cards
31

SQL (Structured Query Language)

Essential for managing and querying relational databases.

New cards
32

Pandas

Python library for data manipulation and analysis, offering data structures and tools for cleaning and preprocessing.

New cards
33

dplyr

R package for data manipulation tasks like filtering, summarizing, and transforming data.

New cards
34

Apache Hadoop

Framework for distributed storage and processing of large datasets.

New cards
35

Apache Spark

Provides a fast and general-purpose cluster computing system for big data processing.

New cards
36

Scikit-learn

Python library offering various machine learning algorithms and tools for modeling and evaluation.

New cards
37

TensorFlow and Keras

Libraries for building and training neural networks and deep learning models.

New cards
38

PyTorch

Another deep learning framework used for building neural network architectures.

New cards
39

Jupyter Notebooks

Interactive environments for creating and sharing documents containing live code, visualizations, and narrative text.

New cards
40

Matplotlib

Python library for creating static, interactive, and 3D visualizations.

New cards
41

Seaborn

Built on top of Matplotlib, Seaborn provides more visually appealing statistical graphics.

New cards
42

ggplot2

R package for creating elegant and complex data visualizations.

New cards
43

Tableau

User-friendly platform for data visualization and analytics.

New cards
44

Power BI

Microsoft's business analytics tool for visualizing and sharing insights from data.

New cards
45

QlikView/Qlik Sense

Platforms for data visualization, business intelligence, and data discovery.

New cards
46

OpenRefine

Tool for cleaning and transforming messy data.

New cards
47

Trifacta

Platform for data wrangling and preparation tasks.

New cards
48

Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Cloud services offering various tools and resources for data storage, processing, and analysis.

New cards
49

Descriptive Questions

What are the key characteristics or trends in the dataset? How is the data distributed across different categories or groups? What are the summary statistics for the variables in the dataset?

New cards
50

Diagnostic Questions

What factors are contributing to a particular outcome or phenomenon? Are there any anomalies, outliers, or patterns that need further investigation? What is the root cause of a specific problem in the dataset?

New cards
51

Predictive Questions

Can we predict future outcomes based on historical data? What variables are most predictive of a certain event or outcome? How accurate are our predictions using different models or algorithms?

New cards
52

Prescriptive Questions

What actions or interventions can be recommended based on predictive models? How can we optimize a process or system to achieve better outcomes? What changes can be made to improve a specific metric or result?

New cards
53

Exploratory Questions

Are there any hidden patterns or relationships in the data? Can we identify clusters or groups within the dataset? What variables are most correlated with each other?

New cards
54

Causal Questions

What is the cause-and-effect relationship between variables? Can we establish causation based on observational or experimental data? How does changing one variable affect another in the dataset?

New cards
55

Comparative Questions

How do different groups or categories in the dataset compare to each other? What are the differences or similarities between subsets of the data? Are there significant differences in outcomes between different treatments or conditions?

New cards
56

Primary Sources

Data collected firsthand for a specific purpose. It includes surveys, experiments, observations, interviews, and focus groups.

New cards
57

Secondary Sources

Data that already exists and is collected by someone else for their own purposes. This includes books, articles, official records, databases, and previously conducted research.

New cards
58

Tertiary Sources

Compilations or summaries of primary and

New cards
59

Predictive Analysis

Using historical data to forecast or predict future outcomes or trends.

New cards
60

Diagnostic Analysis

Identifying reasons behind certain outcomes or patterns by investigating cause-and-effect relationships in data.

New cards
61

Prescriptive Analysis

Recommending actions or strategies based on analysis to optimize or improve future outcomes.

New cards
62

Variables

Containers for storing data values in programming languages.

New cards
63

Functions

Reusable blocks of code that perform specific tasks.

New cards
64

Control Structures

Statements that determine the flow of execution in a program.

New cards
65

Data Types

Categories of data that determine the kind of values that can be stored and manipulated.

New cards
66

Modules

Files containing Python code that can be imported and used in other programs.

New cards
67

Libraries

Collections of modules that provide additional functionality for specific tasks.

New cards
68

Lists

Ordered collections of items in programming languages.

New cards
69

Tuples

Immutable ordered collections of items in programming languages.

New cards
70

File Handling

Manipulating files in a program, such as reading from or writing to files.

New cards
71

Dictionaries

Key-value pairs used to store and retrieve data in programming languages.

New cards
72

Plotting

Creating visual representations of data using graphs or charts.

New cards
73

Data Manipulation

Modifying or transforming data to make it suitable for analysis.

New cards
74

Visualization

Presenting data in a visual format to gain insights or communicate information effectively.

New cards
75

Data Cleaning

Preprocessing data by handling missing values, duplicates, outliers, and normalizing or standardizing data.

New cards
76

Model Accuracy

Techniques to improve the accuracy of a predictive model, such as cross-validation, ensemble methods, and increasing the quantity and quality of data.

New cards
77

Data Formats

Different ways in which data can be structured and represented, such as CSV, Excel, or JSON.

New cards
78

Data Science Applications

Examples of how data science can be used in different domains, such as student performance analysis in education or audience insights in the movie industry.

New cards
79

Python Skills

Proficiency in the Python programming language for data analysis and modeling.

New cards
80

Online Platforms

Tools like Jupyter Notebook or Google Colab for creating and running data science models.

New cards
81

Importing Datasets

Steps to import datasets into a data science environment using tools like Google Colab.

New cards
82

Creating Datasets

Generating or creating custom datasets for analysis using existing data sources.

New cards
83

Data Visualization

Techniques for visualizing data, such as box plots, histograms, and pie charts.

New cards
84

Necessary Libraries

Key libraries in Python for data analysis and modeling, such as pandas, scikit-learn, geopandas, and matplotlib.

New cards
85

Exporting Work

Methods to save or export the results or outputs of data analysis or modeling tasks.

New cards

Explore top notes

note Note
studied byStudied by 5 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 13 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 29 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 7 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 7 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 37 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 13 people
Updated ... ago
5.0 Stars(2)
note Note
studied byStudied by 282 people
Updated ... ago
5.0 Stars(1)

Explore top flashcards

flashcards Flashcard93 terms
studied byStudied by 8 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard20 terms
studied byStudied by 9 people
Updated ... ago
4.0 Stars(1)
flashcards Flashcard34 terms
studied byStudied by 5 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard24 terms
studied byStudied by 35 people
Updated ... ago
5.0 Stars(2)
flashcards Flashcard112 terms
studied byStudied by 16 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard118 terms
studied byStudied by 11 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard47 terms
studied byStudied by 4 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard230 terms
studied byStudied by 90 people
Updated ... ago
5.0 Stars(1)