DS-Platforms

Data Science Platforms

Berliner Hochschule für Technik
Instructor: Prof. Dr. P. Erdelt
Course: Data Science Platforms
Semester: WiSe 24/25
Last Change: September 9, 2024

Data Science Overview

Definition of Data Science Workflow

Data Science Workflow: A structured process for data analysis and knowledge generation, comprised of terminology and software literature.
Key components include data collection, exploration, preparation, modeling, evaluation, and deployment.

Workflow Definition

Workflow:
- An orchestrated and repeatable pattern of activity.
- Organization of resources into processes.

Components of Data Science

What is Data Science?

Encompasses computer-based data analysis and knowledge generation.
Includes:
- Business Intelligence
- Business Analytics
- Information Retrieval
- Information Theory
- Knowledge Discovery
- Data Mining
- Statistics
- Machine Learning

Terminology

Business Intelligence

Business Intelligence is focused on:
- Using IT and internal data.
- Turning data into actionable insights to support management.
- Key tools include ETL, Data Warehousing, OLAP.
- Major Questions:
  - What happened?
  - When?
  - How many?

Business Analytics

Business Analytics enhances business intelligence by incorporating:
- Statistical Analysis
- Data Mining
- Predictive Modeling
- Major Questions:
  - Why did it happen?
  - What will happen?

Additional Terms

Information Retrieval: Finding relevant material (usually unstructured) from large data collections.
Information Theory: Studies quantification, storage, and communication of information.

Knowledge Discovery Process

Preparation: Define goals, knowledge collection, and data gathering.
Data Selection: Choose the relevant dataset.
Data Preprocessing: Cleanse and prepare data.
Data Transformation: Process data into a usable format.
Data Mining: Extract patterns (this involves statistics).
Interpretation: Understanding the results and applying them.

Data Mining

Definition

Data Mining: Involves methods for discovering non-trivial patterns in data.
- Goal: Identify and explain relationships and patterns in large datasets.

Applications of Data Mining

Regression Analysis: Predicting real values.
Classification: Determining categorical membership.
Cluster Analysis: Grouping similar data points.
Association Analysis: Finding rules that describe large portions of data.

Statistics

Descriptive Statistics summarizes the important characteristics of data.
Inferential Statistics helps make predictions about a population based on a sample.

Machine Learning

Definition

Machine Learning: A subset of artificial intelligence focused on developing algorithms that improve automatically through experience and data.

Types of Learning

Supervised Learning: Learning from labeled data (inputs/outputs).
Unsupervised Learning: Finding patterns in data without labeled inputs.
Reinforcement Learning: An agent learns to maximize rewards through its actions in an environment.

Evaluation in Data Science

Evaluation Metrics

Confusion Matrix: A table that describes the performance of a classification model.
- Components include True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN).
Key Metrics:
- Accuracy: Ratio of correctly predicted instances to total instances.
- Precision: Ratio of correctly predicted positive observations to the total predicted positives.
- Recall: Ratio of correctly predicted positive observations to all actual positives (also known as Sensitivity).

ROC Curve

ROC (Receiver Operating Characteristic) Curve: Illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
- The area under the ROC curve (AUC) determines model effectiveness.

Information Theory

Definition

Information Theory quantifies information, aiming to enhance data interpretation and transmission.
- Involves concepts like entropy, mutual information, which measure uncertainty and the amount of information obtained through data.

Key Concepts

Entropy: The measure of uncertainty in a random variable.
Mutual Information: A measure of the amount of information that knowing the value of one variable provides about another.

Conclusion

Data Science combines knowledge from various fields including statistics, computer science, domain knowledge, and machine learning to analyze data and extract meaningful insights.

Data Science Platforms

Berliner Hochschule für Technik

Instructor: Prof. Dr. P. Erdelt
Course: Data Science Platforms
Semester: WiSe 24/25
Last Change: September 9, 2024

Data Science Overview

Definition of Data Science Workflow:The Data Science Workflow is a structured, iterative process that encompasses data analysis and knowledge generation, using specialized terminology and software literature.

Key Components of the Workflow:

Data Collection: Gathering raw data from various sources, which may include databases, APIs, and sensors.
Data Exploration: Analyzing data visually and statistically to discover patterns, anomalies, and insights.
Data Preparation: Transforming and cleaning data to ensure it is suitable for analysis, including handling missing values and normalizing datasets.
Modeling: Applying statistical models and machine learning algorithms to analyze data and predict outcomes.
Evaluation: Assessing the performance of models to determine accuracy and reliability.
Deployment: Implementing the model in a real-world environment, ensuring it functions effectively and continues to improve with new data.

Workflow Definition:

Workflow:An orchestrated and repeatable pattern of activities that organize resources into processes to achieve specific end goals within data science projects.

Components of Data Science

What is Data Science?Data Science encompasses various methodologies, tools, and concepts that rely on computer-based data analysis to extract knowledge and insights from structured and unstructured data. It integrates principles from numerous domains, including:

Business Intelligence: The technology and practices for collecting, analyzing, and presenting business data.
Business Analytics: The practice of iterative, improvisational analysis of past business performance to gain insight for the future.
Information Retrieval: Techniques aimed at finding relevant information from unstructured data sources.
Information Theory: Explores how information is quantified, stored, processed, and communicated.
Knowledge Discovery: The process of uncovering patterns in large datasets.
Data Mining: The practice of analyzing large datasets to identify patterns and correlations that can lead to actionable insights.
Statistics: The science of collecting, analyzing, interpreting, presenting, and organizing data.
Machine Learning: A subset of artificial intelligence focused on algorithms that learn from data.

Data Science Platforms

Berliner Hochschule für Technik

Instructor: Prof. Dr. P. Erdelt
Course: Data Science Platforms
Semester: WiSe 24/25
Last Change: September 9, 2024

Data Science Overview

Key Components of the Workflow:

Data Collection: Gathering raw data from various sources, which may include databases, APIs, and sensors.
Data Exploration: Analyzing data visually and statistically to discover patterns, anomalies, and insights.
Data Preparation: Transforming and cleaning data to ensure it is suitable for analysis, including handling missing values and normalizing datasets.
Modeling: Applying statistical models and machine learning algorithms to analyze data and predict outcomes.
Evaluation: Assessing the performance of models to determine accuracy and reliability.
Deployment: Implementing the model in a real-world environment, ensuring it functions effectively and continues to improve with new data.

Workflow Definition:

Workflow:An orchestrated and repeatable pattern of activities that organize resources into processes to achieve specific end goals within data science projects.

Components of Data Science

Business Intelligence: The technology and practices for collecting, analyzing, and presenting business data.
Business Analytics: The practice of iterative, improvisational analysis of past business performance to gain insight for the future.
Information Retrieval: Techniques aimed at finding relevant information from unstructured data sources.
Information Theory: Explores how information is quantified, stored, processed, and communicated.
Knowledge Discovery: The process of uncovering patterns in large datasets.
Data Mining: The practice of analyzing large datasets to identify patterns and correlations that can lead to actionable insights.
Statistics: The science of collecting, analyzing, interpreting, presenting, and organizing data.
Machine Learning: A subset of artificial intelligence focused on algorithms that learn from data.

Data Science PlatformsBerliner Hochschule für TechnikInstructor: Prof. Dr. P. ErdeltCourse: Data Science PlatformsSemester: WiSe 24/25Last Change: September 9, 2024

Overview of Data Science

The Data Science Workflow is an organized, iterative framework focused on data analysis and knowledge production, relying on specific terminology and software tools.

Key Components of the Workflow:

Data Collection: Gathering raw data from diverse sources, including databases, APIs, and sensors.
Data Exploration: Utilizing visual and statistical analyses to uncover patterns, outliers, and insights in the data.
Data Preparation: Cleaning and transforming data to ensure applicability for analysis, addressing missing values and normalizing datasets.
Modeling: Employing statistical models and machine learning techniques to analyze data and forecast results.
Evaluation: Determining model performance to establish accuracy and reliability.
Deployment: Implementing models in real-world settings, ensuring effective operation and adaptation to incoming data for continuous improvement.

Understanding Data Science

Data Science encapsulates a variety of methodologies, tools, and concepts designed to leverage computer-based data analysis for deriving knowledge and insights from both structured and unstructured data. It draws principles from multiple fields, including:

Business Intelligence: Techniques and practices for collecting, analyzing, and visualizing business data.
Business Analytics: The continuous analysis of historical business performance to forecast future outcomes.
Information Retrieval: Strategies for locating relevant information within unstructured data sets.
Information Theory: The study of how information can be quantified, stored, processed, and communicated.
Knowledge Discovery: The process of identifying patterns within large data sets.
Data Mining: Analyzing large data sets to uncover meaningful patterns and associations.
Statistics: The discipline of collecting, analyzing, interpreting, visualizing, and presenting data.
Machine Learning: A branch of artificial intelligence focused on developing algorithms that learn from data.