data unit
In a database, a ___ is called record (also known as a row or tuple) is a single, structured data item that is stored in a table.
dataset
A ____ is a collection of related data units and information that is composed of separate elements but can be manipulated as a unit by a computer. ____ is normally presented in a tabular pattern.
Data item
It is the equivalent of column in spreadsheet while dataset is equivalent to worksheet in spreadsheet.
dataset
A ____ is a collection of related data units and information that is composed of separate elements but can be manipulated as a unit by a computer. This set is normally presented in a tabular pattern. It is also known as table in database management system.
Data Warehousing
It integrates data and information collected from various sources into one comprehensive database.
ETL (Extract, Transform, Load)
It is the process of extracting data from various sources, transforming it into a consistent format, and loading it into a data warehouse or another storage system for analysis. ___ is a process used in data integration and data warehousing.
Data Lake
It is a centralized repository that allows organizations to store all their structured and unstructured data at any scale.
Data visualization
It is the graphical representation of data to facilitate understanding, analysis, and interpretation. It involves presenting data in visual formats such as charts, graphs, maps, and dashboards to communicate complex information clearly and effectively
Data mining
It is the process of searching and analyzing a large batch of raw data in order to identify patterns and extract useful information.
Machine learning
It is a branch of artificial intelligence (AI) and computer science which focuses on the development of algorithms and statistical models that enable computers to learn and improve their performance on a specific task without being explicitly programmed.
Big data
It is a combination of structured, semi-structured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications.
Data Analytics
Since ____ is a wider concept in which data analysis is just a part. ___ is the broad field of using data and tools to make business decisions. Data analysis, a subset of _____, refers to specific actions.
Data science
It combines mathematics, and statistics, scientific methods, algorithms, specialized programming (information technology), advanced analytics, artificial intelligence (AI), and machine learning (ML) with specific subject matter expertise to uncover actionable insights hidden in an organization's data.
Deep learning
It is a subset of machine learning, which itself is a subset of artificial intelligence (AI). It involves using neural networks with many layers (hence "deep") to model and understand complex patterns in data.
Artificial intelligence (AI)
It is the simulation of human intelligence processes by machines, especially computer systems.
Internet of Things (IoT)
The ____ describes the network of physical objects-"things"-that are embedded with sensors, software, and other technologies for the purpose of connecting and exchanging data with other devices and systems over the internet.
ChatGPT
It is an artificial intelligence (AI) chatbot that uses natural language processing to create humanlike conversational dialogue.
Spreadsheet
It is a computer application or program used to organize, display, analyze, compute, manipulate and store data in a tabular format, typically presented in rows and columns. The most popular spreadsheet applications today are MS Excel, Google sheet, Apple Numbers, and LibreOffice Calc.
Database Management System
This, also often known as DBMS, is a software system that enables users to define, create, maintain, manipulate, and manage databases. Some of the most popular large-scale _____(s) are MS SQL, MySQL, Oracle, Teradata, DB2 (Mainframe) and Adabas (Mainframe).
Databases
These store structured data in a format optimized for efficient storage, retrieval, and manipulation. Structured data is typically stored in tabular form and managed in a relational database (RDBMS).
Data Visualization Tools
These are software applications or platforms that allow users to create visual representations of data. Some popular data visualization tools are Microsoft Power BI, Tableau, Google Data Studio, and QlikView.
Various approaches to data analytics include
a. looking at what happened (descriptive analytics)
b. why something happened (diagnostic analytics)
c. what is going to happen (predictive analytics), or
d. what should be done next (prescriptive analytics).
Various approaches to data analytics include
a. looking at what happened (?)
b. why something happened (?)
c. what is going to happen (?), or
d. what should be done next (?).
Descriptive analytics
It involves analyzing historical data to understand and examine what happened in the past. It focuses on summarizing and visualizing data to provide insights into trends, patterns, and relationships.
Diagnostic analytics
It helps explain why things happened the way they did. It's a more complex version of descriptive analytics, extending beyond what happened to why it happened.
Diagnostic analytics
It involves digging deeper into historical data to understand why certain events occurred.
Predictive analytics
It aims to predict likely outcomes and make educated forecasts using historical data. Simply put, it seeks to answer the question, "What will happen?". _____ use probabilities instead of simply interpreting existing facts.
Prescriptive analytics
It is the use of advanced processes and tools to analyze data and content to recommend the optimal course of action or strategy moving forward. _____ is the most advanced of the four types of data analytics.
Predicting future trends
Optimizing business operations
Enhancing decision-making
Transforming raw data
The primary goals of data analytics are:
Cloud Computing
It is the delivery of different services through the Internet. These resources include tools and applications like data storage, servers, databases, networking, and software. Database as a Service (DBaaS) is a _____ service model that provides users with access to managed database services over the internet. Some of the leading cloud service providers are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
Business Intelligence
It is about descriptive and diagnostic analytics while Business Analytics is about predictive and prescriptive analytics. ____ and Business Analytics (BA) are both subsets of Data Analytics.
Structured Query Language (SQL)
It is a standard language for managing and manipulating relational databases. Mastering ___ empowers data analysts to derive insights from large datasets and optimize the performance of data-related operations.
Data quality
determines the usability and trustworthiness of data.
quality data
The characteristics of ____ are validity, accuracy, completeness, consistency, timeliness, relevance, and reliability.
Data quality
____ issues include incomplete, duplicate, outdated, insecure, inaccurate, incorrect, inconsistent, and outlier.
Data Validity
refers to the degree to which your data conforms to defined business rules or constraints.
Data Accuracy
ensures that your data is close to the true values.
Data consistency
refers to the degree to which all required data is supplied and known. _____ ensures your data is stable within the same data set and/or across multiple data sets. ____ occurs when aggregated data is reconciled with detailed data at lower levels of granularity.
Data Uniformity
refers to the degree to which data is specified using the same unit of
measure.
Data duplicate
also known as data redundancy, occurs when the same information is entered multiple times, sometimes in different formats. can be avoided by implementing record validation checks within a program to ensure that a record does not already exist before it is added to a dataset or database.
Outlier data
refers to the values that differ significantly from other values in your data set. Outlier data refers to observations or data points that deviate significantly from the rest of the dataset.
Insecure data
refers to sensitive data that are not encrypted or access controlled.
Incomplete data
occurs when you don't have data stored for certain variables or data items.
Incorrect data
can easily be prevented when data validation is in place.
Inconsistent data
occurs when there are multiple tables within a database that deal with the same data but may receive it from different inputs.
Inaccurate data
refers to data that contains errors and discrepancies that deviate from the true or expected values.
Constructive Transformation process
where data item is added, copied or replicated.
Destructive Transformation process
where data items or records are trimmed or deleted.
Aesthetic Transformation process
where certain values are standardized to meet requirements or parameters.
Structural transformation process
which includes columns being renamed, moved, and combined.
Data Cleaning
is also known as data cleansing and data scrubbing.
Garbage-In Garbage-Out or GIGO
simply means the quality of output is determined by the quality of the input.
data completeness
The ____ is likely to be achieved when you make the important fields mandatory in the data entry and data model.
Data timeliness
refers to data that is available when it is required.
discovery stage
At the ________ data teams work to understand, identify, and find all applicable raw data and data types that need to be transformed. Data discovery includes identifying and understanding data in its original source format with the help of data profiling tools.
data mapping stage
At the ________ , data teams determine how individual fields are matched, filtered, joined, modified, and aggregated.
extraction stage
At the _______, data teams move data from its source system into the staging areas.
code generation and execution stage
At _____ and _____, data teams generate and execute programs/codes based on the mapping process using a programming language.
review output stage
At the _______, the transformed data is evaluated by the data teams to ensure the conversion has had the desired results in terms of the format of the data.
target stage
At the send to _________, involves sending the transformed data to its target destination.
Data profiling
involves identifying patterns and inconsistencies in data. _______ helps identify data quality issues and assess the overall quality of the data.
Data set
The _________ must be updated or refreshed to replace the obsolete data with the newer data.