COMSCI 2200 - UNDERSTANDING DATABASE ANALYTICS

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/41

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

42 Terms

1
New cards

CRM - Customer relationship management database

· is a resource containing all client information collected, governed, transformed, and shared across an organization.

· it includes marketing and sales reporting tools, which are useful for leading sales and marketing campaigns and increasing customer engagement

2
New cards

SCM - Supply Chain Management

· is management of the flow of goods, data, and finances related to a product or service, from the procurement of raw materials to the delivery of the product at its final destination

3
New cards

ETL - Extract, Transform and Load

· in the world of data warehousing, if you need to bring data from multiple different data sources into one, centralized database, you must first:

- EXTRACT data from its original source

- TRANSFORM data by deduplicating it, combining it, and ensuring quality, to then

- LOAD data into the target database

4
New cards

DW - Data Warehouse

· also known as an enterprise data warehouse, is a system used for reporting and data analysis and is considered a core component of business intelligence

· DWs are central repositories of integrated data from one or more disparate sources

5
New cards

Data lake

· a centralized repository designed to store, process, and secure large amounts of structured, semi structured and unstructured data

· It can store data in its native format and process any variety of it, ignoring size limits

6
New cards

OLAP - Online Analytical Processing

· a computing method that enables users to easily and selectively extract and query data in order to analyze it from different points of view

- ____ business intelligence queries often aid in trends analysis, financial reporting, sales forecasting, budgeting and other planning purposes

7
New cards

OLTP - Online Transaction Processing

· a type of data processing that consists of executing a number of transactions occurring concurrently online banking, shopping, order entry, or sending text messages, for example.

8
New cards

GSP (Generalized Sequential Pattern) algorithm

· is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the apriori (level wise) algorithm

· One way to use the level wise paradigm is to first discover all the frequent items in a level wise fashion.

9
New cards

Apriori algorithm

· for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database

10
New cards

data warehouse

- an integrated system or database that enables the user to instantly analyze internal data and external data generated by the operation system of an enterprise over time, without the need for separate programming from multiple points of view, by integrating data by subject.

11
New cards

- Subject-oriented
- Integrated
- Time variant
- Non-volatile

Characteristics (4) Data warehouse modeling

12
New cards

Subject-oriented

• Among the multiple types of operation system data that are managed by data business functions, the data of a specific subject needed for decision-making activities from an enterprise perspective is saved, while other data are not included.

13
New cards

Integrated

• The structure of a data warehouse is characterized by data consistency and physical unity through company-wide data standardization.

• When obtaining data from the operation system, a series of data conversion tasks are performed to integrate the data.

14
New cards

Time Variant

• To analyze past and present trends and forecast the future, a data warehouse retains data for a long time in the form of a series of snapshots.

• Users can understand the process of data change over time using the data history.

15
New cards

Non-volatile

• A data warehouse is a read-only database that cannot be deleted or modified once it has been loaded from the operation system database.

• When a modification occurs in the operation system data, existing data deleted. The data in the data warehouse stores the history of data at each point in time.

16
New cards

- Fact table
- Dimension table

Components (2) of data warehouse modeling

17
New cards

Fact table

- A core table composed of a set of highly relevant measures.

- As measurement data, a Measure can observe the goal of information analysis, such as the amount. number, time. etc

18
New cards

Dimension table

• A sub-table and a perspective of analyzing each fact.

• has multiple attributes, thus allowing data analysis from diverse perspectives

19
New cards

Data warehouse modeling technique

organizes data using fact tables and dimension tables to facilitate the analysis of information The technique can be divided into the ‘star schema’ and the ‘snowflake’ depending on the dimension table normalization status.

20
New cards

Star schema

• A modeling technique for designing data by separating it into fact tables and dimension tables.

Data duplication occurs because dimension table data are not normalized

• The schema is easy to understand and has few joins, thus improving query performance, but data consistency problems may occur

21
New cards

Snowflake schema

• A modeling technique for completely normalizing the dimension table of the star schema.

Data duplication is rare, and few storage spaces are used owing to the normalization of the dimensional table, but there is some concern about performance degradation due to the greater number of joins compared to the star schema.

22
New cards

ETL (EXTRACTION, TRANSFORMATION, LOADING)

refers to the entire process by which data are extracted from the source system and stored in the data warehouse after cleansing and conversion

It plays the role of maintaining data consistency and integrity among the components of the data warehouse, and is also called ETT (Extraction Transformation, Transportation)

23
New cards

Extraction

• The phase in which data are extracted from the original file or operating system database and stored in the data warehouse.

• In the past, data were extracted on a daily or monthly basis, but in some recent cases data were extracted in real time using database logs according to the business requirements.

24
New cards

Transformation

• A phase in which extracted data are cleaned and converted into a data format suitable for the data warehouse.

• In the event of data quality problems, data are cleansed according to the reference data or business rules.

• The original data format is converted into a data format suitable for the data warehouse.

25
New cards

Loading

• A phase in which converted data are sent to the warehouse for storage and the necessary indexes are generated.

• Full and partial update techniques are available.

26
New cards

OLAP

refers to the process by which the end user accesses multi-dimensional information without an intermediary or medium, and then analyzes the information interactively and uses it for decision making

That is, when the operational data extracted and converted by ETL are stored in the data warehouse or data mart, the end user analyzes them using -.

27
New cards

- Drill Down
- Roll Up
- Drill Across
- Pivot
- Slice
- Dice

OLAP (5) Search technique

28
New cards

Drill Down

A search technique that approaches a specific analysis topic in phases from a high summary level to a low (detail) summary level.

E.g. Time dimension: Year -> month r -> day

29
New cards

Roll Up

Concept opposite to Drill Down

A search technique that approaches a specific analysis topic in phases from a low summary level to a high summary level.

E.g. Time dimension: Day r -> month r -> year

30
New cards

Drill Across

A search technique that uses a certain analysis viewpoint on one analysis topic to approach another analysis topic

31
New cards

Pivot

A search technique that changes the axis of the analysis perspective on a specific analysis topic.

32
New cards

Slice

A search technique that creates subsets by selecting specific values for the members at one level or the members above that level.

33
New cards

Dice

A search technique that creates subsets by slicing more than two dimensions.

34
New cards

Data mining

refers to a series of processes that identify a systematic statistical rule or pattern among a large amount of data, convert it into meaningful information, and apply it to corporate decision-making.

35
New cards

- Association
- Sequence
- Classification
- Clustering

Algorithm (4) Data Mining

36
New cards

Association

• An analysis algorithm that discovers a pattern using a combination of highly relevant data in transaction data, etc.

- Apriori algorithm (etc).

This algorithm is mainly used to place products by analyzing offline stores, and to recommend related products automatically at online shopping malls, etc.

37
New cards

Sequence

An analysis algorithm that searches the correlation of items over time by adding the concept of time to association analysis.

The possibility of a given transaction occurring in the future is forecast by performing time series analysis on transaction history data.

- Apriori algorithm. Generalized Sequential Patterns (GSP), etc.

38
New cards

Classification

An analysis algorithm that creates a tree-type model, which classifies the values (category values) of a specific attribute (category type) by analyzing a dataset when it is given.

- Decision tree algorithm, etc.

39
New cards

decision tree

a plan that includes a root node, branches, and leaf nodes. Every internal node characterizes an examination on an attribute, each division characterizes the consequence of an examination. and each leaf node grasps a class tag.

40
New cards

Clustering

An analysis algorithm that groups records with similar attributes, by considering several attributes of given records (customers, products).

-K-Means algorithm, EM algorithm, etc.

41
New cards

K-means algorithm

a technique for data clustering that may be used for unsupervised machine learning. It is capable of classifying unlabeled data into a predetermined number of clusters based on similarities (k)

42
New cards

Expectation-Maximization (EM) Algorithm

an approach for maximum likelihood estimation in the presence of latent variables.

A general technique for finding maximum likelihood estimators in latent variable models