Data science

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/39

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

40 Terms

New cards

What is a data warehouse?

A centralized, structured data repository used for analysis and reporting; stores current and historical data that is cleansed and categorized.

New cards

What is the primary use of a data warehouse?

To serve as a single source of truth for operational and performance analytics.

New cards

What is a data mart?

A subsection of a data warehouse tailored for a specific business function or user group.

New cards

What is the benefit of a data mart?

Provides isolated security and performance for business-specific reporting and analytics.

New cards

What is a data lake?

A storage system for raw, structured, semi-structured, and unstructured data tagged with metadata.

New cards

What is the main purpose of a data lake?

Supports predictive and advanced analytics; retains all source data for flexible use.

New cards

What is ETL?

Extract, Transform, Load – a process that converts raw data into analysis-ready data.

New cards

What happens in the Extract step of ETL?

Raw data is collected from source systems via batch or stream processing.

New cards

What tools are used for batch data extraction?

Stitch, Blendo.

New cards

What tools are used for stream data extraction?

Apache Samza, Apache Storm, Apache Kafka.

New cards

What occurs during the Transform step of ETL?

Data is cleaned, standardized, enriched, validated, and converted into usable formats.

New cards

What happens in the Load step of ETL?

Processed data is delivered to a target system or repository (initial load, incremental, or full refresh).

New cards

What is load verification?

Checking for missing/null values, server performance, and load failures.

New cards

What is a data pipeline?

A system for moving data from source to destination; includes ETL but also supports broader operations.

New cards

How does a data pipeline differ from ETL?

Data pipeline is a broader term; ETL is a specific data transformation process within a pipeline.

New cards

What is the typical destination of a data pipeline?

Data lakes, applications, or visualization tools.

New cards

What tools are used for data pipelines?

Apache Beam, Google DataFlow.

New cards

What is data integration?

The process of ingesting, transforming, combining, and provisioning data across various sources.

New cards

What are key use cases for data integration?

Data consistency, master data management, sharing, migration, and analytics.

New cards

How does data integration relate to ETL and pipelines?

Data integration uses pipelines to move and combine data; ETL is a process within integration.

New cards

What are features of modern data integration platforms?

Pre-built connectors, open-source architecture, batch/stream optimization, cloud portability, governance tools.

New cards

Name a few commercial data integration tools.

IBM InfoSphere, Talend Data Fabric, SAP, Oracle, Microsoft, TIBCO.

New cards

Name some open-source or iPaaS integration tools.

Dell Boomi, SnapLogic, Jitterbit, Informatica Cloud.

New cards

What is an RDBMS?

Relational Database Management System; stores structured data in tables using predefined schemas.

New cards

What language is used to query RDBMS?

SQL (Structured Query Language).

New cards

What are advantages of RDBMS?

Consistency, integrity, easy backup, and clear data relationships.

New cards

What are limitations of RDBMS?

Poor performance with big/semi/unstructured data, rigid schemas, field length limits.

New cards

What is a NoSQL database?

A flexible, scalable database for semi-structured and unstructured data.

New cards

What are the types of NoSQL databases?

Document-based, key-value, columnar, graph.

New cards

What is a document-based NoSQL database?

Stores semi-structured documents like JSON in collections.

New cards

What is a key-value NoSQL database?

Stores data as key-value pairs for fast retrieval.

New cards

What is a columnar NoSQL database?

Stores data by column for large-scale analytical workloads.

New cards

What is a graph NoSQL database?

Stores data in nodes and edges to capture complex relationships.

New cards

What type of data does a relational database store?

Structured data.

New cards

When would you use a NoSQL database over RDBMS?

When handling semi-structured or unstructured data or needing schema flexibility.

New cards

What determines the data storage method you choose?

Data type, volume, and intended use.

New cards

What is the main benefit of using data warehouses, marts, and lakes?

They support different types of analytics across varied data structures and volumes.

New cards

Why is ETL important for data science?

It ensures raw data is transformed and ready for accurate analysis.

New cards

What does a data scientist need to understand about data systems?

Storage options, retrieval methods, data organization, and transformation processes.

New cards