1/39
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is a data warehouse?
A centralized, structured data repository used for analysis and reporting; stores current and historical data that is cleansed and categorized.
What is the primary use of a data warehouse?
To serve as a single source of truth for operational and performance analytics.
What is a data mart?
A subsection of a data warehouse tailored for a specific business function or user group.
What is the benefit of a data mart?
Provides isolated security and performance for business-specific reporting and analytics.
What is a data lake?
A storage system for raw, structured, semi-structured, and unstructured data tagged with metadata.
What is the main purpose of a data lake?
Supports predictive and advanced analytics; retains all source data for flexible use.
What is ETL?
Extract, Transform, Load – a process that converts raw data into analysis-ready data.
What happens in the Extract step of ETL?
Raw data is collected from source systems via batch or stream processing.
What tools are used for batch data extraction?
Stitch, Blendo.
What tools are used for stream data extraction?
Apache Samza, Apache Storm, Apache Kafka.
What occurs during the Transform step of ETL?
Data is cleaned, standardized, enriched, validated, and converted into usable formats.
What happens in the Load step of ETL?
Processed data is delivered to a target system or repository (initial load, incremental, or full refresh).
What is load verification?
Checking for missing/null values, server performance, and load failures.
What is a data pipeline?
A system for moving data from source to destination; includes ETL but also supports broader operations.
How does a data pipeline differ from ETL?
Data pipeline is a broader term; ETL is a specific data transformation process within a pipeline.
What is the typical destination of a data pipeline?
Data lakes, applications, or visualization tools.
What tools are used for data pipelines?
Apache Beam, Google DataFlow.
What is data integration?
The process of ingesting, transforming, combining, and provisioning data across various sources.
What are key use cases for data integration?
Data consistency, master data management, sharing, migration, and analytics.
How does data integration relate to ETL and pipelines?
Data integration uses pipelines to move and combine data; ETL is a process within integration.
What are features of modern data integration platforms?
Pre-built connectors, open-source architecture, batch/stream optimization, cloud portability, governance tools.
Name a few commercial data integration tools.
IBM InfoSphere, Talend Data Fabric, SAP, Oracle, Microsoft, TIBCO.
Name some open-source or iPaaS integration tools.
Dell Boomi, SnapLogic, Jitterbit, Informatica Cloud.
What is an RDBMS?
Relational Database Management System; stores structured data in tables using predefined schemas.
What language is used to query RDBMS?
SQL (Structured Query Language).
What are advantages of RDBMS?
Consistency, integrity, easy backup, and clear data relationships.
What are limitations of RDBMS?
Poor performance with big/semi/unstructured data, rigid schemas, field length limits.
What is a NoSQL database?
A flexible, scalable database for semi-structured and unstructured data.
What are the types of NoSQL databases?
Document-based, key-value, columnar, graph.
What is a document-based NoSQL database?
Stores semi-structured documents like JSON in collections.
What is a key-value NoSQL database?
Stores data as key-value pairs for fast retrieval.
What is a columnar NoSQL database?
Stores data by column for large-scale analytical workloads.
What is a graph NoSQL database?
Stores data in nodes and edges to capture complex relationships.
What type of data does a relational database store?
Structured data.
When would you use a NoSQL database over RDBMS?
When handling semi-structured or unstructured data or needing schema flexibility.
What determines the data storage method you choose?
Data type, volume, and intended use.
What is the main benefit of using data warehouses, marts, and lakes?
They support different types of analytics across varied data structures and volumes.
Why is ETL important for data science?
It ensures raw data is transformed and ready for accurate analysis.
What does a data scientist need to understand about data systems?
Storage options, retrieval methods, data organization, and transformation processes.