Data Engineering Exam 1

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/60

There's no tags or description

Looks like no tags are added yet.

Last updated 10:59 PM on 10/2/24

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

61 Terms

New cards

Data Engineering is ____ and Data Science & Analytics is ____.

upstream, downstream

New cards

Explain stage 1 of data maturity: Starting with Data

Challenges include losing organizational momentum, engineers should focus on visible wins, communicate with stakeholders, use off-the-shelf solutions, and only build custom systems where they offer competitive advantage.

Goals of this stage are:
1. Gaining executive support for data archetecture design.
2. Defining the right data architecture.
3. Identify and inspect data that will support key initiatives.
4. Operating the data archeticture.
5. Building a solid foundation for future data analysts to generate reports and valueable models.

New cards

Explain stage 2 of data maturity: Scaling with Data

This stage is about scaling data practices, optimizing for growth, and laying the groundwork for broader, company-wide data driven decision-making.

The Goals of Scaling with Data:
1. Establish formal data practices.
2. Create scalable and robust data architectures.
3. Adopt DevOps and DataOps practices lke automatioin, and monitoring.
4. Build systems that support machine learning.
5. Avoid unnecessary complexity and work unless there is competitive advantage.

New cards

Explain stage 3 of data maturity: Leading with Data

Stage 3 focuses on maintaining and enhancing data-driven capabilities, while ensuring that data is seamlessly available and usable across the organization.

Goals:
1. Deploy tools for data access to ensure that everone can access the right data.
2. Foster collaboration with engineers, analysts, and others to openly collab and share data.
3. Enable self-service analytics that empower employees at all levels to access and use data independently for analysis.
4. WATCH OUT for complacency, companies need to continuously improve and avoid stagnatiing or regressing to earlier stages.
5. WATCH OUT for technology distractions, temptations to explore new tech may not add business value.

New cards

Explain what type A and type B data engineers do:

Type A (abstraction): Avoids undifferentiated heavy lifting, keeping data architecture as abstract and straightforward as possible.

Type B (build): Builds data tools and systems that scale and leverage a company's core competency and competitive advantage.

New cards

What are internal-Facing Vs. External Facing Data Engineers?

Internal Facing Engineer: Focuses on activities crucial to the needs of the business and internal stakeholders.

External Facing Engineer: Typically aligns with the users of external-facing applications like social media apps, ecommerce platforms, etc.

New cards

What are the 4 primary languages a data engineer should know?

1. SQL
The most common interface for databases and data lakes, essential for querying and managing data.

2. Python
The bridge between data engineering and data science, with strong support for data tools (pandas, Airflow, PySpark, etc.).

3. JVM languages such as Java and Scala
Used for performance and access to low-level features in open-source projects like Apache Spark, Hive, and Druid.

4. bash
Command-line scripting for Linux; useful for automating tasks and managing OS-level operations in data pipelines.

New cards

What are the 5 Major Stages of the Data Engineering Lifecycle

Generation
Storage
Ingestion
Transformation
Serving

New cards

What is Data Engineering?

Data engineering is the development, implementation, and maintenance of systems that take in raw data and produce info that support analysis and machine learning.

New cards

What is Data Maturity?

The progression toward higher data utilization. There are 3 stages including starting with data, scaling with data, and leading with data.

New cards

What is the Data Engineering Lifecycle (term and step-by-step goals)

Focuses on the stages a data engineer controls.

Generation —> Storage {Ingestion —> Transformation —> Serving}

New cards

What is data generation?

The source system or the origin of the data used in the lifecycle.

IOT device
- Phone, laptop
Application message qeue
Transactional database
- E.g. Create a record to transfer 5 dollars from account A to B.

New cards

What is a Schema

A database schema is a blueprint that defines the structure and organizatioin of a relational database.

New cards

What is important when considering how to STORE your data?

COMPATIBILITY AND SCALE

New cards

What is HOT, WARM, and COLD storage?

HOT:

Access: Very frequent
Storage Cost: HIGH
Retrieval Cost: Cheap

WARM:

Access: Infrequent
Storage Cost: Medium
Retrieval Cost: Medium

COLD:

Access: Infrequent
Storage Cost: Cheap
Retrieval Cost: HIGH

New cards

<p>What is data <strong>INGESTION</strong>?</p>

What is data INGESTION?

The process of collecting data from various sources and moving it to a central location for further processing and analysis.

New cards

Wha is data TRANSFORMATION?

The process of changing from it’s original form into something useful for downstream use cases.

New cards

What is data SERVING?

When the data has value and can be used for practical purposes. It can be used in the following forms:

E.g. — Data analytics: Build reports, dashboards, ad hoc analysis, etc.

New cards

<p>What is a <strong>Reverse ETL</strong>?</p>

What is a Reverse ETL?

Reverse ETL takes processed data from the output side of the data engineering lifecycle and feeds it back into source systems.

New cards

What is DATA SECURITY of the 6 Data Undercurrents Across the Data Engineering Lifecycle?

Security
1. Competency in managing security of the data. The first line of defense for security is o create a culture of security that teaches all individuals who have access to data, that it is their responsibility to protect the companies sensitive data. Only give access to those that needed for the duration necessary to perform their work.

New cards

What is DATA MANAGEMENT of the 6 Data Undercurrents Across the Data Engineering Lifecycle?

Dev., execution,, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle.

New cards

What are DATAOPS of the 6 Data Undercurrents Across the Data Engineering Lifecycle?

A collection of technical practices that enable rapid innovation and experimentation delivering new insights to customers with increasing velocity, low error rates, and clear monitoring.

New cards

What is DATA ARCHITECTURE of the 6 Data Undercurrents Across the Data Engineering Lifecycle?

A data architecture reflects the current and future state of data systems that support an organization’s long-term data needs and strategy.

New cards

What is DATA ORCHESTRATION of the 6 Data Undercurrents Across the Data Engineering Lifecycle?

Orchestration is the process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence.

New cards

What is SOFTWARE ENGINEERING of the 6 Data Undercurrents Across the Data Engineering Lifecycle?

Software engineering is crucial for data engineers. Initially, they worked with low-level coding frameworks like MapReduce, but modern tools like Spark and SQL have made coding more abstract and user-friendly.

New cards

What is Data Architecture Defined

Data architecture is the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs.

New cards

What is Operational Architecture?

Operational architecture encompasses the functional requirements of what needs to happen related to people, processes, and technology.

New cards

What is Technical Architecture?

Outlines HOW data is ingested, stored, transformed, and served along the data engineering lifecycle.

New cards

What is principle 1 of Good Data Architecture? (WRITE DOWN OVER AND OVER TO MEMORIZE)

Choose common components wisely.

Common Components have Broad Applicability
Must be Accessible and Secure

New cards

What is principle 2 of Good Data Architecture?

Plan for failure

Availability
Reliability
Recovery Time Objective (RTO)
Recovery Point Objective (RPO)

New cards

What is principle 3 of Good Data Architecture?

Architect for Scalability

Consider Variability and Permanence of Scale
Elastic Systems – Scale Up, Scale Down, Scale to Zero

New cards

What is principle 4 of Good Data Architecture?

Architecture is Leadership

Architects are Decision Makers
Architects Delegate
Command and Control vs. Influence

New cards

What is principle 5 of Good Data Architecture?

Always be Architecting

Systems aren’t Static
Architects don’t just Maintain – They Evolve to Meet Business Demand
Baseline vs. Target Architecture

New cards

What is principle 6 of Good Data Architecture?

Build Loosely Coupled Systems

Test, Deploy and Change Independently
Services and Contracts
Details are Hidden, Changes don’t Affect Other Components

New cards

What is principle 7 of Good Data Architecture?

Make Reversible Decisions

Simplify Architecture, Enable Agility
Two-way Doors Analogy
Pace of Change Makes this a Necessity

New cards

What is principle 8 of Good Data Architecture?

Prioritize Security

Hardened-perimeter and zero-trust security models
The shared responsibility model
Data engineers as security engineers

New cards

What is principle 9 of Good Data Architecture?

Embrace FinOps

FinOps is the emerging professional movement that advocates a collaborative working relationship between DevOps and Finance, results in an iterative, data drive management of infrastructure pending while simultaneously increasing the cost efficiency and ultimately, the profitability of the cloud environment.

New cards

What are domains and services?

A domain can contain multiple services. A service can include things like orders, invoicing, and products.

An accounting domain responsible for basic accounting functions: invoicing, payroll, and accounts receivable.

New cards

Tight Coupling VS. Loose Coupling

Tight = Extremely, centralized dependences and workflows, where every part of a domain and service is vitally dependent upon every other domain and service.

Loose = Decentralized domains and services that don’t have strict dependence on each other. It's easy for decentralized teams to build systems whose data may not be usable by their peers.

New cards

What are the 4 characteristics of data systems?

Scalability
Elasticity
Availability
Reliability

New cards

What are the architecture tiers and explain each one?

Single tier

You database and application are tightly coupled, residing on a single server.

Mulltitier (n-tier)

The upper layers are more dependent on the lower layers.

Three-tier

These tiers include the Data —> Application Logic —> Presentation tiers.

New cards

What are MONOLITHS?

A monolith in data refers to a system where all data-related processes—such as data storage, processing, transformation, and serving—are tightly integrated within a single, unified architecture. This often means that the entire data pipeline, from ingestion to analytics, is managed in one system or application, with little separation between components.

New cards

What are MICROSERVICES?

Microservices are the opposite of monoliths, it comprises, separate, decentralized, loosely coupled services. Each service has a specific function is the couple from other services operating within its domain. This case, for one service temporarily goes down it won't affect the ability of other services to continue functioning, unlike monoliths.

New cards

What is on Premise?

Data and computing resources are hosted locally within an organization’s own infrastructure. Companies maintain complete control over hardware, software, and security but bear all the costs and responsibilities for management and maintenance.

New cards

What is Cloud?

Data hosted on third-party cloud providers (like AWS, Azure, or Google Cloud), offering scalability and flexibility. Organizations pay for what they use, and infrastructure management is handled by the provider.

New cards

What is Hybrid Cloud?

A combination of on-premise and cloud , allowing organizations to use both environments. Critical data or legacy systems might stay on-prem, while other processes leverage the cloud for scalability and cost-effectiveness.

New cards

What is Event-Driven Architecture?

Event-driven workflow, encompasses the ability to create, update, and asynchronously move events across various parts of the data engineering lifecycle.

For example, in an event driven workflow, an event is produced, routed, and then consumed. In an even driven architecture, events are passed between loosely coupled services.

New cards

What are the two different types of data architecture projects?

Brownfield Projects

Refactoring and reorganizing an existing architecture.

Greenfield Projects

Fresh start, unconstrained by the history or legacy of a prior architecture.

New cards

What is a Data Warehouse Architecture?

A central repository that aggregates data from multiple sources. Data warehouse architecture can be single-tier, two-tier, or three-tier.

New cards

What is a Data Lake Architecture?

A flat architecture that stores large volumes of data in its native format. Data lakes can store any type of data, including text and images, from any source.

New cards

What should be considered when choosing data technologies?

Team size and capabilities
Speed to market
Interoperability
- Ensure that it interacts and operates with other technologies.
Cost Optimization and Business Value
- Gaining back the amount spent on data projects + more.
Location
- On Premises
- Cloud
- Hybrid Cloud
Build Versus Buy
Monolith vs. modular
Serverless Versus Servers
Optimization, performance, and the benchmark wars
the undercurrents of the data engineering lifecycle

New cards

How do you measure the payback on technology?

Your organization expects a positive ROI from your data projects. The following should be considered:

Direct Costs
Indirect Costs
Capital expenses
Operational expenses

New cards

What is an entity?

A noun, a person, place thing or event that you want to track data on

New cards

What is an attribute

A property of an entity

New cards

What is an attribute?

How entities are related to one another, the association between entities

New cards

What is a One-to-many Relationship?

When one entity can be related to many (one or more) instances of another entity

New cards

What is a Repeating Group?

When there are multiple values for one or more attributes

New cards

Define first normal form:

A table in first normal form has no repeating groups, and each record is unique.

There can be several issues with duplication of data, and insert, update and delete anomalies.

New cards

Define second normal form:

Builds on 1NF by ensuring that all non-key attributes are fully dependent on the entire primary key, eliminating partial dependencies (when only part of a composite key determines a non-key attribute).

Without second normal form, you can have issues with update anomalies, inconsistent data, insert anomalies and issues deleting

New cards

Define third normal form:

A database is in 3^rd normal form if it is in second normal form and all determinants are a candidate key (BCNF).

A slightly weaker form states that there are no transitive dependencies.

Similar to not having a database in 2NF, if a database is not in 3NF, you can experience update anomalies, inconsistent data, insert anomalies, and delete anomalies.

New cards

Write the table definition using the common shorthand notation used in the book: e.g. For a department, store the department number and department name. Department number can be used as the primary key.

Department(Department Number, Department Name)