Exam DP-203: Data Engineering Azure

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/399

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

400 Terms

New cards

Which authorization types are supported for Blob Storage in Azure Synapse serverless SQL pools?

User Identity (with SAS token for non-firewalled storage), SAS, Managed Identity.

New cards

Which data processing framework will a data engineer use to ingest data onto cloud data platforms in Azure?

Extract, load, and transform (ELT)

New cards

The schema of what data type can be defined at query time?

Unstructured data

New cards

Duplicating customer content for redundancy and meeting service-level agreements (SLAs) in Azure meets which cloud technical requirement?

High availability

New cards

Azure Blob

A scalable object store for text and binary data

New cards

Azure Files

Managed file shares for cloud or on-premises deployments

New cards

Azure Queue

A messaging store for reliable messaging between application components

New cards

Azure Table

A NoSQL store for no-schema storage of structured data

New cards

Which data platform technology is a globally distributed, multimodel database that can perform queries in less than a second?

Azure Cosmos DB

New cards

Which data store is the least expensive choice when you want to store data but don't need to query it?

Azure Storage

New cards

Which Azure service is the best choice to store documentation about a data source?

Azure Purview

New cards

Another term for structured data

Relational Data

New cards

Another term of Semi-structured data

NoSQL

New cards

What is the process called for converting data into a format that can be transmitted or stored?

serialization

New cards

What are three common sterilization languages?

XML, JSON, and YAML

New cards

Examples of unstructured data

Media files, like photos, videos, and audio files
Microsoft 365 files, like Word documents
Text files
Log files

New cards

What type of data is a JSON file?

Semi-structured

New cards

What type of data is a video?

Unstructured

New cards

What is a transaction?

A logical group of database operations that execute together.

New cards

What does the acronym ACID stand for?

Atomicity, Consistency, Isolation, Durability

New cards

What is atomicity?

A transaction must execute exactly once, and it must be atomic. Either all of the work is done or none of it is. Operations within a transaction usually share a common intent and are interdependent.

New cards

What is consistency?

Ensures that the data is consistent both before and after the transaction.

New cards

What is Isolation?

Ensures that each transaction is unaffected by other transactions.

New cards

What is durability?

That changes made as a result of a transaction are permanently saved in the system. The system saves data that's committed so that even in the event of a failure and system restart, the data is available in its correct state.

New cards

What does OLAP stand for?

Online analytical processing

New cards

What does OLTP stand for?

Online Transaction Processing

New cards

What does OLTP system commonly support?

Many users, have quick response times, and handle large volumes of data.

New cards

What does OLAP system commonly support?

Fewer users, have longer response times, can be less available, and typically handle large transactions or complex transactions

New cards

Which type of transactional database system would work best for product data?

OLAP

New cards

What does Azure Blob Storage support?

Supports storing files like photos and videos.

New cards

What is the data classification for Business data?

Structured

New cards

What are the operations of business data?

Read-only, complex analytical queries across multiple databases

New cards

What are the latency and throughput of business data?

Some latency in the results is expected based on the complex nature of the queries.

New cards

What is latency?

Is a performance metric that measures the time gap between requests and responses for the disks.

New cards

What is the best service for business data?

Azure SQL Database
Business data most likely will be queried by business analysts, who are more likely to know SQL than any other query language. You can use Azure SQL Database as a solution by itself, but if you pair it with Azure Analysis Services, data analysts can create a semantic model over the data in Azure SQL Database.

New cards

Three primary types of data

Structured, Semi-structured, and Unstructured

New cards

Structured data

Comes from table-based source systems such as a relational database or from a flat file such as a comma separated (CSV) file. The primary element of a structured file is that the rows and columns are aligned consistently throughout the file.

New cards

Semi-Structured data

Data such as JavaScript object notation (JSON) files, which may require flattening prior to loading into your source system. When flattened, this data doesn't have to fit neatly into a table structure.

New cards

Unstructured data

Data stored as key-value pairs that don't adhere to standard relational models and Other types of unstructured data that are commonly used include portable data format (PDF), word processor documents, and images.

New cards

Data Engineer tasks in Azure (Data Operations)

Data Integration, Data Transformation, and data consolidation

New cards

Data integration

Establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems.

New cards

Data transformation

Data usually needs to be transformed into suitable structure and format for analysis, often as part of an extract, transform, and load (ETL) process; though increasingly a variation in which you extract, load, and transform (ELT) the data is used to quickly ingest the data into a data lake and then apply "big data" processing techniques to transform it.

New cards

Data consolidation

Process of combining data that has been extracted from multiple data sources into a consistent structure - usually to support analytics and reporting.

New cards

Operational Data

Transactional data that is generated and stored by applications, often in a relational or non-relational database.

New cards

Streaming data

Streaming data refers to perpetual sources of data that generate data values in real-time, often relating to specific events.

New cards

Data pipelines

Used to orchestrate activities that transfer and transform data.

New cards

Data lakes

A storage repository that holds large amounts of data in native, raw formats.

New cards

Data warehouses

Centralized repository of integrated data from one or more disparate sources.

New cards

Apache Spark

Is a parallel processing framework that takes advantage of in-memory processing and a distributed file storage

New cards

Core Azure Technologies

Azure Synapse Analytics
Azure Data Lake Storage Gen2
Azure Stream Analytics
Azure Data Factory
Azure Databricks

New cards

In a data lake, data is stored in?

Files

New cards

Data in a relational database table is

Structured

New cards

Which of the following Azure services provides capabilities for running data pipelines AND managing analytical data in a data lake or relational data warehouse?

Azure Synapse Analytics

New cards

Benefit Azure Data Lake Storage

Data Lake Storage is designed to deal with this variety and volume of data at exabyte scale while securely handling hundreds of gigabytes of throughput.

New cards

Azure Blob storage

Store large amounts of unstructured data in a flat namespace within a blob container.

New cards

Stages Processing Big Data

Ingest
Store
Prep and train
Model and serve

New cards

Stages Processing Big Data: Model and serve

Involves the technologies that will present the data to users.

New cards

Technologies of model and serve

Microsoft Power BI
Azure Synapse Analytics

New cards

Stages Processing Big Data: Prep and train

Identifies the technologies that are used to perform data preparation and model training and scoring for machine learning solutions.

New cards

Technologies of Prep and train

Azure Synapse Analytics
Azure Databricks
Azure HDInsight
Azure Machine Learning

New cards

Stages Processing Big Data: Store

Identifies where the ingested data should be placed.

New cards

Technologies of Store

Azure Data Lake Storage Gen 2

New cards

Stages Processing Big Data: Ingest

Identifies the technology and processes that are used to acquire the source data.

New cards

Technologies for batch ingest

Azure Synapse Analytics
Azure Data Factory

New cards

Technologies for real-time ingest

Apache Kafka for HDInsight
Stream Analytics

New cards

Azure Data Lake Storage Gen2 stores data in...

An HDFS-compatible file system hosted in Azure Storage.

New cards

What option must you enable to use Azure Data Lake Storage Gen2?

Hierarchical namespace

New cards

Descriptive analytics

Answers the question "What is happening in my business?".

New cards

Diagnostic analytics

Answering the question "Why is it happening?".

New cards

Predictive analytics

Answer the question "What is likely to happen in the future based on previous trends and patterns?"

New cards

What is Azure Synapse Analytics?

Azure Synapse Analytics is a centralized service for data storage and processing with an extensible architecture. It integrates commonly used data stores, processing platforms, and visualization tools.

New cards

What is a Synapse Analytics workspace?

A Synapse Analytics workspace defines an instance of the Synapse Analytics service in which you manage the services and data resources for your analytics solution. You create it in an Azure subscription interactively using the Azure portal, Azure PowerShell, the Azure command-line interface (CLI), or an Azure Resource Manager or Bicep template.

New cards

What is a data lake in the context of Azure Synapse Analytics?

In a Synapse Analytics workspace, a data lake is a core resource where data files can be stored and processed at scale. A workspace typically has a default data lake, implemented as a linked service to an Azure Data Lake Storage Gen2 container.

New cards

What role do pipelines play in Azure Synapse Analytics?

Pipelines in Azure Synapse Analytics orchestrate activities necessary to retrieve data from sources, transform the data, and load the transformed data into an analytical store. They are based on the same underlying technology as Azure Data Factory.

New cards

How does Azure Synapse Analytics support SQL-based data querying and manipulation?

Azure Synapse Analytics supports SQL-based data querying and manipulation through two kinds of SQL pool: a built-in serverless pool for querying file-based data in a data lake, and custom dedicated SQL pools that host relational data warehouses.

New cards

How is Apache Spark used in Azure Synapse Analytics?

In Azure Synapse Analytics, you can create Spark pools and use interactive notebooks for data analytics, machine learning, and data visualization. Spark performs distributed processing of files in a data lake.

New cards

What is Azure Synapse Data Explorer?

Azure Synapse Data Explorer is a data processing engine in Azure Synapse Analytics, based on the Azure Data Explorer service. It uses Kusto Query Language (KQL) for high performance, low-latency analysis of batch and streaming data.

New cards

How can Azure Synapse Analytics be integrated with other Azure data services?

Azure Synapse Analytics can be integrated with other Azure data services for end-to-end analytics solutions. Integrations include Azure Synapse Link, Microsoft Power BI, Microsoft Purview, and Azure Machine Learning.

New cards

When is Azure Synapse Analytics used for large-scale data warehousing?

Azure Synapse Analytics is used for large-scale data warehousing when there's a need to integrate all data, including big data, for analytics and reporting purposes from a descriptive analytics perspective, independent of its location or structure.

New cards

How does Azure Synapse Analytics support advanced analytics?

Azure Synapse Analytics enables organizations to perform predictive analytics using both its native features and by integrating with other technologies such as Azure Machine Learning.

New cards

How is Azure Synapse Analytics used for data exploration and discovery?

The serverless SQL pool functionality in Azure Synapse Analytics enables Data Analysts, Data Engineers, and Data Scientists to explore data within the data estate. This supports data discovery, diagnostic analytics, and exploratory data analysis.

New cards

How does Azure Synapse Analytics support real-time analytics?

Azure Synapse Analytics can capture, store, and analyze data in real-time or near-real time with features like Azure Synapse Link, or through the integration of services like Azure Stream Analytics and Azure Data Explorer.

New cards

How does Azure Synapse Analytics facilitate data integration?

Azure Synapse Pipelines in Azure Synapse Analytics enables ingestion, preparation, modeling, and serving of data to be used by downstream systems.

New cards

What does integrated analytics mean in the context of Azure Synapse Analytics?

Integrated analytics in Azure Synapse Analytics refers to the ability to perform a variety of analytics on data in a cohesive solution, removing the complexity by integrating the analytics landscape into one service. This allows more focus on working with data to bring business benefit rather than spending time provisioning and maintaining multiple systems.

New cards

Which feature of Azure Synapse Analytics enables you to transfer data from one store to another and apply transformations to the data at scheduled intervals?

Pipelines

New cards

You want to create a data warehouse in Azure Synapse Analytics in which the data is stored and queried in a relational data store. What kind of pool should you create?

Dedicated SQL Pool

New cards

A data analyst wants to analyze data by using Python code combined with text descriptions of the insights gained from the analysis. What should they use to perform the analysis?

A notebook connected to an Apache Spark Pool

New cards

What are the two runtime environments offered by Azure Synapse SQL in Azure Synapse Analytics?

The two runtime environments are Serverless SQL pool, used for on-demand SQL query processing primarily with data in a data lake, and Dedicated SQL pool, used to host enterprise-scale relational database instances for data warehouses.

New cards

What are some benefits of using Serverless SQL pool in Azure Synapse Analytics?

Serverless SQL pool benefits include familiar Transact-SQL syntax, integrated connectivity from various BI and ad-hoc querying tools, distributed query processing, built-in query execution fault-tolerance, no infrastructure or clusters to maintain, and a pay-per-query model.

New cards

When should Serverless SQL pools in Azure Synapse Analytics be used?

Serverless SQL pools are best suited for querying data residing in a data lake, handling unplanned or "bursty" workloads, and when exact costs for each query need to be monitored and attributed. They are not recommended for OLTP workloads or tasks requiring millisecond response times.

New cards

What are some common use cases for Serverless SQL pools in Azure Synapse Analytics?

Common use cases include data exploration, where initial insights about the data are gathered, data transformation, which can be performed interactively or as part of an automated data pipeline, and creating a logical data warehouse where data is stored in the data lake but abstracted by a relational schema for use by client applications and analytical tools.

New cards

What is a serverless SQL pool used for?

Querying data files in various common file formats, including CSV, JSON, and Parquet.

New cards

Which SQL function is used to generate a tabular rowset from data in one or more files?

OPENROWSET

New cards

What does the BULK parameter do in an OPENROWSET function?

The function includes the full URL to the location in the data lake containing the data files.

New cards

How do you specify the type of data being queried in OPENROWSET?

Using the FORMAT parameter.