DP-203 Certification Exam

0.0(0)

Studied by 8 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/122

There's no tags or description

Looks like no tags are added yet.

Last updated 7:08 PM on 8/21/23

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

123 Terms

New cards

Engineers are seeing a high number of backlogged input events. What can be done to ensure the Backlogged input events are kept in check?

Increase the number of streaming units assigned to the job

New cards

How do you create an external table in a delta lake?

specify a path option to save as an external table. EX: df.write.format("delta").option("path", "/mydata").saveAsTable("MyExternalTable")

New cards

How do you create managed tables in a delta lake?

use the saveAsTable operation. EX: df.write.format("delta").saveAsTable("MyManagedTable")

New cards

How do you load data into a dataframe?

spark.read function. EX: order_details \= spark.read.csv('/orders/*.csv', header\=True, inferSchema\=True)
display(order_details.limit(5))

New cards

How do you run notebooks in Databricks from Data Factory?

Generate an access token for your Azure Databricks workspace, create a linked service in your Azure Data Factory resource that uses the access token to connect to Azure Databricks

New cards

How do you save a dataframe as a partitioned set of files?

use the partitionBy method; dated_df.write.partitionBy("Year").mode("overwrite").parquet("/data")

New cards

How does ADLS organize data?

organizes the stored data into hierarchy of directories and subdirectories, much like a file system. As a result, data processing requires less computational resources, reducing both the time and cost

New cards

How does ADLS store data?

stores data in an HDFS-compatible file system hosted in Azure storage; with this feature, you can store the data in one place and access it through compute technologies like Azure Databricks, ASA without moving the data between environments

New cards

How is Azure Databricks used for Data Science and Engineering?

provides Apache Spark based processing and analysis of large volumes of data in a data lake

New cards

How is Azure Databricks used for machine learning?

supports machine learning workloads that involve data exploration and preparation, training and evaluating machine learning models, and serving models to generate predictions for applications and analyses

New cards

How is Azure Databricks used for SQL?

supports SQL-based querying for data stored in tables in a SQL Warehouse. This capability enables data analysts to query, aggregate, summarize, and visualize data using familiar SQL syntax and a wide range of SQL-based data analytical tools

New cards

How is data in a lake database stored?

stored in the data lake as parquet or CSV files. the files can be managed independently of the database tables

New cards

How is granting access done in Azure Data Lake Gen2?

through RBAC

New cards

How is the underlying data in delta lakes stored?

in parquet format, which is commonly used in data lake ingestion pipelines

New cards

What are 4 common types of analytical techniques?

Descriptive analytics, Diagnostic analytics, predictive analytics, prescriptive analytics

New cards

What are 4 stages for processing big data solutions?

Ingest, store, prep & train, model &serve

New cards

What are Apache Spark Clusters?

spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs

New cards

What are data lakes?

a storage repository that holds large amounts of data in native, raw formats. optimized for scaling to massive volumes of data.

New cards

What are data pipelines?

used to orchestrate activities that transfer and transform data. Pipelines are the primary way in which DE's implement repeatable ETL solutions that can be triggered based on a schedule or in response to events

New cards

What are hopping windows?

hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size

New cards

What are notebooks in Azure Databricks?

an interactive environment in which you can combine text and graphics in Markdown format with cells containing code that you run interactively in the notebook sessions

New cards

What are parameters in Azure Data Factory?

key-value pairs of read-only configuration. parameters are defined in the pipeline

New cards

What are session windows?

group events that arrive at similar times, filtering out periods of time where there is no data

New cards

What are sliding windows?

output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window.

New cards

What are the benefits of a delta lake?

Relational tables that support querying and data modification, support for ACID transaction, data versioning and time travel, support for batch and streaming data, standard formats and interoperability

New cards

What are the benefits of a serverless SQL pool?

Familiar Transact-SQL syntax, Integrate connectivity, distribute query processing, Built-in query execution fault-tolerance, no infrastructure to setup or clusters

New cards

What are the two SQL pools that ASA supports?

Serverless SQL pool and Dedicated SQL pool

New cards

What are tumbling windows?

used to segment a data stream into distinct time segments and perform a function against them; The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window

New cards

What are typical operations on a dataframe?

Filtering rows and columns, renaming columns, creating new columns- often derived rom existing ones, replacing null or other values

New cards

What can the LAG function be used for?

to find the difference between prior events

New cards

What data integration capabilities does integration runtime provides?

Data flow, data movement, activity dispatch, SSIS package execution

New cards

What does a Linked Service in Data Factory enable users to do?

enables you to ingest the data from a data source in readiness to prepare the data for transformation and/or analysis

New cards

What does it mean to define the data source?

Identify source details such as the resource group, subscription, and identity information such as a key or secret

New cards

What does it mean to define the data?

Identify the data to be extracted. Define data by using a database query, a set of files, or an Azure Blob Storage name for blob storage

New cards

What does RBAC mean?

In Azure Data Lake Gen2, role-based access control is available; built in security groups include ReadOnlyUsers, WriteAccessUsers, and FullAccessUsers

New cards

What does the lookup task allow you to do in a data flow activity?

allows you to take a value and search for it in a different source to retrieve another value or set of values

New cards

What function is used to define and use parameters in a notebook?

dbutils.widgets. EX: dbutils.widgets.text("folder", "data")

New cards

What function is used to find the time difference?

DATEDIFF

New cards

What is a control flow?

an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger

New cards

What is a databricks file system (DBFS)?

while each cluster node has its own local file system, the nodes in a cluster have access to a shared, distributed file system in which they can access and operate on data files

New cards

What is a delta lake?

an open-source storage layer that adds relational database semantics to Spark-based data lake processing

New cards

What is a Hive metastore?

The Hive metastore is a special read-only table created within Hive that stores the meta-data related to all tables in a Hive database. This allows for the querying of metadata from the metastore, while also protecting that metadata from writes that may ultimately undermine the structure of other tables within Hive.

New cards

What is a lake database?

provides a relational metadata layer over one or more files in a data lake

New cards

What is a managed table?

defined without a specified location, and the data files are stored within the storage used by the metastore; dropping the table not only removes its metadata from the catalog, but also deletes the folder in which its data files are stored

New cards

What is aggregate?

define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing or computed columns

New cards

What is alter row?

set insert, delete, update, and upsert policies on rows. You can add one-to-many conditions as expressions

New cards

What is an activity in Azure Data Factory?

typically contain the transformation logic or the analysis commands of the Azure Data Factory. activities includes the Copy Activity that can be used to ingest data from a variety of data sources

New cards

What is an external table?

defined for a custom file location, where the data for the table is stored; dropping the table deletes the metadata from the catalog, but doesn't affect the data files

New cards

What is analytical data?

data that has been optimized for analysis and reporting, often in a data warehouse

New cards

What is Apache Spark?

distributed data processing framework that enables large-scale data analytics by coordinating work across multiple processing nodes in a cluster

New cards

What is Azure Cosmos DB?

a globally distributed, multi-model database

New cards

What is Azure Data Factory?

the cloud-based ETL and data integration service that allows you to create data-driven workflows (pipelines) for orchestrating data movement and transforming data at scale

New cards

What is Azure Data Lake Storage Gen2?

provides a cloud-based solution for data lake storage in Azure, and underpins many large-scale analytics solutions built on Azure

New cards

What is Azure Databricks?

a comprehensive platform that is optimized for three specific types of data workload and associated user personas: Data science and Engineering, Machine learning, and SQL

New cards

What is Azure HDInsight?

provides technologies to help you ingest, process, and analyze big data. It supports batch processing, data warehousing, IoT, and data science

New cards

What is Azure Purview?

a unified data governance service that helps you manage and govern your on-premises, multi-cloud and software-as-a-service (SaaS) data

New cards

What is Azure SQL Database?

supports structures such as relational data and unstructured formats such as spatial and XML data

New cards

What is Azure Stream Analytics used for?

process streaming data and respond to data anomalies in real time; use for Internet of Thing(IoT) monitoring, web logs, remote patient monitoring and point of sale (POS) systems

New cards

What is Azure Synapse Analytics?

provides capabilities for running data pipelines and managing analytical data in a data lake or relational data warehouse; a cloud-based data platform that brings together enterprise data warehousing and Big Data analytics

New cards

What is conditional split?

route rows of data to different streams based on matching conditions

New cards

What is data consolidation?

the process of combining data that has been extracted from multiple data sources into a consistent structure usually to support analysis

New cards

What is data integration?

establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems

New cards

What is data transformation?

operational data usually needs to be transformed into suitable structure and format for analysis, often as part of an extract, transform, and load (ETL) process

New cards

What is derived column?

generate new columns or modify existing fields using the data flow expression language

New cards

What is flatten?

take array values inside hierarchical structures such as JSON and unroll them into individual rows

New cards

What is integration runtime in Azure Data Factory?

enables Azure Data Factory to bridge between the activity and linked services objects

New cards

What is lookup?

enables you to reference data from another source

New cards

What is managed identity?

this identity can be used to authorize the request for data access in Azure storage; before accessing the data, the store administrator must grant permissions to managed identity for accessing the data

New cards

What is mapping data flow?

provide an environment for building a wide range of data transformations visually without the need to use code

New cards

What is needed in the Azure SQL database to ensure lineage?

You need a master key in the Azure SQL database for lineage to work.

New cards

What is new branch?

apply multiple sets of operations and transformations against the same data stream

New cards

What is one of the core resources in a synapse analytics workspace?

data lake

New cards

What is operational data?

transactional data that is generated and stored by applications, often in a relational or non-relational database

New cards

What is partitioning?

an optimization technique that enables spark to maximize performance across the worker nodes

New cards

What is pivot?

an aggregation where one or more grouping columns has distinct row values transformed into individual columns

New cards

What is semi-structured data?

data such as JavaScript object rotation (JSON) files, which may require flattening prior to loading into your source system

New cards

What is Shared access signature (SAS)?

provides delegated access to resources in storage account, without sharing account keys; gives you granular control over the type of access you grant to clients who have the SAS

New cards

What is streaming data?

refers to perpetual sources of data that generate data values in real-time, often relation to specific events

New cards

What is structured data?

primarily comes from table-based source systems such as a relational database or from a flat file such as a CSV file

New cards

What is the benefit of ELT?

you can store data in its original format, be it JSON, XML, PDF, or images; you can define the data's structure during the transformation phase, so you can use the source data in multiple downstream systems

New cards

What is the primary structure for working with data in Apache Spark

dataframe

New cards

What is the right design pattern to follow when it comes to the folder structure and file naming convention for the streaming data?

\DataSet\YYYY\MM\DD\datafile_YYYY_MM_DD.csv

New cards

What is Type 1 SCD?

always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten; this approach is common for columns that store supplementary values

New cards

What is Type 2 SCD?

keeps history of changes in dimensions members by adding a new row to the table for each change

New cards

What is Type 3 SCD?

supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member

New cards

What is Type 6 SCD?

Combines Type 1, 2, and 3; in this design you also store the current value in all versions of that entity so you can easily report on the current value or the historical value

New cards

What is unpivot?

pivot columns into row values

New cards

What is unstructured data?

includes data stored as key-value pairs that don't adhere to standard relational models and other types of unstructured data that are commonly used include portable data format

New cards

What is useful about Azure Blob storage?

you can store large amounts of unstructured data in a flat namespace within a blob container

New cards

What is user identity?

also known as "pass-through", is an authorization type where the identity of the Azure AD user that logged into serverless SQL pool is used to authorize access to the data

New cards

What key feature does Azure Synapse Analytics have?

the Data Movement services (DMS) coordinates and transports data between compute nodes as necessary. But you can use a replicated table to reduce data movement and improve performance

New cards

What occurs during the extraction process?

data engineers define the data and its source

New cards

What occurs during the load process?

many Azure destinations can accept data formatted as a JavaScript Object Notation (JSON), file, or blob. You might need to write code to interact with application APIs.

New cards

What occurs during the transformation process?

splitting, combining, deriving, adding, removing, or pivoting columns. Map field between the data source and the data destination. You might also need to aggregate or merge data

New cards

What requirements must be met to create and manage data factory objects?

To create and manage child resources in the Azure portal, you must belong to the Data Factory contributor role at the resource group level or above; To create and manage resources with PowerShell or the SDK, the contributor role at the resource level or above is sufficient

New cards

What service would you use for a requirement listed as such: provide an environment to support an Analytical data store

Azure Synapse Analytics

New cards

What statement can be used to implement column level security in the table?

GRANT

New cards

What statement can be used to implement row level security in the table?

CREATE SECURITY POLICY

New cards

What steps are typically performed for pipelines in Azure Data Factory?

Connect and Collect, Transform and enrich, Publish, Monitor