CCA Final

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/60

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

61 Terms

New cards

Which system is a more natural fit for OLTP?

A. Managed Machine Learning platforms

B. Datawarehouse

C. Data Lake

D. RDBMS

RDBMS

New cards

A Datacube is best thought of as a(n)

A. specialized hardware for fast analysis of massive data

B. data structure, more specifically, a sophisticated nested array

C. archival service provided by AWS

D. function that structures and compresses data

Data structure, more specifically, a sophisticated nested array

New cards

In general, it is very easy and straightforward to transform data from a SQL database into an OLAP cube.

[T/F]

False (Correct! OLAP cubes require that data teams manage complicated pipelines to transform data from a SQL database into OLAP cubes)

New cards

Select all that apply: What are some commonly available datacube operations?

A. Slicing

B. Dicing

C. Drill Up / Down

D. Roll-up

E. Pivot

A. Slicing

B. Dicing

C. Drill Up / Down

D. Roll-up

E. Pivot

New cards

Which of these would probably be best for storing data retrieved by a key or a sequence of keys?

A. MonetDB

B. SybaseIQ

C. Vertica

D. BigTable

BigTable: As a wide-column store, BigTable is specialized for access by a key

New cards

If your primary interest is the richest possible analysis capabilities, which of these two options would likely be the better choice?

A. Column-Oriented Data Warehouse

B. OLAP Datacubes

OLAP Datacubes: Correct! RDBMS are often limited by the constraints of SQL

New cards

Suppose a table contains 10000 rows and 100 columns. A query that uses all of the rows and 5 columns will need to read approximately what percentage of the data contained in the table if you are using a traditional row-based RDBMS system?

A. Significantly less than 5%

B. Approximately 5%

C. Significantly more than 5%

Significantly more than 5%

New cards

A Data Lake is:

A. a new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)

B. a new type of data repository for storing massive amounts of structured data in a single location, rather than spread over multiple datacenters, in order to exploit data locality to speed the analysis

C. a new type of data repository for storing massive amounts of unstructured data in a single location for processing, cleaning, and structuring

A new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)

New cards

Today, OLAP cubes are always designed to fit in the hosting computer's main memory to maximize analytical performance [T/F]

False: Today, OLAP cubes refer specifically to contexts in which these data structures far outstrip the size of the hosting computer's main memory

New cards

Redshift, like most Columnar Stores, makes it easy to update blocks.

A. Redshift, like most Columnar Stores, makes it easy to update blocks.

B. False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches

C. True. Redshift, like most Columnar Stores, are write-optimized, so updates are easy

D. False. Redshift is not a Columnar Store, but a data pipeline that connects Columnar Stores to analysis engines

False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches

New cards

Why is MapReduce not efficient for large-scale graph processing?

A. The map function is computationally a bottleneck.

B. It is fault-tolerance enough.

C. It produces too much communication between stages.

D. It brings load imbalance

C. It produces too much communication between stages.

MapReduce tends to be inefficient because the graph state must be stored at each stage of the graph algorithm, and each computational stage will produce much communication between the stages.

Graph computations involve local data (small part of graph surrounding a vertex), and the

connectivity between vertices is sparse. The data may not all fit into one node. This makes it difficult to fit always into the map/reduce model.

New cards

You want to build a shortest path algorithm using parallel breadth-first search in Pregel. Which of the following pseudo-codes is the proper "compute function" for this program?

A. compute(list of edges) -> return list of messages

B. compute(list of vertexes) -> return list of messages

C. compute(list of messages) -> return list of messages

D. compute(graph) -> return list of messages

C. compute(list of messages) -> return list of messages

Worker: responsible for vertices

• Invokes active vertices compute() function

• Sends, receives, and assigns messages

• Computes local aggregation values

New cards

How is checkpointing done in Pregel?

A. It regularly uses "ping" messages.

B. Each worker communicates with the other workers.

C. The workers all reload their partition state from the most recent available checkpoint.

D. The master periodically instructs the workers to save the state of their partitions to persistent storage.

New cards

How does Pregel detect the failure?

A. It regularly uses "ping" messages.

B. The master periodically instructs the workers to save the state of their partitions to persistent storage.

C. Each worker communicates with the other workers.

D. The workers all reload their partition state from the most recent available checkpoint.

A. It regularly uses "ping" messages.

New cards

How is recovery being done in Pregel?

A. Each worker communicates with the other workers.

B. The master periodically instructs the workers to save the state of their partitions to persistent storage.

C. It regularly uses "ping" messages.

D. The workers all reload their partition state from the most recent available checkpoint.

where the master re-assigns the graph portions to currently available work, workers so you can share out what's the job that's not being finished, you could share that out to other workers that are alive and can process the system. And the workers just reload their partition state for the most available check point and then continue.

New cards

What is ZooKeeper's role in task assignment in Giraph?

A. Responsible for coordination

B. Responsible for vertices

C. Communicate with other workers

D. Responsible for the state of computation

D. ZooKeeper is responsible for computation state:

• Partition/worker mapping

• Global state: #superstep

• Checkpoint paths, aggregator values, statistics

Master: responsible for coordination

• Assigns partitions to workers

• Coordinates synchronization

• Requests checkpoints

• Aggregates aggregator values

• Collects health statuses

Worker: responsible for vertices

• Invokes active vertices compute() function

• Sends, receives, and assigns messages

• Computes local aggregation values

New cards

What is Master's role for task assignment in Giraph?

A. Communicate with other workers

B. Responsible for the state of computation

C. Responsible for vertices

D. Responsible for coordination

ZooKeeper is responsible for computation state:

• Partition/worker mapping

• Global state: #superstep

• Checkpoint paths, aggregator values, statistics

Master: responsible for coordination

• Assigns partitions to workers

• Coordinates synchronization

• Requests checkpoints

• Aggregates aggregator values

• Collects health statuses

Worker: responsible for vertices

• Invokes active vertices compute() function

• Sends, receives, and assigns messages

• Computes local aggregation values

New cards

What is Worker's role for task assignment in Giraph?

A. Responsible for vertices

B. Responsible for coordination

C. Communicate with other workers

D. Responsible for the state of computation

A. Responsible for vertices

ZooKeeper is responsible for computation state:

• Partition/worker mapping

• Global state: #superstep

• Checkpoint paths, aggregator values, statistics

Master: responsible for coordination

• Assigns partitions to workers

• Coordinates synchronization

• Requests checkpoints

• Aggregates aggregator values

• Collects health statuses

Worker: responsible for vertices

• Invokes active vertices compute() function

• Sends, receives, and assigns messages

• Computes local aggregation values

New cards

What is graph processing?

A. A graph database in any storage system that provides index-free adjacency

B. A non-relational, distributed database

C. A distributed real-time computation system

D. A framework for distributed storage and processing of large data sets

A. A graph database in any storage system that provides index-free adjacency

Correct! with a graph data base with all the information in. It's going to provide some way of retrieving their data and typically a graph database is going to provide index-free adjacency.

Graph Processing:

• A graph database is any storage system that provides index-free adjacency. Has pointers to adjacent elements...

• Nodes represent entities (people, businesses, accounts...)

• Properties are pertinent information that relate to

nodes

• Edges interconnect nodes to nodes or nodes to properties and they represent the relationship

between the two

New cards

Which of these is a property of a graph database?

A. Associative data sets

B. Uses a relational model of data

C. Entity type has its table

D. Performs the same operation on large numbers of data

A. Associative data sets

Graph database has a bunch of associative data sets, so you look up one item and you retrieve a different item from it like a node and it's connection to and age.

New cards

What is an example of a collaborative filtering application?

A. Finding the frequent item sets frequently bought together

B. Placing new items into predefined categories

C. A recommendation engine working based on the user preferences and others with similar preferences

D. Grouping similar object together without knowing the groups ahead

C. A recommendation engine working based on the user preferences and others with similar preferences

Collaborative filtering is to have multiple filters working together to extract just the information you want.

New cards

How do cloud providers technically handle model deployment?

A. They keep a pool of virtual machines active, so that any time a HTTPS request for a model inference arrives, one of the VMs is ready to fetch the model artifacts from the model repository and run it.

B. Amason Sagemaker stores all trained models in DynamoDB. Upon an HTTPS request, it asks DynamoDB for the model and runs the proper pre-written algorithm along with parameters fetched from DynamoDB.

C. They make a docker container of the inferencing code, and keep a reference to a BLOB storage bucket where the trained model's parameters are stored. Any time a HTTPS request for a model inference arrives, they launch the container, which fetches the parameters and runs the inference.

D. The models are stored in javascript. Any browser that wishes to run a model fetches the model parameters from a cloud-based BLOB storage, and simply runs the model code locally.

New cards

Which group does K-means fall into?

A. Collaborative filtering

B. Clustering

C. Frequent pattern mining

D. Classification

B. Clustering

Correct! k-means clustering aims to partition n observations into k clusters

New cards

Which of the following is not a classification mechanism?

A. Convolution Neural Network Classifier

B. K-means

C. Naïve Bayes

D. Decision forests

K-means

New cards

In a typical data science workflow, what are the steps involved?

A. Model training, model exploration, cleaning the outcomes, interpreting the results.

B. Cleaning data, exploring data, model training and evaluation, obtaining results, deploying the model

C. Obtaining data, data cleaning, model training, model exploration, model deployment

D. Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results

1. Gathering Data

2. Data Preparation

3. Data Wrangling

4. Analyse Data

5. Train Model

6. Trest Model

7. Deployment

OSEMN Data Science:

1. Obtain

2. Scrub

3. Explore

4. Model

5. Interpret

New cards

What is an example of an FPM application?

A. Grouping similar object together without knowing the groups ahead

B. Finding the frequent item sets frequently bought together

C. A recommendation engine working based on the user preferences and others with similar preferences

D. Placing new items into predefined categories

B. Finding the frequent item sets frequently bought together

New cards

In K-means, what is the order of the following steps?

A. For each data point, assign to the closest centroid

B. If new centroids are different from the old, re-iterate through the loop

C. For each cluster, re-compute the centroids

D. Randomly select k centroids

D -> A -> B -> C

A -> B -> C -> D

D -> A -> C -> B

A -> D -> C -> B

D -> A -> C -> B

New cards

Which of the following best describes how Naïve Bayes works?

A. A set of items is given, and the most frequent set of items is found.

B. A set of unlabeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.

C. A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.

D. A set of data points are given and it classifies them into k groups

C. A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.

New cards

What are the definitions of hyperparameter optimization and AutoML?

A. Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.

B. Hyperparameter optimization means adjusting the parameters of a search space using gradient descent, and is a technical term. AutoML is a special case of hyperparameter optimization, and is marketing jargon.

C. They are both the same and used interchangeably.

D. Hyperparameter optimization refers to adjusting the parameters of a hyper plane that divides the search space in equidistance quadrants. AutoML means the cloud provider takes care of orchestrating machine learning artifacts' deployment.

A. Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.

New cards

If we want to find which set of items in a grocery shop are frequently bought together, which of the following approaches should we use?

A. Naïve Bayes

B. Decision Forests

C. K-Means

D. FPM

New cards

What is load shedding?

A. Enabling a system to continue operating properly in the case of the failure of some of its components

B. The process of eliminating events to keep up with the rate of events

C. Distributing applications across many servers

D. Distributing the data across different parallel computing nodes

B. The process of eliminating events to keep up with the rate of events

Why Real-Time Stream Processing?

Real-time data processing at massive scale is becoming a requirement for businesses

• Real-time search, high frequency trading, social networks

• Have a stream of events that flow into the system at a given data rate

The processing system must keep up with the event rate or degrade gracefully by eliminating events. This is typically called load shedding

New cards

Which of the following is correct?

A. A topology is a network of tuples and streams.

B. A bolt processes input streams and produces new streams.

C. A stream connects a bolt to a spout.

D. A plant jar can receive output from many streams.

B. A bolt processes input streams and produces new streams.

Topologies

- graph of spouts and bolts that are connected with stream groupings

-runs indefinitely (no time/batch boundaries)

Streams

-unbounded sequence of tuples that is processed and created in parallel in a distributed fashion

Spouts

-input source of streams in topology

Bolts

- processing container, which can perform transformation, filter, aggregation, join, etc.

- sinks: special type of bolts that have an output interface

New cards

In a Storm program that produces a sorted list of the top K most frequent words encountered across all the documents streamed into it, four kinds of processing elements (bolts in Storm) might be created: QuoteSplitterBolt, WordCountBolt, MergeBolt, and SortBolt.

What is the order in which words flow through the program?

A. QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt

B. WordCountBolt, QuoteSplitterBolt, SortBolt, MergeBolt

C. WordCountBolt, QuoteSplitterBolt, MergeBolt, SortBolt

D. QuoteSplitterBolt, SortBolt, WordCountBolt, MergeBolt

A. QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt

New cards

What does Trident do?

A. Provides a persistent state for the bolts, with a predefined set of characteristics

B. Provides a persistent state for the bolts, but the exact implementation is up to the user

C. Provides a persistent state for the spout, but the exact implementation is up to the user

D. Provides a persistent state for the topology, with a predefined set of characteristics

B. Provides a persistent state for the bolts, but the exact implementation is up to the user

Trident:

-Provides exactly once semantics

-In trident, state is a first-class citizen, but the exact implementation of state is up to you

----There are many prebuilt connectors to various NoSQL stores like HBase

-Provides a high level API (similar to cascading for Hadoop)

Trident - Use transactions to update state

- Processes each record exactly once

- Per state transaction to external database is slow

New cards

What are streams in Apache Storm?

A. Unbounded sequences of tuples

B. A network of spouts and bolts

C. Processors of input

D. Aggregators

A. Unbounded sequences of tuples

New cards

What are spouts in Apache Storm?

A. Network of spouts and bolts

B. Unbounded sequences of tuples

C. Sources of streams

D. Processors of input

C. Sources of streams

New cards

What are topologies in Apache Storm?

A. Sources of streams

B. Unbounded sequences of tuples

C. Processors input

D. Networks of spouts and bolts

New cards

In the "At Least Once" message process, what happens if there is a failure?

A. You must create and implement your load-balance algorithm.

B. Storm's natural fault-tolerance takes over.

C. Events are double processed.

D. Storm's natural load-balancing takes over.

C. Events are double processed.

New cards

How does Thrift contribute to Storm?

A. Allows Storm to be used from many language

B. Provides load-balancing functionality

C. Enables the usage of streams

D. Provides scalability

A. Allows Storm to be used from many language

Thrift allows users to define and create services which are both consumable by and serviceable by numerous languages

New cards

Which of the following statements is true?

A. Spark Streaming chops a stream into small batches and processes each batch independently.

B. Spark Streaming has no support for state.

C. Spark Streaming uses transactions to update state.

D. Spark Streaming treats each tuple independently and replays a record if not processed.

A. Spark Streaming chops a stream into small batches and processes each batch independently.

Correct! it is called micro-batch

New cards

Which of the following best describes Lambda architecture? (Not to be confused with AWS Lambda).

A. A serial processing pipeline of first a streaming processing system and then a batch processing system

B. Only a stream processing pipeline but with the ability to handle failures

C. A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline

D. A serial processing pipeline of first a batch processing system and then a stream processing system

C. A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline

New cards

Which of the following best describes Kappa architecture?

A. Only one stream processing pipeline but with the ability to handle failures

B. A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline

C. A serial processing pipeline of first a streaming processing system and then a batch processing system

D. A serial processing pipeline of first a batch processing system and then a stream processing system

A. Only one stream processing pipeline but with the ability to handle failures

Correct! In Kappa Architecture, they try to get away from the two parallel paths and they just do the streaming but they try to do streaming good enough so that if there are failures the state doesn't get messed up.

New cards

Which system has a great graphical UI to design dataflows?

A. NiFi

B. Druid

C. Sform

D. Spark Streaming

A. NiFi

Correct! in Nifi you can design a graph to process your data

New cards

Which type of virtualization is feasible for the following scenario?

"A service needs to run an unmodified OS on a basic processor, separate from the host operating sysetm."

A. Container

B. Para-virtualization

C. Full virtualization

New cards

Which type of virtualization is feasible for the following scenario?

"A service needs to run an unknown and unmodified OS on an advanced processor."

A. Hardware-assisted

B. Para-virtualization

A. Hardware-assisted

New cards

Which type of virtualization provides better performance for the following scenario?

"Running multiple independent applications sharing the same kernel"

A. Containers

B. Hardware-assisted full virtualization

A. Containers

New cards

Which type of virtualization provides better performance for the following scenario?

"Running two independent applications, each needs a different version of a kernel module".

A. Containers

B. Full virtualization

New cards

Who is responsible for scheduling and memory management when using containers?

A. Host OS (Base OS) kernel

B. Virtual Machine Manager

C. Supervisor

D. Hypervisor

Host OS (Base OS) kernel

Correct! In container-based systems, the same host kernel is shared among containers, and this kernel is responsible for scheduling and memory management.

New cards

Which type of virtualization is feasible for the following scenario?

"Application that needs different custom operating systems (kernels)"

A. Para-virtualization

B. Hardware-assisted full virtualization

C. Containers

B. Hardware-assisted full virtualization

New cards

Docker is used to:

A. Run a Java program

B. Guarantee that the software will always run the same irrespective of environment

C. Send messages from one machine to another

D. Monitor progress of jobs running on OpenStack

B. Guarantee that the software will always run the same irrespective of environment

Using the Dockerfile format, and relying on Union filesystem technology, docker images downloaded from a hub guarantee specfic software environments for deployment.

New cards

Kubernetes provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.

A. True

B. False

A. True

New cards

A user application is not allowed to load control registers when running in kernel mode.

A. True

B. False

New cards

In x86, kernel mode code runs in ring 0 while user processes run in ring 3.

A. True

B. False

A. True

New cards

When a user application needs to handle an interrupt, it has to enter kernel mode.

A. True

B. False

A. True

New cards

Xen does not require special hardware support, such as Intel "VT-x" or "AMD-V".

A. True

B. False

A. True

Paravirtualization is a software-only virtualization approach.

New cards

Binary translation modifies all instructions on the fly and does not require changes to the guest operating system kernel.

A. True

B. False

Binary translation only modifies sensitive instructions.

New cards

In unikernel, user application can transit to kernel mode using special instructions.

A. True

B. False

There is only one address space in unikernel. Applicaiton can be seen as running in kernel mode the whole time.

New cards

Is it possible to install a second application with different dependencies into an existing unikernel?

A. True

B. False

Making changes to unikernel requires recompilation. Unikernel normally only runs one application.

New cards

Which is not the reason that microVM is faster than a normal VM

A. Having a minimal device model.

B. Having a minimal security protection.

C. Having a minimal guest kernel configuration.

B. Having a minimal security protection.

Using minimal device model and kernel configuration can reduce attacker surface of microVM and does not reduce security protection.

New cards

AWS Lambda and AWS Fargate are using

A. Container

B. microVM

New cards

Which generation of hardware virtualization introduced IOMMU virtualization?

A. First

B. Second

C. Third