Final Prep for CS 498: Cloud Computing Applications

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/99

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

100 Terms

New cards

RDBMS

Which system is a more natural fit for OLTP?

Datawarehouse

Data Lake

Managed Machine Learning platforms

RDBMS

New cards

data structure, more specifically, a sophisticated nested array

A Datacube is best thought of as a(n)

function that structures and compresses data

data structure, more specifically, a sophisticated nested array

archival service provided by AWS

specialized hardware for fast analysis of massive data

New cards

False

Correct! OLAP cubes require that data teams manage complicated pipelines to transform data from a SQL database into OLAP cubes

In general, it is very easy and straightforward to transform data from a SQL database into an OLAP cube.

True

False

New cards

ALL

Select all that apply: What are some commonly available datacube operations?

Slicing

Dicing

Drill Up / Down

Roll-up

Pivot

New cards

BigTable

As a wide-column store, BigTable is specialized for access by a key

Which of these would probably be best for storing data retrieved by a key or a sequence of keys?

MonetDB

SybaseIQ

BigTable

Vertica

New cards

OLAP Datacubes

RDBMS are often limited by the constraints of SQL

If your primary interest is the richest possible analysis capabilities, which of these two options would likely be the better choice?

Column-Oriented Data Warehouse

OLAP Datacubes

New cards

Significantly more than 5%

Suppose a table contains 10000 rows and 100 columns. A query that uses all of the rows and 5 columns will need to read approximately what percentage of the data contained in the table if you are using a traditional row-based RDBMS system?

Significantly less than 5%

Approximately 5%

Significantly more than 5%

New cards

a new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)

A Data Lake is

a new type of data repository for storing massive amounts of unstructured data in a single location for processing, cleaning, and structuring

a new type of data repository for storing massive amounts of structured data in a single location, rather than spread over multiple datacenters, in order to exploit data locality to speed the analysis

a new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)

New cards

False

Today, OLAP cubes refer specifically to contexts in which these data structures far outstrip the size of the hosting computer's main memory

Today, OLAP cubes are always designed to fit in the hosting computer's main memory to maximize analytical performance

True

False

New cards

False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches

Redshift, like most Columnar Stores, makes it easy to update blocks.

True. Redshift, like most Columnar Stores, are write-optimized, so updates are easy

False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches

True, though this property is rarely used in practice since Columnar Stores are primarily utilized by read-heavy applications

False. Redshift is not a Columnar Store, but a data pipeline that connects Columnar Stores to analysis engines

New cards

It produces too much communication between stages.

MapReduce tends to be inefficient because the graph state must be stored at each stage of the graph algorithm, and each computational stage will produce much communication between the stages.

Why is MapReduce not efficient for large-scale graph processing?

It brings load imbalance.

It is fault-tolerance enough.

It produces too much communication between stages.

The map function is computationally a bottleneck.

New cards

compute(list of messages) -> return list of messages

You want to build a shortest path algorithm using parallel breadth-first search in Pregel. Which of the following pseudo-codes is the proper "compute function" for this program?

compute(list of messages) -> return list of messages

compute(list of vertexes) -> return list of messages

compute(list of edges) -> return list of messages

compute(graph) -> return list of messages

New cards

The master periodically instructs the workers to save the state of their partitions to persistent storage.

How is checkpointing done in Pregel?

The workers all reload their partition state from the most recent available checkpoint.

The master periodically instructs the workers to save the state of their partitions to persistent storage.

Each worker communicates with the other workers.

It regularly uses “ping” messages.

New cards

It regularly uses "ping" messages.

It uses, ping messages, keep-alives to make sure that everything is actually responding and doing processing.

How does Pregel detect the failure?

The master periodically instructs the workers to save the state of their partitions to persistent storage.

Each worker communicates with the other workers.

It regularly uses “ping” messages.

The workers all reload their partition state from the most recent available checkpoint.

New cards

The workers all reload their partition state from the most recent available checkpoint.

where the master re-assigns the graph portions to currently available work, workers so you can share out what's the job that's not being finished, you could share that out to other workers that are alive and can process the system. And the workers just reload their partition state for the most available check point and then continue.

How is recovery being done in Pregel?

The workers all reload their partition state from the most recent available checkpoint.

The master periodically instructs the workers to save the state of their partitions to persistent storage.

It regularly uses “ping” messages.

Each worker communicates with the other workers.

New cards

Responsible for the state of computation

ZooKeeper is responsible for computation state:

• Partition/worker mapping

• Global state: #superstep

• Checkpoint paths, aggregator values, statistics

What is ZooKeeper’s role in task assignment in Giraph?

Responsible for the state of computation

Responsible for coordination

Responsible for vertices

Communicate with other workers

New cards

Responsible for coordination

Master is responsible for coordination:

• Assigns partitions to workers

• Coordinates synchronization

• Requests checkpoints

• Aggregates aggregator values

• Collects health statuses

What is Master’s role for task assignment in Giraph?

Responsible for the state of computation

Communicate with other workers

Responsible for vertices

Responsible for coordination

New cards

Responsible for vertices

Worker is responsible for vertice:

• Invokes active vertices compute() function

• Sends, receives, and assigns messages

• Computes local aggregation values

What is Worker’s role for task assignment in Giraph?

Responsible for the state of computation

Communicate with other workers

Responsible for vertices

Responsible for coordination

New cards

A graph database in any storage system that provides index-free adjacency

with a graph data base with all the information in. It's going to provide some way of retrieving their data and typically a graph database is going to provide index-free adjacency.

What is graph processing?

A non-relational, distributed database

A distributed real-time computation system

A graph database in any storage system that provides index-free adjacency

A framework for distributed storage and processing of large data sets

New cards

Associative data sets

Graph database has a bunch of associative data sets, so you look up one item and you retrieve a different item from it like a node and it's connection to and age.

Which of these is a property of a graph database?

Associative data sets

Performs the same operation on large numbers of data

Uses a relational model of data

Entity type has its table

New cards

A recommendation engine working based on the user preferences and others with similar preferences

Collaborative filtering is to have multiple filters working together to extract just the information you want.

What is an example of a collaborative filtering application?

Finding the frequent item sets frequently bought together

Placing new items into predefined categories

Grouping similar object together without knowing the groups ahead

A recommendation engine working based on the user preferences and others with similar preferences

New cards

They make a docker container of the inferencing code, and keep a reference to a BLOB storage bucket where the trained model's parameters are stored. Any time a HTTPS request for a model inference arrives, they launch the container, which fetches the parameters and runs the inference.

How do cloud providers technically handle model deployment?

Amazon Sagemaker stores all trained models in DynamoDB. Upon an HTTPS request, it asks DynamoDB for the model and runs the proper pre-written algorithm along with parameters fetched from DynamoDB.

They keep a pool of virtual machines active, so that any time a HTTPS request for a model inference arrives, one of the VMs is ready to fetch the model artifacts from the model repository and run it.

he models are stored in javascript. Any browser that wishes to run a model fetches the model parameters from a cloud-based BLOB storage, and simply runs the model code locally.

New cards

Clustering

k-means clustering aims to partition n observations into k clusters

Which group does K-means fall into?

Collaborative filtering

Frequent pattern mining

Clustering

Classification

New cards

K-means

K-means is a clustering method

Which of the following is not a classification mechanism?

Naïve Bayes

K-means

Convolution Neural Network Classifier

Decision forests

New cards

Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results

In a typical data science workflow, what are the steps involved?

Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results

Obtaining data, data cleaning, model training, model exploration, model deployment

Model training, model exploration, cleaning the outcomes, interpreting the results.

Cleaning data, exploring data, model training and evaluation, obtaining results, deploying the model

New cards

Finding the frequent item sets frequently bought together

this is an example of frequent pattern mining

What is an example of an FPM application?

Finding the frequent item sets frequently bought together

Placing new items into predefined categories

Grouping similar object together without knowing the groups ahead

A recommendation engine working based on the user preferences and others with similar preferences

New cards

D -> A -> C -> B

In K-means, what is the order of the following steps?

A. For each data point, assign to the closest centroid

B. If new centroids are different from the old, re-iterate through the loop

C. For each cluster, re-compute the centroids

D. Randomly select k centroids

New cards

A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.

Naive Bayes is a supervised classification method

Which of the following best describes how Naïve Bayes works?

A set of unlabeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.

A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.

A set of items is given, and the most frequent set of items is found.

A set of data points are given and it classifies them into k groups

New cards

Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.

What are the definitions of hyperparameter optimization and AutoML?

Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.

Hyperparameter optimization means adjusting the parameters of a search space using gradient descent, and is a technical term. AutoML is a special case of hyperparameter optimization, and is marketing jargon.

They are both the same and used interchangeably.

Hyperparameter optimization refers to adjusting the parameters of a hyper plane that divides the search space in equidistance quadrants. AutoML means the cloud provider takes care of orchestrating machine learning artifacts' deployment.

New cards

FPM

If we want to find which set of items in a grocery shop are frequently bought together, which of the following approaches should we use?

K-Means

Naïve Bayes

Decision Forests

FPM

New cards

The process of eliminating events to keep up with the rate of events

What is load shedding?

Distributing applications across many servers

The process of eliminating events to keep up with the rate of events

Enabling a system to continue operating properly in the case of the failure of some of its components

Distributing the data across different parallel computing nodes

New cards

A bolt processes input streams and produces new streams.

Which of the following is correct?

A topology is a network of tuples and streams.

A stream connects a bolt to a spout.

A plant jar can receive output from many streams.

A bolt processes input streams and produces new streams.

New cards

QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt

In a Storm program that produces a sorted list of the top K most frequent words encountered across all the documents streamed into it, four kinds of processing elements (bolts in Storm) might be created: QuoteSplitterBolt, WordCountBolt, MergeBolt, and SortBolt.

What is the order in which words flow through the program?

WordCountBolt, QuoteSplitterBolt, SortBolt, MergeBolt

QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt

QuoteSplitterBolt, SortBolt, WordCountBolt, MergeBolt

WordCountBolt, QuoteSplitterBolt, MergeBolt, SortBolt

New cards

Provides a persistent state for the bolts, but the exact implementation is up to the user

What does Trident do?

Provides a persistent state for the topology, with a predefined set of characteristics

Provides a persistent state for the bolts, but the exact implementation is up to the user

Provides a persistent state for the spout, but the exact implementation is up to the user

Provides a persistent state for the bolts, with a predefined set of characteristics

New cards

Unbounded sequences of tuples

What are streams in Apache Storm?

A network of spouts and bolts

Unbounded sequences of tuples

Aggregators

Processors of input

New cards

Sources of streams

What are spouts in Apache Storm?

Unbounded sequences of tuples

Network of spouts and bolts

Processors of input

Sources of streams

New cards

Networks of spouts and bolts

What are topologies in Apache Storm?

Unbounded sequences of tuples

Sources of streams

Processors input

Networks of spouts and bolts

New cards

Events are double processed.

In the “At Least Once” message process, what happens if there is a failure?

Events are double processed.

Storm's natural load-balancing takes over.

Storm's natural fault-tolerance takes over.

You must create and implement your load-balance algorithm.

New cards

Trident has first class support for state, but the exact implementation is up to the application developer.

How does Trident treat state?

Trident has first class support for state and is completely automatic without the need of any help from the application developer.

Trident has first class support for state, but the exact implementation is up to the application developer.

Trident does not have any support for state.

None of the above.

New cards

Allows Storm to be used from many language

Thrift allows users to define and create services which are both consumable by and serviceable by numerous languages

How does Thrift contribute to Storm?

Enables the usage of streams

Provides load-balancing functionality

Provides scalability

Allows Storm to be used from many language

New cards

Spark Streaming chops a stream into small batches and processes each batch independently.

it is called micro-batch

Which of the following statements is true?

Spark Streaming treats each tuple independently and replays a record if not processed.

Spark Streaming chops a stream into small batches and processes each batch independently.

Spark Streaming has no support for state.

Spark Streaming uses transactions to update state.

New cards

A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline

Lamda Arch has two parallel data processing paths.The first processing path would use a stream event processing system like storm and Then on the parallel path you have the batch processing system

Which of the following best describes Lambda architecture? (Not to be confused with AWS Lambda).

A serial processing pipeline of first a batch processing system and then a stream processing system

Only a stream processing pipeline but with the ability to handle failures

A serial processing pipeline of first a streaming processing system and then a batch processing system

A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline

New cards

Only one stream processing pipeline but with the ability to handle failures

In Kappa Architecture, they try to get away from the two parallel paths and they just do the streaming but they try to do streaming good enough so that if there are failures the state doesn't get messed up.

Which of the following best describes Kappa architecture?

A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline

A serial processing pipeline of first a streaming processing system and then a batch processing system

A serial processing pipeline of first a batch processing system and then a stream processing system

Only one stream processing pipeline but with the ability to handle failures

New cards

HDFS

HDFS is the storage part of Hadoop

Which of the following is not a component of the Storm Architecture?

Worker

Supervisor

Zookeeper

HDFS

Nimbus

New cards

t is micro-batch, which increase the minimum end-to-end latency of the system.

the disadvantage is that it's not really streaming in the strict sense. it's really batches data, and then runs that batch of data very quickly.

What is the main disadvantage of Spark Streaming?

It does not handle failures.

It is micro-batch, which increase the minimum end-to-end latency of the system.

It lacks a rich eco-system of big data tools.

It does not support state

New cards

NiFi

in Nifi you can design a graph to process your data

Which system has a great graphical UI to design dataflows?

NiFi

Sform

Druid

Spark Streaming

New cards

Druid

Druid provides tricks that processes OLAP queries fast

Which system is best for Online Analytical Processing (OLAP)?

NiFi

Storm

Druid

Spark Streaming

New cards

Full virtualization

Which type of virtualization is feasible for the following scenario?

“A service needs to run an unmodified OS on a basic processor, separate from the host operating sysetm.”

Full virtualization

Container

Para-virtualization

New cards

Hardware-assisted

Which type of virtualization is feasible for the following scenario?

“A service needs to run an unknown and unmodified OS on an advanced processor.”

Para-virtualization

Hardware-assisted

New cards

Containers

Which type of virtualization provides better performance for the following scenario?

“Running multiple independent applications sharing the same kernel”

Hardware-assisted full virtualization

Containers

New cards

Full virtualization

Which type of virtualization provides better performance for the following scenario?

“Running two independent applications, each needs a different version of a kernel module”.

Full virtualization

Containers

New cards

Containers

Which type of virtualization provides better performance for the following scenario?

“Multiple applications with high memory usage”

Containers

Full virtualization through binary translatio

New cards

Running each application in a separate container.

Which is the recommended practice in the following scenario?

"Multiple applications with communication requirements"

Running all applications in one container.

Running each application in a separate container.

New cards

No virtualization

Every sort of virtualization technology has some kind of performance impact. For the absolute best performance, no virtualization is the best option.

Which type of virtualization provides better performance for the following scenario?

“One application running on a single piece of hardware”

Full virtualization

Containers

No virtualization

JVM

New cards

True

Virtual machine using full virtualization can only run guest OS designed for one type of CPU (the same as host).

True

False

New cards

True

The VM simulates enough hardware to allow an unmodified guest OS (one designed for the same CPU) to be run in isolation.

Kubernetes provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.

True

False

New cards

Host OS (Base OS) kernel

Who is responsible for scheduling and memory management when using containers?

Virtual Machine Manager

Hypervisor

Host OS (Base OS) kernel

Supervisor

New cards

Hardware-assisted full virtualization

Which type of virtualization is feasible for the following scenario?

“Application that needs different custom operating systems (kernels)”

Para-virtualization

Hardware-assisted full virtualization

Containers

New cards

Guarantee that the software will always run the same irrespective of environment

Using the Dockerfile format, and relying on Union filesystem technology, docker images downloaded from a hub guarantee specfic software environments for deployment.

Docker is used to:

Monitor progress of jobs running on OpenStack

Send messages from one machine to another

Guarantee that the software will always run the same irrespective of environment

Run a Java program

New cards

True

Kubernetes provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.

False

True

New cards

False

A user application is not allowed to load control registers when running in kernel mode.

True

False

New cards

True

In x86, kernel mode code runs in ring 0 while user processes run in ring 3.

True

False

New cards

True

When a user application needs to handle an interrupt, it has to enter kernel mode.

True

False

New cards

True

aravirtualization is a software-only virtualization approach.

Xen does not require special hardware support, such as Intel "VT-x" or "AMD-V".

True

False

New cards

False

Binary translation only modifies sensitive instructions.

Binary translation modifies all instructions on the fly and does not require changes to the guest operating system kernel.

True

False

New cards

False

There is only one address space in unikernel. Applicaiton can be seen as running in kernel mode the whole time.

In unikernel, user application can transit to kernel mode using special instructions.

True

False

New cards

False

Making changes to unikernel requires recompilation. Unikernel normally only runs one application.

Is it possible to install a second application with different dependencies into an existing unikernel?

True

False

New cards

Having a minimal security protection.

Using minimal device model and kernel configuration can reduce attacker surface of microVM and does not reduce security protection.

Which is not the reason that microVM is faster than a normal VM

Having a minimal device model.

Having a minimal security protection.

Having a minimal guest kernel configuration.

New cards

microVM

AWS Lambda and AWS Fargate are using

Container

microVM

New cards

Third

Which generation of hardware virtualization introduced IOMMU virtualization?

First

Second

Third

New cards

5.4

The latest CentOS comes with Linux kernel version 4.18. If you are running a latest CentOS container on a Ubuntu with kernel version 5.4, which kernel version would you see inside the container?

4.18

5.4

New cards

5.4

Container always uses host kernel.

You built a latest CentOS container on an Ubuntu host with kernel version 4.18. After you upgraded the Ubuntu kernel to 5.4, what will be the kernel version used by the built CentOS container?

5.4

4.18

New cards

True

Besides using Dockerfile to create container image, one can also start a container using existing image and install necessary packages on top of it to create a new image.

True

False

New cards

True

All containers without a --network specified, are attached to the default bridge network. This is risky operation as it allows unrelated services to communicate.

True

False

New cards

bridge

Which is the default Docker network driver?

bridge

host

overlay

New cards

False

Containers on different networks can communicate using the bridge network.

True

False

New cards

overlay network

For communication among containers running on different Docker daemon hosts, you should use

bridge network

overlay network

New cards

overlay network

Which type of network connects multiple Docker daemons together and enables swarm services to communicate with each other.

bridge network

host network

overlay network

macvlan network

New cards

True

In Docker Swarm, ingress is a overlay network that handles control and data traffic related to swarm services.

True

False

New cards

Virutal IP

When a Service is requested the resulting DNS query is forwarded to the Docker Engine, which in turn returns the IP of the service, a virtual IP.

Docker internal load balancing is done using

Virutal IP

Published port numbers to the host system

Service name

New cards

The routing mesh uses IP based service discovery and load balancing.

The routing mesh uses port based service discovery and load balancing.

Which of the following statements is NOT true about Docker routing mesh?

The routing mesh enables each node in the swarm to accept connections on published ports for any service running in the swarm.

The routing mesh uses IP based service discovery and load balancing.

By default all nodes participate in an ingress routing mesh.

New cards

ALL

How can you mount a storage location on the host to a container?

Bind mount

Volume

tmpfs

New cards

When you grant a newly-created or running service access to a secret, the decrypted secret is mounted into the container in an in-memory filesyste

How does a service get access to secret information in Docker Swarm?

When you grant a newly-created or running service access to a secret, the decrypted secret is mounted into the container in an in-memory filesystem

When you grant a newly-created or running service access to a secret, the encrypted secret is mounted into the container in an in-memory filesystem . The program in the container needs to decrypt the secret using the appropriate master key.

When you grant a newly-created or running service access to a secret, the decrypted secret is mounted into the container in a disk mounted location

New cards

Replaces the functionality of Dockerfile.

It still needs Dockerfile to create images.

Which of the following statements about Docker Compose is incorrect?

Uses a YAML file to configure application’s services.

Replaces the functionality of Dockerfile.

Main tool by Docker for container orchestration

New cards

overlay network

For containers running on different host to communicate, you should use

bridge network

host network

overlay network

macvlan network

New cards

False

It's only possible to use container name in a user-defined bridge network.

For a container to communicate with another container running on the default bridge network, one can use either the target container's ip address or its container name directly.

True

False

New cards

False

The routing mesh will automatically route the incoming traffic to a node that has a service task running on it.

Docker Swarm routing mesh will report an error if an external load balancer reaches a node that does not have a task belonging to the requested service.

True

False

New cards

False

When you publish a service port, the swarm makes the service accessible at the target port only on nodes that have a task running for that service.

True

False

New cards

Bind mount

You are working on a course MP and want to use the IDE on host to edit codes and run the codes inside a container. Which is the best way to make the codes accessible inside the container?

Bind mount

Volume

tmpfs

New cards

False

It's the opposite for both cases.

AWS ECS has two launch tyeps: EC2 and Fargate. EC2 will automatically manages all resource provisioning while for Fargate it's managed by customer.

True

False

New cards

Create container image based on user specified requirements.

Which of the following is not a function of Kubernetes?