1/99
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
RDBMS
Which system is a more natural fit for OLTP?
Datawarehouse
Data Lake
Managed Machine Learning platforms
RDBMS
data structure, more specifically, a sophisticated nested array
A Datacube is best thought of as a(n)
function that structures and compresses data
data structure, more specifically, a sophisticated nested array
archival service provided by AWS
specialized hardware for fast analysis of massive data
False
Correct! OLAP cubes require that data teams manage complicated pipelines to transform data from a SQL database into OLAP cubes
In general, it is very easy and straightforward to transform data from a SQL database into an OLAP cube.
True
False
ALL
Select all that apply: What are some commonly available datacube operations?
Slicing
Dicing
Drill Up / Down
Roll-up
Pivot
BigTable
As a wide-column store, BigTable is specialized for access by a key
Which of these would probably be best for storing data retrieved by a key or a sequence of keys?
MonetDB
SybaseIQ
BigTable
Vertica
OLAP Datacubes
RDBMS are often limited by the constraints of SQL
If your primary interest is the richest possible analysis capabilities, which of these two options would likely be the better choice?
Column-Oriented Data Warehouse
OLAP Datacubes
Significantly more than 5%
Suppose a table contains 10000 rows and 100 columns. A query that uses all of the rows and 5 columns will need to read approximately what percentage of the data contained in the table if you are using a traditional row-based RDBMS system?
Significantly less than 5%
Approximately 5%
Significantly more than 5%
a new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)
A Data Lake is
a new type of data repository for storing massive amounts of unstructured data in a single location for processing, cleaning, and structuring
a new type of data repository for storing massive amounts of structured data in a single location, rather than spread over multiple datacenters, in order to exploit data locality to speed the analysis
a new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)
False
Today, OLAP cubes refer specifically to contexts in which these data structures far outstrip the size of the hosting computer's main memory
Today, OLAP cubes are always designed to fit in the hosting computer's main memory to maximize analytical performance
True
False
False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches
Redshift, like most Columnar Stores, makes it easy to update blocks.
True. Redshift, like most Columnar Stores, are write-optimized, so updates are easy
False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches
True, though this property is rarely used in practice since Columnar Stores are primarily utilized by read-heavy applications
False. Redshift is not a Columnar Store, but a data pipeline that connects Columnar Stores to analysis engines
It produces too much communication between stages.
MapReduce tends to be inefficient because the graph state must be stored at each stage of the graph algorithm, and each computational stage will produce much communication between the stages.
Why is MapReduce not efficient for large-scale graph processing?
It brings load imbalance.
It is fault-tolerance enough.
It produces too much communication between stages.
The map function is computationally a bottleneck.
compute(list of messages) -> return list of messages
You want to build a shortest path algorithm using parallel breadth-first search in Pregel. Which of the following pseudo-codes is the proper "compute function" for this program?
compute(list of messages) -> return list of messages
compute(list of vertexes) -> return list of messages
compute(list of edges) -> return list of messages
compute(graph) -> return list of messages
The master periodically instructs the workers to save the state of their partitions to persistent storage.
How is checkpointing done in Pregel?
The workers all reload their partition state from the most recent available checkpoint.
The master periodically instructs the workers to save the state of their partitions to persistent storage.
Each worker communicates with the other workers.
It regularly uses “ping” messages.
It regularly uses "ping" messages.
It uses, ping messages, keep-alives to make sure that everything is actually responding and doing processing.
How does Pregel detect the failure?
The master periodically instructs the workers to save the state of their partitions to persistent storage.
Each worker communicates with the other workers.
It regularly uses “ping” messages.
The workers all reload their partition state from the most recent available checkpoint.
The workers all reload their partition state from the most recent available checkpoint.
where the master re-assigns the graph portions to currently available work, workers so you can share out what's the job that's not being finished, you could share that out to other workers that are alive and can process the system. And the workers just reload their partition state for the most available check point and then continue.
How is recovery being done in Pregel?
The workers all reload their partition state from the most recent available checkpoint.
The master periodically instructs the workers to save the state of their partitions to persistent storage.
It regularly uses “ping” messages.
Each worker communicates with the other workers.
Responsible for the state of computation
ZooKeeper is responsible for computation state:
• Partition/worker mapping
• Global state: #superstep
• Checkpoint paths, aggregator values, statistics
What is ZooKeeper’s role in task assignment in Giraph?
Responsible for the state of computation
Responsible for coordination
Responsible for vertices
Communicate with other workers
Responsible for coordination
Master is responsible for coordination:
• Assigns partitions to workers
• Coordinates synchronization
• Requests checkpoints
• Aggregates aggregator values
• Collects health statuses
What is Master’s role for task assignment in Giraph?
Responsible for the state of computation
Communicate with other workers
Responsible for vertices
Responsible for coordination
Responsible for vertices
Worker is responsible for vertice:
• Invokes active vertices compute() function
• Sends, receives, and assigns messages
• Computes local aggregation values
What is Worker’s role for task assignment in Giraph?
Responsible for the state of computation
Communicate with other workers
Responsible for vertices
Responsible for coordination
A graph database in any storage system that provides index-free adjacency
with a graph data base with all the information in. It's going to provide some way of retrieving their data and typically a graph database is going to provide index-free adjacency.
What is graph processing?
A non-relational, distributed database
A distributed real-time computation system
A graph database in any storage system that provides index-free adjacency
A framework for distributed storage and processing of large data sets
Associative data sets
Graph database has a bunch of associative data sets, so you look up one item and you retrieve a different item from it like a node and it's connection to and age.
Which of these is a property of a graph database?
Associative data sets
Performs the same operation on large numbers of data
Uses a relational model of data
Entity type has its table
A recommendation engine working based on the user preferences and others with similar preferences
Collaborative filtering is to have multiple filters working together to extract just the information you want.
What is an example of a collaborative filtering application?
Finding the frequent item sets frequently bought together
Placing new items into predefined categories
Grouping similar object together without knowing the groups ahead
A recommendation engine working based on the user preferences and others with similar preferences
They make a docker container of the inferencing code, and keep a reference to a BLOB storage bucket where the trained model's parameters are stored. Any time a HTTPS request for a model inference arrives, they launch the container, which fetches the parameters and runs the inference.
How do cloud providers technically handle model deployment?
They make a docker container of the inferencing code, and keep a reference to a BLOB storage bucket where the trained model's parameters are stored. Any time a HTTPS request for a model inference arrives, they launch the container, which fetches the parameters and runs the inference.
Amazon Sagemaker stores all trained models in DynamoDB. Upon an HTTPS request, it asks DynamoDB for the model and runs the proper pre-written algorithm along with parameters fetched from DynamoDB.
They keep a pool of virtual machines active, so that any time a HTTPS request for a model inference arrives, one of the VMs is ready to fetch the model artifacts from the model repository and run it.
T
he models are stored in javascript. Any browser that wishes to run a model fetches the model parameters from a cloud-based BLOB storage, and simply runs the model code locally.
Clustering
k-means clustering aims to partition n observations into k clusters
Which group does K-means fall into?
Collaborative filtering
Frequent pattern mining
Clustering
Classification
K-means
K-means is a clustering method
Which of the following is not a classification mechanism?
Naïve Bayes
K-means
Convolution Neural Network Classifier
Decision forests
Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results
In a typical data science workflow, what are the steps involved?
Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results
Obtaining data, data cleaning, model training, model exploration, model deployment
Model training, model exploration, cleaning the outcomes, interpreting the results.
Cleaning data, exploring data, model training and evaluation, obtaining results, deploying the model
Finding the frequent item sets frequently bought together
this is an example of frequent pattern mining
What is an example of an FPM application?
Finding the frequent item sets frequently bought together
Placing new items into predefined categories
Grouping similar object together without knowing the groups ahead
A recommendation engine working based on the user preferences and others with similar preferences
D -> A -> C -> B
In K-means, what is the order of the following steps?
A. For each data point, assign to the closest centroid
B. If new centroids are different from the old, re-iterate through the loop
C. For each cluster, re-compute the centroids
D. Randomly select k centroids
A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.
Naive Bayes is a supervised classification method
Which of the following best describes how Naïve Bayes works?
A set of unlabeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.
A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.
A set of items is given, and the most frequent set of items is found.
A set of data points are given and it classifies them into k groups
Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.
What are the definitions of hyperparameter optimization and AutoML?
Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.
Hyperparameter optimization means adjusting the parameters of a search space using gradient descent, and is a technical term. AutoML is a special case of hyperparameter optimization, and is marketing jargon.
They are both the same and used interchangeably.
Hyperparameter optimization refers to adjusting the parameters of a hyper plane that divides the search space in equidistance quadrants. AutoML means the cloud provider takes care of orchestrating machine learning artifacts' deployment.
FPM
If we want to find which set of items in a grocery shop are frequently bought together, which of the following approaches should we use?
K-Means
Naïve Bayes
Decision Forests
FPM
The process of eliminating events to keep up with the rate of events
What is load shedding?
Distributing applications across many servers
The process of eliminating events to keep up with the rate of events
Enabling a system to continue operating properly in the case of the failure of some of its components
Distributing the data across different parallel computing nodes
A bolt processes input streams and produces new streams.
Which of the following is correct?
A topology is a network of tuples and streams.
A stream connects a bolt to a spout.
A plant jar can receive output from many streams.
A bolt processes input streams and produces new streams.
QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt
In a Storm program that produces a sorted list of the top K most frequent words encountered across all the documents streamed into it, four kinds of processing elements (bolts in Storm) might be created: QuoteSplitterBolt, WordCountBolt, MergeBolt, and SortBolt.
What is the order in which words flow through the program?
WordCountBolt, QuoteSplitterBolt, SortBolt, MergeBolt
QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt
QuoteSplitterBolt, SortBolt, WordCountBolt, MergeBolt
WordCountBolt, QuoteSplitterBolt, MergeBolt, SortBolt
Provides a persistent state for the bolts, but the exact implementation is up to the user
What does Trident do?
Provides a persistent state for the topology, with a predefined set of characteristics
Provides a persistent state for the bolts, but the exact implementation is up to the user
Provides a persistent state for the spout, but the exact implementation is up to the user
Provides a persistent state for the bolts, with a predefined set of characteristics
Unbounded sequences of tuples
What are streams in Apache Storm?
A network of spouts and bolts
Unbounded sequences of tuples
Aggregators
Processors of input
Sources of streams
What are spouts in Apache Storm?
Unbounded sequences of tuples
Network of spouts and bolts
Processors of input
Sources of streams
Networks of spouts and bolts
What are topologies in Apache Storm?
Unbounded sequences of tuples
Sources of streams
Processors input
Networks of spouts and bolts
Events are double processed.
In the “At Least Once” message process, what happens if there is a failure?
Events are double processed.
Storm's natural load-balancing takes over.
Storm's natural fault-tolerance takes over.
You must create and implement your load-balance algorithm.
Trident has first class support for state, but the exact implementation is up to the application developer.
How does Trident treat state?
Trident has first class support for state and is completely automatic without the need of any help from the application developer.
Trident has first class support for state, but the exact implementation is up to the application developer.
Trident does not have any support for state.
None of the above.
Allows Storm to be used from many language
Thrift allows users to define and create services which are both consumable by and serviceable by numerous languages
How does Thrift contribute to Storm?
Enables the usage of streams
Provides load-balancing functionality
Provides scalability
Allows Storm to be used from many language
Spark Streaming chops a stream into small batches and processes each batch independently.
it is called micro-batch
Which of the following statements is true?
Spark Streaming treats each tuple independently and replays a record if not processed.
Spark Streaming chops a stream into small batches and processes each batch independently.
Spark Streaming has no support for state.
Spark Streaming uses transactions to update state.
A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline
Lamda Arch has two parallel data processing paths.The first processing path would use a stream event processing system like storm and Then on the parallel path you have the batch processing system
Which of the following best describes Lambda architecture? (Not to be confused with AWS Lambda).
A serial processing pipeline of first a batch processing system and then a stream processing system
Only a stream processing pipeline but with the ability to handle failures
A serial processing pipeline of first a streaming processing system and then a batch processing system
A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline
Only one stream processing pipeline but with the ability to handle failures
In Kappa Architecture, they try to get away from the two parallel paths and they just do the streaming but they try to do streaming good enough so that if there are failures the state doesn't get messed up.
Which of the following best describes Kappa architecture?
A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline
A serial processing pipeline of first a streaming processing system and then a batch processing system
A serial processing pipeline of first a batch processing system and then a stream processing system
Only one stream processing pipeline but with the ability to handle failures
HDFS
HDFS is the storage part of Hadoop
Which of the following is not a component of the Storm Architecture?
Worker
Supervisor
Zookeeper
HDFS
Nimbus
t is micro-batch, which increase the minimum end-to-end latency of the system.
the disadvantage is that it's not really streaming in the strict sense. it's really batches data, and then runs that batch of data very quickly.
What is the main disadvantage of Spark Streaming?
It does not handle failures.
It is micro-batch, which increase the minimum end-to-end latency of the system.
It lacks a rich eco-system of big data tools.
It does not support state
NiFi
in Nifi you can design a graph to process your data
Which system has a great graphical UI to design dataflows?
NiFi
Sform
Druid
Spark Streaming
Druid
Druid provides tricks that processes OLAP queries fast
Which system is best for Online Analytical Processing (OLAP)?
NiFi
Storm
Druid
Spark Streaming
Full virtualization
Which type of virtualization is feasible for the following scenario?
“A service needs to run an unmodified OS on a basic processor, separate from the host operating sysetm.”
Full virtualization
Container
Para-virtualization
Hardware-assisted
Which type of virtualization is feasible for the following scenario?
“A service needs to run an unknown and unmodified OS on an advanced processor.”
Para-virtualization
Hardware-assisted
Containers
Which type of virtualization provides better performance for the following scenario?
“Running multiple independent applications sharing the same kernel”
Hardware-assisted full virtualization
Containers
Full virtualization
Which type of virtualization provides better performance for the following scenario?
“Running two independent applications, each needs a different version of a kernel module”.
Full virtualization
Containers
Containers
Which type of virtualization provides better performance for the following scenario?
“Multiple applications with high memory usage”
Containers
Full virtualization through binary translatio
Running each application in a separate container.
Which is the recommended practice in the following scenario?
"Multiple applications with communication requirements"
Running all applications in one container.
Running each application in a separate container.
No virtualization
Every sort of virtualization technology has some kind of performance impact. For the absolute best performance, no virtualization is the best option.
Which type of virtualization provides better performance for the following scenario?
“One application running on a single piece of hardware”
Full virtualization
Containers
No virtualization
JVM
True
Virtual machine using full virtualization can only run guest OS designed for one type of CPU (the same as host).
True
False
True
The VM simulates enough hardware to allow an unmodified guest OS (one designed for the same CPU) to be run in isolation.
Kubernetes provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.
True
False
Host OS (Base OS) kernel
Who is responsible for scheduling and memory management when using containers?
Virtual Machine Manager
Hypervisor
Host OS (Base OS) kernel
Supervisor
Hardware-assisted full virtualization
Which type of virtualization is feasible for the following scenario?
“Application that needs different custom operating systems (kernels)”
Para-virtualization
Hardware-assisted full virtualization
Containers
Guarantee that the software will always run the same irrespective of environment
Using the Dockerfile format, and relying on Union filesystem technology, docker images downloaded from a hub guarantee specfic software environments for deployment.
Docker is used to:
Monitor progress of jobs running on OpenStack
Send messages from one machine to another
Guarantee that the software will always run the same irrespective of environment
Run a Java program
True
Kubernetes provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.
False
True
False
A user application is not allowed to load control registers when running in kernel mode.
True
False
True
In x86, kernel mode code runs in ring 0 while user processes run in ring 3.
True
False
True
When a user application needs to handle an interrupt, it has to enter kernel mode.
True
False
True
aravirtualization is a software-only virtualization approach.
Xen does not require special hardware support, such as Intel "VT-x" or "AMD-V".
True
False
False
Binary translation only modifies sensitive instructions.
Binary translation modifies all instructions on the fly and does not require changes to the guest operating system kernel.
True
False
False
There is only one address space in unikernel. Applicaiton can be seen as running in kernel mode the whole time.
In unikernel, user application can transit to kernel mode using special instructions.
True
False
False
Making changes to unikernel requires recompilation. Unikernel normally only runs one application.
Is it possible to install a second application with different dependencies into an existing unikernel?
True
False
Having a minimal security protection.
Using minimal device model and kernel configuration can reduce attacker surface of microVM and does not reduce security protection.
Which is not the reason that microVM is faster than a normal VM
Having a minimal device model.
Having a minimal security protection.
Having a minimal guest kernel configuration.
microVM
AWS Lambda and AWS Fargate are using
Container
microVM
Third
Which generation of hardware virtualization introduced IOMMU virtualization?
First
Second
Third
5.4
The latest CentOS comes with Linux kernel version 4.18. If you are running a latest CentOS container on a Ubuntu with kernel version 5.4, which kernel version would you see inside the container?
4.18
5.4
5.4
Container always uses host kernel.
You built a latest CentOS container on an Ubuntu host with kernel version 4.18. After you upgraded the Ubuntu kernel to 5.4, what will be the kernel version used by the built CentOS container?
5.4
4.18
True
Besides using Dockerfile to create container image, one can also start a container using existing image and install necessary packages on top of it to create a new image.
True
False
True
All containers without a --network specified, are attached to the default bridge network. This is risky operation as it allows unrelated services to communicate.
True
False
bridge
Which is the default Docker network driver?
bridge
host
overlay
False
Containers on different networks can communicate using the bridge network.
True
False
overlay network
For communication among containers running on different Docker daemon hosts, you should use
bridge network
overlay network
overlay network
Which type of network connects multiple Docker daemons together and enables swarm services to communicate with each other.
bridge network
host network
overlay network
macvlan network
True
In Docker Swarm, ingress is a overlay network that handles control and data traffic related to swarm services.
True
False
Virutal IP
When a Service is requested the resulting DNS query is forwarded to the Docker Engine, which in turn returns the IP of the service, a virtual IP.
Docker internal load balancing is done using
Virutal IP
Published port numbers to the host system
Service name
The routing mesh uses IP based service discovery and load balancing.
The routing mesh uses port based service discovery and load balancing.
Which of the following statements is NOT true about Docker routing mesh?
The routing mesh enables each node in the swarm to accept connections on published ports for any service running in the swarm.
The routing mesh uses IP based service discovery and load balancing.
By default all nodes participate in an ingress routing mesh.
ALL
How can you mount a storage location on the host to a container?
Bind mount
Volume
tmpfs
When you grant a newly-created or running service access to a secret, the decrypted secret is mounted into the container in an in-memory filesyste
How does a service get access to secret information in Docker Swarm?
When you grant a newly-created or running service access to a secret, the decrypted secret is mounted into the container in an in-memory filesystem
When you grant a newly-created or running service access to a secret, the encrypted secret is mounted into the container in an in-memory filesystem . The program in the container needs to decrypt the secret using the appropriate master key.
When you grant a newly-created or running service access to a secret, the decrypted secret is mounted into the container in a disk mounted location
Replaces the functionality of Dockerfile.
It still needs Dockerfile to create images.
Which of the following statements about Docker Compose is incorrect?
Uses a YAML file to configure application’s services.
Replaces the functionality of Dockerfile.
Main tool by Docker for container orchestration
overlay network
For containers running on different host to communicate, you should use
bridge network
host network
overlay network
macvlan network
False
It's only possible to use container name in a user-defined bridge network.
For a container to communicate with another container running on the default bridge network, one can use either the target container's ip address or its container name directly.
True
False
False
The routing mesh will automatically route the incoming traffic to a node that has a service task running on it.
Docker Swarm routing mesh will report an error if an external load balancer reaches a node that does not have a task belonging to the requested service.
True
False
False
When you publish a service port, the swarm makes the service accessible at the target port only on nodes that have a task running for that service.
True
False
Bind mount
You are working on a course MP and want to use the IDE on host to edit codes and run the codes inside a container. Which is the best way to make the codes accessible inside the container?
Bind mount
Volume
tmpfs
False
It's the opposite for both cases.
AWS ECS has two launch tyeps: EC2 and Fargate. EC2 will automatically manages all resource provisioning while for Fargate it's managed by customer.
True
False
Create container image based on user specified requirements.
Which of the following is not a function of Kubernetes?
Schedule containers to run on physical or virtual machines.
Create container image based on user specified requirements.
None above
Kubernetes can be classified as is a
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
Software as a Service (Saas)
None above
True
A node in Kubernetes can run several pods and each pod can run several containers.
True
False
True
Etcd is a key-value store that provides a consistent distributed state for a Kubernetes cluster.
True
False
False
In Kubernetes users need to take care of mapping container ports to host ports.
True
False
True
Environment variables and DNS are two primary modes for service discovery in Kubernetes.
True
False
ALL
In Docker Swarm, a service is used to handle
Launching and monitoring tasks
Rolling updates
Network routing
Launching and monitoring pods
In Kubernetes, a ReplicaSet takes care of
Launching and monitoring pods
Rolling updates
Network routing
All above
Sidecar pattern
Which design pattern will you use for adding HTTPS to a legacy service?
Sidecar pattern
Ambassador pattern
Adapter pattern
Pod
What is the smallest control unit in Kubernetes?
Deployment
Node
Pod