1/60
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Which system is a more natural fit for OLTP?
A. Managed Machine Learning platforms
B. Datawarehouse
C. Data Lake
D. RDBMS
RDBMS
A Datacube is best thought of as a(n)
A. specialized hardware for fast analysis of massive data
B. data structure, more specifically, a sophisticated nested array
C. archival service provided by AWS
D. function that structures and compresses data
Data structure, more specifically, a sophisticated nested array
In general, it is very easy and straightforward to transform data from a SQL database into an OLAP cube.
[T/F]
False (Correct! OLAP cubes require that data teams manage complicated pipelines to transform data from a SQL database into OLAP cubes)
Select all that apply: What are some commonly available datacube operations?
A. Slicing
B. Dicing
C. Drill Up / Down
D. Roll-up
E. Pivot
A. Slicing
B. Dicing
C. Drill Up / Down
D. Roll-up
E. Pivot
Which of these would probably be best for storing data retrieved by a key or a sequence of keys?
A. MonetDB
B. SybaseIQ
C. Vertica
D. BigTable
BigTable: As a wide-column store, BigTable is specialized for access by a key
If your primary interest is the richest possible analysis capabilities, which of these two options would likely be the better choice?
A. Column-Oriented Data Warehouse
B. OLAP Datacubes
OLAP Datacubes: Correct! RDBMS are often limited by the constraints of SQL
Suppose a table contains 10000 rows and 100 columns. A query that uses all of the rows and 5 columns will need to read approximately what percentage of the data contained in the table if you are using a traditional row-based RDBMS system?
A. Significantly less than 5%
B. Approximately 5%
C. Significantly more than 5%
Significantly more than 5%
.
A Data Lake is:
A. a new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)
B. a new type of data repository for storing massive amounts of structured data in a single location, rather than spread over multiple datacenters, in order to exploit data locality to speed the analysis
C. a new type of data repository for storing massive amounts of unstructured data in a single location for processing, cleaning, and structuring
A new type of data repository for storing massive amounts of raw data in its native form, in a single location (both structured and unstructured)
Today, OLAP cubes are always designed to fit in the hosting computer's main memory to maximize analytical performance [T/F]
False: Today, OLAP cubes refer specifically to contexts in which these data structures far outstrip the size of the hosting computer's main memory
Redshift, like most Columnar Stores, makes it easy to update blocks.
A. Redshift, like most Columnar Stores, makes it easy to update blocks.
B. False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches
C. True. Redshift, like most Columnar Stores, are write-optimized, so updates are easy
D. False. Redshift is not a Columnar Store, but a data pipeline that connects Columnar Stores to analysis engines
False. In Redshift, blocks are immutable. In general, Columnar Stores are not good at updates compared to other approaches
Why is MapReduce not efficient for large-scale graph processing?
A. The map function is computationally a bottleneck.
B. It is fault-tolerance enough.
C. It produces too much communication between stages.
D. It brings load imbalance
C. It produces too much communication between stages.
MapReduce tends to be inefficient because the graph state must be stored at each stage of the graph algorithm, and each computational stage will produce much communication between the stages.
Graph computations involve local data (small part of graph surrounding a vertex), and the
connectivity between vertices is sparse. The data may not all fit into one node. This makes it difficult to fit always into the map/reduce model.
You want to build a shortest path algorithm using parallel breadth-first search in Pregel. Which of the following pseudo-codes is the proper "compute function" for this program?
A. compute(list of edges) -> return list of messages
B. compute(list of vertexes) -> return list of messages
C. compute(list of messages) -> return list of messages
D. compute(graph) -> return list of messages
C. compute(list of messages) -> return list of messages
Worker: responsible for vertices
• Invokes active vertices compute() function
• Sends, receives, and assigns messages
• Computes local aggregation values
How is checkpointing done in Pregel?
A. It regularly uses "ping" messages.
B. Each worker communicates with the other workers.
C. The workers all reload their partition state from the most recent available checkpoint.
D. The master periodically instructs the workers to save the state of their partitions to persistent storage.
D. The master periodically instructs the workers to save the state of their partitions to persistent storage.
How does Pregel detect the failure?
A. It regularly uses "ping" messages.
B. The master periodically instructs the workers to save the state of their partitions to persistent storage.
C. Each worker communicates with the other workers.
D. The workers all reload their partition state from the most recent available checkpoint.
A. It regularly uses "ping" messages.
How is recovery being done in Pregel?
A. Each worker communicates with the other workers.
B. The master periodically instructs the workers to save the state of their partitions to persistent storage.
C. It regularly uses "ping" messages.
D. The workers all reload their partition state from the most recent available checkpoint.
D. The workers all reload their partition state from the most recent available checkpoint.
where the master re-assigns the graph portions to currently available work, workers so you can share out what's the job that's not being finished, you could share that out to other workers that are alive and can process the system. And the workers just reload their partition state for the most available check point and then continue.
What is ZooKeeper's role in task assignment in Giraph?
A. Responsible for coordination
B. Responsible for vertices
C. Communicate with other workers
D. Responsible for the state of computation
D. ZooKeeper is responsible for computation state:
• Partition/worker mapping
• Global state: #superstep
• Checkpoint paths, aggregator values, statistics
Master: responsible for coordination
• Assigns partitions to workers
• Coordinates synchronization
• Requests checkpoints
• Aggregates aggregator values
• Collects health statuses
Worker: responsible for vertices
• Invokes active vertices compute() function
• Sends, receives, and assigns messages
• Computes local aggregation values
What is Master's role for task assignment in Giraph?
A. Communicate with other workers
B. Responsible for the state of computation
C. Responsible for vertices
D. Responsible for coordination
D. Responsible for coordination
ZooKeeper is responsible for computation state:
• Partition/worker mapping
• Global state: #superstep
• Checkpoint paths, aggregator values, statistics
Master: responsible for coordination
• Assigns partitions to workers
• Coordinates synchronization
• Requests checkpoints
• Aggregates aggregator values
• Collects health statuses
Worker: responsible for vertices
• Invokes active vertices compute() function
• Sends, receives, and assigns messages
• Computes local aggregation values
What is Worker's role for task assignment in Giraph?
A. Responsible for vertices
B. Responsible for coordination
C. Communicate with other workers
D. Responsible for the state of computation
A. Responsible for vertices
ZooKeeper is responsible for computation state:
• Partition/worker mapping
• Global state: #superstep
• Checkpoint paths, aggregator values, statistics
Master: responsible for coordination
• Assigns partitions to workers
• Coordinates synchronization
• Requests checkpoints
• Aggregates aggregator values
• Collects health statuses
Worker: responsible for vertices
• Invokes active vertices compute() function
• Sends, receives, and assigns messages
• Computes local aggregation values
What is graph processing?
A. A graph database in any storage system that provides index-free adjacency
B. A non-relational, distributed database
C. A distributed real-time computation system
D. A framework for distributed storage and processing of large data sets
A. A graph database in any storage system that provides index-free adjacency
Correct! with a graph data base with all the information in. It's going to provide some way of retrieving their data and typically a graph database is going to provide index-free adjacency.
Graph Processing:
• A graph database is any storage system that provides index-free adjacency. Has pointers to adjacent elements...
• Nodes represent entities (people, businesses, accounts...)
• Properties are pertinent information that relate to
nodes
• Edges interconnect nodes to nodes or nodes to properties and they represent the relationship
between the two
Which of these is a property of a graph database?
A. Associative data sets
B. Uses a relational model of data
C. Entity type has its table
D. Performs the same operation on large numbers of data
A. Associative data sets
Graph database has a bunch of associative data sets, so you look up one item and you retrieve a different item from it like a node and it's connection to and age.
What is an example of a collaborative filtering application?
A. Finding the frequent item sets frequently bought together
B. Placing new items into predefined categories
C. A recommendation engine working based on the user preferences and others with similar preferences
D. Grouping similar object together without knowing the groups ahead
C. A recommendation engine working based on the user preferences and others with similar preferences
Collaborative filtering is to have multiple filters working together to extract just the information you want.
How do cloud providers technically handle model deployment?
A. They keep a pool of virtual machines active, so that any time a HTTPS request for a model inference arrives, one of the VMs is ready to fetch the model artifacts from the model repository and run it.
B. Amason Sagemaker stores all trained models in DynamoDB. Upon an HTTPS request, it asks DynamoDB for the model and runs the proper pre-written algorithm along with parameters fetched from DynamoDB.
C. They make a docker container of the inferencing code, and keep a reference to a BLOB storage bucket where the trained model's parameters are stored. Any time a HTTPS request for a model inference arrives, they launch the container, which fetches the parameters and runs the inference.
D. The models are stored in javascript. Any browser that wishes to run a model fetches the model parameters from a cloud-based BLOB storage, and simply runs the model code locally.
C. They make a docker container of the inferencing code, and keep a reference to a BLOB storage bucket where the trained model's parameters are stored. Any time a HTTPS request for a model inference arrives, they launch the container, which fetches the parameters and runs the inference.
Which group does K-means fall into?
A. Collaborative filtering
B. Clustering
C. Frequent pattern mining
D. Classification
B. Clustering
Correct! k-means clustering aims to partition n observations into k clusters
Which of the following is not a classification mechanism?
A. Convolution Neural Network Classifier
B. K-means
C. Naïve Bayes
D. Decision forests
K-means
In a typical data science workflow, what are the steps involved?
A. Model training, model exploration, cleaning the outcomes, interpreting the results.
B. Cleaning data, exploring data, model training and evaluation, obtaining results, deploying the model
C. Obtaining data, data cleaning, model training, model exploration, model deployment
D. Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results
D. Obtaining data, scrubbing data, exploring the dataset, train and evaluate a model, and interpreting the results
1. Gathering Data
2. Data Preparation
3. Data Wrangling
4. Analyse Data
5. Train Model
6. Trest Model
7. Deployment
OSEMN Data Science:
1. Obtain
2. Scrub
3. Explore
4. Model
5. Interpret
What is an example of an FPM application?
A. Grouping similar object together without knowing the groups ahead
B. Finding the frequent item sets frequently bought together
C. A recommendation engine working based on the user preferences and others with similar preferences
D. Placing new items into predefined categories
B. Finding the frequent item sets frequently bought together
In K-means, what is the order of the following steps?
A. For each data point, assign to the closest centroid
B. If new centroids are different from the old, re-iterate through the loop
C. For each cluster, re-compute the centroids
D. Randomly select k centroids
D -> A -> B -> C
A -> B -> C -> D
D -> A -> C -> B
A -> D -> C -> B
D -> A -> C -> B
Which of the following best describes how Naïve Bayes works?
A. A set of items is given, and the most frequent set of items is found.
B. A set of unlabeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.
C. A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.
D. A set of data points are given and it classifies them into k groups
C. A set of labeled data points are given. A model based on those data points is built such that for any new unlabeled data point the label is determined.
What are the definitions of hyperparameter optimization and AutoML?
A. Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.
B. Hyperparameter optimization means adjusting the parameters of a search space using gradient descent, and is a technical term. AutoML is a special case of hyperparameter optimization, and is marketing jargon.
C. They are both the same and used interchangeably.
D. Hyperparameter optimization refers to adjusting the parameters of a hyper plane that divides the search space in equidistance quadrants. AutoML means the cloud provider takes care of orchestrating machine learning artifacts' deployment.
A. Hyperparameter optimization tunes the training parameters of a single training algorithm, while AutoML tries out multiple training algorithms on the input dataset.
If we want to find which set of items in a grocery shop are frequently bought together, which of the following approaches should we use?
A. Naïve Bayes
B. Decision Forests
C. K-Means
D. FPM
D. FPM
What is load shedding?
A. Enabling a system to continue operating properly in the case of the failure of some of its components
B. The process of eliminating events to keep up with the rate of events
C. Distributing applications across many servers
D. Distributing the data across different parallel computing nodes
B. The process of eliminating events to keep up with the rate of events
Why Real-Time Stream Processing?
Real-time data processing at massive scale is becoming a requirement for businesses
• Real-time search, high frequency trading, social networks
• Have a stream of events that flow into the system at a given data rate
The processing system must keep up with the event rate or degrade gracefully by eliminating events. This is typically called load shedding
Which of the following is correct?
A. A topology is a network of tuples and streams.
B. A bolt processes input streams and produces new streams.
C. A stream connects a bolt to a spout.
D. A plant jar can receive output from many streams.
B. A bolt processes input streams and produces new streams.
Topologies
- graph of spouts and bolts that are connected with stream groupings
-runs indefinitely (no time/batch boundaries)
Streams
-unbounded sequence of tuples that is processed and created in parallel in a distributed fashion
Spouts
-input source of streams in topology
Bolts
- processing container, which can perform transformation, filter, aggregation, join, etc.
- sinks: special type of bolts that have an output interface
In a Storm program that produces a sorted list of the top K most frequent words encountered across all the documents streamed into it, four kinds of processing elements (bolts in Storm) might be created: QuoteSplitterBolt, WordCountBolt, MergeBolt, and SortBolt.
What is the order in which words flow through the program?
A. QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt
B. WordCountBolt, QuoteSplitterBolt, SortBolt, MergeBolt
C. WordCountBolt, QuoteSplitterBolt, MergeBolt, SortBolt
D. QuoteSplitterBolt, SortBolt, WordCountBolt, MergeBolt
A. QuoteSplitterBolt, WordCountBolt, SortBolt, MergeBolt
What does Trident do?
A. Provides a persistent state for the bolts, with a predefined set of characteristics
B. Provides a persistent state for the bolts, but the exact implementation is up to the user
C. Provides a persistent state for the spout, but the exact implementation is up to the user
D. Provides a persistent state for the topology, with a predefined set of characteristics
B. Provides a persistent state for the bolts, but the exact implementation is up to the user
Trident:
-Provides exactly once semantics
-In trident, state is a first-class citizen, but the exact implementation of state is up to you
----There are many prebuilt connectors to various NoSQL stores like HBase
-Provides a high level API (similar to cascading for Hadoop)
Trident - Use transactions to update state
- Processes each record exactly once
- Per state transaction to external database is slow
What are streams in Apache Storm?
A. Unbounded sequences of tuples
B. A network of spouts and bolts
C. Processors of input
D. Aggregators
A. Unbounded sequences of tuples
What are spouts in Apache Storm?
A. Network of spouts and bolts
B. Unbounded sequences of tuples
C. Sources of streams
D. Processors of input
C. Sources of streams
What are topologies in Apache Storm?
A. Sources of streams
B. Unbounded sequences of tuples
C. Processors input
D. Networks of spouts and bolts
D. Networks of spouts and bolts
In the "At Least Once" message process, what happens if there is a failure?
A. You must create and implement your load-balance algorithm.
B. Storm's natural fault-tolerance takes over.
C. Events are double processed.
D. Storm's natural load-balancing takes over.
C. Events are double processed.
How does Thrift contribute to Storm?
A. Allows Storm to be used from many language
B. Provides load-balancing functionality
C. Enables the usage of streams
D. Provides scalability
A. Allows Storm to be used from many language
Thrift allows users to define and create services which are both consumable by and serviceable by numerous languages
Which of the following statements is true?
A. Spark Streaming chops a stream into small batches and processes each batch independently.
B. Spark Streaming has no support for state.
C. Spark Streaming uses transactions to update state.
D. Spark Streaming treats each tuple independently and replays a record if not processed.
A. Spark Streaming chops a stream into small batches and processes each batch independently.
Correct! it is called micro-batch
Which of the following best describes Lambda architecture? (Not to be confused with AWS Lambda).
A. A serial processing pipeline of first a streaming processing system and then a batch processing system
B. Only a stream processing pipeline but with the ability to handle failures
C. A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline
D. A serial processing pipeline of first a batch processing system and then a stream processing system
C. A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline
Which of the following best describes Kappa architecture?
A. Only one stream processing pipeline but with the ability to handle failures
B. A parallel processing pipeline of two branches: a stream processing pipeline and a batch processing pipeline
C. A serial processing pipeline of first a streaming processing system and then a batch processing system
D. A serial processing pipeline of first a batch processing system and then a stream processing system
A. Only one stream processing pipeline but with the ability to handle failures
Correct! In Kappa Architecture, they try to get away from the two parallel paths and they just do the streaming but they try to do streaming good enough so that if there are failures the state doesn't get messed up.
Which system has a great graphical UI to design dataflows?
A. NiFi
B. Druid
C. Sform
D. Spark Streaming
A. NiFi
Correct! in Nifi you can design a graph to process your data
Which type of virtualization is feasible for the following scenario?
"A service needs to run an unmodified OS on a basic processor, separate from the host operating sysetm."
A. Container
B. Para-virtualization
C. Full virtualization
C. Full virtualization
Which type of virtualization is feasible for the following scenario?
"A service needs to run an unknown and unmodified OS on an advanced processor."
A. Hardware-assisted
B. Para-virtualization
A. Hardware-assisted
Which type of virtualization provides better performance for the following scenario?
"Running multiple independent applications sharing the same kernel"
A. Containers
B. Hardware-assisted full virtualization
A. Containers
Which type of virtualization provides better performance for the following scenario?
"Running two independent applications, each needs a different version of a kernel module".
A. Containers
B. Full virtualization
B. Full virtualization
Who is responsible for scheduling and memory management when using containers?
A. Host OS (Base OS) kernel
B. Virtual Machine Manager
C. Supervisor
D. Hypervisor
Host OS (Base OS) kernel
Correct! In container-based systems, the same host kernel is shared among containers, and this kernel is responsible for scheduling and memory management.
Which type of virtualization is feasible for the following scenario?
"Application that needs different custom operating systems (kernels)"
A. Para-virtualization
B. Hardware-assisted full virtualization
C. Containers
B. Hardware-assisted full virtualization
Docker is used to:
A. Run a Java program
B. Guarantee that the software will always run the same irrespective of environment
C. Send messages from one machine to another
D. Monitor progress of jobs running on OpenStack
B. Guarantee that the software will always run the same irrespective of environment
Using the Dockerfile format, and relying on Union filesystem technology, docker images downloaded from a hub guarantee specfic software environments for deployment.
Kubernetes provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.
A. True
B. False
A. True
A user application is not allowed to load control registers when running in kernel mode.
A. True
B. False
B. False
In x86, kernel mode code runs in ring 0 while user processes run in ring 3.
A. True
B. False
A. True
When a user application needs to handle an interrupt, it has to enter kernel mode.
A. True
B. False
A. True
Xen does not require special hardware support, such as Intel "VT-x" or "AMD-V".
A. True
B. False
A. True
Paravirtualization is a software-only virtualization approach.
Binary translation modifies all instructions on the fly and does not require changes to the guest operating system kernel.
A. True
B. False
B. False
Binary translation only modifies sensitive instructions.
In unikernel, user application can transit to kernel mode using special instructions.
A. True
B. False
B. False
There is only one address space in unikernel. Applicaiton can be seen as running in kernel mode the whole time.
Is it possible to install a second application with different dependencies into an existing unikernel?
A. True
B. False
B. False
Making changes to unikernel requires recompilation. Unikernel normally only runs one application.
Which is not the reason that microVM is faster than a normal VM
A. Having a minimal device model.
B. Having a minimal security protection.
C. Having a minimal guest kernel configuration.
B. Having a minimal security protection.
Using minimal device model and kernel configuration can reduce attacker surface of microVM and does not reduce security protection.
AWS Lambda and AWS Fargate are using
A. Container
B. microVM
B. microVM
Which generation of hardware virtualization introduced IOMMU virtualization?
A. First
B. Second
C. Third
C. Third