Exam 2 5190

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/85

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

86 Terms

New cards

What is the purpose of data splitting in machine learning?
A) To increase the dataset size
B) To ensure that the model generalizes to new data
C) To only train the model on one set
D) To avoid training and validation altogether

New cards

What is the primary role of a hidden layer in a neural network?
A) To provide output to the model
B) To directly receive input data
C) To perform computations and extract features
D) To update weights during backpropagation

New cards

Which activation function is commonly used in binary classification for the output layer?A) ReLU
B) Tanh
C) Sigmoid
D) Softmax

New cards

What is an epoch in neural network training?
A) One complete pass through the training dataset
B) A single calculation of the loss
C) A batch update of weights
D) The first phase of model testing

New cards

Which of the following techniques is specifically used to improve the model's generalization?
A) Using only training data
B) Batch updates
C) Hyperparameter tuning
D) Weight initialization

New cards

Why is the test dataset kept separate during model training?
A) To save computation time
B) To tune hyperparameters
C) To provide an unbiased evaluation of the model’s generalization ability
D) To increase training accuracy

New cards

What is the main purpose of an activation function in a neural network?
A) To initialize weights
B) To introduce non-linearity
C) To reduce computation time
D) To create a linear output

New cards

Which of these is a common optimizer used in training neural networks?
A) Tanh
B) SGD (Stochastic Gradient Descent)
C) Softmax
D) Cross-entropy

New cards

What happens during backpropagation in a neural network?
A) Calculation of gradients to update weights
B) Forward pass of data through the network
C) Prediction of outputs based on weights
D) Application of the activation function

New cards

Which metric is commonly used to evaluate model performance in regression tasks?
A) Cross-entropy
B) Mean Squared Error (MSE)
C) Accuracy
D) F1 Score

New cards

What does overfitting in a model indicate?
A) The model performs well on new, unseen data
B) The model fails to capture patterns in the training data
C) The model has memorized the training data rather than generalizing patterns
D) The model has too few parameters

New cards

Which type of data is typically not shuffled during training?
A) Training data
B) Validation data
C) Test data
D) Both B and C

New cards

What is the role of the loss function in a neural network?
A) It calculates the error between predictions and actual values
B) It initializes weights randomly
C) It determines the architecture of the network
D) It sets the learning rate

New cards

Which of the following statements about the learning rate is true?
A) A high learning rate ensures accuracy but slows convergence
B) A low learning rate speeds up convergence
C) A high learning rate can cause overshooting the optimal solution
D) The learning rate is fixed and cannot be changed

New cards

In neural network training, what does “convergence” mean?
A) The model achieves 100% accuracy
B) The model’s performance on the training set decreases
C) The model’s performance stabilizes and no longer improves
D) The model's parameters are reset

New cards

What is the main advantage of zero-shot prompting in using large language models?
A) It uses external knowledge sources
B) It is a cost-efficient and quick way to test a model's ability on a task
C) It customizes the model with specific examples
D) It provides domain-specific knowledge

New cards

In few-shot prompting with dynamic examples, what is the primary difference compared to fixed examples?
A) Dynamic examples require model retraining
B) Fixed examples offer real-time context adaptability
C) Dynamic examples adapt based on input context at runtime
D) Fixed examples allow more customization than dynamic ones

New cards

Which of the following is a benefit of model quantization?
A) Improved model interpretability
B) Lower computational cost during inference
C) Increased accuracy on complex tasks
D) Increased need for high-precision hardware

New cards

What is the purpose of adding adapter layers in parameter-efficient fine-tuning?
A) They replace the model’s pre-trained layers
B) They reduce the number of layers needed for training
C) They adapt the model to new tasks with minimal added parameters
D) They increase the model’s complexity for specific tasks

New cards

What is a primary benefit of using prompt tuning over traditional fine-tuning?
A) It alters the model's pre-trained weights
B) It reduces computational cost by adjusting only soft prompts
C) It requires the model to retain only task-specific information
D) It is only applicable to binary classification tasks

New cards

Which statement best describes prefix tuning?
A) Prefix tuning modifies the output layer of the model
B) Prefix tuning appends trainable parameters at the start of each layer
C) Prefix tuning requires full retraining of the model
D) Prefix tuning applies only to the input sequence of the model

New cards

In Low-Rank Adaptation (LoRA), why are low-rank matrices used?
A) To replace the model’s existing weight matrices
B) To increase computational complexity for specific tasks
C) To make fine-tuning memory efficient by reducing trainable parameters
D) To improve model performance by adding more layers

New cards

Which method combines quantization with LoRA to further reduce memory requirements during fine-tuning?
A) Flash Attention
B) Model Pruning
C) QLoRA
D) Adapter Tuning

New cards

What is the key purpose of Flash Attention?
A) To reduce memory and speed up training in long-sequence tasks
B) To increase model accuracy by adding more layers
C) To improve the model's pre-trained knowledge base
D) To reduce computational time for small datasets

New cards

What does Rotary Position Embedding (RoPE) aim to improve in language models?
A) Model accuracy on short tasks
B) Position-based embeddings for memory efficiency
C) Long-context understanding and sequence memory
D) Reducing training time for complex tasks

New cards

What is the main purpose of theta scaling in RoPE?
A) To enhance the model’s ability to interpret embeddings
B) To adapt the model for shorter sequences
C) To allow the model to handle very long sequences
D) To optimize model layers for parallel processing

New cards

In model fine-tuning, why is more memory required compared to inference?
A) Only the forward pass is used in fine-tuning
B) Fine-tuning involves additional backpropagation and gradient storage
C) Inference requires gradient calculations that fine-tuning skips
D) Fine-tuning operates with fewer parameters than inference

New cards

Why are task-specific prefixes used in prefix tuning?
A) They update all the model’s layers for a new task
B) They allow the same model to adapt to multiple tasks efficiently
C) They require full retraining for each task adaptation
D) They replace the model’s pre-trained weights entirely

New cards

What is one challenge associated with model quantization?
A) Limited compatibility with high-performance hardware
B) Increased memory requirements during inference
C) Increased model training time
D) Reduced ability to handle large datasets

New cards

Which type of prompt is typically optimized and used as an embedded token in prompt tuning?
A) Static prompts
B) Task-specific prefixes
C) Dynamic examples
D) Soft prompts

New cards

What effect does increasing sequence length have on GPU memory during fine-tuning?
A) Memory usage remains constant
B) Memory usage grows linearly
C) Memory usage grows quadratically
D) Memory usage decreases

New cards

Why is Flash Attention used in large language models?
A) To improve memory and computational efficiency in attention computations
B) To add more model layers
C) To enhance training accuracy
D) To reduce the number of tokens processed

New cards

Which of the following is a strategy for handling out-of-memory (OOM) issues?
A) Increasing the dataset size
B) Increasing batch size
C) Decreasing sequence length
D) Increasing sequence length

New cards

What is the primary purpose of gradient checkpointing in a training pipeline?
A) To skip gradient calculations for certain tokens
B) To store more gradients for each token
C) To increase model accuracy on larger datasets
D) To reduce memory usage by storing fewer intermediate activations

New cards

How does parameter-efficient fine-tuning (PEFT) help when training large models?
A) By training all model parameters equally
B) By replacing the pre-trained weights entirely
C) By updating only a small subset of the model's parameters
D) By decreasing the model's output dimensions

New cards

In a training pipeline, what is the role of CustomTrainingArguments?
A) To store default training configurations only
B) To handle model checkpoint saving
C) To define and manage command-line arguments for training parameters
D) To adjust the model architecture directly

New cards

What does the per_device_train_batch_size argument control?
A) The batch size for each individual GPU or CPU during training
B) The batch size for evaluation only
C) The total batch size across all devices
D) The memory usage for each training step

New cards

How does quantization help in deploying large models?
A) By reducing the dataset size needed for training
B) By lowering the precision of weights to save memory and computation
C) By increasing model accuracy
D) By replacing the model’s architecture entirely

New cards

What is the purpose of the output_dir argument in a training pipeline?
A) To specify where to save training logs
B) To define the directory for saving model checkpoints and outputs
C) To set the path for loading the dataset
D) To control the output sequence length

New cards

What effect does enabling load_best_model_at_end have on the training process?
A) It loads the latest checkpoint after training finishes
B) It saves only the best model checkpoint and deletes others
C) It saves checkpoints more frequently
D) It loads the best-performing model checkpoint at the end of training

New cards

Which of these techniques allows the model to handle longer text sequences with fewer OOM issues?
A) Batch size increase
B) Sparse attention
C) Simple truncation
D) Parameter-efficient fine-tuning

New cards

What is the main benefit of using a data collator in the training pipeline?
A) To initialize new model parameters
B) To handle batching, padding, and sequence formatting for inputs
C) To increase the number of tokens in each batch
D) To modify model weights directly

New cards

Why would you use dynamic batch sizing in a training pipeline?
A) To handle batches with longer sequences without running out of memory
B) To make batch size independent of sequence length
C) To reduce GPU memory requirements
D) To prevent data shuffling

New cards

What is gradient_accumulation_steps used for in training?
A) To increase model speed during training
B) To decrease memory usage by adjusting sequence length
C) To accumulate gradients over several batches before updating weights
D) To shuffle the training dataset

New cards

Why is a tokenizer setup necessary before training a language model?
A) To increase batch size during training
B) To increase the model's overall sequence length
C) To reduce the number of tokens needed for training
D) To ensure proper tokenization of input text for processing

New cards

Which characteristic is a limitation of greedy decoding in text generation?
A) Generates highly diverse outputs
B) Considers global sequence optimality
C) Lacks the ability to recover from mistakes
D) Requires more computation than beam search

New cards

In top-k sampling, how does this method introduce diversity in generated text?
A) By selecting only the highest probability token
B) By sampling from the k most likely tokens
C) By reducing the probability distribution to zero
D) By always choosing tokens randomly

New cards

What effect does adjusting temperature have on text generation in language models?
A) Alters the spread of token probabilities
B) Directly influences model accuracy
C) Controls the number of tokens in output
D) Balances global versus local choices

New cards

Beam search in text generation primarily aims to:
A) Optimize the sequence by using exhaustive search
B) Sample tokens randomly to enhance creativity
C) Choose the longest possible sequence
D) Find a balance between locally and globally probable sequences

New cards

Why is a brevity penalty applied in BLEU score calculations?
A) To favor shorter translations
B) To exclude exact word matches
C) To prevent shorter translations from scoring artificially high
D) To prioritize longer n-gram matches

New cards

Which component does ROUGE-N emphasize when evaluating generated summaries?
A) Embedding similarity
B) Overlap of n-grams between candidate and reference texts
C) Semantic coherence
D) Sentence fluency

New cards

A low perplexity score in a language model suggests:
A) The model has high certainty in its predictions
B) The model lacks sufficient training data
C) High diversity in model outputs
D) The model is underperforming

New cards

What does top-p (nucleus) sampling focus on when selecting tokens?
A) Including a fixed number of tokens with high probability
B) Adding randomness by sampling from the entire vocabulary
C) Choosing tokens within a cumulative probability threshold
D) Sampling based on a specific temperature value

New cards

Compared to BLEU, the METEOR metric additionally considers:
A) Exact word order matching
B) Higher weight for longer sequences
C) Semantic matches like synonyms and stems
D) A penalty for too many n-grams

New cards

How does BERTScore evaluate text similarity?
A) By focusing on matching word embeddings for semantic similarity
B) By counting matching words between texts
C) By assessing the frequency of tokens in both texts
D) By penalizing differences in sentence length

New cards

Which sampling technique adjusts the number of candidate tokens based on cumulative probability?
A) Greedy decoding
B) Beam search
C) Top-k sampling
D) Top-p sampling

New cards

What is a primary benefit of using a high temperature in language model generation?
A) Increases predictability of the output
B) Sharpens focus on the most probable tokens
C) Introduces more variability and creativity
D) Ensures deterministic results

New cards

An MMLU score is considered significant for a language model because it measures:
A) Performance in a single specific domain
B) Multitask ability across diverse knowledge areas
C) Memory efficiency during training
D) Processing speed on large datasets

New cards

In MMLU testing, a score of 25 generally represents:
A) Random guessing accuracy
B) High model performance
C) The maximum achievable score
D) Low probability of guessing correctly

New cards

What unique measure does ROUGE-L provide in comparison to ROUGE-N?
A) Counts all single-word matches
B) Evaluates the longest common subsequence in texts
C) Analyzes sentence fluency
D) Identifies non-consecutive bigrams

New cards

Why is gradient accumulation used in training large language models?
A) To reduce the model’s complexity by decreasing the number of layers
B) To achieve larger effective batch sizes without increasing memory usage
C) To prevent overfitting by reducing gradient updates
D) To increase the number of tokens in each input sequence

New cards

What is the main difference between multiclass and multilabel classification?
A) Multiclass classification assigns multiple labels to a single instance, while multilabel classification assigns only one label
B) Multiclass classification predicts one label from multiple classes, while multilabel classification can assign multiple labels to a single instance
C) Multilabel classification requires a larger dataset than multiclass classification
D) Multiclass classification is used only for binary classification, while multilabel handles more than two classes

New cards

Which of the following is a common indication that a model has underfit the data?
A) High training accuracy but low validation accuracy
B) Low training accuracy and low validation accuracy
C) High training accuracy and high validation accuracy
D) Low training accuracy but high validation accuracy

New cards

Which of the following is a common indication that a model has overfit the data?
A) High training accuracy but low validation accuracy
B) Low training accuracy and low validation accuracy
C) High training accuracy and high validation accuracy
D) Low training accuracy but high validation accuracy

New cards

What is a base model in the context of machine learning?
A) A model trained with all possible features and parameters tuned for optimal performance
B) A model trained only on a subset of data to validate a hypothesis
C) A final model that is used in production after extensive training
D) A simple, initial model that serves as a starting point before fine-tuning or transfer learning

New cards

What is an instruction model in machine learning?
A) A model designed to generate instructions for users based on data
B) A base model trained without any specific tasks or instructions
C) A model that is fine-tuned to follow specific instructions or prompts provided by the user
D) A model used solely for generating complex, unsupervised predictions

New cards

What is reference text in the context of evaluating language models?
A) The main dataset used to train the model
B) Text that serves as a benchmark for comparing and evaluating the quality of model-generated outputs
C) A summary of all outputs produced by the model during training
D) Text that the model uses to predict future tokens in a sequence

New cards

What is candidate text in the context of evaluating language models?
A) The original text used as input for training the model
B) A reference text created by human annotators as a benchmark
C) The text generated by the model, which is compared against a reference text for evaluation
D) A summary created from multiple model outputs for accuracy checks

New cards

What does perplexity measure in the context of evaluating language models?
A) The model's certainty in predicting the next word in a sequence
B) The ability of the model to handle multiple tasks simultaneously
C) The accuracy of a model's predictions on unseen data
D) The length of generated text sequences

New cards

What is a key difference between BERT and LDA in natural language processing?
A) BERT is a topic modeling algorithm, while LDA is a transformer-based language model
B) BERT uses contextual embeddings for understanding language, while LDA is a probabilistic model for discovering topics in text
C) BERT is primarily used for topic extraction, while LDA is used for sentiment analysis
D) BERT requires labeled data, while LDA can only be used with unlabeled data

New cards

How is precision used differently in BLEU versus ROUGE for evaluating text generation?
A) In BLEU, precision focuses on n-gram overlap from the candidate text to the reference, while in ROUGE, it emphasizes recall of n-grams in the reference text
B) In BLEU, precision measures the overlap of n-grams in the reference text, while in ROUGE, it measures n-grams in the candidate text
C) BLEU uses precision to prioritize recall, whereas ROUGE uses it to penalize brevity
D) BLEU calculates precision based on word order, whereas ROUGE ignores word order entirely

New cards

What is the main goal of Information Extraction (IE) in NLP?
A) To create language models that generate human-like text
B) To convert structured data into unstructured text
C) To extract structured information from unstructured text
D) To translate text from one language to another

New cards

In Named Entity Recognition (NER), what does the "B" in the BIO tagging scheme represent?
A) Beginning of an entity
B) Inside an entity
C) Boundary of an entity
D) Outside an entity

New cards

Which of the following is an example of Relation Extraction?
A) Identifying locations in a document
B) Summarizing the main events in a story
C) Detecting relationships between named entities, such as "works for"
D) Translating a document into another language

New cards

Which task would benefit most from coreference resolution?
A) Determining the overall sentiment of a text
B) Recognizing and linking different mentions of the same entity in a document
C) Calculating the similarity of sentences in a document
D) Identifying the main topic of a document

New cards

What is the primary purpose of entity linking in NLP?
A) Assigning unique identifiers to recognized entities
B) Creating a knowledge graph from unstructured text
C) Generating new sentences using extracted entities
D) Tokenizing text into words

New cards

Which NLP task is specifically designed to recognize events and their attributes, such as location and participants?
A) Relation Extraction
B) Named Entity Recognition
C) Event Extraction
D) Coreference Resolution

New cards

How does Multi-Task Learning (MTL) enhance model performance in NLP?
A) By optimizing only the main task, ignoring auxiliary tasks
B) By training separate models for each task
C) By sharing representations across related tasks, helping the model learn general features
D) By using only rule-based methods for task improvement

New cards

Which of the following describes the purpose of a knowledge graph in NLP?
A) Representing semantic relationships among entities in a structured format
B) Providing document summarization
C) Translating text from multiple languages
D) Generating responses in conversational AI

New cards

What advantage does a knowledge graph have over cosine similarity in entity retrieval?
A) Reduced need for data structure
B) Greater scalability with large datasets
C) Enhanced semantic understanding of relationships
D) Simplicity in algorithm implementation

New cards

Which model is adapted to produce sentence embeddings for semantic similarity, reducing the need for pairwise comparisons?
A) LDA
B) BERT
C) SBERT
D) NER

New cards

Which topic modeling technique is commonly used to generate topics by analyzing word distributions in documents?
A) Named Entity Recognition
B) LDA
C) Coreference Resolution
D) Cosine Similarity

New cards

In BERTopic, what is the purpose of using dimensionality reduction techniques before clustering?
A) To increase the complexity of the clusters
B) To focus on the most informative aspects of high-dimensional data
C) To generate embeddings for each sentence independently
D) To reduce processing time for single-topic documents only

New cards

Which of the following best describes the c-TF-IDF approach in BERTopic?
A) Identifies important words for individual documents only
B) Measures term frequency across the entire dataset
C) Highlights words significant to specific clusters rather than individual documents
D) Applies dimensionality reduction to embeddings

New cards

What is the main goal of triplet loss in SBERT?
A) To train sentence embeddings for detecting entity boundaries
B) To minimize the distance between unrelated sentences
C) To optimize embeddings by reducing the distance between similar sentences and increasing it between dissimilar ones
D) To reduce the number of embedding dimensions

New cards

What challenge does BERTopic address that LDA struggles with?
A) Handling polysemy and understanding word context
B) Generating embeddings for new languages
C) Processing individual sentences for entity recognition
D) Scaling efficiently for real-time analysis