VL ?s - reliability + observability

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/242

There's no tags or description

Looks like no tags are added yet.

Last updated 11:24 PM on 5/13/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

243 Terms

New cards

Reliability vs availability?

Reliability means the system works correctly over time, even with failures or bad inputs. Availability means the system is reachable and usable. A system can be available but unreliable if it returns wrong data.

New cards

Reliability vs correctness?

Reliability is about consistently functioning under real-world conditions. Correctness is about producing the right result. For manufacturing software, both matter because a system that stays online but shows wrong unit status is still dangerous.

New cards

Reliability vs fault tolerance?

Reliability is the overall ability to work correctly. Fault tolerance is the ability to keep working when part of the system fails. Example: if test-result processing fails, the API can still queue uploads for later processing.

New cards

What is graceful degradation?

Graceful degradation means the system still provides partial useful functionality when something fails. Example: if live dashboard refresh fails, show the last known data with a stale warning instead of showing a blank page.

New cards

What is graceful failure?

Graceful failure means failing in a controlled way without corrupting data. The system should return clear errors, log useful context, avoid partial writes, and leave data in a consistent state.

New cards

What is fail-safe design?

Fail-safe design means the system chooses the safer behavior when something goes wrong. Example: if shipment validation cannot confirm QC passed, the system should block shipment rather than allow it.

New cards

What does fail closed mean?

Fail closed means deny or stop an operation when validation or permissions cannot be verified. For manufacturing, shipment approval and status overrides should fail closed.

New cards

What does fail open mean?

Fail open means allow an operation even when checks fail. It may improve convenience but can be risky. It is usually bad for security, QC, shipment, or permission-sensitive workflows.

New cards

What is data correctness?

Data correctness means stored data accurately represents the real-world workflow and business rules. Example: a unit marked ready_to_ship should actually have passed required tests and QC.

New cards

Why is data correctness critical in manufacturing software?

Incorrect data can cause real operational mistakes, such as shipping failed units, missing bottlenecks, ordering wrong inventory, or losing traceability for a physical product.

New cards

What is data consistency?

Data consistency means related data does not contradict itself. Example: units.current_stage should match the latest unit_stage_history event.

New cards

Correctness vs consistency?

Correctness means data reflects reality. Consistency means data agrees across tables or systems. A system can be internally consistent but still incorrect if it does not match the real production floor.

New cards

How do you design for data correctness?

Use backend validation, database constraints, transactions, allowed state transitions, idempotency keys, audit logs, reconciliation jobs, and tests for important workflows.

New cards

What is validation layering?

Validation layering means using multiple levels of protection. Frontend validation improves user experience, backend validation enforces business rules, and database constraints provide final protection.

New cards

Why is frontend validation not enough?

Users or services can bypass the frontend and call APIs directly. Frontend validation is helpful for fast feedback, but backend validation is required for security and correctness.

New cards

Why use database constraints if backend validates?

Backend validation gives good error messages, but database constraints protect against bugs, race conditions, and alternate code paths. Use both for important rules.

New cards

What are examples of database constraints for reliability?

Unique constraints on serial_number, foreign keys from test_results to units, NOT NULL on required fields, CHECK constraints for valid ranges, and transaction rules for related writes.

New cards

What is a transaction?

A transaction groups multiple database operations so they all succeed together or fail together. It prevents partial updates from corrupting data.

New cards

Why use transactions for status updates?

Updating units.current_stage and inserting unit_stage_history must happen together. If one succeeds and the other fails, the system loses traceability or shows inconsistent state.

New cards

What is rollback?

Rollback undoes changes made inside a transaction after a failure. It returns the database to its previous consistent state.

New cards

What is atomicity?

Atomicity means a transaction is all-or-nothing. Either every operation succeeds or none are applied.

New cards

What is isolation?

Isolation means concurrent transactions should not interfere with each other in a way that causes incorrect results.

New cards

What is durability?

Durability means once a transaction commits, the data should persist even if the system crashes afterward.

New cards

What is a race condition?

A race condition happens when two operations happen at the same time and the result depends on timing. Example: two operators update the same unit status at once.

New cards

How do you handle concurrent updates?

Use transactions, row locks, optimistic locking, version fields, updated_at checks, and state-transition validation.

New cards

What is optimistic locking?

Optimistic locking uses a version number or timestamp to detect whether a record changed after it was read. If it changed, the update is rejected or retried.

New cards

When is optimistic locking useful?

It is useful when conflicts are possible but not constant. Example: two users open the same unit detail page and both try to update status.

New cards

What is pessimistic locking?

Pessimistic locking locks a row while it is being updated so other transactions cannot modify it at the same time. It prevents conflicts but can reduce concurrency.

New cards

Optimistic locking vs pessimistic locking?

Optimistic locking allows more concurrency and detects conflicts later. Pessimistic locking prevents conflicts upfront but can block other users. Use optimistic for common web workflows and pessimistic for critical high-conflict operations.

New cards

What is a stale write?

A stale write happens when a user updates data based on an old version of the record. Optimistic locking can prevent this.

New cards

What is lost update?

A lost update happens when one user's change overwrites another user's change without noticing. Version checks or locks can prevent it.

New cards

What is idempotency?

Idempotency means repeating the same operation multiple times has the same final effect as doing it once. It is essential for safe retries.

New cards

Why is idempotency important for reliability?

Network failures can cause clients or workers to retry. Without idempotency, retries can create duplicate test results, duplicate shipments, or repeated status changes.

New cards

How do you implement idempotency?

Use idempotency keys, unique event IDs, unique database constraints, upserts, and processed-event tracking.

New cards

What is an idempotency key?

An idempotency key is a unique value for one operation. If the same key is received again, the system returns the original result instead of repeating the operation.

New cards

What is retry safety?

Retry safety means an operation can be retried without causing duplicate or incorrect side effects. Idempotency is the main way to achieve it.

New cards

What is backoff?

Backoff means waiting before retrying a failed operation. Exponential backoff increases the wait after each failure.

New cards

What is a retry storm?

A retry storm happens when many clients retry at once and overload an already struggling system. Backoff and jitter help prevent this.

New cards

What is a timeout?

A timeout is a maximum wait time for an operation. It prevents requests from hanging forever when a dependency is slow or broken.

New cards

Why are timeouts important?

Without timeouts, slow dependencies can consume resources and cause cascading failures. Every external call should have a reasonable timeout.

New cards

What is a circuit breaker?

A circuit breaker temporarily stops calls to a failing dependency after repeated failures. It prevents wasting resources and gives the dependency time to recover.

New cards

What is bulkhead isolation?

Bulkhead isolation separates resources so one failure does not take down the whole system. Example: separate worker pools for imports and test-result ingestion.

New cards

What is cascading failure?

A cascading failure happens when one failing component causes other components to fail. Example: a slow database causes API timeouts, which causes retries, which adds even more load.

New cards

How do you prevent cascading failures?

Use timeouts, circuit breakers, queues, backoff, rate limits, bulkheads, graceful degradation, and alerts.

New cards

What is load shedding?

Load shedding means intentionally rejecting or delaying some work to protect the system under high load. Example: reject noncritical report generation while keeping test uploads working.

New cards

What is rate limiting?

Rate limiting restricts how many requests a client can make in a time window. It protects systems from abuse, bugs, and overload.

New cards

What is throttling?

Throttling slows down request processing instead of immediately rejecting all excess traffic. It helps protect dependencies.

New cards

Rate limiting vs throttling?

Rate limiting sets a hard request limit, often returning 429. Throttling slows or delays requests to smooth load.

New cards

What is a degraded mode?

A degraded mode is a limited version of the system during failures. Example: dashboard shows cached data while live aggregation is unavailable.

New cards

What is durability?

Durability means data is not lost after being accepted. Queues, transactions, backups, and persistent storage improve durability.

New cards

How do you avoid data loss?

Use durable queues, database transactions, idempotent writes, backups, raw event storage, dead-letter queues, and monitoring.

New cards

What is a dead-letter queue?

A dead-letter queue stores messages that failed processing too many times. It prevents bad messages from blocking good ones.

New cards

Why is a dead-letter queue important?

It gives engineers visibility into failed records and allows later reprocessing after fixing bugs or data issues.

New cards

What is a poison message?

A poison message is a message that repeatedly fails processing because it is malformed or triggers a bug.

New cards

How do you handle poison messages?

Retry a limited number of times, then move them to a dead-letter queue with error details and alert the owner.

New cards

What is observability?

Observability is the ability to understand what a system is doing internally from outputs like logs, metrics, traces, health checks, and alerts.

New cards

Observability vs monitoring?

Monitoring watches known signals like error rate or latency. Observability helps investigate unknown problems by providing enough context to understand system behavior.

New cards

What are the three pillars of observability?

The common three pillars are logs, metrics, and traces. Health checks and alerts are also important in practical systems.

New cards

What are logs?

Logs are structured records of events, errors, and actions. They help developers debug what happened.

New cards

What is structured logging?

Structured logging records logs as key-value fields instead of plain text. Example fields include request_id, serial_number, user_id, endpoint, status_code, and error_code.

New cards

Why is structured logging useful?

It makes logs searchable and filterable. You can quickly find all errors for serial_number VL001 or all failed test uploads from workstation 3.

New cards

What should API logs include?

method, path, status_code, latency, request_id, user_id or service_id, relevant entity IDs, and error details if an error occurred.

New cards

What should pipeline logs include?

job_id, source, records processed, records failed, retry count, duration, error reasons, event IDs, serial numbers, and timestamps.

New cards

What should security logs include?

New cards

What should not be logged?

Do not log passwords, API keys, tokens, private user data, full secrets, or sensitive payloads unless explicitly protected and necessary.

New cards

What is log level?

Log level describes severity. Common levels include DEBUG, INFO, WARNING, ERROR, and CRITICAL.

New cards

When use DEBUG logs?

Use DEBUG for detailed troubleshooting information that is usually disabled in production because it can be noisy.

New cards

When use INFO logs?

Use INFO for normal important events, such as job started, job completed, or unit status changed.

New cards

When use WARNING logs?

Use WARNING for unusual but recoverable situations, such as delayed job processing or a retryable failure.

New cards

When use ERROR logs?

Use ERROR when an operation fails and needs attention or investigation.

New cards

When use CRITICAL logs?

Use CRITICAL for severe failures that may make the system unusable or cause major data loss.

New cards

What is a metric?

A metric is a numeric measurement over time. Examples include API error rate, request latency, queue depth, ingestion lag, and database query time.

New cards

What is a counter metric?

A counter only increases over time. Example: total_test_results_processed or total_api_errors.

New cards

What is a gauge metric?

A gauge can go up or down. Example: queue_depth, active_users, database_connections.

New cards

What is a histogram metric?

A histogram measures distribution of values, such as request latency buckets.

New cards

What is latency?

Latency is how long an operation takes. Example: an API request takes 250 milliseconds.

New cards

What is p95 latency?

p95 latency means 95 percent of requests are faster than that value. It shows tail performance better than averages.

New cards

Why is average latency sometimes misleading?

Averages hide slow outliers. Most users may be fast, but some requests could be extremely slow. Percentiles reveal those cases.

New cards

What is throughput in systems?

Throughput is how many requests, jobs, or events a system processes per unit time.

New cards

What is error rate?

Error rate is the percentage of operations that fail. Example: failed API requests divided by total API requests.

New cards

What is saturation?

Saturation measures how full or overloaded a resource is. Example: CPU usage, memory usage, database connection pool usage, or queue backlog.

New cards

What are golden signals?

Golden signals are common reliability metrics: latency, traffic, errors, and saturation.

New cards

What API metrics should you monitor?

request rate, p95 latency, error rate, status codes, timeout count, dependency latency, and request volume by endpoint.

New cards

What database metrics should you monitor?

query latency, slow queries, connection count, lock waits, storage usage, CPU, memory, and replication lag if applicable.

New cards

What queue metrics should you monitor?

queue depth, oldest message age, processing rate, retry count, failed jobs, dead-letter count, and worker errors.

New cards

What dashboard metrics should you monitor?

dashboard API latency, widget errors, refresh failures, data freshness age, cache hit rate, and slow queries.

New cards

What pipeline metrics should you monitor?

records processed, records failed, ingestion lag, job duration, last successful run, retry count, dead-letter count, and abnormal zero-record runs.

New cards

What is an alert?

An alert is a notification triggered when a metric or event indicates a problem. Good alerts are actionable and not too noisy.

New cards

What makes an alert good?

A good alert is actionable, specific, timely, and has clear severity. It should help someone decide what to do next.

New cards

What is alert fatigue?

Alert fatigue happens when people receive too many noisy alerts and start ignoring them.

New cards

How do you reduce alert fatigue?

Alert only on actionable problems, use severity levels, group duplicates, tune thresholds, and add runbooks.

New cards

What is an alert threshold?

A threshold is the value that triggers an alert. Example: alert if queue depth is above 1000 for 10 minutes.

New cards

What is alert severity?

Severity indicates urgency. Example: info, warning, critical. Critical alerts should require immediate response.

New cards

What is an alert runbook?

A runbook explains what an alert means, where to investigate, and what steps to take.

New cards

What is tracing?

Tracing follows a request or job across multiple services or components. It helps identify where time is spent and where failures occur.

New cards

What is a trace span?

A span is one operation within a trace, such as API validation, database query, or external API call.

New cards

Why is tracing useful?

Tracing helps debug slow or failing requests by showing the path of the request and timing for each component.

New cards

What is a request ID?

A request ID is a unique identifier for one request. It appears in logs so all events for that request can be connected.

New cards

What is correlation ID?

A correlation ID connects related work across services, queues, or jobs. It helps trace workflows that continue asynchronously.

100

New cards

Request ID vs correlation ID?

A request ID usually identifies one request. A correlation ID may follow a larger workflow across multiple requests or jobs.