1/242
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Reliability vs availability?
Reliability means the system works correctly over time, even with failures or bad inputs. Availability means the system is reachable and usable. A system can be available but unreliable if it returns wrong data.
Reliability vs correctness?
Reliability is about consistently functioning under real-world conditions. Correctness is about producing the right result. For manufacturing software, both matter because a system that stays online but shows wrong unit status is still dangerous.
Reliability vs fault tolerance?
Reliability is the overall ability to work correctly. Fault tolerance is the ability to keep working when part of the system fails. Example: if test-result processing fails, the API can still queue uploads for later processing.
What is graceful degradation?
Graceful degradation means the system still provides partial useful functionality when something fails. Example: if live dashboard refresh fails, show the last known data with a stale warning instead of showing a blank page.
What is graceful failure?
Graceful failure means failing in a controlled way without corrupting data. The system should return clear errors, log useful context, avoid partial writes, and leave data in a consistent state.
What is fail-safe design?
Fail-safe design means the system chooses the safer behavior when something goes wrong. Example: if shipment validation cannot confirm QC passed, the system should block shipment rather than allow it.
What does fail closed mean?
Fail closed means deny or stop an operation when validation or permissions cannot be verified. For manufacturing, shipment approval and status overrides should fail closed.
What does fail open mean?
Fail open means allow an operation even when checks fail. It may improve convenience but can be risky. It is usually bad for security, QC, shipment, or permission-sensitive workflows.
What is data correctness?
Data correctness means stored data accurately represents the real-world workflow and business rules. Example: a unit marked ready_to_ship should actually have passed required tests and QC.
Why is data correctness critical in manufacturing software?
Incorrect data can cause real operational mistakes, such as shipping failed units, missing bottlenecks, ordering wrong inventory, or losing traceability for a physical product.
What is data consistency?
Data consistency means related data does not contradict itself. Example: units.current_stage should match the latest unit_stage_history event.
Correctness vs consistency?
Correctness means data reflects reality. Consistency means data agrees across tables or systems. A system can be internally consistent but still incorrect if it does not match the real production floor.
How do you design for data correctness?
Use backend validation, database constraints, transactions, allowed state transitions, idempotency keys, audit logs, reconciliation jobs, and tests for important workflows.
What is validation layering?
Validation layering means using multiple levels of protection. Frontend validation improves user experience, backend validation enforces business rules, and database constraints provide final protection.
Why is frontend validation not enough?
Users or services can bypass the frontend and call APIs directly. Frontend validation is helpful for fast feedback, but backend validation is required for security and correctness.
Why use database constraints if backend validates?
Backend validation gives good error messages, but database constraints protect against bugs, race conditions, and alternate code paths. Use both for important rules.
What are examples of database constraints for reliability?
Unique constraints on serial_number, foreign keys from test_results to units, NOT NULL on required fields, CHECK constraints for valid ranges, and transaction rules for related writes.
What is a transaction?
A transaction groups multiple database operations so they all succeed together or fail together. It prevents partial updates from corrupting data.
Why use transactions for status updates?
Updating units.current_stage and inserting unit_stage_history must happen together. If one succeeds and the other fails, the system loses traceability or shows inconsistent state.
What is rollback?
Rollback undoes changes made inside a transaction after a failure. It returns the database to its previous consistent state.
What is atomicity?
Atomicity means a transaction is all-or-nothing. Either every operation succeeds or none are applied.
What is isolation?
Isolation means concurrent transactions should not interfere with each other in a way that causes incorrect results.
What is durability?
Durability means once a transaction commits, the data should persist even if the system crashes afterward.
What is a race condition?
A race condition happens when two operations happen at the same time and the result depends on timing. Example: two operators update the same unit status at once.
How do you handle concurrent updates?
Use transactions, row locks, optimistic locking, version fields, updated_at checks, and state-transition validation.
What is optimistic locking?
Optimistic locking uses a version number or timestamp to detect whether a record changed after it was read. If it changed, the update is rejected or retried.
When is optimistic locking useful?
It is useful when conflicts are possible but not constant. Example: two users open the same unit detail page and both try to update status.
What is pessimistic locking?
Pessimistic locking locks a row while it is being updated so other transactions cannot modify it at the same time. It prevents conflicts but can reduce concurrency.
Optimistic locking vs pessimistic locking?
Optimistic locking allows more concurrency and detects conflicts later. Pessimistic locking prevents conflicts upfront but can block other users. Use optimistic for common web workflows and pessimistic for critical high-conflict operations.
What is a stale write?
A stale write happens when a user updates data based on an old version of the record. Optimistic locking can prevent this.
What is lost update?
A lost update happens when one user's change overwrites another user's change without noticing. Version checks or locks can prevent it.
What is idempotency?
Idempotency means repeating the same operation multiple times has the same final effect as doing it once. It is essential for safe retries.
Why is idempotency important for reliability?
Network failures can cause clients or workers to retry. Without idempotency, retries can create duplicate test results, duplicate shipments, or repeated status changes.
How do you implement idempotency?
Use idempotency keys, unique event IDs, unique database constraints, upserts, and processed-event tracking.
What is an idempotency key?
An idempotency key is a unique value for one operation. If the same key is received again, the system returns the original result instead of repeating the operation.
What is retry safety?
Retry safety means an operation can be retried without causing duplicate or incorrect side effects. Idempotency is the main way to achieve it.
What is backoff?
Backoff means waiting before retrying a failed operation. Exponential backoff increases the wait after each failure.
What is a retry storm?
A retry storm happens when many clients retry at once and overload an already struggling system. Backoff and jitter help prevent this.
What is a timeout?
A timeout is a maximum wait time for an operation. It prevents requests from hanging forever when a dependency is slow or broken.
Why are timeouts important?
Without timeouts, slow dependencies can consume resources and cause cascading failures. Every external call should have a reasonable timeout.
What is a circuit breaker?
A circuit breaker temporarily stops calls to a failing dependency after repeated failures. It prevents wasting resources and gives the dependency time to recover.
What is bulkhead isolation?
Bulkhead isolation separates resources so one failure does not take down the whole system. Example: separate worker pools for imports and test-result ingestion.
What is cascading failure?
A cascading failure happens when one failing component causes other components to fail. Example: a slow database causes API timeouts, which causes retries, which adds even more load.
How do you prevent cascading failures?
Use timeouts, circuit breakers, queues, backoff, rate limits, bulkheads, graceful degradation, and alerts.
What is load shedding?
Load shedding means intentionally rejecting or delaying some work to protect the system under high load. Example: reject noncritical report generation while keeping test uploads working.
What is rate limiting?
Rate limiting restricts how many requests a client can make in a time window. It protects systems from abuse, bugs, and overload.
What is throttling?
Throttling slows down request processing instead of immediately rejecting all excess traffic. It helps protect dependencies.
Rate limiting vs throttling?
Rate limiting sets a hard request limit, often returning 429. Throttling slows or delays requests to smooth load.
What is a degraded mode?
A degraded mode is a limited version of the system during failures. Example: dashboard shows cached data while live aggregation is unavailable.
What is durability?
Durability means data is not lost after being accepted. Queues, transactions, backups, and persistent storage improve durability.
How do you avoid data loss?
Use durable queues, database transactions, idempotent writes, backups, raw event storage, dead-letter queues, and monitoring.
What is a dead-letter queue?
A dead-letter queue stores messages that failed processing too many times. It prevents bad messages from blocking good ones.
Why is a dead-letter queue important?
It gives engineers visibility into failed records and allows later reprocessing after fixing bugs or data issues.
What is a poison message?
A poison message is a message that repeatedly fails processing because it is malformed or triggers a bug.
How do you handle poison messages?
Retry a limited number of times, then move them to a dead-letter queue with error details and alert the owner.
What is observability?
Observability is the ability to understand what a system is doing internally from outputs like logs, metrics, traces, health checks, and alerts.
Observability vs monitoring?
Monitoring watches known signals like error rate or latency. Observability helps investigate unknown problems by providing enough context to understand system behavior.
What are the three pillars of observability?
The common three pillars are logs, metrics, and traces. Health checks and alerts are also important in practical systems.
What are logs?
Logs are structured records of events, errors, and actions. They help developers debug what happened.
What is structured logging?
Structured logging records logs as key-value fields instead of plain text. Example fields include request_id, serial_number, user_id, endpoint, status_code, and error_code.
Why is structured logging useful?
It makes logs searchable and filterable. You can quickly find all errors for serial_number VL001 or all failed test uploads from workstation 3.
What should API logs include?
method, path, status_code, latency, request_id, user_id or service_id, relevant entity IDs, and error details if an error occurred.
What should pipeline logs include?
job_id, source, records processed, records failed, retry count, duration, error reasons, event IDs, serial numbers, and timestamps.
What should security logs include?
login attempts, failed auth, permission denials, role changes, API key usage, sensitive exports, and admin actions.
What should not be logged?
Do not log passwords, API keys, tokens, private user data, full secrets, or sensitive payloads unless explicitly protected and necessary.
What is log level?
Log level describes severity. Common levels include DEBUG, INFO, WARNING, ERROR, and CRITICAL.
When use DEBUG logs?
Use DEBUG for detailed troubleshooting information that is usually disabled in production because it can be noisy.
When use INFO logs?
Use INFO for normal important events, such as job started, job completed, or unit status changed.
When use WARNING logs?
Use WARNING for unusual but recoverable situations, such as delayed job processing or a retryable failure.
When use ERROR logs?
Use ERROR when an operation fails and needs attention or investigation.
When use CRITICAL logs?
Use CRITICAL for severe failures that may make the system unusable or cause major data loss.
What is a metric?
A metric is a numeric measurement over time. Examples include API error rate, request latency, queue depth, ingestion lag, and database query time.
What is a counter metric?
A counter only increases over time. Example: total_test_results_processed or total_api_errors.
What is a gauge metric?
A gauge can go up or down. Example: queue_depth, active_users, database_connections.
What is a histogram metric?
A histogram measures distribution of values, such as request latency buckets.
What is latency?
Latency is how long an operation takes. Example: an API request takes 250 milliseconds.
What is p95 latency?
p95 latency means 95 percent of requests are faster than that value. It shows tail performance better than averages.
Why is average latency sometimes misleading?
Averages hide slow outliers. Most users may be fast, but some requests could be extremely slow. Percentiles reveal those cases.
What is throughput in systems?
Throughput is how many requests, jobs, or events a system processes per unit time.
What is error rate?
Error rate is the percentage of operations that fail. Example: failed API requests divided by total API requests.
What is saturation?
Saturation measures how full or overloaded a resource is. Example: CPU usage, memory usage, database connection pool usage, or queue backlog.
What are golden signals?
Golden signals are common reliability metrics: latency, traffic, errors, and saturation.
What API metrics should you monitor?
request rate, p95 latency, error rate, status codes, timeout count, dependency latency, and request volume by endpoint.
What database metrics should you monitor?
query latency, slow queries, connection count, lock waits, storage usage, CPU, memory, and replication lag if applicable.
What queue metrics should you monitor?
queue depth, oldest message age, processing rate, retry count, failed jobs, dead-letter count, and worker errors.
What dashboard metrics should you monitor?
dashboard API latency, widget errors, refresh failures, data freshness age, cache hit rate, and slow queries.
What pipeline metrics should you monitor?
records processed, records failed, ingestion lag, job duration, last successful run, retry count, dead-letter count, and abnormal zero-record runs.
What is an alert?
An alert is a notification triggered when a metric or event indicates a problem. Good alerts are actionable and not too noisy.
What makes an alert good?
A good alert is actionable, specific, timely, and has clear severity. It should help someone decide what to do next.
What is alert fatigue?
Alert fatigue happens when people receive too many noisy alerts and start ignoring them.
How do you reduce alert fatigue?
Alert only on actionable problems, use severity levels, group duplicates, tune thresholds, and add runbooks.
What is an alert threshold?
A threshold is the value that triggers an alert. Example: alert if queue depth is above 1000 for 10 minutes.
What is alert severity?
Severity indicates urgency. Example: info, warning, critical. Critical alerts should require immediate response.
What is an alert runbook?
A runbook explains what an alert means, where to investigate, and what steps to take.
What is tracing?
Tracing follows a request or job across multiple services or components. It helps identify where time is spent and where failures occur.
What is a trace span?
A span is one operation within a trace, such as API validation, database query, or external API call.
Why is tracing useful?
Tracing helps debug slow or failing requests by showing the path of the request and timing for each component.
What is a request ID?
A request ID is a unique identifier for one request. It appears in logs so all events for that request can be connected.
What is correlation ID?
A correlation ID connects related work across services, queues, or jobs. It helps trace workflows that continue asynchronously.
Request ID vs correlation ID?
A request ID usually identifies one request. A correlation ID may follow a larger workflow across multiple requests or jobs.