DBD Final

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/52

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 9:34 PM on 5/3/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

53 Terms

1
New cards

We are mainly concerned with transactions for _____ parallel executions

task

2
New cards

What does ACID stand for?

atomicity, consistency, isolation, durability

3
New cards

what are the two types of parallel execution we talked about?

  1. task parallel: every thread executing their own query independently. Needs concurrency control and isolation

  2. data parallel: multiple threads working on the same query at the same time

4
New cards

what is isolation? what kinds of schedules maintain isolation?

  • each transaction executing as if it is the only transaction running on the DBMS

  • serial execution, serializable schedule

5
New cards

define “schedule”

any interleaved sequence of operations for 2+ transactions

6
New cards

define “serializable schedule”

  • a schedule whose outcome is equivalent to the outcome of one of the many possible serial executions

  • maintains serializability

7
New cards

what does “conflict serializable” mean?

  • a schedule is conflict-serializable if we can obtain a serial execution by swapping adjacent non-conflicting operations without changing the results

  • create a precedence graph to detect whether a schedule is conflict-serializable or not

  • conflict-serializable is an easier-to-detect subset of serializable

8
New cards

what is the relation between serializable, conflict-serializable, and serial?

[serializable [conflict-serializable [serial]]]

<p>[serializable [conflict-serializable [serial]]]</p><p></p>
9
New cards

how do you draw a precedence graph? what is a precedence graph used for?

  • each transaction is a node

  • if there is a conflict (any write+write or read+write from 2 threads on the same data) between t1 and t2 and t1’s operation comes before t2’s, add an edge from t1 to t2

  • if there is a cycle, the schedule is not conflict-serializable

10
New cards

how do you know if a schedule is recoverable or not?

  • non-recoverable:

    • may or may not be conflict-serializable

    • when a failure is introduced, e.g. a thread returns (commits) a read value that no longer exists after another thread aborts a write

  • recoverable:

    • if t2 depends on the results of t1, do not commit t2 until after t1 has committed/aborted

    • may or may not have cascading abort problem

11
New cards

what is the cascading abort problem? how do you prevent it?

  • cascading abort problem

    • when 1 abort aborts many dependent transactions, since a transaction has read an intermediate value of the aborted transaction

  • cascadeless schedule

    • rule: if t2 reads the write of t1, only begin t2’s read after t1 commits/aborts

12
New cards

what is a lock manager?

  • hash table of < RID of item we want to read/write, list of transactions who’ve acquired the lock/are waiting >

  • enforces s/x-lock compatibility matrix

  • useful for deadlock detection

13
New cards

What is the purpose of 2PL? What are the requirements for 2PL?

  • purpose: to produce conflict-serializable schedules

  1. lock before operation using s-lock for reads, x-lock for writes

  2. 2 phases: grow phase (only request or upgrade), followed by shrink phase (only release or downgrade)

14
New cards

What is the purpose of strong strict 2PL? What are the requirements?

  • purpose: to produce conflict-serializable, recoverable schedules that avoid cascading aborts by ensuring every transaction only reads committed data

  • requirement: only release locks after we have aborted/committed

15
New cards

What are the 2 categories of ways to deal with deadlock?

  1. use wait-for graph to detect deadlock, then abort one of them

    • transactions = nodes; use lock manager to figure out wait-for edges between transactions

    • cycle → deadlock

    • abort a transaction (typically newer) to break the cycle

  2. avoid deadlocks by assigning a timestamp to each transaction and allowing 1 directional wait

    1. “wait-die”: old waits for young. If I’m older, wait. If I’m younger, abort myself.

    2. “wound-wait”: young waits for old. If I’m younger, wait. If I’m older, abort the lock holder.

16
New cards

T/F: Using 2PL, schedules may or may not be recoverable.

true. No guarantee that schedules are recoverable. Need strong strict 2PL.

17
New cards

T/F: for wait-die or wound-wait deadlock prevention, after we have aborted a young transaction and it retries, it should be assigned a new timestamp.

False; it should maintain the same timestamp so it’s guaranteed to make progress eventually

18
New cards

How might atomicity be violated in a system without logging?

System failure occurs after dirty page is pushed to disk. There’s no way to undo the partial change, when we only have the value in disk to work with.

19
New cards

How might durability be violated in a system without logging?

System failure occurs before dirty pages are pushed to disk.

20
New cards

Logging can handle what types of failures?

  1. transaction failure (e.g. abort)

    • UNDO (rollback)

  2. system crash

    • UNDO uncommitted transactions

    • REDO committed transactions

21
New cards

For the purposes of logging, what are the two policies we assume the BP uses?

  1. no force policy: writes of an uncommitted transaction can be kept in memory

  2. steal policy: writes of an uncommitted transaction can be pushed to disk

22
New cards

What is a CLR?

  • a log record of the form <ti, data identifier, before value, undoNextLSN=prevLSN field of corresponding log>

  • means “this is an undo operation I want to do, as a result of an abort or during the undo phase of recovery”

  • used only during the redo phase of recovery; ignored during undo phase

23
New cards

How is a blocking checkpoint performed?

  1. pause execution of all transactions

  2. flush all log records to disk

  3. flush all dirty pages to disk

  4. append the log record <checkpoint, L>, where L is the list of active transactions at the time when the checkpoint starts, the transactions that are currently paused

  5. resume

24
New cards

Why is a blocking checkpoint better than no checkpoint at all? What does a blocking checkpoint signify?

  • all log records before the checkpoint have been (persistently) applied to the pages

  • start recovery process at this checkpoint, rather than the beginning of the log

25
New cards

What is the algorithm for transaction rollback in case of abort (non-ARIES, just normal)? Assume the transaction we want to roll back is t2.

Scan the log backward. For each log record in WAL:
	if <t2, x, v1, v2>
		// perform undo and append corresponding CLR
		update x = v1
		append CLR <t2, x, v1>
	if <t2, start>
		// backward scan is stopped; all changes made by t2 have been reverted
		append <t2, abort>

26
New cards

For recovery, what is the goal of the redo phase? What is the algorithm (non-ARIES)?

  • replay the log history from checkpoint and redo updates of all transactions (even if the transaction ends up rolled back or uncommitted)

  • figure out which transactions are uncommitted (active), for the purpose of the undo phase

init L = { uncommitted transactions at this point, from the checkpoint log record }
Scan the log forward from checkpoint. For each log record:
	if <ti, x, v1, v2> (normal update)
		update x = v2
	if <ti, x, v1> (CLR)
		update x = v1
	if <ti start>
		add ti to L
	if <ti commit> or <ti abort>
		// this transaction is finished. Either committed or all effects have been reverted.
		remove ti from L

// L is now the uncommitted transactions at the point of the crash. i.e. they’re either uncommitted in the middle of operation, or they’re in incomplete abortion

27
New cards

What is the algorithm for the undo phase of recovery (non-ARIES)?

Scan the log backward. For each log record:
	if <ti, x, v1, v2>.ti is in L
		// undo
		update x = v1
		append CLR <ti, x, v1>
	if <ti start>.ti is in L
		// we’ve finished the undo for ti (we’re scanning backwards)
		remove ti from L
		append <ti abort>
	if L is empty
		// there are no transactions to be undone
		stop

28
New cards

For ARIES, what are all the new data structures introduced?

  1. LSN

  2. PageLSN

  3. PrevLSN

  4. UndoNextLSN

  5. ATT

  6. DPT

  7. checkpoint <DPT, ATT>

29
New cards

For ARIES, what is LSN?

  • log sequence number

  • each log record is given a unique, monotonically increasing LSN

30
New cards

For ARIES, what is PageLSN? What is the purpose?

  • purpose: skip applying a redo record if it’s already on disk

  • embedded in the header of each page in disk

  • whenever we apply an update to a page, also assign the page’s PageLSN = LSN that represents this update

  • all log records up to PageLSN have been applied persistently to this page already

  • during redo phase of recovery, any log records <= PageLSN will be skipped

31
New cards

For ARIES, what is PrevLSN? What is the purpose?

  • purpose: function as a linkedlist pointer during undo for the relevant transaction

  • each log record (except CLR) contains PrevLSN field that points to the previous log record of the same transaction

  • when performing an undo, we can follow the pointers like a linked list instead of scanning all records

32
New cards

For ARIES, what is UndoNextLSN? What is the purpose?

  • purpose: ensure that if a system crashes during a rollback/recovery (undo phase), it skips over previously processed undo operations

  • in CLRs only

  • when appending a CLR, UndoNextLSN := the prevLSN field of the log the CLR corresponds to

  • after encountering a CLR, we can skip to the prevLSN of the CLR’s corresponding log, because we know the corresponding log (this CLR) has already been applied earlier in the redo phase

33
New cards

For ARIES, what is ATT? What is the purpose?

  • active transaction table

  • table of < tid of active transaction, LastLSN of that transaction >

    • LastLSN = the most recent LSN involving this transaction. The entry into the linked list going backwards for this transaction

  • purpose: for undo, where we start from LastLSN and follow prevLSN/undoNextLSN to find all the log records we need to undo

  • entry is deleted when a transaction is aborted or committed

34
New cards

For ARIES, what is DPT? What is the purpose?

  • records the starting point when a page becomes dirty for the first time

  • hash table of < pageID of dirty page, RecentLSN >

    • RecentLSN = first log record that made this page dirty

  • entry is deleted when page is written to disk

35
New cards

What is a fuzzy checkpoint?

<checkpoint, ATT, DPT>

used for ARIES

36
New cards

What is the transaction rollback (for aborts) algorithm for ARIES? Assume the transaction we want to rollback is t2.

Look up ATT.lastLSN for t2
Undo from lastLSN {
	use prevLSN pointer to skip the log records that are irrelevant to ti
	after each undo, append a CLR: <ti, x, v1, undoNextLSN=(current log’s prevLSN)>
	follow prevLSN back until reaching <ti start>
}
Append <ti abort>

37
New cards

What is the recovery (for system crashes) algorithm for ARIES ?

Phase1: Analysis
Goal: read DPT and ATT from checkpoint. Go through all following log records and update DPT and ATT.

if log record for ti not in ATT
	add ti to ATT
if <ti commit or abort>
	remove ti from ATT
if a page is dirtied for the first time
	add to DPT
	(we never remove any page from DPT)
keep updating ATT.lastLSN for ti

RedoLSN = MIN of (recentLSN in DPT)
	the LSN where we will start the redo phase
	or RedoLSN = LSN of checkpoint if DPT is empty
Phase2: Redo
Goal: replay every action that is not already in the page on disk.

Scan the log starting at RedoLSN:
	For each log record Li (Li is an LSN)=<Ti, …> that updates page Pi
		if Pi is not in DPT || Li < DPT.recentLSN(Pi)
			skip Li 
			// bc it’s not dirtied, or the log record is earlier than the first time the page was dirtied
		else
			read Pi from disk
			if Pi.PageLSN < Li
				// the last time Pi was written to disk was a time before Li
				redo the log record
				update Pi.PageLSN

// Note: skipping any log record means the effects of the log record have already appeared on the page
Phase3: Undo
for each <ti, lastLSN> in ATT, invoke the rollback procedure:
	Look up ATT.lastLSN for ti
	Undo from lastLSN {
		use prevLSN pointer to skip the log records that are irrelevant to ti
		after each undo, append a CLR: <ti, x, v1, undoNextLSN=(current log’s prevLSN)>
		follow prevLSN back until reaching <ti start>
	}
	Append <ti abort>

38
New cards

What are 3 data partitioning methods for distributed databases?

  1. range-based

    • partition vector

    • serves point and range queries efficiently

  2. hash-based

    • hash function

    • serves point queries efficiently; must send range queries to all nodes

  3. round robin

    • randomly distributed

    • must send point and range queries to all nodes

39
New cards

How is a global primary index built?

  • Partitioning function (hash or range) assigns different nodes to different partition keys

  • Each node builds a local index of tuples based on the partition key

40
New cards

How is a global secondary index built?

  • Partitioning function (hash or range) assigns different nodes to different values of a column A

  • Each node builds a table of

    • A value → [list of global primary index keys with this A value, out of all tuples in the entire database]

  • global secondary index is consulted to answer queries like get(dept=cs): use global secondary index to find all the IDs of tuples with dept=cs, then use global primary index to find the tuples with those IDs

41
New cards

what are the two types of skewness in distributed databases?

  1. data skewness: some node has more data

  2. execution skewness: data is evenly distributed, but access hotness is different

42
New cards

What are the 2 ways to reconfigure distributed partitioning?

  1. Dynamic repartitioning (for range partitioning)

    • keep a partition table that maps ranges of a continuous keyspace to different “tablets” assigned to different nodes. Idea is to split into more tablets than there are nodes, and reassign tablets to diff nodes. Can also split a tablet/partition range.

  2. Consistent hashing (for hash partitioning)

    • organize large continuous keyspace into a ring. Assign keys to the next node in the clockwise direction on the ring.

43
New cards

In distributed transactions, every node has 2 components. What are these and what are their roles?

  1. TC (transaction coordinator)

    • where transactions come in

    • looks at partitioning table to determine which node’s TM to send each operation to; sends operations to correct TM to execute, collects results, returns to end user

  2. TM (transaction manager)

    • executes transaction operations

44
New cards

How is 2PL and deadlock dealt with in distributed transactions?

  • 2PL: each node has its own lock manager that manages locks for tuples stored locally on that node. When a transaction is done, a remote request to unlock the tuple will be sent to the lock manager on that node.

  • Deadlock: assign a timestamp to each transaction with node name as a tiebreaker. Do old waits for young or young waits for old.

45
New cards

What is the 2PC algorithm in distributed transactions?

Phase1:
TC logs <prepare T>
TC sends “prepare T” to all nodes in N={nodes who have received a write request and has log records for this transaction}
	see if T can be committed, according to each Ni’s local log records
for each Ni in N
	receive the “prepare T” request
	locally decide if T can commit
	if T can commit,
		log <ready T>
		return “ready”
	else
		log <no T>
		reply “abort”
Phase2:
TC collects all “ready T” messages
if TC receives all “ready T”
	TC logs <T commit>
	send “commit T” to all nodes
else 
	TC logs <T abort>
	sends “abort T” to all nodes
for each node Ni in N
	if TC said “commit T”
		commit
	else
		abort

46
New cards

T/F: For 2PC in distributed transactions, in case of system crash, if TC’s log has <T commit>, the transaction will be committed no matter what. If it does not have <T commit> log record, the transaction is aborted and we go into undo phase. <T commit> is the single ground truth that tells us whether we should abort or not.

T

47
New cards

What is the protocol if a node crashes in 2PC in distributed transactions?

Let’s say TC detects that N1 failed. TC does this:
if TC receives <ready T> from N1
	continue the commit process
else
	assume N1 will send “abort T” in phase1 of 2PC
	abort T
After recovering, N1 does this:
	look at my own local log
		local <commit T>
			commit
		local <abort T>
			abort and undo transaction
		only local <ready T>
			ask if TC has commit or abort log. If commit, then commit. If abort, then abort.
			else if TC is unreachable, ask if N2 has commit or abort log. If commit, then commit. If abort, then abort.
			else if N2 has only “ready”, wait for TC to give us the final answer, commit or abort.
		no <ready T>
			abort, since TC will abort too

48
New cards

What is the protocol if TC crashes in 2PC in distributed transactions?

for each Ni, look at my own log record
	if <commit T>
		commit
	elif <abort T>
		abort
	elif nothing, no <ready T>
		abort
	elif only <ready T>
		ask all other nodes
		if all <ready T>
			wait for TC to commit or not
		else
			abort

After recovering, TC does this:
	if <commit T>
		send “commit T” to all nodes
	else
		send “abort T” to all nodes

49
New cards

What is the diff between OLTP and OLAP?

  • OLTP: online transaction processing

    • short-lived transactions

    • only touches a couple of records

    • involves many writes

    • perform db operations on the current snapshot of the data

  • OLAP: online analytical processing

    • touch many rows

    • read-heavy

    • focused on historical data

50
New cards

What is column store?

  • store each column as separate files

  • can scan specific columns of all tuples

  • can find matching tuples for each col value using either

    • fixed offset math, if attributes are fixed length, or

    • embed the tuple ID into each “column” value

      • this method also allows row reordering, which is helpful for column compression

51
New cards

What is PAX hybrid row/column store?

  • divide rows into row groups

  • within each row group, store as column store

  • has benefit of col store: can scan by column, but are not able to skip columns as aggressively as pure col store

  • has benefit of row store: cols belonging to the same row are close to each other, so you only have to read 2 pages into memory if you’re touching 2 tuples

52
New cards

What are the 5 column compression techniques? Explain each one.

  1. Run-length encoding (+ sorting and embedded tuple ID)

    • represent repeated values as (value being repeated = y, start from = 0, repeated number of times = 3)

  2. Bit packing

    • use a special value to represent an outlier value, then store the outlier values in a separate place, and use a smaller datatype for the rest of the in-range values

  3. Bit map

    • use an n-bit vector to encode the value of a column with n possible values. Useful for union (“or”) queries.

  4. Delta encoding (+ RLE)

    • store the starting point and the delta

    • can store delta entries with RLE

  5. Dictionary

    • deal with columns whose values are specific variable-length strings

    • store an enum basically elsewhere; map strings to e.g. int8

53
New cards