- All Implemented Interfaces:
- EnvConfigObserver, DaemonRunner, ExceptionListenerUser, java.lang.Runnable
public class Checkpointer
extends DaemonThread
implements EnvConfigObserver
The Checkpointer looks through the tree for internal nodes that must be
flushed to the log. Checkpoint flushes must be done in ascending order from
the bottom of the tree up.
Checkpoint and IN Logging Rules
-------------------------------
The checkpoint must log, and make accessible via non-provisional ancestors,
all INs that are dirty at CkptStart. If we crash and recover from that
CkptStart onward, any IN that became dirty (before the crash) after the
CkptStart must become dirty again as the result of replaying the action that
caused it to originally become dirty.
Therefore, when an IN is dirtied at some point in the checkpoint interval,
but is not logged by the checkpoint, the log entry representing the action
that dirtied the IN must follow either the CkptStart or the FirstActiveLSN
that is recorded in the CkptEnd entry. The FirstActiveLSN is less than or
equal to the CkptStart LSN. Recovery will process LNs between the
FirstActiveLSN and the end of the log. Other entries are only processed
from the CkptStart forward. And provisional entries are not processed.
Example: Non-transactional LN logging. We take two actions: 1) log the LN
and then 2) dirty the parent BIN. What if the LN is logged before CkptStart
and the BIN is dirtied after CkptStart? How do we avoid breaking the rules?
The answer is that we log the LN while holding the latch on the parent BIN,
and we don't release the latch until after we dirty the BIN. The
construction of the checkpoint dirty map requires latching the BIN. Since
the LN was logged before CkptStart, the BIN will be dirtied before the
checkpointer latches it during dirty map construction. So the BIN will
always be included in the dirty map and logged by the checkpoint.
Example: Abort. We take two actions: 1) log the abort and then 2) undo the
changes, which modifies (dirties) the BIN parents of the undone LNs. There
is nothing to prevent logging CkptStart in between these two actions, so how
do we avoid breaking the rules? The answer is that we do not unregister the
transaction until after the undo phase. So although the BINs may be dirtied
by the undo after CkptStart is logged, the FirstActiveLSN will be prior to
CkptStart. Therefore, we will process the Abort and replay the action that
modifies the BINs.
Exception: Lazy migration. The log cleaner will make an IN dirty without
logging an action that makes it dirty. This is an exception to the general
rule that actions should be logged when they cause dirtiness. The reasons
this is safe are:
1. The IN contents are not modified, so there is no information lost if the
IN is never logged, or is logged provisionally and no ancestor is logged
non-provisionally.
2. If the IN is logged non-provisionally, this will have the side effect of
recording the old LSN as being obsolete. However, the general rules for
checkpointing and recovery will ensure that the new version is used in
the Btree. The new version will either be replayed by recovery or
referenced in the active Btree via a non-provisional ancestor.
Checkpoint Algorithm
--------------------
The final checkpointDirtyMap field is used to hold (in addition to the dirty
INs) the state of the checkpoint and highest flush levels. Access to this
object is synchronized so that eviction and checkpointing can access it
concurrently. When a checkpoint is not active, the state is CkptState.NONE
and the dirty map is empty. When a checkpoint runs, we do this:
1. Get set of files from cleaner that can be deleted after this checkpoint.
2. Set checkpointDirtyMap state to DIRTY_MAP_INCOMPLETE, meaning that dirty
map construction is in progress.
3. Log CkptStart
4. Construct dirty map, organized by Btree level, from dirty INs in INList.
The highest flush levels are calculated during dirty map construction.
Set checkpointDirtyMap state to DIRTY_MAP_COMPLETE.
5. Flush INs in dirty map.
+ First, flush the bottom two levels a sub-tree at a time, where a
sub-tree is one IN at level two and all its BIN children. Higher
levels (above level two) are logged strictly by level, not using
subtrees.
o If je.checkpointer.highPriority=false, we log one IN at a
time, whether or not the IN is logged as part of a subtree,
and do a Btree search for the parent of each IN.
o If je.checkpointer.highPriority=true, for the bottom two
levels we log each sub-tree in a single call to the
LogManager with the parent IN latched, and we only do one
Btree search for each level two IN. Higher levels are logged
one IN at a time as with highPriority=false.
+ The Provisional property is set as follows, depending on the level
of the IN:
o level is max flush level: Provisional.NO
o level is bottom level: Provisional.YES
o Otherwise (middle levels): Provisional.BEFORE_CKPT_END
6. Flush VLSNIndex cache to make VLSNIndex recoverable.
7. Flush UtilizationTracker (write FileSummaryLNs) to persist all
tracked obsolete offsets and utilization summary info, to make this info
recoverable.
8. Log CkptEnd
9. Delete cleaned files from step 1.
10. Set checkpointDirtyMap state to NONE.
Provisional.BEFORE_CKPT_END
---------------------------
See Provisional.java for a description of the relationship between the
checkpoint algorithm above and the BEFORE_CKPT_END property.
Coordination of Eviction and Checkpointing
------------------------------------------
Eviction can proceed concurrently with all phases of a checkpoint, and
eviction may take place concurrently in multiple threads. This concurrency
is crucial to avoid blocking application threads that perform eviction and
to reduce the amount of eviction required in application threads.
Eviction calls Checkpointer.coordinateEvictionWithCheckpoint, which calls
DirtyINMap.coordinateEvictionWithCheckpoint, just before logging an IN.
coordinateEvictionWithCheckpoint returns whether the IN should be logged
provisionally (Provisional.YES) or non-provisionally (Provisional.NO).
Other coordination necessary depends on the state of the checkpoint:
+ NONE: No additional action.
o return Provisional.NO
+ DIRTY_MAP_INCOMPLETE: The parent IN is added to the dirty map, exactly
as if it were encountered as dirty in the INList during dirty map
construction.
o IN level GTE highest flush level: return Provisional.NO
o IN level LT highest flush level: return Provisional.YES
+ DIRTY_MAP_COMPLETE:
o IN is root: return Provisional.NO
o IN is not root: return Provisional.YES
In general this is designed so that eviction will use the same provisional
value that would be used by the checkpoint, as if the checkpoint itself were
logging the IN. However, there are several conditions where this is not
exactly the case.
1. Eviction may log an IN with Provisional.YES when the IN was not dirty at
the time of dirty map creation, if it became dirty afterwards. In this
case, the checkpointer would not have logged the IN at all. This is safe
because the actions that made that IN dirty are logged in the recovery
period.
2. Eviction may log an IN with Provisional.YES after the checkpoint has
logged it, if it becomes dirty again. In this case the IN is logged
twice, which would not have been done by the checkpoint alone. This is
safe because the actions that made that IN dirty are logged in the
recovery period.
3. An intermediate level IN (not bottom most and not the highest flush
level) will be logged by the checkpoint with Provisional.BEFORE_CKPT_END
but will be logged by eviction with Provisional.YES. See below for why
this is safe.
4. Between checkpoint step 8 (log CkptEnd) and 10 (set checkpointDirtyMap
state to NONE), eviction may log an IN with Provisional.YES, although a
checkpoint is not strictly active during this interval. See below for
why this is safe.
It is safe for eviction to log an IN as Provisional.YES for the last two
special cases, because this does not cause incorrect recovery behavior. For
recovery to work properly, it is only necessary that:
+ Provisional.NO is used for INs at the max flush level during an active
checkpoint.
+ Provisional.YES or BEFORE_CKPT_END is used for INs below the max flush
level, to avoid replaying an IN during recovery that may depend on a file
deleted as the result of the checkpoint.
You may ask why we don't use Provisional.YES for eviction when a checkpoint
is not active. There are two reason, both related to performance:
1. This would be wasteful when an IN is evicted in between checkpoints, and
that portion of the log is processed by recovery later, in the event of a
crash. The evicted INs would be ignored by recovery, but the actions
that caused them to be dirty would be replayed and the INs would be
logged again redundantly.
2. Logging a IN provisionally will not count the old LSN as obsolete
immediately, so cleaner utilization will be inaccurate until the a
non-provisional parent is logged, typically by the next checkpoint. It
is always important to keep the cleaner from stalling and spiking, to
keep latency and throughput as level as possible.
Therefore, it is safe to log with Provisional.YES in between checkpoints,
but not desirable.
Although we don't do this, it would be safe and optimal to evict with
BEFORE_CKPT_END in between checkpoints, because it would be treated by
recovery as if it were Provisional.NO. This is because the interval between
checkpoints is only processed by recovery if it follows the last CkptEnd,
and BEFORE_CKPT_END is treated as Provisional.NO if the IN follows the last
CkptEnd.
However, it would not be safe to evict an IN with BEFORE_CKPT_END during a
checkpoint, when logging of the IN's ancestors does not occur according to
the rules of the checkpoint. If this were done, then if the checkpoint
completes and is used during a subsequent recovery, an obsolete offset for
the old version of the IN will mistakenly be recorded. Below are two cases
where BEFORE_CKPT_END is used correctly and one showing how it could be used
incorrectly.
1. Correct use of BEFORE_CKPT_END when the checkpoint does not complete.
050 BIN-A
060 IN-B parent of BIN-A
100 CkptStart
200 BIN-A logged with BEFORE_CKPT_END
300 FileSummaryLN with obsolete offset for BIN-A at 050
Crash and recover
Recovery will process BIN-A at 200 (it will be considered
non-provisional) because there is no following CkptEnd. It is
therefore correct that BIN-A at 050 is obsolete.
2. Correct use of BEFORE_CKPT_END when the checkpoint does complete.
050 BIN-A
060 IN-B parent of BIN-A
100 CkptStart
200 BIN-A logged with BEFORE_CKPT_END
300 FileSummaryLN with obsolete offset for BIN-A at 050
400 IN-B parent of BIN-A, non-provisional
500 CkptEnd
Crash and recover
Recovery will not process BIN-A at 200 (it will be considered
provisional) because there is a following CkptEnd, but it will
process its parent IN-B at 400, and therefore the BIN-A at 200 will be
active in the tree. It is therefore correct that BIN-A at 050 is
obsolete.
3. Incorrect use of BEFORE_CKPT_END when the checkpoint does complete.
050 BIN-A
060 IN-B parent of BIN-A
100 CkptStart
200 BIN-A logged with BEFORE_CKPT_END
300 FileSummaryLN with obsolete offset for BIN-A at 050
400 CkptEnd
Crash and recover
Recovery will not process BIN-A at 200 (it will be considered
provisional) because there is a following CkptEnd, but no parent
IN-B is logged, and therefore the IN-B at 060 and BIN-A at 050 will be
active in the tree. It is therefore incorrect that BIN-A at 050 is
obsolete.
This last case is what caused the LFNF in SR [#19422], when BEFORE_CKPT_END
was mistakenly used for logging evicted BINs via CacheMode.EVICT_BIN.
During the checkpoint, we evict BIN-A and log it with BEFORE_CKPT_END, yet
neither it nor its parent are part of the checkpoint. After being counted
obsolete, we crash and recover. Then the file containing the BIN (BIN-A at
050 above) is cleaned and deleted. During cleaning, it is not migrated
because an obsolete offset was previously recorded. The LFNF occurs when
trying to access this BIN during a user operation.
CacheMode.EVICT_BIN
-------------------
Unlike in JE 4.0 where EVICT_BIN was first introduced, in JE 4.1 and later
we do not use special rules when an IN is evicted. Since concurrent
eviction and checkpointing are supported in JE 4.1, the above rules apply to
EVICT_BIN as well as all other types of eviction.