public class VLSNIndex
extends java.lang.Object
A VLSN (Virtual LSN) is used to identify every log entry shared between
members of the replication group. Since a JE log is identified by LSNs, we
must have a way to map VLSN->LSNs in order to fetch a replicated log record
from the local log, using the VLSN. The VLSNIndex implements those
mappings. The VLSNIndex has these responsibilities:
Generating new VLSNs.
Only masters need to generate VLSNs, but any node may have the potential
to be a master. The VLSN sequence must ascend over time and across
recoveries, so the VSLN id must be preserved much like the database, node
and txn ids.
Maintaining the VLSN range.
Although each node needs to receive and store each log entry from the
replication stream, over time the part of the stream that is stored can be
reduced, either by log cleaning, or by syncups which can truncate the
replication stream. A node always holds a contiguous portion of the
replication stream. The VLSN range identifies that portion by having the
start and end VLSNs, as well as key landmarks such as the lastSync-able
log entry and the last commit log entry. VLSN range information is used by
elections and syncup.
Gatekeeper for waiting for the most recently logged entries.
Feeders block upon the VLSNIndex when they are trying to fetch the most
recently logged entries. These recent log entries are held in a two level
cache within the VLSNIndex. A call to VLSNIndex.waitForLsn() goes through
this sequence:
1) check the log item stored in the vlsn wait latch, if the call did wait.
2) check the log item cache
If both fail, the FeederReader will fetch the required log entry from log
buffers or disk
Providing the LSN mapping for a log record identified by its VLSN.
The Feeders and the syncup protocol both need to retrieve log records
by VLSN. To do that, we need an LSN mapping.
Mappings are added to VLSNIndex when replicated log entries are written into
the local log. Although all mappings are registered, the VLSNIndex does not
keep every one, in order to save on disk and in-memory storage. Only a
sparse set is kept. When searching for a log entry by VLSN, the caller uses
the closest available mapping and then scans the log looking for that entry.
The VLSNIndex relies on the assumption that VLSN tagged log entries are
ordered and contiguous in the log. That is, the LSN for VLSN 1 is < the LSN
for VLSN 2 < LSN for VLSN 3, and there is never a gap in the VLSNs. However,
at node syncup, the replication stream may need to be truncated when rolling
back a non-committed log entry. We can't literally truncate the log files
because the JE logs contain intermingled transactional and non transactional
information. Instead, the truncation is done both logically by amending the
VLSNIndex, and physically by overmarking those entries in the JE
logs. Because of that, a physical dump of the log may show some VLSN tagged
entries as duplicate and/or out of order because they're abandoned log
entries that are not logically part of the replication stream any more
For example, the log can look like this:
LSN 100, VLSN 1
LSN 200, VLSN 2 <- overmarked
LSN 300, VLSN 3 <- overmarked
--- syncup, rollback to VLSN 1, restart at VLSN 2
LSN 400, VLSN 2
LSN 500, VLSN 3
VLSN->LSN mappings are created under the log write latch, which ensures that
all VLSN tagged log entries are ordered in the logical replication stream in
the log. However, the mapping is added to the VLSNIndex outside the log
write latch, so the VLSNIndex database may have a momentary gap. For
example,
t0- thread 1 logs entry at VLSN=1, LSN=100, within log write latch
t1- thread 2 logs entry at VLSN=2, LSN=150, within log write latch
t2- thread 3 logs entry at VLSN=3, LSN=200, within log write latch
t3- thread 1 calls VLSNIndex.put(VLSN=1/LSN=100)
t4- thread 3 calls VLSNIndex.put(VLSN=3/LSN=200)
t5- thread 2 calls VLSNIndex.put(VLSN=2/LSN=150)
At t4, the VLSNIndex contains 1/100, 3/200, but not 2/150. However, we know
that the VLSNIndex always represents a contiguous range of VLSNs, so the
fact that 2/150 is not yet is handled, and is just like the case where
the VLSNIndex optimized away the mapping in order to keep the index sparse.
We do guarantee that the start and end VLSNs in the range have mappings, in
order to always be able to provide a LTE and GTE mapping for all valid
VLSNs. Because of that, if a VLSN comes out of order, it does not update the
range.
Persistent storage:
The VLSN->LSN mappings in the range are grouped into instances of
com.sleepycat.je.util.VLSNBucket. Each bucket knows the first and last VLSN
within its mini-range. We observe these invariants
- buckets are ordered by VLSN in the database and the bucket cache,
- only the last bucket is the target of updates at any time,
- a single bucket corresponds to a single file, but a single file may
have multiple buckets covering it.
While it would be nice to also guarantee that there are no gaps between
buckets, ie:
bucket(N-1).last == bucket(N).first - 1
bucket(N).last == bucket(N-1).first - 1
it is not possible to do so because we the put() call is not serialized
because we don't want to add overhead to the log write latch. In order
to permit out of order puts(), and to require that only the last bucket
is updated, we must permit gaps between buckets.
Buckets are both cached in memory by the VLSNIndex and are stored
persistently in a internal, non-replicated database. The database holds
key/value pairs where
key = bucket.first
data = bucket
Since the first valid VLSN is 1, key = -1 is reserved for storage of the
VLSNRange.
Buckets are filled up as new VLSNs arrive (either because they've been
generated by write operations on the master, or because they're incoming
operations on the replica). They're flushed to disk periodically rather than
with every new VLSN, because the update rate would have too much of a
performance impact. Since there is this level of caching happening, we must
be careful to write in-memory buckets to disk at well known points to
support recoverability. The flushing must be instigated by a third party
activity, such as checkpointing, rather than by the action of adding a new
mapping. That's because mappings are registered by the logging system, and
although we are not holding the log write latch at that point, it seems
inadvisable to recursively generate another logging call on behalf of the
flush. Currently the VLSNIndex is flushed to disk at every checkpoint. It
can also optionally happen more often, and (TODO) we may want to do so
because we've seen cases where checkpoints take a very long time. Perhaps we
should flush when we flip to a new log file?
Once written to disk, the buckets are generally not updated. Updates can
happen when the range is truncated, such as for syncup rollback, but the
system is quiescent at that time. Log cleaning can delete buckets, but will
not modify them. The VLSNRange does naturally change often, and that data
record does get updated.
Recovery:
The VLSN database is restored at recovery time just as all other databases
are. However, there may be a portion of the VLSN range that was not flushed
to disk. At recovery, we piggyback onto the log scanning done and re-track
the any mappings found within the recovery range. Those mappings are merged
into those stored on disk, so that the VLSNIndex correctly reflects the
entire replication stream at startup. For example, suppose a log has:
LSN
100 firstActiveLSN
200 Checkpoint start
300 VLSN 78
400 VLSNIndex flushed here
500 Checkpoint end
600 VLSN 79
The VLSNIndex is initially populated with the version of the index found
at LSN 400. That doesn't include VLSN 79. A tracking pass is done from
checkpoint start -> end of log, which sweeps up VLSN 78 and VLSN 79 into
a temporary tracker. That tracker is merged in the VLSNIndex, to update
its mappings to VLSN 79.
Note that the checkpoint VLSNIndex must encompass all vlsn mappings that are
prior to the checkpoint start of that recovery period. This follows the
general philosophy that checkpoint flushes all metadata, and recovery reads
from checkpoint start onewards to add on any neede extra data.
Retrieving mappings:
Callers who need to retrieve mappings obtain a VLSNScanner, which acts as a
cursor over the VLSNIndex. A VLSNScanner finds and saves the applicable
VLSNBucket, and queries the bucket directly as long as it can provide
mappings. This reduces the level of contention between multiple readers
(feeders) and writers (application threads, or the replay thread)
Synchronization hierarchy:
To write a new mapping, you must have the mutex on the VLSIndex, and then
the tracker, which lets you obtain the correct bucket, and then you must
have a mutex on the bucket. To read a mapping, you must have the tracker
mutex to obtain the right bucket. If you already have the right bucket in
hand, you only need the bucket mutex.
In truth, buckets which are not the "currentBucket" are not modified again,
so a future optimization would allow for reading a mapping on a finished
bucket without synchronization.
The VLSNRange is updated as an atomic assignment to a volatile field after
taking the mutex on the current bucket. It is read without a mutex, by
looking at it as a volatile field.
The hierarchy is
VLSNIndex -> VLSNTracker -> VLSNBucket
VLSNIndex -> VLSNTracker -> VLSNRange
VLSNIndex -> VLSNIndex.mappingSynchronizer
VLSNIndex.flushSynchronizer -> VLSNTracker
Removing mappings vs reading mappings - sync on the range.
We also need to consider that fact that callers of the VLSNIndex may be
holding other mutex, or IN latches, and that the VLSNIndex methods may do
database operations to read or write to the internal VLSN database. That can
result in a nested database operation, and we need to be careful to avoid
deadlocks. To be safe, we disable critical eviction [#18475]
VLSNBucket.writeDatabase().
Writers
-------
Allocating a new VLSN: bump()
- sync on log write latch
Note that since there is no synchronization on the VLSNINdex itself,
[allocating new VLSN, logging its entry] and [flushing the vlsn index
to disk] is not atomic. See awaitConsistency().
Adding a mapping: put()
- sync on VLSNIndex
-sync on VLSNTracker to access the right bucket, and possibly
create a new bucket. Atomically modify the VLSNRange.
Flushing mappings to disk: writeToDatabase()
- sync on VLSNIndex.flushSyncrhonizer -> VLSNTracker
Replica side syncup truncates the VLSNIndex from the end:
- no synchronization needed, the system is quiescent, and we can assume
that VLSNs are neither read nor written by other threads.
Log cleaning truncates the VLSNIndex from the beginning:
We assume that the log cleaner is prohibited from deleting files that are
being used for current feeding. We can also assume that the end of the
log is not being deleted, and that we're not conflict with put(). We do
have to worry about conflicting with backwards scans when executing
syncup as a feeder, and with flushing mappings to disk. Shall we
disable log file deletion at this point?
Steps to take:
First change the VLSNRange:
- sync on VLSNIndex
- atomically modify the VLSNRange to ensure that no readers or
writers touch the buckets that will be deleted.
- sync on VLSNTracker to delete any dead buckets. Do that before
updating the on-disk database, so that we don't lose any
buckets to writeToDatabase().
- without synchronization, scan the database and non-transactionally
delete any on-disk buckets that are <= the log cleaned file.
Readers
-------
Active forward feeder checks if a mapping exists, and waits if necessary
- read the current VLSNRange w/out a mutex. If not satisfactory
- sync on VLSNIndex
- sync on VLSNIndex.mappingSynchronizer
Active forward feeder reads a mapping:
first - getBucket()
- sync on VLSNTracker to access the right bucket
if bucket is in hand
- sync on target bucket to read bucket