HDDS-14721. Make OM bootstrap implementation configurable.#9830
HDDS-14721. Make OM bootstrap implementation configurable.#9830sadanand48 wants to merge 5 commits intoapache:masterfrom
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @sadanand48 for the patch.
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMBootstrapV2.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMBootstrap.java
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMBootstrap.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMBootstrap.java
Show resolved
Hide resolved
| Files.delete(sourcePath); | ||
| } | ||
| } catch (IOException e) { | ||
| LOG.warn("Failed to delete source file {}: {}", sourcePath, e.getMessage()); |
There was a problem hiding this comment.
are we going to retry after the next start or files would stay there forever?
There was a problem hiding this comment.
We only delete the files in case of v2 , yeah they are not cleaned up later if not for the first time. Shouldn't cause functional issues as they are not sst files but will contribute to the space usage on OM.
There was a problem hiding this comment.
we need a solution for cleaning up files in case of getting exception here.
There was a problem hiding this comment.
we need a solution for cleaning up files in case of getting exception here.
I agree we should cleanup stale files on OM but would recommend addressing the failed cleanup in separate fix and keep this limited to making bootstrap version configurable.
| LOG.info("Creating om.db and moving {} active DB files for v1 format", toMove.size()); | ||
| Files.createDirectories(omDbDir); | ||
| for (Path source : toMove) { | ||
| Path target = omDbDir.resolve(source.getFileName()); |
There was a problem hiding this comment.
how do we handle exception when something cannot be moved? or resolved
There was a problem hiding this comment.
The IOException from the failed move would propagate to eventually return null here thereby leading to a failed installSnapshotAttempt. this would be retried from Ratis again.
public synchronized TermIndex installSnapshotFromLeader(String leaderId) {
if (omRatisSnapshotProvider == null) {
LOG.error("OM Snapshot Provider is not configured as there are no peer " +
"nodes.");
return null;
}
DBCheckpoint omDBCheckpoint;
try {
omDBCheckpoint = omRatisSnapshotProvider.
downloadDBSnapshotFromLeader(leaderId);
} catch (IOException ex) {
LOG.error("Failed to download snapshot from Leader {}.", leaderId, ex);
return null;
}| OZONE_OM_RATIS_SNAPSHOT_MAX_TOTAL_SST_SIZE_DEFAULT = 10737418240L; | ||
|
|
||
| public static final String OZONE_OM_DB_CHECKPOINT_USE_V2_KEY | ||
| = "ozone.om.db.checkpoint.use.v2"; |
There was a problem hiding this comment.
v2 is not descriptive name, it's not clear what is the difference between v1 and v2.
There was a problem hiding this comment.
v1 uses OMDbCheckpointServlet , v2 uses OMDBCheckpointServletInodeBasedXfer. I've changed it to ozone.om.db.checkpoint.use.inode.based.transfer now.
| Files.delete(sourcePath); | ||
| } | ||
| } catch (IOException e) { | ||
| LOG.warn("Failed to delete source file {}: {}", sourcePath, e.getMessage()); |
There was a problem hiding this comment.
we need a solution for cleaning up files in case of getting exception here.
| */ | ||
| private void moveActiveDbFilesToOmDbIfNeeded() throws IOException { | ||
| Path omDbDir = checkpointLocation.resolve(OzoneConsts.OM_DB_NAME); | ||
| if (Files.exists(omDbDir) && Files.isDirectory(omDbDir)) { |
There was a problem hiding this comment.
This only checks if the directory exists, What happens if the checkpointLocation/om.db already exists from a previous failed bootstrap attempt? Then we would skip this and we would have an inconsistent and incompleteom.db
The candidate directory is only cleaned up in two cases:
- Leader changes (
checkLeaderConsistencycallsinit()) downloadDBSnapshotFromLeaderreturns successfully (the finally block ininstallSnapshotFromLeadercallscleanupCheckpoint())
If downloadDBSnapshotFromLeader throws, the catch block returns null with no cleanup:
DBCheckpoint omDBCheckpoint;
try {
omDBCheckpoint = omRatisSnapshotProvider.
downloadDBSnapshotFromLeader(leaderId);
} catch (IOException ex) {
LOG.error("Failed to download snapshot from Leader {}.", leaderId, ex); --> No cleanup, Can have partial OM DB.
return null;
}
In case moveActiveDbFilesToOmDbIfNeeded fails in between in the previous checkpoint, The new checkpoint will skip this because we have checkpointLocation/om.db directory from the previous failed checkpoint.
What changes were proposed in this pull request?
With the recent bootstrap V2 impl done as part of HDDS-12984, make the impl controllable via a config.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14721
How was this patch tested?
unit and integration tests