ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files.#2601
ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files.#2601QianyongY wants to merge 2 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Extends the ORC Java merge tool to optionally split merged output into multiple ORC part files under an output directory when --maxSize is provided, while keeping the existing single-output behavior by default.
Changes:
- Add
--maxSizeoption tomergeto batch inputs (by on-disk size) into multiple output part files. - Add a new unit test covering multi-part merge behavior.
- Update CLI help text and documentation to describe the new multi-output mode.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
site/_docs/java-tools.md |
Updates merge tool docs to describe single vs multi-output mode and examples. |
java/tools/src/java/org/apache/orc/tools/MergeFiles.java |
Implements --maxSize parsing and multi-part merge batching logic. |
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java |
Adds a unit test to validate multi-part output behavior. |
java/tools/src/java/org/apache/orc/tools/Driver.java |
Updates driver help text for the merge subcommand description. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you for making a PR, @QianyongY . |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Measure the size of the first source file to compute a threshold that forces a split. | ||
| long singleFileSize = fs.getFileStatus(new Path(sourceNames[0])).getLen(); | ||
| // Threshold: slightly larger than one file so at most one file fits per part. | ||
| long maxSize = singleFileSize + 1; |
There was a problem hiding this comment.
Is it possible to group by the first two file sizes + 1 to test if the merge really works?
sourceNames[0] len + sourceNames[1] len +1
What changes were proposed in this pull request?
Extends the Java merge tool so that, for inputs sharing the same schema, you can still merge to one ORC file by default, or use -m / --maxSize to write multiple ORC files under an output directory as part-xxxxx.orc, batching by input file size.
Why are the changes needed?
Users often need to merge many compatible ORC files without producing a single huge output file. This adds an optional mode that caps output size at whole-file boundaries while keeping the existing single-file behavior when --maxSize is not set.
How was this patch tested?
Add UT