Skip to content

ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files.#2601

Open
QianyongY wants to merge 2 commits intoapache:mainfrom
QianyongY:features/ORC-2149
Open

ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files.#2601
QianyongY wants to merge 2 commits intoapache:mainfrom
QianyongY:features/ORC-2149

Conversation

@QianyongY
Copy link
Copy Markdown
Contributor

@QianyongY QianyongY commented Apr 15, 2026

What changes were proposed in this pull request?

Extends the Java merge tool so that, for inputs sharing the same schema, you can still merge to one ORC file by default, or use -m / --maxSize to write multiple ORC files under an output directory as part-xxxxx.orc, batching by input file size.

Why are the changes needed?

Users often need to merge many compatible ORC files without producing a single huge output file. This adds an optional mode that caps output size at whole-file boundaries while keeping the existing single-file behavior when --maxSize is not set.

How was this patch tested?

Add UT

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the ORC Java merge tool to optionally split merged output into multiple ORC part files under an output directory when --maxSize is provided, while keeping the existing single-output behavior by default.

Changes:

  • Add --maxSize option to merge to batch inputs (by on-disk size) into multiple output part files.
  • Add a new unit test covering multi-part merge behavior.
  • Update CLI help text and documentation to describe the new multi-output mode.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
site/_docs/java-tools.md Updates merge tool docs to describe single vs multi-output mode and examples.
java/tools/src/java/org/apache/orc/tools/MergeFiles.java Implements --maxSize parsing and multi-part merge batching logic.
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Adds a unit test to validate multi-part output behavior.
java/tools/src/java/org/apache/orc/tools/Driver.java Updates driver help text for the merge subcommand description.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Outdated
Comment thread java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
@dongjoon-hyun
Copy link
Copy Markdown
Member

Thank you for making a PR, @QianyongY .

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread site/_docs/java-tools.md
// Measure the size of the first source file to compute a threshold that forces a split.
long singleFileSize = fs.getFileStatus(new Path(sourceNames[0])).getLen();
// Threshold: slightly larger than one file so at most one file fits per part.
long maxSize = singleFileSize + 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to group by the first two file sizes + 1 to test if the merge really works?

sourceNames[0] len + sourceNames[1] len +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants