Skip to content

Add more narrowly-scoped TimeBoundLoggers around syscalls#7836

Merged
achamayou merged 2 commits intomicrosoft:mainfrom
eddyashton:syscall_scoped_timeboundloggers
Apr 22, 2026
Merged

Add more narrowly-scoped TimeBoundLoggers around syscalls#7836
achamayou merged 2 commits intomicrosoft:mainfrom
eddyashton:syscall_scoped_timeboundloggers

Conversation

@eddyashton
Copy link
Copy Markdown
Member

No description provided.

Copilot AI review requested due to automatic review settings April 22, 2026 13:43
@eddyashton eddyashton requested a review from a team as a code owner April 22, 2026 13:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds more narrowly-scoped asynchost::TimeBoundLogger instrumentation around IO syscalls/filesystem operations so slow host-side disk operations can be identified with more precise log messages.

Changes:

  • Wraps snapshot and ledger file operations (open/read/write/fsync/close/rename/remove) in TimeBoundLogger scopes with more specific messages.
  • Adds similar slow-IO logging around LFS index file reads/writes and file cleanup operations.
  • Makes TimeBoundLogger::default_max_time a header-defined static inline with a 10ms default, removing per-TU definitions.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/snapshots/snapshot_manager.h Adds TimeBoundLogger scopes around snapshot directory creation, snapshot file open/write/close/fsync/rename.
src/snapshots/filenames.h Adds TimeBoundLogger around snapshot ignore-file rename.
src/host/time_bound_logger.h Moves default_max_time to static inline with default initial value.
src/host/test/ledger.cpp Removes out-of-line definition of default_max_time.
src/host/run.cpp Removes earlier out-of-line initialization (runtime config assignment remains elsewhere).
src/host/lfs_file_handler.h Adds TimeBoundLogger around LFS index dir operations and file read/write streams.
src/host/ledger.h Adds TimeBoundLogger around many stdio/filesystem operations while reading/writing/truncating/committing ledgers.
src/host/files_cleanup_timer.h Adds TimeBoundLogger around file open/read-loop hashing and remove() operations.
Comments suppressed due to low confidence (1)

src/host/lfs_file_handler.h:89

  • The get handler assumes ifstream open/seek/read always succeed once is_regular_file() is true. In practice open/seek/tellg/read can still fail (permissions, concurrent deletion, IO errors), and tellg() may return -1 leading to a huge allocation for LFSEncryptedContents. Add checks for f state and for tellg()/read success before allocating/reading, and fall back to not_found or an error response/log.
            TimeBoundLogger log_if_slow(
              fmt::format("Reading LFS file - ifstream({})", target_path));
            std::ifstream f(target_path, std::ios::binary);
            f.seekg(0, f.end);
            const auto file_size = f.tellg();
            LOG_TRACE_FMT(
              "Reading {} byte file from {}",
              static_cast<size_t>(file_size),
              target_path);
            f.seekg(0, f.beg);

            ccf::indexing::LFSEncryptedContents blob(file_size);
            f.read(reinterpret_cast<char*>(blob.data()), blob.size());
            f.close();

Comment thread src/host/lfs_file_handler.h
Comment thread src/snapshots/snapshot_manager.h
Comment thread src/snapshots/snapshot_manager.h
Comment thread src/snapshots/snapshot_manager.h
Comment thread src/host/ledger.h
@eddyashton
Copy link
Copy Markdown
Member Author

I think all of copilot's comments are pre-existing. Potentially worth exploring, but out-of-scope for this PR.

@achamayou
Copy link
Copy Markdown
Member

I think all of copilot's comments are pre-existing. Potentially worth exploring, but out-of-scope for this PR.

Let's fire off copilot fixes for the ones we think are worthwhile. Worth doing a quick bench a/b to confirm this comes at a negligible cost?

@achamayou
Copy link
Copy Markdown
Member

========================================================================================================================
                                            BENCHMARK COMPARISON: main vs PR                                            
========================================================================================================================

Metric                                                   Performance Change                             Values            
                                                                                                      main → PR           
------------------------------------------------------------------------------------------------------------------------
Basic - throughput                                         --|                     -6.9%    41927.70 → 39022.20           
Basic Blocking - throughput                                  |                     -0.2%    962.50 → 960.50               
Basic JS - throughput                                        |                     -0.5%    3751.80 → 3732.30             
Basic Multi-Threaded - throughput                            |                     +0.1%    71679.10 → 71724.20           
CHAMP get - rate                                             |                     +0.9%    25349671.99 → 25565167.46     
CHAMP put - rate                                             |+                    +2.7%    3591596.23 → 3690149.93       
Commit Latency (sig_ms_interval=16ms) -                    ++|                     -7.0%    5.58 → 5.18                   
Commit Latency (sig_ms_interval=1ms) -                       |                     +0.9%    2.01 → 2.03                   
Commit Latency (sig_ms_interval=256ms)                       |                     -0.0%    211.45 → 211.39               
Historical Queries - throughput                              |                     +0.9%    204267.73 → 206105.96         
KV deserialisation - rate                                    |                     -0.7%    1167542.32 → 1159285.88       
KV serialisation - rate                                      |                     +2.1%    1052853.23 → 1074460.08       
KV snapshot deserialisation - rate                           |                     -1.0%    3334.66 → 3300.36             
KV snapshot serialisation - rate                           --|                     -5.4%    5356.94 → 5066.86             
Logging - throughput                                         |+                    +3.8%    36646.20 → 38034.60           
Logging JWT - throughput                                     |                     +1.6%    7065.00 → 7176.50             
RBMap get - rate                                             |                     +0.5%    21742130.69 → 21846731.52     
RBMap put - rate                                             |+                    +3.4%    1048577.61 → 1084050.43       

Legend:
  +++|         = Improvement (left side, better performance)
      |---     = Regression (right side, worse performance)
      |         = No significant change (<2%)
  + = Better performance
  - = Worse performance
  Scale: ±50% change uses full bar width
  Note: For performance metrics, lower latency = better, higher throughput = better

Summary:
  Total metrics compared: 18
  Improvements: 5
  Regressions: 2
  No significant change: 11

Looks good.

@achamayou achamayou merged commit d5442e7 into microsoft:main Apr 22, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants