Skip to content

Logical FS performance: per-directory file-list cache and inum cache …#39

Open
crayy8 wants to merge 7 commits into
SleuthKitLabs:develop-4.1xfrom
crayy8:logical_fs_performance
Open

Logical FS performance: per-directory file-list cache and inum cache …#39
crayy8 wants to merge 7 commits into
SleuthKitLabs:develop-4.1xfrom
crayy8:logical_fs_performance

Conversation

@crayy8
Copy link
Copy Markdown
Member

@crayy8 crayy8 commented May 4, 2026

…improvements

  • Add per-directory file-list cache (DIR_FILE_LIST_CACHE, 500 FIFO slots) to avoid repeatedly enumerating the same directory when resolving K files in one folder during inum-based searches.

  • Store inum_cache paths relative to base_path instead of full OS paths. Saves memory and keeps LOGICAL_INUM_CACHE_MAX_PATH_LEN evaluating only the meaningful portion of the path.

  • Bump LOGICAL_INUM_CACHE_LEN from 3000 to 50000. Combined with break- on-first-empty-slot optimization in scan loops, lookups remain bounded by actual fill level, not array size.

  • Opportunistically cache visited directories during inum searches with always_cache=false (only fills empty slots; never evicts useful ones).

  • Use alloc-before-evict pattern for inum_cache and dir_file_list_cache to avoid leaving slots in a stale "key present, data empty" state on malloc failure.

  • Add get_path_relative_to_base helper with debug assertion to verify the base_path prefix invariant in one place rather than scattering unchecked pointer arithmetic across four call sites.

  • Misc fixes: initialize target_inum and check tsk_malloc result in create_path_search_helper; check GetFullPathNameW return value in create_search_path_long_path.

  • Add comment to load_dir_and_file_lists_win documenting future FindFirstFileEx + FIND_FIRST_EX_LARGE_FETCH optimization (Win7+).

…improvements

- Add per-directory file-list cache (DIR_FILE_LIST_CACHE, 500 FIFO slots)
  to avoid repeatedly enumerating the same directory when resolving K
  files in one folder during inum-based searches.

- Store inum_cache paths relative to base_path instead of full OS paths.
  Saves memory and keeps LOGICAL_INUM_CACHE_MAX_PATH_LEN evaluating only
  the meaningful portion of the path.

- Bump LOGICAL_INUM_CACHE_LEN from 3000 to 50000. Combined with break-
  on-first-empty-slot optimization in scan loops, lookups remain bounded
  by actual fill level, not array size.

- Opportunistically cache visited directories during inum searches with
  always_cache=false (only fills empty slots; never evicts useful ones).

- Use alloc-before-evict pattern for inum_cache and dir_file_list_cache
  to avoid leaving slots in a stale "key present, data empty" state on
  malloc failure.

- Add get_path_relative_to_base helper with debug assertion to verify
  the base_path prefix invariant in one place rather than scattering
  unchecked pointer arithmetic across four call sites.

- Misc fixes: initialize target_inum and check tsk_malloc result in
  create_path_search_helper; check GetFullPathNameW return value in
  create_search_path_long_path.

- Add comment to load_dir_and_file_lists_win documenting future
  FindFirstFileEx + FIND_FIRST_EX_LARGE_FETCH optimization (Win7+).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
crayy8 and others added 6 commits May 4, 2026 14:47
Two fixes to the logical fs cache code from the previous commit:

1. Treat dir_file_list_cache insertion as best-effort. By the time we
   reach the per-name allocations in get_or_load_cached_dir_files,
   file_names (the output parameter) is already populated with valid
   data from the disk enumeration. A malloc failure during cache
   insertion only loses the optimization for next time - the caller
   can proceed with the loaded data. Changed both error branches to
   free the partial temporaries and return TSK_OK instead of TSK_ERR.

2. Fix a pre-existing cache_path leak in load_path_from_inum. If
   find_path_for_inum_in_cache returned a non-NULL cache_path (cache
   hit on the parent dir but a_addr itself wasn't a directory) and
   create_inum_search_helper subsequently failed its tsk_malloc, we
   were returning NULL without freeing cache_path.
- Add path_len field to LOGICAL_INUM_CACHE to avoid TSTRLEN() in scan loops
- Replace cache_age counter with logical-clock LRU (last_used + inum_cache_clock):
  scan loops are now read-only on non-matching entries; only the winner gets
  a timestamp update; eviction finds LRU victim in a single pass
- Merge duplicate check and eviction scan into one loop in add_directory_to_cache
- Add early break on exact prefix match in find_closest_path_match_in_cache
- tsk_logical_fs_check_path / tsk_logical_fs_path2inum now take wide-char path
  pre-converted by caller to avoid redundant conversions
- Add TEMP instrumentation timers in ifind_lib.c and logical_fs.cpp for profiling
  (to be removed after validation)
1A. logicalfs_read_block: NULL-check load_path_from_inum result and
    release cache_lock before returning TSK_ERR on lookup failure.

1B. logicalfs_read_block: release cache_lock before returning TSK_ERR
    on get_win32_safe_path failure (lock was held but never released).

2.  get_win32_safe_path: fix UNC long-path corruption. Previously all
    long paths received a blind \?\ prefix; UNC paths (\server\...)
    require \?\UNC\ with the leading \ of the resolved path replaced.
    Now resolves via GetFullPathNameW into a temp buffer, detects UNC,
    and applies the correct prefix.

3B. logicalfs_dir_open_meta: free(path) before three TSK_ERR returns
    in the directory enumeration loop to plug memory leaks on error paths.

5.  load_dir_and_file_lists_win: replace malloc with tsk_malloc for the
    dir-file-list cache arrays for consistency with the rest of the file.
- Bump LOGICAL_INUM_CACHE_LEN from 5000 to 20000 to reduce DFS traversals
- Add \?\-prefixed base_path at logical_open time; remove per-call
  get_win32_safe_path() conversions throughout logical_fs.cpp
- Remove dead last_assigned_inum variable and redundant NULL guards in
  get_inum_from_directory_path
- Refactor find_closest_path_match_in_cache: split exact/partial match
  into separate blocks with early break on exact match; add const to param
- Fix C26451 arithmetic overflow: cast i to TSK_INUM_T before adding 1
- Minor whitespace and comment cleanup in ifind_lib.c and logical_fs.cpp

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All TEMP/measurement-only code removed from:
- tsk_fs_path2inum (ifind_lib.c): path2inum timing block
- find_closest_path_match_in_cache: prefix-scan hit/miss timing
- find_closest_sibling_match_in_cache: sibling-scan timing
- find_path_for_inum_in_cache: inum-lookup timing
- load_dir_and_file_lists_win: sort timing + <chrono> include
- logicalfs_read_block: all rb_* phase timing and summary

Production logic (CreateFileW, SetFilePointer, ReadFile, sort calls,
cache lookups) is unchanged.
Critical:
- logicalfs_file_add_meta: move free(path) after tsk_error_set_errstr so
  the error message can format the path rather than dereferencing the
  nulled pointer.
- logical_open: free logical_info->base_path on every GetFullPathNameW
  failure path before tsk_img_free (tsk_img_free does not run the close
  callback, so base_path was leaking).

High:
- load_dir_and_file_lists_win: fix malformed format string where
  PRIttocTSK was concatenated after the period (the base_path arg was
  silently dropped from the message).
- inum_cache_clock and LOGICAL_INUM_CACHE.last_used widened to uint64_t
  to prevent wraparound corrupting LRU semantics.
- find_max_inum: set tsk_error before returning LOGICAL_INVALID_INUM on
  load_path_from_inum failure.
- logical_open: set tsk_error on GetFullPathNameW failures so callers
  can diagnose.
- tsk_fs_path2inum: UTF-8 -> UTF-16 conversion failure now returns -1
  (system error) instead of 1 (not found), with tsk_error set.

Medium:
- base_path invariant documented in logical_img.h.
- get_inum_from_directory_path: NULL-check path_buf after tsk_malloc;
  free path_buf on find_closest_path_match_in_cache error.
- Fixed stale docstring on get_inum_from_directory_path; removed stale
  "CR: area of concern" scaffolding comment.
- tsk_logical_fs_path2inum: return -1 on three system-error paths that
  previously returned 1.
- dir_file_list_cache: recheck under lock before insert to drop
  duplicates from racing threads.

Low:
- case_insensitive_compare declared/defined static (was leaking external
  linkage to other TUs).
- find_path_for_inum_in_cache: break after match; inums are unique in
  the cache.
- dir_file_list_cache empty-slot sentinel uses LOGICAL_INVALID_INUM
  named constant; both cache-scan loops defend against empty slots
  independently of the can_cache gate.
- get_path_relative_to_base: runtime checks (NULL / length / prefix)
  return NULL on precondition violation; callers handle NULL with
  TSK_ERR + lock release. Removes release-mode UB introduced when the
  cache moved to storing relative paths.
- logical_open: replaced hard-coded \?\ and \?\UNC\ wide chars with
  named constants (LOGICAL_LONG_PATH_PREFIX, LOGICAL_UNC_LONG_PATH_PREFIX)
  and their computed lengths; memcpy from the named prefix.
- Renamed one-off locals img_info_fmi / img_info_lfp to the conventional
  logical_img_info used elsewhere.

Follow-up review:
- logicalfs_read_block: free(path) after CreateFileW (success or
  failure). The path string was leaked on every file_handle_cache miss.
- load_dir_and_file_lists_win: wrap the FindNextFileW enumeration in
  try / catch(std::bad_alloc) so the kernel handle and search-path
  buffer are released if std::vector::push_back throws.
- tsk_fs_ifind_path: free(cpath) after tsk_fs_path2inum returns.
  Pre-existing 2008-era leak surfaced while auditing this PR's
  ifind_lib.c changes; fix included since we are already touching the
  file.

Perf:
- tsk_logical_fs_path2inum: replace O(N) linear filename scan with
  O(log N) std::lower_bound using case_insensitive_compare (same
  predicate as the sort). Also fixes a sort/lookup inconsistency where
  the sort used towlower but the lookup used _wcsicmp.
- search_directory_recursive: same binary-search swap for the sibling
  dir_name lookup; tightened the TSK_WIN32 guard to wrap the full body
  since case_insensitive_compare's overload is wstring/string-specific.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant