Logical FS performance: per-directory file-list cache and inum cache …#39
Open
crayy8 wants to merge 7 commits into
Open
Logical FS performance: per-directory file-list cache and inum cache …#39crayy8 wants to merge 7 commits into
crayy8 wants to merge 7 commits into
Conversation
…improvements - Add per-directory file-list cache (DIR_FILE_LIST_CACHE, 500 FIFO slots) to avoid repeatedly enumerating the same directory when resolving K files in one folder during inum-based searches. - Store inum_cache paths relative to base_path instead of full OS paths. Saves memory and keeps LOGICAL_INUM_CACHE_MAX_PATH_LEN evaluating only the meaningful portion of the path. - Bump LOGICAL_INUM_CACHE_LEN from 3000 to 50000. Combined with break- on-first-empty-slot optimization in scan loops, lookups remain bounded by actual fill level, not array size. - Opportunistically cache visited directories during inum searches with always_cache=false (only fills empty slots; never evicts useful ones). - Use alloc-before-evict pattern for inum_cache and dir_file_list_cache to avoid leaving slots in a stale "key present, data empty" state on malloc failure. - Add get_path_relative_to_base helper with debug assertion to verify the base_path prefix invariant in one place rather than scattering unchecked pointer arithmetic across four call sites. - Misc fixes: initialize target_inum and check tsk_malloc result in create_path_search_helper; check GetFullPathNameW return value in create_search_path_long_path. - Add comment to load_dir_and_file_lists_win documenting future FindFirstFileEx + FIND_FIRST_EX_LARGE_FETCH optimization (Win7+). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes to the logical fs cache code from the previous commit: 1. Treat dir_file_list_cache insertion as best-effort. By the time we reach the per-name allocations in get_or_load_cached_dir_files, file_names (the output parameter) is already populated with valid data from the disk enumeration. A malloc failure during cache insertion only loses the optimization for next time - the caller can proceed with the loaded data. Changed both error branches to free the partial temporaries and return TSK_OK instead of TSK_ERR. 2. Fix a pre-existing cache_path leak in load_path_from_inum. If find_path_for_inum_in_cache returned a non-NULL cache_path (cache hit on the parent dir but a_addr itself wasn't a directory) and create_inum_search_helper subsequently failed its tsk_malloc, we were returning NULL without freeing cache_path.
- Add path_len field to LOGICAL_INUM_CACHE to avoid TSTRLEN() in scan loops - Replace cache_age counter with logical-clock LRU (last_used + inum_cache_clock): scan loops are now read-only on non-matching entries; only the winner gets a timestamp update; eviction finds LRU victim in a single pass - Merge duplicate check and eviction scan into one loop in add_directory_to_cache - Add early break on exact prefix match in find_closest_path_match_in_cache - tsk_logical_fs_check_path / tsk_logical_fs_path2inum now take wide-char path pre-converted by caller to avoid redundant conversions - Add TEMP instrumentation timers in ifind_lib.c and logical_fs.cpp for profiling (to be removed after validation)
1A. logicalfs_read_block: NULL-check load_path_from_inum result and
release cache_lock before returning TSK_ERR on lookup failure.
1B. logicalfs_read_block: release cache_lock before returning TSK_ERR
on get_win32_safe_path failure (lock was held but never released).
2. get_win32_safe_path: fix UNC long-path corruption. Previously all
long paths received a blind \?\ prefix; UNC paths (\server\...)
require \?\UNC\ with the leading \ of the resolved path replaced.
Now resolves via GetFullPathNameW into a temp buffer, detects UNC,
and applies the correct prefix.
3B. logicalfs_dir_open_meta: free(path) before three TSK_ERR returns
in the directory enumeration loop to plug memory leaks on error paths.
5. load_dir_and_file_lists_win: replace malloc with tsk_malloc for the
dir-file-list cache arrays for consistency with the rest of the file.
- Bump LOGICAL_INUM_CACHE_LEN from 5000 to 20000 to reduce DFS traversals - Add \?\-prefixed base_path at logical_open time; remove per-call get_win32_safe_path() conversions throughout logical_fs.cpp - Remove dead last_assigned_inum variable and redundant NULL guards in get_inum_from_directory_path - Refactor find_closest_path_match_in_cache: split exact/partial match into separate blocks with early break on exact match; add const to param - Fix C26451 arithmetic overflow: cast i to TSK_INUM_T before adding 1 - Minor whitespace and comment cleanup in ifind_lib.c and logical_fs.cpp Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All TEMP/measurement-only code removed from: - tsk_fs_path2inum (ifind_lib.c): path2inum timing block - find_closest_path_match_in_cache: prefix-scan hit/miss timing - find_closest_sibling_match_in_cache: sibling-scan timing - find_path_for_inum_in_cache: inum-lookup timing - load_dir_and_file_lists_win: sort timing + <chrono> include - logicalfs_read_block: all rb_* phase timing and summary Production logic (CreateFileW, SetFilePointer, ReadFile, sort calls, cache lookups) is unchanged.
Critical: - logicalfs_file_add_meta: move free(path) after tsk_error_set_errstr so the error message can format the path rather than dereferencing the nulled pointer. - logical_open: free logical_info->base_path on every GetFullPathNameW failure path before tsk_img_free (tsk_img_free does not run the close callback, so base_path was leaking). High: - load_dir_and_file_lists_win: fix malformed format string where PRIttocTSK was concatenated after the period (the base_path arg was silently dropped from the message). - inum_cache_clock and LOGICAL_INUM_CACHE.last_used widened to uint64_t to prevent wraparound corrupting LRU semantics. - find_max_inum: set tsk_error before returning LOGICAL_INVALID_INUM on load_path_from_inum failure. - logical_open: set tsk_error on GetFullPathNameW failures so callers can diagnose. - tsk_fs_path2inum: UTF-8 -> UTF-16 conversion failure now returns -1 (system error) instead of 1 (not found), with tsk_error set. Medium: - base_path invariant documented in logical_img.h. - get_inum_from_directory_path: NULL-check path_buf after tsk_malloc; free path_buf on find_closest_path_match_in_cache error. - Fixed stale docstring on get_inum_from_directory_path; removed stale "CR: area of concern" scaffolding comment. - tsk_logical_fs_path2inum: return -1 on three system-error paths that previously returned 1. - dir_file_list_cache: recheck under lock before insert to drop duplicates from racing threads. Low: - case_insensitive_compare declared/defined static (was leaking external linkage to other TUs). - find_path_for_inum_in_cache: break after match; inums are unique in the cache. - dir_file_list_cache empty-slot sentinel uses LOGICAL_INVALID_INUM named constant; both cache-scan loops defend against empty slots independently of the can_cache gate. - get_path_relative_to_base: runtime checks (NULL / length / prefix) return NULL on precondition violation; callers handle NULL with TSK_ERR + lock release. Removes release-mode UB introduced when the cache moved to storing relative paths. - logical_open: replaced hard-coded \?\ and \?\UNC\ wide chars with named constants (LOGICAL_LONG_PATH_PREFIX, LOGICAL_UNC_LONG_PATH_PREFIX) and their computed lengths; memcpy from the named prefix. - Renamed one-off locals img_info_fmi / img_info_lfp to the conventional logical_img_info used elsewhere. Follow-up review: - logicalfs_read_block: free(path) after CreateFileW (success or failure). The path string was leaked on every file_handle_cache miss. - load_dir_and_file_lists_win: wrap the FindNextFileW enumeration in try / catch(std::bad_alloc) so the kernel handle and search-path buffer are released if std::vector::push_back throws. - tsk_fs_ifind_path: free(cpath) after tsk_fs_path2inum returns. Pre-existing 2008-era leak surfaced while auditing this PR's ifind_lib.c changes; fix included since we are already touching the file. Perf: - tsk_logical_fs_path2inum: replace O(N) linear filename scan with O(log N) std::lower_bound using case_insensitive_compare (same predicate as the sort). Also fixes a sort/lookup inconsistency where the sort used towlower but the lookup used _wcsicmp. - search_directory_recursive: same binary-search swap for the sibling dir_name lookup; tightened the TSK_WIN32 guard to wrap the full body since case_insensitive_compare's overload is wstring/string-specific.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…improvements
Add per-directory file-list cache (DIR_FILE_LIST_CACHE, 500 FIFO slots) to avoid repeatedly enumerating the same directory when resolving K files in one folder during inum-based searches.
Store inum_cache paths relative to base_path instead of full OS paths. Saves memory and keeps LOGICAL_INUM_CACHE_MAX_PATH_LEN evaluating only the meaningful portion of the path.
Bump LOGICAL_INUM_CACHE_LEN from 3000 to 50000. Combined with break- on-first-empty-slot optimization in scan loops, lookups remain bounded by actual fill level, not array size.
Opportunistically cache visited directories during inum searches with always_cache=false (only fills empty slots; never evicts useful ones).
Use alloc-before-evict pattern for inum_cache and dir_file_list_cache to avoid leaving slots in a stale "key present, data empty" state on malloc failure.
Add get_path_relative_to_base helper with debug assertion to verify the base_path prefix invariant in one place rather than scattering unchecked pointer arithmetic across four call sites.
Misc fixes: initialize target_inum and check tsk_malloc result in create_path_search_helper; check GetFullPathNameW return value in create_search_path_long_path.
Add comment to load_dir_and_file_lists_win documenting future FindFirstFileEx + FIND_FIRST_EX_LARGE_FETCH optimization (Win7+).