[PowerRename] Fix Unicode characters and non-breaking spaces not being correctly normalized before matching (#43972)

## Summary of the Pull Request Fixes PowerRename failing to normalise different Unicode forms before matching. This results in filenames containing visually identical characters to the search term from failing to match because their underlying binary representations differ. This affects renaming files created on macOS which names files in NFD (decomposed form) rather than Windows' NFC (precomposed form). Additionally, this fixes matching to filenames containing non-breaking space characters, which can be created by automated systems and web downloaders. Previously, the NBSP character would fail to match a normal space.  ## PR Checklist - [x] Closes: #43971 - [x] Closes: #43815 - [ ] **Communication:** I've discussed this with core contributors already. If the work hasn't been agreed, this work might be rejected - [x] **Tests:** Added/updated and all pass - [ ] **Localization:** All end-user-facing strings can be localized - [ ] **Dev docs:** Added/updated - [ ] **New binaries:** Added on the required places - [ ] [JSON for signing](https://github.com/microsoft/PowerToys/blob/main/.pipelines/ESRPSigning_core.json) for new binaries - [ ] [WXS for installer](https://github.com/microsoft/PowerToys/blob/main/installer/PowerToysSetup/Product.wxs) for new binaries and localization folder - [ ] [YML for CI pipeline](https://github.com/microsoft/PowerToys/blob/main/.pipelines/ci/templates/build-powertoys-steps.yml) for new test projects - [ ] [YML for signed pipeline](https://github.com/microsoft/PowerToys/blob/main/.pipelines/release.yml) - [ ] **Documentation updated:** If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/windows-uwp/tree/docs/hub/powertoys) and link it here: #xxx  ## Detailed Description of the Pull Request / Additional comments The underlying issue is a binary mismatch between: 1. Precomposed characters (NFC) typed by Windows users, e.g. `U+0439` - `й`. 2. Decomposed characters (NFD) found in filenames from other platforms (or copied from text), e.g. `U+0438` `U+0306` - `и` + `̆ `. 3. Standard spaces (`U+0020`) versus non-breaking spaces (`U+00A0`). ### Updates to PowerRenameRegex.cpp I added a `SanitizeAndNormalize` function which replaces all non-breaking spaces with standard spaces and normalises the string to **Normalization Form C** using Win32's `NormalizeString`. `PutSearchTerm` and `PutReplaceTerm` now normalise input immediately before performing any other processing. `Replace` now normalises the `source` filename before processing. I updated the RegEx path to ensure it runs against the normalised `sourceToUse` string instead of the raw `source` string; otherwise regex matches would fail.  ## Validation Steps Performed Manually tested the use case detailed in #43971 with the following filenames: - `Testй NFC.txt` - `Testй NFD.txt` Result: <img width="1097" height="542" alt="image" src="https://github.com/user-attachments/assets/55dd4f01-8ec9-462c-a20f-dd246c368cf5" /> There are two new unit tests which exercise both the non-breaking space and Unicode form normalisation issues. These run on both the Boost- and non-Boost test paths, adding four tests to the total. All new tests fail as expected on the prior code and all PowerRename tests pass successfully with the changes in this PR: <img width="606" height="276" alt="image" src="https://github.com/user-attachments/assets/08dc01f6-201c-4d56-8f34-e5043e3d1e86" />
2026-04-03 09:46:54 +02:00 · 2025-12-25 03:34:32 +00:00
parent d87dde132d
commit 48e95caf39
3 changed files with 112 additions and 15 deletions
--- a/src/modules/powerrename/lib/PowerRenameRegEx.cpp
+++ b/src/modules/powerrename/lib/PowerRenameRegEx.cpp
@@ -11,6 +11,48 @@
 using std::conditional_t;
 using std::regex_error;

+/// <summary>
+/// Sanitizes the input string by replacing non-breaking spaces with regular spaces and
+/// normalizes it to Unicode NFC (precomposed) form.
+/// </summary>
+/// <param name="input">The input wide string to sanitize and normalize. If empty, it is
+/// returned unchanged.</param>
+/// <returns>A new std::wstring containing the sanitized and NFC-normalized form of the
+/// input. If normalization fails, the function returns the sanitized string (with non-
+/// breaking spaces replaced) as-is.</returns>
+static std::wstring SanitizeAndNormalize(const std::wstring& input)
+{
+    if (input.empty())
+    {
+        return input;
+    }
+
+    std::wstring sanitized = input;
+    // Replace non-breaking spaces (0xA0) with regular spaces (0x20).
+    std::replace(sanitized.begin(), sanitized.end(), L'\u00A0', L' ');
+
+    // Normalize to NFC (Precomposed).
+    // Get the size needed for the normalized string, including null terminator.
+    int size = NormalizeString(NormalizationC, sanitized.c_str(), -1, nullptr, 0);
+    if (size <= 0)
+    {
+        return sanitized; // Return unaltered if normalization fails.
+    }
+
+    // Perform the normalization.
+    std::wstring normalized;
+    normalized.resize(size);
+    NormalizeString(NormalizationC, sanitized.c_str(), -1, &normalized[0], size);
+
+    // Remove the explicit null terminator added by NormalizeString.
+    if (!normalized.empty() && normalized.back() == L'\0')
+    {
+        normalized.pop_back();
+    }
+
+    return normalized;
+}
+
 IFACEMETHODIMP_(ULONG)
 CPowerRenameRegEx::AddRef()
 {
@@ -94,18 +136,20 @@ IFACEMETHODIMP CPowerRenameRegEx::PutSearchTerm(_In_ PCWSTR searchTerm, bool for
    HRESULT hr = S_OK;
    if (searchTerm)
    {
+        std::wstring normalizedSearchTerm = SanitizeAndNormalize(searchTerm);
+
        CSRWExclusiveAutoLock lock(&m_lock);
-        if (m_searchTerm == nullptr || lstrcmp(searchTerm, m_searchTerm) != 0)
+        if (m_searchTerm == nullptr || lstrcmp(normalizedSearchTerm.c_str(), m_searchTerm) != 0)
        {
            changed = true;
            CoTaskMemFree(m_searchTerm);
-            if (lstrcmp(searchTerm, L"") == 0)
+            if (normalizedSearchTerm.empty())
            {
                m_searchTerm = NULL;
            }
            else
            {
-                hr = SHStrDup(searchTerm, &m_searchTerm);
+                hr = SHStrDup(normalizedSearchTerm.c_str(), &m_searchTerm);
            }
        }
    }
@@ -238,17 +282,19 @@ IFACEMETHODIMP CPowerRenameRegEx::PutReplaceTerm(_In_ PCWSTR replaceTerm, bool f
    HRESULT hr = S_OK;
    if (replaceTerm)
    {
+        std::wstring normalizedReplaceTerm = SanitizeAndNormalize(replaceTerm);
+
        CSRWExclusiveAutoLock lock(&m_lock);
-        if (m_replaceTerm == nullptr || lstrcmp(replaceTerm, m_RawReplaceTerm.c_str()) != 0)
+        if (m_replaceTerm == nullptr || lstrcmp(normalizedReplaceTerm.c_str(), m_RawReplaceTerm.c_str()) != 0)
        {
            changed = true;
            CoTaskMemFree(m_replaceTerm);
-            m_RawReplaceTerm = replaceTerm;
+            m_RawReplaceTerm = normalizedReplaceTerm;

            if ((m_flags & RandomizeItems) || (m_flags & EnumerateItems))
                hr = _OnEnumerateOrRandomizeItemsChanged();
            else
-                hr = SHStrDup(replaceTerm, &m_replaceTerm);
+                hr = SHStrDup(normalizedReplaceTerm.c_str(), &m_replaceTerm);
        }
    }

@@ -397,7 +443,10 @@ HRESULT CPowerRenameRegEx::Replace(_In_ PCWSTR source, _Outptr_ PWSTR* result, u
    {
        return hr;
    }
-    std::wstring res = source;
+
+    std::wstring normalizedSource = SanitizeAndNormalize(source);
+
+    std::wstring res = normalizedSource;
    try
    {
        // TODO: creating the regex could be costly.  May want to cache this.
@@ -438,9 +487,8 @@ HRESULT CPowerRenameRegEx::Replace(_In_ PCWSTR source, _Outptr_ PWSTR* result, u
            }
        }

-        std::wstring sourceToUse;
+        std::wstring sourceToUse = normalizedSource;
        sourceToUse.reserve(MAX_PATH);
-        sourceToUse = source;

        std::wstring searchTerm(m_searchTerm);
        std::wstring replaceTerm;
@@ -536,7 +584,7 @@ HRESULT CPowerRenameRegEx::Replace(_In_ PCWSTR source, _Outptr_ PWSTR* result, u
            replaceTerm = regex_replace(replaceTerm, zeroGroupRegex, L"$1$$$0");
            replaceTerm = regex_replace(replaceTerm, otherGroupsRegex, L"$1$0$4");

-            res = RegexReplaceDispatch[_useBoostLib](source, m_searchTerm, replaceTerm, m_flags & MatchAllOccurrences, isCaseInsensitive);
+            res = RegexReplaceDispatch[_useBoostLib](sourceToUse, m_searchTerm, replaceTerm, m_flags & MatchAllOccurrences, isCaseInsensitive);

            // Use regex search to determine if a match exists. This is the basis for incrementing
            // the counter.
@@ -669,17 +717,17 @@ PowerRenameLib::MetadataType CPowerRenameRegEx::_GetMetadataTypeFromFlags() cons
 {
    if (m_flags & MetadataSourceXMP)
        return PowerRenameLib::MetadataType::XMP;
-    
+
    // Default to EXIF
    return PowerRenameLib::MetadataType::EXIF;
 }

-// Interface method implementation  
+// Interface method implementation
 IFACEMETHODIMP CPowerRenameRegEx::GetMetadataType(_Out_ PowerRenameLib::MetadataType* metadataType)
 {
    if (metadataType == nullptr)
        return E_POINTER;
-        
+
    *metadataType = _GetMetadataTypeFromFlags();
    return S_OK;
 }
@@ -689,5 +737,3 @@ PowerRenameLib::MetadataType CPowerRenameRegEx::GetMetadataType() const
 {
    return _GetMetadataTypeFromFlags();
 }
-
-
--- a/src/modules/powerrename/unittests/CommonRegExTests.h
+++ b/src/modules/powerrename/unittests/CommonRegExTests.h
@@ -647,6 +647,54 @@ TEST_METHOD(VerifyCounterIncrementsWhenResultIsUnchanged)
    CoTaskMemFree(result);
 }

+// Helper function to verify normalization behavior.
+void VerifyNormalizationHelper(DWORD flags)
+{
+    CComPtr<IPowerRenameRegEx> renameRegEx;
+    Assert::IsTrue(CPowerRenameRegEx::s_CreateInstance(&renameRegEx) == S_OK);
+    Assert::IsTrue(renameRegEx->PutFlags(flags) == S_OK);
+
+    // 1. Unicode Normalization: NFD source with NFC search term.
+    PWSTR result = nullptr;
+    unsigned long index = 0;
+
+    // Source: "Test" + U+0438 (Cyrillic small letter i) + U+0306 (combining breve).
+    std::wstring sourceNFD = L"Test\u0438\u0306";
+    // Search: "Test" + U+0438 (Cyrillic small letter i with breve).
+    std::wstring searchNFC = L"Test\u0439";
+
+    // A match should occur despite different normalization forms.
+    Assert::IsTrue(renameRegEx->PutSearchTerm(searchNFC.c_str()) == S_OK);
+    Assert::IsTrue(renameRegEx->PutReplaceTerm(L"Match") == S_OK);
+    Assert::IsTrue(renameRegEx->Replace(sourceNFD.c_str(), &result, index) == S_OK);
+    Assert::AreEqual(L"Match", result, L"Failed to match NFD source with NFC search term.");
+    CoTaskMemFree(result);
+
+    // 2. Whitespace Normalization: test non-breaking space versus regular space.
+    result = nullptr;
+    index = 0;
+
+    // Source: "Hello" + non-breaking space + "World".
+    std::wstring sourceNBSP = L"Hello\u00A0World";
+    // Search: "Hello" + regular space + "World".
+    std::wstring searchSpace = L"Hello World";
+
+    Assert::IsTrue(renameRegEx->PutSearchTerm(searchSpace.c_str()) == S_OK);
+    Assert::IsTrue(renameRegEx->Replace(sourceNBSP.c_str(), &result, index) == S_OK);
+    Assert::AreEqual(L"Match", result, L"Failed to match non-breaking space source with regular space search term.");
+    CoTaskMemFree(result);
+}
+
+TEST_METHOD(VerifyUnicodeAndWhitespaceNormalizationSimpleSearch)
+{
+    VerifyNormalizationHelper(0);
+}
+
+TEST_METHOD(VerifyUnicodeAndWhitespaceNormalizationRegex)
+{
+    VerifyNormalizationHelper(UseRegularExpressions);
+}
+
 #ifndef TESTS_PARTIAL
 };
 }