fix: unicode is handled properly in strdist by letFunny · Pull Request #280 · canonical/chisel

letFunny · 2026-04-02T09:52:23Z

Have you signed the CLA?

While doing more performance work I noticed a couple of bugs in our implementation of distance. I will fix each one of them in a different PR so that the perf work can land. I have added a TODO for the next bug. I don't want to fix both as part of the same PR as it will be much harder to reason about the code, the fix for unicode in itself is pretty easy.

This PR fixes a common bug when handling strings in Go. The for loop:

for bi, br := range b {
    ...
    lst[bi+1] = cost
}

Is iterating over a string and using bi which is the byte offset to store things in lst instead of the current rune count. In the case where the characters are ASCII it works because each rune is 1 byte but in the case of runes that take more than 1 byte this fails (see the added test case).

letFunny · 2026-04-02T09:59:35Z

 		}
-		_ = stop
-		if cut != 0 && stop {
+		if cut != 0 && len(b) > 0 && stop {


This is not strictly related to unicode but it made the test fail when the second string was empty so I fixed it as well.

For context, stop is meant to represent that the minimum edit cost is greater than the threshold. However, if the second string is empty, stop will not be set to false by the inner loop, which is a bug (see test case).

Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.

Your intuition was right this is only a partial fix. I added several more test cases that were failing previously. I also changed the logic to finish as soon as cut is reached. The issue was the following, stop is computed by iterating over b[i] and seeing if:

cost of swapping it for a[i]

cost of inserting b[i]

cost of removing a[i]

0 if a[i] == b[i]

any of these was less than cut. The problem is that to enter the loop b could not be empty. The calculation above the for loop correctly accounted for this case in lst[0], that is comparing a[:j] to b[:0]. The problem is that we were not using the cost here to compute min.

You were the one devising the algorithm so I hope I got a good understanding, let me know if we need to discuss it in more depth in person.

upils

Thanks! Would you mind reworking the PR description to clarify what is actually fixed? I found it a little tricky to spot.

niemeyer

Thanks for the fix! A question about the added logic.

niemeyer · 2026-04-27T14:55:02Z

 		}
-		_ = stop
-		if cut != 0 && stop {
+		if cut != 0 && len(b) > 0 && stop {


Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.

niemeyer

👆

upils

We discussed offline some aspects of this function that this work uncovered (non-symmetric behavior, potential optimization) but I think this PR should focus on fixing the unicode-related bug, so it is good as-is. Thanks!

upils · 2026-04-29T12:27:39Z

+	{f: strdist.GlobCost, r: 3, a: "abc", b: ""},
+	{f: strdist.GlobCost, r: 1, cut: 1, a: "abc", b: ""},
+	{f: strdist.GlobCost, r: 2, cut: 3, a: "ab", b: ""},


These tests are unrelated to using globs. What about testing them with StandardCost instead to make it obvious?

Done, let me know what you think. I also added more tests following's @niemeyer suggestion in the sprint.

upils

Thanks @letFunny. I like that the additional tests give us a clearer view of the behavior for the corner cases.

upils · 2026-05-22T07:50:10Z

+	{f: strdist.StandardCost, r: 2, cut: 3, a: "ab", b: ""},
+	{f: strdist.StandardCost, r: 2, cut: 1, a: "", b: "ab"},


suggestion: add another comment after to make it clear the non-symmetry is only illustrated with these 2 cases?

I thought about doing that but I could not think of a way of being idiomatic. The usual is // End of symmetric tests or something like that but there are no more sections in the code. In Chisel we use a summary but that too fells heavyweight.

fix: unicode is handled properly in strdist

fd23218

letFunny commented Apr 2, 2026

View reviewed changes

letFunny added the Bug An undesired feature ;-) label Apr 2, 2026

remove unneeded comment

548b573

letFunny mentioned this pull request Apr 2, 2026

perf: improve GlobPath performance #282

Draft

1 task

upils approved these changes Apr 10, 2026

View reviewed changes

Merge branch 'main' into bug-fix-distance-unicode

76960ac

niemeyer reviewed Apr 27, 2026

View reviewed changes

niemeyer requested changes Apr 27, 2026

View reviewed changes

letFunny added 2 commits April 27, 2026 17:09

address review

b6783cd

more tests

77ec162

letFunny requested a review from upils April 28, 2026 08:24

yet another condition ^ tm

66017cf

upils approved these changes May 22, 2026

View reviewed changes

more testing

c49ea24

upils approved these changes May 22, 2026

View reviewed changes

		{f: strdist.StandardCost, r: 2, cut: 3, a: "ab", b: ""},
		{f: strdist.StandardCost, r: 2, cut: 1, a: "", b: "ab"},

Conversation

letFunny commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

letFunny Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

niemeyer Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

letFunny Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

upils left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

niemeyer left a comment

Choose a reason for hiding this comment

Uh oh!

niemeyer Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

niemeyer left a comment

Choose a reason for hiding this comment

Uh oh!

upils left a comment

Choose a reason for hiding this comment

Uh oh!

upils Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

letFunny May 22, 2026

Choose a reason for hiding this comment

Uh oh!

upils left a comment

Choose a reason for hiding this comment

Uh oh!

upils May 22, 2026

Choose a reason for hiding this comment

Uh oh!

letFunny May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

letFunny commented Apr 2, 2026 •

edited

Loading

letFunny Apr 2, 2026 •

edited

Loading

upils left a comment •

edited

Loading