fix: unicode is handled properly in strdist#280
Conversation
| } | ||
| _ = stop | ||
| if cut != 0 && stop { | ||
| if cut != 0 && len(b) > 0 && stop { |
There was a problem hiding this comment.
This is not strictly related to unicode but it made the test fail when the second string was empty so I fixed it as well.
For context, stop is meant to represent that the minimum edit cost is greater than the threshold. However, if the second string is empty, stop will not be set to false by the inner loop, which is a bug (see test case).
There was a problem hiding this comment.
Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.
There was a problem hiding this comment.
Your intuition was right this is only a partial fix. I added several more test cases that were failing previously. I also changed the logic to finish as soon as cut is reached. The issue was the following, stop is computed by iterating over b[i] and seeing if:
- cost of swapping it for
a[i] - cost of inserting
b[i] - cost of removing
a[i] - 0 if
a[i] == b[i]
any of these was less than cut. The problem is that to enter the loop b could not be empty. The calculation above the for loop correctly accounted for this case in lst[0], that is comparing a[:j] to b[:0]. The problem is that we were not using the cost here to compute min.
You were the one devising the algorithm so I hope I got a good understanding, let me know if we need to discuss it in more depth in person.
niemeyer
left a comment
There was a problem hiding this comment.
Thanks for the fix! A question about the added logic.
| } | ||
| _ = stop | ||
| if cut != 0 && stop { | ||
| if cut != 0 && len(b) > 0 && stop { |
There was a problem hiding this comment.
Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.
upils
left a comment
There was a problem hiding this comment.
We discussed offline some aspects of this function that this work uncovered (non-symmetric behavior, potential optimization) but I think this PR should focus on fixing the unicode-related bug, so it is good as-is. Thanks!
| {f: strdist.GlobCost, r: 3, a: "abc", b: ""}, | ||
| {f: strdist.GlobCost, r: 1, cut: 1, a: "abc", b: ""}, | ||
| {f: strdist.GlobCost, r: 2, cut: 3, a: "ab", b: ""}, |
There was a problem hiding this comment.
These tests are unrelated to using globs. What about testing them with StandardCost instead to make it obvious?
There was a problem hiding this comment.
Done, let me know what you think. I also added more tests following's @niemeyer suggestion in the sprint.
| {f: strdist.StandardCost, r: 2, cut: 3, a: "ab", b: ""}, | ||
| {f: strdist.StandardCost, r: 2, cut: 1, a: "", b: "ab"}, |
There was a problem hiding this comment.
suggestion: add another comment after to make it clear the non-symmetry is only illustrated with these 2 cases?
There was a problem hiding this comment.
I thought about doing that but I could not think of a way of being idiomatic. The usual is // End of symmetric tests or something like that but there are no more sections in the code. In Chisel we use a summary but that too fells heavyweight.
While doing more performance work I noticed a couple of bugs in our implementation of distance. I will fix each one of them in a different PR so that the perf work can land. I have added a TODO for the next bug. I don't want to fix both as part of the same PR as it will be much harder to reason about the code, the fix for unicode in itself is pretty easy.
This PR fixes a common bug when handling strings in Go. The for loop:
Is iterating over a string and using
biwhich is the byte offset to store things inlstinstead of the current rune count. In the case where the characters are ASCII it works because each rune is 1 byte but in the case of runes that take more than 1 byte this fails (see the added test case).