Unexpected implicit leading wildcard in sentence search

morbrorper · July 28, 2021, 5:59am

Searching for translations containing “hat” in the Italian from English FFT (from the dashboard), I don’t just get results about hats, but also translations containing “what” and “that”, and the like; this makes the search virtually useless.

Searching in all sentences (using the icon at the top of the dashboard) gives the expected result though: only sentences about hats. But from all collections, of course.

zzcguns · July 28, 2021, 6:26am

Oh yes, well spotted!

A partial workaround would be to surround the word with spaces, but then you would lose any “hat” that is followed by punctuation.

A fuller workaround would be to make use of your regular expression tip, and search for something like [^a-z]hat[^a-z] which would only look for the word “hat” (there may be simpler regular expressions to do this, but I don’t know what regex engine is being used).

The issue that you raise might be related to the fact that the general search (top of dashboard) doesn’t accept regular expressions, and the Manage Collections interface does.

zzcguns · July 28, 2021, 7:43am

Actually, I was just thinking about this while making a cup of tea and I don’t think this can be considered a bug. It is a “feature” that will look like a bug to someone who doesn’t know about the regex feature that you pointed us all to a few days ago, but it is actually an expected way that the search would work using regular expressions. In other words, it’s not that there is an unexpected implicit leading wildcard, but that there is no automatic word delimiter placed around the search term(s).

The choice therefore would be -

having no regular expression support in which case the search would work in the same way as the general search (top of dashboard)
keeping regular expressions, and expecting the user to know that they need to introduce their own word delimiters around search terms (this is the current case, but there is no documentation provided to let people know that they’re using a regular expression search)
allowing regular expressions, but having Clozemaster automatically add word delimiters around the search terms

In this case, I think that you are suggesting that the third of these be implemented.

I’ve checked the search, and it appears to be using SQL regular expression syntax (which is not really a surprise I suppose).

Therefore, to match on word boundaries a person can use “[[:<:]]” and “[[:>:]]” which will allow searches to match at the beginning and end of the sentence/translation as well. Note that the “[^a-z]hat[^a-z]” example wouldn’t match “hat” at the beginning or end of a sentence, but “hat” isn’t going to appear at the start of a sentence, and nor should it appear at the end of a sentence since all sentences should end in punctuation.

So what I believe we are asking is for an implicit “[[:<:]]” to be put at the start of the search, and then a “[[:>:]]” at the end of the search.

A person experienced with regular expressions could then still override this search in the ways you have previously described.

As an example in the Italian from English Fluency Fast Track, with the current search facility and searching on sentences (i.e. searching on the Italian phrases) -

a search on “[[:<:]]correre[[:>:]]” will only pick up sentences where the word “correre” is on its own, including when it begins a sentence (and searches are not case sensitive so this picks up “Correre” as well)
a search on “[[:<:]][a-z]*correre[[:>:]]” will also pick up words such as “trascorrere”, “scorrere”, “percorrere” etc. in addition to instances of “correre”

If the implicit word boundary delimiters were provided, then the same two searches would be simplified to “correre” and “[a-z]*correre” respectively.

morbrorper · July 28, 2021, 9:50am

It’s a tough call between usability and flexibility.

Thanks for the “[[:<:]]correre[[:>:]]”; I had been trying “\b”, to no avail.

Floria7 · July 28, 2021, 10:08am

Ciao @morbroper @zzcguns This for me is as complicated as Clitics so I’m very glad we less IT savvy can rely on Your savvy for such queries.

I take my hat off to you!