Tatoeba search is slow

The option to search Tatoeba isn’t as useful as it could be, as it triggers a loose search which is slow and often produces too many irrelevant hits to be useful.

Here is one example; I searched for Italian perdono, which took about 20 seconds (note that my search term is not among the top results):

IMO it would be more useful to perform an exact search by default and maybe have the loose search as an option. The exact search is triggered by prepending an equal sign and enclosing the search term in quotes: =“perdono”.

3 Likes

In languages that support loose searches (stemming), I believe that the loose indexes are built at the same time as the exact indexes, so a loose search may actually not be inherently slower than an exact search (though it would not be easy to determine this empirically over an Internet connection because there are so many sources of variability that could interfere). It’s probably true that searches that return fewer results take a shorter amount of time in general. But there’s a cost, which is that if you get too few results, you’re going to want to do another, looser search anyway, which will just get you back to where you were earlier – except that now that you’ve had to perform two searches, the entire process has taken you more time.

2 Likes

OK, the exact search may not be much faster, but at least I wouldn’t have to wade through page after page of irrelevant hits. This particular search yields 1379 loose vs. 55 exact hits.

2 Likes

But nothing is forcing you to wade through all of those pages unless there’s some reason that the kind of sentences you’re looking for are showing up late in the search. In the particular instance you cited, I think the reason that the exact hits for “perdono” occur so late in the results is that there are lots of one-word sentences matching a stemmed version of “perdono” but no one-word sentences that contain “perdono” itself. The default search on Tatoeba is “Relevance”, and apparently short sentences that consist only of a stemmed version of the search word are ranked higher than longer sentences that contain an exact version. I personally like random sort, so from my point of view, having Clozemaster specify it in the Tatoeba search performed from the dialog would be useful. But I wouldn’t be surprised if other people dislike random search for one reason or another. For instance, they might always prefer short sentences and would always want to see all of them at the top of the search results.

In this particular case, an exact search will happen to push the examples you want closer to the top of the results. However, there’s no guarantee in general that performing an exact search is going to filter out the sentences you don’t want and leave the ones you do. In fact, it could do the reverse. Perhaps it turns out that some sentence that happens to contain “perdo” is one that strikes you as particularly memorable, but you wouldn’t see it if you performed an exact search.

In general, it’s easier to narrow down a search that’s a little wider than you might want than it is to do the reverse. If you want to widen the search, you have to think of variants in the first place. It’s easier to look at the variation present in the first few pages of a search and then figure out what part of it you want to eliminate. As a “power user”, you can make use of whichever filters you want. For instance, you can eliminate all sentences containing the word “Tom” by adding “-Tom” to your initial search.

I would love it if we could specify our own searches to fit the five spots on the dialog, as suggested by @LuciusVorenusX in this post (which I hope @mike has seen). Then we could define our own initial Tatoeba searches however we liked. But in the absence of such customizability, I think that a stemmed search makes a better default than an exact search.

2 Likes

You make a convincing argument, and customization would be great.

In this particular case, I should have mentioned that I was looking for uses of perdono the noun, so the stemmed verb forms are just noise.

1 Like