Integration with Tatoeba, status?

I seem to remember @Mike mentioning some ongoing work on tighter integration with Tatoeba, so that corrections and new sentences would be brought over to Clozemaster on a regular basis. I’m wondering what the status is on that issue, which sorely needs to be addressed.

Over at Tatoeba, I can see people mentioning that they have abandoned Clozemaster because the corpus is old and not being updated. I don’t think I am going to do that yet, but I am thinking of giving up Japanese from English, because I would practically need to be a fluent Japanese speaker already, to sort out the wheat from the chaff. I am also reluctant to recommend Clozemaster to other people, unless they already know the target language intimately, because of the many issues with the corpus. Of course, some corpora are better than others, but even the good ones contain errors and need to be updated regularly.

5 Likes

Thanks for the post! We are indeed able to update from Tatoeba more easily/frequently now :raised_hands: though we’re still working out how best to handle updating sentences people have already played and whether to add / how best to handle the influx of sentences for some of the more common languages (Russian now has 800k+ sentences, Italian 700k+, etc. :exploding_head: - that’s a lot of data and TTS).

Over at Tatoeba, I can see people mentioning that they have abandoned Clozemaster because the corpus is old and not being updated.

Thanks for letting us know - where are you seeing that?

I am also reluctant to recommend Clozemaster to other people, unless they already know the target language intimately, because of the many issues with the corpus.

Thanks for letting us know this as well! We definitely want to change that - we’re improving our moderation process and growing our small team of moderators to help update/improve reported sentences, which includes contributing to Tatoeba as well, so hopefully you’ll start seeing the improvements within the next few weeks/months. If there’s anything else you think might help please let us know - we want you to be wildly excited to recommend Clozemaster! :slight_smile:

7 Likes

There were a few comments on the Wall recently. I don’t want to point out any names, so I’m not posting the links.

4 Likes

Regarding how to handle languages with a huge amount of sentences, I know there are curated lists that I imagine could be useful as a starting point, at least for some languages.

2 Likes

Since it is easier now to stay up to date with Tatoeba, can we expect more language pairs to become avaliable by the time the current pairs get updated?

2 Likes

Yep! We’re aiming to get North Frisian from English added soon with more to follow. Any in particular you’d like to request? Language pairings with >=1500 sentences with direct translations on Tatoeba are what we’re considering at the moment.

4 Likes

Following up on morbrorper’s note, here are sample negative comments about Clozemaster on the Wall at Tatoeba:

… I originally came here from Clozemaster, and the orphan sentences turned up more often than not in the Japanese section, and some of them made no sense to me. I’m not worried about running out of sentences to translate.

Clozemaster is not enough good for learning a language.
iKnow has better sentence examples for words in order: iKnow!
I use clozemaster. I have an extension to mark words.
The ‘English from Magyar’ is not only contains translations of Hungarian, but translations of translations, and later if the sentence is changed in Tatoeba, in clozemaster it remains unchanged.

Learning Norwegian is similar fun; the database has not been updated from Tatoeba for a long time. On the other hand, I have enough background that I notice quite a few typos and that’s not the only method I use. [Translated from Finnish by GoogleTranslate]

I’ve stopped using Clozemaster months ago and switched to Anki today using confirmed sentences from here
I also have a wide array of resources for Japanese
I just mentioned them because I found out about this through there.

3 Likes

“As a starting point” is key. People have their own criteria for assembling these lists, which may not be immediately visible. For instance, the person who has assembled the longest list in English teaches beginners, so he favors short sentences with simple vocabulary and syntax. Diversity is not a priority for him, so there is a lot of repetition. Finally, although the list is advertised as being proofread, I continually find errors in the sentences it contains.

2 Likes

Spanish, Italian, French, Dutch, Turkish, Romanian.

I’d like to use Clozemaster on all combinations of those languages.

Turkish from Spanish is the one I’m looking forward to the most, though.

1 Like

Hopefully, it sounds like some of these complaints will be solved if the new ‘pipeline’ for updates Mike and the team have built works as advertised.

3 Likes

Any updates on this? I’m looking to add a ton of sentences or have updates on a regular basis with major languages (Spanish, French, German, etc.) I’m not the most savy tech guy, but adding a bunch of words is pretty tedious.

There’s a list of 3,000,000 sentences I had found for Spanish taken from movies and news; it’s not written in the write format though LOL

Frisian! Wonderful :+1::grin::+1:

1 Like

Having finally looked at Tatoeba, I have some requests.

Berber and/or Kabyle - I’m assuming that “Berber” means Moroccan Tamazight… ?! What an amazing and huge resource.

Kirundi - since Swahili is not really working out, this would be an awesome entry into the Bantu languages.

Kotava - this auxlang has unique features, an ongoing community after more than 40 years, and there are several open source translated novels online.

Big plus for all the above languages - they are well represented not only in English translations, but also French, Russian, German, Spanish, Portuguese, Turkish, Arabic… and those are just the ones I checked. Some of these less obvious choices have thousands of translations - 4000 sentences from Kirundi to Turkish!

Cheers

I think the biggest challenge is how to prevent importing bad translations from Tatoeba. I took a quick look at Tatoeba and noticed that they link one translation to another but when there is an incorrect translation it looks like there’s is nobody in control. For example is saw a sentence correctly translated from Chinese to English and than badly translated to Spanish and than Spanish correctly translated to Dutch, French and to a lot of different languages. Finally they all end up one way or another with bad translations… So IMHO at this point Tatoeba could be a big mess and maybe unreliable. :face_with_monocle:

How will Clozemaster handle this?

3 Likes

I think Tatoeba has a flawed system for quality control because it is dependent on the owner of a sentence responding to comments.

Now, in practice, errors are rare, and users typically do respond to comments, but when they don’t, there is not much recourse for new users.

I’m not quite sure how to fix this. I think in the end though I would rather Clozemaster be employing some sort of editorial checks rather than just bulk importing. I’m not sure how practical this is though.

I think Clozemaster is so awesome though, I’m a paying subscriber, and if we had enough paying subscribers you could easily employ a paid team of bilingual speakers to review the sentences one-by-one.

3 Likes

(1) Which sentence are you talking about?

(2) Is it possible that you’re confusing direct translations with indirect translations? The user interface distinguishes between sentences that have been marked as direct translations and those that are simply indirect translations. In the screenshot, I circled the translations (= direct translations), but not the indirect translations (= translations of translations).

direct_vs_indirect_2

In the scenario that you mentioned, where a sentence is badly translated from English to Spanish, then correctly translated from Spanish to other languages, there will be only one incorrect direct translation. It’s not a good idea to rely on indirect translations in general. I’m unclear as to whether Clozemaster uses indirect translations for certain language pairs. If it does, then they need to be reviewed before they are added to a collection.

(3)

So IMHO at this point Tatoeba could be a big mess and maybe unreliable.

Is that speculation based on a single error, or have you looked at a number of sentences and estimated the error rate? In the languages I know, I have seen (and reported/fixed) errors, but the overall quality of Tatoeba sentences and translations is high.

1 Like

Hi alanf, thank you for you’re response.

First of all I’m new to Clozemaster and Tatoeba, but I do have some concerns about the quality of the translations on Tatoeba and therefore also on Clozemaster, but I hope together we can make Clozemaster even better, I strongly believe in this concept of language immersion.

About you question 1 and 2; The sentence “房间里家具齐全。” on Tatoeba has a direct translation to Dutch which is incorrect and a direct translation to English which is correct. Then there are a lot of translations based on translations, but I don’t understand why they allow people to do that because of the risk on more errors. If you don’t understand the source language there is always a risk on misinterpretation.

About your question 3, sorry I didn’t mean to judge them, yes I am speculation because I entered Tatoeba and than I only looked at 3 or 4 sentences but I noticed that they all had (direct or indirect) mistakes. Then I became very sad (maybe because of my hopes where too high) and left.

1 Like

Thanks for that feedback. I unlinked the Dutch from the Chinese.

Translations of translations (indirect translations) are displayed automatically, but they are shown differently from direct translations. They are displayed (1) for users who know both languages well and want to see possible translations whose quality they can judge for themselves and (2) in case people want to mark them as good translations in their own right, which advanced contributors at Tatoeba can do with a single button click.

Indirect translations, as you’ve seen, can be invalid due to mistakes pertaining to direct translations made by people who don’t understand both languages perfectly. But they can also be invalid for “good” reasons, namely that one language in the pair makes a distinction not made by the other. For instance, the word “su” in Spanish can correspond to several possessive pronouns (“his”, “her”, “your”) in English. Thus, sentences containing “his” from English might be linked through sentences containing “su” in Spanish to sentences in German (which, like English, distinguishes between “his” and “her”) that contain “ihr”, meaning “her”. Someone who knows these languages well would understand that the sentences diverge in this respect, but would find the correspondences between the sentences useful because of the other vocabulary they contain.

If you see a significant number of errors at Tatoeba in direct translations, especially in a particular language pair, I’d like to know about it.

2 Likes

I see a significant number of such errors. Over the past few months I’ve caught sentences in German where there were grammatical errors in the sentences themselves, like disagreements between person/number between verb ending and subject. I’ve found a few sentences in English that, if not grammatically incorrect, read awkwardly and seemed like something a native speaker would not say.

More commonly though I find translations that aren’t egregiously wrong but are misleading or somehow off. Like, maybe the meaning is translated correctly but the connotation is very different and not captured accurately. A common problem is differences in formality…more commonly, a highly formal sentence in one language will be translated into a highly casual one in English. Occasionally I’ve found English idioms used the wrong way, like I found one recently where someone used “laughed up her sleeve” (which means laughing secretly or behind someone’s back) to translate a sentence that meant “couldn’t stop laughing”.

The fact that I’m not really fluent in any language other than English makes me strongly suspect that there are far more of these errors than I am actually detecting. Yeah, if I pour over each sentence with a fine-tooth comb I could probably find more errors, but I’m mainly just reporting (using Clozemaster, and also commenting on Tatoeba) the sentences that I know well enough in the language in question to strongly suspect are wrong or at least less-than-ideal.

The thing that bothers me about Tatoeba though is not that there are errors, but rather, that they don’t seem to have a good quality-control mechanism. All I can do is comment on a sentence and hope that the owner fixes it. There is no way (to my knowledge) for me to flag a sentence as needing review, or unlink translations or flag a pairing of sentences as having a problem with it. And there is no way to see whether or not another user has flagged a sentence or pairing as problematic, all you can do is look to see if there is a comment on either sentence.

3 Likes

Actually, you can do all of this once you are an advanced contributor at Tatoeba. I encourage you to apply to become one, once you have spent some time there and are comfortable with how it works. An advanced contributor can unlink sentences that should not be linked, or place tags, such as “@change”, on a sentence. Corpus maintainers (the next level of community member) periodically review sentences with tags like these, fix the problems, and then remove the tags.

1 Like