"Low Hanging Fruit" Collection?

I was looking to make an “easier” first collection (my son is just starting french so utter beginner) and was considering using cognates to generate the easiest possible first sentences by:

  • taking the tatoeba download
  • taking a cognates list ( eg Cognates)
  • scoring the sentences by most->least cognates
  • Making a “100 common words” collection using the easiest sentences
  • Add the first hundred words to the cognates / easy list
  • Making a “Next 100 common words” collection using the easiest sentences

Has anyone tried anything like this, or will I find it’s not really much easier than the default common words lists?

Or even better you can really easily do this centrally across many common languages and it will lower the barrier of entry to people using clozemaster?


I don’t have an opinion on your proposed methodology, but I recently made myself an easier starting collection for Mandarin Chinese using sentences I studied on Duolingo and LingoDeer. After a month, I was able to start using the existing collections.

Mike asked a while ago whether people were interested in having collections that tied to Duolingo’s lessons. Most people said no, as did I at the time, but now that I’m attempting a more difficult-to-me language, I see the value.


I hope you don’t mind if I just focus on the language-learning aspects, and leave it to others to discuss the part related to Clozemaster.

(1) French already has a very high concentration of English cognates. I’m not sure that developing an “enriched” list is worth the trouble. It does sound like a lot of effort on your part.

(2) The list of cognates you linked to includes some rows like these:
Cake n. Bolo (Portuguese) Torta (Spanish) Gâteau (French)
While people do talk about tortes in English, I don’t think “bolo” is a cognate at all, and “gâteau” is only a cognate in a convoluted sense. The Online Etymological Dictionary has this entry for the English word “gateau” (which I have never heard used in English):

1845, from French gâteau “cake,” from Old French gastel, from Frankish *wastil “cake,” from Proto-Germanic *was-tilaz, from PIE *wes- (5) “to eat, consume.”

Unless your son speaks Old French, Frankish, Proto-Germanic, or Proto Indo-European :slight_smile: , I don’t think “gâteau” is going to be any easier for him as a beginning French learner than any other word. So you probably would have to filter that list to remove such words.

(3) When one learns from a skewed collection, there’s always the possibility that one’s knowledge and skills end up getting skewed. Some of that is not a bad thing for a beginning learner. But saving the noncognates for later could just mean that he hits a wall when he encounters them. If there are noncognates that are important for building everyday sentences, this will only delay the point when he can do that.

1 Like

Thanks, the duolingo words idea sounds good - it’s easy to pull them all from duome.eu and I could use that to seed my “easy words list” to find the best sentences to use.

I’ll give it a whizz and see what comes up.

Thanks, I think that list does look a bit off.

I’m thinking maybe if I take a simpler “most common / obvious” cognates list and only use it to find the easiest to read tatoeba sentences and then use that translation.

Maybe combine a simpler cognates list with Kadrian’s idea of the words from the first 10 duolingo levels. Hopefully that can give a large batch of sentences with ideally only one unfamiliar word each.

Also I was assuming I’d pick the one “new” unfamiliar word as the cloze, so it’d be the learning point for the sentence. But maybe I should just treat it as practice / reinforcement and make one of the easier words the cloze.

The other thing I’ve just realised I’m not accounting for here is child / teenager motivation. My third criteria should really be to filter to sentences that contain an exclamation mark and up my weighting for cognates such as adolescent and bikini… can’t beat some more tabloid style sentences to keep the interest up :slight_smile:


I can report some success… the only thing that may need some work is I’m using the tatoeba full download file as in input and it looks like it’s possible to download a version filtered to verified translations. The other inputs are a list of 300 exact match cognates, a list of duolingo words per level and a “bonus” list of words that encourage the script to use a sentence.

If anyone’s interested, this is the logic I used where
words=words in sentence
known=known or cognate words
current=words in current duolingo level
freq_words = next 25 most common unknown words in tatoeba sentences

   # for each duoling skill level:
   # for each tatoeba sentence:

   # if all words are known and a word from current level is there, use it
    if known==words and current > 0 : 
    # if all but one words are known and a word from current level is there,
    # and the extra word is in the next 25 most common unknown words, use it
    elif known==words-1 and current > 0 and new_words > 0:
    # if all but one words are known and a word from current level is there,
    # and the bonus flag is set and there are at least 4 words in the sentence            
    elif known==words-1 and current > 0 and bonus > 0 and words > 3:

I assume I can’t post a whole python script here, if anyone’s interested I’ll upload it somewhere.

I limited the output to max 5 sentences with the same cloze word and the resulting cloze collections came out around 100-150 sentences per duolingo skill level. Of these:

  • about half were 100% “known words” with the cloze being a word in the current duo level
  • about a half were all but 1 known word, which was mostly from the top 25 next most frequent words, otherwise a random word.

Cool idea! I’d definitely be interested in checking out the script if you’re up for uploading it. Also curious to hear if your son finds the resulting collection useful.

1 Like

Not bad, he seems engaged and looks to have played 101 sentences already - though we’re down to 1 batch of 10 a day. He’s also just doing 1 duo lesson a day, which means I probably need to aim for around 100 sentences per level to keep it all in sync.

My nice simple script has grown a bit, as I’ve made it mix in ~10% new words as as clozes. It seems to generate around 150 sentences per duo level, with all bar one word of each sentence a “known”

I’ve sent you the script by email reply to my sign up email if you fancy a peek…


I’ve simplified it a bit as it was getting too complicated. Here’s a dropbox link cm.py if anyone wants a go

you’ll need:

  • the big sentence list from “Sentence pairs” here: https://tatoeba.org/eng/downloads
  • a file with a load of exact cognates (csv or one per line)
  • a file with the duo level words. format “level_name,word1,word2” etc per line
  • a file with bonus words, one per line. These simply encourage the script to use a sentence a bit.

Play with file names and settings lines 65-90, main setting is known_word_repetitions which sets how many times you get a word as a cloze. After that target_sentences sets maximum sentences per level.

Run it with “cm.py fr” or whatever you use as a file prefix
so something like this:

Basic idea is most sentences have cloze word of current level and are completely made up of known words. If not possible, it find sentences with 1 new word, then finally sentences with 2 new words if there aren’t enough.


Cognates tend to confuse me and trap me into the bad habit of translation, and there are many many faux amis, words that seem to be cognates but are not at all the same, like assister which in english is to attend, not to assist).

It seems to me that users of this app, and others such as duolingo, are far too focused on vocabulary, whereas Lingq users, and others involved in language acquisition by input, focus on listening listening listening, repeat, then listen some more. Even at the beginner level.

1 Like

Users of this app are awesome and can learn however works best for them. Cognates work well for some people, as does focusing on vocabulary. Moreover it’s difficult to argue that Clozemaster is anything but language acquisition by input - the whole concept is about getting input from and exposure to thousands of sentences in your target language.

Aside from that - kindly stop plugging Lingq on this forum. :slight_smile: Discussion of other resources is welcome, but >50% of your posts are promoting them.


Thanks for uploading and sharing all this! Very cool. And thanks for the email. This is similar to how collections are created for Clozemaster, just using cognates and the duo level words in place of a frequency list. Perhaps we can make cognate collections like this available as shared collections.

1 Like

No worries Mike. My main thinking was just to try to make sentences only contain “known” words, I’ve actually shortened my cognates list to a small one, just so there’s a “seed” to make the first few hundred sentences interesting but still easy and they’re soon swamped by the learnt words.

I’ve just tried a totally different (simpler!) logic, essentially exactly the same as the “FFT” collection, but trying to make it “easier” by just enforcing the rule that all words in all sentences are ones that have previously been a cloze word and I think I like that better. Just doing a first run through to see if it’s any good.

my logic here is:

  • sort all sentences by score (longer sentences containing “bonus” words score higher)
  • make a target words list by interleaving the duo vocab list and the frequency list
    • find the first sentence that contains 100% known words, except one word from the next 100 “target words” list
    • making that new word the cloze and add it to known words list
    • repeat!

My hope is I’ll start getting nice long sentences made completely of known words… work in progress, but here’s the script anyhow:


Sorry, I’m not promoting, it’s just where I learned about polyglots and that type of input. And I do use Clozemaster for just such comprehensible input, at which it definitely excels. I guess I just want to encourage people to use Clozemaster differently, not focus only on individual words–there’s so much great content to mine in each sentence here, in the user cloze-collections, the Radio (hoping for that eventually in other languages).

But you’re quite right, I shouldn’t refer to another site but rather to the linguist Stephen Krashen–he seems to have published a great deal during his career about the input hypothesis. Many apologies, I can delete my previous comments, was hoping to engage in fruitful discussions.

1 Like

btw @mike I plug Clozemaster whenever I talk about language learning–I’m just very enthusiastic for those resources that I have found useful. Again, my apologies.

1 Like

@Plovdiv no problem at all, no need to delete anything, and thanks for understanding! Perhaps my comment was a bit too harsh, sorry about that, and I definitely don’t mean to discourage you from posting - the enthusiasm is very much appreciated. :slight_smile: As far as Stephen Krashen and the input hypothesis, there is indeed lots to discuss there, and we’re aiming to continue expanding the Radio languages, including Bulgarian. Thanks again!

1 Like

Thanks for sharing! Curious to hear what you think of it once you’ve played it a bit. The difficulty does tend to increase quickly for the FFT (though it seems to be an advantage the more intermediate/advanced you are since you get into more interesting/difficult content more quickly). Also curious to hear with the longer sentences if you think the grammar is more difficult (depends on the language of course), but I know I’ve run into the problem of “I know what all these words individually mean, just not in that order / the way they’re arranged here” :slight_smile:

1 Like

@mike Thanks for replying.