More content for Irish and more?

sadiavt · January 4, 2022, 4:16am

Hi, I’m a relatively new Pro user, but started w Clozemaster in Feb. I find the content is either too easy for me or the simplest cloze selections are constantly given… it’s a little frustrating. I am starting to add my own content and copy sentences over to my own content with better cloze selections and I’m concerned there’s such a small number of sentences avail (and many wrong) from Tatoeba but I have a BUNCH of questions:

Is random content only pulled from Tatoeba? Any other sources due to come available or use?
Is there any other random collection available for Irish, like a Fast Track, that I’ve seen mentioned or is that only for beginners (I’m an intermediate + level).
Do the sentences with random clozes get recycled with other words chosen for the clozes?
I sometimes copy a sentence with a simple cloze over to my own collection and then mark the original sentence as ‘known’. If I leave the sentence be, will it be presented to me with a different cloze in the Random Collection or should I just mark all sentences with obvious clozes as known afterwards?
Lastly, your faq gives a list of how to set up a csv file for importing a large list, but I would really appreciate seeing a sample formatted list - is there one available?

Thanks, enjoying the program and about to introduce what I’ve found I CAN tweak to a bunch of users in the Fluent in Three Months community, but it would be good to know what can be done with less common languages.

sindaco · January 4, 2022, 12:54pm

I can’t really answer most of your questions, though @mike did ask about the utility of perhaps including Reverso Context sentences here in Clozemaster, which @morbrorper did echo would greatly benefit languages with fewer Tatoeba sentences.

Fluency Fast track, in my understanding, does normally start with the most common words (which you could all now just mark as “known” at least) to then progress to the less common words. In Italian, and other languages, we also have the Most Common [100 / 500 / 1000 / … / >50,000] Words Collections, where you could just jump in where you’d like. I was also beyond beginner level when arriving at Clozemaster, but still found it to be very beneficial, since, even if the first few clozes are perhaps relatively easy, I was still learning a lot just from seeing them in the context of the rest of the sentence. It’s not just about translating / learning the cloze word for me, but about the cloze in the context of the overall sentence, so I will still learn a lot, even if I already know the clozes in question.

In my experience, you will see a lot of the same sentences again later on indeed, with the “less common” clozes now selected, instead of the previously “most common” cloze words, which is basically what you’re fast-tracking by adding them with different clozes to your own collections. I think you can safely mark the original sentence as “known”, and sitll be confronted with the sentence with a “less common” cloze later on, since the “known” marking should occur for that specific cloze-iteration of the sentence in my understanding.

Anyway, most of these things are just speculation from my part, I mainly wanted to show a sample csv file.

You can probably most easily make them using Excel (or a Google or Open Source or other equivalent), you just need to put the “Sentence” in the first column, which is the only required column for adding sentences yourself, but it’s generally also helpful to put the “Translation” in the second column, and select the “Cloze” word of your choice in the third column. I’ve never used/included the “Pronunciation” or “Notes” columns myself, but they would be the fourth and fifth columns. You can leave the Pronunciation column blank, and still add a Note, but it would have to be in the fifth column, in order to get read in correctly. Here’s a screenshot for reference of what that would look like:

Also note that I’ve added the same sentence twice here, but with different cloze words to be selected, as the first and second entries.

From Excel (or equivalent) you can then save this file as a .csv file, which is a “comma separated file”. If you were to open this file again in Excel, it will still show you the contents in the columns, as expected, but if you were to open it with a text editor, you would see the content of the “columns” separated by commas instead.

Note that the “Notes” column, contained a comman for the second (and only) entry, and it put quotation marks around it, otherwise it would think there was another sixth (undeclared) column with " just for illustratory purposes" as its content for that row. This is something to pay attention to if you’re not using Excel (or equivalent) to generate the csv file, but rather trying to create it yourself manually.

If any of this is still unclear, or you’re running into any errors creating or adding the csv file to a collection, just give a shout and I’d be happy to try to help troubleshoot and/or try to explain things more clearly

sadiavt · January 4, 2022, 5:54pm

@sindaco this is really helpful, especially seeing the example you posted with the spreadsheet and csv example. I do run across issues sometimes when I want to put word with a ’ sign in a cloze… As in the word d’fhéach (I watched) Clozemaster sometimes breaks apart the d from the rest of the word… am wondering if quotations around the sentence would help there too?

Oh other question… if one wanted to put a pronunciation in but it’s not offered as a feature for the language, how would or could I do that? A link to a sound file online?

Thanks again!

sindaco · January 5, 2022, 10:48am

Glad it’s helpful

I think this might be a bit trickier, because there’s languages, like English, where apostrophes are sometimes used to denote contractions of multiple words (e.g. “it’s”, “wouldn’t”, though the latter would not be a correct contraction of “wouldn” + "‘t’), I guess here it kind of depends on how Clozemaster chooses to parse these, either in general, or language-specifically (e.g. in Dutch too the apostrophe can be part of the actual cloze word, if you are constructing the plural in certain cases, like “foto” → “foto’s”).

However, I’m just realising I’ve seen clozes in Italian in Custom Collections (added by fellow members), which has multiple words selected as the cloze, and also contained apostrophes. There however, I was not able to enter the answers, because the apostrophe was in a different format, than my keyboard allowed to enter. I could only enter it by manually selecting it correctly from a symbol picker. In the end I just ended up adding an identical copy using the “normal” apostrophe as an “Alternative answer”. So I’m wondering if perhaps it’s something in the Clozemaster interface when selecting a cloze containing an apostrophe in an existing sentence, I’ll have a bit of a play around with it to see if I can discover anything more (and perhaps also try a d’fhéach cloze).

I was also wondering how this might work. I actually did a quick internet search for this yesterday, before posting, but I couldn’t locate anything insightful quickly. I was wondering if it might just parse for instance the denotation in the International Phonetic Alphabet, with corresponding syllable stress marks. I might have a bit of a play around with this too. The idea of including a link to an online sound file sounds like a very valid option too.

sadiavt · January 7, 2022, 2:57pm

Thanks for your thoughts and experiments here! I will continue to experiment too.

mike · January 8, 2022, 2:42pm

To be sure - do you mean record yourself, or upload an mp3 for example, or both?

sadiavt · January 8, 2022, 2:58pm

Hi @mike either one would be great.

sindaco · January 8, 2022, 3:11pm

Just curious (not (necessarily) a feature request), would there be any possibility to provide e.g. the IPA notation and have it read out with TTS? I’m just wondering how the “pronunciation” column had been officially intended during the csv upload process? Perhaps just to display as written text clarification underneath the sentence during playing / reviewing as I think I’ve seen in some existing sentences / collections?

sadiavt · January 9, 2022, 3:19pm

@mike would there be a way to use this word frequency list for Irish to add more content. GitHub - michmech/irish-word-frequency: About 6,500 Irish lemmas ordered by corpus frequency, with noise removed. ? or this… focal.ie ? … at least the 100 most common, etc and Cloze Reading features that other languages use? I know there are wikipedia articles written in Irish as well - Vicipéid

MsFixer · January 16, 2022, 4:14am

Hello @sadiavt
As a Japanese native speaker who learns Indonesian from English, I understand your point. (Relatively) “minor” languages have limited options for learners like us.
According to the FAQ page, Clozemaster refers to frequency lists suggested by Wiktionary (a sister project of Wikipedia) except for the Japanese course – in your case in Irish, yes, it refers to michmech’s list posted on GitHub. And in my case in Indonesian, hermitdave’s list based on OpenSubtitles.org. I was surprised that the Irish list has only 6,500 lemmas. And Tatoeba’s stats page says it has only 1,000+ sentences in Irish. They are too small.

I compared the hermitdave’s Indonesian list (i.e. Wiktionary) with another corpus called “Leipzig Corpora Collection” (LCC) provided by the Department of Computer Science at the University of Leipzig. It turned out that LCC is much better than frequency lists suggested by Wiktionary from the following aspects:

LCC crawled a massive number of real-world websites including news articles, government reports, corporate websites, Wikipedia articles, and personal blogs. The source is well-balanced. The frequency ranking of LCC makes much more sense to me than that of Wiktionary.
LCC provides which websites it crawled (i.e. the sentence source). Using LCC is legally safer than Wiktionary in terms of copyright matters (particularly, attribution rights).

I would suggest you to
STEP 1) download the Irish frequency list from LCC (it’s free!)
STEP 2) re-order the list by frequency on your spreadsheet (e.g. Microsoft Excel)
STEP 3) flag which words you don’t know yet
STEP 4) look up such unfamiliar words in the LCC Irish online search (it’s also free!)
STEP 5) pick up real-world example sentences using the unfamiliar words
STEP 6) create your own sentence collection on Clozemaster as per sindaco’s wonderful instruction on data import. – Don’t forget to include the source hyperlink of LCC in the “Note” column of the CVS file.

The LCC’s Terms of Use clearly states that you can use downloadable data under CC-BY. This license means you can reuse the data even for commercial purposes (i.e. creating your own collection on commercially-driven Clozemaster is okay) IF you mention that your collection is based on LCC (i.e. BY means you need to display on your material who the original copyright holder is).

Please note that the LCC list also contains improper entries – for example, LCC is case sensitive, meaning “You” “you” and “YOU” are regarded as three different individual items on LCC. But it is still helpful for learners to identify which words they don’t know yet and which words they have to prioritize to memorize next.

Hope this helps!

sadiavt · January 17, 2022, 9:50pm

@MsFixer Thanks for your very informative reply. I will definitely look into this. I’m already changing and making my own Cloze collections of sentences using @sindaco 's helpful schematics and that’s been great, so far. I’m mining different sources for sentences and often changing them up. I need to pester some of my native Irish speaking friends to see if they’d contribute sentences to Tatoeba.
Thanks again!