Universal language support in Kolibri

Published in

Learning Equality

8 min readFeb 21, 2022

Devon Rueckner worked at Learning Equality as Product Development Lead and Product Manager from 2016 until the end of 2021. He is now starting a photography technology company called Basil Gang. In this guest blog post, Devon shares insider details about how Learning Equality approaches internationalization, writing systems, type rendering, and product design in Kolibri.

As of version 0.15, the Kolibri Learning Platform is available in 30 languages and can render educational content in nearly every written language. With consistent and legible typography. And fast loading times. On every major operating system. Without ever accessing the internet.

A GIF that illustrates the sentence “Welcome to Kolibri!” in 30 languages — “Welcome to Kolibri!” in 30 languages

It wasn’t always this way. Follow along with this post to learn a bit about the challenges we faced in creating a fully internationalized application, why it matters, and how these efforts reflect the broader approach that Learning Equality takes to product design and engineering.

The early days

2018 was a big year for internationalizing Kolibri. Having just finished adding right-to-left language support (thanks in part to RTLCSS), we were able to add Arabic, Farsi, and Urdu to the existing left-to-right English, Portuguese, Spanish, and Swahili languages.

Riding this momentum and with support from the global Learning Equality community, within a few months, we had translated Kolibri into 11 more languages — for a total of 18! This work allowed Kolibri to reach many new learners. Because learning in a first language (or “mother tongue”) is more effective than learning in a second language, the new languages also increased Kolibri’s potential to support existing bilingual learners.

Unfortunately, not all was well in type town: as we added new languages, we began to notice cracks in our typographic foundations. An early sign was the Yorùbá language which uses mostly Latin script along with an ‘underdot’ diacritic (◌̣) to modify pronunciation. When Yorùbá was initially included, any character with an underdot was rendered in an unsightly serif font:

Screenshot showing Kolibri in Yorùbá with incorrect rendering of some characters — *Characters with an underdot are rendered in a different font*

Let’s take a brief detour to consider an important question: does it matter? Product managers have a wide variety of tools and tropes to help triage issues like these when trying to deliver great products on time: “Perfect is the enemy of done,” “prioritize the critical user journeys,” “focus on objectives and key results,” and one of my favorites, stolen from David Allen of GTD: “We can do anything, but not everything”. The text doesn’t look great, but it’s legible and better than nothing. Ship it, right?

While these aphorisms are certainly practical, they also need to be applied carefully or they risk becoming convenient excuses to cut corners at the expense of some subset of users. The percentage of people using Kolibri in Yorùbá may be relatively small, but for a particular Yorùbá-speaker who uses Kolibri daily it will look bad 100% of the time. When we acknowledge that we would not accept text rendering like this in our own languages, it becomes obvious that it’s equally unacceptable for any other language we purport to support.

That said, those errant serifs were just the tip of the iceberg; the problems ran much deeper.

Goals of a unified typeface

For Kolibri’s typeface, we chose Google’s Noto Sans which claims to have a set of consistently designed, typographically correct glyphs for every single one of the 137,929 characters across the 150 writing systems cataloged in the 2019 Unicode Standard. That is to say, this font has “almost everything,” including many endangered languages. This is the kind of audaciously ambitious project that benefits from the resources and scale of an organization like Google, and the open license made it a perfect fit for distribution and use in Kolibri.

Mixing fonts isn’t “wrong” per se, but it makes the text less harmonious across the application. Compare the examples below. On the left, there is a separate font for each writing system; on the right, there is consistent use of Noto for all writing systems:

Two renderings of a page that includes type from multiple language systems — Examples of Gujarati, Arabic, Latin, and Cyrillic scripts showing mixed versus consistent fonts

However, these are mainly aesthetic concerns, and falling back on system fonts is a good strategy for keeping the application lightweight.

A far worse outcome is what happens when a device doesn’t have any compatible font for some text. In this scenario, text is replaced with indecipherable “not defined” glyphs — typically some boxes like ⍰⍰⍰. (At some point, people started calling these glyphs “tofu” after the similar-looking blocks of bean curd, and some nerdy Googler named the “Noto” typeface after “no tofu”.) We have no control over which fonts are installed on the devices that Kolibri users are running, and often the users don’t have control, either! It became clear that we could not both rely on system fonts and also be able to guarantee that Kolibri will work for everyone in all languages.

Our goal was clear and simple: embed Noto in Kolibri, and ensure that every writing system renders beautifully. Easy, right?

A few bytes at a time

The primary problem was performance.

Perhaps unsurprisingly, a set of fonts that has glyphs for every character in existence is going to be large. The basic set of font files is just under 500 MB, and this doesn’t even include the set of Chinese, Japanese, and Korean (CJK) characters which are much, much bigger. Kolibri is designed to run on highly resource-constrained devices and networks. Even if we could embed 500MB worth of fonts in the server (which we cannot), there’s no way we could require every user’s tablet, laptop, or phone to load them. We needed to lose weight.

To start, we decided to make a UI design compromise and constrain the font variants we would allow in Kolibri. While Noto has dozens of variants (‘normal’, ‘italic’, ‘bold’, ‘bold italic’, ‘condensed’, ‘extra-condensed semi-bold italic’…) we decided that we would limit ourselves to only ‘normal’ or ‘bold’. It turns out that — combined with sizing, color, and layout — this is still sufficient to establish a strong visual hierarchy in the application. Limiting the variants brought the size down to about 10 MB: still too big for the web clients, but small enough to distribute with the server.

Because it works offline, Kolibri needs to come with all languages pre-installed. However, most users will just choose a single one during setup and never touch the others. Therefore, the next thing we tried was to initially load only the font for the currently-selected language while loading additional language fonts on an as-needed basis. The combined size for the ‘normal’ and ‘bold’ font variants for a single language is typically around 300 kB. This is not completely unreasonable, but it’s still too big for a fast initial page load.

The Kolibri UI has about 10,000 words of text that contain every user-facing message — from learner encouragement (“Keep up the great progress!”) to errors (“A lesson with this name already exists”) to simple actions (“Save” and “Cancel”). It turns out that all of this translated text can be rendered with only a fraction of a typical language’s font, so our next trick was to create a custom subset font for each language including only the characters needed. These subset fonts weigh around 30 kB. Now we’re getting somewhere — 30 kB down from 500 MB! (When Kolibri is updated and re-translated during major releases, these custom subset fonts need to be regenerated. Our automated font subset tool would never have been possible without the incredible fontTools library and its crew of brilliant maintainers.)

A minor digression: it turns out that we can’t just look at the characters used in the application text to make a working subset font; we need to look at the words. In many writing systems, the visual form of letters can change based on the other letters around it, similar to the concept of a ligature. For example in the Devanagari writing system, adjacent consonants in a word will often merge to form a single conjunct consonant glyph such as र् + क = र्क. Without taking this into account we would end up with some very unfortunate behaviors:

Screenshot showing Hindi text rendered with missing conjunct consonant glyphs — *Incorrect rendering of Hindi text due to missing conjunct glyphs*

With that problem solved, we seemed to have achieved our goal. We’ve constructed a set of small fonts, each of which contains the exact minimum set of glyphs necessary to render the application for each language, making initial page loads fast. Additional fonts are loaded only as necessary, such as when a user opens content with additional glyphs or switches to a different language.

That all worked great, except for one problem — the language-picker breaks everything:

Screenshot of the language-picker user interface — *The list of languages uses characters from many writing systems*

As soon as this dialog is shown, the browser needs to render glyphs from many writing systems simultaneously, triggering the download of every corresponding font! To avoid this, Kolibri has one last trick: it contains one more special subset font with essentially just what is necessary to render the set of language names, and embeds this in the page for all languages.

And with that, we are done! …almost. We haven’t addressed the CJK fonts yet…

The bigger picture

This post has been a deep dive into one very specific example of Learning Equality’s commitment to equitable access to education, and how this commitment drives design and engineering decisions on the product team.

This is representative of innumerable other examples over the years. These range from essential yet subtle features such as keyboard and screen reader accessibility details to flagship features such as take-home devices supporting learners affected by the pandemic. Some of this work gets wider visibility. However, most of it will be lost to old Figma boards, Github PRs, and Google docs — perhaps unconsciously appreciated by the millions of learners and coaches using Kolibri around the world.

I was fortunate to have spent six wonderful years working on Kolibri at Learning Equality. One thing that constantly impressed me is how the team is both deeply idealistic and practical at the same time, because these traits can easily be in tension. In many complex situations, there may simply not be a “correct” answer, only a best effort given the information and resources and constraints at hand. Working at Learning Equality, I learned a lot about how to listen carefully, make decisions thoughtfully, move forward persistently, and then repeat.

Thank you to everyone who helped review drafts of this article!

Universal language support in Kolibri

The early days

Goals of a unified typeface

A few bytes at a time

The bigger picture

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Learning Equality

Written by Devon Rueckner

More from Devon Rueckner and Learning Equality

Cutting-Edge AI for Education at the Edge

Learning Equality’s updates on equitable AI — including automating curriculum alignment, and getting AI-powered tools into our offline…

GSoC’24 : A Summer Tale of Coding with Learning Equality

This past summer, I had the privilege of being selected as a Google Summer of Code (GSoC) contributor at Learning Equality. In this blog…

From Challenge to Change: How a New Approach to Project-Based Learning is Helping Children in…

Imagine stepping into a vibrant space adorned with colorful walls, filled with local play materials, and equipped with yoga mats. Instead…

Leveraging Transformative AI to Support Curriculum Alignment

Curriculum alignment is critical but tedious; we’re building open ML and AI tools to automate and streamline the process.

Recommended from Medium

The root causes for the dev-design mismatch

Designers use an unconstrained canvas tool to design for rule-based interactive systems, hoping the devs will perfect everything in…

The resume that got a software engineer a $300,000 job at Google.

1-page. Well-formatted.

Lists

Good Product Thinking

A Guide to OKRs – Objectives and Key Results

Growth Marketing

Business

Philanthropy 4.0: What Form of Giving Enables Transformative Change?

Faced with accelerating disruptions and social and environmental breakdowns, traditional forms of philanthropic giving may be less…

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Steve Jobs Hated User Research. Here’s Why I Agree With Him.

People have no idea what they want, and product managers would be better served by writing tickets for their engineers. Here’s why your…

The UX job market REALLY sucks right now

Why you should pivot and change directions right NOW