AI-Driven Caption Generation in Kolibri Studio

Published in

Learning Equality

8 min readNov 3, 2023

This summer, I had the incredible opportunity to work with Learning Equality as a student developer, taking part in the Google Summer of Code program. Learning Equality, is a nonprofit organization at the forefront of making quality education accessible to all. In this blog post, I will share my experience throughout the Google Summer of Code 2023.

About the project

The project’s primary goal was to enhance Kolibri Studio, Learning Equality’s open source curricular tool, by automating the creation and editing of captions for audio and video resources. This would improve accessibility for learners. The main objectives were to develop front-end support for initiating the caption generation process for selected content (audio or video files) and implement automatic caption generation on the backend. The main focus during this GSoC project was perfecting the automatic caption generation process, with potential future enhancements for a front-end editor

change_event(plan) ≈ f(x)

By the end of community bonding period, I had a plan: the user tells the backend that they want caption for ‘this’ video (or audio) → this information gets to backend → it starts a celery task (actual generating) → when finished, we acknowledge the user → populate the frontend.

I needed to use already existing architecture in Kolibri Studio, so I need to leverage the change event architecture.

I had this understanding before the community bonding period started, here is my proposal. But the understanding got more clear when I met my mentors Blaine Jester and Samson Akol.

Initiating the Django Model Design

It began with the idea of using a JSON Schema to validate caption edits before saving them to the model. The goal was to ensure the accuracy and integrity of the captions, making it possible to share them with other editors who might be working on the same content (multiple user edits).

An overview of the envisioned model schema with JSONField (which proved to be impractical):

GeneratedCaption
 - id (uuid)
 - language (string) eg: 'en', 'hi'
 - caption (JSONField) eg: {
  "captions": [
    {
      "id": "1",
      "text": "- Where are we now?",
      "startTime": 74.815,
      "endTime": 78.114
    },
    {
      "id": "2",
      "text": "- This is big bat country.",
      "startTime": 78.113,
      "endTime": 80.991
    },
    {
      "id": "3",
      "text": "- [ Bats Screeching ]\n- They won't get in your hair.",
      "startTime": 81.058,
      "endTime": 83.868
    }
  ]
}

But one of the main issues was the asynchronous processing of user edits on the backend. Synchronously validating potential cue overlaps proved impractical.

The complexity arose from managing concurrent edits by multiple users in different browsers. The changes were processed serially on the backend, which meant that the frontend couldn’t rely on locking changes or rejecting edits. This posed significant hurdles in resolving conflicts.

The Need for a Structured Approach: SQL Database and Normalization

It became apparent that using a SQL database like PostgreSQL for structured data storage was the way forward. Storing extensive JSON data in a single field created inefficiencies.

For instance, in the case of a lengthy video with a substantial transcript, storing the entire transcript as a 1 MB JSON meant that every user edit required reading and writing the entire 1 MB of data. The current change event architecture wasn’t optimized for such operations.

Normalizing

As the discussions unfolded, me and my mentors contemplated the transition to a normalized structure. Two options were considered:

Two Tables with Foreign Key Relationship: The first option involved creating a separate table for captions with foreign key relationships to another table containing metadata like file_id and language.
Single Table with Grouping Fields: The second option suggested using a single table where each row included file_id and language as grouping fields to describe a caption set, ordered by start_time.

After careful consideration, the first option, embracing a more normalized structure for caption management.

So, now we have 2 tables with following schema:

CaptionFile

- id (UUID)
- file_id (UUID): Associates an audio/video file with the caption file
- language (FK to Language Model): Represents the language of the caption
                                   file
- modified (DateTime): Indicates the date and time of the last modification
- output_file (Foreign Key to File): Links to the associated generated 
                                     WebVTT file

  (`modified` and `output_file` will be used for publishing)

CaptionCue

- id (UUID)
- text (Text): Contains the caption text
- starttime (Float): start time of the cue in seconds
- endtime (Float): end time of the cue in seconds
- caption_file (Foreign Key to CaptionFile): Establishes a relationship 
                                         with the related caption file

DRF Viewsets for Data Mastery

Once we had Django models for storing caption files and cues, I created serializers and viewsets for these models. Here are some endpoints that I created for caption files:

GET /api/captions
GET /api/captions?contentnode__in=
GET /api/captions?file_id=
GET /api/captions?language=
POST /api/captions
POST /api/captions/<id>

and for caption cues:

GET /api/captions/<caption_id>/cues
POST /api/captions/<caption_id>/cues
GET /api/captions/<caption_id>/cues/<cue_id>

In practice, the frontend primarily uses the GET calls to retrieve data. Any significant changes or modifications typically go through the change event architecture.

IndexedDB Resources and Vue Caption Modal

Okay, now using all the above pieces — whenever the EditModal is opened by the user in the ChannelEdit SPA (Single Page Application) we want to load all our caption cues of a contentnode, if exists from backend and store it in frontend. Studio already uses IndexedDB for this so, I made IndexedDB resources for both CaptionFile and CaptionCue and linked them with vuex(actions) and viewsets (with url’s basename). This fills the indexedDB rows for any CaptionFile associated with a contentnode(via GET).

In addition to that, I created a CTA in the Caption Modal with Language Drop-down. The language drop-down shows only the intersect of:

language supported by Kolibri
language supported by ASR Model

Caption modal in the edit view with a CTA and a Language Drop-down

Synchronizing Change Events, Enqueuing Celery Tasks and Dexie Live Queries

After the CTA is clicked, we need a way to keep the frontend informed of what’s happening in the backend (generating captions or not). Actually, it’s easy — in Studio’s change event architecture, we have flags like __COPYING, __TASK_ID, and __LAST_FETCH that are used to track state. I created a new flag called __GENERATING_CAPTIONS to track the caption generation state.

When a user clicks the “generate captions” button, a new row is added to the IndexedDB resource on the frontend. The change event architecture detects changes to front-end resources and sends the updates to the backend. The backend then uses the POST viewset to save the new information (file ID, language) to the Django ORM in Postgres.

In addition to saving data in Postgres, a change event is created to set the __GENERATING_CAPTIONS flag to true for contentnode when caption generation is triggered. This allows tracking which nodes have pending caption generation tasks.

To listen for changes to this flag, I set up a Dexie LiveQuery on the IndexedDB resources. This keeps the frontend updated on the caption generation status.

The Vuex store has two relevant maps:

    /* List of caption files for a contentnode
     * [
     *     contentnode_id: {
     *         pk: {
     *             id: pk
     *             file_id: file_id
     *             language: language
     *             __generating_captions: boolean
     *         }
     *     },
     * ]
     */
    captionFilesMap: [],

    /* Caption Cues for a contentnode
     * [
     *     contentnode_id: {
     *         caption_file_id: {
     *              id: id
     *              starttime: starttime
     *              endtime: endtime
     *              text: text
     *         }
     *     },
     * ]
     */
    captionCuesMap: [],

By tracking the status flags for caption generation in IndexedDB, the frontend can display updated status for in-progress caption generation tasks as they are updated in the back-end database. The Vuex maps allow convenient access to caption files and cues for display. A getter is used to change the state of loading in the frontend.

On the back-end side, after saving the file ID and language of caption file to Postgres and setting the generating flag to true, we enqueue a Celery task which will start the caption generation. Once the task finishes, we set the flag again to false.

AppNexus: Bridging Django App and Backend Alchemy.

Running a deep learning model directly in a Celery worker is not ideal — we don’t want to do that for production or local development. Whisper is not the only deep learning model that will be added to Studio to improve its capabilities — the LE team is also working on a search and recommendation engine. Also, different models should be run in production while in development we use different ones, e.g: openai/whipser-large in prod while openai/whisper-tiny in local development. We needed a flexible plug-n-play like feature for the back-end service we want to use. Within the Contentcuration Django application, we want to build an API layer that acts as a bridge for communicating with different backends like Docker containers, Google Cloud Vertex AI, and VM instances. Our goal is to ensure this API layer can work with these backends, regardless of where or how they run. As long as the input and output formats remain the same, this setup provides flexibility in choosing and leveraging back-end resources.

The standalone deployed back-end services will not have direct access to Contentcuration models. So the envisioned API layer must facilitate providing a URL to access required file resources and return a standardized response, irrespective of the backend used.

This GitHub issue provides a good description of the architecture for the API Layer (AppNexus): https://github.com/learningequality/studio/issues/4275

For the MVP, we can directly create an Adapter for the locally hosted Whisper Backend without using the manager class and generate transcriptions in the Celery task.

from contentcuration.utils.transcription import WhisperBackendFactory
from contentcuration.utils.transcription import WhisperAdapter

# The backend factory will import this backend from settings
# WHISPER_BACKEND = 'contentcuration.utils.transcription.LocalWhisper'
backend = WhisperBackendFactory().create_backend() 
adapter = WhisperAdapter(backend=backend)

response = adapter.transcribe(caption_file_id)
response.get_cues(caption_file_id)
[ {cue1}, {cue2} ... ]

Before setting the generating flag to false, we create change event to send this caption cues to the frontend.

Generation of WebVTT on publishing

When you click the “Publish Channel” button in Studio, it triggers a publishing workflow that generates files for importing your channel into Kolibri (the learning platform where users interact with the materials). During the publishing process, if a content node is a video or audio file with associated CaptionFiles, Studio performs two actions:

If a CaptionFile does not have a WebVTT file (output_file), Studio creates a WebVTT file.
If a CaptionFile already has a WebVTT file, Studio checks if the CaptionFile has been modified. If so, it updates the output file with a newly generated WebVTT file. The old WebVTT file is then collected by the garbage collector.

Future scope of the project

In the future, this project holds immense potential for further development. The addition of a front-end editor to preview and edit generated captions will provide even more flexibility and control. We can explore the integration of translation services to make content accessible to a global audience. Additionally, the summarization of transcriptions and the ability to export content in PDFs and documents are exciting possibilities. With the power of neural networks, the potential for innovation is limitless.

I am immensely grateful to Learning Equality for this opportunity, and I want to extend my heartfelt thanks to my mentors, Blaine Jester and Samson Akol, for their unwavering support and guidance throughout this journey.

Learning Equality

AI-Driven Caption Generation in Kolibri Studio

About the project

change_event(plan) ≈ f(x)

Initiating the Django Model Design

The Need for a Structured Approach: SQL Database and Normalization

Normalizing

DRF Viewsets for Data Mastery

IndexedDB Resources and Vue Caption Modal

Synchronizing Change Events, Enqueuing Celery Tasks and Dexie Live Queries

AppNexus: Bridging Django App and Backend Alchemy.

Generation of WebVTT on publishing

Future scope of the project

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Learning Equality

Written by Akash

Responses (1)