Transcribe an audio file to text

Use Interval's handy file upload functionality to quickly build a tool for transcribing an audio file to text. Interval is flexible enough to build on this functionality to fit your team's working style. Index meeting notes, post updates to Slack, or summarize YouTube videos.

interval.com

Try it out

To create a fresh project based off this example, run:

npm
yarn

npx create-interval-app --template=transcribe-audio --language=python

yarn create interval-app --template=transcribe-audio --language=python

The full source code for this tool can also be found in our examples repo.

How it works

This example uses Whisper, OpenAI's open source speech recognition model. Install the library and ensure you have its ffmpeg command-line dependancy installed too.

First, we'll leverage the io.input.file I/O method to collect an audio file to transcribe. Using the allowed_extensions option, we'll ensure that the provided file has an audio content.

import os
from interval_sdk import Interval, IO, io_var, ctx_var

interval = Interval(
    os.environ.get('INTERVAL_API_KEY'),
)

@interval.action
async def transcribe_audio(io: IO):

    file = await io.input.file(
        "Upload a file to transcribe", allowed_extensions=[".wav", ".mp3"]
    )

    return "All done!"

interval.listen();

If you test out this tool with your personal development key, you'll be able to find it in your Development environment. When run, the action provides a friendly uploader for submitting the audio file.

Now that our action has a file to operate on, we can transcribe it using the machine learning model.

We'll add code to load the base version of the Whisper model. This will run once when you boot up your action (and will take some time to download the model the first time it runs).

Then, once a file is provided, we'll pass it to the model to transcribe its text content and display the results.

import os
from interval_sdk import Interval, IO, io_var, ctx_var

import whisper

interval = Interval(
    os.environ.get('INTERVAL_API_KEY'),
)

model = whisper.load_model("base")

@interval.action
async def transcribe_audio(io: IO):

    file = await io.input.file(
        "Upload a file to transcribe", allowed_extensions=[".wav", ".mp3"]
    )

    with open(file.name, "wb") as f:
        f.write(file.read())

    result = model.transcribe(file.name)

    await io.group(
        io.display.heading("Transcription results"),
        io.display.markdown(result["text"])
    )

    return "All done!"

interval.listen();

This works great! But we can improve things a bit.

First, we'll add a loading spinner to run while the model is running, to provide context around what's happening.

Next, our current implementation saves the uploaded file locally to simplify the transcription call, but with a bit more legwork we can avoid this and pass the file's contents directly to the model as bytes. For this to work, we'll use ffmpeg to preprocess the file and numpy to reformat the audio into an array of floats representing the audio waveform.

import os
from typing import cast

from interval_sdk import Interval, IO, ctx_var
from interval_sdk.internal_rpc_schema import ActionContext

import whisper
import numpy as np
import ffmpeg

interval = Interval(
    os.environ.get('INTERVAL_API_KEY'),
)

model = whisper.load_model("base")

def audio_from_bytes(inp: bytes, sr: int = 16000):
    try:
        out, _ = (
            ffmpeg.input('pipe:', threads=0)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=inp)
        )
    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

@interval.action
async def transcribe_audio(io: IO):

    file = await io.input.file(
        "Upload a file to transcribe", allowed_extensions=[".wav", ".mp3"]
    )

    with open(file.name, "wb") as f:
        f.write(file.read())

    result = model.transcribe(file.name)
    ctx = cast(ActionContext, ctx_var.get())
    await ctx.loading.start("Transcribing audio...")

    audio = audio_from_bytes(file.read())
    result = model.transcribe(audio)

    await io.group(
        io.display.heading("Transcription results"),
        io.display.markdown(result["text"])
    )

    return "All done!"

interval.listen();

API methods used

Did this section clearly explain what you wanted to learn?

Transcribe an audio file to text

Try it out

How it works

API methods used

On this page

Product

Company

Join our mailing list

Transcribe an audio file to text

Try it out​

How it works​

API methods used​

On this page

Product

Company

Join our mailing list

Try it out

How it works

API methods used