Transcribing audio buffers from react-native-audio-api in Real-time
expo-speech-transcriber is a library I built originally for on-device transcriptions in my app Spendly, but has now morphed into a fun open-source project for easy, fast, on-device transcriptions for React Native/Expo apps targeting iOS!
In this blog post, we'll look at how I achieved real-time transcriptions based on audio buffers coming from Software Mansion's react-native-audio-api library.
react-native-audio-api is a feature-rich library for doing performant audio-related tasks in React Native. It is the go-to library for doing complex tasks that the expo-audio library wasn’t designed for — e.g., exposing audio buffers in real-time.
During my planning phase for this library, I thought about how I was going to pass the buffer channel data through Expo modules to the native Swift side to be converted into a native AVAudioPCMBuffer and appended to the bufferRecognitionRequest I created. See ExpoSpeechTranscriberModule.swift in the package’s GitHub repo for more info.
The method I resorted to was passing the buffer channel data and another prop called the sampleRate, so I could reconstruct the AVAudioPCMBuffer on the Swift side of things. The buffer channel data is obtained through the buffer.getChannelData(0) method call, which returns a Float32Array as stated in the react-native-audio-api docs for getChannelData.
Good, so this is the native definition for my realtimeBufferTranscribe function meant to receive data from JavaScript land:
private func realtimeBufferTranscribe(buffer: [Float32], sampleRate: Double) async -> Void {
// rest of the code
}
Also, this is how it's defined in Expo modules:
// this is how it's done in Expo modules
AsyncFunction("realtimeBufferTranscribe") { (buffer: [Float32], sampleRate: Double) async -> Void in
await self.realtimeBufferTranscribe(buffer: buffer, sampleRate: sampleRate)
}
Now the first error I ran into was trying to pass a Float32Array through Expo modules to the native side, which isn’t supported (at least at the time of writing). This was the error from JSI:
Assertion failed: (runtime.isArray(*this)), function getArray, file jsi-inl.h, line 158.
meaning it probably expected an array like number[] and not a Float32Array. Seeing this, I was like, “Mhmm, is copying arrays over the JSI bridge not going to cause performance issues?” — like I saw in my experiment implementing the cosine similarity function from Vercel's AI SDK using Nitro Modules. TL;DR: copying large amounts of data using normal arrays caused/causes performance issues as stated in the Nitro docs.
Nonetheless, I decided to try it anyway, and it worked pretty well, probably because the buffers came in tiny bits from Javascript land. But that’s not all — the buffer copied into Swift was an array containing the audio sample values and not their memory addresses. I had to get the memory address of the Swift array, copy the data into the AVAudioPCMBuffer using memcpy for performant copying, and then safely append the pcmBuffer to the bufferRecognitionRequest. Example:
func realtimeBufferTranscribe(buffer: [Float32], ...) {
// buffer = [0.1, -0.2, 0.3, -0.1, ...]
// This is a Swift array containing the actual audio samples
// Get memory address of the Swift array
buffer.withUnsafeBufferPointer { bufferPointer in
// bufferPointer.baseAddress = memory address of first element
// Copy from Swift array to AVAudioPCMBuffer
memcpy(
channelBuffer, // Destination: PCM buffer memory
bufferPointer.baseAddress, // Source: Swift array memory address
buffer.count * MemoryLayout<Float32>.size // Size: number of bytes to copy
)
}
// append buffer
bufferRecognitionRequest?.append(pcmBuffer)
}
MemoryLayout<Float32>.size gives the size of Float32 in memory.
After that, subsequent transcriptions are sent using event emitters provided by Expo modules. Snippet from the code:
let recognizedText = result.bestTranscription.formattedString
self.sendEvent(
"onTranscriptionProgress",
["text": recognizedText, "isFinal": result.isFinal]
)
These are then consumed on the JavaScript end with an easy-to-use hook that listens for the events. Example:
import * as SpeechTranscriber from 'expo-speech-transcriber';
const App = () => {
const { text, isFinal, error, isRecording } = SpeechTranscriber.useRealTimeTranscription();
return (
// JSX code
)
}
See src/index.ts in the GitHub repo for the hook code.
Additionally, in theory, if someone is able to stream remotely stored audio using react-native-audio-api and then pass that audio into the speech-transcriber module, they’ll be able to transcribe it in real-time! If you're reading this and want to try it out, please do and reach out to me on X at 1804davey (DMs open!), so we can chat about your results.
Thanks for reading this one — I had fun writing this post and building the project. If you don't mind, give it a star on GitHub and use it in your projects! Here are relevant links:
Lastly, props to the Expo team for the awesome DX of Expo modules — till next time, Happy Coding. 🛠️