auphonic.com - Entries for the category Development

Introducing the Auphonic Command Line Interface (CLI)

grh@auphonic.com (Georg) — Thu, 26 Mar 2026 06:55:42 +0000

Auphonic is now available from the command line:
The Auphonic CLI lets you process, manage, and automate audio productions without leaving the terminal. It's a free, single binary with no dependencies — just download it, authenticate, and start processing.

Everything you know from Auphonic is available: Noise Reduction, Loudness Normalization, Speech Recognition, Multitrack Processing, Presets, and publishing to External Services. All from a single command.

Use Cases

Whether you're producing a single episode or processing thousands of files, the CLI fits into a wide range of workflows:

File-based workflows: If you're a broadcaster, podcaster, or anyone working with file-based audio, the CLI lets you integrate Auphonic directly into your local workflow. Drop it into your existing production pipeline and process files right where they live — no browser required.

DAW integration: Pipe audio straight from ffmpeg, sox, or your DAW exports into Auphonic. This makes it easy to add post-processing as the final step in your editing workflow without switching tools.

Batch processing & scripting: Wrap the CLI in a shell script, cron job, or CI/CD pipeline to process entire folders of files automatically. Great for teams that need to handle high volumes of audio with consistent quality settings.

Optimized for humans and agents: A single command replaces what would otherwise be multiple API calls with HTTP requests, JSON payloads, and auth headers. The CLI is designed to be easy to use interactively, but its structured output also makes it a natural fit for AI agents and LLMs that need to integrate audio processing into their toolchains.

Examples

Processing a file is as simple as a single command. The CLI uploads your audio to Auphonic, applies our default audio algorithms automatically, and downloads the processed result back to your machine:

Terminal

# Process an audio file, wait for completion, and download the result
$ auphonic process interview.wav --wait --download
Processing interview.wav...
Status: Done
Downloaded: interview-auphonic.mp3

Presets save your production settings — output formats, loudness targets, metadata, publishing destinations — so you get consistent results across episodes without repeating yourself. The --preset flag accepts either the preset name or its UUID. You can also pipe audio from other tools like ffmpeg:

Terminal

# Apply a preset to your production
$ auphonic process episode.wav --preset "My Podcast" --wait --download
# Pipe audio from ffmpeg
$ ffmpeg -i recording.mkv -vn -f wav - | auphonic process - --preset "My Podcast"

Need to check on past productions or grab a result file? The list and download commands let you quickly browse your production history and pull output files. Add --open to open the production in your browser, or use auphonic open UUID to jump to any production at any time:

Terminal

# List recent productions
$ auphonic list
UUID                    Status    Title
aB3xKmNpQrStUvWxYz1234  Done      Episode 42
zY9wLcFgHjKmNpQrSt5678  Done      Interview
# Download a production's output files
$ auphonic download aB3xKmNpQrStUvWxYz1234
# Open a production in the browser
$ auphonic open aB3xKmNpQrStUvWxYz1234

Multitrack productions let you process separate audio tracks for each speaker or source — Auphonic automatically balances levels, reduces crosstalk, and mixes them into a final file. This is ideal for interviews, panel discussions, or any recording with multiple microphones:

Terminal

# Multitrack production with labeled tracks
$ auphonic process \
  --track id=host,file=host.wav \
  --track id=guest,file=guest.wav \
  --track id=music,file=music.wav \
  --wait --download

You can add Speech Recognition to generate transcripts alongside your processed audio, publish directly to an External Service like your podcast host or YouTube, or set default options so you don't have to repeat them every time:

Terminal

# Add speech recognition
$ auphonic process episode.wav --speech-recognition --wait
# Publish to your podcast host
$ auphonic process episode.wav --publish SERVICE_UUID --wait
# Set a default preset for all future productions
$ auphonic config set default-preset "My Podcast"

See the CLI reference documentation for the full command reference.

Auphonic API Updates

Alongside the CLI, we've made some improvements to the Auphonic API that benefit both API users and CLI users alike.

OpenAPI specification: The Auphonic API now has a full OpenAPI 3.0.3 specification covering all endpoints. In practice, this means you can auto-generate client libraries in your language of choice, get autocompletion in your IDE, and validate requests before sending them. You can also browse the full API interactively via ReDoc.

Preset names in API: You can now reference presets by name instead of UUID in API calls — no more copying opaque identifiers. The CLI uses this feature under the hood, which is why --preset "My Podcast" works out of the box. If multiple presets share a name, priority goes to personal presets first, then shared, then default. See the API documentation for details.

Get Started

The CLI runs on macOS, Linux, and Windows. Install it with a single command:

Terminal

$ curl -sSL https://auphonic.com/cli/install.sh | sh
# Authenticate with your Auphonic account
$ auphonic auth login

You can also download pre-built binaries directly from auphonic.com/cli, where you'll find setup instructions for all platforms.
For the full command reference, see the CLI documentation.

Feedback

The CLI is a new addition to Auphonic, and we'd love to hear how it works for you. If you have feedback, run into issues, or have ideas for new features, please visit our Contact Page or email us at support@auphonic.com.

New Auphonic Static and Music Denoiser

lukasm@auphonic.com (Lukas) — Thu, 31 Jul 2025 09:43:32 +0000

Technology is evolving fast and we’re the ones pushing it forward.

Our new Static Denoiser removes steady background noise like hiss, hum, or fan noise while keeping music, ambience, and sound design fully intact. Perfect for audio dramas, videos, music, meditations, podcasts, or anything where clarity matters and the atmosphere does too.

Auphonic’s Vision for Noise Reduction

We firmly believe that there is no “one button to fix it all.”

But how can we say that when it is our mission to build exactly that?

Our goal is precision: Giving users full control over what stays and what goes.

That’s why we offer different tools for different kinds of noise:

Speech Isolation removes everything but your voice. It keeps only the speech you care about.
Dynamic Denoiser adapts in real time to changing environments - ideal for unpredictable noise patterns, while leaving music intact.
Static Denoiser now updated to precisely target stationary noise - like constant hiss or hum - while preserving music, ambient effects, and subtle details.

Think of audio dramas where sound effects should stay untouched, or meditation recordings with soft tonal elements that must remain intact. Unlike non-stationary noises (e.g. coughs, chair squeaks or mouth clicks), stationary noise is consistent over time - making it ideal for static removal without harming your content.

While we believe in precise user control, there is a way to reduce your audio editing work to almost one button: Saving your favorite settings as Custom Presets and applying those to your productions. You can even share your Presets with friends!
But more on that later 😉

How to Use the New Static Denoiser

Cleaning up your audio is easier than ever: Just upload your file, choose the Static Denoiser in the Production Form, and let Auphonic do the rest. No deep tech knowledge required.

Fine-tune the sound by adjusting the Remove Noise and Remove Reverb sliders. Lower them slightly to retain some natural texture while still boosting clarity and speech intelligibility.

And for everyone still wanting to use the Original Static Denoiser: Don't worry, the Legacy Denoising Version is still available. We renamed it to "Classic" and you can use it as normal. Just select it through the drop-down menu.

Audio Examples

It’s hard to read audio, but easy to hear the difference. Here is a sound comparison of our denoising models - please use headphones to hear all details!

New Static Denoiser vs. Speech Isolation

We processed a snippet from the History of Jazz podcast using both Speech Isolation and the new Static Denoiser. Speech Isolation removes everything but the speech - including music, background vocals, and ambience - resulting in a clean, voice-only track. In contrast, the Static Denoiser keeps the musical texture intact while removing just the steady background noise. Hear the difference for yourself!

Original:
Static Denoised:
Speech Isolated:

New Static Denoiser vs. Dynamic Denoiser

In this excerpt from the German audio drama Der Graue, the “Static Denoiser” perfectly preserves all the sound effects while removing the reverb and static noise, whereas the “Dynamic Denoiser” (or “Speech Isolation”) removes everything from the audio that is not speech:

Original:
Static Denoised:
Dynamic Denoised:

New Static Denoiser vs. Legacy Classic Denoiser

To show how far we’ve come, we processed a short segment from the radio play Around the World in 80 Days using both the new Static Denoiser and the legacy Classic Denoiser.
The Classic Denoiser extracts noise prints in speech pauses, which is not possible in this example because of the background music, whereas the new model removes noise cleanly.

Original:
Old Classic Denoised:
New Static Denoised:

Try It Out

The new Static Denoiser is live! Perfect for cleaning up hiss and hum while keeping music and sound effects intact.

If you're used to the legacy model and want to keep using it: No worries, the Classic Denoiser is still available. Want more aggressive cleanup or even full music removal? Try our Dynamic Denoiser or Speech Isolation models .

Re-processing is free - so go wild:
As long as you don’t change your input file, you can tweak your production and test different settings without using extra credits. Try it. Break it. Compare models. Save your findings as your favorite preset. And please, tell us what you think!

Feedback

We drive innovation through your feedback and would love to hear how the new Static Denoiser works for you. Send us your thoughts through the production feedback form or reach out through our Contact Page.

Every bit of input helps us fine-tune things further.

Independently control Noise, Reverb and Breath Reduction Amounts

mpagavino@auphonic.com (Manuel) — Thu, 16 May 2024 13:20:55 +0000

Responding to your feedback, we are now proud to present new separate parameters for noise, reverb, and breath reduction to give you more flexible control for your individual, best output results.
Find all the new parameters below and listen to the Audio Examples to get a closer impression of the upgrade.

What's the update about?

Before

Previously, you could only set the Denoising Method and one reduction amount, that was used for all elements.
Depending on the selected method, you were already able to decide whether music, static, or changing noises should be removed, but there was no setting to keep the typewriter sound effects while removing the reverb, for example.

Now

With our latest upgrade, you can now set the reduction amounts separately for noise, reverb, and breathing sounds.
For example, you could completely remove the background noise while reducing the reverb just a little to enhance speech intelligibility but keep the atmosphere. Like we did in Audio Example 1.
Many of you have also asked about the possibility of slightly reducing breath sounds rather than eliminating them completely. In Audio Example 2 we demonstrate how you can prevent your audio from sounding strange and unnatural by reducing instead of eliminating all breathing sounds.

To all of you who are happy with the results and don't want anything to change, relax:
If you don't change the default settings, the noise reduction algorithms work exactly the same as before.

Note: As the 'Static Denoiser' removes only stationary noise, there are no 'Remove Reverb' and 'Remove Breathings' parameters available for this denoising method.

New Parameters

Screenshot of the new Noise Reduction Parameters in the production form.

In order to use the new noise reduction features, you may separately set the following parameters:

Denoising Method: (unchanged) Select what kind of noise you want to remove.
[Dynamic Denoiser (default), Speech Isolation, Static Denoiser]
Note that the parameters 'Remove Reverb' and 'Remove Breathings' are NOT available for Static Denoiser!
Remove Noise: Select the amount of noise you want to remove.
[100 dB (default), Disable Denoise, 3 dB, 6 dB, ..., 100 dB (full)]
Remove Reverb: Select the amount of reverb you want to remove.
[100 dB (default), Disable Deverb, 3 dB, 6 dB, ..., 100 dB (full)]
Remove Breathings: Select the amount of breathings you want to remove.
[Off (default), 3 dB, 6 dB, ..., 100 dB (full)]

Feel free to experiment with all the options to find your preferred parameter settings! Editing and reprocessing existing productions does not cost any additional credits as long as you don't change the input file.

Listen to the results:

1. Reverb reduction with full noise elimination

For the first audio example by conduitministries.com we set the 'Remove Noise' amount to 100 dB (full) and varied the 'Remove Reverb' amount starting from 0 dB (Off) to 12 dB (medium) and to 100 dB (full). Listen to how first the noise is gone and then step by step the reverb is lower:

Original
-100dB Denoise -0dB Deverb
-100dB Denoise -12dB Deverb
-100dB Denoise -100dB Deverb

2. Breathing sound reduction

In the breathing reduction audio example by LibriVox.org we used the 'Remove Breathing' amounts increasing from the original audio with 0 dB (Off) to 12 dB (medium) and to 100 dB (full) reduction.
In the result files you can hear, that the 100 dB (full) elimination leads to weird, unnatural-sounding pauses, that can be prevented by just reducing the breathing sounds:

Original
-12dB Debreath
-100dB Debreath

Try it now on auphonic.com!

Feedback

We hope you like our upgraded version of the Noise Reduction Algorithms with new parameters for more control.
If you have more feature requests or feedback for us, please let us know! You can also leave a comment in the feedback section on the status page of your specific production. We're looking forward to hearing from you!

New Auphonic Transcript Editor

manuelw@auphonic.com (Manuel) — Thu, 21 Mar 2024 12:31:22 +0000

We're excited to roll out an upgraded version of our Transcript Editor, focusing on enhancing your transcription workflow and making it more intuitive, especially for mobile users. This overhaul introduces several key improvements and features designed to streamline the transcription process.

Click here for a Live Demo

What's new?

Line by Line Editing

Your transcript is being rendered line by line. This allows for precise editing of every single timestamp. Depending on the speech recognition engine editing can be done on word or phrase level.
For optimal results, we suggest utilizing our Auphonic Whisper ASR engine.

A paragraph with 9 lines, every line represents a "subtitle line" (.vtt, .srt).

You can split or combine paragraphs and lines using the Enter and Backspace keys. Our new Playback Slider enables seamless scrolling through the text, while we highlight the currently selected word as you go. With the switchable Play on Click function you can start your playback from anywhere in the transcript.

Automatic Shownotes and Chapters

If you enable Automatic Shownotes and Chapters in the Production form, we include AI generated shownotes and chapters directly into the Transcript Editor. You can edit Chapter Times and Text directly within the Transcript Editor. Once you click Save (top right), any modifications made within the shownotes and chapters will also be saved back to the production.

Screenshot of Automatic Shownotes and Chapters withing the Transcript Editor.

You are also able to edit chapter times directly within the transcript editor. Please note that this only works within the Transcript section of the editor to ensure precise placement of chapters.

Screenshot of Edit Chapter Time.

Local History: Undo and Redo

Our Local History feature offers convenient undo and redo functionality. This means you can effortlessly revert changes or redo them as needed, providing you with greater control and flexibility during the editing process.

Edit Speakers

Our revamped Transcript Editor automatically assigns speakers in Multitrack Productions. You can use the Track Identifier in our production form to assign speakers and easily edit, remove, or add new ones within the Transcript Editor. So it's clear who says what at any time!

Screenshot of Edit Speaker.

Confidence Highlighting

Within the transcription areas of low confidence are highlighted with our Highlight Low Confidence feature. So you can check up on the AI and edit words it is not entirely confident about! It also shows you how confident the AI is about the specific area if you hover over it.

Offline Mode

This feature enables you to download and share the Transcript Editor for offline editing. This means you can share the *.html file with someone else for editing purposes. Simply use the Download Editor action to obtain an offline version of the Transcript Editor.

Export Transcript

You can export your transcript in all currently relevant formats (.srt, .vtt, .txt, .html, .pdf) and include them in your publishing process or video editing software.

Screenshot of Export Transcript.

Responsive Design

We want to make podcasting as easy as possible. The responsive design of our Transcript Editor allows for fast editing on mobile devices - so you can edit what you're saying anywhere, anytime!

Screenshot of Transcript Editor on a mobile device.

Why use it?

Apple Podcasts paved the way for all podcasts to feature transcripts, so users can read through what you are saying and look for specific parts of your talk they're interested in revisiting. That could be book recommendations, advertisements or one of your ideas you want to share with the world.

We're trying to make the transcription process as automated and easy for you as possible. Especially when using Multitrack Productions we strive to take all the work off of you and let the AI do its magic.

Additionally, we aim to simplify manual corrections wherever automated processes encounter challenges, ensuring that even these adjustments are as effortless as possible.

Try it now on auphonic.com!

Feedback

If you have feature requests or feedback for our new Transcript Editor, please let us know! You can also leave a comment in the feedback section on the status page of your specific production.
We're looking forward to hearing from you!

Improve your Audio with our new Automatic Filler Word Cutter

isabell@auphonic.com (Isabell) — Wed, 04 Oct 2023 08:44:33 +0000

We all know the problem: the content is perfectly prepared, and everything is in place, but the moment you hit the record button, your brain freezes, and what pops out of your mouth is a rain of “ums”, “uhs”, and “mhs” that no listener would enjoy.
Cleaning up a record like that by manually cutting out every single filler word is a painstaking task.

So we heard your requests to automate the filler word removing task, started implementing it, and are now very happy to release our new Automatic Filler Cutter feature. See our Audio Examples and Usage Instructions below.

What is removed?

While the definition of filler words is not the same, depending on who you ask, some words can be used as filler as well as content. For example, “like”, “well”, “you know”, etc. cannot be removed without the risk of removing also content and destroying sentences, even if those words are used as filler words in some cases.

Therefore, we decided to focus on the removal of the obvious fillers, namely any kind of “ums”, “uhs”, “mhs”, German “ähm”, “äh”, “öh”, French “euh”, “euhm” and similar.

Audio Examples

1. English Male Speaker

The first audio example is an excerpt from the interview “From Racing Failure to Red Bull Champion: The Untold Christian Horner Story”. Our algorithm found and removed a remarkable ten filler words in this 45-second snippet:

Screenshot of the Auphonic Audio Inspector: each pale red shaded area corresponds to a cut-out filler word.

Original:
Cut:

2. Austrian-German Female Speaker

The following example is an interview with the Austrian Ex-Foreign Minister, Karin Kneissl, who uses seven filler words within 26 seconds:

Original:
Cut:

Usage Instructions

To use the Auphonic Automatic Filler Cutter feature, you just have to create a production or preset as you are used to and select “Cut Fillers” for “Automatic Cutting” in the section “Audio Algorithms”:

When your production is done, all cut-out filler words will appear as pale red shaded areas in the Auphonic Audio Inspector on the production status page, as you can see in the upper screenshot of the Audio Inspector.

If you want to remove silent segments from your audio as well, please also enable our Automatic Silence Cutting feature.

NOTE: Our Automatic Cutting features (for filler and silence) are not available for video files!

Behind the Scenes

For the training of our Automatic Filler Cutter AI-Algorithm, we created datasets that contain manually labeled audio files, collected from 'real world' audio data. So far, we have labeled, trained, and tested the system with English, German, Spanish, and French data.

However, in the Auphonic Web Service, you can activate and test the Automatic Filler Word Cutter for all languages. We would be very happy to hear how the filler removal works out for completely different-sounding languages from, e.g., the Asian, African, or Slavic language families.

Please send us feedback on any problems or error patterns you discover! This will help us generate specific data for the training to improve the algorithm and eliminate your problems.

Conclusion

Automatic filler word cutting is a powerful tool for podcasters looking to enhance the quality of their content. It boosts clarity and professionalism, all while making your editing process more efficient. Some users, however, see a touch of authenticity in filler words within podcasts. So, we leave it up to you to enable or disable the Automatic Filler Cutter feature for your next Auphonic production, depending on your desired style.

We are currently working on filler word cutting optimizations for more languages, so watch our channels to get all the news on our upgrades!

If you have any feedback for us – how the filler cutter is working in your language, what you do or don't like, what you miss, what else you would want to remove from your audio besides silence and filler words, etc. – you are welcome to contact us via email or directly comment on our production interface!

Automatically generate Shownotes, Summaries and Chapters from Recordings

isabell@auphonic.com (Isabell) — Mon, 12 Jun 2023 13:03:18 +0000

We're thrilled to introduce our Automatic Shownotes and Chapters feature. This AI-powered tool effortlessly generates concise summaries, intuitive chapter timestamps and relevant keywords for your podcasts, audio and video files.
See our Examples and the How To section below for details.

Why do I need Shownotes and Chapters?

In addition to links and other information, shownotes contain short summaries of the main topics of your episode, and inserted chapter marks allow you to timestamp sections with different topics of a podcast or video. This makes your content more accessible and user-friendly, enabeling listeners to quickly navigate to specific sections of the episode or find a previous episode to brush up on a particular topic.

Shownotes are also very likely to boost your show's Search Engine Optimization and eventually its popularity, leading to an increase in listeners.

However, especially structuring the content and finding useful positions for chapter marks is a very time-consuming process, that can be fully automated with our new feature.

Besides the obvious use of creating shownotes and chapters for podcasts, you can also use our new feature to easily generate an abstract of your lecture recording, take the summary of your show as the starting point for a social media post, or choose your favourite chapter title as the podcast name.

What happens behind the Scenes?

When the Automatic Shownotes and Chapters feature is selected, the first step is speech transcription by either our internal Auphonic Whisper ASR or any integrated External ASR Service of your choice.

Some open source tools and ChatGPT will then summarize the ASR resulting text in different levels of detail, analyze the content to identify sections with the different topics discussed, and finally complete each section with timestamps for easy navigation.
Beginning with the generation of a Long Summary, the number of characters is further reduced for a Brief Summary and from the brief summary a Subtitle and some Keywords for the main topics are extracted.

Depending on the duration of the input audio or video file, the level of detail of the thematic sections is also slightly adjusted, resulting in a reasonable number of chapters for very short 5-minute audio files as well as for long 180-minute audio files.

How to automatically generate Shownotes and Chapters in Auphonic

If you are a paying or beta user, you can automatically generate shownotes and chapters by checking the Automatic Shownotes and Chapters Checkbox in the Auphonic singletrack or multitrack Production Form with any of our ASR Services enabled.
Once your production is done, the generated data will show up in your transcript result files and in the well-known Auphonic Transcript Editor above the speech recognition transcript section.
By clicking on a chapter title in the Chapters section of the transcript editor, you can jump directly to that chapter in your transcript to review and edit that section.

Unless you have manually entered content before, the generated data will also be automatically stored in your audio files' metadata as follows:

Generated Long Summary stored in metadata field Summary.
Generated Subtitle stored in metadata field Subtitle.
Generated Keywords stored in metadata field Tags.
Generated Timestamps for thematic sections stored as Start Time of Chapters Marks.
Generated Headlines for thematic sections stored as Chapter Title of Chapters Marks.

The metadata is automatically displayed with your audio file wherever you import your audio for further editing.

Please note that not all of our supported Output File Formats are designed to use metadata.
For details see our previous blog posts: ID3 Tags Metadata (used in MP3 output files), Vorbis Comment Metadata (used in FLAC, Opus and Ogg Vorbis output files) and MPEG-4 iTunes-style Metadata (used in AAC, M4A/M4B/MP4 and ALAC output files).

Example

As a real-life example, we automatically generated shownotes and chapters for the Lex Fridman Podcast #367: "Sam Altman: OpenAI CEO on GPT-4, ChatGPT, and the Future of AI".

Check out our transcript and generated shownotes:
LexFridmanPodcast367-transcript.html

Conclusion

The automatic generation of shownotes and chapters is a huge time-saver for podcasters and video creators, as it speeds up the tedious process of manually structuring and summarizing your content.

For now it is available for all paying or beta users. If you would like to become a beta user, or have any questions or feedback, please do not hesitate to contact us!

New Auphonic AutoEQ Filtering (Beta)

grh@auphonic.com (Georg) — Tue, 24 Jan 2023 09:35:03 +0000

In addition to our Leveler, Denoiser, and Adaptive 'Hi-Pass' Filter, we now release the missing equalization feature with the new Auphonic AutoEQ.
The AutoEQ automatically analyzes and optimizes the frequency spectrum of a voice recording, to remove sibilance (De-esser) and to create a clear, warm, and pleasant sound - listen to the audio examples below to get an idea about what it does.

Screenshot of manually adjusted example settings for the equalizer plug-in 'Pro-Q3' by fabfilter.

What is Equalization and why is it difficult?

Equalization (EQ) in audio recording and reproduction is the process of adjusting the volume of different frequency bands within a signal.
The following vocal EQ cheat sheet, published by Producer Hive, will give you a small impression, of what can be influenced by equalizing:

Vocal EQ Cheat Sheet by Producer Hive.

On the other hand, it is very easy to ruin a good voice recording with manual heavy-handed equalization, resulting in voices, that sound very sharp or muddy or even like the speaker had a blocked nose.
Besides the skill and experience of an audio engineer, manual adjustments of frequencies also require a very good and linear studio playback device. For example, performing manual equalization with strongly bass-heavy speakers would most likely lead to a very sharp, unpleasant listening experience using treble-heavy headphones.

For singletrack productions with more than one speaker, equalizing is also a very complex and time-consuming process, as every voice has its unique frequency spectrum and needs its own equalization. One could separate speakers with cuts or create a track envelope to fade from one speaker to another, however, any solution is a very tedious job, if you do it by hand.

That is where the Auphonic AutoEQ comes in! All those steps are now available in just one click!

How does the Auphonic AutoEQ work?

The Auphonic Web Service analyzes your audio content and classifies the audio file into small and meaningful segments like music, silence, different speakers, etc. to process every single segment with the best matching algorithms.
All our features like the Adaptive Leveler, Dynamic Denoising, Adaptive 'Hi-Pass' Filtering, and now the new AutoEQ filter option are built on top of this basic processing.

Using Auphonic AutoEQ, spectral EQ profiles are created for each speaker separately and permanently changing over time. The aim of those time-dependent EQ profiles is to create a constant, pleasant sound in the output file even if there are slightly changing voices in the record, for example, due to modified speaker-microphone positions.

Audio Examples

Here are two short audio examples, which demonstrate some features of our AutoEQ.
We recommend listening with headphones so you can hear all the details.

Example 1. Female Speaker with Background Music

In the following example (BCB: The Voices of Bainbridge Island) of a female narrator speaking while background music is playing, you can easily recognize quite sharp 'sss' sounds in the female voice. This sharpness in the female voice is removed by the so-called De-essing feature of the Auphonic AutoEQ, while the background music is not changed.

Original:
AutoEQed:

Example 2. Dialog of Male and Female Speakers

The next example (BCB: The Voices of Bainbridge Island) shows how the AutoEQ optimizes a singletrack record containing two speakers with different voice characteristics. Our AutoEQ algorithms analyze each voice separately and calculate the matching frequency adjustments to optimize the voice of every single speaker.

Original:
AutoEQed:

AutoEQ Beta Integration in the Auphonic Web Service

To use the Auphonic AutoEQ, you just have to create a production or preset as you are used to, toggle “Advanced Parameters” on the top right in the section “Audio Algorithms” and select “Voice AutoEQ” within “Filtering”:

For a first test period, the AutoEQ will only be available for Beta and paying users, to incorporate your feedback and finalize an optimized version.
If you are a free user but want to try Auphonic AutoEQ: please just ask for access!

Practical Tips

For best results using Auphonic AutoEQ, however, it is still necessary, that your audio content is of sufficiently good quality, as no equalizer can make up frequencies that are not there in the first place. Audio files with low bitrates often lack important frequencies, that cannot be recovered by equalizing. AutoEQ is just a feature to boost or cut individual frequency bands, not a bandwidth extension. For more information about required audio quality, see the former blog post: Audio File Formats and Bitrates for Podcasts.

Another important topic is the definition of the 'best result'. Equalizing is a very subjective task, that differs a lot depending on every personal opinion. So Auphonic AutoEQ is set up to follow quite conservative rules of equalizing and rather apply subtle tweaks and remove obvious problems, than support personal preferences. This also means, your record will experience no significant changes from Auphonic AutoEQ if it sounds reasonably OK or pretty good already.

Conclusion

Auphonic audio post production algorithms keep getting better and better in leaps and bounds lately, offering you new Beta Features: Beta Auphonic Denoiser, Beta Auphonic Speech Recognition, and Beta Auphonic AutoEQ.
Right now we are fine-tuning all our current Beta Features with high intensity to release a new upgraded version of our Auphonic Web Service as soon as possible.
Please watch this channel for further updates – soon to come.

If you have any feedback for us or want to become a Beta user, you are very welcome to comment directly in our production interface or to contact us via email!

Auphonic Speech Recognition Engine using Whisper by OpenAI (Beta)

grh@auphonic.com (Georg) — Tue, 08 Nov 2022 08:55:26 +0000

Today we release our first self-hosted Auphonic Speech Recognition Engine using the open-source Whisper model by OpenAI!
With Whisper, you can now integrate automatic speech recognition in 99 languages into your Auphonic audio post-production workflow, without creating an external account and without extra costs!

Whisper Speech Recognition in Auphonic

So far, Auphonic users had to choose one of our integrated external service providers (Wit.ai, Google Cloud Speech, Amazon Transcribe, Speechmatics) for speech recognition, so audio files were transferred to an external server, using external computing powers, that users had to pay for in their external accounts.

The new Auphonic Speech Recognition is using Whisper, which was published by OpenAI as an open-source project. Open-source means, the publicly shared GitHub repository contains a complete Whisper package including source code, examples, and research results.
However, automatic speech recognition is a very time and hardware-consuming process, that can be incredibly slow using a standard home computer without special GPUs. So we decided to integrate this service and offer you automatic speech recognition (ASR) by Whisper processed on our own hardware, just like any other Auphonic processing task, giving you quite some benefits:

No external account is needed anymore to run ASR in Auphonic.
Your data doesn't leave our Auphonic servers for ASR processing.
No extra costs for external ASR services.
Additional Auphonic pre- and post-processing for more accurate ASR, especially for Multitrack Productions.
The quality of Whisper ASR is absolutely comparable to the “best” services in our comparison table.

How to use Whisper?

To use the Auphonic Whisper integration, you just have to create a production or preset as you are used to and select “Auphonic Whisper ASR” as “Service” in the section Speech Recognition.
This option will automatically appear for Beta and paying users. If you are a free user but want to try Whisper: please just ask for access!

When your Auphonic speech recognition is done, you can download your transcript in different formats and may edit or share your transcript with the Auphonic Transcript Editor.
For more details about all our integrated speech recognition services, please visit our Speech Recognition Help and watch this channel for Whisper updates – soon to come.

Why Beta?

We decided to launch Whisper for Beta and paying users only, as Whisper was just published end of September and there was not enough time to test every single use case sufficiently.
Another issue is the required computing power: for suitable scaling of the GPU infrastructure, we need a beta phase to test the service while we are monitoring the hardware usage, to make sure there are no server overloads.

Conclusion

Automatic speech recognition services are evolving very quickly, and we've seen major improvements over the past few years.
With Whisper, we can now perform speech recognition without extra costs on our own GPU hardware, no external services are required anymore.

Auphonic Whisper ASR is available for Beta and paying users now, free users can ask for Beta access.
You are very welcome to send us feedback (directly in the production interface or via email), whether you notice something that works particularly well or discover any problems.
Your feedback is a great help to improve the system!

New Speechmatics API Integration and Speech Recognition Services Comparison

freymat@auphonic.com (Matthias) — Thu, 08 Sep 2022 07:46:34 +0000

Speechmatics released a new API including an enhanced transcription engine (2h free per month!) that we integrated into the Auphonic Web Service now.
In this blog post, we also compare the accuracy of all our integrated speech recognition services and present our results.

Automatic speech recognition is most useful to make audio searchable: Even if automatically generated transcripts are not perfect and might be difficult to read (spoken text is very different from written text), they are very valuable if you try to find a specific topic within a one-hour audio file or if you need the exact time of a quote in an audio archive.
Currently, Auphonic supports the integration of the following four speech recognition services: Wit.ai, Google Cloud Speech, Amazon Transcribe, and Speechmatics.
All speech recognition services are improving very quickly lately, and we'll do our best to keep you updated – getting closer and closer to perfection.

Most recently, Speechmatics developed a new Enhanced Model, that we added to our production services. So now, you do have the choice between the Standard Model with faster results and medium good accuracy or the Enhanced Model with slower results but very good accuracy.
For each transcription model, you can process two hours of speech recognition per month for free (= 4h free per month combined). If you exceed the two hours per month and model, you will be charged $1.25/h for Standard and $1.90/h for Enhanced Model. For high volumes, you may contact the Speechmatics support for a discount.

How do other Speech Recognition Services compare to Speechmatics?

We tried to compare the relative ASR (Automatic Speech Recognition) quality of all services in English and German – 'best' means just the best one of our integrated services.
As speech recognition services are evolving very quickly, this is just a snapshot and may change again in the near future.

	Wit.ai	Google Speech API	Amazon Transcribe	Speechmatics
Price	free, also for commercial use	1+1h free per month, (Enhanced + Default Model), then ~$0.96-$2.16/h (depending on user settings)	1h free per month, (first 12 months), then ~$1.44/h	Standard 2h free per month, then ~$1.25/h, much cheaper for high volumes	Enhanced 2h free per month, then ~$1.90/h much cheaper for high volumes
ASR Quality English	basic	good (Enhanced Model)	very good	very good	best
ASR Quality German	basic	basic (Default Model)	very good	very good	best
Keyword Support	No	Yes	Yes	Yes	Yes
Word Timestamps and Confidence	No	No	Yes	Yes	Yes
Speed	fast	fast	much slower	medium	slower
Supported Languages	ar, bn, my, zh, nl, en, fi, fr, de, hi, id, it, ja, ca, ko, ms, ml, mr, pl, pt, ru, si, es, sv, tl, ta, th, tr, ur, vi	most languages supported! 138 languages and dialects (see: Google Language Support)	af-ZA, ar-AE, ar-SA, zh-CN, zh-TW, da-DK, nl-NL, en-AU, en-GB, en-IN, en-IE, en-NZ, en-AB, en-ZA, en-US, en-WL, fr-FR, fr-CA, fa-IR, de-DE, de-CH, he-IL, hi-IN, id-ID, it-IT, ja-JP, ko-KR, ms-MY, pt-PT, pt-BR, ru-RU, es-ES, es-US, ta-IN, te-IN, th-TH, tr-TR	ar, bg, yue, ca, hr, cs, da, nl, en, fi, fr, de, el, hi, hu, it, id, ja, ko, lv, lt, ms, cmn, no, pl, pt, ro, ru, sk, sl, es, sv, tr, uk	ar, bg, yue, ca, hr, cs, da, nl, en, fi, fr, de, el, hi, hu, it, id, ja, ko, lv, lt, ms, cmn, no, pl, pt, ro, ru, sk, sl, es, sv, tr, uk

Try out Speechmatics in Auphonic

1. Connect Speechmatics to your Auphonic Account

Enter a display name for the Speechmatics service in your Auphonic account.
Sign up for a Speechmatics Account.
On the Speechmatics page, go to “Manage Access” on the left, choose a name for your key and click “Generate API Key”. This API Key will only be shown once, so make sure you keep it safe! Copy your generated API Key to your Auphonic account into the form field “API Key”.
For “Model Accuracy” please select “Standard Model” or “Enhanced Model”.

If you want to use both Standard and Enhanced Models of Speechmatics once in a while, you need to create two separate services (one service for each model) in your Auphonic account!

2. Add Speechmatics to your Auphonic Production

Once your Speechmatics and Auphonic Accounts are connected, you can either create a preset or directly start your production just like you are used to.
In section “Speech Recognition” you may set “Service” to “Speechmatics”, select the language of your audio, add “Keywords” if you want, and you are ready to “Start Production”! In your Speechmatics account menu “Track Usage” there is a detailed list of your usage for the current month. For more information, you can also watch the following Video Tutorial by Speechmatics about usage, limits, and billing.

3. Correct Results using the Auphonic Transcript Editor

Auphonic also includes a Transcript Editor directly in our HTML output file.
If you use Speechmatics or Amazon Transcribe, the editor displays word confidence values to instantly see which sections should be checked manually:

Conclusion

Automatic Speech Recognition Services are evolving very quickly, and we've seen great improvements since our last comparisons in 2018 – especially in recognizing sloppy language, accents, and dialects.

With the new Enhanced Transcription Model by Speechmatics, we can now pass on further optimizations to you at a very reasonable price (4h free per month) – and we guess there are more improvements to come pretty soon.

Also, please let us know if you get different results comparing ASR services or if you compare services in other languages!

New Advanced Leveler with Broadcast Parameters (MaxLRA, MaxS, MaxM)

grh@auphonic.com (Georg) — Fri, 04 Dec 2020 08:24:11 +0000

Today we are thrilled to introduce revised parameters for the Adaptive Leveler to move our advanced algorithms out of beta.
The leveler can now run in three modes, which allow detailed Leveler Strength control and also the use of Broadcast Parameters (Max. Loudness Range, Max. Short-term Loudness, Max. Momentary Loudness) to limit the amount of leveling.

Photo by Gemma Evans.

When we first introduced our advanced parameters, we used the Maximum Loudness Range (MaxLRA) value to control the strength of our leveler. This gave good results, but it turned out that only pure speech programs give reliable and comparable LRA values and it was sometimes difficult to set a loudness range target for diverse audio content. To resolve this issue, we reworked the parameter and called it Dynamic Range.
After discussions with our users, however, we received the feedback that the name Dynamic Range was too confusing, so we decided to call it Leveler Strength.

In our discussions with users, we were also told they like to be able to set a loudness range target to limit the amount of leveling, because MaxLRA is often used by broadcasters and in regulations.
As a solution, we added a Broadcast Mode, which makes it now possible to use the MaxLRA, MaxS, and MaxM values to control the strength of our leveler as well.

In this blog post, we will first discuss the new parameters - Leveler Strength, Music Gain, MusicSpeech Classifier Settings, MaxLRA, MaxS, and MaxM - of our Singletrack Advanced Leveler, then we will show how these settings can be used in the Multitrack Advanced Leveler.

Singletrack Advanced Leveler

The Adaptive Leveler normalizes all speakers to a similar loudness so that a consumer in a car or subway doesn't feel the need to reach for the volume control. However, in other environments (living room, cinema, etc.) or in dynamic recordings, you might want more level differences (Dynamic Range, Loudness Range / LRA) between speakers and within music segments.

Our new parameters let users control the Leveler Strength to adjust mid-term level differences, similar to a sound engineer using the faders of an audio mixer, and Compressor Settings for short-term dynamics control.
The Advanced Leveler can be used in three different Modes:
Default Mode, Separate MusicSpeech Parameters, and Broadcast Mode.

For more details, please see our Leveler Parameters Help.

Default Mode

Leveler Strength:: The Leveler Strength controls how much leveling is applied: 100% means full leveling, 0% means no leveling at all. Changing the Leveler Strength increases/decreases the Dynamic Range of the output file.

Example Use Case:
Lower Leveler Strength values should be used if you want to keep more loudness differences in dynamic narration or dynamic music recordings (live concert/classical).
Compressor Settings:: Here you can select a preset value for micro-dynamics compression.
A compressor reduces the volume of short and loud spikes like the pronunciation of "p" and "t" or laughter (short-term dynamics) and also shapes the sound of your voice (making the sound more or less "processed" or "punchy").

Separate MusicSpeech Parameters

In the Separate MusicSpeech Parameters mode, independent settings for music and speech segments (Music Leveler Strength, Music Compressor) can be selected.
These settings allow you to use, for example, more leveling in speech segmets while keeping music and FX elements less processed.

You can also disable our music/speech classifier or add a gain to music segments:

MusicSpeech Classifier:: Use our speech/music classifier to level music and speech segments separately, or override the classifier decision and treat the whole audio file as speech or music.
Music Gain:: Add a gain to music segments, to make music louder or softer compared to the speech parts. Use the default setting (0 dB) to give music and speech parts a similar average loudness.

Broadcast Mode

The Broadcast Mode uses different parameters, which are often used by broadcasters and in regulations, to control the Leveler Strength:

Maximum Loudness Range (LRA):: The loudness range (LRA) indicates the variation of loudness throughout a program and is measured in LU (loudness units) - for more details see Loudness Measurement and Normalization or EBU Tech 3342.
The volume changes of our Leveler will be restricted so that the LRA of the output file is below the selected value (if possible).
High LRA values will result in very dynamic output files, whereas low LRA values will result in compressed output audio. If the LRA value of your input file is already below the maximum loudness range value, no leveling at all will be applied.

Loudness Range values are most reliable for pure speech programs: a typical LRA value for news programs is 3 LU; for talks and discussions, an LRA value of 5 LU is common. LRA values for features, radio dramas, movies, or music strongly depend on the individual character and might be in the range of 5 to 25 LU - for more information, please see Where LRA falls short.
Netflix, for instance, recommends an LRA of 4 to 18 LU for the overall program and 7 LU or less for dialog.
Maximum Short-term Loudness (MaxS):: Set a Maximum Short-term Loudness target (3s measurement window, see EBU Tech 3341, Section 2.2) relative to your Global Loudness Normalization Target.
Our Adaptive Leveler will ensure that the MaxS loudness value of the output file, which are loudness values measured with an integration time of 3s, will be below this target (if possible).
For example, if the MaxS value is set to +5 LU relative and the Loudness Target to -23 LUFS, then the absolute MaxS value of your output file will be restricted to -18 LUFS.

The Max Short-term Loudness is used in certain regulations for short-form content and advertisements.
See for example EBU R128 S1: Loudness Parameters for Short-form Content (advertisements, promos, etc.), which recommends a Max Short-term Loudness of +5 LU relative.
Maximum Momentary Loudness (MaxM):: Similar to the MaxS target, it's also possible to use a Maximum Momentary Loudness target (0.4s measurement window, see EBU Tech 3341, Section 2.2) relative to your Global Loudness Normalization Target.
Our Adaptive Leveler will ensure that the MaxM loudness value of the output file, which are loudness values measured with an integration time of 0.4s, will be below this target (if possible).

The Max Momentary Loudness is used in certain regulations by broadcasters. For example, CBC and Radio Canada require that the Momentary Loudness must not exceed +10 LU above the target loudness.

If it's not possible for the levels of the output file to be below the given MaxLRA, MaxS, or MaxM target values, you will receive a warning message via email and on the production page.

Example Use Case:
The broadcast parameters can be used to generate automatic mixdowns with different LRA values for different target environments (very compressed environments like mobile devices or Alexa, or very dynamic ones like home cinema, etc.).

Multitrack Advanced Leveler

The new leveling parameters are also available in our multitrack version.
Here you can set separate leveling parameters per track, and also use broadcast parameters in the final mixdown, to ensure that your levels are below the given MaxLRA, MaxS, or MaxM target values.

Leveling Parameters per Track

The parameters Leveler Strength, Compressor, MusicSpeech Classifier, Stereo Panorama, and Track Gain allow you to customize which parts of the track audio should be leveled, how much they should be leveled, and how much dynamic range compression should be applied.

MusicSpeech Classifier Setting:: Select between the Speech Track and Music Track Adaptive Leveler.
If this is set to On, a classifier will decide if this is a music or speech track.
Stereo Panorama (Balance):: Change the stereo panorama (balance for stereo input files) of the current track.
Track Gain: (in the Fore/Background section): Increase/decrease the loudness of this track compared to other tracks.
This can be used to add gain to a music or a specific speech track, making it louder/softer compared to other tracks.

For more details, please see Multitrack Leveler Parameters Help.

Leveling Master Parameters

In addition to our track parameters, you can switch the Leveler Mode in the master algorithm settings to Broadcast Mode to control the combined leveling strength.
Volume changes of our leveling algorithms will be adjusted so that the final mixdown of the multitrack production meets the given MaxLRA, MaxS, or MaxM target values - as is done in the Singletrack Broadcast Mode.

For more details, please see Master Algorithm Parameters Help.

Summary

We revised the leveling parameters to end the beta phase of our advanced algorithms.
All advanced settings are stable and will not change significantly going forward.

The following leveling parameters are available:

Leveler Strength: control the strength of the leveling algorithm
Compressor: select a preset for short-term dynamics control
Music Gain / Track Gain: make music parts louder/softer compared to speech parts
MusicSpeech Classifier: use our classifier or set everything to speech/music
Separate MusicSpeech Parameters Mode: separate controls for speech and music parts
Broadcast Mode: use the parameters Maximum Loudness Range (MaxLRA), Maximum Short-term Loudness (MaxS), and Maximum Momentary Loudness (MaxM) to control the leveling strength

All new settings are also available in our API, please see Singletrack and Multitrack Advanced API Settings.

Don't hesitate to contact us if you have any questions or feedback about our algorithms!

Advanced Multitrack Audio Algorithms Release (Beta)

grh@auphonic.com (Georg) — Fri, 29 Mar 2019 10:16:41 +0000

Last weekend, at the Subscribe10 conference, we released Advanced Audio Algorithm Parameters for Multitrack Productions:

We launched our advanced audio algorithm parameters for Singletrack Productions last year. Now these settings (and more) are available for Multitrack Algorithms as well, which gives you detailed control for each track of your production.

The following new parameters are available:

Fore/Background Settings: keep your music/clip tracks unchanged and set a custom background gain
Multitrack Leveler Parameters: control the stereo panorama, leveling algorithm, dynamic range and compression
Better Hum and Noise Reduction Controls for each track
Maximum True Peak Level setting for the final mixdown
Full API Support

Please join our private beta program and let us know how you use these new features or if you need even more control!

Fore/Background Settings

The parameter Fore/Background controls whether a track should be in foreground, in background, ducked, or unchanged, which is especially important for music or clip tracks.
For more details, please see Automatic Ducking, Foreground and Background Tracks .

We now added the new option Unchanged and a new parameter to set the level of background segments/tracks:

Unchanged (Foreground):: We sometimes received complaints from users, which produced very complex music or clip tracks, that Auphonic changes the levels too hard.
If you set the parameter Fore/Background to the new option Unchanged (Foreground), Level relations within this track won’t be changed at all. It will be added to the final mixdown so that foreground/solo parts of this track will be as loud as (foreground) speech from other tracks.
Background Level:: It is now possible to set the level of background segments/tracks (compared to foreground segments) in background and ducking tracks. By default, background and ducking segments are 18dB softer than foreground segments.

Leveler Parameters

Similar to our Singletrack Advanced Leveler Parameters (see this previous blog post), we also released leveling parameters for Multitrack Productions now.
The following advanced parameters for our Multitrack Adaptive Leveler can be set for each track and allow you to customize which parts of the audio should be leveled, how much they should be leveled, how much dynamic range compression should be applied and to set the stereo panorama (balance):

Leveler Preset:: Select the Speech or Music Leveler for this track.
If set to Automatic (default), a classifier will decide if this is a music or speech track.
Dynamic Range:: The parameter Dynamic Range controls how much leveling is applied: Higher values result in more dynamic output audio files (less leveling). If you want to increase the dynamic range by 3dB (or LU), just increase the Dynamic Range parameter by 3dB.
For more details, please see Multitrack Leveler Parameters.
Compressor:: Select a preset for Micro-Dynamics Compression: Auto, Soft, Medium, Hard or Off.
The Compressor adjusts short-term dynamics, whereas the Leveler adjusts mid-term level differences.
For more details, please see Multitrack Leveler Parameters.
Stereo Panorama (Balance):: Change the stereo panorama (balance for stereo input files) of the current track.
Possible values: L100, L75, L50, L25, Center, R25, R50, R75 and R100.

If you understand German and want to know more about our Advanced Leveler Parameters and audio dynamics in general, watch our talk at the Subscribe10 conference:
Video: Audio Lautheit und Dynamik.

Better Hum and Noise Reduction Controls

We now offer three parameters to control the combination of our Multitrack Noise and Hum Reduction Algorithms for each input track:

Noise Reduction Amount:: Maximum noise and hum reduction amount in dB, higher values remove more noise.
In Auto mode, a classifier decides if and how much noise reduction is necessary (to avoid artifacts). Set to a custom (non-Auto) value if you prefer more noise reduction or want to bypass our classifier.
Hum Base Frequency:: Set the hum base frequency to 50Hz or 60Hz (if you know it), or use Auto to automatically detect the hum base frequency in each speech region.
Hum Reduction Amount:: Maximum hum reduction amount in dB, higher values remove more noise.
In Auto mode, a classifier decides how much hum reduction is necessary in each speech region. Set it to a custom value (> 0), if you prefer more hum reduction or want to bypass our classifier. Use Disable Dehum to disable hum reduction and use our noise reduction algorithms only.

Behavior of noise and hum reduction parameter combinations:

Noise Reduction Amount	Hum Base Frequency	Hum Reduction Amount
Auto	Auto	Auto	Automatic hum and noise reduction
Auto or > 0	*	Disabled	No hum reduction, only denoise
Disabled	50Hz	Auto or > 0	Force 50Hz hum reduction, no denoise
Disabled	Auto	Auto or > 0	Automatic dehum, no denoise
12dB	60Hz	Auto or > 0	Always do dehum (60Hz) and denoise (12dB)

Maximum True Peak Level

In the Master Algorithm Settings of your multitrack production, you can set the maximum allowed true peak level of the processed output file, which is controlled by the True Peak Limiter after our Loudness Normalization algorithms.

If set to Auto (which is the current default), a reasonable value according to the selected loudness target is used: -1dBTP for 23 LUFS (EBU R128) and higher, -2dBTP for -24 LUFS (ATSC A/85) and lower loudness targets.

Full API Support

All advanced algorithm parameters, for Singletrack and Multitrack Productions, are available in our API as well, which allows you to integrate them into your scripts, external workflows and third-party applications.

Singletrack API:

Documentation on how to use the advanced algorithm parameters in our singletrack production API: Advanced Algorithm Parameters

Multitrack API:

Documentation of advanced settings for each track of a multitrack production:
Multitrack Advanced Audio Algorithm Settings

Join the Beta and Send Feedback

Please join our beta and let us know your case studies, if you need any other algorithm parameters or if you have any questions!

Here are some private beta invitation codes:

8tZPc3T9pH VAvO8VsDg9 0TwKXBW4Ni kjXJMivtZ1 J9APmAAYjT Zwm6HabuFw HNK5gF8FR5 Do1MPHUyPW CTk45VbV4t xYOzDkEnWP
9XE4dZ0FxD 0Sl3PxDRho uSoRQxmKPx TCI62OjEYu 6EQaPYs7v4 reIJVOwIr8 7hPJqZmWfw kti3m5KbNE GoM2nF0AcN xHCbDC37O5
6PabLBRm9P j2SoI8peiY olQ2vsmnfV fqfxX4mWLO OozsiA8DWo weJw0PXDky VTnOfOiL6l B6HRr6gil0 so0AvM1Ryy NpPYsInFqm
oFeQPLwG0k HmCOkyaX9R G7DR5Sc9Kv MeQLSUCkge xCSvPTrTgl jyQKG3BWWA HCzWRxSrgW xP15hYKEDl 241gK62TrO Q56DHjT3r4
9TqWVZHZLE aWFMSWcuX8 x6FR5OTL43 Xf6tRpyP4S tDGbOUngU0 5BkOF2I264 cccHS0KveO dT29cF75gG 2ySWlYp1kp iJWPhpAimF

We are happy to send further invitation codes to all interested users - please do not hesitate to contact us!

If you have an invitation code, you can enter it here to activate the Multitrack Advanced Audio Algorithm Parameters:
Auphonic Algorithm Parameters Private Beta Activation

More Languages for Amazon Transcribe Speech Recognition

grh@auphonic.com (Georg) — Thu, 31 Jan 2019 10:30:26 +0000

Until recently, Amazon Transcribe supported speech recognition in English and Spanish only.
Now they included French, Italian and Portuguese as well - and a few other languages (including German) are in private beta.

Update March 2019:
Now Amazon Transcribe supports German and Korean as well.

The Auphonic Audio Inspector on the status page of a finished Multitrack Production including speech recognition.
Please click on the screenshot to see it in full resolution!

Amazon Transcribe is integrated as speech recognition engine within Auphonic and offers accurate transcriptions (compared to other services) at low costs, including keywords / custom vocabulary support, word confidence, timestamps, and punctuation.
See the following AWS blog post and video for more information about recent Amazon Transcribe developments: Transcribe speech in three new languages: French, Italian, and Brazilian Portuguese.

Amazon Transcribe is also a perfect fit if you want to use our Transcript Editor because you will be able to see word timestamps and confidence values to instantly check which section/words should be corrected manually to increase the transcription accuracy:

Screenshot of our Transcript Editor with word confidence highlighting and the edit bar.

These features are also available if you use Speechmatics, but unfortunately not in our other integrated speech recognition services.

About Speech Recognition within Auphonic

Auphonic has built a layer on top of a few external speech recognition services to make audio searchable:
Our classifiers generate metadata during the analysis of an audio signal (music segments, silence, multiple speakers, etc.) to divide the audio file into small and meaningful segments, which are processed by the speech recognition engine. The results from all segments are then combined, and meaningful timestamps, simple punctuation and structuring are added to the resulting text.

To learn more about speech recognition within Auphonic, take a look at our Speech Recognition and Transcript Editor help pages or listen to our Speech Recognition Audio Examples.

A comparison table of our integrated services (price, quality, languages, speed, features, etc.) can be found here: Speech Recognition Services Comparison.

Conclusion

We hope that Amazon and others will continue to add new languages, to get accurate and inexpensive automatic speech recognition in many languages.

Don't hesitate to contact us if you have any questions or feedback about speech recognition or our transcript editor!

Auphonic Adaptive Leveler Customization (Beta Update)

grh@auphonic.com (Georg) — Mon, 05 Nov 2018 11:42:22 +0000

In late August, we launched the private beta program of our advanced audio algorithm parameters. After feedback by our users and many new experiments, we are proud to release a complete rework of the Adaptive Leveler parameters:

In the previous version, we based our Adaptive Leveler parameters on the Loudness Range descriptor (LRA), which is included in the EBU R128 specification.
Although it worked, it turned out that it is very difficult to set a loudness range target for diverse audio content, which does include speech, background sounds, music parts, etc. The results were not predictable and it was hard to find good target values.
Therefore we developed our own algorithm to measure the dynamic range of audio signals, which works similarly for speech, music and other audio content.

The following advanced parameters for our Adaptive Leveler allow you to customize which parts of the audio should be leveled (foreground, all, speech, music, etc.), how much they should be leveled (dynamic range), and how much micro-dynamics compression should be applied.

To try out the new algorithms, please join our private beta program and let us know your feedback!

Leveler Preset

The Leveler Preset defines which parts of the audio should be adjusted by our Adaptive Leveler:

Default Leveler:
Our classic, default leveling algorithm as demonstrated in the Leveler Audio Examples. Use it if you are unsure.
Foreground Only Leveler:
This preset reacts slower and levels foreground parts only. Use it if you have background speech or background music, which should not be amplified.
Fast Leveler:
A preset which reacts much faster. It is built for recordings with fast and extreme loudness differences, for example, to amplify very quiet questions from the audience in a lecture recording, to balance fast-changing soft and loud voices within one audio track, etc.
Amplify Everything:
Amplify as much as possible. Similar to the Fast Leveler, but also amplifies non-speech background sounds like noise.

Leveler Dynamic Range

Our default Leveler tries to normalize all speakers to a similar loudness so that a consumer in a car or subway doesn't feel the need to reach for the volume control.
However, in other environments (living room, cinema, etc.) or in dynamic recordings, you might want more level differences (Dynamic Range, Loudness Range / LRA) between speakers and within music segments.

The parameter Dynamic Range controls how much leveling is applied: Higher values result in more dynamic output audio files (less leveling). If you want to increase the dynamic range by 3dB (or LU), just increase the Dynamic Range parameter by 3dB.
We also like to call this Loudness Comfort Zone: above a maximum and below a minimum possible level (the comfort zone), no leveling is applied. So if your input file already has a small dynamic range (is within the comfort zone), our leveler will be just bypassed.

Example Use Cases:
Higher dynamic range values should be used if you want to keep more loudness differences in dynamic narration or dynamic music recordings (live concert/classical).
It is also possible to utilize this parameter to generate automatic mixdowns with different loudness range (LRA) values for different target environments (very compressed ones like mobile devices or Alexa, very dynamic ones like home cinema, etc.).

Compressor

Controls Micro-Dynamics Compression:
The compressor reduces the volume of short and loud spikes like "p", "t" or laughter ( short-term dynamics) and also shapes the sound of your voice (it will sound more or less "processed").
The Leveler, on the other hand, adjusts mid-term level differences, as done by a sound engineer, using the faders of an audio mixer, so that a listener doesn't have to adjust the playback volume all the time.
For more details please see Loudness Normalization and Compression of Podcasts and Speech Audio.

Possible values are:

Auto:
The compressor setting depends on the selected Leveler Preset. Medium compression is used in Foreground Only and Default Leveler presets, Hard compression in our Fast Leveler and Amplify Everything presets.
Soft:
Uses less compression.
Medium:
Our default setting.
Hard:
More compression, especially tries to compress short and extreme level overshoots. Use this preset if you want your voice to sound very processed, our if you have extreme and fast-changing level differences.
Off:
No short-term dynamics compression is used at all, only mid-term leveling. Switch off the compressor if you just want to adjust the loudness range without any additional micro-dynamics compression.

Separate Music/Speech Parameters

Use the switch Separate MusicSpeech Parameters (top right), to see separate Adaptive Leveler parameters for music and speech segments, to control all leveling details separately for speech and music parts:

For dialog intelligibility improvements in films and TV, it is important that the speech/dialog level and loudness range is not too soft compared to the overall programme level and loudness range. This parameter allows you to use more leveling in speech parts while keeping music and FX elements less processed.
Note: Speech, music and overall loudness and loudness range of your production are also displayed in our Audio Processing Statistics!

Example Use Case:
Music live recordings or dynamic music mixes, where you want to amplify all speakers (speech dynamic range should be small) but keep the dynamic range within and between music segments (music dynamic range should be high).
Dialog intelligibility improvements for films and TV, without effecting music and FX elements.

Other Advanced Audio Algorithm Parameters

We also offer advanced audio parameters for our Noise, Hum Reduction and Global Loudness Normalization algorithms:

For more details, please see the Advanced Audio Algorithms Documentation.

Want to know more?

If you want to know more details about our advanced algorithm parameters (especially the leveler parameters), please listen to the following podcast interview with Chris Curran (Podcast Engineering School):
Auphonic’s New Advanced Features, with Georg Holzmann – PES 108

Advanced Parameters Private Beta and Feedback

At the moment the advanced algorithm parameters are for beta users only. This is to allow us to get user feedback, so we can change the parameters to suit user needs.
Please let us know your case studies, if you need any other algorithm parameters or if you have any questions!

Here are some private beta invitation codes:

jbwCVpLYrl 6zmLqq8o3z RXYIUbC6al QDmIZLuPKa JIrnGRZBgl SWQOWeZOBD ISeBCA9gTy w5FdsyhZVI qWAvANQ5mC twOjdHrit3
KwnL2Le6jB 63SE2V54KK G32AULFyaM 3H0CLYAwLU mp1GFNVZHr swzvEBRCVa rLcNJHUNZT CGGbL0O4q1 5o5dUjruJ9 hAggWBpGvj
ykJ57cFQSe 0OHAD2u1Dx RG4wSYTLbf UcsSYI78Md Xedr3NPCgK mI8gd7eDvO 0Au4gpUDJB mYLkvKYz1C ukrKoW5hoy S34sraR0BU
J2tlV0yNwX QwNdnStYD3 Zho9oZR2e9 jHdjgUq420 51zLbV09p4 c0cth0abCf 3iVBKHVKXU BK4kTbDQzt uTBEkMnSPv tg6cJtsMrZ
BdB8gFyhRg wBsLHg90GG EYwxVUZJGp HLQ72b65uH NNd415ktFS JIm2eTkxMX EV2C5RAUXI a3iwbxWjKj X1AT7DCD7V y0AFIrWo5l

We are happy to send further invitation codes to all interested users - please do not hesitate to contact us!

If you have an invitation code, you can enter it here to activate the advanced audio algorithm parameters:
Auphonic Algorithm Parameters Private Beta Activation

Resumable File Uploads to Auphonic

grh@auphonic.com (Georg) — Tue, 04 Sep 2018 09:39:53 +0000

Large file uploads in a web browser are problematic, even in 2018. If working with a poor network connection, uploads can fail and have to be retried from the start.

At Auphonic, our users have to upload large audio and video files, or multiple media files when creating a multitrack production. To minimize any potential issues, we integrated various external services which are specialized for large file transfers, like FTP, SFTP, Dropbox, Google Drive, S3, etc.

To further minimize issues, as of today we have also released resumable and chunked direct file uploads in the web browser to auphonic.com.

If you are not interested in the technical details, please just go to the section Resumable Uploads in Auphonic below.

The Problem with Large File Uploads in the Browser

If using either mobile networks (which remain fragile) or unstable WiFi connections, file uploads are often interrupted and will fail. There are also many areas in the world where connections are quite poor, which makes uploading big files frustrating.

After an interrupted file upload, the web browser must restart the whole upload from the start, which is a problem when it happens in the middle of a 4GB video file upload on a slow connection.
Furthermore, the longer an upload takes, the more likely it is to have a network glitch interrupting the upload, which then has to be retried from the start.

The Solution: Chunked, Resumable Uploads

To avoid user frustration, we need to be able to detect network errors and potentially resume an upload without having to restart it from the beginning.

To achieve this, we have to split a file upload in smaller chunks directly within the web browser, so that these chunks can then be sent to the server afterwards.
If an upload fails or the user wants to pause, it is possible to resume it later and only send those chunks that have not already been uploaded.
If there is a network interruption or change, the upload will be retried automatically.

Companies like Dropbox, Google, Amazon AWS etc. all have their own protocols and API's for chunked uploads, but there are also some open source implementations available, which offer resumable uploads:

resumable.js [link]:: "A JavaScript library providing multiple simultaneous, stable and resumable uploads via the HTML5 File API"
This solutions is a JavaScript library only and requires that the protocol is implemented on the server as well.
tus.io [link]:: "Open Protocol for Resumable File Uploads"
Tus.io offers a simple, cheap and reusable stack for clients and servers (in many languages). They have a blog with further information about resumable uploads, see tus blog.
plupload [link]:: A JavaScript library, similar to resumable.js, which requires a separate server implementation.

We chose to use resumable.js and developed our own server implementation.

Resumable Uploads in Auphonic

If you upload files to a singletrack or multitrack production, you will see the upload progress bar and a pause button, which is one way to pause and resume an upload:

It is also possible to close the browser completely or shut down your computer during the upload, then edit the production and upload the file again later. This will just resume the file upload from the position where it was stopped before.
(Previously uploaded chunks are saved for 24h on our servers, after that you have to start the whole upload again.)

In case of a network problem or if you switch to a different connection, we will resume the upload automatically.
This should solve many problems which were reported by some users in the past!

You can of course also use any of our external services for stable incoming and outgoing file transfers!

Do you still have Uploading Issues?

We hope that uploads to Auphonic are much more reliable now, even on poor connections.

If you still experience any problems, please let us know.
We are very happy about any bug reports and will do our best to fix them!

Codec2: a whole Podcast on a Floppy Disk

stewart.cuthew@gmail.com (Stewart) — Fri, 01 Jun 2018 09:28:53 +0000

In a previous blogpost we talked about the Opus codec, which offers very low bitrates. Another codec seeking to achieve even lower bitrates is Codec 2.

Codec 2 is designed for use with speech only, and although the bitrates are impressive the results aren’t as clear as Opus, as you can hear in the following audio examples. However, there is some interesting work being done with Codec 2 in combination with neural network (WaveNets) that is yielding great results.

Layers of a WaveNet neural network.

Background

Codec 2 is an open source codec designed for speech, and aims for compression rates between 700bps and 3200bps (bits per seconds).

The man behind it, David Rowe, is an electronic engineer currently living in South Australia. He started the project in September 2009, with the main aim of improving low-cost radio communication for people living in remote areas of the world. With this in mind, he set out to develop a codec that would significantly reduce file sizes and the bandwidth required when streaming.

Another motivation according to David, was to be free from patented technologies used by closed source codes which he believes “require expensive and awkward licenses and are stifling innovation”. His belief is that this work can be done without requiring the use of patent protected codecs, so all his work is open source.

Potential Applications

Rowe’s perceived applications include VOIP trunking, voice over low bandwidth HF/VHF digital radio, (especially for amateur radio, so as to avoid issues with the use of proprietary codecs), and developing world and remote area communications, including military, police and emergency services.

Why we’re interested here at Auphonic is for its potential for longer podcasts, presentations and audiobooks, allowing for low storage and minimizing the effect of bad network connections.

How it Works

To achieve the lower rates sought, speech has to be reduced into the smallest possible information/data, and this means that the amount of redundant information that is transmitted has to be minimized.

To do this, Codec 2 uses harmonic sinusoidal speech coding. This splits the speech into 10 - 30ms segments, called frames. Each frame is then analysed for the fundamental frequency (or pitch), and the number of harmonics that fit into a 4Khz bandwidth. Further, for each of the harmonics within the 4khz range, the amplitude and phase are recorded.

This information is then coded, and the decoder reconstructs the audio based on this data.

Codec 2 Block diagrams - Encoder (left) And decoder (right)
Figure from Rowtel.

Audio Examples and Comparison with other Codecs

Whilst it all sounds great in theory, how does the reality match up? Let’s have a listen.

Here is a short wav audio file:

intro-orig.wav - 1.3 MB (download):

Applying Codec 2 (without the WaveNet decoder) at the different rates available, 3200bps, 2400bps,1600bps,1200bps and 700bps, we get:

3200bps (download):

2400bps (download):

1600bps (download):

1200bps (download):

700bps (download):

These examples show significantly reduced file sizes.
Putting that information more meaningfully in terms of how much storage you would need for an hour of audio:

At 3200bps, 1 hour of audio requires only 1.37MB (this would fit on one old 3½-inch floppy disk!)
A rate of 2400bps equates to 1.03MB/h
A rate of 1600bps equates to 0.68MB/h (Or approximately 2 hours of audio on one floppy disk!)
A rate of 1200bps equates to 0.51MB/h
A rate of 700bps equates to 0.3MB/h

So great compression, but the result is clearly not natural sounding.

As a comparison here is the same audio as a 8kb/s MP3:

MP3 at 8 kb/s - 23kb file size (download):

The file size is significantly larger than Codec 2 and the quality is arguably still not useable. You can clearly hear what is sometimes called sizzle - the weird metallic sounds you hear on low quality MP3s.

There is a final codec which is worth comparing, one that that seems to capture the two ideals of usable quality at low bitrates that we want: Opus.
Because of it's convincing low-bitrate performance, Auphonic already offers Opus encoding all the way down to 6 kbps, the lowest bitrate that Opus supports.

Comparing Opus at this 6 kbps rate to the 8kbps MP3 shows a significant improvement - although slightly muffled, it still sounds natural:

Opus at 6kbps (download):

Returning to Codec 2, and purely as s a bit of fun, here are some samples of Codec 2 on music! (Note that Codec 2 is not designed for music, it was only ever conceived for use on speech).

Original file (download):

As a 8kbps MP3 (download):

I personally couldn’t listen to the MP3 at this rate, so let’s listen to what Codec 2 does!

Codec 2 at different bitrates:

3200bps (download):

2400bps (download):

1600bps (download):

1200bps (download):

700bps (download):

As you can hear, it is not suitable for this application at all!

Codec 2 and WaveNet

As we have heard, despite the impressive bitrates achieved, the end result is not very natural sounding.
However, where it starts to get more interesting is the work done by W. Bastiaan Kleijn from Cornell University Library. He has been using with Codec 2 running at 2400bps on the coding side, but replaced the Codec 2 decoder with a WaveNet deep learning generative model (for more informationsee the paper Wavenet based low rate speech coding).

Here are some samples from the authors:

Codec	Male Example
Original File
Codec 2
With WaveNet Decoder

Codec	Female Example
Original File
Codec 2
With WaveNet Decoder

Comparing to Codec 2 you can hear a significant increase in quality, and if you compare to the original, there is not a significant decrease in quality.

David Rowe himself has stated that he considers the result to be "a game changer for low bit rate speech coding" and “as good an an 8000bps wideband speech codec”.

Conclusion

Whilst the (original) Codec 2 project represents very interesting work, it is limited, and the end result is not suited for podcasting. Also as we heard in the audio examples, it can only be used for voice recordings, and not music.

However, Codec 2 in combination with a WaveNet decoder improves the quality a lot and the low bitrate (2400bps) would be extremely interesting for podcasts and audiobooks distribution as well: one hour of audio would require only 1.03MB of storage!

Auphonic will add support for Codec 2 output files when the WaveNet decoder is in a usable form. For now we have just added support for Codec 2 input files.