Not this…

A Comparison of Automatic Speech Recognition (ASR) Systems, part 3

TimBunce — Sun, 17 May 2020 18:45:31 +0000

In my two previous posts I evaluated a number of Automatic Speech Recognition systems and selected Google and Speechmatics as the best fit for my needs. Here, after another long gap, I’m returning with updated results and discussion, including new excellent results from Rev.ai, 3Scribe and AssemblyAI.

For this evaluation I’m using the same method and the same twelve audio clips as I described in my previous post.

Results

The table below presents the results. The first column includes the approximate date of the ASR processing in YYYY-MM format. The rows are ordered by the median of their WER scores across all 12 files. Each cell is color coded according to the degree to which the WER score is better (lower, deeper green) or worse (higher, deeper red) than the median of this set of results for that file.

Service	Median	F10 A41	F11 A97	F13 B52	F14 C18	F14 C42	F15 C96	F16 D64	F17 D83	F17 E03	F18 E82	F18 E83	F18 E84
3Scribe 2020-05 $7.8/hr	9	14	8	8	6	8	7	12	9	12	14	9	11
Rev.ai 2020-04 $2.1/hr	11	18	12	10	7	9	7	14	9	12	17	9	12
AssemblyAI 2020-05 $0.9/hr	11	20	10	11	9	10	10	14	10	13	16	11	12
Google 2019-07	12	21	9	9	8	11	8	12	9	12	23	14	14
Google 2020-02 $1.4/hr	12	20	10	10	9	11	9	12	10	13	22	13	14
Speechmatics 2018-12	12	19	11	10	9	11	9	16	11	13	19	14	12
Rev.ai 2019-07	12	21	12	11	7	12	8	17	10	14	18	11	14
Speechmatics 2020-04 $4.4/hr	12	18	11	10	9	12	9	16	10	13	19	14	12

One of the few benefits of taking years to work through this process is that I can see how ASR results for a service change over time. While Google’s and Speechmatic’s score dropped a little, Rev.ai has significantly improved. 3Scribe and AssemblyAI are newcomers to my testing.

The prices shown are the approximate USD cost per hour, ignoring any free tier or bulk discounts.

It’s important to note that these results are all very good. The nature of the informal testing I’m doing means there’s really little value in distinguishing between small differences in WER scores. At this level the scores are significantly affected by differences in how “verbatim” the systems try to be, such as when a speaker hesitates and repeats a word or two. For example, here’s a section of vimdiff showing Google, Rev.ai, and Speechmatics making different choices:

The effect of actual transcription errors on the WER score has become less significant and I don’t have the time to sift through which differences are significant or not. I’m content these services are all good enough for my needs.

Google

Google’s score had an insignificant drop (-0.1) since July last year. The problem in my previous test, where the transcript of the F18.E82 clip was missing a chunk of text, was still present.

Speechmatics

When I submitted each audio file to Speechmatics a pop-up alert said “Duplicate file. You already have a job that used a file with this name. Are you sure you want to select it again?” I said yes. Speechmatics uploaded the file, took some time to transcribe each one, and charged me for the service. When I downloaded the transcripts I found that they were identical to the previous transcripts generated in 2018. This seemed suspicious so I edited an audio file to remove a tiny moment of silence and tried again. This time the transcript was different, so I did the same for all the other files. That sure seems like a bug.

Speechmatics score has dropped since December 2018. It’s a slightly larger drop (-0.3) than Google’s but still small.

Rev.ai

Rev.ai is a newcomer to my testing. Jay Lee, the General Manager of speech services at Rev.com, contacted me in July, prompted by my previous blog post. Rev.ai is the enterprise version of their ASR API. We had a call where we talked over the project, my methods, the results. Full disclosure: Jay very kindly donated enough minutes of Rev.ai time to cover my needs for this project.

I tested their Rev.ai service in July and the results were good then. They’re even better now (+0.9).

3Scribe

As I was drafting this post Eddie Gahan from 3Scribe contacted me with an invitation to try out their new service. Their scores are impressive. They have an API but don’t yet offer features like word-level timings or confidence scores. They’re one to watch. I wish them well, not least because they’re an Irish company.

AssemblyAI

I’d overlooked AssemblyAI thus far. They only offer an API interface, though it’s simple to use and well documented. They don’t provide speaker diarisation, but do provide excellent results at an excellent price, with punctuation and word level timing and confidence. Their free tier is 300 minutes/month.

Differential Analysis

Each service has a relatively high WER score when using the transcript from one of the others as the ground truth. This is good. It means the services are making different mistakes/decisions and those differences could be used to highlight likely errors in the others.

Diversions

A couple of issues diverted me for a while.

Exploring the Parameter Space of the Google API

Unlike the other services, the Google API offers many configuration options which provide “information to the recognizer that specifies how to process the request”. There’s an implication, to me at least, that providing more detailed configuration options could result in more accurate transcriptions. But which ones would have a significant effect?

I picked a number of parameters and ran transcriptions with various combinations of likely-looking parameter values. I was especially hopeful that specifying a NAICS code for the topic of the podcast would have a positive effect. To cut a long story short, nothing made a significant difference, except providing a vocabulary. I suspect Google may use the configuration details provided by the users to help train their system.

Extra Vocabulary

Both the Google and Rev.ai APIs provide a way to improve transcription accuracy by specifying extra words and phrases.

Google’s SpeechContext has phrases: “A list of strings containing words and phrases ‘hints’ so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases” There’s also an integer boost value: “Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. We recommend using a binary search approach to finding the optimal value for your use case.”

Rev.ai describe custom_vocabularies like this: “An array of words or phrases not found in the normal dictionary. Add your specific technical jargon, proper nouns and uncommon phrases as strings in this array to add them to the lexicon for this job.”

Specifying a vocabulary of extra words seemed like a very appealing way to improve the transcription accuracy. Certainly worth spending some time exploring. I needed a list of words that the transcriptions tended to get wrong. For each file I compared the ground truth transcript with all the ASR generated transcripts and extracted a list of all the words the ASRs had got wrong, regardless of circumstances. I called this a ‘commonly wrong words’ list.

I tried Google first, submitting jobs with the commonly wrong words and various boost levels. A boost level of 3 had the best effect, reducing the WER from around 11.5 to 9.5. An impressive gain!

Then I tried the same with Rev.ai. This time the results got worse. I was rather puzzled and disappointed. I contacted Rev and they kindly arranged a meeting to explain how the feature worked. My take-aways were that it’s useful for specific terms and especially phrases that are not already known to their system. That my “shotgun” approach, using lots of individual words, wasn’t a good use. And that it’s hard to predict the effect.

Later on it dawned on me that my ‘commonly wrong words’ approach was simply not valid. I was effectively cheating by strongly hinting to Google what words it had got wrong. Moreover it would not be possible to automatically generate a suitable list of words and phrases for each audio file to be transcribed. The closest viable approach might be to extract unusual words and phrases, such as uncommon names, from podcast show notes. I may return to exploring the creation and use of a custom vocabulary later, but for now I’m shelving it.

Other ASR Services Using Rev?

When talking to Rev.ai they mentioned Fireflies.ai. (A service which simplifies recording and transcription of business meetings. You invite their bot, called Fred, to join the meeting via your calendar app and the rest is automatic.) Fireflies use Rev.ai as the ASR. I tried them out and was puzzled to see some results were better than Rev.ai’s.

That reminded me of Descript who, when I tested them previously, were using Google as the ASR yet had some results that were better than Google’s.

It seems there are two factors at play: pre-processing of the audio before it’s sent to the ASR, and post-processing of the raw ASR results.

Here’s a comparison of results from Fireflies.ai, Rev.ai, and Descript:

Service	Median	F10 A41	F11 A97	F13 B52	F14 C18	F14 C42	F15 C96	F16 D64	F17 D83	F17 E03	F18 E82	F18 E83	F18 E84
Fireflies 2020-04	10	18	10	9	8	10	7	15	9	13	15	9	13
Rev.ai 2020-04	11	18	12	10	7	9	7	14	9	12	17	9	12
Descript 2020-04	12	20	12	11	7	11	7	16	9	12	17	9	13

I’ve included Descript because the differential WER scores between the three are low. Specifically the WER between Descript and Rev.ai is half of that between Descript and Google, suggesting that Descript is using Rev in their ASR process.

It’s interesting that Fireflies.ai did especially well on the three oldest files, with relatively lower quality audio, and the more recent F18.E82 file that had clipping. It made me wonder if I should experiment with some audio pre-processing of my own but, after spending 4 years getting this far, I’ll pass!

Revisiting The Past

I thought it would be interesting to revisit the 2 hour audio file I used in the first of my ASR comparison posts, back in May 2018.

The figures in the Sentences, Commas, and Questions columns are the number of full-stop, comma, and question mark characters in the transcript. The figures in the Names column are a rough approximation to the number of Proper Nouns.

Service	WER	Sentences	Commas	Questions	Names
Human range low – high	4.10 — 5.10	840 — 1261	1450 — 1748	49 — 76	1056 — 1208
Rev.ai 2020-05	8.48	731	1383	59	884
3Scribe 2020-05	8.71	688	664	67	995
AssemblyAI 2020-05	9.34	805	643	69	996
Speechmatics 2020-05	9.61	667	0	0	931
Google 2018-04	10.03	641	421	29	1232
Google 2020-05	10.36	462	238	20	1325
Speechmatics 2018-02	11.65	672	0	0	892

The Rev.ai results are particularly impressive. Beyond the good WER score they also come remarkably close to the Human transcripts in terms of identifying sentences, commas and questions. As do 3Scribe and AssemblyAI. Accurate recognition of sentences and especially questions should be helpful in segmenting the transcript into topics.

Over the last two years Google’s results have got slightly worse and Speechmatics have improved enough to jump ahead of them. (The 2018 figures for Google and Speechmatics don’t exactly match those in my earlier post due to small changes in my analysis scripts, such as recognizing more compound words.)

Conclusions

For my needs, on this project, Rev.ai have the best results. Their kind donation of time credits is the icing on the cake. AssemblyAI lacks speaker identification. Speechmatics has a better WER score than Google but doesn’t recognise questions. All except 3Scribe have word-level timing and confidence indicators.

It’s time for me to start doing some bulk processing. Finally.

I’ll start with a few recent 2-hour podcast audio files. Transcribe them via the Rev.ai API. Then process the transcripts from the raw JSON returned by the API into various formats including Markdown and basic HTML. Once I’ve a pipeline setup I’ll start working backwards through older episodes. Then I’ll iterate on whatever extra features seems most interesting at the time. Probably starting with search, and probably using Elasticsearch.

Remember, these are just my results for this project and with these specific audio files and subject matter. Your mileage will vary. Work out what features you need and do your own testing with your own audio to work out which services will work best for you. Have fun.

Updates:

22nd May 2020: Added results for AssemblyAI.
23rd May 2020: Added the approximate cost per hour for the services.

A Comparison of Automatic Speech Recognition (ASR) Systems, part 2

TimBunce — Mon, 11 Feb 2019 20:38:50 +0000

In my previous post I evaluated a number of Automatic Speech Recognition systems. That evaluation was useful but limited in an important way: it only used a single good quality audio file with a single pair of speakers (who both happened to be males with clear North American accents). Consequently there was no evaluation of performance across a variety of accents and varying audio quality etc.

To address that limitation I’ve tested 14 ASR systems with 12 different audio files, covering a range of accents and audio quality. This post presents the results.

Update: In May 2020 I wrote a follow-up post, part 3, with updated results for the best systems, including Rev.ai, AssemblyAI, Google, Speechmatics, and 3Scribe.

Audio Samples

For this evaluation I picked a number of interviews, spread over a range of years with a mix of accents and audio qualities, and used a 10 minute section of each one. Below I’ve listed some details of the audio files. Label is the identifier for the audio file used in the results table, the first two digits are the year of the recording.

Label	MP3 Attributes (all 16-bit)	Interviewees
F10.A41	48 kbps, 44.1 kHz, Joint Stereo	Female, Irish accent
F11.A97	96 kbps, 44.1kHz, Mono	Male, Caribbean accent
F13.B52	64 kbps, 44.1kHz, Joint Stereo	Female, British accent
F14.B18	96 kbps, 44.1kHz, Mono	Female, North American accent
F14.C42	96 kbps, 44.1kHz, Mono	Male, North American accent
F15.C96	96 kbps, 44.1kHz, Mono	Male, North American accent
F16.D64	64 kbps, 48kHz, Mono	Male, Indian accent
F17.D83	64 kbps, 48kHz, Mono	Male, North American accent
F17.E03	64 kbps, 48kHz, Mono	Female, North American accent
F18.E82	128 kbps, 44.1 kHz, Joint Stereo	One male, two female (crosstalk, clipping)
F18.E83	256 kbps, 48 kHz, Joint Stereo	Male, French accent
F18.E84	256 kbps, 48 kHz, Joint Stereo	Male, North American accent

I used roughly the same methodology as before. I purchased verbatim transcripts, made and checked by humans, from three services: Rev, Scribie, and Cielo24. I compared the transcripts and wherever they differed I listened to the audio and decided on the ‘ground truth’ to use for the evaluation.

I want to take a moment to give credit to Rev for great service. They cost $1/min yet delivered all the transcripts within 4 hours and had the lowest WER score of 3.8, compared with 4.2 for Scribie ($1/min) and 5.5 for Cielo24’s top “Best+” service ($2/min).

For Microsoft I had to convert the files to WAV format (16-bit mono 16kHz) because that’s the only format their SDK supports. Similarly for Google I converted the files to FLAC (16-bit mono 16kHz). Both are lossless conversions. All the other services accepted the original MP3 format.

Results

The table below presents the results. The ‘Humans’ row of the table shows the median WER score for the three human transcripts. The service rows are ordered by the median of their WER scores across all 12 files. Each cell is color coded according to the degree to which the WER score is better (lower, deeper green) or worse (higher, deeper red) than the median of the ASR results for that file (shown in a middle row).

Service	Median	F10 A41	F11 A97	F13 B52	F14 C18	F14 C42	F15 C96	F16 D64	F17 D83	F17 E03	F18 E82	F18 E83	F18 E84
Humans	4.2	4	4	3	3	5	4	3	6	4	11	6	5
Google Enh. Video	11.3	21	9	10	8	11	8	12	9	12	22	14	13
Descript	11.4	20	9	9	8	11	8	13	9	13	18	11	13
Speechmatics	11.8	19	11	10	9	11	9	16	11	13	19	14	12
TranscribeMe	11.9	18	10	10	9	12	8	17	11	12	20	13	13
Temi	12.7	23	14	12	8	11	10	20	12	16	19	12	13
Otter.ai	13.0	22	12	12	9	13	11	17	13	15	19	12	14
SimonSays.ai	13.4	23	13	12	11	14	11	19	12	15	20	14	13
Median of ASR results	13.8	23	12	12	10	14	10	18	13	15	22	14	14
Go Transcribe	14.2	25	10	12	10	14	10	21	14	15	24	17	13
Spext	14.5	24	12	11	11	15	11	15	13	17	30	14	17
Happy Scribe	14.9	19	29	12	10	15	10	16	14	15	24	16	15
AWS Transcribe	17.4	27	14	18	12	17	14	23	18	16	23	21	16
Scribie Auto	18.7	37	21	20	12	16	11	30	17	20	21	17	18
Cielo24	19.1	32	18	20	14	16	14	32	18	20	31	25	17
Microsoft	20.3	29	16	16	13	16	21	21	16	19	30	25	22

I tested Descript as an afterthought. Descript use Google as the backend ASR service (with some custom post-processing, I’m told) and has a very nice app with a rich feature set. Testing Descript turned out to be helpful in highlighting what appears to be a bug in the Google service.

Let’s explore the odd results for F18.E82. That audio was by far the most challenging in this evaluation. There were four speakers, informal banter and cross-talking, and the audio was slightly clipped. The Human WER score of 11 reflects differences in how the humans rendered the speakers talking over one another and their disfluencies.

Google’s unusually poor result for this file was due to missing chunks of the transcript. When I first tried it there were two large chunks (~50s each) and some smaller chunks missing, and the WER score was 35! I tried rerunning the transcription, and then again with different audio formats, but it didn’t help.

A few days later I tested Descript. It scored an inconsistent mix of good and bad results with a median of 14. That seemed odd for a service that uses Google, especially as it had a better score (28) than Google for F18.E82. I retested Google and it improved to 22 (with 254 more words than in the previous Google transcript). I retested Descript and it improved to 18 (with 183 more words than in the previous Descript transcript). Those results haven’t changed with further testing. Using Google directly for that file still gets a worse result than using Descript, mostly due to Google’s transcript missing a 16 second chunk. Odd.

I regenerated Descript transcripts for the four files that had much worse results than Google’s and they all improved (F10.A41 23.6→20.4; F11.A97 14.6→9.2; F14.B18 11.87→8.14; F14.C42 14.53→11.40; F15.C96 17.60→7.63; F16.D64 15.91→12.67).

This seems like a significant problem with the Google service. I’ve reported it to Descript and had an acknowledgement but haven’t heard back yet.

Non-runners

I didn’t retest Trint or Sonix because, as noted in my previous post, Trint, Sonix, and Speechmatics have very little difference between their transcripts, a differential WER of just 1.4. That suggests those three services are using very similar models and training data.

VoiceBase are now represented by Cielo24, who have taken over the web service.

I had included IBM’s Watson service in this test, hoping it had improved (especially as it now takes MP3 so I didn’t need to transcode as I had before). It was consistently the worst performer, with a median WER of 24, so I dropped it from the results.

I’d also planned to include Remeeting which I came across after my previous testing and looked promising. Their results were generally similar or worse than Cielo24’s, with a couple of transcripts much worse due to extra duplicated fragments of text. They seem to do a good job with speaker identification so I’ll include them in any future testing I do for that.

I was contacted by Unravel shortly before posting this. They, like Descript, use Google to provide the transcripts. Their service is basic and their pricing is low ($15 for 300 mins/month) with a free tier (60 mins/month). While testing the service I encountered the same problem with missing chunks that I described above.

Pre-trainers

A valid concern with the previous evaluation was that a transcript for the audio I used was available on the internet and so may have been included in the training data for the ASR systems. I doubted that would make much difference in practice, given the quantity of training data needed by ASR systems, but wanted to check.

The last three files (F18.E82, F18.E83, and F18.E84) in this new evaluation were all transcribed before being published on the internet. It’s interesting to note that Scribie was one of the services I used to generate human transcripts and the Scribie Auto ASR service did unusually well on the F18.E82 file. Scribie also did well in my previous testing where I’d also used them to generate the human transcript. (The F15.C96 file in this test is a 10 minute section of that same file and again Scribie Auto ASR did unusually well on that file.)

On the other hand, Scribie Auto ASR did poorly on all the other files even though I’d used Scribie for the human transcripts of them. Similarly Cielo24 doesn’t appear to have gained noticeable advantage from having generated human transcripts of the files.

Another data point is that Microsoft performed poorly for those last three files. If those files are removed from the results then Microsoft’s ranking rises above Amazon’s.

Conclusions

The clear winners in this test are Google’s enhanced video ($0.048/min) and Speechmatics ($0.08/min), which came a close second on accuracy and price. (Though clearly there’s an issue with Google missing chunks in the transcript.)

TranscribeMe ($0.25/min) is relatively accurate but also three times the price and lacks features I want. Temi ($0.10/min) is only slightly worse yet less than half the price of TranscribeMe. Otter.ai ($0 up to 600mins/month, 6,000mins for $9.99/mo) is good, though not as good as they appeared to be in my previous test.

Remember, these are just my results with these specific audio files and subject matter. Your mileage will vary. Do your own testing with your own audio to work out which services will work best for you.

Automatic Speech Recognition is amazingly good, yet still far from human levels of accuracy, especially for poor quality audio. Comparing transcripts from multiple services still looks like an appealing way to identify likely errors to aid human editing.

What Next?

Now there’s a clear winner (Google) I have confidence in the next step is to start generating transcripts for all the podcast episodes. Finally.

Once I’ve a workflow in place for that I can circle back and investigate how to add a workflow for human review and editing. That’s where I’d look more deeply into comparing the ‘master’ transcript from Google with another, e.g. from Speechmatics, to identify and highlight likely errors.

I also have ideas for a simple way to compare the quality of speaker identification across services, which will likely prompt another blog post, one day.

There are more of my rambling thoughts in the What Next? section of my previous post.

A Comparison of Automatic Speech Recognition (ASR) Systems

TimBunce — Tue, 15 May 2018 14:19:00 +0000

Back in March 2016 I wrote Semi-automated podcast transcription about my interest in finding ways to make archives of podcast content more accessible. Please read that post for details of my motivations and goals.

Some 11 months later, in February 2017, I wrote Comparing Transcriptions describing how I was exploring measuring transcription accuracy. That turned out to be more tricky, and interesting, than I’d expected. Please read that post for details of the methods I’m using and what the WER (word error rate) score means.

Here, after another over-long gap, I’m returning to post the current results, and start thinking about next steps. One cause of the delay has been that whenever I returned to the topic there had been significant changes in at least one of the results, most recently when Google announced their enhanced models. In the end the delay turned out to be helpful.

The Scores

The table below shows the results of my tests on many automated speech recognition services, ordered by WER score (lower is better). I’ll note a major caveat up front: I only used a single audio file for these tests. An almost two hour interview in English between two North American males with no strong accents and good audio quality. I can’t be sure how the results would differ for female voices, more accented voices, lower audio quality etc. I plan to retest the top tier services with at least one other file in due course.

Updates:

In February 2019 I wrote a follow-up post, part 2, which presents the results of evaluating 14 ASR systems with 12 different audio files covering a variety of speakers, accents, and audio quality. Naturally that gives more representative results.

In May 2020 I wrote a further follow-up post, part 3, with updated results for the best systems, including Rev.ai, AssemblyAI, Google, Speechmatics, and 3Scribe.

You can’t beat a human, at least not yet. All the human services scored between 4 and 6. I described them in my previous post, so I won’t dwell on them here.

Service	WER	Punctuation ( `.` / `,` / `?` / names )	Timing	Other Features	Approx Cost (not bulk)
Human (Voicebase)	4.10	1090/1626/57/1056			$1.5/min
Human (3PlayMedia)	4.11	1261/1470/76/1064			$3/min
Human (Scribie)	4.72	923/1450/49/1153			$0.75/min
Human (Volunteer)	5.10	840/1748/60/1208			Goodwill
Google Speech-to-Text (video model, not enhanced)	10.06	792/421/29/1238	Words	C, A, V	$0.048/min
Spext	10.44	813/369/30/1263	Lines	E	$0.16/min
Otter AI	10.79	786/1166/35/1030	Pgfs	E, S	Free up to 600 mins/month
Speechmatics	11.35	955/0/0/929	Words	S, C	$0.08/min
Trint	11.39	968/0/0/894	Lines	E	$0.33/min
Go-Transcribe	11.46	979/0/0/922	Pgfs	E	$0.22/min
SimonSays	11.64	941/0/0/893	Line	E, S	$0.17/min
Sonix	11.66	943/0/0/900	Lines	D, S, E	$0.083/min+$15/mon
Temi	11.95	915/1329/51/862	Pgfs	S, E	$0.10/min
Scribie ASR	12.36	970/1307/48/973	None	E	Currently free
TranscribeMe	12.55	1203/0/63/836	Lines		$0.25/min
YouTube Captions	13.68	0/0/0/1075	Lines	S	Currently free
Voicebase	15.40	116/0/0/1119	Lines	E, V	$0.02/min
AWS Transcribe	21.70	772/0/85/67	Words	S, C, A, V	$0.02/min
IBM Watson	24.50	11/0/0/896	Words	C, A, V	$0.02/min
Dragon +vocabulary	24.86	9/7/0/967	None		Free + €300 for app
Deepgram	27.54	715/1262/52/443	Pgfs	S, E	$0.0183
SpokenData	35.92	1457/0/0/680	Words	S, E	$0.12/min

WER: Word error rate (lower is better).

Punctuation: Number of sentences / commas / question marks / capital letters not at the start of a sentence (a rough proxy for proper nouns).
Timing: Approximate highest precision timing: Words typically means a data format like JSON or XML with timing information for each word, Lines typically means a subtitle format like SRT, Pgfs (paragraphs) means some lower precision.
Other Features: E=online editor, S=speaker identification (diarisation), A=suggested alternatives, C=confidence score, V=custom vocabulary (not used in these tests).
Approx Cost: base cost, before any bulk discount, in USD.

Note the clustering of WER scores. After the human services scoring from 4–6, the top-tier ASR services all score 10–16, with most around 12. The scores in the next tier are roughly double: 22–28. Seems likely that the top-tier systems are using more modern technology.

For my goals I prioritise these features:

Accuracy is a priority, naturally, so most systems in the top-tier would do.
A custom vocabulary would further improve accuracy.
Cost. Clearly $0.02/min is much more attractive than $0.33/min when there are hundreds of hours of archives to transcribe. (I’m ignoring bulk discounts for now.)
Word level timing enables accurate linking to audio segments and helps enable comparison/merging of transcripts from multiple sources (such as taking punctuation from one transcript and applying it to another).
Good punctuation reduces the manual review effort required to polish the automated transcript into something pleasantly readable. Recognition of questions would also help with topic segmentation.
Speaker identification would also help identify questions and enable multiple ‘timelines’ to help resolve transcripts where there’s cross-talk.

Before Google released their updated Speech-to-Text service in April there wasn’t a clear winner for me. Now there is. Their new video premium model is significantly better than anything else I’ve tested.

I also tested their enhanced models a few weeks after I initially posted this. It didn’t help for my test file. I also tried setting interactionType and industryNaicsCodeOfAudio in the recognition metadata of the video model but that made the WER slightly worse. Perhaps they will improve over time.

Punctuation is clearly subjective but both Temi and Scribie get much closer than Google to the number of question marks and commas used by the human transcribers. Google did very well on capital letters though (a rough proxy for proper nouns).

I think we’ll see a growing ecosystem of tools and services using Google Speech-to-Text service as a backend. The Descript app is an interesting example.

Differential Analysis

While working on Comparing Transcriptions I’d realized that comparing transcripts from multiple services is a good way to find errors because they tend to make different mistakes.

So for this post I also compared most of the top-tier services against one another, i.e. using the transcript from one as the ‘ground truth’ for scoring others. A higher WER score in this test is good. It means the services are making different mistakes and those differences would highlight errors.

Google, Otter AI, Temi, Voicebase, Scribie, and TranscribeMe all scored a high WER, over 10, against all the others. Go-Transcribe vs Speechmatics had a WER of 6.1. SimonSays had a WER of 5.2 against Sonix, Trint, and Speechmatics. Trint, Sonix, and Speechmatics have very little difference between the transcripts, a WER of just 1.4. That suggests those three services are using very similar models and training data.

What Next?

My primary goal is to get the transcripts available and searchable, so the next phase would be developing a simple process to transcribe each podcast and convert the result into web pages. That much seems straightforward using the Google API. Then there’s working with the podcast host to integrate with their website, style, menus etc.

After that the steps are a more fuzzy. I’ll be crossing the river by feeling the stones…

The automated transcripts will naturally have errors that people notice (and more that they won’t). To improve the quality it’s important to make it very easy for them to contribute corrections. Being able to listen to the corresponding section of audio would be a great help. All that will require a web-based user interface backed by a service and a suitable data model.

The suggested corrections will need reviewing and merging. That will require its own low-friction workflow. I have a vague notion of using GitHub for this.

Generating transcripts from at least one other service would provide a way to highlight possible errors, in both words and punctuation. Those highlights would be useful for readers and also encourage the contribution of corrections. Otter API, Speechmatics and Voicebase are attractive low-cost options for these extra transcriptions, as are any contributed by volunteers. This kind of multi-transcription functionality has significant implications for the data model.

I’d like to directly support translations of the transcriptions. The original transcription is a moving target as corrections are submitted over time, so the translations would need to track corrections applied to the original transcription since the translation was created. Translators are also very likely to notice errors in the original, especially if they’re working from the audio.

Before getting into any design or development work, beyond the basic transcriptions, I’d want to do another round of due-dilligence research, looking for what services and open source projects might be useful components or form good foundations. Amara springs to mind. If you know of any existing projects or services that may be relevant please add a comment or let me know in some other way.

I’m not sure when, or even if, I’ll have any further updates on this hobby project. If you’re interested in helping out feel free to email me.

I hope you’ve found my rambling explorations interesting.

Updates:

25th May 2018: Updated SimonSays.ai with much improved score
10th June 2018: Updated notes about Google enhanced model (not helping WER score).
8th September 2018: Added Otter AI, prompted by a note in a blog post by Descript comparing ASR systems.
10th September 2018: Emphasised that I only used a single audio file for these tests. Noted that Otter.ai is free up to 600 mins/month.
14th September 2018: Added Spext.
14th September 2018: Discussion about this post on Hacker News.
15th November 2018: Removed results for Vocapia at their request since they “do not consider that the testing was done in a scientifically rigorous manner”.
9th January 2019: Updated WER scores resulting from updated cleanup and normalization code. (The code now removes terms that are commonly speech disfluencies such as “you know”, and “like”. This avoids penalizing services due to differences in how “verbatim” their results are.) All results got better, but some more than others. Spext overtook Otter.ai to take 2nd place.
Trint overtook GoTranscribe and SimonSays to take 4th place, and Scribie ASR overtook TranscribeMe. You can see the previous results on archive.org. Note that I reused the same transcripts, only the scoring code changed. I’m working on a new blog post comparing many ASR services with 12 different audio files.
12th January 2019: Another update like the previous one, this time removing “yeah” which I had neglected to do previously. It’s one of the most frequent word errors in the ASR transcripts. (“Yeah” plays an interesting role in English discourse. Whole papers have been written about it, such as Turn-initial Yeah in Nonnative Speakers’ Speech: A Routine Token for Not-so-routine Interactional Projects.) Again, all the scores improved as expected, but some more than others. Speechmatics score dropped from 11.71 to 11.35, raising it 3 places and overtaking SimonSays, GoTranscribe, and Trint. Otherwise the ASR ranking was unchanged.
Feb 11th 2019: I’ve written a follow-up post which presents the results of evaluating 14 ASR systems with 12 different audio files covering a variety of speakers, accents, and audio quality.
May 20th 2020: I’ve added a link to my follow-up post, part 3, which has updated results for the best systems, including Rev.ai, AssemblyAI, Google, Speechmatics, and 3Scribe.

Comparing Transcriptions

TimBunce — Thu, 09 Feb 2017 18:18:57 +0000

After a pause I am working again on my semi-automated podcast transcription project. The first part involves evaluating the quality of various methods of transcription. But how?

In this post I’ll explore how I’ve been comparing transcripts to evaluate transcription services. I’ll include the results for some human-powered services. I’ll write up the results for automated services in a later post.

Accuracy

The key metric for transcription is accuracy. How closely the words in the generated transcript match the spoken words in the original audio.

To compare the words in the transcript with the audio you need a reference transcript that’s deemed to be completely accurate; a ground truth upon which comparisons can be based. Then the sequence of words in that reference transcript can be compared against the sequence of words in the hypothesis transcript from each system being compared.

Word Error Rate

The Word Error Rate (WER) is a very simple and widely use metric for transcription accuracy. It’s a number, calculated as the number of words that need to be changed or inserted or deleted to convert the hypothesis transcript into the reference transcript, divided by the number of words in the reference transcript. (It’s the Levenshtein distance for words, measuring the minimum number of single word edits to correct the transcript.) A perfect match has a WER of zero, larger values indicate lower accuracy and thus more editing.

Of course there are other metrics, and arguments pointing out that not all words are equally important. For our purposes though, the simplicity of WER is very appealing and widely used in the industry.

One Word Per Line

I decided early on that I’d simply convert a transcript text file into a file containing one word per line, and then use a simple diff command to identify the words that need to be changed/inserted/deleted, and the diffstat command to count them up.

Simple, in theory. In practice a significant amount of ‘normalization’ work, which I’ll describe below, was needed to reduce spurious differences.

Visualizing the Differences

A very useful command for inspecting and comparing these ‘word files’ is vimdiff. It gives a clear colour-coded view of the differences between up to four files.

Here’s an example comparing word files that haven’t been normalized. The left-most column is the transcript produced by a human volunteer. The other three columns, from left to right, were generated by automated systems (in this case VoiceBase API, Dragon by Nuance, and Watson by IBM).

This very small section of the transcript has several interesting differences.

Before I talk about normalization I want to draw your attention to the second column, the automated transcript by VoiceBase. Note the “two thousand and two” vs “two thousand two” in the fourth column, and “just down the road” vs “just on the road” in the other three columns. The phrases “just down the road” and “two thousand and two” would be more common than the alternatives, yet the alternatives are correct in this case.

I suspect this is an example of automated services giving too much weight to their training data when selecting the best hypothesis for a sentence. (It’s a similar situation to autocorrect correcting your miss-spellings with correctly spelt but inappropriate replacements.) A key point here is that, because the chosen hypothesis is likely to read well, it’s harder for a human to notice this kind of error.

Normalization

You can see from the vimdiff output above that numbers can cause differences to be flagged even though the words convey the correct meaning: “2,000” vs “2000” vs “two thousand”, and “2002” vs “two thousand and two” vs “two thousand two”. To address this I wrote some code to ‘normalize’ the words. For numbers it converts them to word form and handles years (“1960” to “nineteen sixty”) and pluralized years (“1960s” to to “nineteen sixties”) as special cases.

Words containing apostrophes are another case of differences: I’m splitting on any non-word character so “they’re” is being split into two words. Some transcription systems might produce “they are” (not shown in this example). Apostrophes are tricky. To address this the normalization code expands common contractions, like “they’re” into “they are”, handles some non-possessive uses of “‘s” and removes the apostrophe in all other cases. It’s a fudge of course but seems to work well enough.

Another significant area for normalization are compound words. Some people, and systems, might write “audio book” while others write “audiobook” or “audio-book”. To address this the normalization code expands closed compound words, like “audiobook” into the separate words “audio book”. It seemed be too much work to generalize this so the normalization code has a hard-coded list that detects around 70 compound words that I encountered while testing. (Remember that the goal here isn’t perfection, it’s simply reducing the number of insignificant differences for my test cases so the WER scores are a more meaningful.)

Other normalizations the code handles include ordinals (“20th” becomes “twentieth”), informal terms (“gotta” becomes “got to”), spellings (“realise” becomes “realize”), and some abbreviations are collapsed (“L.A.” becomes “LA”).

Transcripts Produced by Humans

For my evaluation I chose a podcast episode of just under two hours length, that had good quality audio, and already had a transcript produced by a volunteer. The primary voice was an American male who spoke clearly but quickly and without a strong accent.

I suspected that a single human-produced transcript wouldn’t be sufficiently reliable so I ordered human-produced transcripts from three separate transcription services. This turned out to be more interesting, and more useful, than I’d anticipated.

3PlayMedia “Extended (10 Days)”
- Rate $3/min = $333.44 total.
- Completed in 7 days.
- Seven “flags” in the transcript, mostly “inaudible” or “interposing voices”.
Scribie “Flex 5 (3-5 Days)”
- Rate $0.75/min = $83.35 total.
- Completed in 3 days. Formats: TXT, DOC, ODF, PDF. (No timecodes)
VoiceBase “Premium 99%, 5-7 Days, 3 human reviews”
- Rate $1.5/min = $168.00 total.
- Completed in 4 hours. Formats: PDF, RTF, SRT (timecodes).
- Note that VoiceBase provide both human and automated transcription services.

The 4:1 difference in cost is notable! 3PlayMedia certainly charge a premium. Let’s see how their transcripts compare.

With the normalization I was expecting these transcripts, all produced by humans, to be closer to each other than they turned out to be. Here’s an example of some differences:

(In this vimdiff image, and all the ones that follow, the columns from left-to-right are: Volunteer, 3PlayMedia, Scribie, VoiceBase.)

That shows three different transcriptions of the “to use” phrase. It’s not very clear on the audio and I’d agree with the majority here and say that “you used” was correct. In this example it isn’t significant but it does illustrate that imperfect human judgement is needed when the audio is isn’t completely clear. Transcribers have to write something and can’t easily express their degree of confidence. If it falls too low they might write something like “[inaudible]” or “[crosstalk]”, but above that threshold they take a guess and you don’t know that. And because the guess is likely to read well it’s harder for a human to notice this kind of error.

The second difference shown in that image relates to the difference between a clean transcript and a verbatim transcript. In a clean transcript conformational affirmations (“Uh-huh.”, “I see.”), filler words (“ah”, “um”) and other forms of speech disfluency aren’t included.

That “you are” is in the audio, so would be in a verbatim transcript, but three of the four humans decided that it wasn’t significant and should be left out of the clean transcript. But one of the four decided it should be kept in. Other common examples include “I mean” and “you know”. There’s no right answer here, it’s a judgement call case-by-case.

It works the other way as well. Sometimes the transcriber will add a word or two that they think makes the text more clear. Compare “everything submits to and is accountable to” with “everything submits to it and is accountable to”. Two of the four humans decided to add an “it” that wasn’t in the audio. Similarly “believe it” vs “believe in it”, here again two of the four added an “in”, only this time it was not the same transcribers adding the word. Transcribers are likely to “clean” transcripts in a way that’s biased towards their own speaking style. Speakers have a distinct verbal style and changes like these by a transcriber can be more distracting than helpful if not in keeping with the speakers own style.

Generally these interpretations of the audio, and writing the corresponding text, are made with care and don’t alter the meaning for the reader. At least that’s what I was thinking until I came across this: To be fair this part of the audio is a little garbled due to crosstalk. I’m sure the speaker said “doesn’t matter” (which got normalized to “does not matter”). It’s another example of where the lack of confidence indicators in human transcripts is a problem.

Here are a few other examples of differences in these human transcripts that caught my eye:

Getting to Ground Truth

To compare transcription services I needed a reference transcript – a ground truth – against which to compare the others.

Since the transcripts varied significantly I had little choice but to create my own ‘ground truth’ transcript manually. I copied the transcript generated by the volunteer, then listened to the audio in the places where the various transcripts differed in a non-obvious way – well over 200 of them. For each place I decided what the most accurate transcription was and edited the ground truth transcript to match. The most difficult places were where multiple speakers talk over one another. It’s very hard to convey the intent clearly and accurately in a linear sequence of words. Often ‘clearly’ and ‘accurately’ are at odds with each other.

I repeated this process with the automated transcripts in order to add-in the disfluencies and conformational affirmations etc. present in the audio. In other words, to shift the ground truth transcript from a ‘clean’ transcript to being closer to a ‘verbose’ transcript. Without that work the apparent Word Error Rate of the automated transcripts would be unfairly higher. They would all be equally effected but that effect would reduce the visibility of the genuine differences. (With hindsight I should have used a separate file for this step but the overall process was iterative and exploratory rather than the linear sequence outlined here.)

The final ground truth transcript has 21,629 words.

Other Attributes

Speaker Diarisation

The transcripts produced by the volunteer and by Scribie identified the speakers. The transcript from Voicebase identified transitions from one speaker to another, but didn’t identify the specific speaker. The transcript from 3PlayMedia didn’t identify speakers or transitions, despite costing three to four times as much.

Quality Flags

3PlayMedia flagged seven places in the transcript with [? … ?] where the transcriber was unsure of the words but had made a reasonable guess, plus three instances of [INTERPOSING VOICES] and nine [INAUDIBLE]. Voicebase flagged five [CROSSTALK] and two [INAUDIBLE]. Scribie flagged none.

Most of 3PlayMedia transcription text flagged as unsure were correct. About half of the INTERPOSING VOICES and INAUDIBLE in the 3PlayMedia transcript the other services had accurate transcriptions for.

Segmentation / Punctuation

The volunteer transcript had 842 sentences in 158 paragraphs.
The 3PlayMedia transcript had 1259 sentences in 331 paragraphs.
The Scribie transcript had 915 sentences in 223 paragraphs.
The Voicebase transcript had 1077 sentences in 648 paragraphs.

I’m not sure what to make of those wide differences. The figures are a little noisy due to artifacts in the way the files were processed, but most the differences seem to be due simply to style.

The figures for the automated systems (in a future post) highlight those which do a very poor job of segmentation. The big downside of Dragon by Nuance, for example, is that it doesn’t do segmentation. You simply get a very long stream of words. So you’ll still have a lot of work to do to make them usable, no matter how accurate they might be.

Results for Human Transcription

Service      WER   Diarisation     Timing      Cost
==========   ===   ===========     ======      =======
3PlayMedia   4.5   None            Subtitles   $333.44
VoiceBase    4.6   Transitions     Subtitles   $168.00
Scribie      5.1   Speaker names   Paragraph   $ 83.35
Volunteer    5.3   Speaker names   None        N/A

The important number here is the Word Error Rate (WER). Lower is better. The difference between 4.5 and 5.3 is quite small in practice. Most of the ‘errors’ are in parts of the transcript are due to insignificant differences or ambiguous sections due to cross-talk.

I suspect a WER around 5 represents a reasonable ‘best case’ for transcription of an interview. For comparison, the best of the automated transcription services I’m testing have WER of 12 to 16, with some in the 30 to 40 range.

All this work was to understand how to judge the accuracy of a transcription in order to evaluate automated systems. Comparing human transcription services turned out to be a useful approach to understanding the issues and help develop the tools.

It’s clear that for the highest accuracy it’s very helpful to use more than one service and check the places where they differ. Of course that significantly increases the cost and effort.

I’m testing a number of automated systems currently and I’ll include those results in a later blog post.

Semi-automated podcast transcription

TimBunce — Tue, 22 Mar 2016 23:01:17 +0000

The medium of podcasting continues to grow in popularity. Americans, for example, now listen to over 21 million hours of podcasts per day. Few of those podcasts have transcripts available, so the content isn’t discoverable, searchable, linkable, reusable. It’s lost.

The typical solution is to pay a commercial transcription service, which charge roughly $1/minute and claim around 98% accuracy. For a podcast producing an hour of content a week, that would add an overhead of around $250 a month. A back catalogue of a year of podcasts would cost over $3,100 to transcribe.

When I remember fragments of some story or idea that I recall hearing on a podcast, I’d like to be able to find it again. Without searchable transcripts I can’t. It’s impractical to listen to hundreds of old episodes, so the content is effectively lost.

Given the advances in automated speech recognition in recent years, I began to wonder if some kind of automated transcription system would be practical. This led on to some thinking about interesting user interfaces.

This (long) post is a record of my research and ponderings around this topic. I sketch out some goals, constraints, and a rough outline of what I’m thinking of, along with links to many tools, projects, and references to information that might help. I’ve also been updating it as I’ve come across extra information and new services.

I’m hoping someone will tell me that such a system, or parts of it, already exist so that I can contribute to those existing projects. If not then I’m interested in starting a new project – or projects – and would welcome any help. Read on if you’re interested…

My Goals

Here is an outline of functionality that I’d like from a basic automated system:

Produce podcast transcripts as plain text on static web pages that are indexed by search engines.
Provide anchors to make it easy for people to link to a particular section, or sections, in the transcript.
Provide buttons to play the audio/video from that point. This requires the transcription to have timecode data.
Identify and show who is speaking, e.g. via speaker diarisation.

Of course, an automated transcription is likely to have errors. Perhaps many. For a popular podcast there are likely to be some members of the audience (perhaps many) who are willing to contribute some amount of time to checking and correcting errors, somewhat like Wikipedia. A low-friction user experience makes that more likely.

In other words, crowdsourcing of error checking and correction may be a viable way to close the “quality gap” between manual and automated transcriptions. At this point I’ve no idea how big that gap will be, though I’m confident it can be made small enough for this whole endeavour to be worthwhile. (I’m assuming that the podcasts will have clear high-quality audio.)

I have explored the options for transcription in more detail below.

Beyond the basic transcription, presentation, and editing features there are many interesting possibilities for future enhancements.

Natural language processing

Automated natural language processing is becoming a lot more powerful and could be used to enrich the transcript with extra information. For example:

Using keyword extraction to automatically identify suitable keywords for indexing, to aid search and discovery. Also entity extraction to identify the names of things, such as people, companies, or locations.
Identification of topic segments within a podcast is much more difficult, but also more useful. This is an interesting area of research, e.g. Maui (software). I’d like to support overlapping segments to cover both high-level themes and the specifics within them.
The keyword extraction could then be applied to individual segments, as well as whole podcasts, to aid finer-grained indexing.
Some kind of classification of topics into, or with, a taxonomy might also be helpful for someone exploring a large topic space.
Generate automatic summaries of segments. The summaries for all the segments would form a summary of the episode.

Those would open up alternative ways to search and explore a collection of podcasts. You’d be able to easily read or listen to all the segments that touch on a given topic across many episodes. Perhaps stitching them into a thread or ‘playlist’ you can share with others, somewhat like Storify.com.

There are also more immediate, practical problems such as recognising the boundary between sentences and fixing the casing of words. These aren’t critical but would significantly reduce the error checking and correction required to create a high quality transcript.

Database Storage

It should be clear by now that the underlying transcript data will need to be stored in some kind of database where it can be augmented with timecodes, speakers, segment details, keywords etc.

The database would also support user interfaces for error checking and correction, fine-tuning segments, and keywords etc.

From there the transcripts could be output in a variety of forms, from static web pages to rich interactive tools for exploration and sharing.

Full Text Search

Web search engines like Google and Bing are very good at what they do. Yet they are very general tools, trying to do the best they can for all the web pages on the internet. There are better tools for specific jobs.

One that I’m familiar with is Elasticsearch which has a rich set of features for dealing with human language and powerful full-text search capabilities. Beyond its general capabilities, it can be taught synonyms specific to the topics covered by the podcast. This would significantly improve the quality of search results.

Video Subtitles/Captions

I generally listen to podcasts as audio, while driving or resting, even those that are available as videos. I hadn’t given any thought to subtitles as another output format until I started researching what transcription tools, projects and services already existed. I’ll talk more about it below.

Schematic

Here’s a simple schematic, for what it’s worth:

What’s Out There

Applications that Facilitate Manual Transcription

These tools typically provide a user-interface that combines a media player with a text editor. You play the media and start typing what you hear (as fast as you can), pause, rewind a bit, repeat.

Here are a selection for reference, in no particular order:

InqScribe for Mac and Windows. $39-99.
HyperTRANSCRIBE, Mac and Windows. $40.
Transana, Mac and Windows. $75.
Transcriber Pro, Windows only. €10/year
GearPlayer, Windows only. $120.
pmTrans, open source for Linux, Mac, and Windows. Free.
Express Scribe, for Mac and Windows. Free.
Transcribe, web service, $20/year.
Scribie transcription editor, web service. Free.
oTranscribe, web app, open source.
NowTranscribe combines automatic generation of a draft with predictive correction and automatic control of the audio playback. It’s an innovative approach that’s worth seeing in action.

If you’re performing manual transcription at the moment, especially with a standard word processor, I’d urge you to try some of these. They may smooth out the process in many small ways that accumulate to save you a lot of time and effort.

When performing manual transcription it obviously helps to be able to type fast, ideally fast enough to keep up with the speakers. Approximate words per minute rates are around 150–200 for typical podcast speakers, and 40–80 for average-to-good typists. That difference creates a problem.

Users of Dvorak keyboards often report significantly faster typing speeds. For maximum speed you might be interested in the Open Stenography Project.

Very few transcribers can keep up with typical speakers. The usual solution is to use a foot pedal to rewind the media by a few seconds whenever needed, that way your fingers can stay on the home row of the keyboard. Yet every time you rewind there’s a break in your flow and productivity falls.

An alternative approach is to slow the media playing down to match a comfortable typing rate. This can be done with audio time-scale/pitch modification techniques such as PSLOA which can change the speed without altering the pitch. Most of the tools I’ve listed above support variable speed playback, but only a few explicitly mention maintaining the correct pitch. The free web-based Scribie transcription editor seems particularly good at this.

Commercial Transcription Services

These provide a service where you upload an audio or video file and get back a file containing the transcription. You’re paying some amount of money for someone to use an application (like those above) on your behalf, plus some level of quality checking. I’ll only list here a few services that provide timecoded transcriptions, including subtitling services.

At the very high-end, 3play Media are a traditional transcription service provider offering “Premium quality with +99% accuracy” for prices ranging from $2 to $3 per minute. They provide an API for upload/download.

At the very low-end, if you’re willing to handle the management of the work then Fiverr have a number of people offering transcription services for $5 (typically for 10 to 20 minutes of transcription). Your mileage will vary.

In the innovative-middle-ground, Scribie guarantee +98% accuracy, offer prices down to $0.70/min for 20-30 day turnaround, and include time-coding. There’s an additional charge of $1.00/minute for producing subtitles (SBV/SRT). They provide an API and have an interesting blog. They also make their own transcription editor web application freely available for anyone to use. I like their technology and ‘managed crowdsourcing‘ approach.

Commercial Transcription Services (behind the scenes)

Speaking of crowdsourcing, while researching this post I came across CrowdSurfWork. This site is an interface for freelance transcribers to work on “micro-tasks” related to transcription. Their system is built on Amazon.com’s Mechanical Turk service, which provides a marketplace for “Human Intelligence Tasks”. Typical micro-tasks include transcribing a chunk of audio (“up to 35 seconds”), reviewing and scoring a chunk of transcript, quality checking a whole transcript etc. CrowdSurfWork don’t say who their clients are. They’re certainly not the only ones using Mechanical Turk for transcription work.

Commercial services provide a complete transcription service: audio in, high quality transcript out. Internally that work is usually broken down into a transcription phase and a quality check/edit phase. I wonder if some companies could offer a service that takes a raw initial transcript (e.g. generated by an automated transcription system) and just perform the quality check/edit phase, at a lower cost.

~~I also wonder if~~ It seems very likely that some companies are already using automated transcription systems, especially for regular clients where the system could be trained for the clients voice.

Free Automated Transcription

Automatic speech recognition has come a long way in recent years, with untrained speaker-independent systems achieving useful levels of accuracy.

Google Docs now supports Voice typing which you can use to transcribe your voice, or other audio being played at the time. It only works in the Chrome browser, or the Docs app on iOS or Android. Here’s a demo. (See also Speechlogger which uses the same underlying Google technology and has some handy tips on improving the quality when transcribing audio files by using a “virtual line-in cable”. See also Loopback for Mac.)

Another relevant way to access Google’s speaker-independent speech recognition is to upload a video to YouTube and let it provide Automatic Captioning for you. More on that below.

On a Mac you can use your voice to enter text into almost any application. The default mode uses a web service but you can enable Enhanced Dictation which installs the recognition code locally so you don’t need an internet connection and can “dictate continuously”.

These don’t offer any customisation or training to improve the accuracy.

Microsoft Windows offers a similar Speech Recognition service. It supports a customisable speech dictionary and accuracy improves with usage. As far as I can tell this is built in to the operating system and doesn’t use a network service.

There are a number of speech recognition projects for Linux. I have not looked into them in detail. If you have experience with any that would fit this project I’d be grateful if you would get in touch with me.

Commercial Speech-to-text Services

The Google Cloud Speech API offers access to APIs for applications to “see, hear and translate”. It’s based on the same neural network tech that powers Google’s voice search in the Google app and voice typing in Google’s Keyboard and Chrome described above. It offers some customization in the form of a list of phrases (up to 500, provided with the API request) that act as “hints” to the speech recognizer to favor specific words and phrases in the results. The current limits cap audio length at 80 minutes and require use of uncompressed audio.

Nuance, who currently provide the technology behind Apple’s Siri and dictation services, offer a HTTP REST Cloud speech recognition service that’s targeted at mobile devices. (I presume this is the service behind their new and expensive, Dragon Anywhere mobile dictation app.)

The service supports uploading custom phrases and vocabularies. It also allows you to specify an ID for the speaker which is used for Speaker-Dependent Acoustic Model Adaptation (SD-AMA). This “creates adapted acoustic model profiles from audio collected from each user to improve recognition performance over time.” Both of these should help improve accuracy beyond what’s possible with speaker-independent services like those from Google or Apple.

The pricing is $.008 / transaction where a ‘transaction’ is a successful HTTP request, presumably about a sentence (I’ve seen references to 30 seconds as a maximum). Their terms require ‘Emerald Level’ payment when the client isn’t a mobile device. Some negotiation might be required!

Microsoft provide a Bing Speech API. The REST API only supports 10 seconds of audio per request, similar to Nuance ‘transactions’ described above. Their Client Library supports streaming.

IBM offers their Watson Developer Cloud Speech to Text service. It has both HTTP REST and WebSocket APIs. The pricing is free for the first thousand minutes per month, then $0.02 per minute. The IBM service doesn’t support SD-AMA ~~or custom vocabularies~~. Support for custom vocabularies was added in September 2016. (They’ve said they’re working on speaker diarization.) The results include timestamps, confidence indicators and alternative suggestions. Here’s an example use to translate a ProPublica podcast.

Vocapia provide a Speech to Text API service called VoxSigma. It returns “XML with speaker diarization, language identification tags, word transcription, punctuation, confidence measures, numeral entities and other specific entities”. They also support customization in the form of ‘Language Model Adaptation’ by uploading sample text. I’ve requested technical documentation and pricing details, neither of which are on their web site. They’ve given me a trial account to test the service.

Speechmatics provide speech to text services with a simple REST API. The transcript data includes speaker diarization, word transcription, punctuation, and confidence measures. They don’t offer any customization. Pricing is £0.06/minute (£3.60/hour), with the first hour free. Speechmatics claim to be the world’s most accurate transcription service.

Voicebase provide a transcription service. They’re using a different version of Speechmatics technology (with slightly lower accuracy it seems). I’m including them here because they provide interesting keyword extraction features. From a two hour interview I uploaded they extracted 94 keywords (like “ecological limits”, “symbolic language” etc.) and grouped them under 170 headings (like “Bioethics”, “Ontology” etc.). Clicking on a keyword or group, or entering search terms manually, shows all the places in the audio timeline where the topic is spoken about. You can then easily listen to just those parts. As you do the relevant portion of the transcript is highlighted. When you sign up they give you $60 (US) free credit. I didn’t see any rates quoted but it appears to be $0.02/minute. Output formats are PDF, RTF, and SRT.

SpokenData offers automated transcription with an interactive transcription editor, API, and optional human transcription services. It’s a project of Czech company ReplayWell. Pricing is €0.10/minute down to under €0.05/minute for bulk. The first hour is free. Other services, including speaker segmentation (diarization), are currently free. Transcript formats include SRT, TXT, TRS, XML.

Deepgram also provide an automated transcription service. Pricing is under $0.02/minute. They have a basic transcription viewer and a minimal dashboard. To download a transcript you have to use make an API call with a “get_object_transcript” action that’s not currently documented in their rather minimal API documentation. The transcript format is JSON with per-paragraph timings.

Trint don’t yet have an API for their automated transcription service, but they do have a nice interactive editor with pitch-corrected speed control. Pricing is $0.25-$0.20/minute. Trint “automatically identifies different speakers and segments them into separate paragraphs” (emphasis mine). That doesn’t seem quite right. The transcript is segmented into paragraphs but there’s no identification of speakers that I can find. The editor let’s you label the speaker for each paragraph, but you still need to do that manually for every single paragraph. Transcript formats are DOCX, SRT, VTT or “Interactive Transcript” which is a zip containing HTML and JavaScript. So there’s no pure-data transcript format available. (The “Interactive Transcript” zip contains the transcript in the form of HTML with a span with attributes for each word.) Review.

Pop Up Archive offers a service that seems an ideal fit for these requirements. You upload a file and they tag, index & transcribe it automatically, including timestamps and speaker diarization. They provide an interactive transcript editor synced to the audio, team plans allow concurrent editing by multiple people. Download transcripts in .TXT, .XML, .JSON, .WEBVTT, and .SRT formats, and there’s an API. (Looking at the output it looks like they’re using Speechmatics as the backend transcription service.) They provide a search and browse interface for the thousands of podcast transcripts they’re hosting, plus a HTML code generator for embedding players on your own website. Pricing ranges from $0.25/min down to $0.20/min on monthly plans. One hour free credit.

Pop Up Archive have an interesting project called Audiosear.ch which is billed as “a full–text search and intelligence engine for podcasts and radio”. It includes a ClipMaker feature that makes it easy for anyone to search for and select a favorite podcast moment and share it on social media as a short auto-playing video of the audio and transcript. Take a look and try it out.

Spreza and Voyz.es are two other service providers in this space. They’re both currently in private beta. I’ve applied for access.

In November 2017,aAlmost two years after originally writing this post, Amazon launched their Amazon Transcribe service which adds inferred punctuation, word-level timestamps, and recognises multiple speakers.

See also Pop Up Podcasting’s review of automated transcription tools.

Commercial Speech-to-text Applications

These are applications which you install and run on your own machine. Modern machines and software are fast enough for high quality results in realtime. A key feature is the ability to train the software to improve the recognition of a particular voice. This, combined with custom vocabularies, greatly improves the accuracy.

Ignoring companies offering niche products (like vestec, SRI, and verbio) which don’t provide documentation or prices online, there’s only one major player left in this field: Nuance, with their Dragon line of products for PC and Mac.

Dragon can learn your vocabulary and likely phrases by reading documents or emails you’ve written. For transcribing podcasts it could be given some existing transcriptions, if you have any. It will also learn from the corrections you make while dictating. All this training is tied to a single voice profile so Dragon will only work well with a single voice at a time.

It’s also important to note that Dragon, unlike services such as Trint and Speechmatics, will only give you a bare stream of words. There’s no segmentation into sentences and paragraphs. You’ll have to do that by hand, along with capitalizing the first word. So even if Dragon was very accurate you’ll always be left with a lot of work to do.

Anecdotal Accuracy

This thread on Ycombinator from March 2016 includes a variety of opinions, including “As someone who’s worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space”; “If Nuance is 100%, I’d say CMUSphinx is at least 40%”; “As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I’ve tested.”

Spoiler: in my testing so far Trint.com and Speechmatics.com are much better than Watson and (untrained) Nuance. I’ll post detailed results when I’ve finshed testing.

State of the Art

The state of the art in speech recognition is advancing very rapidly at the moment as Deep Learning and other modern machine learning techniques are being applied ever-more successfully. One of the most difficult of all human speech recognition tasks is conversational telephone speech, very similar to the conversational podcast speech we’re exploring in here. Recent research, published in October 2016, has shown that it is now possible to achieve human parity in conversational speech recognition. A significant research milestone that should be reflected in commercial systems in the future.

Verbatim vs Clean Transcription

Informal speech is often littered with stutters, filler words (‘ah’, ‘um’, ‘like’ etc.), and other forms of speech disfluency. Conversational speech often contains ‘conformational affirmations’ such as “Uh-huh.”, “I see.”

Commercial transcription services will, by default, provide you with a ‘clean’ transcript that doesn’t include every utterance in the audio. The disfluencies and conformational affirmations are skipped. A ‘verbatim’ transcription service is often available at a higher cost to account for the work that’s needed to capture the extra details.

Depending on the amount of disfluency, a clean transcript can be significantly easier to read than a verbatim transcript.

An automated transcription system will naturally produce a verbatim transcript. Cleaning up a verbatim transcript automatically is an active area of research. For our purposes in the short term I imagine some typical cases could be recognised and edited out automatically. The rest would have to be dealt with as part of the crowdsourced manual QA process.

Podcast Transcription

So can a viable automated podcast transcription solution be built from these options?

Dragon applications offer the highest accuracy but only work well with a single voice, don’t provide automatic timecodes, and are hard to automate.

Free automated transcription services offer no training or customisation and don’t provide timecodes directly.

The Watson Developer Cloud Speech to Text service offers timecodes but no training or customisation. It might be workable but is likely to be relatively poor quality, especially without diarisation.

The Nuance Cloud Speech Recognition service would require me to pre-process the audio into small chunks, presumably based on pauses. That would mean I’d effectively generate timecodes myself but at the cost of significant extra audio processing upfront. Quality is bound to suffer, especially in segments where pauses aren’t clear.

Considering pre-processing the audio opens up extra possibilities. In addition to identifying pauses, I could also implement diarisation (e.g. using one of the open source tools). That would not only improve the chunking, where one speaker starts talking over another, but also open up interesting solutions for the single speaker problem…

Given the details of who is speaking when, a separate audio file for each speaker could be generated, with the voice of the other speaker replaced by silence. (An audio editor that supports a cue sheet would make that simple.) Each per-speaker file could then be fed to Dragon with the appropriate profile for that speaker. After a short period of training the rest of the transcription for that speaker could proceed automatically and with higher accuracy.

That would solve the single speaker problem but there’s still a lack of timecodes in the transcript. A few approaches spring to mind but the most interesting is to insert the timecodes into the audio stream as spoken words, e.g. “zero seven colon one five space”, perhaps using a text-to-speech tool. Then there would be no need to keep the periods of silence for the other speaker. The audio file would dictate its own timecodes!

The transcripts generated for each speaker could then be merged using the timecodes in the text to interleave them in the correct order. (Though in practice they’d probably simply be written into a database with the timecode as a key.)

Slicing the audio per-speaker would also enable a neat solution to the problem of poor quality recordings of interviews where the remote person has a poor internet connection. If they made a separate local recording of their voice then that audio file could be sliced up and used for the transcription of the parts of the interview where they were speaking. Neat!

Video Subtitles/Captions

When you listen to someone you absorb more than when just reading their words. Transcriptions help you search and discover sections of interest, but then it should be easy to listen to the words.

This is why having timecodes is important. Having searched transcripts to discover sections of interest you could click a button to listen to just those parts. For video podcasts you might choose to watch, giving you the added dimension of all the non-verbal communication.

Where do subtitles, and their more feature-full modern cousin, captions, fit in? For the deaf, the hard of hearing, and non-native speakers, they offer the opportunity to read the words in sync with the added richness of the non-verbal communication.

In theory subtitles/captions could be generated directly from a transcript if it has sufficiently frequent and accurate timecodes. Speaker diarisation would also help. That should be enough to generate at least a good quality draft. Which raises the “quality gap” question again: could automatically generated subtitles/captions be made “good enough” that the effort of manual correction is significantly less than the effort of manual creation? I think so.

Note that there will almost always be a need for some manual editing. For example, carefully condensing the number of words to fit within typical reading speeds, or adding captions for sounds, like “[dog barking]”.

Syncing the timing of the appearance (and disappearance) of each subtitle is a painstaking process that consumes the most time of any portion of the captioning process. Here’s an example video of the manual syncing process.

One way to avoid the effort is to let YouTube perform Set Timings on a plain-text transcript for you (Video of announcement and demo in 2009.) It’s “not recommended for videos that are over an hour long or have poor audio quality”. If that does work well then it would remove the need to generate timecodes myself.

I presume that having a ‘verbatim’ transcript, rather than a ‘clean’ one, would help the YouTube Set Timings processing to be more reliable.

Applications and Services

Wikipedia has a comparison of subtitle editors that provides an incomplete list of free and commercial editors for various platforms. There’s also a list in the “Use captioning software & services” section of the Add subtitles & closed captions YouTube help page.

I’ll just highlight a few interesting ones here:

Voxcribe offer a commercial Windows application called VoxcribeCC that uses speaker-independent speech recognition technology to automatically caption a video. The first 60 minutes is free, then you pay-as-you-go for $7-$10 per hour. Output formats are Subrip (srt) and Timed Text (xml). It doesn’t support training or custom vocabularies.

Amara deserves a special mention: Amara is an open-source and non-profit collaboration community for captioning and subtitling video. A ‘Wikipedia for Subtitles’, Amara enables volunteers to make videos accessible for people who are deaf and hard of hearing and anyone who doesn’t speak the language of the original video. Amara has more than 100,000 subtitling volunteers and organizations like TED, Khan Academy, and PBS use it to make video accessible.

Amara is a project of the Participatory Culture Foundation. (YouTube also supports community-contributed directly, along with paid translations.) The open-source code is being actively developed and includes a rich API.

Workflow

Here’s an outline for one (of many) possible workflows:

Generate a verbatim transcript from the audio.
Generate and upload a transcript file formatted for YouTube.
Request YouTube to perform a Set Timings operation.
Download the subtitles and timecode data.
Clean up the verbatim transcript.
Combine with the speaker diarisation data, if available.
Generate the interactive transcript pages.
Condense subtitle wording to fit reading speed, if needed.
Upload condensed and diarised subtitles back to YouTube.

What Next?

Proof of Concept Testing

I have mentioned lots of services in this post. Next I’m planning to do some very basic testing of the ones that seem likely to be useful. I’ll use some podcast audio for which I also have manual transcriptions. I want to get some experience with the various tools from the low-end (speaker independent) through to the high-end (Dragon with vocabulary and voice training). That will give me some sense of how big the “quality gap” really is. I’ll post some results when I have them.

I’ve written a follow-up post about how I’m Comparing Transcriptions – which turned out to be more tricky, and interesting, than I’d expected.

A Project?

Naturally I’m glossing over lots of details here, and I know there’s lots I don’t know. At this stage I’m very much in exploratory mode, discovering possibilities to see what might be viable. I’m encouraged by what I’ve found so far and can see interesting paths worth exploring.

I have no particular experience with audio processing or bulk transcription, but I am interested in helping more podcasts to have rich searchable transcripts available.

Are you? Great! Please get in touch.

Appendix of Random Notes

Some of the most common Subtitle and Caption File Formats are:

SRT – “SubRip Text” – a standard subtitle format supported by most video players
SSA – “SubStation Alpha” format that allows more advanced subtitles than the conventional SRT format
TTML – “Timed Text Markup Language”, an XML format that is one of W3C’s standards regulating timed text
DFXP – “Distribution Format Exchange Profile”, the old name for TTML
SBV – “SubViewer” plain text format, similar to SRT. Also known as .SUB
VTT – “Web Video Text Tracks”, very similar to SubRip, supported by most browsers

YouTube supports many subtitle and caption file formats.

Updates:

2016-04-23: Added Speechmatics, NowTranscribe, Google Cloud Speech API, Microsoft Bing Speech API, and the Anecdotal Accuracy section.
2016-11-08: Updated IBM Watson entry to note that support for custom vocabularies was added in September 2016.
2016-11-22: Added “State of the Art” section with a link to the recent Achieving human parity in conversational speech recognition paper.
2016-11-27: Added voicebase with details of the keyword extraction UI.
2016-12-03: Updated Google Speech API details. Added a link to a talk where Speechmatics claim to be the world’s most accurate. Some other minor edits.
2016-12-30: Added details for SpokenData, Deepgram, Trint, Spreza and Voyz.es.
2017-02-01: Added Pop Up Archive and AudioSear.ch. Plus a note on Dragon pointing out that there’s no segmentation into sentences.
2017-02-09: Added link to the Comparing Transcriptions follow-up post.
2018-04-03: Added link to Maui topic-extraction software, thanks to Rob Wilkinson.
2018-04-10: Added Amazon Transcribe and a link to another review of tools.

Introducing Data::Tumbler and Test::WriteVariants

TimBunce — Sun, 23 Mar 2014 14:20:11 +0000

For some time now Jens Rehsack (‎Sno‎), H.Merijn Brand (‎Tux‎) and I have been working on bootstrapping a large project to provide a common test suite for the DBI that can be reused by drivers to test their conformance to the DBI specification.

This post isn’t about that. This post is about two spin-off modules that might seem unrelated: Data::Tumbler and Test::WriteVariants, and the Perl QA Hackathon that saw them released.

This was my first year attending a Perl QA Hackathon. An annual event where key developers get together to discuss and develop the code, services, and standards at the core of the Perl ecosystem.

See the Results and Blogs pages to get a sense of the important work that gets done at these events and the in weeks that follow. What’s less clear but just as important are the personal connections made and renewed here.

These events take a lot of work to put together. Special thanks are due to Philippe Bruhat (BooK) and Laurent Boivin (elbeho) for organising it so well; to Wendy for looking after our nourishment and caffination so joyfully; to Booking.com for the venue; and all the other sponsors for helping to make this QA Hackathon the great success it was. In no particular order, SPLIO, Grant Street Group, DYN, Campus Explorer, EVOZON, elasticsearch, Eligo, Mongueurs de Perl, WenZPerl for the Perl6 Community, PROCURA, Made In Love and The Perl Foundation. Thank you one and all.

My focus at the hackathon was on pushing the DBI Test project forward with Sno and Tux. Getting Data::Tumbler and Test::WriteVariants polished up and released was a key part of that. We also had valuable discussions with BooK about useful enhancements to Test::Database.

So, what are Data::Tumbler and Test::WriteVariants? To explain that I’ll start 10 years ago…

The DBI distribution includes DBI::PurePerl, a fairly-complete implementation of DBI in pure-perl, and DBD::Gofer, a fairly-transparent proxy.

Both these modules need testing, and both should behave very much like using the normal DBI. The best way to test that was to re-run the DBI tests while using DBI::PurePerl, re-run them again using DBD::Gofer, and re-run them again using DBI::PurePerl and DBD::Gofer at the same time. So, since 2004, that’s what the DBI does.

When you run Makefile.PL in the DBI distribution it looks at the 44 test files and generates 141 new test files with various combinations of contexts. These generated test files look something like this:

#!perl -w
$ENV{DBI_AUTOPROXY} = 'dbi:Gofer:transport=null;policy=pedantic';
END { delete $ENV{DBI_AUTOPROXY}; }; # for VMS
$ENV{DBI_PUREPERL} = 2;
END { delete $ENV{DBI_PUREPERL}; }; # for VMS
require './t/06attrs.t';

They setup a ‘context’ and then execute the original test. In this case the context is DBD::Gofer + DBI::PurePerl.

This arrangement has proved to be extremely effective. I’ve frequently made a change to the DBI and forgotten to make corresponding changes to DBD::Gofer and/or DBI::PurePerl, only to be forcefully reminded by the tests which worked for plain-DBI failing noisily when run in the extra test contexts.

It was clear that something like this was needed for the DBI Test project. We wanted to generate test variants not only for DBI::PurePerl and DBD::Gofer but also each available database driver. Each driver might also want to add test variants of their own. (DBD::DBM, for example, supports a number of DBM backends and serialization formats that all need testing in combination).

After lots of experimentation and refactoring the relevant logic was extracted out into the Data::Tumbler and Test::WriteVariants modules, generalised, polished up and released during the hackathon.

For some reason I struggle when trying to explain what Data::Tumbler is or does. The summary in the documentation says “Dynamic generation of nested combinations of variants”, which is a bit of a mouthful.

It’s basically a single simple subroutine that recurses into itself driven by the results of calling provider callbacks. As it recurses it builds up a path and a context from the keys and values returned by the providers.

The provider callbacks are passed the current path and context plus a cloned copy of a payload which they can edit. Because it’s cloned, any changes made to the payload will only be visible to any later providers and the consumer.

The recursion bottoms-out when there are no more providers. At this point a consumer callback is called with the current path, context, and payload.

That’s an abstract description, which is fitting as it’s an abstract algorithm. I hope it’s reasonably clear. There are a couple of examples in the documentation synopsis. Currently Test::WriteVariants, described next, is the only use-case. I’d love to find some more, if only to help improve the documentation. Let me know if you can think of any!

Test::WriteVariants directly addresses the use-case of writing a tree of perl .../*.t test files, each setting up various combinations of context values before invoking the test code.

Hopefully you can see where Data::Tumbler fits in: the payload is a hash of tests for which you’d like extra variant tests written; the providers define variants of the contexts in which you’d like the tests executed, typically by setting environment variables. The consumer writes a new *.t file for each element in the payload hash, using the path to build a directory tree, and using the context to set environment variables, etc., in each test file written.

The providers can also remove tests from the payload that aren’t relevant in a given context, or add more that are only relevant to a given context.

Test::WriteVariants allows providers to be specified not just as code references but also as namespaces. In this case it uses Module::Pluggable::Object to find installed plugins within that namespace and wraps them in a code reference for Data::Tumbler. This allows extra test variants to be added by installing other modules.

This is used in DBI Test. The DBD::DBM driver, for example, can install a provider plugin module that adds extra variants when the context indicates that DBD::DBM is being tested. The plugin also arranges to add DBD::DBM specific tests in those contexts.

Although Test::WriteVariants is new, and still evolving quite fast, it’s already proving very useful. Jens is experimenting with using it for improving the testing of List::MoreUtils, especially covering both the XS and pure-perl variants.

I hope you can see uses for Test::WriteVariants in improving the testing of your own modules. If so, please do try it out and let me know how it work out for you and if there’s anything that needs improving.

Happy testing!

Migrating a complex search query from DBIx::Class to Elasticsearch

TimBunce — Mon, 29 Jul 2013 21:25:13 +0000

At the heart of one of our major web applications at TigerLead is a property listing search. The search supports all the obvious criteria, like price range and bedrooms, more complex ones like school districts, plus a “full-text” search field.

This is the story of moving the property listing search logic from querying a PostgreSQL instance to querying an ElasticSearch cluster.The initial motivation for using ElasticSearch was to improve the full-text search feature. We’d been using the full text search features built into PostgreSQL which was functional but limited. I’m sure we could have made better use of it but we wanted to take a bigger leap forward.

At the time, early in 2012, we looked at various options, including Sphinx and Solr. Elasticsearch was new and relatively immature but had a compelling feature set and momentum. The availability of powerful feature-rich APIs for Perl, i.e., the ElasticSearch, ElasticSearch::SearchBuilder, and Elastic::Model modules, was also a key factor. We began to see Elasticsearch as not just a solution for full-text search but as a strategic technology, applicable to a wide range of applications.

I found the learning curve quite steep. There was little in the way of guides and tutorials at the time and the reference documentation was patchy and often assumed familiarity with the terminology for Lucene, the foundation that underlies both Solr and Elasticsearch. Thankfully the documentation and other resources have improved since then. Also many companies are using Elasticsearch now (github, stackoverflow, foursquare, soundcloud, Wajam and kickstarter, to name a few) and blogging about their experience of what to do and what not to do.

I’d especially like to thank Clinton Gormley for kindly giving me much help and support as I climbed the learning curve and stumbled over assorted issues.

Index Building

Our PostgreSQL database remains the ‘source of truth’. We build a new Elasticsearch index from the PostgreSQL data each day and feed changes into Elasticsearch every few minutes. Each new index has a name that includes the date and time it was created and we use aliases to route queries relating to subsets of the data to the appropriate index.

Index and alias definition and management, along with the loading of data, is managed via the delightful Elastic::Model module. (Clinton’s presentation is well worth a look.)

Aside: We’re exploring the use of PgQ for in-PostgreSQL transaction-safe queuing. That would let us keep Elasticsearch in sync in near-realtime in a more efficient way using triggers on relevant tables to queue change messages.

We only sync on-market property listings to Elasticsearch, of which there are millions across the US and Canada. Each has many numeric fields, many boolean fields, a number of text fields, plus a number of lists-of-integers fields.

Query Building

We use DBIx::Class as an abstract interface to the property listings in PostgreSQL, and that means using SQL::Abstract to construct the search query. So we have a module that, for each web search query parameter, adds the corresponding elements to the SQL::Abstract data structure.

Most are pretty trivial, like

$sql_abstract->{price} = { '>=' => $price_min };

Others are a little more tricky, like our basic textsearch query:

$sql_abstract->{ts_index_col} = { '@@' => \[ "plainto_tsquery(?)", [ ts_index_col => $plain_ft_query ] ] };

Elasticsearch has a very rich query language. It’s not actually a language at all, but something more like an Abstract Syntax Tree expressed in JSON. The Perl interface to this is the ElasticSearch::SearchBuilder. It looks a little like SQL::Abstract but is much richer.

I thought for a while about translating the SQL::Abstract data structure that we already generated into a corresponding ElasticSearch::SearchBuilder structure. In the end I decided this wouldn’t leave us in a good place. It proved better to modify every place that built the SQL::Abstract data structure to also build an ElasticSearch::SearchBuilder structure, tuned to the semantics of the field. For example, in some cases it can be better to use ‘lte‘ and ‘gte‘ instead of ‘<=‘ and ‘>=‘ as comparison operators.

Full Records or IDs Only?

Another early design decision was whether to store (and return) all the property details in Elasticsearch, or just enough to perform searches and return only IDs that could then be used to fetch the full details from PostgreSQL.

In the end I decided to store only search fields and return only IDs. The big downside was that every query would require two round-trips: one to query Elasticsearch to get the IDs and one to query PostgreSQL to get the full details. That might seem a little odd. The major motivation was how the new code would interface with the existing logic in the web application.

Execution

The existing code executed the listing search using the standard DBIx::Class search() method:

$c->model($schema_name)->search( $sqla, \%attr )

Here %attr contained two joins, two prefetches, paging, order by, cache control, and some extra fields via '+select'. The resulting resultset was then inflated via a series of five with_*() method calls based on DBIx::Class::ResultSet::WithMetaData (which was fashionable at the time the code was written).

At this stage using Elasticsearch was just an experiment and, frankly, I didn’t want to mess with all that code! Returning just IDs let me integrate Elasticsearch with hardly any changes.

The trick was to replace the $sqla data structure that had been constructed to perform the full search with one that would just fetch the IDs that had been returned by Elasticsearch:

$sqla = { 'me.id' => { -in => \@ids_from_es } };

There was a little fiddling with paging and ordering, but that trick was the heart of it making the integration quite simple.

Another benefit was that we have a simple way to recover from problems. If ES fails for any reason then we simply don’t alter $sqla, so the original query runs against PG.

Runtime Control

We needed to be able to soft-launch this, to enable it for only a subset of requests. We already had the infrastructure for that, so once the code was in production we could enable it for specific users and control the overall percentage of search requests using Elasticsearch.

This was obviously very useful but also, as it turned out, our initial timidness hid interesting performance behaviour.

Performance

Property search is a key feature of the service and performance was naturally a concern. Would the presumed benefits of Elasticsearch (ES) outweigh the cost of having to run two queries, on two different databases? I was fairly confident but not certain.

We were using a cluster of three ES nodes, each with 8GB memory and 4 CPUs. Once the code was ready I’d done some performance testing, firing randomly generated search requests at the web servers, in our staging environment. Those stress test results looked good.

When we started routing 2-5% of actual production search requests to ES, however, the results were not good. Here’s a chart of the performance of PostgreSQL (PG) in green with Elasticsearch+PostgreSQL (ES+PG) in red:

The mean search time using ES+PG was worse than the 90th percentile time for PG alone. That was disappointing and puzzling. I embarked on a review of all the (many) things that might not be optimal in the ES server configuration, in the mapping applied to the fields, and the particular way were we constructing the queries. Here Clinton Gormley was beyond helpful, again. We found and tuned many little things, which was great, but none were clearly the cause of the apparent slowness.

To cut a long story short, the cause turned out to be the fact we were running the ES nodes in virtual machines (KVM). More specifically, although we’d configured ES to lock the physical memory pages via bootstrap.mlockall=true, mlockall() within a guest operating system doesn’t stop the host operating system stealing the physical pages.

From the host’s point of view those memory pages weren’t busy enough to keep assigned to the ES VM, so the solution was simple: give more traffic to ES. Sure enough, as we increased the number of requests going to ES it got faster!

Here’s a chart showing the final ramp up from around 15% of requests going to ES up to 100%:

You can see that at 15% the mean and 90th percentile performance of ES+PG closely matched that of PG alone. At 100% ES+PG was not only clearly faster than PG alone, but the 90th percentile was close to the mean of PG alone. Since then we’ve upgraded ES to a more recent version and increased the memory on each node to 16GB. Now the mean search time is a steady 100ms and the 90th percentile hovers around 150ms.

Scalability

We’re using multicast discovery so there’s zero configuration. We can deploy a new server and the new Elasticsearch node will join the cluster and automatically distribute the data and query workload. It really is as simple as that.

Reliability

We’ve only had one problem that I can recall where Elasticsearch behaved strangely. Even that didn’t stop search requests, it only affected building a new index. Restarting the cluster fixed it.

That was with an early 0.20.x release and we’ve had no recurrence after upgrading. We’re on the latest 0.20.x now and plan to move to 0.90.x before long. (An upgrade that should significantly boost performance again.)

Next Steps

We’ve been impressed with Elasticsearch as a search solution, in terms of functionality, reliability and performance. Delighted by the support from Clinton and the IRC community. And amazed at the range of plugins being developed.

We’re pushing full listing data into Elasticsearch now, and writing modules to better abstract the searching so it can be used more easily in other applications. We’re also happily cooking up plans to use more Elasticsearch features, like percolate, in other projects.

NYTProf v5 – Flaming Precision

TimBunce — Mon, 08 Apr 2013 22:27:32 +0000

As soon as I saw a Flame Graph visualization I knew it would make a great addition to NYTProf. So I’m delighted that the new Devel::NYTProf version 5.00, just released, has a Flame Graph as the main feature of the index page.

In this post I’ll explain the Flame Graph visualization, the new ‘subroutine calls event stream’ that makes the Flame Graph possible, and other recent changes, including improved precision in the subroutine profiler.

Precision

Let’s start with the improved precision. That work was actually released a few months ago in Devel::NYTProf 4.23 but not announced.

Devel::NYTProf started life as a line/statement profiler, writing a stream of events, one per statement. It’s important for speed that the stream is space efficient, so statement times were expressed as integer microseconds (a ‘tick’) and written in a compressed form. Values less than 128µs use a single byte. This worked very well for v1. Back in early 2008 minimum statement times were typically just a few microseconds.

When I added the subroutine profiler I chose to use double precision floating point values to hold the subroutine call times with seconds as the units. I presume that seemed reasonable at the time as microseconds (multiples of 1e-6) can be accurately stored double precision floating point values and are significantly above the typical machine epsilon of 2.220446e-16.

I’d assumed the values weren’t at risk from the pernicious effect of cumulative round-off errors. The situation got worse with NYTProf v2 because that switched the clock ‘tick’ from 1µs to 100ns on some systems (those with POSIX realtime clock API and OS X). And then worse again when profiling of ‘slowops’ was added in NYTProf v3 since slowops are often far from slow.

$ perl -we '$n=10_000_000; $t=0.0; $i=3/$n; $t+=$i while $n--; print "$t\n";' 2.99999999961925

The way the subroutine profiler works, calculating inclusive and exclusive times as it goes, makes it sensitive to these accumulated errors. (Sometimes a subroutine that did nothing but call a very fast subroutine many times could be reported as having taken less time than the sum of the times in the subroutine it called.)

The subroutine profiler still uses double precision floating point values to accumulate the times, but now accumulates integer ticks instead of fractional seconds.

$ perl -we '$n=10_000_000; $t=0.0; $i=3.0; $t+=$i while $n--; $t/=10_000_000; print "$t\n";' 3

(The $t=0.0 and $i=3.0 ensure perl is using floating point values in that example. I checked it with Devel::Peek.)

Subroutine Call Events

There’s one thing the old and deeply flawed Devel::DProf profiler can do that NYTProf hasn’t been able to: the DProf dprofpp utility can generate a subroutine call tree.

NYTProf hasn’t been able to do that because its subroutine profiler worked entirely in memory, accumulating aggregate data about each call arc, but not outputting anything until the end of the profile. So all the calls on any given arc are merged together.

NYTProf v5 adds a new calls option that enables streaming of subroutine call events as they happen. With calls=2 subroutine call and return events are generated. With calls=1 (the default) only subroutine return events are generated. (A curious side effect of perl internals and the way NYTProf works means it can’t reliably know the name of the subroutine at call entry time. So the call entry event isn’t very useful at the moment.)

The call return events are sufficient to recreate a call tree, albeit with some expensive massaging of the data. NYTProf does this with the new nytprofcalls utility which reads and processes the stream of call return events. At the moment it’s undocumented, rather hackish, and only generates the call data in a collapsed form suitable for generating a flamegraph (more below). It could be extended to produce a call tree without too much work. Then, finally, the ghost of Devel::DProf can be laid to rest.

Flame Graph

Brendan Gregg developed the Flame Graph as a way to visualize very large volumes of stack traces sampled by DTrace.

It’s a wonderfully compact and information-rich way to visualize the where a program is spending its time. It’s also unusual and potentially confusing, so a little explanation is required. Keep in mind that it’s a visualization of distinct call stacks and that the colors are not meaningful.

The y-axis represents stack depth. Each box represents the spent time in a particular subroutine when called by the subroutine below it. So a particular subroutine will appear in multiple places if called via different call stacks.

The x-axis spans the time the profiler was running. It does not show the passing of time from left to right, as most graphs do. The left to right ordering has no meaning (it’s sorted alphabetically).

The width of the box shows the inclusive time the subroutine was running, or part of the ancestry of subroutines that were running (the boxes above it). Wider box functions may be slower than narrow box functions, or they may simply be called more often. The call count is not shown.

Brendan’s original flamegraph script generated an SVG that wasn’t well suited to embedding in an application like NYTProf. He’s kindly accepted a series of pull requests to add the key features I was looking for. The most important being the ability to make the boxes clickable: click on a box and you’ll be taken to the report for that subroutine!

Let’s take a closer look at a simple example using a recursive Fibonacci function:

sub fib {
    my $n = shift;
    return $n if $n < 2;
    fib($n-1) + fib($n-2);
}
sub foo { fib(8) }
sub bar { fib(8) }
foo();
bar();

That gives us a Flame Graph like this:

The line at the bottom that spans the full width represents the entire profile run. In this case it was 778µs. (Hover over any block to see the time – you can see one in the image, along with the bold and bordered box it relates to).

The first line above that shows the calls to foo and bar. The line for those is shorter than the total line because the total includes the time perl spent compiling the script. It shows up clearly here because this script is so fast.

Then, above the blocks for both foo and bar, you can see the recursive calls to fib rising like flames (okay, with a little imagination). Two things to note here. Firstly bar is shown to the left of foo simply because the names at each level are in lexicographic order. There’s no deeper meaning in the ordering.

Secondly, you can easily see that bar was faster (narrower) than foo, even though they contain the same code. Why’s that? When foo ran first it would have paid the price for growing the stacks and warming the memory pages. Then when bar was called it gained from foo‘s work.

Flame Graph Generator

Behind the scenes nytprofhtml runs nytprofcalls to generate a file in the report directory called all_stacks_by_time.calls. It then calls flamegraph.pl to read that file and generate the all_stacks_by_time.svg that’s shown in the report.

The all_stacks_by_time.calls has a very simple format. One line per distinct call stack, with subroutine names separated by semicolons, followed by a number (which is either in 1µs or 100ns units depending on the platform). Here’s an example running the code above but calling fib(2) instead of fib(8) to keep it small:

main::bar 37
main::bar;main::fib 45
main::bar;main::fib;main::fib 19
main::foo 416
main::foo;main::fib 222
main::foo;main::fib;main::fib 61

This simple format is perfect for grep’ing! You can effectively zoom-in on any subset of the call stacks by generating a flamegraph of just the stacks that contain the functions you’re interested in. For example, running this command on the profile of perlcritic shown at the top:

grep -w Perl::Critic::Policy::new nytprof/all_stacks_by_time.calls | flamegraph.pl > tmp.svg && open tmp.svg

gives you this Flame Graph:

You can see that a lot of time is being spent gathering stack traces for exceptions (this is with perlcritic 1.118 on perl v5.14.2).

It would be nice to have a Flame Graph generated for each of the top-N files/modules, showing just the subset of call stacks that involve any of the subroutines defined in that file. I didn’t get around to that for v5.00. Feel free to fork the code, add that in, and send me a pull request!

Minor Changes

The very old and very limited nytprofcsv utility has been deprecated. Let me know if you use it, otherwise it would be around much longer.

The blocks option is no longer on by default – it seems that few people used the ability to view statement times rolled up at the block level. You can always enable it with blocks=1 in the options.

What Next?

For NYTProf? I don’t know.

Next up on my to-do list is giving Devel::SizeMe the love it needs. There’s some deep work I’d really like to get done before YAPC::NA in June.

Maybe I’ll see you there.

Suggested Alternatives as a MetaCPAN feature

TimBunce — Sun, 10 Mar 2013 22:16:06 +0000

I expressed this idea recently in a tweet and then started writing it up in more detail as a comment to Brendan Byrd’s The Four Major Problems with CPAN blog post. It grew in detail until I figured I should just write it up as a blog post of my own.(I fell out of the way of blogging over the two years or so of focus and distraction that our major house extension took to go from conception to reality. I’ve been meaning to start blogging again more regularly anyway. I’ve a few blog posts brewing in the back of my mind, so we’ll see how it goes.)

In Brendan’s post he describes four problems with CPAN:

Too many modules are unmaintained; abandoned but not marked as such.
There is not enough data on what modules are mature; which ones are the “right ones” to use.
Many modules are only used for semi-private needs.
Modules cannot be renamed or deleted, even with a long-term deprecation process.

I’d like to propose a feature that doesn’t seem to address these issues directly but would, I believe, greatly reduce the significance of all of them.

Olaf Alders responded to Brendan’s post with Sifting Through the CPAN and pointed out the need for better search tools and specifically suggests tagging. While tagging might be helpful in general I think we need a way to explicitly guide users from one module to another.

Suggested Alternatives

I’ve long thought that CPAN would benefit from a mechanism to track “suggested alternative modules”. (And/or perhaps “suggested alternative distributions“, but I’ll just talk about modules for now.)

I envisage a “Suggested Alternatives” section in the right sidebar on every module page. It would show the top-N suggestions, with a [++] icon beside each, ordered by the number of people who have made the suggestion or agreed to it by pressing the [++] icon. And naturally it would have a text field to enter an existing module name, with type-ahead suggestions. Finally, the Suggested Alternatives heading would be a link to a details page.

The details page would show, for that module, every instance of a suggestion being made or up-voted, with the user and the date. That would let people see who made the suggestion and when. Users would be able to remove their own suggestions.

For modules that are the suggested alternative for some other module, their page could show something like “Suggested as the alternative to X other modules by Y people” with a link to a page that would show the corresponding details.

With something like this in place “unmaintained, abandoned” modules would gather suggested alternatives. Mature ‘good’ modules would tend to accumulate suggestions pointing towards them, while mature ‘poor’ modules would tend to accumulate suggestions pointing away. Experiments and obscure “private needs” modules wouldn’t gather suggestions and that, combined with the higher ranking of modules with votes and inward pointing suggestions, means they’d languish in obscurity doing little harm.

The Alternatives Graph

This “alternatives” data creates a graph of relationships among similar modules in a powerful and directly useful way.

For search results it would be useful not only for ranking but also for widening the search. Modules that are the suggested alternatives for modules in the ‘natural’ results could be included. That’s potentially a big win.

Of course it would be perfectly reasonable for a pair of modules to have suggestions pointing to each other. Or for there to be loops of suggestions. That’s fine and simply expresses the conflicting views of the users making the suggestions.

Similar Modules (a digression)

I also had the idea that there may be value in having a ‘similar modules’ link that shows the list of modules produced by traversing the graph of suggestions for some number of hops in both directions, and ranked by some combination of votes and placement in the graph.

But then I wondered if that would be better implemented an explicit way to suggest a ‘similar module’. In other words, generalize the idea of a “suggested alternative” into a “related module” relationship plus attributes like a “weight”. Where a positive weight denotes a “suggested alternative” and a zero weight is simply a “similar module” or a “see also”. Perhaps there’s also value in having a “complementary module” relationship.

This is all a bit vague. It suggests to me that any code to support a “module relationship” mechanism should be kept generic to allow for other kinds of relationships in future.

The Whys and Wherefores

The primary data of the graph is a link from one module to another with a count of the number of people who agreed with that suggestion.

That surface data is built from a deeper layer that records, for each link, which users that made the suggestion and when.

A helpful extra feature would be to let users optionally give a short reason for why they are suggesting this particular alternative. Perhaps because they feel it’s unmaintained, or lacks specific features that their suggested alternative has.

Suggestions without the whys would be very useful, and I’d suggest that that much is implemented first. But suggestions without explanations are also very limited. Knowing what motivated someone to suggest a particular alternative would be very helpful to others trying to pick a module for a task. For example, people might make multiple alternative suggestions recommending Bar instead of Foo if you want a certain feature, and Baz instead of Foo if they want another.

I don’t think there’s much risk of this becoming a comment battlefield because on any given page all the comments share the same direction ‘away’ from the module. Someone with an opposing viewpoint would add a separate suggestion with their own comments on the ‘opposite’ module.

I’d suggest the comment field be kept very short, say 50 characters, and provide a separate url to encourage referencing supporting material such as a blog post or mailing list archive.

Other approaches might be to have a few checkboxes with typical reasons (very limited), or perhaps tags, or link in with cpanratings in some way (possibly complex).

Alternative Distributions

The best way to build and present Alternative Distributions data is probably to simply derive it from the Alternative Modules data.

It would simply be a read-only view that collapses the module level graph data down to links between the corresponding distributions.

Yanick Steps Up

After writing a draft of this post I saw a tweet from Yanick with a link to a specific proposal on his blog. I skimmed it, realised it was similar to mine and replied saying to I’d reference it here. I decided I’d finish my post before reading it properly.

So here are my thoughts on Yanick’s suggestions:

Distributions vs Modules: Modules are the fundamental unit of use and the natural focus of attention and reviews. It’s relatively easy to derive distribution suggestions from module suggestions, but not the other way around. Using modules as the focus also means the suggestions will still be valid if a module moves from one distribution to another.

Adding notes: I agree that comments are best avoided for the initial system. I also feel strongly that their value outweighs their risks if implemented and presented carefully, so they should at least be taken into account in the initial design work.

User interface for recommending an alternative: Having a button beside the existing high-profile vote button doesn’t feel right to me. The vote button is a positive action and encouraging low-friction drive-by voting makes sense. Suggesting an alternative is a more negative action, and one to be considered more carefully. Using the sidebar seems more appropriate.

User interface for viewing suggestion alternatives: I’d rather not include any user names on the module page. It complicates the code and confuses the user experience (“which names are shown and why?” etc). The full details are available on the detail page if anyone wants to take the extra step to see them.

Volunteering to do something: Awesome!

Thanks Yanick.

Update: Implementation is being discussed on this cpan-api ticket.

Introducing Devel::SizeMe – Visualizing Perl Memory Use

TimBunce — Fri, 05 Oct 2012 11:49:03 +0000

For a long time I’ve wanted to create a module that would shed light on how perl uses memory. This year I decided to do something about it.

My research and development didn’t yield much fruit in time for OSCON in July, where my talk ended up being about my research and plans. (I also tried to explain that RSS isn’t a useful measurement for this, and that malloc buffering means even total process size isn’t a very useful measurement.) I was invited to speak at YAPC::Asia in Tokyo in September and really wanted to have something worthwhile to demonstrate there.

I’m delighted to say that some frantic hacking (aka Conference Driven Development) yielded a working demo just in time and, after a little more polish, I’ve now uploaded Devel::SizeMe to CPAN.

In this post I want to introduce you to Devel::SizeMe, show some screenshots, a screencast of the talk and demo, and outline current issues and plans for future development.

For a while I thought Devel::NYTProf might be a useful framework for building some kind of “memory profiler”. Something that would measure changes in memory use over time between lines and subroutines. Nicholas Clark even created a clever experimental hack to demo the concept. Sadly the data just didn’t seem to be very useful. It turns out that knowing where memory is allocated and freed isn’t nearly as important as knowing where memory is being held.

The Plan

It was clear that some kind of ‘snapshot’ mechanism was needed. Something that would:

crawl all the data structures within a perl interpreter
have some way of naming the path to each data structure
stream the data out for external storage and processing
be fast enough that snapshots could be taken frequently
visualize the vast amount of data
compare different snapshots

Luckily the hardest part, step 1, was already covered by Devel::Size. Originally written by Dan Sugalski in 2005, then maintained by Tels and BrowserUK, it had been picked up and polished by Nicholas Clark to stay in sync with the many internal optimizations he and others were adding to the perl core. It’s not without problems, and I’ll outline those below, but it was a great base for me.

I added a callback mechanism, so my code and others could “hitch a ride” on the back of Devel::Size as it crawled the data structures, and came up with a very lightweight way to track and output the “name path”.

Textual Output

My initial code just wrote a tree-like textual representation to prove the concept:

$ SIZEME='' perl -MDevel::SizeMe=total_size -e 'total_size([ 1, "hi", [] ])'
SV(PVAV) fill=2/2		[#1 @0] 
:   +24 sv_head =24
:   +40 sv_body =64
:   +24 av_max =88
:   ~note av_len 2
:   AVelem->		[#2 @1] 
:   :   SV(RV)		[#3 @2] 
:   :   :   +24 sv_head =112
:   :   :   RV->		[#4 @3] 
:   :   :   :   SV(PVAV) fill=-1/-1		[#5 @4] 
:   :   :   :   :   +24 sv_head =136
:   :   :   :   :   +40 sv_body =176
:   :   ~note i 2
:   AVelem->		[#6 @1] 
:   :   SV(PV)		[#7 @2] 
:   :   :   +24 sv_head =200
:   :   :   +16 sv_body =216
:   :   :   +16 SvLEN =232
:   :   ~note i 1
:   AVelem->		[#8 @1] 
:   :   SV(IV)		[#9 @2] 
:   :   :   +24 sv_head =256
:   :   ~note i 0

There you can see the array (PVAV) ‘node’ with ‘leaf’ sizes for the sv_head (24 bytes), sv_body (40 bytes), and the array of element pointers (av_max, 24 bytes). Below that you can see a ‘link’ called AVelem pointing to a reference (RV) to an array with no elements. The “~note” lines are ‘attributes’ that can be used to provide extra information about nodes. The ‘=NNN‘ gives a running total of the accumulated size.

The terminology here (sv_head, sv_body, av_max etc.) might not be familiar to you unless you’ve spent time delving into perl guts. Hopefully, though, it’s clear that Devel::SizeMe gives access to immense detail.

Graph Visualization

That detail can quickly become overwhelming for non-trivial data structures. Some kind of visualization was needed. So I added a more compact ‘raw’ output format and a script (sizeme_store.pl) to process it. The script ‘decorates’ the nodes with the leaf and attribute data, gives the links better names, and adds extra details like the total size of the children.

$ SIZEME='|sizeme_store.pl --dot=sizeme.dot' perl -MDevel::SizeMe=total_size -e 'total_size([ 1, "hi", [] ])'

The SIZEME env var gives the name of the file to write the raw data to, or in this case the name of a program to pipe the data into. Here I’m asking sizeme_store.pl to write a dot format file which, when rendered by Graphviz, produces a graph like this:

You can see the links have been labeled with the index attribute, and the nodes show how the size is calculated (self+children=total) and the sizes accumulate up the graph.

That’s lovely, and works well for modestly sized data structures. It doesn’t scale well though. You quickly find yourself looking at diagrams like this:

Treemap Visualization

The graph visualization is rather more impressive than it is practical. A more useful visualization for this kind of data is an interactive treemap. Where the size of the boxes represents the memory use and you can drill-down into the data structures. To do that, and have it work on massive data dumps, I needed some kind of database and tree map code that supported on-demand loading. I opted for SQLite as the data store, the JavaScript InfoVis Toolkit for the tree map code, and Mojolicious::Lite as the web app framework.

$ SIZEME='|sizeme_store.pl --db=sizeme.db' perl -MDevel::SizeMe=total_size -e 'total_size([ 1, "hi", [] ])'

That’s asking sizeme_store.pl to produce a sizeme.db file. Then, to visualize the data you can run sizeme_graph.pl to launch the web app:

$ sizeme_graph.pl --db=sizeme.db daemon

then visit http://127.0.0.1:3000/ to see the result:

The overall grey area, which has a title bar labeled “SV(PVAV)”, represents the total memory used by the structure. The area is divided into three parts for the three elements of the array. The smallest, labeled “[0]-> SV(IV)”, is the integer. The next larger one, labeled “[1]-> SV(PV)”, is the string. The largest area is the array reference. Because the referenced array was empty the logic in sizeme_graph.pl has ‘collapsed’ the array into the parent node to simplify the tree map. This is reflected in the label “[2]-> SV(RV) RV-> SV(AV)”.

The darker box is a tooltip that moves with the pointer and displays extra detail about whatever node the pointer hovers over. In this case it’s showing that the total memory use is 88 bytes (the head and the body size of the RV and the AV have been summed up). The rest of the content is mostly debugging information. They’ll be more useful info here in future.

The Whole Picture

The total_size($ref) function dumps the contents of a particular data structure. But it’s not enough to get the whole picture. For that I wanted to be able to dump everything in a perl interpreter. Executing total_size(\%main::) gets closer to everything, but it’s still a long way off.

So I added a perl_size() function. That starts by dumping the stashes (\%main::, or in internals speak PL_defstash) but then goes on to dump many more items you might never have realized existed. PL_stashcache, PL_regex_padav, PL_encoding, PL_modglobal, and PL_parser to name but a few. It then records the amount of unused space in perl’s arenas.

Finally then scans the arenas looking for any values that haven’t been seen yet. Currently this finds quite a lot because the perl_size() code isn’t complete yet. (Many thanks to rafl for helping improve the coverage here.) Once it’s complete, any unseen values found in the arenas will be leaks. So Devel::SizeMe may turn into a useful leak detection tool.

Taking this idea further, there’s also a heap_size() function. The goal here is to try to account for everything in the heap. (See my slides if you’re not familiar with that term.) The one key item here is asking malloc for information about how much memory it’s using and, especially, how much ‘free’ memory it’s holding on to, for malloc’s which support that.

See It In Action

This explanation is rather dry. To get a real sense of what Devel::SizeMe can do you need to see it in action with some non-trivial data. Here’s a screencast of my Perl Memory Use talk at YAPC::Asia (also available as a raw mov here and here, mv4 here and here, and mp4 here and here). The demonstration starts at 13:00.

Simple Usage

Just four steps:

cpanm Devel::SizeMe # install the module
perl -d:SizeMe …your.script.here…
sizeme_graph.pl daemon
open http://127.0.0.1:3000/

Devel::SizeMe notices that it’s been run as perl -d:SizeMe and arranges to automatically call perl_size() in an END block. Simple.

Current Issues

There are two weakness with the current Devel::Size logic that affect Devel::SizeMe.

The first is that it uses a simple depth-first search. That’s fine when just calculating a total, but for Devel::SizeMe it means that chasing references held by one named item, like a subroutine, can lead to all sorts of other items, including entire stashes, appearing to be “within” the item that held the reference. The second is that Devel::Size doesn’t have well defined sense of when to stop chasing references because it doesn’t consider reference counts.

So I plan to add a multi-phase search mechanism. References with a count of 1 will be followed immediately. References with a count greater than one will be queued, along with a count of how many times the reference has been seen so far. In this way all the ‘named’ data reachable from %main:: will be found first and identified with their natural names before the queued items are crawled. This should greatly improve the output.

More coverage is needed in perl_size() to reduce the number of ‘unseen’ items that show up in the arenas, as seen in the screencast.

Future Plans

A priority is to get my changes to the core of Devel::Size integrated back in. It would be crazy to have two modules duplicating this sometimes complex and perl-version-specific logic. My goal is to have a single C file that’s used by both modules. Each would compile it with different macros to enable the required behavior. This should enable Devel::Size to suffer no performance loss for the extra logic that Devel::SizeMe has added.

I’ve already started adding some support for “named” runs. The idea is to enable the size functions to be called multiple times within a single process, and to store the data in separate tables within the database. This is an important step towards being able to compare multiple runs to see how the memory use has changed.

Lots of refactoring is needed to turn my conference-driven-dash-for-the-finish-line hacking into more robust and reusable code. In particular I’d like to get a reasonable stable and useful database schema so other people can write module to process the data generated by Devel::SizeMe.

Further in the future I can imagine having an option to record the existence of pointers to data that’s already been seen. That information is currently discarded but would add a great deal of detail to the output. Reference loops would be much easier to see for example. It would turn the output ‘tree’ into a directed graph and enable much richer visualizations.

We’re just at the start.

Enjoy.

This page has been translated into Spanish language by Maria Ramos from Webhostinghub.com/support/edu. Thank you Maria.

A Space For Thought

TimBunce — Sun, 08 Apr 2012 18:15:33 +0000

This is the text of a speech I originally wrote for the International Speech Competition at my Toastmasters club in April 2012. (I won the club competition and came second in the area competition a week or so later.)

In July I gave a slightly modified version, reproduced here, as a 5 minute Lightning Talk at OSCON in Portland OR.I wrote early drafts in the first person, which I prefer to do for material rooted in personal experience, then changed it be mostly second person as that seemed to be more effective in this case. In written form you’ll miss the gestures and delivery but hopefully the text is clear enough.

It’s written to spoken quite slowly, with pauses, so please read it that way when you’ve some time to spare.

A SPACE FOR THOUGHT

What is the difference between thought, and the quiet awareness in the space between thoughts?

~pause~

I want to share with you the single most important thing I’ve learned in my life.

It’s a shift in how I relate to myself and the world around me.
A change in perspective that has revealed answers to many mysteries;
so much more of the world makes sense to me now.

I really want to share this with you, but I have a problem.

The key idea is so simple that, if you’re not familiar with it, you probably won’t believe me.

Or if you are, you may dismiss it as obvious and of no value. Missing the depth and implications of it.

To persuade you I could quote countless great examples from literature, science, art, and everyday life.
Showing you how they fit together and make sense when viewed in this light.

But I don’t have time.

I only have time to give you a starting point, to plant a seed,
and some suggestions for how to nurture it, in the hope that it can grow and blossom for you too.

Before I share you this simple insight, before I plant this seed,
I need your help to prepare the ground.
I need you to experience something for yourself.

So please join me in a simple exercise in awareness. In paying attention.

Start paying attention now, to the feeling of your left foot.
Just experience your left foot for a while, without thinking about it.

~pause~

What you’re paying attention to is your foot.
What you’re paying attention with is in your head.

~pause~

We’ll do that again now but this time I’ll say something to prompt some thought.
I want you to notice what happens to your attention when you start thinking.

Return your attention to your foot now.

~pause~

Nine plus seven.

~pause~

Did you notice your attention move away from your foot when you started thinking?
The focus of your awareness moved from your foot into your mind.

Your full attention can’t be on a thought and something else at the same time.
You need to be aware of the thought, just as you need to be aware of the feeling.

Awareness is primary. Thinking and feeling are secondary.

~pause~

So here’s the seed I want to plant:

You are not your thoughts, just as you are not your feelings.
You, the essence of who you really are, is the awareness.
The conscious awareness within which your thoughts and feelings arise.

~pause~

That’s it.

It’s so simple, and yet so delicate.

Easily crushed by the weight of your own thoughts, that are constantly seeking to define you.

~pause~

Having planted the seed, I want to give you three tips for nurturing it, that I have found very helpful.

1st – Give your thoughts and opinions some space.

View them from a little distance.
Note their contents but don’t judge them.
Judging involves the thinking mind and you won’t break free.
Simply note their contents and let them go.

Treat your thoughts as suggestions from a much loved friend.
But a friend who you know is vain, insecure, and untrustworthy.

Noticing how this friend reacts to situations in your life
is a fascinating and rewarding pastime.

You don’t need to watch a soap opera on TV
when you can watch the one going on in your thinking mind!

2nd – Practice taking your attention away from thoughts
whenever they’re unhappy, unproductive or unhelpful.
Which, let’s face it, can be much of the time.

Simply bring your attention to your breathing, your foot,
or anything else in the present moment.

3rd – Slow the momentum of the mind by bringing moments of stillness into your life regularly.

The phone rings — take a conscious breath with an empty mind before answering.
Get in the car — take a few breaths before starting the engine.
Look at nature, birds, trees, flowers, people, without labeling, judging, or other mental activity.

~pause~

The more often I remember to do these simple things,
the more my sense of self shifts,
from the noise and turmoil of the thinking mind,
to being rooted in the peace beyond it.

So what is the difference between thought, and the quiet awareness in the space between thoughts?
That’s for you to discover in your own way, if you want to,
but you won’t find out by thinking about it.

What’s actually installed in that perl library?

TimBunce — Wed, 16 Nov 2011 21:52:01 +0000

A key part of my plan for Upgrading from Perl 5.8 is the ability to take a perl library installed for one version of perl, and reinstall it for a different version of perl.

To do that you have to know exactly what distributions were installed in the original library. And not just which distributions, but which versions of those distributions.

I’ve a solution for that now. It turned out to be rather harder to solve than I’d thought… As I mentioned previously, I had developed a “distinctly hackish solution” that seemed to be working well. Sadly it didn’t withstand battle testing.

We have a library with almost 5000 modules installed from CPAN over many years. I ran that hackish script and it duly listed the distributions it thought were installed. Using that list I reinstalled them into a new library and ran diff -r to compare the two. That found a bunch of differences that led me into a vortex of hacking and rerunning. Eventually I had to admit that the whole approach wasn’t robust enough and I started to explore other ideas.

Some searching turned up BackPAN::Version::Discover which is meant to “Figure out exactly which dist versions you have installed”. Perfect. Sadly it simply didn’t work well for me. Probably because it’s using a similarly flawed approach to my own.

I knew brian d foy’s MyCPAN project was working towards a similar goal. His approach required us to either run a large BackPAN indexing process or paying to license the data to offset his costs for doing so. That didn’t seem attractive.

I wondered about using GitPAN and the github API to match git blob hashes of local modules with files in the gitpan repos. Sadly GitPAN has fallen out of date and isn’t being maintained at the moment. With hindsight I’m thankful of that because it lead me to a better solution.

MetaCPAN

MetaCPAN is full of awesome. On the surface it looks like another kind of search.cpan.org site. Don’t be fooled. Underneath is a vast repository of CPAN metadata powered by an ElasticSearch distributed database (based on Lucene). How vast? Every file in every distribution on CPAN (and, critically for me, the BackPAN archive) has been indexed in great detail. Including details like the file size and which spans of lines are code and which are pod.

The cherry on the cake is the RESTful API that provides full access to ElasticSearch query expressions.

The key “lightbulb over head” moment came when I realized I could ask MetaCPAN to “find all releases that contain a particular version of a module“. Bingo!

The Method

The next step was how to work out which of those candidates was the one actually installed. The key realization here was that I could use MetaCPAN to get version and file size info for all the modules in each candidate release and see how well they matched what was currently installed.

The whole process falls into several distinct phases…

The first phase finds the name, version, and file size of all the modules in the library being surveyed. (Taking care to handle an archlib nested within the main lib.)

Then, for every module it asks MetaCPAN for all the distribution releases that included that that module version. For rarely changed modules in frequently released distributions there might be many candidates, so it tries to limit the number of candidates by also matching the file size. This is especially helpful for modules that don’t have a version number.

Then, for every candidate distribution release, MetaCPAN is queried to get the modules in the release, along with their version numbers and file sizes. These are compared to the data it gathered about the locally installed modules to yield a “fraction installed” figure between 0 and 1. The candidates that share the highest fraction installed are returned.

Typically there’s just one candidate that has fraction installed of 1. A perfect match. Sometimes the fraction is less than 1 for various obscure but valid reasons. Sometimes life isn’t so simple. There may be multiple candidates that have the same highest fraction installed value. So the next phase attempts to narrow the choice from among the “best candidates” for each module. The results are gathered into a two level hash of distributions and candidate releases.

The final phase is the first to work in terms of distributions instead of modules. For each distribution it tries to choose among the candidate releases.

The Results

The method seems to work well. It identifies files with local changes. It deals gracefully with ‘remnant’ modules that were included in an old release but not in later ones. And it copes with distributions that have been split into separate distributions.

It reports progress and anything unusual to stderr and writes the list of distributions to stdout. You should investigate anything that’s reported to ensure that the chosen distribution is the right one.

I checked the results by creating a new library (see below) and running diff -r old_lib new_lib. I didn’t see any differences that I couldn’t account for.

The survey process is not fast. It can take a couple of hours on the first run for a large library. Most of that time is spent making MetaCPAN calls (lots and lots of MetaCPAN calls) so you’re dependent on network and MetaCPAN performance. Most of the calls are cached in an external file so later runs are much faster.

Using The Results

Using a list of distributions to recreate a library isn’t as straight-forward as it might seem. You can’t just give the list to cpanm because it would try to install the latest version of any prerequisites. I looked at using –scandeps or topological sorting to reorder the list to put the prerequisites first. It didn’t work out. I also looked at using CPAN::Mini::Inject (and OrePAN and Pinto) to create a local MiniCPAN for cpanm to fetch from. They didn’t work out either, for various reasons.

In the end I added a --makecpan dir option so that the surveyor script itself would fetch the distributions and create a MiniCPAN for cpanm to use.

So now a typical initial run looks like this:

dist_surveyor --makecpan my_cpan /some/perl/lib/dir > installed_dists.txt

followed by building a new library from the results:

cpanm --mirror file:$PWD/my_cpan --mirror-only -l new_lib < installed_dists.txt

If you need to rebuild the library, perhaps due to test failures, then it’s much faster to use a list of modules to drive cpanm. Fortunately dist_surveyor writes one for you:

cpanm --mirror file:$PWD/my_cpan --mirror-only -l new_lib < my_cpan/dist_surveyor/token_packages.txt

Testing Bonus

Speaking of test failures, I was surprised to see how often tests failed due to problems with prerequisites even though the distribution and its prerequisites had passed their tests when originally installed. For example, imagine distribution A v1, and its prerequisite B v1 are installed. Later, distribution B gets upgraded to v2 but the tests for distribution A don’t get rerun.

Reinstalling all the distributions forces all distributions to be tested with the prerequisites that are actually being used.

Presentation Slides

I gave a lightning talk on Dist::Surveyor at the 2011 London Perl Workshop (always a great event) and uploaded the slides.

Source Code

The repository is on github and I’ve made a release to CPAN.

Upgrading from Perl 5.8

TimBunce — Thu, 21 Jul 2011 14:50:26 +0000

Imagine…

You have a production system, with many different kinds of application services running on many servers, all using the perl 5.8.8 supplied by the system.
You want to upgrade to use perl 5.14.1
You don’t want to change the system perl.
You’re using CPAN modules that are slightly out of date but you can’t upgrade them because newer versions have dependencies that require perl 5.10.
The perl application codebase is large and has poor test coverage.
You want developers to be able to easily test their code with different versions of perl.
You don’t want a risky all-at-once “big bang” upgrade. Individual production installations should be able to use different perl versions, even if only for a few days, and to switch back and forth easily.
You want to simplify future perl upgrades.

I imagine there are lots of people in similar situations.

In this post I want to explore how I’m tackling a similar problem, both for my own benefit and in the hope it’ll be useful to others.

Incremental Upgrades

Perl now has an explicit deprecation policy that requires a mandatory warning for at least one major perl version before a feature is removed. So a feature that’s removed in perl 5.14 will generate a mandatory warning, at compile time if possible, in perl 5.12.

This means we should not jump straight from perl 5.8.8 to 5.14.1. It’s important to test our code with the latest 5.10.x and 5.12.x releases along the way. That way if we do hit a problem it’ll be easier to determine the cause.

This also fits in with our desire to simplify future upgrades. Effectively we’re not doing one perl version upgrade but three, although we may only do one or two actual upgrades on production machines.

Multiple Perls

We want the developers to be able to able to easily test their code with different versions of perl, so we need to allow multiple versions to be installed at the same time. Fortunately perlbrew makes that easy.

We’ll probably have the systems team install ready-built and read-only perlbrew perls on all the machines via scp. We’ll use perlbrew as a way to get a set of perls installed but the actual selecting of a perl via PATH etc. we’ll handle ourselves.

Multiple CPAN Install Trees

Major versions of perl aren’t binary compatible with each other. This means extension modules, like DBI, which were installed for one major version of perl can’t be reused with another.

We keep all the code installed from CPAN in a repository, separate from the perl installation. Perl finds them using PERL5LIB env var and installers install there using the PERL_MB_OPT and PERL_MM_OPT env vars to set it as the ‘install_base’.

Since we want developers to switch easily between perl versions, this means we need multiple CPAN installation directories, one per major perl version. We’ll rebuild and reinstall the extension modules into each immediately after building and installing the corresponding perl version.

If we have to rebuild and reinstall the extension modules then we can easily rebuild and reinstall all our CPAN modules. That way we get to rerun all their test suites against each version of perl plus the specific versions of their prerequisite modules that we’re using.

Reinstalling CPAN Distributions

This is where it gets tricky.

Identifying what CPAN distributions we have installed is fairly easy. You can use tools like CPAN.pm or whatdists.pl to generate a list. But there’s a catch. They’ll only tell you what current distributions you need to install to get the same set of modules. That’s not what we need.

We need a list of the specific distribution versions that are currently installed. It turns out that that information isn’t recorded in the installation and it’s amazingly difficult to recreate reliably. (The perllocal.pod file ought to have this information but isn’t updated by the Module::Build installer and doesn’t record the actual distribution name.)

In an extension of his MyCPAN work, brian d foy is trying to tackle this problem by creating MD5 hashes for the millions of files on BackPAN (the CPAN archive) but there’s still much hard work ahead.

Why do we need the specific versions, why not simply upgrade everything to the latest version first as a separate project? Two reasons.

First, we’re caught by the fact that some latest distributions, either directly or indirectly, require a later version of perl. (David Cantrel’s cpxxxan project offers an interesting approach to this problem. E.g., use http://cp5.8.8an.barnyard.co.uk/ as the CPAN mirror to get a “latest that works on 5.8.8” view. [Thanks to ribasushi++ for the reminder.])

Second, having a complete list of exactly what we have installed also gives us easy reproducibility. Future installs will always yield exactly the same set of files, without risk of silent changes due to new releases on CPAN. The cpxxxan indices for older perls are much less likely to change, but still may. Also, if we upgraded everything to the latest using cp5.8.8an we’d need an extra testing cycle to check for problems with that upgrade before we even start on the perl upgrade.

After contemplating the large, ambitious, and incomplete MyCPAN project, I decided I’d try a distinctly hackish solution to this problem by extending the whatdists.pl script with a perllocal.pod parser and some heuristics. It seems to have worked out well. I’m going to check it by installing the distributions into a different directory and diff’ing that against the original.

If that works out I’ll release the code and write up a blog post about it.

Installing Only Specific CPAN Distributions

Normally when you install a distribution from CPAN you’re happy for the installer to fetch and install the latest version of any prerequisite modules it might need. In our situation we want to install only a specific version of each.

In theory we could arrange that by ordering the list such that the prerequisite modules are installed first. The CPANDB module combined with a topological sort of the requires, test_requires, and build_requires dependencies via the Graph module should do the trick. [Hat tip to ribasushi++ for the CPANDB suggestion.] But there’s a simpler approach…

I’ll probably simply duck that issue by using CPAN::Mini::Inject to create a miniature CPAN that contains only the specific versions of the specific distributions we’re using. Then we can use the cpanm –mirror and –mirror-only options to install from that mini CPAN.

Extending Test Coverage

All the above will give developers the ability to switch perl versions with ease, while keeping exactly the same set of CPAN modules. So now we can turn our attention to testing.

Our test coverage could charitably be described as spotty. Getting it up to a good level across all our code is simply not viable in the short term.

So for now I’m setting a very low goal: simply get all the perl modules and scripts compiled. You could say I’m aiming for 100% “compilation coverage” :-)

This will get all the developers aware of the basic mechanics of testing, like Test::Most and prove and it gives us a good baseline to increase coverage from. More importantly in the short term it let’s us detect any compile-time deprecation warnings as we test with perl 5.10 and 5.12.

To ensure 100% (compilation) coverage I’ll use Devel::Cover to do coverage analysis and write a utility, probably using Devel::Cover::Covered, to find all our perl scripts and modules and check that they have all at least been compiled.

Summary

Multiple perl versions, via perlbrew.
Multiple identical CPAN install trees, one per major perl version.
Proven 100% compilation coverage as a minimum.

So, that’s the plan.

Building a different kind of extension

TimBunce — Wed, 29 Jun 2011 21:44:29 +0000

For the past year I’ve been rather distracted, with little time to devote to Open Source projects. I’ve been working on a different kind of project, adding an extension to our home. It’s been quite a journey.

After much planning (the plumbing Statement of Works, for example, covers four pages), and our fair share of trials and tribulations, the builders broke ground two weeks ago. Now, after days of digging and rock-breaking, the foundations trenches are all dug out and the concrete will be poured tomorrow morning. Finally, we’ll be “out of the ground”.

Naturally I want to be around to handle issues as they arise, so this year I won’t be going to OSCON or YAPC::EU. If all goes well we should be completed in time for me to attend the London Perl Workshop in November.

Meanwhile I hope to find a little time for catching up on outstanding issues with DBI and NYTProf and perhaps a little more blogging.

Looking for a Senior Developer job? TigerLead is Hiring again in West LA

TimBunce — Thu, 14 Apr 2011 17:50:58 +0000

The company I work for, TigerLead.com, has another job opening in West LA:

As a Senior Developer, you will be playing a central role in the design, development, and delivery of cutting-edge web applications for one of the most heavily-trafficked network of real estate sites on the web. You will work in a small, collaborative environment with other seasoned pros and with the direct support of the company’s owners and senior management. Your canvas and raw materials include rich data sets totaling several million property listings replenished daily by hundreds of external data feeds. This valuable data and our powerful end-user tools to access it are deployed across several thousand real estate search sites used by more than a million home-buyer leads and growing by 50K+ users each month. The 1M+ leads using our search tools are in turn tracked and cultivated by the several thousand real estate professionals using our management software. This is an outstanding opportunity to see your creations immediately embraced by a large community of users as you work within a creative and supportive environment that is both professional and non-bureaucratic at the same time, offering the positives of a start-up culture without the drama and instability.

If that sounds like interesting work to you then take a look at the full job posting.

TigerLead is a lovely company to work for and this is a great opportunity. Highly recommended.

java2perl6api – Java to Perl 6 API translation – What, Why, and Whereto

TimBunce — Fri, 16 Jul 2010 17:12:59 +0000

In this post I’m going to talk about the java2perl6api project. What its goals are, why I think it’s important, how it relates to a Perl 6 DBI, what exists now, what’s needs doing, and how you can help.

Firstly I’d like to point out that, funnily enough, I’m not very familiar with Java or Perl6. It’s entirely possible that I’ll make all sorts of errors in the following details. If you spot any do please let me know.

Background

The Java language ecosystem is big and mature after years of heavy investment of time and money.

It doesn’t have a central repository of Open Source modules like CPAN (though Maven repositories like these are similar I guess). It does, however, have a number of mature high quality class libraries, and a very large number of developers familiar with those libraries (more on that below).

Goals

The primary goal of the java2perl6api project is to make it easy to create Perl 6 class libraries that mirror Java equivalents. By mirror I mean share the same method names and semantics at a high level (though not at a low-level, more on that below).

Secondary goals are to do that well enough that:

the documentation for Java classes can serve as primary the documentation for the corresponding Perl 6 classes. The Perl 6 classes need only document the differences in behavior, which these should be minimal and ‘natural’. The same applies to books describing the Java classes.
Java developers familiar with the Java classes should feel comfortable working with the corresponding Perl 6 classes.
and, hopefully, some way can be found to convert test suites for the Java classes into Perl 6 code that’ll test the corresponding Perl 6 classes. (I appreciate that this is a non-trivial proposition, but there are viable approaches available, like xmlvm.) Even if that can’t be done, extracting and translating tests manually is less work, and more effective, than creating them from scratch for a new API.

Why?

Firstly, creating good APIs is hard. Java APIs like JDBC 3.0 and NIO.2 are the result of years of professional effort and demanding commercial experience. Why not build on that experience?

I appreciate that Java APIs are often limited by the constraints of the language, such as the lack of closures, and that Perl 6 can probably express any given set of semantics more effectively than Java. My point here is that some Java APIs embody, however inelegantly, years of hard won experience that we can benefit from. I’d rather make new mistakes than repeat old ones.

Secondly, there are many more Java developers than Perl developers. Many many more if job vacancies are any indication:

I think we’d be foolish not to try to smooth the path for any Java developers who might be interested in Perl 6. The java2perl6api project is just one small aspect of that.

I really hope someone starts writing a “Perl 6 for Java Developers” tutorial. Perl 6 has the potential to become a very popular language¹. Getting just a tiny percentage of Java developers (and Computer Science majors and their teachers) interested in it could be a big help.

Thirdly, any future DBI for Perl 6 and Parrot needs a much better foundation than the very limited and poorly defined one that underlies the Perl 5 DBI. I plan to adopt the JDBC 3.0 API and test suite for that internal role. (You could call this a “Test Suite Driven Strategy”.) I’ll talk more about that in a future blog post.

The History java2perl6api

I’ve been kicking around various ideas for integrating Java and Perl6/Parrot for years. I think I first decided to use JDBC as the inspiration for the DBI-to-driver API in 2006.

You may remember back in 2004, around the 10th anniversary of the DBI, the Perl Foundation setup a “DBI Development Fund” that people could donate to. I’ve never drawn any money from that fund. I want to use it to oil other peoples wheels.

In 2007 Best Practical sponsored Perl 6 Microgrants through the Perl Foundation. I asked if I could piggyback my idea for a Java to Perl 6 API translator onto their microgrant management process but using money from the DBI Development Fund. TPF and Best Practical kindly agreed. I posted a description of the task and Phil Crow volunteered and was awarded the microgrant in April 2007.

At OSCON in July 2007 I gave lightning talk called “Database interfaces for open source languages suck” which explained the rationale for using JDBC as a foundation for the DBI-to-driver API and mentioned Phil’s java2perl6 project.

Development ground to a halt around the end of 2007 for various reasons. It picked up again for a few months after OSCON 2009 (where I gave a short lightning talk asking for help) then stalled again in October. Partly because we seemed to have hit a limitation with Rakudo and partly because I was focussed on Devel::NYTProf version 3 and then version 4, which took way more time than I expected.

There’s life in the project again now. We’ve dodged the earlier problem, put the code on github, brought it into sync with current Rakudo Perl 6 syntax, and generally instilled some momentum.

The Current java2perl6api

Let’s take a look at a simple example.

To generate a perl6 file that mirrors the API of the java.sql.Savepoint class you’d just execute java2perl6api like this:

$ java2perl6api java.sql.Savepoint
loading java.sql.Savepoint
wrote java/sql/Savepoint.pm6 - interface java.sql.Savepoint
checking java/sql/Savepoint.pm6 - interface java.sql.Savepoint

That’s loaded and parsed the description of the java.sql.Savepoint class (from the javap command), generated a corresponding perl6 module, and run perl6 to validate it.

The generated module (with some whitespace and cruft removed) looks like this:

use v6;
role java::sql::Savepoint {
    method getSavepointId (
    --> Int   #  int
    ) { ... }
    method getSavepointName (
    --> Str   #  java.lang.String
    ) { ... }
};
=begin pod
=head1 Java
  Compiled from "Savepoint.java"
  public interface java.sql.Savepoint{
      public abstract int getSavepointId() throws java.sql.SQLException;
      public abstract java.lang.String getSavepointName() throws java.sql.SQLException;
  }
=end pod

The pod section shows the description of the class that javap returned. The java2perl6api utility parsed that Java interface and generated the corresponding Perl6 role. The ‘java.sql.Savepoint’ has been mapped to ‘java::sql::Savepoint’. The generated methods are stubs using ... (the “yada, yada, yada” operator). The types int and java.lang.String have been mapped to Int and Str. Because the only types used were built-ins, no type declarations were added.

Currently java2perl6api handles the above plus overloaded methods (which generate multi methods), multiple implements clauses (which generate multiple does clauses). There’s also partial support for class/interface constants (which currently generate exported methods).

The default behavior is to recursively process any Java types referenced by the class which aren’t mapped to Perl 6 types. So executing java2perl6api java.sql.Connection, for example, will generate 48 Perl 6 modules! (Because java.sql.Connection refers to many types, including java.sql.Array which refers to many types including java.sql.ResultSet which refers to java.net.URL which refers to java.net.Proxy etc. etc.) The --norecurse options disables this behavior.

Normally you’ll want to use the recursion but instead of letting it drill all the way into the Java types, you would supply your own ‘typemap’ specification via an option. That tells java2perl6api which Java types you want to map to which Perl 6 types. So instead of recursing into the java.net.URL type to generate a java/net/URL.pm6 file, for example, you can tell java2perl6api to use a specific Perl 6 type. Perhaps just Str for now.

How this relates to JDBC / DBDI / DBI v2

I want to start applying java2perl6api to the JDBC classes now to create a “Database Driver Interface” or “DBDI” for Perl 6.

Starting with the DriverManager class and the Connection interface I’ll use java2perl6api to generate corresponding Perl 6 roles with heavy stubbing out of types. Basically anything I don’t need to think about right now will be mapped to the Any type.

I’ll start fleshing out some basic implementation logic for each in a Perl 6 class that does the corresponding role. I’ll probably use PostgreSQL as the first driver and the guts of MiniDBD::Pg as inspiration.

The first minor milestones will be creating connections, then execute non-selects, then selects then prepared statements. Somewhere along the way I expect they’ll be a Perl 6 DBDI driver implemented for the Perl 6 MiniDBI project. The next key step would be to start refactoring the code heavily so anyone wanting to implement a new driver should only have to implement the driver specific parts. (There are some JDBC driver toolkits that can provide useful ideas for that.)

What needs doing

There’s a TODO file in the repository that lists the current items that need working on.

One fairly simple item is to add a --prefix option to specify an extra leading name for the generated role. So java.sql.Savepoint with a prefix of DBDI would generate a DBDI::java::sql::Savepoint role.

Another item, less simple but more important, is to automatically discover the values of constants and embed them into the generated file. Probably the best way to do that is to extend the parser (which uses Parse::RecDescent) to parse the verbose-mode output of javap, which includes those details.

There are plenty of others.

How you can get involved

Firstly, come and say “Hi!” in the #dbdi IRC channel on irc.freenode.net.

The code is on github. You can get commit access by asking on the #perl6 channel.

There’s also a mailing list at dbdi-dev@perl.org which you can subscribe to.

I look forward to hearing from you!

When I say “Perl 6 has the potential to become a very popular language” I do so with typical British Understatement.

NYTProf 4.04 – Came, Saw Ampersand, and Conquered

TimBunce — Fri, 09 Jul 2010 21:06:24 +0000

Please forgive the title!

Perl has three regular expression match variables ( $& $‘ $’ ) which hold the string that the last regular expression matched, the string before the match, and the string after the match, respectively.

As you’re probably aware, the mere presence of any of these variables, anywhere in the code, even if never accessed, will slow down all regular expression matches in the entire program. (See the WARNING at the end of the Capture Buffers section of the perlre documentation for more information.)

Clearly this is not good.

I’ve long planned to add detection and reporting of this to Devel::NYTProf, along with things like method cache invalidation, but it’s never risen to the top of the list. In fact, now I look, I see it never even got entered into the ever-growing collection of ideas recorded in the HACKING file.

After the 4.00 release, plus few minor releases, I’d put NYTProf on hold and was starting to focus on my java2perl6 API translation project (more news on that soon).

Then I saw a recent blog post by Josh McAdams, one of the authors of Effective Perl Programming (along with Joseph N. Hall and brian d foy) about detecting these variables using the Devel::SawAmpersand and Devel::FindAmpersand modules. Firstly it reminded me of the issue, and then it struck me that few people would bother using those tools because they simply wouldn’t know they had the problem in the first place.

Someone with a performance problem is likely to use a profiler like NYTProf to see where time is being spent in their code. That might point out that significant time is being spent in regular expressions, but even then they might not make the leap to consider these special match variables as a possible cause. The profiler should point it out to them!

NYTProf version 4.03 didn’t. Clearly that was not good. So NYTProf version 4.04 now does!

In the list of files on the index page it highlights the file and adds a comment:

On the report page for the file itself it adds an unmissable, and hopefully self-explanatory, note to the top of the page:

I’d be very interested to hear from anyone who now discovers these problem variable lurking in their application code or any CPAN modules.

Go take a look!

Reflections on Perl and DBI from an Early Contributor

TimBunce — Thu, 08 Jul 2010 12:48:27 +0000

The name Buzz Moschetti probably isn’t familiar to you. Buzz was the author of the Perl 4 database for Interbase known as Interperl.

Back in those days Perl 5 was barely a twinkle in Larry’s eye and database interfaces for Perl 4 required building a custom perl binary.

Buzz was one of the four people to get the email on September 29th 1992 from Ted Lemon that started the perldb-interest project which defined a specification that ultimately lead to the DBI. (The other people were Kurt Andersen re informix, Kevin Stock re oraperl, and Michael Peppler re sybperl. I joined a few days later.)

Update: It turns out that it was actually Buzz who sent that original email, Ted just forwarded it on to others, including me. So Buzz can be said to have started the process that led to the DBI!

I hadn’t heard from Buzz for many years until he sent me an email recently.

This is his story:

Thought I’d share a quick story with you.

Recently, I was frustrated with a development team’s efforts in putting together some DB-oriented reconciliations. The candidate solution was a blend of precompiled SQL in COBOL code, file dumps and ftps, programs to read files, more programs to read other DBs, etc. etc. Not only was the process orchestration a project in its own right, the end-to-end logic required to accurately perform the reconciliation was distributed across several programs and platforms, diluting the knowledge base. I knew a perl program using multiple DBD drivers to different DB engines could do it in a much cleaner way, but over the years my job has changed and although I still use perl regularly, I don’t do much in the way of DBD/DBI. To make matters worse, one of the targets was mainframe DB2 and very little work had been done here with DBD::DB2. Also, the Sybperl module continues to be heavily used in addition to DBD::Sybase, so local DBD/DBI expertise in general is thin. I decided to get it working on my own.

The infrastructure team spun up for me a Linux virtual machine with a modern build environment on it. This had the latest gcc compilers and a firm-approved build of perl 5.8.5 right out of the box. It took a few days of low-priority requests to get the appropriate 32bit Linux client-side SDKs for the DB2 and Sybase products but soon enough I had an environment set up with headers and shared libs. I was ready to build some perl modules, something I haven’t done in years.

I went to CPAN and downloaded DBD::DB2, untar’d it, and ran perl Makefile.PL and make. Everything worked perfectly and the whole exercise took minutes. ‘make test’ sets PERL_DL_NONLAZY and warned of some unused symbols not being found, but that was OK. The rest of the tests that I expected to work with my level of permissions worked fine. ‘make install’ worked perfectly. Buoyed by this success, I wrote a 4-liner test program just to connect and fetch some data from a table I knew about. Outside of the test environment, however, the shared libs for DB2 were not found so I cheated and relinked and reinstalled DB2.so with the -Wl,-rpath option to “cement in” the location of those libs so I wouldn’t have to fuss with LD_LIBRARY_PATH. My test program now worked fine. Newly comfortable with the process, I downloaded DBD::Sybase and built and installed the module in scarcely more time than it took for the compiler to run. In my excitement I skipped over the DBD::Sybase 4-liner test program and went straight to a slightly bigger script that used both modules and grabbed data from both DBs. It quietly and quickly executed.

Total time from initial download with almost no clues to a running example: about 40 minutes. Later, for grin’s sake, I threw in DBD::Oracle for good measure. That went even faster — about 5 minutes — from CPAN download to printing “Oracle connected!” because I was more familiar with the connection string syntax that is bespoke for each engine.

As I watched the program run, it made me reflect on how far we’ve come and how easy yet sophisticated the perl module ecosystem has become. There is no question that this multi-DBD perl program is easier to understand and support than a solution involving a set of disconnected programs, platforms, and files. But I think it is the organization and design of the resources as a whole — DBI, DBD, CPAN, MakeMaker, pod, binary and non-binary library locations, etc. — that makes the whole environment so clear, symmetric, and easy to use with confidence. I think back to the build environment that I used to create interperl, and the progress that has been made in terms of both breadth of module functionality and depth of framework for module build portability is simply amazing. Perl has grown far beyond just being another language. It has a value proposition as an able integrator of widely disparate functionality.

I exited the Perl mainstream some time ago but I am watching from the side and I applaud the work you’ve done in this space.

Take care.

Thanks Buzz!

Looking for a new job? TigerLead is also Hiring in Ann Arbor MI

TimBunce — Fri, 02 Jul 2010 19:41:27 +0000

In addition to the job vacancy in West LA, the company I work for, TigerLead.com, has an opening for a “skilled developer” in Ann Arbor, Michigan:

Our work involves manipulating and warehousing external data feeds and developing web interfaces to create home search tools for prospective buyers and lead management tools for real estate agents. We’re looking for a skilled coder to join our small team of talented engineers in Ann Arbor. We hope to find an experienced programmer who is a good fit with our team, well-versed in multiple languages, able to learn quickly and work independently. We work in a Linux environment, and tools and languages we use include Perl, Ruby on Rails, PostgreSQL, and GIT. Perl experience is a significant plus, but your current comfort level with any of these specific tools is less important than overall technical aptitude and ability to learn quickly and fit in well with the current team.

That’s a little thin on details partly because the work is varied. If you think you might be interested, take a look at the full job posting.

TigerLead is a lovely company to work for and this is a great opportunity. Highly recommended.

Looking for a new job? TigerLead is Hiring in West LA

TimBunce — Fri, 02 Jul 2010 17:37:22 +0000

The company I work for, TigerLead.com, has an opening for a “skilled coder / database wrangler”.

We’re looking for a skilled coder / database wrangler to play a key role within our Operations and Engineering teams. The various responsibilities of the job include working with the large databases underlying our real estate search tools, setting up services for new clients, communicating with clients to evaluate bug reports, troubleshooting technical issues escalated by our client services team, and interfacing with the engineering team on systems maintenance and development. The scope of work that we do involves managing hundreds of external data feeds that feed into in-house databases totaling several million property listings. These listing databases power hundreds of real estate search sites used by more than a million home-buyer leads, who are tracked and cultivated by the thousands of Realtors using our management software. This position is critical to the robustness of these systems.

If that sounds like interesting work to you then take a look at the full job posting.

TigerLead is a lovely company to work for and this is a great opportunity. Highly recommended.