tag:blogger.com,1999:blog-52469877556510652862019-05-21T09:56:09.637-07:00cbloom rantscbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger720125tag:blogger.com,1999:blog-5246987755651065286.post-36571360087865723202019-05-21T09:40:00.002-07:002019-05-21T09:56:09.627-07:00On the entropy of distance coding Let's look at a little math problem that I think is both amusing and illuminating, and is related to why LZ77 distance coding is fundamentally inefficient for sending a sequence of repeated tokens. <P> The toy problem we will study is this : <P> <FONT COLOR=GREEN> Given an array of random bytes, you for some silly reason decide to compress them by sending the distance to the most recent preceding occurance of that byte. <P> So to send 'A' you count backward from your current position P ; if there's an A at P-1, you send a 0, if it's at P-2 send a 1, etc. <P> What is the entropy of this distance encoding? </FONT> <P> This "compressor" is silly, it expands (it will send more than 8 bits per byte). How much does it expand? <P> First of all, why are we looking at this. It's a boiled down version of LZ77 on a specific type of data. LZ77 sends repeated strings by sending the distance to the previous occurance of that string. Imagine your data consists of a finite set of tokens. Each token is several bytes and imagine the tokens are drawn from a random source, and there are 256 of them. If you had the indices of the tokens, you would just have a random byte sequence. But LZ77 does not have that and cannot convert the stream back to the token indexes. Instead the best it can do is to parse on token boundaries, and send the distance to the previous occurance of that token. <P> This does occur in the real world. For example consider something like a palettized image. The palette indices are bytes that refer to a dictionary of 256 colors. If you are given the image to compress after de-palettizing, you would see something like 32-bit RGBA tokens that repeat. Once you have seen the 256 unique colors, you no longer need to send anything but distances to previous occurances of a color. English text is also a bit like this, with the tokens equal to a word+punction string of variable length. <P> So back to our toy problem. To get the entropy of the distance coding, we need the probability of the distances. <P> To find that, I think about coding the distance with a sequence of binary coding events. Start at distance = 0. Either the current byte matches ours, or it doesn't. Chance of matching is (1/256) for random bytes. The chance of no-match is (255/256). So we multiply our current probability by (1/256) and the probability of all future distances by (255/256), then move to the next distance. (perhaps it's easier to imagine that the preceding bytes don't exist yet; rather you are drawing a random number as you step back in distance; any time you get your own value you stop) <P> This gives you the probability distribution : <FONT COLOR=GREEN><PRE> P(0) = (1/256) P(1) = (255/256) * (1/256) P(2) = (255/256)^2 * (1/256) ... P(n) = (255/256)^n * (1/256) </PRE></FONT> an alternative way to get the same distribution is to look at distance D. First multiply by the probability that it is not a lower distance (one minus the sum of all lower distance probabilities). Then the probability that it is here is (1/256). That's : <FONT COLOR=GREEN><PRE> P(0) = (1/256) P(1) = (1 - P(0)) * (1/256) P(2) = (1 - P(0) - P(1)) * (1/256) ... P(n) = (1 - P(0) - P(1) .. - P(n-1)) * (1/256) </PRE></FONT> <P> which is equal to the first way. <P> Given this distribution we can compute the entropy : <FONT COLOR=GREEN><PRE> H = - Sum_n P(n) * log2( P(n) ) starting at n = 0 let x = (255/256) P(n) = (1-x) * x^n log2( P(n) ) = log2( 1-x ) + n * log2( x ) H = - Sum_n (1-x) * x^n * ( log2( 1-x ) + n * log2( x ) ) terms that don't depend on n pull out of the sum : H = - (1-x) * log2( 1-x ) * Sum_n x^n - (1-x) * log2( x ) * Sum_n n * x^n we need two sums : G = Sum_n x^n S = Sum_n n * x^n G is just the geometric series G = 1 / (1 - x) recall the trick to find G is to look at G - x*G we can use the same trick to find S S - x*S = G - 1 the other standard trick to find S is to take the d/dx of G either way you get: S = x / (1-x)^2 H = - (1-x) * log2( 1-x ) * G - (1-x) * log2( x ) * S H = - (1-x) * log2( 1-x ) * ( 1 / (1-x) ) - (1-x) * log2( x ) * ( x / (1-x)^2 ) H = - log2( 1-x ) - log2( x ) * ( x / (1-x) ) putting back in x = (255/256) 1-x = 1/256 the first term "- log2( 1-x )" is just 8 bits, send a byte H = 8 + 255 * log2( 256 / 255 ) H = 8 + 1.43987 bits about 9.44 bits or if we look at general alphabet size N now instead of source bytes H = log2(N) + (N-1) * log2( N / (N-1) ) recalling of course log2(N) is just the number of bits it would take to code a random symbol of N possible values we'll look at the expansion due to distance coding, H - log2(N) if we now take the limit of N large H - log2(N) -> (N-1) * log2( N / (N-1) ) H - log2(N) -> (N-1) * log2( 1 + 1 / (N-1) ) H - log2(N) -> log2( ( 1 + 1 / (N-1) ) ^ (N-1) ) H - log2(N) -> log2( ( 1 + 1/N ) ^ N ) ( 1 + 1/N ) ^ N -> e !! H - log2(N) -> log2( e ) = 1.44269 bits </PRE></FONT> for large alphabet, the excess bits sent due to coding distances instead of indices is log2(e) !! <P> I thought it was pretty entertaining for Euler to show up in this problem. <P> Why is distance coding fundamentally inefficient (vs just coding the indices of these repeated tokens) ? It's due to repetitions of values. <P> If our preceding bytes were "ABCB" and we now wish to code an "A" , it's at distance = 3 because we had to count the two B's. Our average distance is getting bumped up because symbols may occur multiple times before we find our match. If we did an LZ77 coder that made all substrings of the match length unique, we would not have this inefficiency. (for example LZFG which sends suffix trie node indices rather than distances does this) <P> We can see where this inefficiency appears in the probabilities: <FONT COLOR=GREEN><PRE> if you are drawing a random number from 256 at each step keep stepping until you get a match each time you have to draw from all 256 possibilities (repeats allowed) P(0) = (1/256) P(1) = (255/256) * (1/256) P(2) = (255/256)^2 * (1/256) ... P(n) = (255/256)^n * (1/256) (as above) instead imagine drawing balls from a hat once you draw a ball it is discarded so it cannot repeat stop when you match your byte first draw has the same 1/256 chance of match : P(0) = (1/256) as before multiply by 1-P to get the probability of continuing but now we only have 255 balls left, so the chance of a match is (1/255) P(1) = (255/256) * (1/255) = (1/256) current P was (1/255) so multiply the next by (254/255) now we have 254 , so we match (1/254) of the time : P(2) = (255/256) * (254/255) * (1/254) = (1/256) ... P(n) = (1/256) it's just a flat probability. </PRE></FONT> decrementing the alphabet size by one each time makes a big difference. <P> This theoretical coding loss is also observed exactly in the real world. <P> <FONT COLOR=GREEN><PRE> experiment : make 100,000 random bytes make a palette of 256 32-bit random dwords use the 100,000 random bytes to index the palette to output 400,000 bytes what can LZ77 on these 400,000 bytes? Our theoretical analysis says the best possible is 9.44 bits per palette index plus we have to send 1024 bytes of the palette data as well best = 100,000 * 9.44/8 + 1024 = 119,024 real world : Leviathan Optimal5 random_bytes_depal.bin : 400,000 -> 119,034 = 2.381 bpb = 3.360 to 1 </PRE></FONT> it turns out this theoretical limit is almost exactly achieved by Oodle Leviathan, only 10 bytes over due to headers and so on. <P> Leviathan is able to hit this limit (unlike simpler LZ77's) because it will entropy code all the match signals to nearly zero bits (all the matches will be "match len 4" commands, so they will have a probability near 1 and thus entropy code to zero bits). Also the offsets are in multiples of 4, but Leviathan will see the bottom bits are always zero and not send them. The result is that Leviathan sends the matches in this kind of data just as if it was sending a distance in tokens (as opposed to bytes) back to find the repeated token, which is exactly what our toy problem looked at. <P> We cannot do better than this with LZ77-style distance coding of matches. If we want to beat this we must induce the dictionary and send id references to whole words. Leviathan and similar LZ's cannot get as close to the optimum on text, because we don't handle tokens of varying length as well. In this kind of depalettized data, the match length is totally redundant and should be coded in zero bits. With text there is content-length correlation which we don't model at all. <P> Also note that we assume random tokens here. The loss due to sending distances instead of token id's gets even worse if the tokens are not equiprobable, as the distances are a poor way to capture the different token probabilities. <P> <B> Summary : </B> <P> The coding loss due to sending repeated tokens by distance rather than by index is at least log2(e) bits for large alphabet. This theoretical limit acts as a real world limit on the compression that LZ77 class algorithms can achieve on depalettized data. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/Hpn9Rdo6DKQ" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-88279480430662886892019-04-09T16:39:00.003-07:002019-04-10T08:04:45.982-07:00Oodle 2.8.0 Release Oodle 2.8.0 is now out. This release continues to improve the Kraken, Mermaid, Selkie, Leviathan family of compressors. There are no new compressors or major changes in this release. We continue to move towards retiring the pre-Kraken codecs, but they are still available in Oodle 2.8 <P> The major new features in Oodle 2.8 are : <P> <UL> <LI> "Job" threading of the encoders in Oodle Core. This allows you to get multi-threaded encoding of medium size buffers using Oodle Core on your own thread system (or optionally use the one provided in OodleX). See a whole section on this below. <P> <LI> Faster encoders, particularly the optimals, and particularly Optimal1. They're 1 - 2X faster even without Jobify threading. <P> <LI> Better limits on memory use, particularly for the encoders. You can now query the memory sizes needed and allocate all the memory yourself before calling Oodle, and Oodle will then do no allocations. see OodleLZ_GetCompressScratchMemBound, example_lz_noallocs, and the FAQ. </UL> <P> An example of the encoder speed improvement on the "seven" test set, measured with ect on a Core i7 3770, Kraken levels 5 (Optimal1) and 7 (Optimal3) : <FONT COLOR=GREEN><PRE> Oodle 2.7.5 : ooKraken5 : 3.02:1 , 3.3 enc MB/s , 1089.2 dec MB/s ooKraken7 : 3.09:1 , 1.5 enc MB/s , 1038.1 dec MB/s Oodle 2.8.0 : (without Jobify) ooKraken5 : 3.01:1 , 4.6 enc MB/s , 1093.6 dec MB/s ooKraken7 : 3.08:1 , 2.3 enc MB/s , 1027.6 dec MB/s Oodle 2.8.0 : (with Jobify) ooKraken5 : 3.01:1 , 7.2 enc MB/s , 1088.3 dec MB/s ooKraken7 : 3.08:1 , 2.9 enc MB/s , 1024.6 dec MB/s </PRE></FONT> <P> See <A HREF="http://www.radgametools.com/oodlehist.htm">the full change log</A> for more. <P><HR><P> <B> About the new Jobify threading of Oodle Core : </B> <P> Oodle Core is a pure code lib (as much as possible) that just does memory to memory compression and decompression. It does not have IO, threading, or other system dependencies. (that's provided by Oodle Ext). The system functions that Oodle Core needs are accessed through function pointers that the user can provide, such as for allocations and logging. We have extended this so you can now plug in a Job threading system which Oodle Core can optionally use to multi-thread operations. <P> Currently the only thing we will multi-thread is OodleLZ_Compress encoding of the new LZ compressors (Kraken, Mermaid, Selkie, Leviathan) at the Optimal levels, on buffers larger than one BLOCK (256 KB). In the future we may multi-thread other things. <P> Previously if you wanted multi-threaded encoding you had to split your buffers into chunks and multi-thread at the chunk level (with or without overlap), or by encoding multiple files simultaneously. You still can and should do that. Oodle Ext for example provides functions to multi-thread at this granularity. Oodle Core does not do this for you. I refer to this as "macro" parallelism. <P> The Oodle Core provides more "micro" parallelism that uses multiple cores even on individual buffers. It parallelizes at the BLOCK level, hence it will not get any parallelism on buffers <= one BLOCK (256 KB). <P> Threading of OodleLZ_Compress is controlled by the OodleLZ_CompressOptions:jobify setting. If you don't touch it, the default value (Jobify_Default) is to use threads if a thread system is plugged in to Oodle Core, and to not use threads if no thread system is plugged in. You may change that option to explicitly control which calls try to use threads and which don't. <P> OodleX_Init plugs the Oodle Ext thread system in to Oodle Core. So if you use OodleX and don't touch anything, you will now have Jobify threading of OodleLZ_Compress automatically enabled. You can specify Threads_No in OodleX_Init if you don't want the OodleX thread system. If you use OodleX you should NOT plug in your own thread system or allocators into Oodle Core - you must let OodleX provide all the plugins. The Oodle Core plugins allow people who are not using OodleX to provide the systems from their own engine. <P> WHO WILL SEE AN ENCODE PERF WIN : <P> If you are encoding buffers larger than 1 BLOCK (256 KB). <P> If you are encoding at level Optimal1 (5) or higher. <P> If you use the new LZ codecs (Kraken, Mermaid, Selkie, Leviathan) <P> If you plug in a job system, either with OodleX or your own. <P> CAVEAT : <P> If you are already heavily macro-threading, eg. encoding lots of files multi-threaded, using all your cores, then Jobify probably won't help much. It also won't hurt, and might help ensure full utilization of all your cores. YMMV. <P> If you are encoding small chunks (say 64 KB or 256 KB), then you should be macro-threading, encoding those chunks simultaneously on many threads and Jobify does not apply to you. Note when encoding lots of small chunks you should be passing pre-allocated memory to Oodle and reusing that memory for all your compress calls (but not sharing it across threads - one scratch memory buffer per thread!). Allocation time overhead can be very significant on small chunks. <P> If you are encoding huge files, you should be macro-threading at the chunk level, possibly with dictionary backup for overlap. Contact RAD support for the "oozi" example that demonstrates multi-threaded encoding of huge files with async IO. <P> NOTE : All the perf numbers we post about and shows graphs for are for single threaded speed. I will try to continue to stick to that. <P><HR><P> <B> A few APIs have changed & the CompressOptions struct has changed. </B> <P> This is why the middle version number (8) was incremented. When the middle ("major") version of Oodle is the same, the Oodle lib is binary link compatible. That means you can just drop in a new DLL without recompiling. When the major version changes you must recompile. <P> A few APIs have small signature changes : <FONT COLOR=GREEN><PRE> OodleLZ_GetDecodeBufferSize, OodleLZ_GetCompressedBufferSizeNeeded and OodleLZ_GetInPlaceDecodeBufferSize : take compressor argument to return smaller padding for the new codecs. OodleLZ_GetChunkCompressor API : take compressed size argument to ensure it doesn't read past end </PRE></FONT> these should give compile errors and be easy to fix. <P> The CompressOptions struct has new fields. Those fields may be zero initialized to get default values. So if you were initializing the struct thusly : <FONT COLOR=GREEN><PRE> struct OodleLZ_CompressOptions my_options = { 1, 2, 3 }; </PRE></FONT> the new fields on the end will be implicitly zeroed by C, and that is fine. <P> NOTE : I do NOT recommend that style of initializing CompressOptions. The recommended pattern is to GetDefaults and then modify the fields you want to change : <FONT COLOR=GREEN><PRE> struct OodleLZ_CompressOptions my_options = OodleLZ_CompressOptions_GetDefault(); my_options.seekChunkReset = true; my_options.dictionarySize = 256*1024; </PRE></FONT> then after you set up options you should Validate : <FONT COLOR=GREEN><PRE> OodleLZ_CompressOptions_Validate(&my_options); </PRE></FONT> Validate will clamp values into valid ranges and make sure that any constraints are met. Note that if Validate changes your options you should really look at why, you shouldn't be shipping code where you rely on Validate to clamp your options. <P><HR><P> <B> WARNINGS : </B> <P> example_lz before 2.8.0 had a bug that caused it to stomp the user-provided input file, if one was provided. <P> YES IT WOULD STOMP YOUR FILE! <P> That bug is not in the Oodle library, it's in the example code, so we did not issue a critical bug fix for it, but please beware running the old example_lz with a file argument. If you update to the 280 SDK please make sure you update the *whole* SDK including the examples, not just the lib! <P> On Windows it is very important to not link both Oodle Core and Ext. The Oodle Ext DLL includes a whole copy of Oodle Core - if you use OodleX you should ONLY link to the Ext DLL, *not* both. <P> Unfortunately because of the C linkage model, if you link to both Core and Ext, the Oodle Core symbols will be multiply defined and just silently link without a warning or anything. That is not benign. (It's almost never benign and C needs to get its act together and fix linkage in general). It's specifically not benign here, because Oodle Ext will be calling its own copy of Core, but you might be calling to the other copy of Core, so the static variables will not be shared. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/vvYDD6S-uFQ" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-58218488461886580492019-04-09T16:38:00.002-07:002019-04-09T16:38:32.708-07:00Benchmarking Oodle with ozip -b The <A HREF="https://cbloomrants.blogspot.com/2018/07/oodle-ozip-for-command-line-compression.html">ozip</A> utility is designed to act mostly like gzip. A compiled executable of ozip is provided with Oodle for easy testing, or you may download <A HREF="https://github.com/jamesbloom/ozip/">ozip source on github</A>. <P> ozip now has a benchmarking option (ozip -b) which is an easy way to test Oodle. <P> ozip -b runs encode & decode many times to provide accurate timing. It does not include IO. It was designed to be similar to zstd -b so that they are directly comparable. <P> ozip -b can take a file or a dir (plus wildcard), in which case it will test all the files in the dir. You can set up the specific compressor and options you want to test to see how they affect performance and compression ratio. <P> So for example you can test the effect of spaceSpeedTradeoffBytes on Kraken level Optimal1 : <FONT COLOR=GREEN><PRE> r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os512 K 5 mozilla : 51220480 -> 14288373 (3.585), 3.5 MB/s, 1080.4 MB/s r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla K 5 mozilla : 51220480 -> 14216948 (3.603), 3.5 MB/s, 1048.6 MB/s r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os128 K 5 mozilla : 51220480 -> 14164777 (3.616), 3.5 MB/s, 1004.6 MB/s </PRE></FONT> Or to test Kraken HyperFast3 on all the files in Silesia : <FONT COLOR=GREEN><PRE> r:\>ozip -b -c8 -ol-3 r:\testsets\silesia\* K-3 12 files : 211938580 -> 81913142 (2.587), 339.0 MB/s, 1087.6 MB/s </PRE></FONT> <P><HR><P> Another option for easy testing with Oodle is example_lz_chart, which is also provided as a pre-compiled exe and also as source code. <P> example_lz_chart runs on a single file you provide and prints a report of the compression ratio and speed of various Oodle compressors and encode levels. <P> This gives you an overview of the different performance points you can hit with Oodle. <P><HR><P> WARNING : <P> Do not try to profile Oodle by process timing ozip. <P> The normal ozip (not -b mode) uses stdio and is not designed to be as efficient as possible. It's designed for simplicity and to replicated gzip behavior when used for streaming pipes on UNIX. <P> In general it is not recommended to benchmark by timing with things like IO included because it's very difficult to do that right and can give misleading results. <P> See also : <P> <A HREF="https://cbloomrants.blogspot.com/2016/05/tips-for-benchmarking-compressor.html"> Tips for benchmarking a compressor </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2018/06/the-perils-of-holistic-profiling.html"> The Perils of Holistic Profiling </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2017/02/tips-for-using-oodle-as-efficiently-as.html"> Tips for using Oodle as Efficiently as Possible </A> <BR> <P><HR><P> NOTE : <P> ozip does not have any threading. ozip -b is benchmarking single threaded performance. <P> This is true even for the new Jobify threading because ozip initializes OodleX without threads : <FONT COLOR=GREEN><PRE> OodleX_Init_Default(OODLE_HEADER_VERSION,OodleX_Init_GetDefaults_DebugSystems_No,OodleX_Init_GetDefaults_Threads_No); </PRE></FONT> <P> I believe that zstd -b is also single threaded so they are apples to apples. However some compressors uses threads by default (LZMA, LZHAM, etc.) so if they are being compared they should be set to not use threads OR you should use Oodle with threads. Measuring multi-threaded performance is context dependent (for example are you encoding many small chunks simultaneously?) and I generally don't recommend it, it's much easier to compare fairly with single threaded performance. <P> For high performance on large files, ask for the "oozi" example. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/m1aTISCcYmA" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-40700859538888130482019-01-20T12:03:00.002-08:002019-01-20T12:03:26.922-08:00The Oodle LZNIB Algorithm Oodle LZNIB is an LZ77-family compressor which has very good decode speed (about 5X faster than Zip/deflate, about 1/2 the speed of LZ4) while getting better compression than Zip/deflate. This was quite unique at the time it came out. It is now made obsolete by Oodle Mermaid/Selkie. <P> I thought it might be interesting to write up LZNIB and look at some of the things we learned from it. <P> LZNIB is a non-entropy coded LZ (*) which writes values in nibbles and bytes. Literals are written uncompressed (8 bits). <P> (* = actually it's more like "semi entropy coded"; it doesn't use bit io, and doesn't use standard entropy coders like Huffman, but the variable length integer encoding was adjusted to minimize entropy, and can vary to fit the file's statistics; more details on this later) <P> LZNIB can send three actions : literal run ("lrl") , match with offset ("normal match"), match with repeat offset ("rep match"). <P> One of the key realizations for LZNIB is that even though there are three actions, only two are possible at any time. <FONT COLOR=GREEN><PRE> After Literals : LRL should not occur (would have just been a longer LRL) match & rep match are possible After Match or Rep Match : rep match should not occur (would have just been a longer match) normal match & lrl are possible </PRE></FONT> because there are only two actions possible at each point, we can send this using a nibble with a decision threshold value : <FONT COLOR=GREEN><PRE> Threshold T values in [0 , T-1] are action type 1 values in [T , 15] are action type 2 </PRE></FONT> So the core decode step of LZNIB is to grab the next nibble, do one branch, and then process that action. The values within your threshold group are the first part of your LRL or match len. There are at least two different thresholds, one for {match/rep-match} in the after-literals state, and one for {match/lrl} in the after-match state. In LZNIB we hard-coded the threshold for match/rep-match to 5 as this was found to be good on most data. The optimal {match/lrl} threshold is more dependent on the data. <P> Approximately, (T/16) should be the probability of action type 1. This is in fact exactly what entropy coding this binary action choice would do. What entropy coding does is take your range of possible encodable values and splits them by the probability of the binary choice. The remaining values in the word can then be split to send further choices. Here we just do the first choice using a semi-entropy-coding, then use the remainder of the word to send as much of our next value ("len") as we can. (the fact that we do not entropy code "len" and that len has probability peaks is why the probability based choice of T might not be actually optimal) <P> In LZNIB the threshold T could be transmitted per chunk so that it could be optimized to fit the data. In practice that was rarely used, and mostly the default value of T=8 was used. Part of why it was rarely used is due to the difficulty of parse-statistics feedback. The parse must make decisions based on some assumed T, because it affects coding cost. Then after you have your parse choices you can make a choice for optimal T for that data, but if that T is different you might want to re-parse with the new T. This is a mess and the simplest way to address it here is just to parse for all values of T. You can reduce the set of likely useful T values to around 8 or 10 and just do the parse 8-10X times to make a T choice. This in fact works great and helps compression a lot on some data, but is obviously slow. <P> In contrast to something like LZ4, LZNIB has a flexible action sequence. That is, LZ4 is always doing {LRL,match,LRL,match,...}. For example to send two matches in a row you must send an LRL of length 0 between them. LZNIB has a flexible action sequence, therefore requires a branch, it could send {LRL,match,match,LRL,rep match,match,...} <P> LZNIB uses unbounded offsets for matches. They are sent using a variable length integer scheme. Initially 12 bits are sent, then a byte is added as necessary. The scheme used is "encodemod", which treats each value sent as being divided in two ranges. One range is for values that can terminate in the current word, the other range is for values that don't fit and includes a portion of the remaining value. See the <A HREF="http://cbloomrants.blogspot.com/2015/11/flipped-encodemod.html"> encodemod post series </A> for details. <P> The encodemod scheme is very flexible and can be tuned to fit the entropy characteristics of the data (unlike traditional bit-packing variable length integer schemes, which have a less flexible knob). To do this we gathered LZ parses for many files and found the best encodemod thresholds. This was done for the offsets, for LRL, match len (after match), match len (after literal), and rep match len. <P> All the lens (LRL, match len, etc.) are sent first in the control nibble. If they don't fit, the maximum is sent in the control nibble, then the remainder is sent using encodemod. The encodemod used for lens sends first another nibble, then any further values are sent with bytes. <P> The full decoder for LZNIB in pseudocode is : <FONT COLOR=GREEN><PRE> Do a literal run to start. After-literal state : Get control nibble if nib < 5 { rep match if nib == 4 get further ml using encodemod, first nibble then bytes copy rep match } else { normal match if nib == 15 get further ml using encodemod, first nibble then bytes get offset using encodemod, first 12 bits then bytes copy match } After-match state : Get control nibble if nib < T { literal run if nib == T-1 get further lrl using encodemod, first nibble then bytes copy literal run goto After-literal } else { normal match if nib == 15 get further ml using encodemod, first nibble then bytes get offset using encodemod, first 12 bits then bytes copy match goto After-match } </PRE></FONT> That's it. <P><HR><P> LZNIB is simple to decode but it turned out to be quite hard to parse well. Making good choices in the encoder can have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be important again for Kraken & Mermaid/Selkie. <P> One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in units of nibbles. Those ties need to be broken in favor of faster decode speed. <P> To good approximation, decode speed is just about minimizing the number of control nibbles. You want as few transitions between actions as possible, you prefer to stay in long literal runs and long matches. You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size with fewer mode switches. <P> Let's look at some simple examples to get a feel for these issues. <P> Consider a rep match of len 1. A rep match control takes 1 nibble, while sending a literal instead takes 2 nibbles, so sending a rep instead would save you 1 nibble. But, if the rep is followed by another literal run, you would have to send a new lrl control nibble there, while staying in an lrl might save you that. <P> <FONT COLOR=GREEN><PRE> L = literal , costs 2 nibbles + 1 at start of lrl R = rep match, costs 1 control nibble M = normal match, costs 1 control nibble + 3 or more nibbles for offset LLR = 1+2*2+1 = 6 nibbles LLL = 1+3*2 = 7 nibbles so LLR is best right? take the rep! LLRL = 1+2*2+1 + 1+2 = 9 nibbles LLLL = 1+4*2 = 9 nibbles No! Nibble count is the same but fewer controls = prefer the literal. It depends what follows the rep. LLRM LLLM in this case the rep is cheaper. So if a literal follows the rep, don't take it, right? LLLLRLLLL = 1+2*4 + 1 + 1+2*4 = 19 nibbles LLLLLLLLL = 1+2*9 + 1 = 20 nibbles No! In the second case the LRL of 9 can't fit in the control nibble, so an extra lrl nibble must be sent. So prefer the rep here. </PRE></FONT> So the simple choice of "do I take a rep of len 1 or stay in LRL" is not easy and can only be made non-locally. <P> A similar thing happens with normal matches. A normal match of len 3 with an offset that fits in 12 bits takes 4 nibbles, which saves you 2 vs sending 3 literals. But if you have to resume an LRL after the match, that costs you 1, so your savings is down to 1. There may be cheaper ways (in terms of decode speed) to get that 1 nibble savings, such as a rep of len 1 for example. Or if you can trade two len 3 matches for a single len 4 match : <FONT COLOR=GREEN><PRE> MMMLMMML = 4 + 3 + 4 + 3 = 14 LLMMMMLL = 5 + 4 + 5 = 14 same nibble count, but fewer mode switches = prefer the single len 4 match over two len 3's </PRE></FONT> A len 3 match that doesn't fit in the 12 bit offset (but does fit in the next threshold, at 20 bits) takes 6 nibbles to send, which is a tie with 3 literals. But again it could actually save you a nibble if it causes the LRL to fit in control. <P> You might be tempted to just punt this, eg. make rep matches have a minimum length of 2, and normal matches have a minimum length of 4. However that is leaving a lot of compression on the table. The shorter matches only save a nibble here and there, but on some files there are many of those possible. For the optimal parsers we wanted to be able to get those wins when they are possible. <P><HR><P> The challenge of optimal parsing LZNIB. <P> The standard approach for optimal parsing a format with rep matches is the "forward arrivals" style parse (as in LZMA). (this is in contrast to the classical backwards LZSS optimal parse which can only be used in formats that do not have adaptive state which is parse-path dependent). <P> See some posts on forward-arrivals parse : <A HREF="https://cbloomrants.blogspot.com/2015/01/01-23-15-lza-new-optimal-parse.html"> here </A> and <A HREF="https://cbloomrants.blogspot.com/2016/06/06-09-16-fundamentals-of-modern-lz-two.html"> here </A> <P> The standard forward arrivals parse does not work well with LZNIB. In the forward-arrivals parse, you take the cheapest way to arrive at pos P, you cost all ways to go forward from P (with a literal, or various match options), and fill out the costs at P+1, P+len, etc. <P> The problem is that it doesn't account for the fact that the best way to arrive at pos P depends on what comes next (in particular, do literals or matches come next, and if literals, how many?). It treats each action as being independent. <P> We'll look at some ways to improve the forward arrivals parse but first a detour. Fortunately in this case it's possible to solve this systematically. <P> Whenever I face a difficult heuristic like this where we know we're approximating in some way and don't know quite the best way, the first thing I always look for is - can we solve it exactly? (or with some bounded error). The more exact solution might be too slow and we won't actually be able to ship it, but it gives us a target, it lets us know how close our approximation is, and may give us guidance on how to get there. <P> In this case what we can do is a full dynamic programming parse with the LRL as part of the state matrix. <P> Optimal parsing is always a form of dynamic programming. We often approximate it by ignoring the internal state of the parse and making an array of only [pos]. What I mean by "full dynamic programming" is to make the state explicit and use a 2d array of [pos][state]. Then on each parse choice, you look at how that steps position (eg. by match len) and also how that changes the internal state, and you move to the next array slot. In this case the important state variable is LRL. <P> (we're treating a single literal as a coding action, which it is not, and correcting that by considering LRL as a state variable. The result is that we get correct code costs for all LRL steps from each pos.) <P> (note that the offset which is used for rep matches is also an important internal state variable, but we are continuing to ignore that as is usual; we do store a different rep arrival in each [pos][lrl] entry, but we don't differentiate between arrivals that only differ by rep offset) <P> We consider LRL's in [0,21]. This lets us capture the transition of LRL not fitting in the control nibble (at 7, with the default threshold of 8), and then again when it overflows the first nibble of encodemod (at 7+15=21). LRL value of 21 is a state for all higher LRL's, so we don't account for when later bytes of encodemod overflow. <P> We make a 2d matrix that's (file len) * 22 entries. <P> At each pos P, you can advance all the entries by adding one literal. This does : <FONT COLOR=GREEN><PRE> for all LRL costs[P+1][LRL+1] = 2 + costs[P][LRL] + LRL_Delta_Cost(LRL) (2 nibbles is the cost of a literal byte) where LRL_Delta_Cost = 1 at LRL = 0 1 at LRL = 7 1 at LRL = 21 otherwise zero </PRE></FONT> Or you can advance with a match. To advance with a match, start from the cheapest arrival with any LRL and step by the match len and fill costs[P+len]. You can also advance with a rep match, which is similar except that it cannot advance from the LRL=0 state. Each arrival stores where it came from (both pos and LRL). <P> When you reach the end of the file, you take the cheapest of all the LRL arrivals and trace it backwards to the root to find the parse. <P> This kind of full matrix dynamic programming parse completely captures the non-local effects caused by things like LRL running into thresholds that change the cost. Unfortunately it's too slow for production (and uses a lot of memory), but it is very useful as a reference point. It told us that a much better optimal parse was possible. <P> An important note : The dynamic programming parse needs to be making space-speed decisions. As noted previously in particular there are a lot of ties and those need to be broken in favor of decode speed. The primary driver for decode speed is the number of control tokens. What we did is just add a cost penalty to each control token. The number we used is (1/4) of a nibble penalty for each control token. That is, we will give up 1 nibble of compression if it saves 4 mode switches. If we can send {LRL} instead of {LRL,match,LRL,rep,match} we will do it if the penalty is only one nibble. <P> (modern Oodle now uses a rate-disortion-like Lagrange multiplier to optimize for the desired tradeoff of space & speed, which the user can control. This work in LZNIB predates that system and just used a hard-coded tradeoff that we found to greatly improve decode speed without much penalty to compressed size.) <P> So, armed with the dynamic programming solution, we could generate stats to look at how it was parsing files, and compare that to how forward-arrivals was parsing. What we saw was : <FONT COLOR=GREEN><PRE> dynamic programming : --------------------------------------------------------- 7.6 bytes per token ; 30.2% literals, 55.6% matches, 14.1% LO AM: 50.4% match / 49.6% lrl ; 8.9 average AM ML , 7.0 lrl AM: offlow 40.7% ml3 24.0% ml4 21.8% AL: 49.2% match / 50.8% LO ; 7.6 average AL ML , 6.4 LO len AL: offlow 53.5% ml3 24.6% ml4 19.9% ; LO len1 11.4% len2 24.9% --------------------------------------------------------- forward arrivals : --------------------------------------------------------- 5.9 bytes per token ; 21.0% literals, 61.1% matches, 17.9% LO AM: 46.4% match / 53.6% lrl ; 8.4 average AM ML , 3.5 lrl AM: offlow 43.1% ml3 37.5% ml4 19.9% AL: 39.5% match / 60.5% LO ; 7.5 average AL ML , 5.0 LO len AL: offlow 36.7% ml3 43.1% ml4 15.0% ; LO len1 30.1% len2 23.4% --------------------------------------------------------- key : --------------------------------------------------------- decompressed bytes per token ; % of control tokens of each type AM = in after-match state AM: % of tokens ; average len of that token AM: offlow = offset in 12 bits , ml3 = % of matches that have len 3 AL = in after-literal state LO means "last offset" which a synonym for "rep match" --------------------------------------------------------- </PRE></FONT> In this case the compressed size was very similar, but the dynamic programming parse was much faster to decode (about 20% faster). <P> We can easily see what's going on : <FONT COLOR=GREEN><PRE> DP parse has more bytes per token, eg. fewer tokens for the whole file. This is why it is faster. This is more of the end result of the problem rather than the source of the problem. DP parse has way more bytes in literals (30.2% vs 21%) DP parse has way longer average LRL (7.0 vs 3.5) forward-arrivals parse sends way more len-3 matches (37.5% vs 25.0% and 43.1% vs 24.6%) forward-arrivals parse sends way more rep len 1 matches (30.1% vs 11.4%) </PRE></FONT> I found it quite interesting that two ways to parse the file could produce nearly the same compressed size, but get there in very different ways. <P> Clearly the forward arrivals parse needs a way to find the longer literal runs when they are best in a non-local way. <P> When you are walking along in a forward-arrivals parse, you just see the single cheapest arrival to your pos; consider something like this : <FONT COLOR=GREEN><PRE> 1234 LLLR </PRE></FONT> At pos 4, the cheapest arrival is via rep-len1. The standard forward arrivals parse fills that spot with the rep-len1 arrival, and then continues forward from there. There's no way to go back in time. You no longer see the arrival at pos 3. <P> A key thing we should observe is that when a non-local cost effect makes us change a decision (away from just using cheapest arrival), it is always in favor of LRL. The standard arrivals parse is taking too many matches & reps and we want to take literal runs instead in some of those places. <P> The solution is pretty simple : <P> Store only match (and rep-match) arrivals at each pos (not literal arrivals). When considering going forward, consider starting at current pos P from the match arrival there, OR go from [P-1] with 1 LRL, or [P-2] with 2 LRL, etc. <FONT COLOR=GREEN><PRE> considering an example from before : 12345678 MMMLMMML LLMMMMLL at pos 8 I look at the ways to go forward (find matches & reps) say I find a match how should I arrive at pos 8 ? I can arrive via arrival (MMMLMMM) + LRL 1 or arrival (LLMMMM) + LRL 2 You should choose the latter. Even though arrival was the cheaper way to get to pos 7, arrival is the better way to get to pos 8. </PRE></FONT> You want to limit the LRL lookback to something reasonable. Perhaps 8 (or 22 as we did in the full matrix parse, but that isn't necessary). If you find no match arrivals in the 8-step lookback, you need a way to go back to the most recent preceding match arrival. <P> Instead of just looking at a forward step of {match} we consider {step back 1 + L + M} , {back2 + LLM }, etc. In LZNIB, costing the LRL lookback steps is simple because literals are just always 1 byte. <P> Essentially what we're doing here is treating the LZNIB parse as if it used an {LRL,match} combined packet. Even though the actual format has separate literal run and match actions, they act like they are combined. <P> In fact there's a different way to look at LZNIB. After an LRL nibble action token, there must always be a match or rep-match action. So we can think of those as instead being a byte action token for {LRL+match}. In that case rather than the after-literal / after-match states, there is only one state : <FONT COLOR=GREEN><PRE> LZNIB always after-match : nibble action : match (no rep match is possible) byte action : LRL then match byte action : LRL then rep-match </PRE></FONT> If you parse using these actions, then the non-local effects go away. <P> In the end we found huge improvements to the LZNIB parse that gave us ~20% faster decompression just through making better decisions in the encoder. The way that we investigated this and developed the parse was later fruitful in the development of Kraken & Mermaid/Selkie, which have similar issues but are even far more complicated to parse well. <P><HR><P> Some old references links related to LZNIB, and some perf reports showing LZNIB in the pre-oceanic-cryptozoology days : <P> <A HREF="https://cbloomrants.blogspot.com/2012/10/10-22-12-lz-bytewise-conclusions.html"> cbloom rants 10-22-12 - LZ-Bytewise conclusions </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2015/03/03-02-15-lz-rep-match-after-match.html"> cbloom rants 03-02-15 - LZ Rep-Match after Match Strategies </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2015/05/05-13-15-skewed-pareto-chart.html"> cbloom rants 05-13-15 - Skewed Pareto Chart </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2015/12/oodle-results-update.html"> cbloom rants Oodle Results Update </A> <BR> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/JeEgeviRb9w" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com2tag:blogger.com,1999:blog-5246987755651065286.post-52117307302497522072018-10-23T19:52:00.003-07:002018-10-23T19:52:39.045-07:00Oodle 2.7.3 on the Nintendo Switch It's been a while since I reported Switch performance, and we've made a lot of improvements to ARM performance since then (both for ARM32 and 64, for Switch, iOS, and Mobile), and of course added Leviathan. Time for an update! <P> I compare Oodle against the zlib implementation provided in the Nintendo SDK (nn deflate). Deflate is run at level 9, Oodle at level 8. I'm timing decode speed only. <P> On mixed content test file lzt99 : <P> <TABLE><TR><TD> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=FFB000,882787,A92FFF,FF00FF,B097F7&chxr=1,0,1000,100&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=nn_deflate-l9|Leviathan-z8|Kraken-z8|Mermaid-z8|Selkie-z8&chd=e:Em,N6,SH,ht,-L&chxl=0:|total&chtt=decode+MB/s&chts=303030,14,l"> </TD><TD> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=FFB000,882787,A92FFF,FF00FF,B097F7&chxr=1,0,3,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=nn_deflate-l9|Leviathan-z8|Kraken-z8|Mermaid-z8|Selkie-z8&chd=e:oL,7J,4o,z-,pc&chxl=0:|total&chtt=compression+ratio&chts=303030,14,l"> </TD></TR></TABLE> <P> Plot of log compresion ratio vs. log decode speed : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=s&chs=800x360&chxt=x,y&chxr=0,6,10,1|1,0,2,1&chd=e:CocNiTwp-y,dOvFtEpHeq&chdl=nn_deflate-l9|Leviathan-z8|Kraken-z8|Mermaid-z8|Selkie-z8&chdlp=r&chco=FFB000|882787|A92FFF|FF00FF|B097F7&chm=s,000000,0,0,0|o,FFB000,0,0,10|o,882787,0,1,10|o,A92FFF,0,2,10|o,FF00FF,0,3,10|o,B097F7,0,4,10&chtt=loglog+ratio+vs+MB/s&chxl=0:|64|128|256|512|1024|1:|1|2|4|"> <P> Raw numbers : <P><PRE> lzt99 : nn_deflate-l9 : 1.883 to 1 : 71.70 MB/s lzt99 : Leviathan-z8 : 2.773 to 1 : 217.26 MB/s lzt99 : Kraken-z8 : 2.655 to 1 : 282.96 MB/s lzt99 : Mermaid-z8 : 2.437 to 1 : 526.64 MB/s lzt99 : Selkie-z8 : 1.943 to 1 : 971.69 MB/s </PRE> <P> Leviathan is about 3X faster to decode than zlib/deflate, with way more compression. The rest of the Oodle compressor family provides even faster decode speeds with lower compression ratios. The fastest, Oodle Selkie, is similar compression ratio to zlib/deflate but more than 10X faster to decode. <P> details : <P> This is with Switch SDK 5.5, clang-5.0.1 ; the nn:deflate speed has gotten a little worse since the last time I measured it (was 74.750 MB/s). The compression ratio of nn:deflate is the same. For reference, <A HREF="https://cbloomrants.blogspot.com/2017/01/oodle-on-nintendo-switch.html"> old numbers are here </A>, for Oodle 2.6.0 and 2.4.2. For example you can see that LZNA was at 24 MB/s with compression just slightly below Leviathan. Leviathan is truly a remarkable beast. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/1mdEBTo-eEc" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-58887579733153226952018-10-22T13:02:00.001-07:002018-10-22T13:02:56.489-07:00recip_arith without unused range Finishing up, today we'll go through how to make recip_arith take advantage of the un-mapped portion of the coding range, and see some more connections to past work. <P> recip_arith uses only the top bits of "range" in the map, rounding down the cdf to range scaling factor when it converts to fixed point. Thus it always maps less than range. <P> The classic "range coder" does something similar. With the map : <FONT COLOR=GREEN><PRE> uint32_t r_norm = ac->range >> cdf_bits; forward(cdf) = cdf * r_norm; </PRE></FONT> the bottom "cdf_bits" of range are not used in this map, so a small portion of range is unused. <P> The standard way to make use of this extra range is to assign to the last symbol. Most of the older papers show this as the default. They'll do something like : <FONT COLOR=GREEN><PRE> (range coder "map end") uint32_t r_norm = ac->range >> cdf_bits; forward(cdf) : if ( cdf == cdf_tot ) return range; else return cdf * r_norm; </PRE></FONT> that is, explicitly taking the last interval and making it go all the way to range. You may see this in other "range coder" implementations. <P> I do not do this in my range coder. The benefit is around 0.001 bpb (or less), and it does cost some speed. The benefit depends on whether you can give that extra range to a useful symbol. Ideally you would give it to the most probable symbol. What you are doing is reducing the code length of one symbol, you get the most benefit by reducing the code length of the symbol that occurs most often. The higher the probability of the MPS, the more benefit you will get, assuming you put the MPS at the end of the alphabet where it gets this range. (conversely if the end of the alphabet is a symbol that never/rarely occurs, you get no benefit). <P> I call this variant "map end". In recip_arith the tradeoff is a little more biased towards doing "map end", because the unmapped portion of range is larger. <P> (note that the "map end" excess range is less than the amount of range assigned per cdf; that is, it's equivalent to increasing that integer symbol frequency by some fractional amount, so it's not a huge distortion of the symbol probabilities) <P> But in recip_arith, just mapping that extra range to one symbol is more of a probability distortion than it is in the range coder, because the extra space is larger. <P> We can do something better. The scaling factor from cdf to range in recip_arith is the *floor* to fixed point. When range is near the top of that fractional truncation, you would much rather round up. But you can't round up because that would lead to using more than all of range, which is uncodeable : <FONT COLOR=GREEN><PRE> int range_clz = clz32(range); uint32_t r_top_down = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS); uint32_t r_norm_down = r_top_down << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); uint32_t r_top_up = r_top_down+1; uint32_t r_norm_up = r_top_up << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); r_norm_up == r_norm_down + (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)); r_norm_down * cdf_tot <= range r_norm_up * cdf_tot > range if you just use r_norm_down , you use too little of range unmapped_range = range - r_norm_down * cdf_tot = range & ( (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS)) - 1 ) conversely if you used r_norm_up for scaling, you would use too much range the amount is the ~ bit reverse of those bottom bits </PRE></FONT> So inspired by SM98 we can use a mix of round-down & round-up of the scaling factor, such that all of range is covered. <P> What we want is to use round_up as much as possible, because it gives lower codelens, and then for enough symbols to avoid over-using range, switch to round_down. Just like SM98 we can find a threshold such that we place the transmission so that all of range is used. <P> The nice way to think about it conceptually is that we construct two maps : we use r_norm_down as the scaling factor from the left side of range (starting at 0), and we use r_norm_up as the scaling factor from the right side, and we switch over whether they cross. <P> The implementation is simple : <FONT COLOR=GREEN><PRE> r_norm_down & r_norm_up as above uint32_t map_up_excess = (r_norm_up << cdf_bits) - range; forward(cdf) : uint32_t cdf_down = cdf * r_norm_down; int64_t cdf_up = cdf * r_norm_up - (int64_t) map_up_excess; // signed ret = MAX( cdf_down, cdf_up ); </PRE></FONT> and this can be simplified further with some algebra which I won't show. <P> Perhaps a picture is clearer : <P> <IMG SRC="https://www.cbloom.com/blogimages/recip_downup_map.jpg"> <P> I call this map "recip_arith down/up". The decoder is straightforward except for one point : <P> The reciprocal for both r_top and (r_top+1) have to be looked up. range is conceptually in [1,2) in fixed point, r_top can be exactly 1 but never 2. (r_top+1) can be exactly 2. The reciprocal table needs one more entry than (1<<RECIP_ARITH_TABLE_BITS). That means the reciprocal can take one more bit. In particular if you use the 32 bit reciprocal numerator as in recip_arith.h , the reciprocal will no longer fit in 32 bits. The easy solution is to change RECIP_ARITH_NUMERATOR_BITS to 31. <P> The coding loss of "recip_arith down/up" (at RECIP_ARITH_TABLE_BITS = 8) is extremely low. It's typically better than the "range coder" map, and comparable to the CACM87 map. <P> "recip_arith down/up" is even extremely good at RECIP_ARITH_TABLE_BITS = 4. That might be interesting for hardware implementations, because it means very tiny tables can be used, and the multipliers needed (for encoding) are also very low bit count. <P> Note that if you reduce RECIP_ARITH_TABLE_BITS all the way down to 1, then r_top_down is always just 1, and r_top_up is just 2. Then "recip_arith down/up" is choosing between a 1X and 2X mapping. This is exactly SM98 ! <P> At RECIP_ARITH_TABLE_BITS = 2, then "recip_arith down/up" is choosing 1X or 1.5X or 2X. (the top bits are either 10 or 11 , we're getting only 1 fractional bit of range, the top bit is always on). This is exactly the same as the "reduced overhead" coder of Daala. (see OD_EC_REDUCED_OVERHEAD, in entcode.h) <P> <FONT COLOR=GREEN><PRE> OBJ2 range coder: 246,814 -> 193,172 = 6.261 bpb = 1.278 to 1 recip_arith (8 bits): 246,814 -> 193,282 = 6.265 bpb = 1.277 to 1 recip_arith down/up (8 bits): 246,814 -> 193,171 = 6.261 bpb = 1.278 to 1 recip_arith down/up (4 bits): 246,814 -> 193,240 = 6.264 bpb = 1.277 to 1 recip_arith down/up (3 bits): 246,814 -> 193,436 = 6.270 bpb = 1.276 to 1 PAPER3 range coder: 46,526 -> 27,133 = 4.665 bpb = 1.715 to 1 recip_arith (8 bits): 46,526 -> 27,156 = 4.669 bpb = 1.713 to 1 recip_arith down/up (8 bits): 46,526 -> 27,133 = 4.665 bpb = 1.715 to 1 recip_arith down/up (4 bits): 46,526 -> 27,127 = 4.664 bpb = 1.715 to 1 recip_arith down/up (3 bits): 46,526 -> 27,155 = 4.669 bpb = 1.713 to 1 PIC H : 1.21464 range coder: 513,216 -> 78,408 = 1.222 bpb = 6.545 to 1 recip_arith (8 bits): 513,216 -> 78,651 = 1.226 bpb = 6.525 to 1 recip_arith down/up (8 bits): 513,216 -> 78,408 = 1.222 bpb = 6.545 to 1 recip_arith down/up (4 bits): 513,216 -> 78,395 = 1.222 bpb = 6.547 to 1 recip_arith down/up (3 bits): 513,216 -> 78,806 = 1.228 bpb = 6.512 to 1 PROGL range coder: 71,646 -> 42,723 = 4.770 bpb = 1.677 to 1 recip_arith (8 bits): 71,646 -> 42,757 = 4.774 bpb = 1.676 to 1 recip_arith down/up (8 bits): 71,646 -> 42,721 = 4.770 bpb = 1.677 to 1 recip_arith down/up (4 bits): 71,646 -> 42,724 = 4.771 bpb = 1.677 to 1 recip_arith down/up (3 bits): 71,646 -> 42,731 = 4.771 bpb = 1.677 to 1 TRANS range coder: 93,695 -> 64,806 = 5.533 bpb = 1.446 to 1 recip_arith (8 bits): 93,695 -> 64,851 = 5.537 bpb = 1.445 to 1 recip_arith down/up (8 bits): 93,695 -> 64,806 = 5.533 bpb = 1.446 to 1 recip_arith down/up (4 bits): 93,695 -> 64,820 = 5.535 bpb = 1.445 to 1 recip_arith down/up (3 bits): 93,695 -> 64,884 = 5.540 bpb = 1.444 to 1 </PRE></FONT> <P> recip_arith down/up (4 bits) has very low coding loss, it's very close to the standard range coder. (note that "range coder" here is not doing "map end" which would improve it slightly, particularly if end was assigned to the MPS, particularly so on "PIC" which is the only file where range coder noticeably misses the entropy) <P> Base recip_arith loss vs range coder is exactly 0.004 bpb in all cases here. This is with cdf_bits = 13 and a crappy frequency normalization heuristic, so there are some anomalies where eg. recip_arith down/up 4 bits gets more compression than 8 bits. recip_arith vanilla version at 4 bits gets very poor; coding loss is comparable to SM98, around 1.0% <P> recip_arith down/up is probably not practical for software implementation, but might be interesting for hardware. It's also interesting to me as a theoretical superset that connects recip_arith to SM98. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/_8lS7AtKT-s" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-45018494227592511642018-10-22T11:16:00.001-07:002018-10-22T11:16:29.046-07:00Working through recip_arith Time to look at the actual recip_arith map and how it works. <P> recip_arith takes the top N bits of range (always a high 1 bit, and then N-1 fractional bits in the fixed point view) and uses them to scale the (power of 2 total) CDF up to the arithmetic domain. <P> The recip_arith forward map is : <FONT COLOR=GREEN><PRE> int range_clz = clz32(range); uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS); uint32_t r_norm = r_top << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); forward(cdf) = cdf * r_norm; or forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); </PRE></FONT> r_norm is the scaling factor, close to (range/cdf_tot), differing in that all bits below RECIP_ARITH_TABLE_BITS are set to zero. Recall the standard "range coder" map is : <FONT COLOR=GREEN><PRE> uint32_t r_norm = ac->range >> cdf_bits; forward(cdf) = cdf * r_norm; </PRE></FONT> the only difference in recip_arith is that we are only using the top RECIP_ARITH_TABLE_BITS of "range" in the scaling. This is a floor of the ideal scaling factor, and means we use less of range than we would like. <P> Note that the "range coder" itself only uses (r_bits - cdf_bits) in its scaling factor, so it relies on r_bits reasonably larger than cdf_bits. This is in contract to CACM87 or full precision scaling. <P> The recip_arith map can be thought of as starting with the "identity map" from the last post, and then adding fractional bits to the fixed point in [1,2) , possibly 1.5X, 1.25X, etc. refining the scaling factor. <P> The point of course is to able to invert it. The naive inverse would be : <FONT COLOR=GREEN><PRE> inverse(code) = code / r_norm; </PRE></FONT> which works, but has the divide we are trying to avoid. Our idea is that because r_top is small, we can use it to look up a reciprocal multiply. How should we do that? There are two choices, where we invert the steps of the forward map one by one, in reverse order : <FONT COLOR=GREEN><PRE> forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); inverse(code) = (code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) / r_top; or forward(cdf) = (cdf << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) * r_top; inverse(code) = (code / r_top) >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); </PRE></FONT> if you actually do the divide, these are the same, but with the reciprocal multiply they are not. The issue is our reciprocal will only be exactly equal to the divide for some number of numerator bits. If we used the second form, we would need (code / r_top) , and "code" can be very large (24-31 bits or 56-63 bits). To do that reciprocal exactly would require an intermediate multiply larger than 64 bits. <P> Therefore we need the first form : first shift down code to only the necessary bits to resolve cdf boundaries, then do the divide : <FONT COLOR=GREEN><PRE> inverse(code) : uint32_t code_necessary_bits = code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); ret = code_necessary_bits / r_top; ret = ( code_necessary_bits * recip_table[r_top] ) >> RECIP_ARITH_NUMERATOR_BITS; </PRE></FONT> (for 32-bit systems you might want to choose RECIP_ARITH_NUMERATOR_BITS = 32 so that this is a 32x32 -> hi word multiply; on 64-bit that's moot) <P> "code_necessary_bits" has its top bit at cdf_bits + RECIP_ARITH_TABLE_BITS ; in practice this is something like 14 + 8 = 22 . This is just enough bits that after dividing by r_top , you still resolve every cdf value exactly. In order for recip_table[] to have an exact reciprocal, it needs to be 1 more bit than the numerator, eg. cdf_bits + RECIP_ARITH_TABLE_BITS + 1. This means the intermediate that we need to multiply easily fits in 64 bits. <P> This acts as the limit on RECIP_ARITH_TABLE_BITS, aside from the consideration of how much space in L1 you can afford to give it. <P> With RECIP_ARITH_TABLE_BITS = 1 , recip_arith is the "identity map". With RECIP_ARITH_TABLE_BITS = 8, coding loss is very low (0.1%). <P> The fraction of "range" that can be wasted by the recip_arith map is 1/2^RECIP_ARITH_TABLE_BITS , which means the coding loss is : <PRE>! coding loss @ maximum ~= -log2( 1 - 1/2^RECIP_ARITH_TABLE_BITS ) (eg. at RECIP_ARITH_TABLE_BITS = 8 this is 0.00565 bits) </PRE> The maximum occurs when all bits uner the top RECIP_ARITH_TABLE_BITS are 1; when they're all 0 this map is exact and there's no coding loss. The real world coding loss seems to be about 75% of the above formula. <P> Note that in a standard "range coder" you are already approximating to only use the bits of range between r_bits and cdf_bits; so eg. r_bits in 24-31 , cdf_bits of 14, means you are using 10-17 top bits of range in r_norm (the scaling factor). recip_arith just always uses 8 (for example) rather than the variable 10-17. <P> Note that recip_arith does not map all of range. We'll talk about variants that do use the whole range in the next post. (The standard range coder also doesn't map all of range). This means there are "code" values which the decoder cannot handle through its normal paths, where "code" is out of the mapped range. A valid stream produced by the encoder will never lead to those values being seen in the decoder, but a corrupt/attack/fuzzy stream could have them. Therefore a safe decoder must detect when code is out of the mapped range and at least handle it well enough not to crash. Testing and returning failure is an efficient way to handle it since it's just a branch that's never taken. <P> In review : the fully general CACM87 map uses 2 divides to encode, and 3 to decode. (encode is 2 forward maps, decode is 1 inverse map + 2 forward maps). By choosing cdf_tot to be a power of 2 we can easily eliminate the divide in the forward map. The divide in the inverse map is harder to remove, but we do it by reducing the number of bits used so that we can use a table of reciprocal multiplies. <P> The decoder critical path of instructions in recip_arith is : clz -> shift -> table lookup -> multiply . On modern desktops this is about the same speed or just a little faster than a divide. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/dCuOhSE3iWk" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-89481686217830588592018-10-21T15:14:00.000-07:002018-10-22T11:16:53.657-07:00Arithmetic coder division-free maps We have now narrowed in on the key issue that forces us to use division in arithmetic coders : the inverse map from arithmetic domain to CDF domain. Let's now start to look at division free options for that map. <P> Often henceforth I will talk about the map after normalizing the ratio to [1,2) by shifting CDFs. That is : <FONT COLOR=GREEN><PRE> while cdf_tot*2 < range cdf's *= 2 same as : left shift cdf's to put the top bit of CDF aligned with the top bit of range </PRE></FONT> Recall that in a typical implementation "range" has its top bit go from position 24-31. cdf_tot = (1<<cdf_shift) we have chosen to be a power of 2, and cdf_shift is typically in the 11-14 range. <P> We now think in terms of a fixed point fraction. cdf_tot after normalizing is "one" , and "range" is in [1,2) , that is : <FONT COLOR=GREEN><PRE> one = 1.00000000000000000 range = 1.01001010101100000 </PRE></FONT> essentially all we're getting at here is that in the scaling from CDF domain to "arithmetic domain", the binary powers of 2 to get us as close to range as possible are the easy part. We just shift the CDF's up. The fractional part of range in [1,2) is the hard part that makes us need a multiply/divide. <P> And that leads us directly to our first division free map : <FONT COLOR=GREEN><PRE> identity map : aka "just shift" map : r_bits = bsr(range) = position of top bit of range shift = r_bits - cdf_shift forward(cdf) = cdf << shift inverse(code) = code >> shift </PRE></FONT> If we look at how this map performs for range in [1,2) , when range is near 1 ("one" in fixed point, that is, range nearly 1<<r_bits), then this map is exactly right. As range grows towards 2, this map gets worse and worse. Near 2, this map is costing us exactly 1 bit of coding loss per symbol (because we're only using half of the range that we could). <P> (aside : in the ancient days, Rissanen actually used the "identity map") <P> The problem with this map is that it is failing to scale up cdf to use all of range when range is over 1. Obviously we can just check for that : <FONT COLOR=GREEN><PRE> "one or 1.5 map" : If range is in [1,1.5) , scale by 1X if range is in [1.5,2) , scale by 1.5X forward(cdf) : cdf_norm = cdf << shift; ret = cdf_norm; if ( range (fixed point) >= 1.5 ) (test 0.1 bit) ret += cdf_norm >> 1; inverse(code) : if ( range (fixed point) >= 1.5 ) ret = (code * 2)/3; else ret = code; </PRE></FONT> so we are now testing 1 fractional bit of range below the leading bit. This divides the "range" interval into two scaling zones. We could obviously test further fractional bits of range to divide the range into scaling zones like : <FONT COLOR=GREEN><PRE> test 2 fractional bits of range (below top 1) : [ 1X , 1.25X , 1.5X , 1.75X ] </PRE></FONT> but that's just the same as doing the normal range coder scaling, but only using the top 3 bits of range (top one + 2 fractional bits). And that is the same as the recip_arith forward map. <P> Note that in the "one or 1.5 map" it may look like we are still doing a divide to get from arithmetic domain back to cdf domain. But it's divide by the constant 3, which is implemented by the compiler (or us) as a reciprocal multiply. As we use more (fractional) bits of range, we need various other divides by constants (1/5,1/7,etc.) which are just reciprocal multiplies. Rather than branching, we can just use a table and take the top bits of range to look them up. This leads us directly to recip_arith, which we will flush out in the next post. <P> In these maps we are always using less than the full range. eg. say we do the "2 fractional bits of range" , if range is in the [1.25,1.5) fixed point interval, we will use 1.25X to scale from CDF to arithmetic domain. That is the correct scaling when range is the bottom of [1.25,1.5) but does not use all of range when range is larger. The reason is we can't do something like use a scaling factor that's the middle of the bucket (1.375), since when range is low that would give us forward(cdf) > range , which is uncodeable. Therefore we must also use the round-down or truncation of range to fewer top bits. This approach of using some number of fractional bits of range means that the map never makes use of all of range; as you add more bits, you can use more and more of range, but you are slowly approaching the limit from below. <P> There's an alternative approach due to Stuiver & Moffat ("Piecewise Integer Mapping for Arithmetic Coding"), commonly called SM98. <P> The SM98 map says : consider range and CDF normalized, so range is in [1,2) fixed point. If we just scale CDF by 1X everywhere ("identity map" above) we are not using all of range. We can't *ever* scale CDF by 2X uniformly, because range is strictly < 2 , that would make forward(cdf_tot) exceed range. What we can do is scale CDF by 2X for a portion of the interval, and 1X for the remainder, so that we use all of range : <FONT COLOR=GREEN><PRE> Choose some CDF threshold t such that when we make the map : cdf < t -> scale by 1X cdf >= t -> scale by 2X then we use the whole range in our map, eg. forward(cdf_tot) = range The map is : forward(cdf) : if ( cdf < t ) ret = cdf else ret = t + (cdf - t)*2 = cdf*2 - t t = cdf_tot*2 - range note that cdf_tot is normalized to be equal to range's top bit here, so 't' is the same as "2 - range" in fixed point that's the same as the ~ bit inverse of range's fractional bits forward(cdf) : if ( cdf < t ) ret = cdf; else ret = range - (cdf_tot - cdf)*2; forward(cdf) : ret = MAX( cdf , range - (cdf_tot - cdf)*2 ) </PRE></FONT> This final form with the branchless MAX is nicer for implementation, but it's also an alternate way to see the map. What we're doing is a 1X map for the early cdf's, starting at the left side of the arithmetic range. If we stuck with that map the whole way, it would not reach the end of range (and thus waste coding quality). We're simultaneously doing a 2X mapping of the late CDF's, starting at the *right* side of the arithmetic range. If we stuck with that map the whole way, it would overshoot zero and thus not be codeable. Where those two maps cross, we switch between them, thus using the more generous 2X mapping as much as possible. <P> (the inverse map is done similarly, just with >>1 instead of *2) <P> So, the SM98 mapping uses all of range, thus does not introduce coding loss due to failure to use all of range. It does, however, not assign intervals propertional to the probability. <P> When range is near 1, SM98 does the 1X map nearly everywhere, so its scaling is correct. When range is near 2, SM98 does the 2x map nearly everwhere, so again there is little coding loss. Intervals are proportional to the probability. The problem is when range is in the middle. Then some symbols will get a 1X scaling, and others will get a 2X scaling, distorting the probabilities, causing coding loss. <P> (aside: I did an experiment and confirmed that you can compensate for this somewhat with a skewed probability estimate. SM98 in this form gives too much code space to later symbols (you can of course choose to reverse that). To compensate you should increase the frequency more when you see early symbols than when you see later ones, so that the probability estimate is skewed in the opposite way of the coder. This does in fact cut the coding loss of SM98, roughly in half, from about 1.0% to 0.5%. Note this is totally evil and I'm not actually recommending this, just writing it down for the record.) <P> And I'll finish with a drawing : <P> <IMG SRC="https://www.cbloom.com/blogimages/simple_arithcoder_maps_small.jpg"> <P> Teaser : you can of course combine the ideas of "fractional bits of range" map and the SM98 map. When range is in the interval [1,1.5) you could use a 1X scaling for low CDF and 1.5X scaling for high CDF; in the [1.5,2) interval use an SM98 threshold to split the CDF's into intervals with 1.5X and 2X scaling. This was tried by Daala as the "reduced overhead" coder. We will come back to this later. <P> Oh, and another connection : <P> If you did the "fractional bits of range" scaling thing; first 1 bit giving you a 1.5X zone, then two bits adding 1.25X and 1.75X zones, etc. If you keep doing that all the way down, in a range coder framework you are trying to compute (range / cdf_tot). That means you need to look at (r_bits - cdf_bits). If you simply keep testing that number of bits and adding in the forward() map - you will wind up with the full range coder map. <P> That process is the Moffat-Neal-Witten DCC95 multiplication free coder. In that context you might want to choose r_bits = 14 or 15, bit renormalization. cdf_bits = 11 or so. The difference (r_bits - cdf_bits) is your coding precision (larger = less coding loss), and it's also the number of times you have to test bits of r and possibly do (ret += cdf>>n) in the forward map. <P> ADD : <P> I brought up thinking of range normalized to [1,2) as a conceptual aid, but that can also be a good implementation choice, particularly for coders like SM98 where you spend most of your time doing clz to find the top bit position of range. Instead of letting range float, like in Michael Schindler range coder 24-31 bits, you instead keep range pegged at 31 bits. That lets you avoid the clz to find the top bit, at the cost of doing a clz after encoding to find out how much range has shrunk to normalize it back up. <P> Now you might think this requires bit renorm, but it does not. Instead you can still do byte renorm, and keep a count of the number of spare bits at the *bottom*. So you are still doing 24-31 bit renorm, but the space is at the bottom instead of the top. <P> This implementation style is not shown in recip_arith for clarity, but I figure I better mention everything so Google can't patent it. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/9lW-3RtY_Mk" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-34484825351868625112018-10-18T10:45:00.000-07:002018-10-19T08:33:54.627-07:00About arithmetic coders and recip_arith in particular An arithmetic coder encodes a sequence of symbols as an infinite precision number. You divide the range of that number (I will call this the "arithmetic domain") into successively smaller intervals; each interval corresponds to a symbol. After encoding a symbol, your current coding interval shrinks to the low/high range of that symbol, and you encode further symbols within that range. The minimum number of bits required to distinguish the desired sequence from another is the code length of the sequence. <P> If you make each symbol's interval proportional to the probability of that symbol, the arithmetic coder can produce a code length equal to the entropy. (we're still assuming infinite precision encoding). If your estimated probabilities do not match the true symbol probabilities (they never do) you have some loss due to modeling. The difference between the arithmetic coder's output length and the sum of -log2(P) for all model probabilities is the coding loss. <P> In order to do arithmetic coding in finite precision, we track the low/high interval of the arithmetic coder. As the top bits (or bytes) of low & high become equal, we stream them out. This is like "zooming in" on a portion of the infinite number line where the top & bottom of the interval are in the same binary power of two buckets. <P> We must then confront the "underflow problem". That is, sometimes (high - low) ("range") can get very small, but the top bits of high and low never match, eg. if they happen to straddle a binary power of 2. They can be something like <FONT COLOR=GREEN><PRE> high = 1000000000aaaaa.. low = 0111111111bbbbb.. </PRE></FONT> There are several solutions to the underflow problem. For example CACM87 "bit plus follow" or the MACM / VirtQ approach which you can read about elsewhere, also the "just force a shrink" method (see links at end). <P> The method I use in "recip_arith" rather hides the underflow problem. Rather than checking for the top bits (bytes) of low & high being equal, we simply zoom in regardless. The renormalization step is : <FONT COLOR=GREEN><PRE> while ( ac->range < (1<<24) ) { *(ac->ptr)++ = (uint8_t)(ac->low>>24); ac->low <<= 8; ac->range <<= 8; } </PRE></FONT> low and range are 32 bit here, when range is less than 2^24 the top byte of low & high is the same and can be shifted out, *except* in the underflow case. In the underflow case, we could have the top byte of low is = FF , and high is actually 100 with an implicit bit above the 32nd bit (eg. low + range exceeds 2^32). What we do is go ahead and output the FF, then if we later find that we made a mistake we correct it by propagating a carry into the already sent bits. <P> (note that you could do range += FF here for slightly less coding loss, but the difference is small; the actual "high" of our arithmetic interval is 1 above range, range can approach that infinitely closely from below but never touch it; the coding interval is [low,high) inclusive on the bottom & exclusive on the top. Coders that don't quite get this right have a lot of +1/-1 adjustments around low & high) <P> Renormalization means we can send an infinite length number while only working on a finite precision portion of that number down in the active range of bits at the bottom. Renormalization also means that "range" is kept large enough to be able to distinguish symbols with only integer subdivision of the range, which we shall now address. Renormalization in and of itself does not introduce any coding loss; it is perfectly accurate (though failing to add FF is coding loss, and schemes like the "force shrink" method or "just dont renormalize" method of fpaq0p do contribute to coding loss). <P> The other way we must adapt to finite precision is the division of the interval into ranges proportional to symbol probabilities. The infinite precision refinement would be (real numbers!) : <FONT COLOR=GREEN><PRE> arithmetic_low += CDF_low * arithmetic_range / CDF_sum; arithmetic_range *= CDF_freq / CDF_sum; (real numbers, no floor divide) (CDF = cumulative distribution function, aka cumulative probability, sum of previous symbol frequencies) CDF_freq = CDF_high - CDF_low for the current symbol ; CDFs in [0,CDF_sum] </PRE></FONT> We don't want to do real numbers, so we just approximate them with integer math. But how exactly? <P> <B> The crucial distinguishing aspect of an arithmetic coder is how you map the CDF domain to the arithmetic domain </B> <P> The CDF domain is controlled by you; you have modeled probabilities somehow. The CDF domain always starts at 0 and goes to CDF_sum, which is under your control. In the decoder, you must search in the CDF domain to find what symbol is specified. Working in the CDF domain is easy. In contrast, the arithmetic interval is always changing; "low/range" is being shrunk by coding, and then zoomed in again by renormalization. <P> The forward map takes you from CDF domain to arithmetic domain. Adding on the arithmetic "low" is trivial and we will not include it in the map. The crucial thing is just scaling by (arithmetic_range / CDF_sum). <P> We can now write a very general arithmetic encoder : <FONT COLOR=GREEN><PRE> arithmetic_low += forward(CDF_low,arithmetic_range,CDF_sum); arithmetic_range = forward(CDF_high,arithmetic_range,CDF_sum) - forward(CDF_low,arithmetic_range,CDF_sum); </PRE></FONT> our "forward" map will be working on integers. Some properties forward must have : <FONT COLOR=GREEN><PRE> forward(x) should be monotonic forward(x+1) > forward(x) strictly (so that range can never shrink to zero) this may only be true for arithmetic_range > CDF_sum or some similar constraint forward(0) >= 0 forward(CDF_sum) <= arithmetic_range forward map of the CDF end points does not need to hit the end points of range, but it must be within them (failure to use all of range does contribute to coding loss) </PRE></FONT> The coding loss of our approximation is caused by the difference in forward(high) - forward(low) and the ideal scaling (which should be proportional to range & symbol probability). <P> The integer forward map with lowest coding loss is the "CACM87 map" : <FONT COLOR=GREEN><PRE> forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum; this is now integers (eg. floor division) CACM87 has forward(cdf_sum) = range ; eg. it uses the full range </PRE></FONT> coding loss is just due to the floor division not exactly matching the real number divide. (note that you might be tempted to say, hey add (cdf_sum/2) to get a rounded integer division instead of floor; the exact form here is needed to be able to construct an inverse map with the right properties, which we will get to later). <P> <B> Sketch of full arithmetic coding process </B> <P> A quick overview of what the decoder has to do. Most of the decoder just replicates the same work as the encoder; it narrows the arithmetic interval in exactly the same way. Rather than streaming out bytes in renormalization, the decoder streams them in. The decoder sees the arithmetic code value that the encoder sent, to some precision ahead. It needs enough bits fetched to be able to resolve the correct symbol (to tell which CDF bin is selected). <P> In implementation, rather than track the low/high arithmetic interval and the arithmetic number within that interval, we instead just track (arithmetic - low), the offset inside the interval. I call this the "code" in the decoder. <P> The decoder needs an extra step that the encoder doesn't do : given the current "code" , figure out what symbol that specifies. To do that, we have to take the "code" (in the arithmetic interval domain), map it back to CDF domain, then scan the CDF intervals to find which symbol's bin it falls in. <P> To do so requires an "inverse" map (arithmetic domain -> CDF domain), the opposite of the "forward" map (CDF -> arithmetic) we just introduced. <P> A full general purpose (multi-symbol) arithmetic coder is : <FONT COLOR=GREEN><PRE> (in integers now) Encode : look up CDF of the symbol you want to encode map CDF interval to range inverval : lo = forward(cdf_low,range,cdf_sum); hi = forward(cdf_high,range,cdf_sum); arithmetic_low += lo; arithmetic_range = hi - lo; propagate carry in "arithmetic_low" if necessary renormalize if necessary Decode : take current arithmetic "code" map it back to CDF domain : target = inverse(arithmetic_code,range,cdf_sum); find symbol from CDF target such that : CDF_low <= target < CDF_high rest proceeds like encoder: lo = forward(cdf_low,range,cdf_sum); hi = forward(cdf_high,range,cdf_sum); arithmetic_code -= lo; arithmetic_range = hi - lo; renormalize if necessary </PRE></FONT> The encoder needs "forward" twice, the decoder needs "forward" twice plus "inverse" once. <P> Naive implementation of forward & inverse both need division, which would mean 2 and 3 divides for encode & decode, respectively. <P> <B> The inverse map and when you don't need it </B> <P> First of all, why do you need the inverse map, and when do you not need it? <P> One common case where you don't need the inverse map at all is binary arithmetic coding. In that case it is common to just do the forward map and resolve the symbol in arithmetic domain, rather than CDF domain. <P> That is : <FONT COLOR=GREEN><PRE> binary decoder : arithetmetic_code is known map the threshold between bits 0 & 1 to arithmetic domain : arihmetic_mid = forward(cdf_min,range,cdf_sum); find bin in arithmetic domain : symbol = arithetmetic_code >= arithmetic_mid; lo/hi = { 0 , arithmetic_mid , range } </PRE></FONT> (in the binary case we also only need one forward map, not two, since one of the end points is always 0 or range). <P> Now, you can do the same technique for small alphabet multi-symbol, for 4 or 8 or 16 symbols (in SIMD vectors); rather than make a CDF target to look up the symbol, instead take all the symbol CDF's and scale them into arithmetic domain. In practice this means a bunch of calls to "forward" (ie. a bunch of multiplies) rather than one call to "inverse" (a divide). <P> But for full alphabet (ie 8 bit, 256 symbol), you don't want to scale all the CDF's. (exception: Fenwick Tree style descent). Typically you want a static (or semi-static, defsum) probability model, then you can do the CDF -> symbol lookup using just a table. In that case we can construct the symbol lookup in CDF domain, we need the map from arithmetic domain back to CDF domain. <P> The inverse map must have properties : <P> <FONT COLOR=GREEN><PRE> assume range >= cdf_sum so the forward map is a stretch , inverse map is a contraction should invert exactly at the CDF end points : y = forward(x); inverse(y) == x the CDF buckets should map to the lower CDF : lo = forward(x) hi = forward(x+1) (hi > lo , it can be much greater than 1 apart) inverse( anything in [lo,hi) ) = x </PRE></FONT> in hand wavey terms, you need inverse to act like floor division. eg : <FONT COLOR=GREEN><PRE> CDF domain  -> arithmetic domain  </PRE></FONT> For example, for the CACM87 forward map we used above, the inverse is : <FONT COLOR=GREEN><PRE> CACM87 forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum; inverse(code,range,cdf_sum) = ( code * cdf_sum + cdf_sum-1 ) / range; (integers, floor division) </PRE></FONT> In most general form, both forward & inverse map require division. The forward map is easy to make divide-free, but the inverse map not so. <P> <B> Getting rid of division and cdf_sum power of 2 </B> <P> We'll now start getting rid of pesky division. <P> The first thing we can do, which we will adopt henceforth, is to choose cdf_sum to be a power of 2. We can choose our static model to normalize the cdf sum to a power of 2, or with adaptive modeling use a scheme that maintains power of 2 sums. (defsum, constant sum shift, sliding window, etc.) <FONT COLOR=GREEN><PRE> cdf_sum = 1<<code>&lt;</code>cdf_shift; CACM87 forward(cdf,range,cdf_sum) = ( cdf * range ) >> cdf_shift; </PRE></FONT> So we have eliminated division from the forward map, but it remains in the inverse map. (the problem is that the CDF domain is our "home" which is under our control, while the arithmetic domain is constantly moving around, stretching and shrinking, as the "range" interval is modified). <P> We're fundamentally stuck needing something like "1 / range" , which is the whole crux of the problem that recip_arith is trying to attack. <P> I think we'll come back to that next time, as we've come far enough. <P> While I'm going into these forward/inverse maps lets go ahead and mention the standard "range coder" : <FONT COLOR=GREEN><PRE> range coder : forward(cdf,range,cdf_sum) = cdf * ( range / cdf_sum ); inverse(code,range,cdf_sum) = code / ( range / cdf_sum ); (integer, floor division) or with power of 2 cdf_sum and in a more C way : r_norm = range >> cdf_shift; forward(cdf,range,cdf_sum) = cdf * r_norm; inverse(code,range,cdf_sum) = code / r_norm; </PRE></FONT> where I'm introducing "r_norm" , which is just the ratio between "range" and "cdf_sum" , or the scaling from CDF to arithmetic domain. <P> Historically, the "range coder" map was attractive (vs CACM87) because it allowed 32 bit arithmetic state. In the olden days we needed to do the multiply and stay in 32 bits. In CACM87 you have to do (cdf * range) in the numerator, so each of those was limited to 16 bits. Because the arithmetic state (code/range) was only 16 bits, you had to do bit renormalization (in order to keep range large enough to do large CDF sums (actually byte renormalization was done as far back as 1984, in which case CDF sum was capped at 8 bits and only binary arithmetic coding could be done)). By adopting the "range coder" map, you could put the arithmetic state in 32 bits and still use just 32 bit registers. That meant byte renormalization was possible. <P> So, with modern processors with 64-bit registers there's actually very little advantage to the "range coder" map over the CACM87 map. <P> The range coder map has some coding loss. The fundamental reason is that the forward() map scaling is not exactly (Probability * range). Another way to think of that is that not all of range is used. In the "range coder" : <FONT COLOR=GREEN><PRE> forward(cdf_sum) = cdf_sum * r_norm = r_norm << cdf_shift = (range >> cdf_shift) << cdf_shift forward(cdf_sum) = range with bottom cdf_shift bits turned off unused region is (range % cdf_sum) </PRE></FONT> The coding loss is small in practice (because we ensure that range is much larger than cdf_sum). In typical use, range is 24-31 bits and cdf_shift is in 11-14 , then the coding loss is on the order of 0.001 bpb. You can make the coding loss of the range coder arbitrarily small by using larger range (eg. 56-63 bits) with small cdf_sum. <P> The "range coder" map is actually much simpler than the CACM87 map. It simply takes each integer step in the CDF domain, and turns that into a step of "r_norm" in the arithmetic domain. The inverse map then just does the opposite, each run of "r_norm" steps in the arithmetic domain maps to a single integer step in the CDF domain. <P> .. and that's enough background for now. <P> The entry page for recip_arith : <A HREF="https://cbloomrants.blogspot.com/2018/10/10-16-18-multi-symbol-division-free.html"> cbloom rants A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication </A> <P> Links : <P> <A HREF="http://cbloomrants.blogspot.com/2008/10/10-05-08-5.html"> cbloom rants 10-05-08 Rant on New Arithmetic Coders </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2008/10/10-06-08-followup-on-russian-range-coder.html"> cbloom rants 10-06-08 Followup on the Russian Range Coder </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2008/10/10-07-08-little-more-on-arithmetic.html"> cbloom rants 10-07-08 A little more on arithmetic coding ... </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2008/10/10-10-08-on-art-of-good-arithmetic.html"> cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2010/08/08-16-10-range-coder-revisited-oh-wait.html"> cbloom rants 08-16-10 - Range Coder Revisited .. oh wait, nevermind </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2010/09/09-16-10-modern-arithmetic-coding-from.html"> cbloom rants 09-16-10 - Modern Arithmetic Coding from 1987 </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2008/10/10-08-08.html"> cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places. </A> <BR> <A HREF="https://cbloomrants.blogspot.com/2010/08/08-11-10-huffman-arithmetic-equivalence.html"> cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence </A> <BR> <A HREF="https://people.xiph.org/~tterribe/notes/range.html"> On the Overhead of Range Coders </A> <BR> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/KMZEPY8HZTY" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-33279902590671915962018-10-18T10:43:00.003-07:002018-10-22T16:31:48.792-07:00A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication A standard multi-symbol arithmetic coder necessarily requires a costly divide to map the arithmetic code target back to a cdf (even when you have chosen the cdf sum to be a power of 2). <P> Doing binary arithmetic coding without a divide is trivial. Previous methods for division free multi-symbol arithmetic coding have been at the cost of low coding precision, that is coding in more bits than -log2 of the symbol probabilities (for example see DCC95 and SM98). <P> code is here : <P> <A HREF="https://github.com/thecbloom/recip_arith"> recip_arith on github </A> <P> recip_arith is public domain. <P> Our method uses an approximate map from the cdf interval to the range interval, scaling by only a few top bits of range. This approximate map introduces coding loss, but that can be made quite small with just 8 bits from range. <P> The advantage of this approximate map is that it can be exactly inverted using a table of reciprocal multiplies : <FONT COLOR=GREEN><PRE> x / y == (x * recip_table[y]) >> 32 recip_table[y] = ((1<<32) + y-1) / y </PRE></FONT> x/y in integers is the floor divide, and recip_table is the ceil reciprocal. (see Alverson, "Integer Division using reciprocals"). The choice of a 32 bit numerator is somewhat arbitrary, but we do want the reciprocals to fit in 32 bit words to minimize the size of recip_table in L1 cache. <P> Crucially because only the top 8 bits of "range" are used in the forward map, then "y" needed in the divide is only 8 bits, so the recip_table can be quite small. Furthermore, 8 bits + 24 bits of cdf in the numerator "x" can be inverted exactly. <P> <B> Coding Efficiency </B> <P> The coding loss of "recip_arith" with 8 bit tables is reliably under 0.005 bpb (around 0.1%) , in constrast to SM98 where the coding loss can be 10X higher (0.05 bpb or 1.0%) <FONT COLOR=GREEN><PRE> Calgary Corpus file "news" , len=377,109 cdf_bits = 13 (symbol frequencies sum to 1<<13) from smallest to largest output : recip_arith coder (down/up 8-bit) : 244,641 = 5.190 bpb cacm87: 244,642 = 5.190 bpb range coder: 244,645 = 5.190 bpb recip_arith coder (with end of range map) : 244,736 = 5.192 bpb recip_arith coder: 244,825 = 5.194 bpb recip_arith coder (down/up 2-bit) (aka Daala "reduced overhead") : 245,488 = 5.208 bpb SM98 : 245,690 = 5.212 bpb </PRE></FONT> (the down/up and "end of range map" variants will be discussed later) <P> The crucial step of recip_arith is the way the arithmetic coding interval is shrunk to encode a symbol. <P> The standard range coder does : <FONT COLOR=GREEN><PRE> uint32_t r_norm = ac->range >> cdf_bits; ac->low += cdf_low * r_norm; ac->range = cdf_freq * r_norm; </PRE></FONT> while recip_arith does : <FONT COLOR=GREEN><PRE> uint32_t range = ac->range; int range_clz = clz32(range); uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS); uint32_t r_norm = r_top << ( 32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits); ac->low += cdf_low * r_norm; ac->range = cdf_freq * r_norm; </PRE></FONT> where "r_top" is just the highest RECIP_ARITH_TABLE_BITS bits of range, and r_norm is those bits back in their original position, then shifted down by cdf_bits. In the end the way the interval is shrunk is exactly the same as the range coder, except that all bits below the top RECIP_ARITH_TABLE_BITS of "r_norm" are turned off. <P> <B> Is it useful ? </B> <P> It's unclear if there are really killer practical advantages to recip_arith at the moment. On many modern high end CPUs, 32 bit divides are pretty fast, so if you just take recip_arith and use it to replace a standard range coder you might not see much difference. On CPUs with slow divides there is a more significant advantage. <P> recip_arith provides an interesting continuum that connects very low precision coders like SM98 to the full precision range coder. The "up/down" variant with a very small reciprocal table (4 bits?) might be interesting for hardware implementations, where division is not desirable. <P> recip_arith works with 64-bit coding state (the standard range coder would require a 64-bit divide, which is often much slower than a 32-bit divide), which can provide advantages. recip_arith also works with the encoder & decoder not in lock step; eg. you could encode with 32-bit state and decode with 64-bit state, allowing you independence of the bitstream & implementations, which is very desirable; not all arithmetic coders have this property. <P> I think primarily it is theoretically interesting at the moment, and it remains to be seen if that turns into a compelling practical application. <P> <B> What's Next </B> <P> I'll be doing a couple of followup posts going into the details of how/why this works. We'll talk about the theory of arithmetic coders and how to think about them. Then we'll look at the up/down and "map end" variants. <P> <B> Index of posts </B> <P> <A HREF="https://github.com/thecbloom/recip_arith"> recip_arith on github </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2018/10/about-arithmetic-coders-and-reciparith.html"> 1. cbloom rants About arithmetic coders and recip_arith in particular </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2018/10/arithmetic-coder-division-free-maps.html"> 2. cbloom rants Arithmetic coder division-free maps </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2018/10/working-through-reciparith.html"> 3. cbloom rants Working through recip_arith </A> <BR> <A HREF="http://cbloomrants.blogspot.com/2018/10/reciparith-without-unused-range.html"> 4. cbloom rants recip_arith without unused range </A> <BR> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/JeowZUc-8tg" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-56241507232459233362018-10-13T12:18:00.003-07:002018-10-18T10:45:46.016-07:00If you have had difficulty sending me emailMy email has been extremely flaky over the last few months due to problems at dreamhost (who host cbloom.com). <P> If you sent me an email and I did not reply, it's possible I didn't get it. Check your spam for bounced emails, since in my experience many undelivered mail responses incorrectly wind up in spam folders these days. <P> I will be changing to a new host (because dreamhost sucks and has consistently failed to fix their email issues). There may be more missed emails in the transition. <P> If you want to send me something reliably, please try carrier pigeon or can on string, since technology has only gotten worse since then. (grumble grumble, we all as an industry fucking epically suck and computers are monumentally awful and depressing, grumble grumble) <P><HR><P> Transition in progress... looks like I'll be moving cbloom.com web hosting off dreamhost as well. Let me know if you see any problems. <P><HR><P> ... done. Now on fastmail. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/xUAFTaKlsdI" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-55164929142298914422018-10-01T13:30:00.003-07:002018-10-01T16:40:27.395-07:00Oodle Lossless Image We have released <A HREF="http://www.radgametools.com/oodlelimage.htm">Oodle Lossless Image Compression</A> (aka OLI) v1.0 <P> OLI is built on the blazing fast Oodle Data Compression engine, with a new very efficient image specific front end. OLI has a very simple native API, and it also has drop in look-alike APIs to replace stb_image or libpng, so it's very easy to integrate. <P> OLI gets much more compression than PNG. Its compression ratio is similar to higher compression codecs like webp-ll or FLIF. The big advantage of OLI is its blazing fast decode speed. OLI decodes way faster than any of the competition (3-10X faster), so you can compress big images smaller and load them faster. <P> (OLI is mostly by Jon Olick who did an awesome job on it) <P> OLI is currently for full color lossless images only. It's not for 3d game textures, it doesn't do BCn GPU compressed textures, it doesn't do things like half-float normal maps. It's for 24 bit RGB and 32 bit RGBA images, the kind of things you would have used PNG for before. OLI currently has only limited support for low color images (eg. palettized, 1-bit, and gray scale images). It's early days for OLI still and that support will get better, particularly if we hear from customers who need it. <P> For more about OLI visit <A HREF="http://www.radgametools.com/oodlelimage.htm">the RAD Game Tools web site</A>. <P> Image sizes on my "good" test set, smallest size for each image in bold : <P> <TABLE BORDER=2 CELLSPACING=2> <TR BGCOLOR=#BFBFBF><TD BGCOLOR=#BFBFBF><B></B></TD> <TD>oli</TD> <TD>webpll</TD> <TD>pngcrush</TD> <TD>png</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>4069138464_193981e0e2_o </TD> <TD><B>3140944</B></TD> <TD>3319222</TD> <TD>4019282</TD> <TD>4025911</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>9xkFD </TD> <TD><B>2524927</B></TD> <TD>2772626</TD> <TD>3850985</TD> <TD>3856541</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>A-Girl-With-Kind-Eyes </TD> <TD><B>1212598</B></TD> <TD>1307518</TD> <TD>1568931</TD> <TD>1620105</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Beach_Hurricane_WaLp_TW2011 </TD> <TD><B>1970283</B></TD> <TD>1997532</TD> <TD>2543245</TD> <TD>2606013</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Beach_Starfish_and_Conch_Shell_Wp_TW </TD> <TD><B>2266789</B></TD> <TD>2299242</TD> <TD>2591019</TD> <TD>2701580</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>CliffSideA01_cm0001 </TD> <TD>3475407</TD> <TD><B>3340540</B></TD> <TD>4302971</TD> <TD>4309175</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>CliffSideA03_cm0001 </TD> <TD>3626390</TD> <TD><B>3560460</B></TD> <TD>4697715</TD> <TD>4704495</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Girl_in_white </TD> <TD><B>1253577</B></TD> <TD>1313662</TD> <TD>1726569</TD> <TD>1729053</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>IMG_0060 </TD> <TD><B>2324810</B></TD> <TD>2601634</TD> <TD>3767787</TD> <TD>3773211</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>IMG_0152 </TD> <TD><B>2166388</B></TD> <TD>2389052</TD> <TD>3668094</TD> <TD>3673386</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>IMG_0155 </TD> <TD><B>1526228</B></TD> <TD>1613594</TD> <TD>2369898</TD> <TD>2373318</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>IMG_0195 </TD> <TD><B>1998851</B></TD> <TD>2115548</TD> <TD>2910248</TD> <TD>2914448</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>IMG_0339 </TD> <TD><B>1815852</B></TD> <TD>2021904</TD> <TD>3153675</TD> <TD>3158211</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>IMG_0341 </TD> <TD><B>1196634</B></TD> <TD>1267144</TD> <TD>1865148</TD> <TD>1867836</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>LOVE1 </TD> <TD><B>1170354</B></TD> <TD>1217124</TD> <TD>1598967</TD> <TD>1601271</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Lily+Allen+smile </TD> <TD><B>1156702</B></TD> <TD>1291656</TD> <TD>1850890</TD> <TD>1853554</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>PDI_1200 </TD> <TD><B>921293</B></TD> <TD>959710</TD> <TD>1260621</TD> <TD>1262433</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>POMSep2007 </TD> <TD><B>1061705</B></TD> <TD>1066320</TD> <TD>1520546</TD> <TD>1522742</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Portal2Win </TD> <TD>2992069</TD> <TD><B>2904416</B></TD> <TD>3695920</TD> <TD>3701248</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Spring_Beauty </TD> <TD><B>1749534</B></TD> <TD>1788812</TD> <TD>2041309</TD> <TD>2118280</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>Sunset_Rocky_Beach_Walp_TW </TD> <TD>2700270</TD> <TD><B>2665954</B></TD> <TD>3137206</TD> <TD>3192653</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>ainokura_bibble </TD> <TD>817631</TD> <TD><B>817516</B></TD> <TD>1085462</TD> <TD>1087022</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>artificial </TD> <TD>721090</TD> <TD><B>578668</B></TD> <TD>845572</TD> <TD>846796</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>big_building </TD> <TD><B>3772059</B></TD> <TD>3795954</TD> <TD>4569456</TD> <TD>4579407</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>big_tree </TD> <TD>3476811</TD> <TD><B>3466330</B></TD> <TD>3747685</TD> <TD>3753085</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>bmw </TD> <TD><B>1408411</B></TD> <TD>1429148</TD> <TD>1790859</TD> <TD>1793439</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>bridge </TD> <TD>1401722</TD> <TD><B>1400592</B></TD> <TD>1535838</TD> <TD>1538094</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>c77b75083b34 </TD> <TD><B>2516018</B></TD> <TD>2657536</TD> <TD>3818229</TD> <TD>3823737</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>cathedral </TD> <TD><B>1914743</B></TD> <TD>1926612</TD> <TD>2212063</TD> <TD>2215255</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>church-in-the-prairie </TD> <TD><B>1955457</B></TD> <TD>1984848</TD> <TD>2973585</TD> <TD>2977869</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>cool-orb </TD> <TD>379704</TD> <TD><B>326258</B></TD> <TD>492951</TD> <TD>493671</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>deer </TD> <TD>4753833</TD> <TD><B>4454234</B></TD> <TD>4757131</TD> <TD>4763983</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>flower_foveon </TD> <TD>1701007</TD> <TD><B>1672546</B></TD> <TD>2194479</TD> <TD>2213494</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>hdr </TD> <TD><B>1588182</B></TD> <TD>1610458</TD> <TD>1942456</TD> <TD>2011759</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>highclass_by_DivineError_lossless </TD> <TD>591236</TD> <TD><B>588602</B></TD> <TD>848270</TD> <TD>849494</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>leaves_iso_200 </TD> <TD><B>2537504</B></TD> <TD>2613286</TD> <TD>3120814</TD> <TD>3125314</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>moses </TD> <TD><B>2137994</B></TD> <TD>2183616</TD> <TD>3220571</TD> <TD>3225215</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>mysoup </TD> <TD><B>2473925</B></TD> <TD>2717346</TD> <TD>3631056</TD> <TD>3636300</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>nails2 </TD> <TD><B>2181876</B></TD> <TD>2300032</TD> <TD>3228688</TD> <TD>3233344</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>nightshot_iso_100 </TD> <TD>2023458</TD> <TD><B>1948254</B></TD> <TD>2293315</TD> <TD>2397815</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>phoenix </TD> <TD>110389</TD> <TD><B>59800</B></TD> <TD>83692</TD> <TD>131474</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>pulchra2.1 </TD> <TD><B>2421925</B></TD> <TD>2536570</TD> <TD>3482309</TD> <TD>3487337</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>record_only </TD> <TD><B>789644</B></TD> <TD>790842</TD> <TD>1211780</TD> <TD>1282677</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>school-shoot-moodboard-number-1 </TD> <TD><B>1587878</B></TD> <TD>1638888</TD> <TD>2199389</TD> <TD>2202557</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>space-desktop </TD> <TD>561563</TD> <TD><B>545412</B></TD> <TD>724571</TD> <TD>725615</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>spider_web </TD> <TD><B>2119921</B></TD> <TD>2203508</TD> <TD>2603659</TD> <TD>2715256</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>windmill_evsm </TD> <TD><B>1470104</B></TD> <TD>1487660</TD> <TD>2090287</TD> <TD>2093311</TD> </TR> <TR><TD BGCOLOR=#BFBFBF>yahoo </TD> <TD>1999722</TD> <TD><B>1700342</B></TD> <TD>2710221</TD> <TD>2849650</TD> </TR> <TR BGCOLOR=#BFBFBF><TD BGCOLOR=#BFBFBF>total</TD> <TD>91665412</TD> <TD>93248528</TD> <TD>121555414</TD> <TD>122618434</TD> </TR> </TABLE> <P> <FONT COLOR=GREEN><PRE> oli is oli_enc --super webpll is cwebp -lossless -z 9 pngcrush is default options png is made by bmp2png </PRE></FONT> <P><HR><P> <div style="background-color:#d0d0d0"> Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, <A HREF="http://www.radgametools.com/oodle.htm">visit the RAD Game Tools web site.</A> This is my personal blog where I post supplemental material about Oodle. </div> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/n7xx2XGmWbg" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-72607233862514586492018-08-06T12:36:00.003-07:002018-08-06T12:36:48.303-07:00Oodle 2.7.0 release : Network SDK separation Oodle 2.7.0 is now out. <P> The biggest change is that the Network and Data portions are now in two different SDKs. You may now use either one on its own, or both together. If you are a licensee of both Data and Network, you will get two SDKs. The two SDKS can be installed to the same place, and shared files may overwrite. <P> The compressed data bitstreams (for both Network & Data) are not changed from 2.6 to 2.7 ; I've bumped the major version just because of the large API change of splitting the libs. <P> We've also made some optimizations in the LZ decoders; Mermaid, Kraken & Leviathan are all a wee bit faster to decode. <P> Perf comparison to previous versions : <FONT COLOR=GREEN><PRE> Oodle 2.7.0 , Kraken 8 : PD3D : 4.02:1 , 1.0 enc MB/s , 1128.5 dec MB/s GTS : 2.68:1 , 1.0 enc MB/s , 1324.1 dec MB/s Silesia : 4.25:1 , 0.6 enc MB/s , 1062.2 dec MB/s ============================ PD3D : Kraken8 255 : 3.67:1 , 2.8 enc MB/s , 1091.5 dec MB/s Kraken8 260 -v5: 3.72:1 , 1.2 enc MB/s , 1079.9 dec MB/s Kraken8 260 : 4.00:1 , 1.0 enc MB/s , 1034.7 dec MB/s GTS : Kraken8 255 : 2.60:1 , 2.5 enc MB/s , 1335.8 dec MB/s Kraken8 260 -v5: 2.63:1 , 1.2 enc MB/s , 1343.8 dec MB/s Kraken8 260 : 2.67:1 , 1.0 enc MB/s , 1282.3 dec MB/s Silesia : Kraken8 255 : 4.12:1 , 1.4 enc MB/s , 982.0 dec MB/s Kraken8 260 -v5: 4.18:1 , 0.6 enc MB/s , 1018.7 dec MB/s Kraken8 260 : 4.24:1 , 0.6 enc MB/s , 985.4 dec MB/s </PRE></FONT> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/Q5-K8BCqJU4" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-66250376457708275312018-07-09T11:29:00.002-07:002018-07-09T11:29:21.355-07:00Oodle ozip for command line compression and piped streams Oodle now ships with "ozip", a command line compressor that acts like gzip (and bzip, and xz). <P> ozip can be used as a command line compressor to create or decode Oodle-compressed files; ozip can also be used to pipe Oodle-compressed streams. <P> The intention is that ozip should act similarly to gzip (in terms of command line arguments, but with more compression and faster decompression) so it can be dropped in to standard work flows that use gzip-like compressors. For example ozip can be used with tar "I" ("--use-compress-program") to pipe the tar package through ozip. <P> ozip works with pipes (particularly useful on Unix), so it can be used to pipe compressed data. (eg. with things like "zfs send" to a pipe). <P> A pre-compiled ozip binary is now distributed with the Oodle SDK. <P> You can also get and modify the ozip source code on github : <P> <A HREF="https://github.com/jamesbloom/ozip/"> ozip on github </A> <P> To build ozip from source code, you need the <A HREF="http://www.radgametools.com/oodle.htm"> Oodle SDK </A>. (ozip is open source and public domain, Oodle is not). <P> If you have corrections to ozip, we're happy to take pull requests, particularly wrst making sure we act like gzip and it's easy to drop in ozip to gzip-like work flows on Unix. When writing ozip, we found that gzip/bzip/xz don't have the exact same command line argument handling, and yet are treated as interchangeable by various programs. We tried to replicate the common intersection of their behavior. <P> ozip is a single file compressor, not a package archiver. It does not store file names or metadata. <P> ozip was written by my brother, James Bloom. <P><HR><P> If you have server workflows that involve streaming compressed data, or think you could benefit from Oodle, we'd be happy to hear from you. We're still evolving our solutions in this space. <P> If you are compressing very large files (not piped streams), you can get much higher performance by using threads and asynchronous IO to overlap IO with compression CPU time. If this is important to you, ask about the "oozi" example in Oodle for reference on how to do that. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/L_-Iu6Un868" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-90282965967777469432018-06-07T11:43:00.000-07:002018-06-07T12:01:09.909-07:00New in Oodle 2.6.3 : HyperFast Encode Speeds Oodle 2.6.3 now has faster encode levels ("hyperfast"), for uses where encode speed is crucial. <P> Previously the fastest Oodle encode level was "SuperFast" (level 1). The new "HyperFast" levels are below that (level -1 to -4). The HyperFast levels sacrifice some compression ratio to maximize encode speed. <P> An example of the performance of the new levels (on lzt99, x64, Core i7-3770) : <P> <TABLE><TR><TD> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=500x450&chbh=15,2,10&chxt=x,y&chco=A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7&chxr=1,0,700,100&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=Kraken-z-3|Kraken-z-2|Kraken-z-1|Kraken-z1|Kraken-z2|Kraken-z3|Kraken-z4|Mermaid-z-3|Mermaid-z-2|Mermaid-z-1|Mermaid-z1|Mermaid-z2|Mermaid-z3|Mermaid-z4|Selkie-z-3|Selkie-z-2|Selkie-z-1|Selkie-z1|Selkie-z2|Selkie-z3|Selkie-z4&chd=e:mH,ee,Zn,PR,LA,Dj,CS,oI,gW,ZW,X5,Pz,I4,Dv,5Y,qp,h2,fJ,Su,J0,D.&chxl=0:|lzt99&chtt=encode+MB/s"> </TD><TD> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=500x450&chbh=15,2,10&chxt=x,y&chco=A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7&chxr=1,0,3,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=Kraken-z-3|Kraken-z-2|Kraken-z-1|Kraken-z1|Kraken-z2|Kraken-z3|Kraken-z4|Mermaid-z-3|Mermaid-z-2|Mermaid-z-1|Mermaid-z1|Mermaid-z2|Mermaid-z3|Mermaid-z4|Selkie-z-3|Selkie-z-2|Selkie-z-1|Selkie-z1|Selkie-z2|Selkie-z3|Selkie-z4&chd=e:kg,oC,s2,wY,xf,y-,z7,ja,mQ,q5,ri,tM,uy,vF,e3,gj,jy,kO,lS,nH,nv&chxl=0:|lzt99&chtt=compression+ratio"> </TD></TR></TABLE> Higher CompressionLevels are to the right in the bar charts above; they get higher compression ratios at the cost of lower encode speed. Charts show three HyperFast levels (-1 to -3) and 4 normal levels (1 to 4). <P> <IMG SRC="http://chart.apis.google.com/chart?cht=s&chs=750x380&chxt=x,y&chxr=0,4,10,1|1,0,2,1&chd=e:yKuusDkFfDNpG3y9vpr5q-knbwOc4dz5wWvDnOdSPc,YzdEiTlzm2oOpEXZa9gQg8ipkQkjRDThX5YcZyb.cu&chdl=Kraken-z-3|Kraken-z-2|Kraken-z-1|Kraken-z1|Kraken-z2|Kraken-z3|Kraken-z4|Mermaid-z-3|Mermaid-z-2|Mermaid-z-1|Mermaid-z1|Mermaid-z2|Mermaid-z3|Mermaid-z4|Selkie-z-3|Selkie-z-2|Selkie-z-1|Selkie-z1|Selkie-z2|Selkie-z3|Selkie-z4&chdlp=r&chco=A92FFF|A92FFF|A92FFF|A92FFF|A92FFF|A92FFF|A92FFF|FF00FF|FF00FF|FF00FF|FF00FF|FF00FF|FF00FF|FF00FF|B097F7|B097F7|B097F7|B097F7|B097F7|B097F7|B097F7&chm=s,000000,0,0,0|o,A92FFF,0,0,10|o,A92FFF,0,1,10|o,A92FFF,0,2,10|o,A92FFF,0,3,10|o,A92FFF,0,4,10|o,A92FFF,0,5,10|o,A92FFF,0,6,10|o,FF00FF,0,7,10|o,FF00FF,0,8,10|o,FF00FF,0,9,10|o,FF00FF,0,10,10|o,FF00FF,0,11,10|o,FF00FF,0,12,10|o,FF00FF,0,13,10|o,B097F7,0,14,10|o,B097F7,0,15,10|o,B097F7,0,16,10|o,B097F7,0,17,10|o,B097F7,0,18,10|o,B097F7,0,19,10|o,B097F7,0,20,10&chtt=loglog+ratio+vs+encode+MB/s&chxl=0:|16|32|64|128|256|512|1024|1:|1|2|4|"> <P> In the loglog plot, up = higher compression ratio, right = faster encode. <P> <FONT COLOR=GREEN><PRE> lzt99 : Kraken-z-3 : 1.711 to 1 : 416.89 MB/s lzt99 : Kraken-z-2 : 1.877 to 1 : 333.28 MB/s lzt99 : Kraken-z-1 : 2.103 to 1 : 280.09 MB/s lzt99 : Kraken-z1 : 2.268 to 1 : 167.01 MB/s lzt99 : Kraken-z2 : 2.320 to 1 : 120.39 MB/s lzt99 : Kraken-z3 : 2.390 to 1 : 38.85 MB/s lzt99 : Kraken-z4 : 2.434 to 1 : 24.98 MB/s lzt99 : Mermaid-z-3 : 1.660 to 1 : 438.89 MB/s lzt99 : Mermaid-z-2 : 1.793 to 1 : 353.82 MB/s lzt99 : Mermaid-z-1 : 2.011 to 1 : 277.35 MB/s lzt99 : Mermaid-z1 : 2.041 to 1 : 261.38 MB/s lzt99 : Mermaid-z2 : 2.118 to 1 : 172.77 MB/s lzt99 : Mermaid-z3 : 2.194 to 1 : 97.11 MB/s lzt99 : Mermaid-z4 : 2.207 to 1 : 40.88 MB/s lzt99 : Selkie-z-3 : 1.447 to 1 : 627.76 MB/s lzt99 : Selkie-z-2 : 1.526 to 1 : 466.57 MB/s lzt99 : Selkie-z-1 : 1.678 to 1 : 370.34 MB/s lzt99 : Selkie-z1 : 1.698 to 1 : 340.68 MB/s lzt99 : Selkie-z2 : 1.748 to 1 : 204.76 MB/s lzt99 : Selkie-z3 : 1.833 to 1 : 107.29 MB/s lzt99 : Selkie-z4 : 1.863 to 1 : 43.65 MB/s </PRE></FONT> <P> A quick guide to the Oodle CompressionLevels : <FONT COLOR=GREEN><PRE> -4 to -1 : HyperFast levels when you want maximum encode speed these sacrifice compression ratio for encode time 0 : no compression (memcpy pass through) 1 to 4 : SuperFast, VeryFast, Fast, Normal these are the "normal" compression levels encode times are ballpark comparable to zlib 5 to 8 : optimal levels increasing compression ratio & encode time levels above 6 can be slow to encode these are useful for distribution, when you want the best possible bitstream </PRE></FONT> Note that the CompressionLevel is a dial for encode speed vs. compression ratio. It does not have a consistent correlation to decode speed. That is, all of these compression levels get roughly the same excellent decode speed. <P> Comparing to Oodle 2.6.0 on Silesia : <FONT COLOR=GREEN><PRE> Oodle 2.6.0 : Kraken 1 "SuperFast" : 3.12:1 , 147.2 enc MB/s , 920.9 dec MB/s Kraken 2 "VeryFast" : 3.26:1 , 107.8 enc MB/s , 945.0 dec MB/s Kraken 3 "Fast" : 3.50:1 , 47.1 enc MB/s , 1043.3 dec MB/s Oodle 2.6.3 : Kraken -2 "HyperFast2" : 2.92:1 , 300.4 enc MB/s , 1092.5 dec MB/s Kraken -1 "HyperFast1" : 3.08:1 , 231.3 enc MB/s , 996.2 dec MB/s Kraken 1 "SuperFast" : 3.29:1 , 164.6 enc MB/s , 885.0 dec MB/s Kraken 2 "VeryFast" : 3.40:1 , 109.5 enc MB/s , 967.3 dec MB/s Kraken 3 "Fast" : 3.61:1 , 45.8 enc MB/s , 987.5 dec MB/s </PRE></FONT> Note that in Oodle 2.6.3 the normal levels (1-3) have also improved (much higher compression ratios). <P><HR><P> <div style="background-color:#d0d0d0"> Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, <A HREF="http://www.radgametools.com/oodle.htm">visit the RAD Game Tools web site.</A> This is my personal blog where I post supplemental material about Oodle. </div> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/x946RqjDz3I" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com2tag:blogger.com,1999:blog-5246987755651065286.post-4493071136673302562018-06-06T10:45:00.004-07:002018-06-06T10:45:46.070-07:00The Perils of Holistic Profiling I have found that many evaluators of Oodle are now trying to do timing of their entire process. That is, profiling the compressor by measuring the effect on total load time, or by profiling an entire game frame time (as opposed to profiling just the compression operation, perhaps in situ or perhaps in a test bench). <P> I believe what's happened is that many people have read about the dangerous of artificial benchmarks. (for example there are some famous papers on the perils of profiling malloc with synthetic use loads, or how profiling threading primitives in isolation is pretty useless). <P> While those warnings do raise important issues, the right response is not to switch to timing whole operations. <P> For example, while timing mallocs with bad synthetic data loads is not useful (and perhaps even harmful), similarly timing an entire application run to determine whether a malloc is better or not can also be misleading. <P> Basically I think the wrong lesson has been learned and people are over simplifying. They have taken one bad practice (time operations by running them in a synthetic test bench over and over), and replaced it with another bad practice (time the whole application). <P> The reality of profiling is far more complex and difficult. There is no one right answer. There is not a simple prescription of how to do it. Like any scientific measurement of a complex dynamic system, it requires care and study. It requires looking at the specific situation and coming up with the right measurement process. It requires secondary measurements to validate your primary measurements, to make sure you are testing what you think you are. <P> Now, one of the appealing things of whole-process timing is that in one very specific case, it is the right thing to do. <P> IF the thing you care about is whole-process time, and the process is always run the same way, and you do timing on the system that the process is run on, and in the same application state and environment, AND crucially this you are only allowed to make one change to the process - then whole process timing is right. <P> Let's first talk about the last issue, which is the "single change" problem. <P> Quite often a good change can appear to do nothing (or even be negative) for whole process time on its own. By looking at just the whole process time to evaluate the change, you miss a very positive step. Only if another step is taken will the value of that first step be shown. <P> A common case of this is if your process has other limiting factors that need to be fixed. <P> For example on the macroscopic level, if your game is totally GPU bound, then anything you do to CPU time will not show up at all if you are only measuring whole frame time. So you might profile a CPU optimization and see no benefit to frame time. You can miss big improvements this way, because they will only show up if you also fix what's causing the process to be GPU bound. <P> Similarly at a more microscopic level, it's common to have a major limiting factor in a sequence of code. For example you might have a memory read that typically misses cache, or an unpredictable branch. Any improvements you make to the arithmetic instructions in that area may be invisible, because the processor winds up stalling on a very slow cache line fill from memory. If you are timing your optimization work "in situ" to be "realistic" you can completely miss good changes because they are hidden by other bad code. <P> Another common example, maybe you convert some scalar code to SIMD. You think it should be faster, but you time it in app and it doesn't seem to be. Maybe you're bound in other places. Maybe you're suffering added latency from round-tripping from scalar to SIMD back to scalar. Maybe your data needs to be reformatted to be stored in SIMD friendly ways. Maybe the surrounding code needs to be also converted to SIMD so that they can hand off more smoothly. There may in fact be a big win there that you aren't seeing. <P> This is a general problem that greedy optimization and trying to look at steps one by one can be very misleading when measuring whole process time. Sometimes taking individual steps is better evaluated by measuring just those steps in isolation, because using whole process time obscures them. Sometimes you have to try a step that you believe to be good even if it doesn't show up in measurements, and see if taking more steps will provide a non-greedy multi-step improvement. <P> <B> Particular perils of IO timing </B> <P> A very common problem that I see is trying to measure data loading performance, including IO timing, which is fraught with pitfalls. <P> If you're doing repeated timings, then you'll be loading data that is already in the system disk cache, so your IO speed may just look like RAM speed. Is what's important to you cold cache timing (user's first run), or hot cache time? Or both? <P> Obviously there is a wide range of disk speeds, from very slow hard disks (as on consoles) in the 20 MB/s range up to SSD's and NVMe in the GB/s range. Which are you timing on? Which will your user have? Whether you have slow seeks or not can be a huge factor. <P> Timing on consoles with disk simulators (or worse : host FS) is particularly problematic and may not reflect real world performance at all. <P> The previously mentioned issue of high latency problems hiding good changes is very common. For example doing lots of small IO calls creates long idle times that can hide other good changes. <P> Are you timing on a disk that's fragmented, or nearly full? Has your SSD been through lots of write cycles already or does it need rebalancing? Are you timing when other processes are running hitting the disk as well? <P> Basically it's almost impossible to accurately recreate the environment that the user will experience. And the variation is not small, it can be absolutely massive. A 1 byte read could take anything between 1 nanosecond (eg. data already in disk cache) to 100 milliseconds (slow HD seek + other processes hitting disk). <P> Because of the uncertainty of IO timing, I just don't do it. I use a simulated "disk speed" and just set : <FONT COLOR=GREEN><PRE> disk time = data size / simulated disk speed </PRE></FONT> Then the question is, well if it's so uncertain, what simulated disk speed do you use? The answer is : all of them. You cannot say what disk speed the user will experience, there's a huge range, so you need to look at performance over a spectrum of disk speeds. <P> I do this by making a plot of what the total time for (load + decomp) is over a range of simulated disk speeds. Then I can examine how the performance is affected over a range of possible client systems, without trying to guess the exact disk speed of the client runtime environment. For more on this, see : <A HREF="http://cbloomrants.blogspot.com/2015/03/03-02-15-oodle-lz-pareto-frontier.html"> Oodle LZ Pareto Frontier </A> or <A HREF="http://cbloomrants.blogspot.com/2016/04/oodle-kraken-pareto-frontier.html"> Oodle Kraken Pareto Frontier </A>. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/Qlax6YcYE0w" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-8813145188977981102018-06-06T10:45:00.001-07:002019-03-25T10:00:27.837-07:00ZStd is faster than Leviathan ZStd is faster than Leviathan on some files ; well, no, it's not that simple. <P> This is another post about careful measurement, how to compare compressors, and about the unique way Oodle works. <P> (usual caveat: I don't mean to pick on ZStd here; I use it as a reference point because it is excellent, the closest thing to Oodle, and something we are often compared against. ZStd timing is done with lzbench; times are on x64 Core i7-3770) <P> There are two files in my "gametestset" where ZStd appears to be significantly faster to decode than Leviathan : <FONT COLOR=GREEN><PRE> e.dds : zstd 1.3.3 -22 3.32 MB/s 626 MB/s 403413 38.47% ooLeviathan8 : 1,048,704 -> 355,045 = 2.708 bpb = 2.954 to 1 decode : 1.928 millis, 6.26 c/b, rate= 544.03 MB/s Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 5.71 MB/s 4257 MB/s 16281301 84.18% ooLeviathan8 : 19,341,802 ->16,178,303 = 6.692 bpb = 1.196 to 1 decode : 8.519 millis, 1.50 c/b, rate= 2270.48 MB/s </PRE></FONT> Whoah! ZStd is a lot faster to decode than Leviathan on these files, right? (626 MB/s vs 544.03 MB/s and 4257 MB/s vs 2270.48 MB/s) <P> No, it's not that simple. Compressor performance is a two axis value of {space,speed}. It's a 2d vector, not a scalar. You can't simply take one component of the vector and just compare speeds at unequal compression. <P> All compressors are able to hit a range of {space,speed} points by making different decisions. For example with ZStd at level 22 you could forbid length 3 matches and that would bias it more towards decode speed and lower compression ratio. <P> Oodle is unique in being fundamentally built as a space-speed optimization process. The Oodle encoders can make bit streams that cover a range of compression ratios and decode speeds, depending on what the client asks it to prioritize. <P> Compressor performance is determined by two things : the fundamental algorithm, and the current settings. The settings will allow you to dial the 2d performance data point to different places. The algorithm places a limit on where those data points can be - it defines a Pareto Frontier. This Pareto curve is a fundamental aspect of the algorithm, while the exact space speed point on that curve is simply a choice of settings. <P> There is no such thing as "what is the speed of ZStd?". It depends how you have dialed the settings to reach different performance data points. The speed is not a fundamental aspect of the algorithm. The Pareto frontier *is* a fundamental aspect, the limit on where those 2d data points can reach. <P> One way to compare compression algorithms (as opposed to their current settings) is to plot many points of their 2d performance at different settings, and then inspect how the curves lie. One curve might strictly cover the other, then that algorithm is always better. Or, they might cross at some point, which means each algorithm is best in a different performance domain. <P> Another way to compare compression algorithms is to dial them to find points where one axis is equal (either they decode at the same speed, or they have the same compression ratio), then you can do a simple 1d comparison of the other value. You can also try to find points where one compressor is strictly better on both axes. The inconclusive situations is when one compressor is better on one axis, and the other is better on the other axis. <P> (note I have been talking about compressor performance as the 2d vector of {decode speed,ratio} , but of course you could also consider encode time, memory use, other factors, and then you might choose other axes, or have a 3d or 4d value to compare. The same principles apply.) <P> (there is another way to compare 2d compressor performance with 1d scalar; at RAD we internally use <A HREF="http://cbloomrants.blogspot.com/2018/01/05-17-16-weissman-score.html"> the corrected Weissman score </A>. One of the reasons we use the 1d Weissman score is because sometimes we make an improvement to a compressor and one of the axes gets worse. That is, we do some work, and then measure, and we see compression ratio went down. Oh no, WTF! But actually decode speed went up. From the 2d performance vector it can be hard to tell if you made an improvement or not, the 1d scalar Weissman score makes that easier.) <P> <B> Oodle is an optimizing compiler </B> <P> Oodle is fundamentally different than other compressors. There is no "Oodle has X performance". Oodle has whatever performance you ask it to have (and the compressed size will vary along the Pareto frontier). <P> Perhaps an analogy that people are more familiar with is an optimizing compiler. <P> The Oodle decoder is a virtual machine that runs a "program" to create an output. The compressed data is the program of commands that run in the Oodle interpreter. <P> The Oodle encoder is a compiler that makes the program to run on that machine (the Oodle decoder). The Oodle compiler tries to create the most optimal program it can, by considering different instruction sequences that can create the same output. Those different sequences may have different sizes and speeds. Oodle chooses them based on how the user has specified to consider the value of time vs. size. (this is a bit like telling your optimizing compiler to optimize for size vs. optimize for speed, but Oodle is much more fine grained). <P> For example at the microscopic level, Oodle might consider a sequence of 6 bytes. This can be sent as 6 literals, or a pair of length 3 matches, or a literal + a len 4 rep match + another literal. Each possibility is considered and the cost is measured for size & decode time. At the macroscopic level Oodle considers different encodings of the command sequences, whether to send somethings uncompressed or with different entropy coders, and different bit packings. <P> <B> Oodle is a market trader </B> <P> Unlike any other lossless compressor, Oodle makes these decisions based on a cost model. <P> It has been standard for a long time to make space vs. speed decisions in lossless compressors, but it has in the past always been done with hacky ad-hoc methods. For example, it's common to say something like "if the compressed size is only 1% less than the uncompressed size, then just send it uncompressed". <P> Oodle does not do that. Oodle considers its compression savings (bytes under the uncompressed size) to be "money". It can spend that money to get decode time. Oodle plays the market, it looks for the best price to spend its money (size savings) to get the maximum gain of decode time. <P> Oodle does not make ad-hoc decisions to trade speed for size, it makes an effort to get the best possible value for you when you trade size for speed. (it is of course not truly optimal because it uses heuristics and limits the search, since trying all possible encodings would be intractable). <P> Because of this, it's easy to dial Oodle to different performance points to find more fundamental comparisons with other compressors. (see, for example : <A HREF="http://cbloomrants.blogspot.com/2017/09/oodle-tuneability-with-space-speed.html"> Oodle tuneability with space-speed tradeoff </A> ) <P> Note that traditional ad-hoc compressors (like ZStd and everyone else) make mistakes in their space-speed decisions. They do not allocate time savings to the best possible files. This is an inevitable consequence of having simple thresholds in decision making (and this flaw is what led us to do a true cost model). That is, Leviathan decode speed is usually, say, 30% faster than ZStd. On some files that ratio goes way up or way down. When that happens, it is often because ZStd is making a mistake. That is, it's not paying the right price to trade size for speed. <P> Of course this relies on you telling Oodle the truth about whether you want decode speed or size. Since Oodle is aggressively trading the market, you must tell it the way you value speed vs. size. If you use Leviathan at default settings, Oodle thinks your main concern is size, not decode speed. If you actually care more about decode speed, you need to adjust the price (with "spaceSpeedTradeoffBytes") or possibly switch to another compressor (Kraken, Mermaid, or let <A HREF="http://cbloomrants.blogspot.com/2018/02/oodle-260-hydra-and-space-speed.html"> Hydra </A> switch for you). <P> <B> Back to the files where ZStd is faster </B> <P> Armed with our new knowledge, let's revist those two files : <P> <FONT COLOR=GREEN><PRE> e.dds : zstd 1.3.3 -22 : 2.600 to 1 : 625.72 MB/s e.dds : Leviathan -8 : 2.954 to 1 : 544.03 MB/s Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4257 MB/s Transistor_AudenFMOD_Ambience.bank : Leviathan -8 : 1.196 to 1 : 2270.48 MB/s </PRE></FONT> Is ZStd faster on these files? At this point we don't know. These are inconclusive data points. In both cases, Leviathan has more compression, but ZStd has more speed - the victor on each axis differs and we can't easily say which is really doing better. <P> To get a simpler comparison we can dial Leviathan to different performance points using Oodle's "spaceSpeedTradeoffBytes" parameter, which sets the relative cost of time vs size in Oodle's decisions. <P> That is, in both cases Oodle has size savings to spend. It can spend those size savings to get more decode speed. <P> On e.dds, let's take Leviathan and dial spaceSpeedTradeoffBytes up from the default of 256 in powers of two to favor decode speed more : <FONT COLOR=GREEN><PRE> e.dds : zstd 1.3.3 -22 : 2.600 to 1 : 625.72 MB/s e.dds : Leviathan 1 : 3.020 to 1 : 448.30 MB/s e.dds : Leviathan 256 : 2.954 to 1 : 544.03 MB/s e.dds : Leviathan 512 : 2.938 to 1 : 577.23 MB/s e.dds : Leviathan 1024 : 2.866 to 1 : 826.15 MB/s e.dds : Leviathan 2048 : 2.831 to 1 : 886.42 MB/s </PRE></FONT> What is the speed of Leviathan? There is no one speed of Leviathan. It can go from 448 MB/s to 886 MB/s depending on what you tell the encoder you want. The fundamental aspect is what compression ratio can be achieved at each decode speed. <P> We can see that ZStd is not fundamentally faster on this file; in fact Leviathan can get much more decode speed AND compression ratio at spaceSpeedTradeoffBytes = 1024 or 2048. <P> Similarly on Transistor_AudenFMOD_Ambience.bank : <FONT COLOR=GREEN><PRE> Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s Transistor_Aude...D_Ambience.bank : Leviathan 256 : 1.196 to 1 : 2270.48 MB/s Transistor_Aude...D_Ambience.bank : Leviathan 512 : 1.193 to 1 : 3701.30 MB/s Transistor_Aude...D_Ambience.bank : Leviathan 1024 : 1.190 to 1 : 4738.83 MB/s Transistor_Aude...D_Ambience.bank : Leviathan 2048 : 1.187 to 1 : 6193.92 MB/s zstd 1.3.3 -22 5.71 MB/s 4257 MB/s 16281301 84.18 Transistor_AudenFMOD_Ambience.bank Leviathan spaceSpeedTradeoffBytes = 2048 ooLeviathan8 : 19,341,802 ->16,290,106 = 6.738 bpb = 1.187 to 1 decode : 3.123 millis, 0.55 c/b, rate= 6193.92 MB/s </PRE></FONT> In this case we can dial Leviathan to get very nearly the same compressed size, and then just compare speeds (4275.38 MB/s vs 6193.92 MB/s). <P> Again ZStd is not actually faster than Leviathan here. If you looked at Leviathan's default setting encode (2270.48 MB/s) you were not seeing ZStd being faster to decode. What you are seeing is that you told Leviathan to choose an encoding that favors size over decode speed. <P> It doesn't make sense to tell Oodle to make a very small file, and then just compare decode speeds. That's like buying a truck to maximize cargo carrying and then complaining that it has poor gas mileage. You specifically asked me to optimize for the opposite goal! <P> Note that in the Transistor bank case, it looks like Oodle is paying a bad price to get a tiny compression savings; going from 6000 MB/s to 2000 MB/s seems like a lot. In fact that is a small time difference, while 1.187 to 1.196 ratio is actually a big size savings. The problem here is that ratio & speed are inverted measures of what we are really optimizing, which is time and size. Internally we always look at bpb (bits per byte) and cpb (cycles per byte) when measuring performance. <P> Bits and Cycles are your commidities that you are trading. If you look at compression ratio or speed, those are actually inverses of the commodities you care about, which can make numbers look weird. If we convert these ratio to commodites : <FONT COLOR=GREEN><PRE> Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s Transistor_Aude...D_Ambience.bank : Leviathan 256 : 1.196 to 1 : 2270.48 MB/s Transistor_Aude...D_Ambience.bank : Leviathan 512 : 1.193 to 1 : 3701.30 MB/s is Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 6.734 bits per byte : 0.795 cycles per byte Transistor_Aude...D_Ambience.bank : Leviathan 256 : 6.689 bits per byte : 1.497 cycles per byte Transistor_Aude...D_Ambience.bank : Leviathan 512 : 6.706 bits per byte : 0.919 cycles per byte </PRE></FONT> The setting "spaceSpeedTradeoffBytes" tells Leviathan how much it should pay in cycles to gain some compression in bits. <P> e.dds charts : <P> <TABLE><TR><TD> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=3782FF,882787,882787,882787,882787&chxr=1,0,900,100&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=zstd 1.3.3 -22|Leviathan256|Leviathan512|Leviathan1024|Leviathan2048&chd=e:sf,mr,pC,6v,.B&chxl=0:|e.dds&chtt=decode+MB/s&chts=303030,14,l"> </TD><TD> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=3782FF,882787,882787,882787,882787&chxr=1,0,3,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=zstd 1.3.3 -22|Leviathan256|Leviathan512|Leviathan1024|Leviathan2048&chd=e:3c,.A,-r,9I,8Z&chxl=0:|e.dds&chtt=compression+ratio&chts=303030,14,l"> </TD></TR></TABLE> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=s&chs=800x360&chxt=x,y&chxr=0,9,10,1|1,0,2,1&chd=e:ShFnLEsLyr,sGx.xwwmwC&chdl=zstd 1.3.3 -22|Leviathan256|Leviathan512|Leviathan1024|Leviathan2048&chdlp=r&chco=3782FF|882787|882787|882787|882787&chm=s,000000,0,0,0|o,3782FF,0,0,10|o,882787,0,1,10|o,882787,0,2,10|o,882787,0,3,10|o,882787,0,4,10&chtt=loglog+ratio+vs+MB/s&chxl=0:|512|1024|1:|1|2|4|"> <P> <P> See also : <A HREF="http://cbloomrants.blogspot.com/2018/01/the-natural-lambda.html"> The Natural Lambda </A> <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/LPLq_QROCRY" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-33427740150893649092018-05-07T08:28:00.002-07:002018-05-07T08:28:30.164-07:00Visualizing Probability Update Schemes Part 2 A little bonus so we can look at the weird properties of the geometric / PAQ mixer. <P> Recall from the tour of mixing : <P> Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference) domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't sum to one, and they can be negative. <P> In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them. Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]). <P> For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor that says - yes I like your prediction, but take it farther. <P> The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply amplification or suppression of the codelen difference. <P> Let's see this in a very simple case : just one expert. <P> The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate" to exaggerate the effect (1000X faster than typical). <P> Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's to make the effect more obvious. <P> The underlying expert adapts slowly : (P(1) in green, codelen difference in blue) <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=e:g.h9i6j0ksljmYnLn8ospbqIq0resHsvtVt7ufvCvkwFwlxDxhx-yay1zQzpyCweu9tfsEqspWoEm0lm&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=t:0.090,0.178,0.263,0.346,0.427,0.506,0.583,0.658,0.732,0.805,0.877,0.947,1.016,1.084,1.151,1.217,1.282,1.346,1.409,1.472,1.534,1.595,1.656,1.716,1.775,1.834,1.893,1.951,2.008,2.065,1.842,1.643,1.463,1.298,1.145,1.003,0.869,0.744,0.624,0.511&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0"> <P> Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start. Codelen difference is still positive (L(0) > L(1)). <P> With the PAQ mixer applied to just a single expert : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=paq mixed&chd=e:g.iJjjlYnxqYtfwbzX1.4T547X8U9H9s-M-i-1.C.Q.Z.g.m.r.v.x.0.1.3vZJoH1HbHjIPJQKlMFOM&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=paq mixed&chd=t:0.090,0.195,0.323,0.491,0.716,0.972,1.298,1.638,2.025,2.433,2.873,3.244,3.682,4.038,4.410,4.748,5.113,5.428,5.764,6.048,6.429,6.739,7.035,7.347,7.671,7.994,8.188,8.537,8.675,8.997,1.514,-2.498,-2.843,-2.930,-2.903,-2.760,-2.565,-2.336,-2.104,-1.812&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=ccaa00&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=weight&chd=t:1.000,1.074,1.216,1.415,1.661,1.935,2.225,2.512,2.783,3.028,3.242,3.422,3.577,3.703,3.810,3.898,3.973,4.034,4.086,4.129,4.166,4.196,4.221,4.243,4.261,4.276,4.288,4.299,4.308,4.316,0.814,-1.525,-1.948,-2.251,-2.508,-2.739,-2.960,-3.175,-3.383,-3.583&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0"> <P> In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too low, so we're scaling up the codelen difference by 4X. <P> In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*. P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50% <P> Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower (otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding). In practice the weight converge slowly to an ideal and stay there for long periods of time. <P> But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X). <P> For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can learn that no, on this particular file the deterministic model is in fact better than that, so I like you and amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as predictable as you think).<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/FebRvhiX4RE" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-20584258300073191062018-05-06T16:54:00.000-07:002018-05-06T16:54:19.822-07:00Visualizing Probability Update Schemes Unlike the last "visualizing" post, I don't have a real clear point to this one. This is more like "let's look at some pretty pictures". <P> I do think it helps to get intuition by actually looking at charts & graphs, rather than just always look at the end result score for something like compression. <P> We're going to look at binary probability estimation schemes. <P> Binary probability estimation is just filtering the history of bits seen in some way. <P> Each bit seen is a digital signal of value 0 or 1. You just want some kind of smoothing of that signal. Boom, that's your probability estimate, P(1). All this nonsense about Poisson and Bernoulli processes and blah blah, what a load of crap. It's just filtering. <P> For example, the "Laplace" estimator <FONT COLOR=GREEN><PRE> P(1) = (n1 + c)/(n0 + n1 + 2c) </PRE></FONT> That's just the average of the entire signal seen so far. Add up all the bits, divide by number. (countless papers have been written about what c should be (c= 1? c = 1/2?), why not 1/4? or 3/4? Of course there's no a-priori reason for any particular value of c and in the real world it should just be tweaked to maximize results.) <P> That's the least history-forgetting estimator of all, it weights everything in the past equally (we never want an estimator where older stuff is more important). In the other direction you have lots of options - FIR and IIR filters. You could of course take the last N bits and average them (FIR filter), or take the last N and average with a weight that smoothly goes from 0 to 1 across the window (sigmoidal or just linear). You can of course take an IIR average, that's <FONT COLOR=GREEN><PRE> average <- (average*N + last)/(N+1) </PRE></FONT> Which is just the simplest IIR filter, aka geometric decay, "exponential smoothing" blah blah. <P> Anyway, that's just to get us thinking in terms of filters. Let's look at some common filters and how they behave. <P><HR><P> Green bars = probability of a 1 bit after the bit at bottom is processed. <P> Blue bars = log odds = codelen(0) - codelen(1) <P><HR><P> Laplace : count n0 & n1 : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=counting&chd=e:qqv.zM1V223.445m6L6q2JySu7r.panHlDjMhhgAemgAeugAe0gAe5gAe9gAfBeHfFgAfIeUdieZfOgA&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=counting&chd=t:1.000,1.585,2.000,2.322,2.585,2.807,3.000,3.170,3.322,3.459,2.459,1.874,1.459,1.138,0.874,0.652,0.459,0.290,0.138,0.000,-0.126,0.000,-0.115,0.000,-0.107,0.000,-0.100,0.000,-0.093,0.000,-0.087,-0.170,-0.082,0.000,-0.078,-0.152,-0.222,-0.144,-0.070,0.000&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> After 10 1's and 10 0's, P is back to 50/50 ; we don't like the way this estimator has no recency bias. <P><HR><P> Geometric IIR with updshift 2 (very fast) : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>2&chd=e:n.t.yf134Z6T7u8y9l-Mupi.aPTrOwLEITGOErDgCoR-NeaHTlerXBhQY8itaCThepm-dPV7QccVlPr7&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>2&chd=t:0.737,1.354,1.903,2.412,2.893,3.356,3.806,4.246,4.680,5.109,1.427,0.270,-0.525,-1.171,-1.738,-2.257,-2.746,-3.214,-3.667,-4.110,-4.546,-1.357,-1.906,-0.537,-1.182,-0.119,-0.833,0.114,-0.647,0.245,-0.545,-1.188,-0.122,0.640,-0.250,-0.940,-1.532,-0.332,0.478,1.131&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> When a fast geometric gets to the 010101 section is wiggles back and forth wildly. If you look at the codelen difference in that region you can see it's wasting something like 1.5 bits on each coding (because it's always wiggled in just the wrong way, eg. biased towards 0 right before a 1 occurs). <P> Note that geometric probabilities have an exponential decay shape. (specifically , 0.75^n , where 0.75 is from (1 - 1/4) and the 4 is because shift 2). HOWEVER the codelen difference steps are *linear*. <P> The geometric probability update (in the direction of MPS increasing probability) is very close to just adding a constant to codelen difference (logit). (this just because P *= lambda , so log(P) += log(lambda)). You can see after the 111 region, the first 0 causes a big negative step in codelen difference, and then once 0 becomes the MPS the further 0 steps are the same constant linear step. <P><HR><P> Geometric IIR with updshift 5 (fast) : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=e:g.h9i6j0ksljmYnLn8osnbmMlAj2iuhpglfkeldoctdzc4d-dCeIdMeRdVeaddcidpeudwc1b7dDeJfN&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=t:0.090,0.178,0.263,0.346,0.427,0.506,0.583,0.658,0.732,0.805,0.683,0.566,0.455,0.349,0.247,0.148,0.053,-0.038,-0.127,-0.214,-0.298,-0.198,-0.282,-0.183,-0.268,-0.169,-0.254,-0.155,-0.241,-0.143,-0.229,-0.313,-0.212,-0.115,-0.202,-0.286,-0.369,-0.266,-0.167,-0.071&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> Shift 5 is still fast by real world standards but looks slow compared to the crazy speed of geo 2. You can see that the lack of an initial adaptation range hurts the ramp up on the 1111 portion. That is, "geo 5" acts like it has 33 symbols in its history; at the beginning it actually has 0, so it has a bogus history of 33 times P = 0.5 which gives it a lot of intertia. <P><HR><P> FIR 8-tap window flat weight : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=fir flat 8&chd=e:n.n.v.v.3.3.........3.v.n.gAYAQAIAAAAAAAAAIAIAQAQAYAYAgAgAgAgAYAgAgAgAYAYAYAgAn.&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=fir flat 8&chd=t:0.737,0.737,1.585,1.585,2.807,2.807,6.629,6.629,6.629,6.629,2.807,1.585,0.737,0.000,-0.737,-1.585,-2.807,-6.629,-6.629,-6.629,-6.629,-2.807,-2.807,-1.585,-1.585,-0.737,-0.737,0.000,0.000,0.000,0.000,-0.737,0.000,0.000,0.000,-0.737,-0.737,-0.737,0.000,0.737&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> Just take the last 8 and average. (I initialize the window to 01010 which is why it has the two-tap stair step in the beginning). In practice you can't like the probabilities get to 0 or 1 completely, you have to clamp at some epsilon, and in fact you need a very large epsilon here because over-estimating P(1) and then observing a 0 bit is very very bad (codelen of 0 goes to infinity fast as P->1). The "codelen difference" chart here uses a P epsilon of 0.01 <P><HR><P> bilateral filter : <P> It's just filtering, right? Why not? This bilateral filter takes a small local average (4 tap) and weights contribution of those local averages back through time. The weight of each sample is multiplied by e^-dist * e^-value_difference. That is, two terms (hence bilateral), weight goes down as you go back in time, but also based on how far away the sample is from the most recent one in value. ("sample" is the 4-tap local average) <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=bilateral&chd=e:l0q-zH366-9B-Z.R.x..4oqsbjOmIpFRDKBvAyAOAAHXKTUuWqcCcDfPeigffVZudbf6fXdPWbXZdzml&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=bilateral&chd=t:0.532,1.032,1.989,2.790,3.556,4.363,5.291,6.473,6.629,6.629,2.945,1.003,-0.404,-1.758,-2.679,-3.477,-4.271,-5.161,-6.336,-6.629,-6.629,-2.945,-2.382,-1.062,-0.867,-0.359,-0.358,-0.069,-0.131,0.045,-0.060,-0.573,-0.232,-0.008,-0.057,-0.249,-0.890,-0.795,-0.199,0.603&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> What the bilateral does is that as you get into each region, it quickly forgets the previous region. So as you get into 000 it forgets the 111, and once it gets into 0101 it pretty solidly stabilizes at P = 0.5 ; that is, it's fast in forgetting the past when the past doesn't match the present (fast like geo 2), BUT it's not fast in over-responding to the 0101 wiggle like geo 2. <P> There are an infinity of different funny impulse responses you could do here. I have no idea if any of them would actually help compression (other than by giving you more parameters to be able to train to your data, which always helps, sort of). <P><HR><P> mixed : <P> Linear mix of geo 2 (super fast) and geo 5 (still quite fast). mixing weight starts at 0.5 and is updated with online gradient descent. The mixer here has an unrealistically fast learning rate to exaggerate the effect. <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=mixed&chd=e:kfoRrquxxr0Y285a7z-LpPljiofWbxYJUqRWOOLQIYYyVxciZYeUbIfLcEfjcoaNd2gddrbrZoc-e9hN&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=mixed&chd=t:0.408,0.763,1.102,1.442,1.795,2.175,2.604,3.124,3.837,5.101,0.859,0.506,0.238,-0.058,-0.383,-0.722,-1.069,-1.428,-1.807,-2.229,-2.731,-0.661,-0.956,-0.313,-0.605,-0.152,-0.443,-0.074,-0.356,-0.040,-0.305,-0.528,-0.194,0.041,-0.210,-0.392,-0.582,-0.274,-0.094,0.110&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=ccaa00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=weight&chd=e:gAhij7m3qKtxxs1.6u.9QIM1OIRyWSa6fdj7oatBx4bHdjXyYtVgVXTTScQ8PoQmNENbKgKsM0HmHRKF&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1"> <P> The weight shown is the weight of geo 2, the faster model. You can see that in the 111 and 000 regions the weight of geo 2 shoots way up (because geo 5 is predicting P near 0.5 still), and then in the 0101 region the weight of geo 2 gradually falls off because the slow geo 5 does better in that region. <P> The mixer immediately does something that none of the other estimators did - when the first 0 bit occurs, it takes a *big* step down, almost down to 50/50. It's an even faster step than geo 2 did on its own. (same thing with the first 1 after the 000 region). <P> Something quite surprising popped out to me. The probability steps in the 111 and 000 regions wind up linear. Note that both the underlying estimators (geo 2 and geo 5) has exponential decay curving probabilities, but the interaction with the mixing weight cancels that out and we get linear. I'm not sure if that's a coincidence of the particular learning rate, but it definitely illustrates to us that a mixed probability can behave unlike any of its underlying experts! <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/93rg44LrjLA" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-91096410431712252522018-05-04T18:08:00.003-07:002018-05-04T19:32:36.738-07:00Visualizing Binary Probability Updates Some time ago I noted that the standard way we use binary probabilities to model statistics has some odd quirks : <A HREF="http://cbloomrants.blogspot.com/2014/06/06-12-14-some-lzma-notes.html"> 06-12-14 - Some LZMA Notes </A> <P> I thought I would make some pictures to make that more intuitive. <P> What I have here is a 4 bit symbol (alphabet of 16). It is coded with 4 binary probabilities in a tree. That is : <FONT COLOR=GREEN><PRE> first code a bit for sym >= 8 or not (is it in [0-7] or [8-15]) then go to [0-3] vs [4-7] (for example) then [4-5] vs [6-7] lastly  vs  </PRE></FONT> One way you might implement this is something like : <FONT COLOR=GREEN><PRE> U32 p0; // sym is given sym |= 16; // set a place-value marker bit for 4 times { int ctx = sym >> 4; int bit = (sym>>3)&1; arithmetic_code( p0[ctx] , bit ); sym <<= 1; } </PRE></FONT> and note that only 15 p0's are actually used, p0 is not accessed; p0 is the root probability for [0-7] vs [8-15] , p0 is [0-3] vs [4-7] , etc. <P> The standard binary probability update is exponential decay (the simplest geometric IIR filter) : <FONT COLOR=GREEN><PRE> if ( bit ) { P0 -= P0 * lambda; } else { P0 += (1.0 - P0) * lambda; } </PRE></FONT> So what actually happens to the symbol probabilities when you do this? <P> Something a bit strange. <P> When you update symbol  (for example), his probability goes up. But so does the probability of his neighbor, . And also his next neighbors [6 and 7]. And so on up the binary tree. <P> Now the problem is not the binary decomposition of the coding. Fundamentally binary decomposition and full alphabet coding are equivalent and should be able to get the same results if you form the correct correspondence between them. <P> (aside : For example, if you use a normal n0/n1 count model for probabilities, AND you initialize the counts such that the parents = the sum of the children, then what you get is : <A HREF="http://www.cbloom.com/blog/visualizing_binary_probability_updates_n0_n1.html"> visualizing_binary_probability_updates_n0_n1.html </A> - only the symbols you observe increase in probability. Note this is a binary probability tree, not full alphabet, but it acts exactly the same way.) <P> The problem (as noted in the previous LZMA post) is that the update rate at the top levels is too large. <P> Intuitively, when a 5 occurs you should update the binary choice  vs  , because a 5 is more likely, that pair is more likely. The problem is that  did not get more likely, and the probability of the group  represents the chance of either occuring. One of those things did get more likely, but one did not. The 4 in the group should act as a drag that keeps the group  probability from going up so much. Approximately, it should go up by half the speed of the leaf update. <P> The larger the group, the greater the drag of symbols that did not get more likely. eg. in [0-7] vs [8-15] when a 5 occurs, all of 0123467 did not get more likely, so 7 symbols act as drag and the rate should be around 1/8 the speed. <P> (see appendix at end for the exact speeds you should use if you wanted to adjust only one probability) <P> Perhaps its best to just look at the charts. These are histograms of probabilities (in percent). It starts all even, then symbol 5 occurs 20 X, then symbol 12 occurs 20 X. The probabilities are updated with the binary update scheme. lambda here is 1.0/8 , which is rather fast, used to make the visualization more dramatic. <P> What you should see : when 5 first starts getting incremented, the probabilities of its neighbors go way up, 4, and 6-7, and the whole 0-7 side. After 5 occurs many times, the probabilities of its neighbors goes down. Then when 12 starts beeing seen, the whole 8-15 side shoots up. <P> Go scroll down through the charts then we'll chat more. <P> This is a funny thing. Most people in the data compression community think of binary coding symbols this way as just a way to encode an alphabet using binary arithmetic coding. They aren't thinking about the fact that what they're actually doing is a strange sort of group-probability update. <P> In fact, in many applications if you take a binary arithmetic coder like this and replace it with an N-ary full alphabet coder, results get *worse*. Worse !? WTF !? It's because this weird group update thing that the binary coder is doing is often actually a good thing. <P> You can imagine scenarios where that could be the case. In some types of data, when a new symbol X suddenly starts occuring (when it had been rare before), then that means (X-1) and (X+2) may start to be seen as well. We're getting some kind of complicated modeling that novel symbols imply their neighbors novel probability should go up. In some type of data (such as BWT post-MTF) the probabilities act very much in binary tree groups. (see Fenwick for example). In other types of data that is very bit structured (such as ascii text and executable code), when a symbol with some top 3 bits occurs, then other symbols with those top bits are also more likely. That is, many alphabets actually have a natural binary decomposition where symbol groups in the binary tree do have joint probability. <P> This is one of those weird things that happens all the time in data compression where you think you're just doing an implementation choice ("I'll use binary arithmetic coding instead of full alphabet") but that actually winds up also doing modeling in ways you don't understand. <P> The visuals : <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:EAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEA"> <P> 5 x 1 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:D8D8D8D8E.GaEbEbDgDgDgDgDgDgDgDg"> <P> 5 x 2 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DyDyDyDyFxJSErErDEDEDEDEDEDEDEDE"> <P> 5 x 3 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DkDkDkDkGTMhEvEvCrCrCrCrCrCrCrCr"> <P> 5 x 4 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DUDUDUDUGoP.EsEsCWCWCWCWCWCWCWCW"> <P> 5 x 5 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DDDDDDDDGwTkEiEiCDCDCDCDCDCDCDCD"> <P> 5 x 6 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CyCyCyCyGtXKEUEUBzBzBzBzBzBzBzBz"> <P> 5 x 7 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CiCiCiCiGhasEEEEBlBlBlBlBlBlBlBl"> <P> 5 x 8 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CSCSCSCSGQeHDxDxBYBYBYBYBYBYBYBY"> <P> 5 x 9 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CDCDCDCDF6hWDeDeBNBNBNBNBNBNBNBN"> <P> 5 x 10 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:B1B1B1B1FhkZDLDLBDBDBDBDBDBDBDBD"> <P> 5 x 11 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BoBoBoBoFHnPC5C5A7A7A7A7A7A7A7A7"> <P> 5 x 12 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BdBdBdBdEsp2CnCnA0A0A0A0A0A0A0A0"> <P> 5 x 13 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BSBSBSBSESsPCWCWAtAtAtAtAtAtAtAt"> <P> 5 x 14 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BJBJBJBJD4ubCGCGAnAnAnAnAnAnAnAn"> <P> 5 x 15 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BABABABADgwZB4B4AjAjAjAjAjAjAjAj"> <P> 5 x 16 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:A5A5A5A5DJyKBrBrAeAeAeAeAeAeAeAe"> <P> 5 x 17 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AyAyAyAyC0zwBfBfAaAaAaAaAaAaAaAa"> <P> 5 x 18 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AsAsAsAsCh1LBUBUAXAXAXAXAXAXAXAX"> <P> 5 x 19 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AnAnAnAnCP2dBLBLAUAUAUAUAUAUAUAU"> <P> 5 x 20 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AiAiAiAiB.3lBCBCASASASASASASASAS"> <P> 12 x 1 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AeAeAeAeBwwoA6A6BGBGBGBGBxBYBOBO"> <P> 12 x 2 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AaAaAaAaBiqjAzAzBmBmBmBmD7CcB-B-"> <P> 12 x 3 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AXAXAXAXBVlPAsAsB5B5B5B5GpDWChCh"> <P> 12 x 4 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AUAUAUAUBLglAnAnCCCCCCCCJ0EEC4C4"> <P> 12 x 5 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ASASASASBBcgAiAiCFCFCFCFNSElDFDF"> <P> 12 x 6 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:APAPAPAPA5Y8AeAeCCCCCCCCQ7E5DKDK"> <P> 12 x 7 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ANANANANAyV1AaAaB9B9B9B9UoFCDJDJ"> <P> 12 x 8 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AMAMAMAMAsTGAXAXB2B2B2B2YTFCDDDD"> <P> 12 x 9 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AKAKAKAKAmQtAUAUBtBtBtBtb3E7C6C6"> <P> 12 x 10 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AJAJAJAJAiOoARARBkBkBkBkfREvCuCu"> <P> 12 x 11 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AIAIAIAIAdMzAPAPBcBcBcBcifEfCiCi"> <P> 12 x 12 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AHAHAHAHAaLMANANBTBTBTBTlfENCVCV"> <P> 12 x 13 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AGAGAGAGAWJzAMAMBLBLBLBLoRD5CJCJ"> <P> 12 x 14 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AFAFAFAFAUIlAKAKBDBDBDBDq0DlB8B8"> <P> 12 x 15 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AFAFAFAFARHgAJAJA8A8A8A8tIDRBwBw"> <P> 12 x 16 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AEAEAEAEAPGkAIAIA2A2A2A2vPC-BlBl"> <P> 12 x 17 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AEAEAEAEANFvAHAHAwAwAwAwxICrBaBa"> <P> 12 x 18 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ADADADADAMFCAGAGAqAqAqAqy1CaBRBR"> <P> 12 x 19 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ADADADADAKEZAFAFAlAlAlAl0XCKBIBI"> <P> 12 x 20 <P> <IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ACACACACAJD2AFAFAhAhAhAh1uB7BABA"> <P> <P><HR><P> Appendix : <P> How to do a binary probability update without changing your neighbor : <FONT COLOR=GREEN><PRE> Consider the simplest case : 4 symbol alphabet : [0 1 2 3] A = P[0 1] vs [2 3] B = P vs  P(0) = A * B P(1) = A * (1 - B) now a 0 occurs alpha = update rate for A beta = update rate for B A' = A + (1-A) * alpha B' = B + (1-B) * beta we want P(1)/P(23) to be unchanged by the update (if that is unchanged, then P(1)/P(2) and P(1)/P(3) is unchanged) that is, the increase to P(0) should scale down all other P's by the same factor P(1)/P(23) = A * (1-B) / (1-A) so require : A' * (1-B') / (1-A') = A * (1-B) / (1-A) solve for alpha [...algebra...] alpha = A * beta / ( 1 - (1 - A) * beta ) </PRE></FONT> that's as opposed to just using the same speed at all levels, alpha = beta. <P> In the limit of small beta, (slow update), this is just alpha = A * beta. <P> The update rate at the higher level is decreased by the probability of the updated subsection. <P> In the special case where all the probabilities start equal, A = 0.5, so this is just alpha = beta / 2 - the update rates should be halved at each level, which was the intuitive rule of thumb that I hand-waved about before. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/abIj4WqJaec" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-24339955744653173292018-05-03T13:29:00.000-07:002018-05-03T13:29:07.865-07:00Whirlwind Tour of Mixing Part 3 On to online gradient descent. <P> First we need to talk about how our experts are mixed. There are two primary choices : <FONT COLOR=GREEN><PRE> Linear mixing : M(x) = Sum{ w_i P_i(x) } / Sum{ w_i } or = Sum{ w_i P_i(x) } if the weights are normalized Geometric mixing : M(x) = Prod{ P_i(x) ^ c_i } / Sum(all symbols y){ Prod{ P_j(y) ^ c_j } } (the coefficients c_i do not need to be normalized) P = probability from each expert M = mixed probability require Sum(all symbols y){ M(y) } = 1 </PRE></FONT> Linear mixing is just a weighted linear combination. Geometric mixing is also known as "product of expert" and "logarithmic opinion pooling". <P> Geometric mixing can be expressed as a linear combination of the logs, if normalization is ignored : <FONT COLOR=GREEN><PRE> log( M_unnormalized ) = Sum{ c_i * log( P_i ) } </PRE></FONT> In the large alphabet case, there is no simple logarithmic form of geometric M (with normalization included). <P> In the special case of a binary alphabet, Geometric mixing has a special simple form (even with normalization) in log space : <FONT COLOR=GREEN><PRE> M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ (1 - P_i) ^ c_i } ) M = 1 / ( 1 + Prod{ ((1 - P_i)/P_i) ^ c_i } ) M = 1 / ( 1 + Prod{ e^c_i * log((1 - P_i)/P_i) } ) M = 1 / ( 1 + e^ Sum { c_i * log((1 - P_i)/P_i) } ) M = sigmoid( Sum { c_i * logit(P_i) } ) </PRE></FONT> which we looked at in the previous post, getting some background on logit and sigmoid. <P> We looked before about how logit space strongly preserves skewed probabilities. We can see the same thing directly in the geometric mixer. <P> Obviously if an expert has a P near 0, that multiplies into the product and produces an output near 0. Zero times anything = zero. Less obviously if an expert has a P near 1, the denomenator normalization factor becomes zero in the other terms (that aren't the numerator), so the mix becomes 1, regardless of what the other P's. <P> So assume we are using these mixtures. We want to optimize (minimize) code length. The general procedure for gradient descent is : <FONT COLOR=GREEN><PRE> the code length cost of a previous coding event is -log2( M(x) ) evaluated at the actually coded symbol x (in compression the total output file length is just the sum of all these event costs) To improve the w_i coefficients over time : after coding symbol x, see how w_i could have been improved to reduce -log2( M(x) ) take a step of w_i in that direction the direction is the negative gradient : delta(w_i) ~= - d/dw_i { -log2( M(x) ) } delta(w_i) ~= (1/M) * d/dw_i{ M(x) } (~ means proportional; the actual delta we want is scaled by some time step rate) </PRE></FONT> so to find the steps we should take we just have to do some derivatives. <P> I won't work through them in ascii, but they're relatively simple. <P> (the geometric mixer update for the non-binary alphabet case is complex and I won't cover it any more here. For practical reasons we will assume linear mixing for non-binary alphabets, and either choice is available for binary alphabets. For the non-binary geometric step see Mattern "Mixing Strategies in Data Compression") <P> When you work through the derivatives what you get is : <FONT COLOR=GREEN><PRE> Linear mixer : delta(w_i) ~= ( P_i / M ) - 1 Geometric mixer (binary) : delta(w_i) ~= (1 - M) * logit(P_i) </PRE></FONT> Let's now spend a little time getting familiar with both of them. <P> First the linear mixer : <P> <FONT COLOR=GREEN><PRE> put r_i = P_i(x) / M(x) ratio of expert i's probability to the mixed probability, on the symbol we just coded (x) Note : Sum { w_i * r_i } = 1 (if the weights are normalized) Your goal is to have M = 1 That is, you gave 100% probability to the symbol that actually occured The good experts are the ones with high P(x) (x actually occured) The experts that beat the current mix have P > M Note that because of Sum { w_i * r_i } = 1 Whenever there is an expert with r_i > 1 , there must be another with r_i < 1 The linear mixer update is just : delta(w_i) ~= r_i - 1 experts that were better than the current mix get more weight experts worse than the current mix get less weight </PRE></FONT> I think it should be intuitive this is a reasonable update. <P> Note that delta(w) does not preserve normalization. That is : <FONT COLOR=GREEN><PRE> Sum{ delta(w_i) } != 0 , and != constant however Sum{ w_i * delta(w_i) } = Sum{ w_i * P_i / M } - Sum { w_i } = Sum {w_i} - Sum{w_i} = 0 the deltas *do* sum to zero when weighted. </PRE></FONT> We'll come back to that later. <P> If we write the linear update as : <FONT COLOR=GREEN><PRE> delta(w_i) ~= ( P_i - M ) / M </PRE></FONT> it's obvious that delta is proportional to (1/M). That is, when we got the overall mixed prediction M right, M is near 1, the step is small. When we got the step wrong, M is near 0, the step we take can be very large. (note that P_i does not have to also be near zero when M is near zero, the w_i for that expert might have been small). In particular, say most of the experts were confused, they predict the wrong thing, so for the actual coded event x, their P(x) is near zero. One guy said no, what about x! His P(x) was higher. But everyone ignores the one guy piping up (his weight is low). The mixed M(x) will be very low, we make a huge codelen (very bad log cost), the step (1/M) will be huge, and it bumps up the one guy who got the P(x) right. <P> Now playing with the geometric update : <FONT COLOR=GREEN><PRE> Geometric mixer (binary) : delta(w_i) ~= (1 - M) * logit(P_i) this is evaluated for the actually coded symbol x (x = 0 or 1) let's use the notation that ! means "for the opposite symbol" eg. if you coded a 0 , then ! means "for a 1" eg. !P = 1-P and !M = 1-M (in the previous post I put up a graph of the relationship between L and !L, codelens of opposing bits) We can rewrite delta(w_i) as : delta(w_i) ~= !M * ( !L_i - L_i ) recalling logit is the codelen difference that's delta(w_i) ~= (mixed probability of the opposite symbol) * (ith model's codelen of opposite - codelen of coded symbol) </PRE></FONT> I find this form to be more intuitive; let's talk through it with words. <P> The step size is proportional to !M , the opposite symbol mixed probability. This is a bit like the (1/M) scaling in the linear mixer. When we got the estimate right, the M(x) of the coded symbol will go to 1, so !M goes to zero, our step size goes to zero. That is, if our mixed probability M was right, we take no step. If it ain't broke don't fix it. If we got the M very wrong, we thought a 0 bit would not occur but then it did, M is near zero, !M goes to one. <P> While the general direction of scaling here is the same as the linear mixer, the scaling is very different. For example in the case where we were grossly wrong, M -> 0 , the linear mixer steps like (1/M) - it can be a huge step. The geometric mixer steps like (1-M) , it hits a finite maximum. <P> The next term (!L - L) means weights go up for experts with a large opposite codelen. eg. if the codelen of the actually coded symbol was low, that's a good expert, then the codelen of the opposite must be high, so his weight goes up. <P> Again this is similar to the linear mixer, which took a step ~ P(x) , so experts who were right go up. Again the scaling is different, and in the opposite way. In the linear mixer, if experts get the P right or wrong, the step just scales in [0,1] ; it varies, but not drastically. Conversely in the geometric case because step is proportional to codelen, it goes to infinity as P gets extremely wrong. <P> Say you're an expert who gets it all wrong and guesses P(x) = 0 (near zero) then x occurs. In the linear case the step is just (-1). In the geometric case, L(x) -> inf , !L -> 0 , your weight takes a negative infinite step. Experts who get it grossly wrong are not gradually diminished, they are immediately crushed. (in practice we clamp weights to some chosen finite range, or clamp step to a maximim) <P> Repeating again what the difference is : <P> In linear mixing, if the ensemble makes a big joint mistake, the one guy out of the set who was right can get a big weight boost. If the ensemble does okay, individuals experts that were way off do not get big adjustments. <P> In geometric mixing, individual experts that make a big mistake take a huge weight penalty. This is only mildly scaled by whether the ensemble was good or not. <P> Another funny difference : <P> With geometric mixing, experts that predict a 50/50 probability (equal chance of 0 or 1) do not change w at all. If you had a static expert that just always said "I dunno, 50/50?" , his weight would never ever change. But he would also never contribute to the mixture either, since as noted in the previous post equivocal experts simply have no effect on the sum of logits. (you might think that if you kept seeing 1 bits over and over, P(1) should be near 100% and experts that keep saying "50%" should have their weight go to crap; that does not happen in geometric mixing). <P> In contrast, with linear mixing, 50/50 experts would either increase or decrease in weight depending on if they were better or worse than the previous net mixture. eg. if M was 0.4 on the symbol that occurred, the weight of an 0.5 P expert would go up, since he's better than the mix and somebody must be worse than him. <P> Note again that in geometric mixing (for binary alphabets), there is no explicit normalization needed. The w_i's are not really weights, they don't sum to 1. They are not in [0,1] but go from [-inf,inf]. It's not so much "weighting of experts" as "adding of experts" but in delta-codelen space. <P> Another funny thing about geometric mixing is what the "learning rate" really does. <P> With the rate explicitly in the steps : <FONT COLOR=GREEN><PRE> Linear mixer : w_i += alpha * ( ( P_i / M ) - 1 ) (then w_i normalized) Geometric mixer (binary) : c_i += alpha * ( (1 - M) * logit(P_i) ) (c_i not normalized) </PRE></FONT> In the linear case, alpha really is a learning rate. It controls the speed of how fast a better expert takes weight from a worse expert. They redistribute the finite weight resource because of normalization. <P> That's not the case in the geometric mixer. In fact, alpha just acts as an overall scale to all the weights. Because all the increments to w over time have been scaled by alpha, we can just factor it out : <FONT COLOR=GREEN><PRE> no alpha scaling : c_i += ( (1 - M) * logit(P_i) ) factor out alpha : M = sigmoid( alpha * Sum { c_i * logit(P_i) } ) </PRE></FONT> The "learning rate" in the geometric mixer really just scales how the sum of logits is stretched back to probability. In fact you can think of it as a normalization factor more than a learning rate. <P> The geometric mixer has yet another funny property that it doesn't obviously pass through the best predictor. It doesn't even obviously pass through the case of a single expert. (with linear mixing it's obvious that crap predictors weight goes to zero, and a single expert would just pass through at weight=1) <FONT COLOR=GREEN><PRE> aside you may ignore; quick proof that it does : say you have only one expert P say that P is in fact perfectly correct 0 bits occur at rate P and P is constant initially you have some mixing coefficient 'c' (combining alpha in with c) The mixed probability is : M = sigmoid( c * logit(P) ) which is crap. c is applying a weird scaling we don't want. c should go to 1, because P is the true probability. Does it? delta(c) after one step of symbol 0 is (1-M) * logit(P) for symbol 1 it's M * logit(1-P) since P is the true probability, the probabilistic average step is : average_delta(c) = P * ( (1-M) * logit(P) ) + (1-P) * ( (M) * logit(1-P) ) logit(1-P) = - logit(P) average_delta(c) = P * logit(P) - M * logit(P) = (P - M) * logit(P) in terms of l = logit(P) using P = sigmoid(l) average_delta(c) = ( sigmoid(l) - sigmoid( c * l ) ) * l this is a differential equation for c(t) you could solve it (anyone? anyone?) but even without solving it's clear that it does go to c = 1 as t -> inf if c < 1 , average_delta(c) is > 0 , so c goes up if c > 1 , average_delta(c) is < 0 , so c goes down in the special case of l small (P near 50/50), sigmoid is near linear, then the diffeq is easy and c ~= (1 + e^-t) </PRE></FONT> Note that this was done with alpha set to 1 (included in c). If we leave alpha separate, then c should converge to 1/alpha in order to pass through a single predictor. That's what I mean by alpha not really being a "learning rate" but more of a normalization factor. <P> So anyway, despite geometric mixing being a bit odd, it works in practice; it also works in theory (the gradient descent can be proved to converge with reasonable bounds and so on). In practice it has the nice property of not needing any divides for normalization (in the binary case). Of course it doesn't require evaluation of logit & sigmoid, which are typically done by table lookup. <P><HR><P> Bonus section : Soft Bayes <P> The "Soft Bayes" method is not a gradient descent, but it is very similar, so I will mention it here. Soft Bayes uses a weighted linear combination of the experts, just like linear gradient descent mixing. <P> <FONT COLOR=GREEN><PRE> Linear mixer : d_i = ( P_i / M ) - 1 w_i += rate * d_i Soft Bayes : w_i *= ( 1 + rate * d_i ) </PRE></FONT> Instead of adding on the increment, Soft Bayes multiples 1 + the increment. <P> Multiplying by 1 + delta is obviously similar to adding delta and the same intuitive arguments we made before about why this works still apply. Maybe I'll repeat them here : <P> General principles we want in a mixer update step : <FONT COLOR=GREEN><PRE> The step should be larger when the mixer produced a bad result the worse M(x) predicted the observed symbol x, the larger the step should be this comes from scaling like (1/M) or (1-M) The step should scale up weights of experts that predicted the symbol well eg. larger P_i(x) should go up , smaller P_i(x) should go down a step proportional to (P-M) does this in a balanced way (some go up, some go down) but a step of (P - 0.5) works too as does logit(P) </PRE></FONT> We can see a few properties of Soft Bayes : <FONT COLOR=GREEN><PRE> Soft Bayes : w_i *= ( 1 + rate * d_i ) w_i' = w_i + w_i * rate * d_i w_i += rate * w_i * d_i This is the same as linear gradient descent step, but each step d_i is scaled by w_i Soft Bayes is inherently self-normalizing : Sum{w_i'} = Sum{w_i} + Sum{w_i * rate * d_i} Sum{ w_i * d_i } = Sum{ ( w_i * P_i / M ) - w_i } = ( M/M - 1 ) * Sum{ w_i } = 0 therefore : Sum{w_i'} = Sum{w_i} the sum of weights is not changed if they started normalized, they stay normalized </PRE></FONT> The reason that "soft bayes" is so named can be seen thusly : <FONT COLOR=GREEN><PRE> Soft Bayes : w_i *= ( 1 + rate * d_i ) d_i = ( P_i / M ) - 1 w_i *= (1-rate) + rate * ( P_i / M ) at rate = 1 : w_i *= P_i / M ignoring normalization this is w_i *= P_i </PRE></FONT> which is "Beta Weighting" , which is aka "naive Bayes". <P> So Soft Bayes uses "rate" as a lerp factor to blend in a naive Bayes update. When rate is 1, it's "fully Bayes", when rate is lower it keeps some amount of the previous weight lerped in. <P> Soft Bayes seems to work well in practice. Soft Bayes could have the bad property that if weights get to zero they can never rise, but in practice we always clamp weights to a non-zero range so this doesn't occur. <P> A quick aside on why Beta weighting is the naive Bayesian solution : <FONT COLOR=GREEN><PRE> ignoring normalization through this aside (also ignoring priors and all the proper stuff you should do if you're Bayesian) (see papers for thorough details) Mix experts : M = Sum{ w_i * P_i } P_i(x) is the probability of x given expert i ; P(x|i) The w_i should be the probability of expert i, given past data , that is : P(x|past) = Sum{ P(i|past) * P(x|i) } P(i|past) is just ~= P(past|i) (ignoring prior P(i) and normalizing terms P(past) that factor out in normalization) Now *if* (big if) the past events are all independent random events, then : P(past|i) = P(x0|i)*P(x1|i)*P(x2|i)... for past symbols x0,x1,... since P("ab") = P(a)*P(b) etc. which you can compute incrementally as : w_i *= P_i the Beta Weight update. </PRE></FONT> In words, the weight for the expert should be the probability of that expert having made the data we have seen so far. That probability is just the P_i that expert has predicted for all past symbols, multiplied together. <P> Now this is obviously very wrong. Our symbols are very much not independent, they have strong conditional probability, that's how we get compression out of modeling the stream. The failure of that assumption may be why true beta weighting is poor in practice. <P><HR><P> This Whirlwind Tour has been brought to you by the papers of Mattern & Knoll, PAQ of Mahoney, and the radness of RAD. <P> Small note about practical application : <P> You of course want to implement this with integers and store things like weights and probabilities in fixed point. I have found that the exact way you handle truncation of that fixed point is very important. In particular you need to be careful about rounding & biasing and try not to just lose small values. In general mixing appears to be very "fiddly"; it has a very small sweet spot of tuned parameters and small changes to them can make results go very bad. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/iHItA6QS73k" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-24680154859984477632018-05-02T15:48:00.000-07:002018-05-02T15:48:32.663-07:00Whirlwind Tour of Mixing Part 2 Before we dive into online gradient descent, let's look at a couple of building blocks. <P> In many cases we've been talking about mixers that work with arbitrary alphabets, but let's specialize now to binary. In that case, each expert just makes one P, say P(0), and implicitly P(1) = 1 - P(0). <P> We will make use of the logit ("stretch") and logistic sigmoid ("squash"), so let's look at them a bit first. <P> <FONT COLOR=GREEN><PRE> logit(x) = ln( x / (1-x) ) sigmoid(x) = 1 / (1 + e^-x ) (aka "logistic") logit is "log odds" of a probability logit takes [0,1] -> [-inf,inf] sigmoid takes [-inf,inf] -> [0,1] logit( sigmoid(x) ) = 1 </PRE></FONT> I'm a little bit fast and loose with the base of the logs and exponents sometimes, because the difference comes down to scaling factors which wash out in constants in the end. I'll try to be not too sloppy about it. <P> You can see plots of them <A HREF="https://nathanbrixius.wordpress.com/2016/06/04/functions-i-have-known-logit-and-sigmoid/"> here for example </A>. The "logistic sigmoid" is an annoyingly badly named function; "logistic" is way too similar to "logit" which is very confusion, and "sigmoid" is just a general shape description, not a specific function. <P> What's more intuitive in data compression typically is the base-2 variants : <FONT COLOR=GREEN><PRE> log2it(P) = log2( P / (1-P) ) log2istic(t) = 1 / (1 + 2^-t ) </PRE></FONT> which is just different by a scale of the [-inf,inf] range. (it's a difference of where you measure your codelen in units of "bits" or "e's"). <P> In data compression there's an intuitive way to look at the logit. It's the difference of codelens of symbol 0 and 1 arising from the probability P. <FONT COLOR=GREEN><PRE> log2it(P) = log2( P / (1-P) ) = log2( P ) - log2( 1 - P ) if P = P0 then L0 = codelen of 0 = - log2(P0) L1 = - log2(1-P0) log2it(P0) = L1 - L0 which I like to write as : log2it = !L - L where !L means the len of the opposite symbol </PRE></FONT> Working with the logit in data compression is thus very natural and has nice properties : <P> 1. It is (a)symmetric under exchange of the symbol alphabet 0->1. In that case P -> 1-P , so logit -> - logit. The magnitude stays the same, so mixers that act on logit behave the same way. eg. you aren't favoring 0 or 1, which of course you shouldn't be. This is why using the *difference* of codelen between L0 and L1 is the right thing to do, not just using L0, since the 0->1 exchange acts on L0 in a strange way. <P> Any mixer that we construct should behave the same way if we swap 0 and 1 bits. If you tried to construct a valid mixer like that using only functions of L0, it would be a mess. Doing it in the logit = (!L-L) it works automatically. <P> 2. It's linear in codelen. Our goal in compression is to minimize total codelen, so having a parameter that is in the space we want makes that natural. <P> 3. It's unbounded. Working with P's is awkward because of their finite range, which requires clamping and theoretically projection back into the valid cube. <P> 4. It's centered on zero for P = 0.5 and large for skewed P. We noted before that what we really want is to gather models that are skewed, and logit does this naturally. <P> Let's look at the last property. <P> Doing mixing in logit domain implicitly weights skewed experts, as we looked at before in "functional weighting". With logit, skewed P's near 0 or 1 can stretched out to -inf or +inf. This gives them very high implicit "weight", that is even after mixing with a P near 0.5, the result is still skewed. <FONT COLOR=GREEN><PRE> Mix two probabilities with equal weight : 0.5 & 0.9 linear : 0.7 logit : 0.75 0.5 & 0.99 linear : 0.745 logit : 0.90867 0.5 & 0.999 linear : 0.74950 logit : 0.96933 </PRE></FONT> Logit blending is sticky at skewed P. <P> A little plot of what the L -> !L function looks like : <P> <A HREF="http://www.cbloom.com/blogimages/len_to_len.png"> <IMG SRC="http://www.cbloom.com/blogimages/len_to_len_small.png"> </A> <P> Obviously it is symmetric around (1,1) where both codelens are 1 bit, P is 0.5 ; as P gets skewed either way one len goes to zero and the other goes to infinity. <P> logistic sigmoid is the map back from codelen delta to probability. It has the nice property for us that no matter what you give it, the output is a valid normalized probability in [0,1] ; you can scale or screw up your logistics in whatever way you want and you still get a normalized probability (which makes it codeable). <P> Linear combination of logits (codelen deltas) is equal to geometric (multiplicative) mixing of probabilities : <FONT COLOR=GREEN><PRE> M = mixed probability linear combination in logit space, then transform back to probability : M = sigmoid( Sum{ c_i * logit( P_i ) } ) M = 1 / ( 1 + e ^ - Sum{ c_i * logit( P_i ) } ) e ^ logit(P) = P/(1-P) e ^ -logit(P) = (1-P)/P e ^ - c * logit(P) = ((1-P)/P) ^ c M = 1 / ( 1 + Prod{ (( 1 - P_i ) / P_i)^c_i } ) M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ ( 1 - P_i )^c_i } ) </PRE></FONT> What we have is a geometric blend (like a geometric mean, but with arbitrary exponents, which are the mixing factors), and the denomenator is just the normalizing factor so that M0 and M1 some to one. <P> Note that the blend factors (c_i) do not need to be normalized here. We get a normalized probability even if the c_i's are not normalized. That is an important property for us in practice. <P> This is a classic machine learning method called "product of experts". In the most basic form, all the c_i = 1, the powers are all 1, the probabilities from each expert are just multiplied together. <P> I'm calling the c_i "coefficients" not "weights" to emphasize the fact that they are not normalized. <P> Note that while we don't need to normalize the c_i , and in practice we don't (and in the "product of experts" usage with all c_i = 1 they are of course not normalized), that has weird consequences and let us not brush it under the table. <P> For one thing, the overall scale of c_i changes the blend. In many cases we hand-waved about scaling because normalization will wipe out any overall scaling factors of weights. In this case the magnitude scale of the c_i absolutely matters. Furthermore, the scale of the c_i has nothing to do with a "learning rate" or blend strength. It's just an extra exponent applied to your probabilities that does some kind of nonlinear mapping to how they are mixed. e.g. doing c_i *= 2 is like mixing squared probabilities instead of linear probabilities. <P> For another, with unnormalized c_i - only skewed probabilities contribute. Mixing in more 50/50 experts does *nothing*. It does not bias your result back towards 50/50 at all! <P> We can see this both in the logit formulation and in the geometric mean formulation : <FONT COLOR=GREEN><PRE> M = sigmoid( Sum{ c_i * logit( P_i ) } ) Now add on a new expert P_k with some coefficient c_k M' = sigmoid( Sum{ c_i * logit( P_i ) + c_k * logit( P_k ) } ) if P_k = 50% , then logit(P_k) = 0 M' = M even if c_k is large. (if you normalized the c's, the addition of c_k would bring down the coffecient from the other contributions) The Sum of logits only adds up codelen deltas Experts with large codelen deltas are adding their scores to the vote small codelen deltas simply don't contribute at all In the geometric mean : Q = Prod{ P_i ^ c_i } M = Q / ( Q + !Q ) (where ! means "of the opposing symbol; eg. P -> (1-P) ) Add on another expert with P = 0.5 : Q' = Q * 0.5 ^ c_k !Q' = !Q * 0.5 ^ c_k M' = Q * 0.5 ^ c_k / ( Q * 0.5 ^ c_k + !Q * 0.5 ^ c_k) the same term multiplied around just factors out M' = M </PRE></FONT> This style of mixing is like adding experts, not blending them. Loud shouty experts with skewed codelen deltas add on. <P> Next up, how to use this for gradient descent mixing in compression. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/lC1jYjnwQPo" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-60300949689982632422018-05-02T12:10:00.000-07:002018-05-02T13:32:08.454-07:00Whirlwind Tour of Mixing Part 1 I'm going to try to run through mixing techniques in data compression in a wild quick tour. Part 1 will cover techniques up to but not including online gradient descent. <P> First some background : <P> We have probabilities output by some "experts" , P_i(x) for the ith expert for symbol x. <P> We wish to combine these in some way to create a mixed probability. <P> The experts are a previous stage of probability estimation. In data compression these are usually models; they could be different orders of context model, or the same order at different speeds, or other probability sources. <P> (you could also just have two static experts, one that says P(0) = 100% and one that says P(1) = 100%, then the way you combine them is actually modeling the binary probability; in that case the mixer is the model. What I'm getting at is there's actually a continuum between the model & the mixer; the mixer is doing some of the modeling work. Another common synthetic case used for theoretical analysis is to go continuous and have an infinite number of experts with P(0) = x for all x in [0,1], then the weight w(x) is again the model.) <P> Now, more generally, the experts could also output a confidence. This is intuitively obviously something that should be done, but nobody does it, so we'll just ignore it and say that our work is clearly flawed and always has room for improvement. (for example, a context that has seen n0 = 1 and n1 = 0 might give P(0) = 90% , but that is a far less confident expert that one that has n0 = 90 and n1 = 10 , which should get much higher weight). What I'm getting at here is the experts themselves have a lot more information than just P, and that information should be used to feed the mixer. We will not do that. We treat the mixer as independent of the experts and only give it the probabilities. <P> (you can fix this a bit by using side information (not P) from the model stages as context for the mixer coefficients; more on that later). <P> Why mixing? A lot of reasons. One is that your data may be switching character over time. It might switch from very-LZ-like to very-order-3-context like to order-0-like. One option would be to try to detect those ranges and transmit switch commands to select models, but that is not "online" (one pass) and the overhead to send the switch commands winds up being comparable to just mixing all those models (much research shows that these alternatives are asymptotically bounded). Getting perfect switch transmission right is actually way harder than mixing; you have to send the switch commands efficiently, you have to find the ideal switch ranges with consideration of the cost of transmitting the switch (which is a variant of a "parse-statistics feedback" problem which is one of the big unsolved bugbears in compression). <P> Another reason we like mixing is that our models might be good for part of the character of the data, but not all of it. For example it's quite common to have some data that's text-like (has strong order-N context properties) but is also LZ-like (perhaps matches at offset 4 or 8 are more likely, or at rep offset). It's hard to capture both those correlations with one model. So do two models and mix. <P> On to techniques : <P><HR><P> 1. Select <P> The most elementary mixer just selects one of the models. PPMZ used this for LOE (Local Order Estimation). PPMZ's primary scheme was to just select the order with highest MPS (most probable symbol) probability. Specifically if any order was deterministic (MPS P = 100%) it would be selected. <P> 2. Fixed Weights <P> The simplest mixer is just to combine stages using fixed weights. For example : <FONT COLOR=GREEN><PRE> P = (3 * P1 + P2)/4 </PRE></FONT> This is used very effectively in BCM, for example, which uses it in funny ways like mixing the probability pre-APM with the probability post-APM using fixed weight ratios. <P> Fixed linear combinations like this of multi-speed adaptive counters is not really mixing, but is a way to create higher order IIR filters. (as in the "two speed" counter from BALZ; that really makes a 2-tap IIR from 1-tap IIR geometric decay filters) <P> 3. Per-file/chunk transmitted Fixed Weights <P> Instead of fixed constants for mixing, you can of course optimize the weights per file (or per chunk) and transmit them. <P> If you think of the goal of online gradient descent as finding the best fit of parameters and converging to a constant - then this makes perfect sense. Why do silly neural net online learning when we can just statically optimize the weights? <P> Of course we don't actually want our online learners to converge to constants - we want them to keep adapting; there is no steady state in real world data compression. All the online learning research which assumes convergence and learning rate decreasing over time and various asymptotic gaurantees - it's all sort of wrong for us. <P> M1 (Mattern) uses optimized weights and gets quite high compression. This method is obviously very asymmetric, slow to encode & fast(er) to decode. You can also approximate it by making some pre-optimized weights for various types of data and just doing data type detection to pick a weight set. <P> 4. Functional Weighting <P> You can combine P's using only functions of the P's, with no separate state (weight variables that are updated over time). The general idea here is that more skewed P's are usually much more powerful (more likely to be true) than P's near 50/50. <P> If you have one expert on your panel saying "I dunno, everything is equally likely" and another guy saying "definitely blue!" , you are usually best weighting more to the guy who seems sure. (unless the positive guy is a presidential candidate, then you need to treat lack of considering both sides of an argument or admitting mistakes as huge character flaws that make them unfit to serve). <P> In PPMZ this was used for combining the three SEE orders; the weighting used there was 1/H (entropy). <FONT COLOR=GREEN><PRE> weight as only a function of P the idea is to give more weight to skewed P's W = 1 / H(P) = 1 / ( P * log(P) + (1-P) * log(1-P) ) e^ -H would be fine too very simple skews work okay too : W = ABS(P - 0.5) just higher weight for P away from 0.5 then linearly combine models : P_mixed = Sum{ W_i * P_i } / Sum{ W_i } </PRE></FONT> This method has generally been superceded by better approaches in most cases, but it is still useful for creating a prior model for initial seeding of mixers, such as for 2d APM mixing (below). <P> I'll argue in a future post that Mahoney's stretch/squash (logit/logistic sigmoid) is a form of this (among other things). <P> 5. Secondary Estimation Mixing / 2d APM Mixing <P> As mentioned at the end of the last post, you can use a 2d APM for mixing. The idea is very simple, instead of one P as input to the APM, you take two probabilities from different models and look them up as a 2d matrix. <P> You may want to use the stretch remap and bilinear filtering so that as few bits as possible are needed to index the APM table, in order to preserve as much density as possible. (eg. 6 bits from each P = 12 bit APM table) <P> This allows you to model completely arbitrary mixing relationships, not just "some weight to model 1, some to model 2". It can do the fully P-input dependent mixing, like if model 1 P is high, it gets most of the influence, but if model 1 P is low, then model 2 takes over. <P> The big disadvantage of this method is sparsity. It's really only viable for mixing two or three models, you can't do a ton. It's best on large stable data sources, not small rapidly changing data, because the mixing weight takes longer to respond to changes. <P> One of the big advantages of single-scalar weighted mixing is that it provides a very fast adapting stage to your compressor. Say you have big models with N states, N might be a million, so each individual context only gets an update every millionth byte. Then you might have a (1d) APM stage to adapt the P's; this has 100 entries or so, so it only gets 1/100 updates. But then you blend the models with a scalar mixing weight - that updates after every byte. The scalar mixing weight is always 100% up to date with recent changes in the data, it can model switches very quickly while the rest of the big model takes much more time to respond. 2d APM Mixing loses this advantage. <P> 6. Beta Weighting <P> Beta Weighting was introduced by CTW ; see also the classic work of Volf on switching, etc. (annoyingly this means something different in stock trading). <P> Beta Weighting is an incremental weight update. It does : <FONT COLOR=GREEN><PRE> P_mixed = Sum W_i * P_i / Sum W_i (linear weighted sum of model probabilities, normalized) after seeing symbol X W_i *= P_i(X) </PRE></FONT> We can understand this intuitively : <FONT COLOR=GREEN><PRE> If a model had a high probability for the actually occuring symbol (it was a good model), it will have P_i large W_i *= 0.9 or whatever if the model had low probability for the actual symbol (it was wrong, claiming that symbol was unlikely), it will get scaled down : W_i *= 0.1 for example </PRE></FONT> Furthermore, the weights we get this way are just negative exponentials of the cost in each model. That is : <FONT COLOR=GREEN><PRE> P = 2^ - L L = codelen The weights winds up being the product of all coded probabilities seen so far : W at time t = P_(t-1) * P(t-2) * P(t-3) * P(t_4) ... W = 2^( - Sum_t{ L_(t-1) + L(t-2) + L(t-3) ... } ) W = 2^( - codelen if you used this model for all symbols ) W ~ e^ - cost </PRE></FONT> If you like we can also think of this another way : <FONT COLOR=GREEN><PRE> since W is normalized, overall scaling is irrelevant at each update, find the model that produced the lowest codelen (or highest P) this is the model we would have chosen with an ideal selector so the cost of any other model is : cost_i = L_i - L_best the CTW Beta weight rule is : W_i *= 2^ - cost_i (and the best model weight stays the same, its cost is zero) or W = 2^ - total_cost total_cost += cost_i </PRE></FONT> <P> In the special case of two models we can write : <FONT COLOR=GREEN><PRE> The normalized weight for model 1 is : W1 = ( 2^-C1 ) / ( 2^-C1 + 2^-C2 ) = 1 / ( 1 + 2^(C1-C2) ) = log2istic( C2-C1 ) W2 = 1 - W1 = log2istic( C1-C2 ) "log2istic" = logistic sigmoid with base 2 instead of e We only need to track the delta of costs , dC = C1-C2 P = log2istic(-dC) * P1 + log2istic(dC) * P2 (equivalently, just store W1 and do W1 *= P1/P2) </PRE></FONT> We'll see in the next post how logistic is a transform from delta-codelens to probabilities, and this will become neater then. <P> Now Beta weighting is obviously wrong. It weights every coding event equally, no matter how old it is. In compression we always want strong recency modeling, so a better cost update is something like : <FONT COLOR=GREEN><PRE> dC' = dC * aging + rate * (C1-C2) where aging = 0.95 or whatever , something < 1 and rate is not so much the learning rate (it could be normalized out) as it is an exponential modifier to the weighting. </PRE></FONT> Beta weighting as used in CTW stores a weight per node. Each node weights counts as seen at that level vs. the next deeper level. In contrast, in context mixers like PAQ the weight for order N vs order N+1 is stored outside of the model, not per context but as a global property of the data. The PAQ way is much more flexible because you can always use some context for the weight if you like. <P> Next post we'll get into online gradient descent. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/GLaEYn1ZNm8" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-67539029806839499002018-05-01T11:58:00.002-07:002018-05-01T20:31:32.989-07:00Secondary Estimation : From PPMZ SEE to PAQ APM I want to ramble a bit about how Matt Mahoney uses Secondary Estimation in APM's (in his compressors PAQ and BBB), but to get there I thought I'd go through a little story about what secondary estimation is all about. <P> Secondary Estimation began as a way to apply a (small) correction to a (large) primary model, but we will see in our journey that it has transformed into a way of doing heavy duty modeling itself. <P> The simplest and perhaps earlier form of secondary estimation may have been bias factors to linear predictors. <P> Say you are predicting some signal like audio. A basic linear predictor is of the form : <FONT COLOR=GREEN><PRE> to code x[t] pred = 2*x[t-1] - x[t-2] delta = x[t] - pred transmit delta </PRE></FONT> Now it's common to observe that "delta" has remaining predicability. The simplest case is if it just has an overall bias; it tends to be > 0. Then you track the average observed delta and subtract it off to correct the primary predictor. A more interesting case is to use some context to track different corrections. For example : <FONT COLOR=GREEN><PRE> Previous delta as context : context = delta > 0 ? 1 : 0; pred = 2*x[t-1] - x[t-2] delta = x[t] - pred bias = secondary_estimation[ context ] transmit (delta - bias) secondary_estimation[ context ].update(bias) </PRE></FONT> Now we are tracking the average bias that occurs when previous delta was positive or negative, so if they tend to occur in clumps we will remove that bias. ("update" means "average in in some way" which we will perhaps revisit). <P> Another possibility is to use local shape information, something like : <FONT COLOR=GREEN><PRE> for last 4 values , track if delta was > 0 each of those is a bit this makes a 4 bit context look up bias using this context </PRE></FONT> Now you can detect patterns like the transition from flats to edges and apply different biases in each case. <P> The fundamental thing that we are trying to do here, which is a common theme is : <P> <CITE> The primary model is a mathematical predictor, which extracts some fundamental information about the data. <P> The secondary estimator tracks how well that primary model is fitting the current observation set and corrects for it. </CITE> <P> In CALIC a form of secondary estimation very similar to this is used. CALIC uses a shape-adaptive gradient predictor (DPCM lossless image coding). This is the primary model; essentially it tries to fix the local pixels to one of a few classes (smooth, 45 degree edge, 90 degree edge, etc.) and uses different linear predictors for each case. The prediction error from that primary model is then corrected using a table lookup in a shape context. The table lookup tracks the average bias in that shape context. While the primary model uses gradient predictors that are hard-coded (don't vary from image to image), the secondary model can correct for how well those gradient predictors fit the current image. <P> I'm going to do another hand wavey example before we get into modern data compression. <P> A common form of difficult prediction that everyone is familiar with is weather prediction. <P> Modern weather prediction is done using ensembles of atmospheric models. They take observations of various atmospheric variables (temperature, pressure, vorticity, clouds, etc.) and do a physical simulation to evolve the fluid dynamics equations forward in time. They form a range of possible outcomes by testing how variations in the input lead to variations in the output; they also create confidence intervals by using ensembles of slightly different models. The result is a prediction with a probability for each possible outcome. The layman is only shown a simplified prediction (eg. predict 80 degrees Tuesday) but the models actually assign a probability to each possibility (80 degrees at 30% , 81 degrees at 15%, etc.) <P> Now this physically based simulation is a primary model. We can adjust it using a secondary model. <P> Again the simplest form would be if the physical simulation just has a general bias. eg. it predicts 0.5 degrees too warm on average. A more subtle case would be if there is a bias per location. <FONT COLOR=GREEN><PRE> primary = prediction from primary model context = location track (observed - primary) store bias per context secondary = primary corrected by bias[context] </PRE></FONT> Another correction we might want is the success of yesterday's prediction. You can imagine that the prediction accuracy is usually better if yesterday's prediction was spot on. If we got yesterday grossly wrong, it's more likely we will do so again. Now in the primary physical model, we might be feeding in yesterday's atmospheric conditions as context already, but it is not using yesterday's prediction. We are doing secondary estimation using side data which is not in the primary model. <P> You might also imagine that there is no overall bias to the primary prediction, but there is bias depending on what value it output. And furthermore that bias might depend on location. And it might depend on how good yesterday's prediction was. Now our secondary estimator is : <FONT COLOR=GREEN><PRE> make primary prediction context = value of primary, quantized into N bits (eg. maybe we map temperature to 5 bits, chance of precipitation to 3 bits) context += location context += success of yesterday's prediction secondary = primary corrected by bias[context] </PRE></FONT> Now we can perform quite subtle corrections. Say we have a great overall primary predictor. But in Seattle, when the primary model predicts a temperature of 60 and rain at 10% , it tends to actually rain 20% of the time (and more likely if we got yesterday wrong). <P><HR><P> <B>SEE in PPMZ</B> <P> I introduced Secondary Estimation (for escapes) to context modeling in PPMZ. <P> (See "Solving the Problems of Context Modeling" for details; I was obviously quite modest about my work back then!) <P> Let's start with a tiny bit of background on PPM and what was going on in PPM development back then. <P> PPM takes a context of some previous characters (order-N = N previous characters) and tracks what characters have been seen in that context. eg. after order-4 context "hell" we may have seen space ' ' 3 times and 'o' 2 times (and if you're a bro perhaps 'a' too many times). <P> Classic PPM starts from deep high order contexts (perhaps order-8) and if the character to be coded has not been seen in that context order, it sends an "escape" and drops down to the next lower context. <P> The problem of estimating the escape probability was long a confounding. The standard approach was to use the count of characters seen in the context to form the escape probability. <FONT COLOR=GREEN><PRE> A nice summary of classical escape estimators from "Experiments on the Zero Frequency Problem" : n = the # of tokens seen so far u = number of unique tokens seen so far ti = number of unique tokens seen i times PPMA : 1 / (n+1) PPMB : (u - t1) / n PPMC : u / (n + u) PPMD : (u/2) / n PPMP : t1/n - t2/n^2 - t3/n^3 ... PPMX : t1 / n PPMXC : t1 / n if t1 < n u / (n + u) otherwise </PRE></FONT> eg. for PPMA you just leave the escape count at 1. For PPMC, you increment the escape count each time a novel symbol is seen. For PPMD, you increment the escape count by 1/2 on a novel symbol (and also set the novel characters count to 1/2) so the total always goes up by 1. <P> Now some of these have theoretical justifications based on Poisson process models and Laplacian priors and so on, but the correspondence of those priors to real world data is generally poor (at low counts). <P> The big difficult with PPM (and data compression in general) is that low counts is where we do our business. We can easily get contexts with high counts (sufficient statistics) and accurate estimators by dropping to lower orders. eg. if you did PPM and limit your max order to order-2 , you wouldn't have this problem. But we generally want to push our orders to the absolute highest breaking point, where our statistical density goes to shit. Big compression gains can be found there, so we need to be able to work where we may have only seen 1 or 2 events in a context before. <P> Around the time I was doing PPMZ, the PPM* and PPMD papers by Teahan came out ("Unbounded length contexts for PPM" and "Probability estimation for PPM"). PPMD was a new better estimator, but to me it just screamed that something was wrong. All these semi-justified mathematical forms for the escape probability seemed to just be the wrong approach. At the same time, PPM* showed the value of long deterministic contexts, where by definition "u" (the number of unique symbols) is always 1 and "n" (total) count is typically low. <P> We need to be able to answer a very important and difficult question. In a context that has only seen one symbol one time, what should the escape probability be? (equivalently, what is the probability that one symbol is right?). There is no simple formula of observed counts in that context that can solve that. <P> Secondary Estimation is the obvious answer. We want to know : <FONT COLOR=GREEN><PRE> P(esc) when only 1 symbol has been seen in a context </PRE></FONT> We might first observe that this P(esc) varies between files. On more compressible files, it is much lower. On near-random files, it can be quite high. This is a question of to what extent is this novel deterministic context a reflection of the end statistics, or just a mirage due to sparse statistics. <P> That is, <FONT COLOR=GREEN><PRE> In an order-8 context like "abcdefgh" we have only even seen the symbol 'x' once Is that a reflection of a true pattern? That "abcdefgh" will always predict 'x' ? Or is the following character in fact completely random, and if we saw more symbols we would see all of them with equal probability </PRE></FONT> Again you don't have the information within the context to tell the difference. Using an average across the whole file is a very rough guess, but obviously you could do better. <P> You need to use out-of-context observations to augment the primary model. Rather than just tracking the real P(esc) for the whole file we can obviously use some context to find portions of the data where it behaves differently. For example P(esc) (when only 1 symbol has been seen in a context) might be 75% if the parent context has high entropy; it might be 50% if the order is 6, it might be 10% if order is 8 and the previous symbols were also coded from high order with no escape. <P> In PPMZ, I specifically handled only low escape counts and totals; that is, it was specifically trying to solve this sparse statistics problem. The more modern approach (APM's, see later) is to just run all probabilities through secondary statistics. <P> The general approach to secondary estimation in data compression is : <FONT COLOR=GREEN><PRE> create a probability from the primary model take {probability + some other context} and look that up in a secondary model use the observed statistics in the secondary model for coding </PRE></FONT> The "secondary model" here is typically just a table. We are using the observed statistics in one model as the input to another model. In PPMZ, the primary model is large and sparse while the secondary model is small and dense, but that need not always be the case. <P> The secondary model is usually initialized so that the input probabilities pass through if nothing has been observed yet. That is, <FONT COLOR=GREEN><PRE> initially : secondary_model[ probability + context ] = probability </PRE></FONT> Secondary estimation in cases like PPM can be thought of as a way of sharing information across sparse contexts that are not connected in the normal model. <P> That is, in PPM contexts that are suffixes have a parent-child relationship and can easily share information; eg. "abcd" and "bcd" and "cd" are connected and can share observations. But some other context "xyzw" is totally unconnected in the tree. Despite that, by having the same statistics they may be quite related! That is <FONT COLOR=GREEN><PRE> "abcd" has seen 3 occurances of symbol 'd' (and nothing else) "xyzw" has seen 3 occurances of symbol 'd' (and nothing else) </PRE></FONT> these are probably very similar contexts even though they have no connection in the PPM tree. The next coding event that happens in either context, we want to communicate to the other. eg. say a symbol 'e' now occurs in "abcd" - that makes it more likely that "xyzw" will have an escape. That is, 3-count 'd' contexts are now less likely to be deterministic. The secondary estimation table allows us to accumulate these disparate contexts together and merge their observations. <P> The PPMZ SEE context is made from the escape & total count, as well as three different orders of previous symbol bits. The three orders and then blended together (an early form of weighted mixing!). I certainly don't recommend the exact details of how PPMZ does it today. The full details of the PPMZ SEE mechanism can be found in <A HREF="http://www.cbloom.com/papers/ppmz.pdf"> PPMZ (Solving the Problems of Context Modeling) (PDF) </A> . You may also wish to consult the later work by Shkarin "PPM: one step to practicality" (PPMd/PPmonstr/PPMii) which is rather more refined. Shkarin adds some good ideas to the context used for SEE, such as using the recent success so that switches between highly compressible and less compressible data can be modeled. <P> Let me repeat myself to be clear. Part of the purpose & function of secondary estimation is to find patterns where the observed counts in a state to do not linearly correspond to the actual probability in any way. <P> That is, secondary estimation can model things like : <FONT COLOR=GREEN><PRE> Assuming a binary coder The coder sees 0's and 1's Each context state tracks n0 and n1 , the # or 0's and 1's seen in that context On file A : if observed { n0 = 0, n1 = 1 } -> actual P1 is 60% if observed { n0 = 0, n1 = 2 } -> actual P1 is 70% if observed { n0 = 0, n1 = 3 } -> actual P1 is 90% On file B : if observed { n0 = 0, n1 = 1 } -> actual P1 is 70% if observed { n0 = 0, n1 = 2 } -> actual P1 is 90% if observed { n0 = 0, n1 = 3 } -> actual P1 is 99% The traditional models are of the form : P1 = (n1 + c) / (n0 + n1 + 2c) and you can make some Bayesian prior arguments to come up with different values of c (1/2 and 1 are common) </PRE></FONT> The point is there is *no* prior that gets it right. If the data actually came from a Laplacian or Poisson source, then observing counts you could make estimates of the true P in this way. But context observations do not work that way. <P> There are lots of different effects happening. One is that there is multiple "sources" in a file. There might be one source that's pure random, one source is purely predictable (all 0's or all 1's), another source is in fact pretty close to a Poisson process. When you have only a few observations in a context, part of the uncertainty is trying to guess which of the many sources in the file that context maps to. <P><HR><P> <B> Adaptive Probability Map (APM) ; secondary estimation & beyond. </B> <P> Matt Mahoney's PAQ (and many other modern compressors) make heavy use of secondary estimation, not just for escapes, but for every coding event. Matt calls it an "APM" , and while it is essentially the same thing as a secondary estimation table, the typical usage and some details are a bit different, so I will call them APMs here to distinguish. <P> PPM-type compressors hit a dead end in their compression ratio due to the difficult of doing things like mixing and secondary estimation in character alphabets. PAQ, by using exclusively binary coding, only has one probability to work with (the probability of a 0 or 1, and then the other is inferred), so it can be easily transformed. This allows you to apply secondary estimation not just to the binary escape event, but to all symbol coding. <P> An APM is a secondary estimation table. You take a probability from a previous stage, look it up in a table (with some additional context, optionally), and then use the observed statistics in the table, either for coding or as input to another stage. <P> By convention, there are some differences in typical implementation choices. I'll describe a typical APM implementation : <FONT COLOR=GREEN><PRE> probability P from previous stage P is transformed nonlinearly with a "stretch" [stretch(P) + context] is looked up in APM table observed P there is passed to next stage optionally : look up both floor(P) and ceil(P) in some fixed point to get two adjacent APM entries linearly interpolate them using fractional bits </PRE></FONT> A lot of the details here come from the fact that we are passing general P's through the APM, rather than restricting to only low counts as in PPMZ SEE. That is, we need to handle a wide range of P's. <P> Thus, you might want to use a fixed point to index the APM table and then linearly interpolate (standard technique for tabulated function lookup); this lets you use smaller, denser tables and still get fine precision. <P> The "stretch" function is to map P so that we quantize into buckets where we need them. That is, if you think in terms of P in [0,1] we want to have smaller buckets at the endpoints. The reason is that P steps near 0 or 1 make a very large difference in codelen, so getting them right there is crucial. That is, a P difference of 0.90 to 0.95 is much more important than from 0.50 to 0.55 ; instead of having variable bucket sizes (smaller buckets at the end) we use uniform quantization but stretch P first (the ends spread out more). One nice option for stretch is the "logit" function. <FONT COLOR=GREEN><PRE> LPS symbol codelen (LPS = less probable symbol) steps of P near 0.5 produce small changes in codelen : P = 0.5 : 1 bit P = 0.55 : 1.152 bits steps of P near 1.0 produce big changes in codelen : P = 0.9 : 3.32193 bits P = 0.95 : 4.32193 bits </PRE></FONT> The reason why you might want to do this stretch(P) indexing to the APM table (rather than just look up P) is again density. With the stretch(P) distortion, you can get away with only 5 or 6 bits of index, and still get good resolution where you need it. With linear P indexing you might need a 10 bit table size for the same quality. That's 32-64 entries instead of 1024 which is a massive increase in how often each slot gets updated (or a great decrease in the average age of statistics; we want them to be as fresh as possible). With such course indexing of the APM, the two adjacent table slots should be used (above and below the input P) and linearly interpolated. <P> APM implementations typically update the observed statistics using the "probability shift" method (which is equivalent to constant total count or geometric decay IIR). <P> An APM should be initialized such that it passes through the input probability. If you get a probability from the previous stage, and nothing has yet been observed in the APM's context, it passes through that probability. Once it does observe something it shifts the output probability towards the observed statistics in that context. <P> We can see something interesting that APM's can do already : <P> I wrote before about how secondary estimation is crucial in PPM to handle sparse contexts with very few events. It allows you to track the actual observed rates in that class of contexts. But we thought of SEE as less important in dense contexts. <P> That is true only in non-time-varying data. In the real world almost all data is time varying. It can shift character, or go through different modes or phases. In that sense all contexts are sparse, even if they have a lot of observed counts, because they are sparse in observation of recent events. That is, some order-4 context in PPM might have 100 observed counts, but most of those were long in the past (several megabytes ago). Only a few are from the last few bytes, where we may have shifted to a different character of data. <P> Running all your counts through an APM picks this up very nicely. <FONT COLOR=GREEN><PRE> During stable phase : look up primary model secondary = APM[ primary ] when primary is dense, secondary == primary will pass through Now imagine we suddenly transition to a chunk of data that is nearly random the primary model still has all the old counts and will only slowly learn about the new random data but the APM sees every symbol coded, so it can learn quickly APM[] will quickly tend towards APM[P] = 0.5 for all P that is, the input P will be ignored and all states will have 50/50 probability </PRE></FONT> This is quite an extreme example, but more generally the APM can pick up regions where we move to more or less compressible data. As in the PPM SEE case, what's happening here is sharing of information across parts of the large/sparse primary model that might not otherwise communicate. <P> Let me repeat that to be clear : <P> Say you have a large primary model, with N context nodes. Each node is only updated at a rate of (1/N) times per byte processed. The APM on the other hand is updated every time. It can adapt much faster - it's sharing information across all the nodes of the primary model. You may have very mature nodes in the primary model with counts like {n0=5,n1=10} (which we would normally consider to be quite stable). Now you move to a region where the probability of a 1 is 100%. It will take *tons* of steps for the primary model to pick that up and model it, because the updates are scattered all over the big model space. Any one node only gets 1/N of the updates, so it takes ~N*10 bytes to get good statistics for the change. <P> An interesting common way to use APM's is in a cascade, rather than a single step with a large context. <P> That is, you want to take some P from the primary model, and you have additional contexts C1 and C2 to condition the APM step. You have two options : <FONT COLOR=GREEN><PRE> Single step, large context : P_out = APM[ {C1,C2,P} ] Two steps, small context, cascade : P1 = APM1[ {C1,P} ] P_out = APM2[ {C2,P1} ] </PRE></FONT> That is, feed P through an APM with one context, then take that P output and pass it through another APM with a different context, as many times as you like. <P> Why use a cascade instead of a large context? The main reason is density. <P> If you have N bits of extra context information to use in your APM lookup, the traditional big table approach dilutes your statistics by 2^N. The cascade approach only dilutes by *N. <P> The smaller tables of the APM cascade mean that it cannot model certain kinds of correlations. It can only pick up ways that each context correlated to biases on P. It cannot pick up joint context correlations. <FONT COLOR=GREEN><PRE> The APM cascade can model things like : if C1 = 0 , high P's tend to skew higher (input P of 0.7 outputs 0.9) if C2 = 1 , low P's tend to skew lower But it can't model joint things like : if C1 = 0 and C2 = 1 , low P's tend towards 0.5 </PRE></FONT> One nice thing about APM cascades is that they are nearly a NOP when you add a context that doesn't help. (as opposed to the single table large context method, which has a negative diluting effecting when you add context bits that don't help). For example if C2 is just not correlated to P, then APM2 will just pass P through unmodified. APM1 can model the correlation of C1 and P without being molested by the mistake of adding C2. They sort of work independently and stages that aren't helping turn themselves off. <P> One funny thing you can do with an APM cascade is to just use it for modeling with no primary model at all. This looks like : <FONT COLOR=GREEN><PRE> Traditional usage : do order-3 context modeling to generate P modify observed P with a single APM : P = APM[P] APM Cascade as the model : P0 = order-0 frequency P1 = APM1[ o1 | P0 ]; P2 = APM2[ o2 | P1 ]; P3 = APM3[ o3 | P2 ]; </PRE></FONT> That is, just run the probability from each stage through the next to get order-N modeling. <P> We do this from low order to high, so that each stage can add its observation to the extent it has seen events. Imagine that the current order2 and order3 contexts have not been observed at all yet. Then we get P1 from APM1, pass it through APM2, since that has seen nothing it just passes P1 through, then APM3 does to. So we wind up getting the lowest order that has actually observed things. <P> This method of using an APM cascade is the primary model is used by Mahoney in BBB. <P> A 2d APM can also be used as a mixer, which I think I will leave for the next post. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/VVf6eOPhiuE" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-51053606408876715012018-04-18T15:34:00.003-07:002018-10-27T19:28:49.416-07:00Race in EOF on Windows Pipes There appears to be a race in the EOF check on Windows pipes. I haven't seen this written about elsewhere, so here it is. <P> TLDR : don't call eof on pipes in Windows! Use read returning zero instead to detect eof. <P> We observed that semi-randomly pipes would report EOF too soon and we'd get a truncated stream. (this is not the issue with ctrl-Z ; of course you have to make sure your pipe is set to binary). <P> To reproduce the issue you may use this simple piper : <FONT COLOR=GREEN><PRE> // piper : #include <code>&lt;</code>fcntl.h> #include <code>&lt;</code>io.h> int main(int argc,char *argv[]) { int f_in = _fileno(stdin); int f_out = _fileno(stdout); // f_in & out are always 0 and 1 _setmode(f_in, _O_BINARY); _setmode(f_out, _O_BINARY); char buf; // NO : sees early eof! // don't ask about eof until _read returns 0 while ( ! _eof(f_in) ) // YES : just for loop here //for(;;) { int n = _read(f_in,buf,sizeof(buf)); if ( n > 0 ) { _write(f_out,buf,n); } else if ( n == 0 ) { // I think eof is always true here : if ( _eof(f_in) ) break; } else { int err = errno; if ( err == EAGAIN ) continue; // respond to other errors? if ( _eof(f_in) ) break; } } return 0; } </PRE></FONT> the behavior of piper is this : <FONT COLOR=GREEN><PRE> vc80.pdb 1,044,480 redirect input & ouput : piper.exe < vc80.pdb > r:\out r:\out 1,044,480 consistently copies whole stream, no problems. Now using a pipe : piper.exe < vc80.pdb | piper > r:\out r:\out 16,384 r:\out 28,672 r:\out 12,288 semi-random output sizes due to hitting eof early </PRE></FONT> If the eof check marked with "NO" in the code is removed, and the for loop is used instead, piping works fine. <P> I can only guess at what's happening, but here's a shot : <P> If the pipe reader asks about EOF and there is nothing currently pending to read in the pipe, eof() returns true. <P> Windows anonymous pipes are unbuffered. That is, they are lock-step between the reader & writer. When the reader calls read() it blocks until the writer puts something, and vice-versa. The bytes are copied directly from writer to reader without going through an internal OS buffer. <P> In this context, what this means is if the reader drains out everything the writer had to put, and then races ahead and calls eof() before the writer puts anything, it sees eof true. If the writer puts something first, it seems eof false. It's just a straight up race. <FONT COLOR=GREEN><PRE> time reader writer ----------------------------------------------------------- 0 _write blocks waiting for reader 1 _eof returns false 2 _read consumes from writer 3 _write blocks waiting for reader 4 _eof returns false 5 _read consumes from writer 6 _eof returns true 7 _write blocks waiting for reader </PRE></FONT> at times 1 and 4, the eof check returns false because the writer had gotten to the pipe first. At time 6 the reader runs faster and checks eof before the writer can put anything, now it seems eof is true. <P> As a check of this hypothesis : if you add a Sleep(1) call immediately before the _eof check (in the loop), there is no early eof observed, because the writer always gets a chance to put data in the pipe before the eof check. <P> Having this behavior be a race is pretty nasty. To avoid the problem, never ask about eof on pipes in Windows, instead use the return value of read(). The difference is that read() blocks on the writer process, waiting for it to either put some bytes or terminate. I believe this is a bug in Windows. Pipes should never be returning EOF as long as the process writing to them is alive. <img src="http://feeds.feedburner.com/~r/CbloomRants/~4/q7BR7fmKl3E" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0