tag:blogger.com,1999:blog-52469877556510652862019-02-09T01:49:16.454-08:00cbloom rantscbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger728125tag:blogger.com,1999:blog-5246987755651065286.post-40700859538888130482019-01-20T12:03:00.002-08:002019-01-20T12:03:26.922-08:00The Oodle LZNIB Algorithm
Oodle LZNIB is an LZ77-family compressor which has very good decode speed (about 5X faster than Zip/deflate, about 1/2 the speed of LZ4)
while getting better compression than Zip/deflate. This was quite unique at the time it came out. It is now made obsolete by Oodle
Mermaid/Selkie.
<P>
I thought it might be interesting to write up LZNIB and look at some of the things we learned from it.
<P>
LZNIB is a non-entropy coded LZ (*) which writes values in nibbles and bytes. Literals are written uncompressed (8 bits).
<P>
(* = actually it's more like "semi entropy coded"; it doesn't use bit io, and doesn't use standard entropy coders like Huffman,
but the variable length integer encoding was adjusted to minimize entropy, and can vary to fit the file's statistics; more details
on this later)
<P>
LZNIB can send three actions : literal run ("lrl") , match with offset ("normal match"), match with repeat offset ("rep match").
<P>
One of the key realizations for LZNIB is that even though there are three actions, only two are possible at any time.
<FONT COLOR=GREEN><PRE>
After Literals :
LRL should not occur (would have just been a longer LRL)
match & rep match are possible
After Match or Rep Match :
rep match should not occur (would have just been a longer match)
normal match & lrl are possible
</PRE></FONT>
because there are only two actions possible at each point, we can send this using a nibble with a decision threshold value :
<FONT COLOR=GREEN><PRE>
Threshold T
values in [0 , T-1] are action type 1
values in [T , 15] are action type 2
</PRE></FONT>
So the core decode step of LZNIB is to grab the next nibble, do one branch, and then process that action.
The values within your threshold group are the first part of your LRL or match len.
There are at least two different thresholds, one for {match/rep-match} in the after-literals state,
and one for {match/lrl} in the after-match state. In LZNIB we hard-coded the threshold for match/rep-match
to 5 as this was found to be good on most data. The optimal {match/lrl} threshold is more dependent on the data.
<P>
Approximately, (T/16) should be the probability of action type 1. This is in fact exactly what entropy coding this binary action
choice would do. What entropy coding does is take your range of possible encodable values and splits them by the probability of
the binary choice. The remaining values in the word can then be split to send further choices. Here we just do the first choice using a
semi-entropy-coding, then use the remainder of the word to send as much of our next value ("len") as we can. (the fact that we
do not entropy code "len" and that len has probability peaks is why the probability based choice of T might not be actually optimal)
<P>
In LZNIB the threshold T could be transmitted per chunk so that it could be optimized to fit the data. In practice that was
rarely used, and mostly the default value of T=8 was used. Part of why it was rarely used is due to the difficulty of parse-statistics
feedback. The parse must make decisions based on some assumed T, because it affects coding cost. Then after you have your parse
choices you can make a choice for optimal T for that data, but if that T is different you might want to re-parse with the new T.
This is a mess and the simplest way to address it here is just to parse for all values of T. You can reduce the set of likely
useful T values to around 8 or 10 and just do the parse 8-10X times to make a T choice. This in fact works great and helps compression
a lot on some data, but is obviously slow.
<P>
In contrast to something like LZ4, LZNIB has a flexible action sequence. That is, LZ4 is always doing {LRL,match,LRL,match,...}.
For example to send two matches in a row you must send an LRL of length 0 between them. LZNIB has a flexible action sequence,
therefore requires a branch, it could send {LRL,match,match,LRL,rep match,match,...}
<P>
LZNIB uses unbounded offsets for matches. They are sent using a variable length integer scheme. Initially 12 bits
are sent, then a byte is added as necessary. The scheme used is "encodemod", which treats each value sent as being
divided in two ranges. One range is for values that can terminate in the current word, the other range is for values
that don't fit and includes a portion of the remaining value. See the
<A HREF="http://cbloomrants.blogspot.com/2015/11/flipped-encodemod.html"> encodemod post series </A> for details.
<P>
The encodemod scheme is very flexible and can be tuned to fit the entropy characteristics of the data (unlike
traditional bit-packing variable length integer schemes, which have a less flexible knob). To do this we gathered LZ parses for many files and
found the best encodemod thresholds. This was done for the offsets, for LRL, match len (after match), match len
(after literal), and rep match len.
<P>
All the lens (LRL, match len, etc.) are sent first in the control nibble. If they don't fit, the maximum is
sent in the control nibble, then the remainder is sent using encodemod. The encodemod used for lens sends first
another nibble, then any further values are sent with bytes.
<P>
The full decoder for LZNIB in pseudocode is :
<FONT COLOR=GREEN><PRE>
Do a literal run to start.
After-literal state :
Get control nibble
if nib < 5
{
rep match
if nib == 4 get further ml using encodemod, first nibble then bytes
copy rep match
}
else
{
normal match
if nib == 15 get further ml using encodemod, first nibble then bytes
get offset using encodemod, first 12 bits then bytes
copy match
}
After-match state :
Get control nibble
if nib < T
{
literal run
if nib == T-1 get further lrl using encodemod, first nibble then bytes
copy literal run
goto After-literal
}
else
{
normal match
if nib == 15 get further ml using encodemod, first nibble then bytes
get offset using encodemod, first 12 bits then bytes
copy match
goto After-match
}
</PRE></FONT>
That's it.
<P><HR><P>
LZNIB is simple to decode but it turned out to be quite hard to parse well. Making good choices in the encoder can
have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those
issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be
important again for Kraken & Mermaid/Selkie.
<P>
One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in
units of nibbles. Those ties need to be broken in favor of faster decode speed.
<P>
To good approximation, decode speed is just about minimizing the number of control nibbles. You want as
few transitions between actions as possible, you prefer to stay in long literal runs and long matches.
You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size
with fewer mode switches.
<P>
Let's look at some simple examples to get a feel for these issues.
<P>
Consider a rep match of len 1. A rep match control takes 1 nibble, while sending a literal instead takes 2
nibbles, so sending a rep instead would save you 1 nibble. But, if the rep is followed by another literal
run, you would have to send a new lrl control nibble there, while staying in an lrl might save you that.
<P>
<FONT COLOR=GREEN><PRE>
L = literal , costs 2 nibbles + 1 at start of lrl
R = rep match, costs 1 control nibble
M = normal match, costs 1 control nibble + 3 or more nibbles for offset
LLR = 1+2*2+1 = 6 nibbles
LLL = 1+3*2 = 7 nibbles
so LLR is best right? take the rep!
LLRL = 1+2*2+1 + 1+2 = 9 nibbles
LLLL = 1+4*2 = 9 nibbles
No! Nibble count is the same but fewer controls = prefer the literal.
It depends what follows the rep.
LLRM
LLLM
in this case the rep is cheaper.
So if a literal follows the rep, don't take it, right?
LLLLRLLLL = 1+2*4 + 1 + 1+2*4 = 19 nibbles
LLLLLLLLL = 1+2*9 + 1 = 20 nibbles
No! In the second case the LRL of 9 can't fit in the control nibble, so an
extra lrl nibble must be sent. So prefer the rep here.
</PRE></FONT>
So the simple choice of "do I take a rep of len 1 or stay in LRL" is not easy and can only be made
non-locally.
<P>
A similar thing happens with normal matches. A normal match of len 3 with an offset that fits in 12 bits
takes 4 nibbles, which saves you 2 vs sending 3 literals. But if you have to resume an LRL after the match,
that costs you 1, so your savings is down to 1. There may be cheaper ways (in terms of decode speed) to get that
1 nibble savings, such as a rep of len 1 for example. Or if you can trade two len 3 matches for a single len 4 match :
<FONT COLOR=GREEN><PRE>
MMMLMMML = 4 + 3 + 4 + 3 = 14
LLMMMMLL = 5 + 4 + 5 = 14
same nibble count, but fewer mode switches = prefer the single len 4 match over two len 3's
</PRE></FONT>
A len 3 match that doesn't fit in the 12 bit offset (but does fit in the next threshold, at 20 bits) takes 6
nibbles to send, which is a tie with 3 literals. But again it could actually save you a nibble if it causes
the LRL to fit in control.
<P>
You might be tempted to just punt this, eg. make rep matches have a minimum length of 2, and
normal matches have a minimum length of 4. However that is leaving a lot of compression on the table. The shorter
matches only save a nibble here and there, but on some files there are many of those possible. For the optimal
parsers we wanted to be able to get those wins when they are possible.
<P><HR><P>
The challenge of optimal parsing LZNIB.
<P>
The standard approach for optimal parsing a format with rep matches is the "forward arrivals" style parse
(as in LZMA). (this is in contrast to the classical backwards LZSS optimal parse which can only be used in
formats that do not have adaptive state which is parse-path dependent).
<P>
See some posts on forward-arrivals parse :
<A HREF="https://cbloomrants.blogspot.com/2015/01/01-23-15-lza-new-optimal-parse.html"> here </A> and
<A HREF="https://cbloomrants.blogspot.com/2016/06/06-09-16-fundamentals-of-modern-lz-two.html"> here </A>
<P>
The standard forward arrivals parse does not work well with LZNIB. In the forward-arrivals parse, you take the
cheapest way to arrive at pos P, you cost all ways to go forward from P (with a literal, or various match options),
and fill out the costs at P+1, P+len, etc.
<P>
The problem is that it doesn't account for the fact that the best way to arrive at pos P depends on what comes
next (in particular, do literals or matches come next, and if literals, how many?). It treats each action as
being independent.
<P>
We'll look at some ways to improve the forward arrivals parse but first a detour.
Fortunately in this case it's possible to solve this systematically.
<P>
Whenever I face a difficult heuristic like this where we know we're approximating in some way and don't know
quite the best way, the first thing I always look for is - can we solve it exactly? (or with some bounded error).
The more exact solution might be too slow and we won't actually be able to ship it, but it gives us a target,
it lets us know how close our approximation is, and may give us guidance on how to get there.
<P>
In this case what we can do is a full dynamic programming parse with the LRL as part of the state matrix.
<P>
Optimal parsing is always a form of dynamic programming. We often approximate it by ignoring the internal state of
the parse and making an array of only [pos]. What I mean by "full dynamic programming" is to make the state explicit
and use a 2d array of [pos][state]. Then on each parse choice, you look at how that steps position (eg. by match len)
and also how that changes the internal state, and you move to the next array slot. In this case the important state
variable is LRL.
<P>
(we're treating a single literal as a coding action, which it is not, and correcting that by considering LRL as a
state variable. The result is that we get correct code costs for all LRL steps from each pos.)
<P>
(note that the offset which is used for rep matches is also an important internal state variable, but we are continuing
to ignore that as is usual; we do store a different rep arrival in each [pos][lrl] entry, but we don't differentiate
between arrivals that only differ by rep offset)
<P>
We consider LRL's in [0,21]. This lets us capture the transition of LRL not fitting in the control nibble (at 7,
with the default threshold of 8), and then again when it overflows the first nibble of encodemod (at 7+15=21).
LRL value of 21 is a state for all higher LRL's, so we don't account for when later bytes of encodemod overflow.
<P>
We make a 2d matrix that's (file len) * 22 entries.
<P>
At each pos P, you can advance all the entries by adding one literal. This does :
<FONT COLOR=GREEN><PRE>
for all LRL
costs[P+1][LRL+1] = 2 + costs[P][LRL] + LRL_Delta_Cost(LRL)
(2 nibbles is the cost of a literal byte)
where LRL_Delta_Cost =
1 at LRL = 0
1 at LRL = 7
1 at LRL = 21
otherwise zero
</PRE></FONT>
Or you can advance with a match. To advance with a match, start from the cheapest arrival with any LRL and step
by the match len and fill costs[P+len][0]. You can also advance with a rep match, which is similar except that it
cannot advance from the LRL=0 state. Each arrival stores where it came from (both pos and LRL).
<P>
When you reach the end of the file, you take the cheapest of all the LRL arrivals and trace it backwards to the
root to find the parse.
<P>
This kind of full matrix dynamic programming parse completely captures the non-local effects caused by things
like LRL running into thresholds that change the cost. Unfortunately it's too slow for production (and uses a lot
of memory), but it is very useful as a reference point. It told us that a much better optimal parse was possible.
<P>
An important note : The dynamic programming parse needs to be making space-speed decisions. As noted
previously in particular there are a lot of ties and those need to be broken in favor of decode speed. The primary
driver for decode speed is the number of control tokens. What we did is just add a cost penalty to each control
token. The number we used is (1/4) of a nibble penalty for each control token. That is, we will give up 1 nibble
of compression if it saves 4 mode switches. If we can send {LRL} instead of {LRL,match,LRL,rep,match} we will do it
if the penalty is only one nibble.
<P>
(modern Oodle now uses a rate-disortion-like Lagrange multiplier to optimize for the desired tradeoff of space & speed,
which the user can control. This work in LZNIB predates that system and just used a hard-coded tradeoff that we
found to greatly improve decode speed without much penalty to compressed size.)
<P>
So, armed with the dynamic programming solution, we could generate stats to look at how it was parsing files, and
compare that to how forward-arrivals was parsing. What we saw was :
<FONT COLOR=GREEN><PRE>
dynamic programming :
---------------------------------------------------------
7.6 bytes per token ; 30.2% literals, 55.6% matches, 14.1% LO
AM: 50.4% match / 49.6% lrl ; 8.9 average AM ML , 7.0 lrl
AM: offlow 40.7% ml3 24.0% ml4 21.8%
AL: 49.2% match / 50.8% LO ; 7.6 average AL ML , 6.4 LO len
AL: offlow 53.5% ml3 24.6% ml4 19.9% ; LO len1 11.4% len2 24.9%
---------------------------------------------------------
forward arrivals :
---------------------------------------------------------
5.9 bytes per token ; 21.0% literals, 61.1% matches, 17.9% LO
AM: 46.4% match / 53.6% lrl ; 8.4 average AM ML , 3.5 lrl
AM: offlow 43.1% ml3 37.5% ml4 19.9%
AL: 39.5% match / 60.5% LO ; 7.5 average AL ML , 5.0 LO len
AL: offlow 36.7% ml3 43.1% ml4 15.0% ; LO len1 30.1% len2 23.4%
---------------------------------------------------------
key :
---------------------------------------------------------
decompressed bytes per token ; % of control tokens of each type
AM = in after-match state
AM: % of tokens ; average len of that token
AM: offlow = offset in 12 bits , ml3 = % of matches that have len 3
AL = in after-literal state
LO means "last offset" which a synonym for "rep match"
---------------------------------------------------------
</PRE></FONT>
In this case the compressed size was very similar, but the dynamic programming parse was much faster to decode (about 20% faster).
<P>
We can easily see what's going on :
<FONT COLOR=GREEN><PRE>
DP parse has more bytes per token, eg. fewer tokens for the whole file. This is why it is faster. This is more of the end result
of the problem rather than the source of the problem.
DP parse has way more bytes in literals (30.2% vs 21%)
DP parse has way longer average LRL (7.0 vs 3.5)
forward-arrivals parse sends way more len-3 matches (37.5% vs 25.0% and 43.1% vs 24.6%)
forward-arrivals parse sends way more rep len 1 matches (30.1% vs 11.4%)
</PRE></FONT>
I found it quite interesting that two ways to parse the file could produce nearly the same
compressed size, but get there in very different ways.
<P>
Clearly the forward arrivals parse needs a way to find the longer literal runs when they are best in a non-local way.
<P>
When you are walking along in a forward-arrivals parse, you just see the single cheapest arrival to your pos;
consider something like this :
<FONT COLOR=GREEN><PRE>
1234
LLLR
</PRE></FONT>
At pos 4, the cheapest arrival is via rep-len1. The standard forward arrivals parse fills that spot with the rep-len1 arrival,
and then continues forward from there. There's no way to go back in time. You no longer see the arrival at pos 3.
<P>
A key thing we should observe is that when a non-local cost effect makes us change a decision (away from just using
cheapest arrival), it is always in favor of LRL.
The standard arrivals parse is taking too many matches & reps and we want to take literal runs instead in some of those places.
<P>
The solution is pretty simple :
<P>
Store only match (and rep-match) arrivals at each pos (not literal arrivals). When considering going forward, consider starting at current pos P
from the match arrival there, OR go from [P-1] with 1 LRL, or [P-2] with 2 LRL, etc.
<FONT COLOR=GREEN><PRE>
considering an example from before :
12345678
MMMLMMML
LLMMMMLL
at pos 8 I look at the ways to go forward (find matches & reps)
say I find a match
how should I arrive at pos 8 ?
I can arrive via
arrival[7] (MMMLMMM) + LRL 1
or
arrival[6] (LLMMMM) + LRL 2
You should choose the latter.
Even though arrival[7] was the cheaper way to get to pos 7, arrival[6] is the better way to get to pos 8.
</PRE></FONT>
You want to limit the LRL lookback to something reasonable. Perhaps 8 (or 22 as we did in the full matrix parse, but that isn't necessary).
If you find no match arrivals in the 8-step lookback, you need a way to go back to the most recent preceding match arrival.
<P>
Instead of just looking at a forward step of {match} we consider {step back 1 + L + M} , {back2 + LLM }, etc.
In LZNIB, costing the LRL lookback steps is simple because literals are just always 1 byte.
<P>
Essentially what we're doing here is treating the LZNIB parse as if it used an {LRL,match} combined packet. Even though the actual format
has separate literal run and match actions, they act like they are combined.
<P>
In fact there's a different way to look at LZNIB. After an LRL nibble action token, there must always be a match or rep-match action.
So we can think of those as instead being a byte action token for {LRL+match}. In that case rather than the after-literal / after-match states,
there is only one state :
<FONT COLOR=GREEN><PRE>
LZNIB always after-match :
nibble action : match
(no rep match is possible)
byte action : LRL then match
byte action : LRL then rep-match
</PRE></FONT>
If you parse using these actions, then the non-local effects go away.
<P>
In the end we found huge improvements to the LZNIB parse that gave us ~20% faster decompression just through making better decisions in the
encoder. The way that we investigated this and developed the parse was later fruitful in the development of Kraken & Mermaid/Selkie, which have
similar issues but are even far more complicated to parse well.
<P><HR><P>
Some old references links related to LZNIB, and some perf reports showing LZNIB in the pre-oceanic-cryptozoology days : <P>
<A HREF="https://cbloomrants.blogspot.com/2012/10/10-22-12-lz-bytewise-conclusions.html"> cbloom rants 10-22-12 - LZ-Bytewise conclusions </A> <BR>
<A HREF="https://cbloomrants.blogspot.com/2015/03/03-02-15-lz-rep-match-after-match.html"> cbloom rants 03-02-15 - LZ Rep-Match after Match Strategies </A> <BR>
<A HREF="https://cbloomrants.blogspot.com/2015/05/05-13-15-skewed-pareto-chart.html"> cbloom rants 05-13-15 - Skewed Pareto Chart </A> <BR>
<A HREF="https://cbloomrants.blogspot.com/2015/12/oodle-results-update.html"> cbloom rants Oodle Results Update </A> <BR>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/JeEgeviRb9w" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com2tag:blogger.com,1999:blog-5246987755651065286.post-52117307302497522072018-10-23T19:52:00.003-07:002018-10-23T19:52:39.045-07:00Oodle 2.7.3 on the Nintendo Switch
It's been a while since I reported Switch performance, and we've made a lot of improvements to ARM performance
since then (both for ARM32 and 64, for Switch, iOS, and Mobile), and of course added Leviathan. Time for an update!
<P>
I compare Oodle against the zlib implementation provided in the Nintendo SDK (nn deflate). Deflate is run at level 9,
Oodle at level 8. I'm timing decode speed only.
<P>
On mixed content test file lzt99 :
<P>
<TABLE><TR><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=FFB000,882787,A92FFF,FF00FF,B097F7&chxr=1,0,1000,100&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=nn_deflate-l9|Leviathan-z8|Kraken-z8|Mermaid-z8|Selkie-z8&chd=e:Em,N6,SH,ht,-L&chxl=0:|total&chtt=decode+MB/s&chts=303030,14,l">
</TD><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=FFB000,882787,A92FFF,FF00FF,B097F7&chxr=1,0,3,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=nn_deflate-l9|Leviathan-z8|Kraken-z8|Mermaid-z8|Selkie-z8&chd=e:oL,7J,4o,z-,pc&chxl=0:|total&chtt=compression+ratio&chts=303030,14,l">
</TD></TR></TABLE>
<P>
Plot of log compresion ratio vs. log decode speed :
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=s&chs=800x360&chxt=x,y&chxr=0,6,10,1|1,0,2,1&chd=e:CocNiTwp-y,dOvFtEpHeq&chdl=nn_deflate-l9|Leviathan-z8|Kraken-z8|Mermaid-z8|Selkie-z8&chdlp=r&chco=FFB000|882787|A92FFF|FF00FF|B097F7&chm=s,000000,0,0,0|o,FFB000,0,0,10|o,882787,0,1,10|o,A92FFF,0,2,10|o,FF00FF,0,3,10|o,B097F7,0,4,10&chtt=loglog+ratio+vs+MB/s&chxl=0:|64|128|256|512|1024|1:|1|2|4|">
<P>
Raw numbers :
<P><PRE>
lzt99 : nn_deflate-l9 : 1.883 to 1 : 71.70 MB/s
lzt99 : Leviathan-z8 : 2.773 to 1 : 217.26 MB/s
lzt99 : Kraken-z8 : 2.655 to 1 : 282.96 MB/s
lzt99 : Mermaid-z8 : 2.437 to 1 : 526.64 MB/s
lzt99 : Selkie-z8 : 1.943 to 1 : 971.69 MB/s
</PRE>
<P>
Leviathan is about 3X faster to decode than zlib/deflate, with way more compression. The rest of the Oodle
compressor family provides even faster decode speeds with lower compression ratios. The fastest, Oodle Selkie,
is similar compression ratio to zlib/deflate but more than 10X faster to decode.
<P>
details :
<P>
This is with Switch SDK 5.5, clang-5.0.1 ; the nn:deflate speed has gotten a little worse since the last time I
measured it (was 74.750 MB/s). The compression ratio of nn:deflate is the same. For reference,
<A HREF="https://cbloomrants.blogspot.com/2017/01/oodle-on-nintendo-switch.html"> old numbers are here </A>, for
Oodle 2.6.0 and 2.4.2.
For example you can see that LZNA was at 24 MB/s with compression just slightly below Leviathan.
Leviathan is truly a remarkable beast.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/1mdEBTo-eEc" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-58887579733153226952018-10-22T13:02:00.001-07:002018-10-22T13:02:56.489-07:00recip_arith without unused range
Finishing up, today we'll go through how to make recip_arith take advantage of the un-mapped portion of the
coding range, and see some more connections to past work.
<P>
recip_arith uses only the top bits of "range" in the map, rounding down the cdf to range scaling factor when it
converts to fixed point. Thus it always maps less than range.
<P>
The classic "range coder" does something similar. With the map :
<FONT COLOR=GREEN><PRE>
uint32_t r_norm = ac->range >> cdf_bits;
forward(cdf) = cdf * r_norm;
</PRE></FONT>
the bottom "cdf_bits" of range are not used in this map, so a small portion of range is unused.
<P>
The standard way to make use of this extra range is to assign to the last symbol. Most of the older papers show this
as the default. They'll do something like :
<FONT COLOR=GREEN><PRE>
(range coder "map end")
uint32_t r_norm = ac->range >> cdf_bits;
forward(cdf) :
if ( cdf == cdf_tot ) return range;
else return cdf * r_norm;
</PRE></FONT>
that is, explicitly taking the last interval and making it go all the way to range. You may see this in other "range coder"
implementations.
<P>
I do not do this in my range coder. The benefit is around 0.001 bpb (or less), and it does cost some speed.
The benefit depends on whether you can give that extra range to a useful symbol. Ideally you would give it to the most
probable symbol. What you are doing is reducing the code length of one symbol, you get the most benefit by reducing the code
length of the symbol that occurs most often. The higher the probability of the MPS, the more benefit you will get,
assuming you put the MPS at the end of the alphabet where it gets this range. (conversely if the end of the alphabet is a
symbol that never/rarely occurs, you get no benefit).
<P>
I call this variant "map end". In recip_arith the tradeoff is a little more biased towards doing "map end", because the unmapped
portion of range is larger.
<P>
(note that the "map end" excess range is less than the amount of range assigned per cdf; that is, it's equivalent to increasing
that integer symbol frequency by some fractional amount, so it's not a huge distortion of the symbol probabilities)
<P>
But in recip_arith, just mapping that extra range to one symbol is more of a probability distortion than it is in the range coder,
because the extra space is larger.
<P>
We can do something better. The scaling factor from cdf to range in recip_arith is the *floor* to fixed point. When range is
near the top of that fractional truncation, you would much rather round up. But you can't round up because that would lead to using
more than all of range, which is uncodeable :
<FONT COLOR=GREEN><PRE>
int range_clz = clz32(range);
uint32_t r_top_down = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm_down = r_top_down << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
uint32_t r_top_up = r_top_down+1;
uint32_t r_norm_up = r_top_up << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
r_norm_up == r_norm_down + (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits));
r_norm_down * cdf_tot <= range
r_norm_up * cdf_tot > range
if you just use r_norm_down , you use too little of range
unmapped_range = range - r_norm_down * cdf_tot = range & ( (1 << (32 - range_clz - RECIP_ARITH_TABLE_BITS)) - 1 )
conversely if you used r_norm_up for scaling, you would use too much range
the amount is the ~ bit reverse of those bottom bits
</PRE></FONT>
So inspired by SM98 we can use a mix of round-down & round-up of the scaling factor, such that all of range is covered.
<P>
What we want is to use round_up as much as possible, because it gives lower codelens, and then for enough symbols to avoid
over-using range, switch to round_down. Just like SM98 we can find a threshold such that we place the transmission so that
all of range is used.
<P>
The nice way to think about it conceptually is that we construct two maps : we use r_norm_down as the scaling factor from
the left side of range (starting at 0), and we use r_norm_up as the scaling factor from the right side, and we switch over
whether they cross.
<P>
The implementation is simple :
<FONT COLOR=GREEN><PRE>
r_norm_down & r_norm_up as above
uint32_t map_up_excess = (r_norm_up << cdf_bits) - range;
forward(cdf) :
uint32_t cdf_down = cdf * r_norm_down;
int64_t cdf_up = cdf * r_norm_up - (int64_t) map_up_excess; // signed
ret = MAX( cdf_down, cdf_up );
</PRE></FONT>
and this can be simplified further with some algebra which I won't show.
<P>
Perhaps a picture is clearer :
<P>
<IMG SRC="https://www.cbloom.com/blogimages/recip_downup_map.jpg">
<P>
I call this map "recip_arith down/up". The decoder is straightforward except for one point :
<P>
The reciprocal for both r_top and (r_top+1) have to be looked up. range is conceptually in [1,2) in
fixed point, r_top can be exactly 1 but never 2. (r_top+1) can be exactly 2. The reciprocal table needs
one more entry than (1<<RECIP_ARITH_TABLE_BITS). That means the reciprocal
can take one more bit. In particular if you use the 32 bit reciprocal numerator as in recip_arith.h ,
the reciprocal will no longer fit in 32 bits. The easy solution is to change RECIP_ARITH_NUMERATOR_BITS to 31.
<P>
The coding loss of "recip_arith down/up" (at RECIP_ARITH_TABLE_BITS = 8) is extremely low.
It's typically better than the "range coder" map, and comparable to the CACM87 map.
<P>
"recip_arith down/up" is even extremely good at RECIP_ARITH_TABLE_BITS = 4. That might be interesting
for hardware implementations, because it means very tiny tables can be used, and the multipliers needed
(for encoding) are also very low bit count.
<P>
Note that if you reduce RECIP_ARITH_TABLE_BITS all the way down to 1, then r_top_down is always just 1,
and r_top_up is just 2. Then "recip_arith down/up" is choosing between a 1X and 2X mapping. This is exactly
SM98 !
<P>
At RECIP_ARITH_TABLE_BITS = 2, then "recip_arith down/up" is choosing 1X or 1.5X or 2X. (the top bits are either
10 or 11 , we're getting only 1 fractional bit of range, the top bit is always on). This is exactly the
same as the "reduced overhead" coder of Daala. (see OD_EC_REDUCED_OVERHEAD, in entcode.h)
<P>
<FONT COLOR=GREEN><PRE>
OBJ2
range coder: 246,814 -> 193,172 = 6.261 bpb = 1.278 to 1
recip_arith (8 bits): 246,814 -> 193,282 = 6.265 bpb = 1.277 to 1
recip_arith down/up (8 bits): 246,814 -> 193,171 = 6.261 bpb = 1.278 to 1
recip_arith down/up (4 bits): 246,814 -> 193,240 = 6.264 bpb = 1.277 to 1
recip_arith down/up (3 bits): 246,814 -> 193,436 = 6.270 bpb = 1.276 to 1
PAPER3
range coder: 46,526 -> 27,133 = 4.665 bpb = 1.715 to 1
recip_arith (8 bits): 46,526 -> 27,156 = 4.669 bpb = 1.713 to 1
recip_arith down/up (8 bits): 46,526 -> 27,133 = 4.665 bpb = 1.715 to 1
recip_arith down/up (4 bits): 46,526 -> 27,127 = 4.664 bpb = 1.715 to 1
recip_arith down/up (3 bits): 46,526 -> 27,155 = 4.669 bpb = 1.713 to 1
PIC
H : 1.21464
range coder: 513,216 -> 78,408 = 1.222 bpb = 6.545 to 1
recip_arith (8 bits): 513,216 -> 78,651 = 1.226 bpb = 6.525 to 1
recip_arith down/up (8 bits): 513,216 -> 78,408 = 1.222 bpb = 6.545 to 1
recip_arith down/up (4 bits): 513,216 -> 78,395 = 1.222 bpb = 6.547 to 1
recip_arith down/up (3 bits): 513,216 -> 78,806 = 1.228 bpb = 6.512 to 1
PROGL
range coder: 71,646 -> 42,723 = 4.770 bpb = 1.677 to 1
recip_arith (8 bits): 71,646 -> 42,757 = 4.774 bpb = 1.676 to 1
recip_arith down/up (8 bits): 71,646 -> 42,721 = 4.770 bpb = 1.677 to 1
recip_arith down/up (4 bits): 71,646 -> 42,724 = 4.771 bpb = 1.677 to 1
recip_arith down/up (3 bits): 71,646 -> 42,731 = 4.771 bpb = 1.677 to 1
TRANS
range coder: 93,695 -> 64,806 = 5.533 bpb = 1.446 to 1
recip_arith (8 bits): 93,695 -> 64,851 = 5.537 bpb = 1.445 to 1
recip_arith down/up (8 bits): 93,695 -> 64,806 = 5.533 bpb = 1.446 to 1
recip_arith down/up (4 bits): 93,695 -> 64,820 = 5.535 bpb = 1.445 to 1
recip_arith down/up (3 bits): 93,695 -> 64,884 = 5.540 bpb = 1.444 to 1
</PRE></FONT>
<P>
recip_arith down/up (4 bits) has very low coding loss, it's very close to the standard range coder.
(note that "range coder" here is not doing "map end" which would improve it slightly, particularly
if end was assigned to the MPS, particularly so on "PIC" which is the only file where range coder
noticeably misses the entropy)
<P>
Base recip_arith loss vs range coder is exactly 0.004 bpb in all cases here. This is with
cdf_bits = 13 and a crappy frequency normalization heuristic, so there are some anomalies
where eg. recip_arith down/up 4 bits gets more compression than 8 bits.
recip_arith vanilla version at 4 bits gets very poor; coding loss is comparable to SM98,
around 1.0%
<P>
recip_arith down/up is probably not practical for software implementation, but might be
interesting for hardware. It's also interesting to me as a theoretical superset that
connects recip_arith to SM98.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/_8lS7AtKT-s" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-45018494227592511642018-10-22T11:16:00.001-07:002018-10-22T11:16:29.046-07:00Working through recip_arith
Time to look at the actual recip_arith map and how it works.
<P>
recip_arith takes the top N bits of range (always a high 1 bit, and then N-1 fractional bits in the fixed point view) and
uses them to scale the (power of 2 total) CDF up to the arithmetic domain.
<P>
The recip_arith forward map is :
<FONT COLOR=GREEN><PRE>
int range_clz = clz32(range);
uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm = r_top << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
forward(cdf) = cdf * r_norm;
or
forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
</PRE></FONT>
r_norm is the scaling factor, close to (range/cdf_tot), differing in that all bits below RECIP_ARITH_TABLE_BITS are set to zero.
Recall the standard "range coder" map is :
<FONT COLOR=GREEN><PRE>
uint32_t r_norm = ac->range >> cdf_bits;
forward(cdf) = cdf * r_norm;
</PRE></FONT>
the only difference in recip_arith is that we are only using the top RECIP_ARITH_TABLE_BITS of "range" in the scaling.
This is a floor of the ideal scaling factor, and means we use less of range than we would like.
<P>
Note that the "range coder" itself only uses (r_bits - cdf_bits) in its scaling factor, so it relies on r_bits reasonably larger than cdf_bits.
This is in contract to CACM87 or full precision scaling.
<P>
The recip_arith map can be thought of as starting with the "identity map" from the last post, and then adding fractional bits to
the fixed point in [1,2) , possibly 1.5X, 1.25X, etc. refining the scaling factor.
<P>
The point of course is to able to invert it. The naive inverse would be :
<FONT COLOR=GREEN><PRE>
inverse(code) = code / r_norm;
</PRE></FONT>
which works, but has the divide we are trying to avoid. Our idea is that because r_top is small, we can use
it to look up a reciprocal multiply. How should we do that? There are two choices, where we invert the steps
of the forward map one by one, in reverse order :
<FONT COLOR=GREEN><PRE>
forward(cdf) = (cdf * r_top) << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
inverse(code) = (code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) / r_top;
or
forward(cdf) = (cdf << (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits)) * r_top;
inverse(code) = (code / r_top) >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
</PRE></FONT>
if you actually do the divide, these are the same, but with the reciprocal multiply they are not. The issue is
our reciprocal will only be exactly equal to the divide for some number of numerator bits. If we used the second
form, we would need (code / r_top) , and "code" can be very large (24-31 bits or 56-63 bits). To do that reciprocal
exactly would require an intermediate multiply larger than 64 bits.
<P>
Therefore we need the first form : first shift down code to only the necessary bits to resolve cdf boundaries,
then do the divide :
<FONT COLOR=GREEN><PRE>
inverse(code) :
uint32_t code_necessary_bits = code >> (32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
ret = code_necessary_bits / r_top;
ret = ( code_necessary_bits * recip_table[r_top] ) >> RECIP_ARITH_NUMERATOR_BITS;
</PRE></FONT>
(for 32-bit systems you might want to choose RECIP_ARITH_NUMERATOR_BITS = 32 so that this is a 32x32 -> hi word
multiply; on 64-bit that's moot)
<P>
"code_necessary_bits" has its top bit at cdf_bits + RECIP_ARITH_TABLE_BITS ; in practice this is something like 14 + 8 = 22 .
This is just enough bits that after dividing by r_top , you still resolve every cdf value exactly.
In order for recip_table[] to have an exact reciprocal, it needs to be 1 more bit than the numerator, eg.
cdf_bits + RECIP_ARITH_TABLE_BITS + 1. This means the intermediate that we need to multiply easily fits in 64 bits.
<P>
This acts as the limit on RECIP_ARITH_TABLE_BITS, aside from the consideration of how much space in L1 you can afford to give it.
<P>
With RECIP_ARITH_TABLE_BITS = 1 , recip_arith is the "identity map". With RECIP_ARITH_TABLE_BITS = 8, coding loss is
very low (0.1%).
<P>
The fraction of "range" that can be wasted by the recip_arith map is 1/2^RECIP_ARITH_TABLE_BITS , which means the
coding loss is :
<PRE>!
coding loss @ maximum ~= -log2( 1 - 1/2^RECIP_ARITH_TABLE_BITS )
(eg. at RECIP_ARITH_TABLE_BITS = 8 this is 0.00565 bits)
</PRE>
The maximum occurs when all bits uner the top RECIP_ARITH_TABLE_BITS are 1; when they're all 0 this map is exact and
there's no coding loss. The real world coding loss seems to be about 75% of the above formula.
<P>
Note that in a standard "range coder" you are already approximating to only use the bits of range between r_bits and cdf_bits;
so eg. r_bits in 24-31 , cdf_bits of 14, means you are using 10-17 top bits of range in r_norm (the scaling factor).
recip_arith just always uses 8 (for example) rather than the variable 10-17.
<P>
Note that recip_arith does not map all of range. We'll talk about variants that do use the whole range in the next post.
(The standard range coder also doesn't map all of range). This means there are "code" values which the decoder cannot handle
through its normal paths, where "code" is out of the mapped range. A valid stream produced by the encoder will never lead to
those values being seen in the decoder, but a corrupt/attack/fuzzy stream could have them. Therefore a safe decoder must detect
when code is out of the mapped range and at least handle it well enough not to crash. Testing and returning failure is an efficient
way to handle it since it's just a branch that's never taken.
<P>
In review : the fully general CACM87 map uses 2 divides to encode, and 3 to decode. (encode is 2 forward maps, decode is 1 inverse map +
2 forward maps). By choosing cdf_tot to be a power of 2 we can easily eliminate the divide in the forward map. The divide in the inverse
map is harder to remove, but we do it by reducing the number of bits used so that we can use a table of reciprocal multiplies.
<P>
The decoder critical path of instructions in recip_arith is : clz -> shift -> table lookup -> multiply . On modern desktops this is
about the same speed or just a little faster than a divide.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/dCuOhSE3iWk" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-89481686217830588592018-10-21T15:14:00.000-07:002018-10-22T11:16:53.657-07:00Arithmetic coder division-free maps
We have now narrowed in on the key issue that forces us to use division in arithmetic coders : the inverse map from arithmetic domain
to CDF domain. Let's now start to look at division free options for that map.
<P>
Often henceforth I will talk about the map after normalizing the ratio to [1,2) by shifting CDFs. That is :
<FONT COLOR=GREEN><PRE>
while cdf_tot*2 < range
cdf's *= 2
same as :
left shift cdf's to put the top bit of CDF aligned with the top bit of range
</PRE></FONT>
Recall that in a typical implementation "range" has its top bit go from position 24-31. cdf_tot = (1<<cdf_shift) we have chosen to be a power
of 2, and cdf_shift is typically in the 11-14 range.
<P>
We now think in terms of a fixed point fraction. cdf_tot after normalizing is "one" , and "range" is in [1,2) , that is :
<FONT COLOR=GREEN><PRE>
one = 1.00000000000000000
range = 1.01001010101100000
</PRE></FONT>
essentially all we're getting at here is that in the scaling from CDF domain to "arithmetic domain", the binary powers of 2 to get us
as close to range as possible are the easy part. We just shift the CDF's up. The fractional part of range in [1,2) is the hard part
that makes us need a multiply/divide.
<P>
And that leads us directly to our first division free map :
<FONT COLOR=GREEN><PRE>
identity map : aka "just shift" map :
r_bits = bsr(range) = position of top bit of range
shift = r_bits - cdf_shift
forward(cdf) = cdf << shift
inverse(code) = code >> shift
</PRE></FONT>
If we look at how this map performs for range in [1,2) , when range is near 1 ("one" in fixed point, that is, range nearly 1<<r_bits),
then this map is exactly right. As range grows towards 2, this map gets worse and worse. Near 2, this map is costing us exactly 1 bit
of coding loss per symbol (because we're only using half of the range that we could).
<P>
(aside : in the ancient days, Rissanen actually used the "identity map")
<P>
The problem with this map is that it is failing to scale up cdf to use all of range when range is over 1. Obviously we can just check for
that :
<FONT COLOR=GREEN><PRE>
"one or 1.5 map" :
If range is in [1,1.5) , scale by 1X
if range is in [1.5,2) , scale by 1.5X
forward(cdf) :
cdf_norm = cdf << shift;
ret = cdf_norm;
if ( range (fixed point) >= 1.5 ) (test 0.1 bit)
ret += cdf_norm >> 1;
inverse(code) :
if ( range (fixed point) >= 1.5 )
ret = (code * 2)/3;
else
ret = code;
</PRE></FONT>
so we are now testing 1 fractional bit of range below the leading bit. This divides the "range" interval into two scaling zones.
We could obviously test further fractional bits of range to divide the range into scaling zones like :
<FONT COLOR=GREEN><PRE>
test 2 fractional bits of range (below top 1) : [ 1X , 1.25X , 1.5X , 1.75X ]
</PRE></FONT>
but that's just the same as doing the normal range coder scaling, but only using the top 3 bits of range (top one + 2 fractional bits).
And that is the same as the recip_arith forward map.
<P>
Note that in the "one or 1.5 map" it may look like we are still doing a divide to get from arithmetic domain back to cdf domain.
But it's divide by the constant 3, which is implemented by the compiler (or us) as a reciprocal multiply.
As we use more (fractional) bits of range, we need various other divides by constants (1/5,1/7,etc.) which are just reciprocal
multiplies. Rather than branching, we can just use a table and take the top bits of range to look them up.
This leads us directly to recip_arith, which we will flush out in the next post.
<P>
In these maps we are always using less than the full range. eg. say we do the "2 fractional bits of range" , if range is in the
[1.25,1.5) fixed point interval, we will use 1.25X to scale from CDF to arithmetic domain. That is the correct scaling when range
is the bottom of [1.25,1.5) but does not use all of range when range is larger.
The reason is we can't do something like use a scaling factor that's the middle of the bucket (1.375), since when range is low
that would give us forward(cdf) > range , which is uncodeable. Therefore we must also use the round-down or truncation of range
to fewer top bits. This approach of using some number of fractional bits of range means that the map never makes use of all of range;
as you add more bits, you can use more and more of range, but you are slowly approaching the limit from below.
<P>
There's an alternative approach due to Stuiver & Moffat ("Piecewise Integer Mapping for Arithmetic Coding"), commonly called SM98.
<P>
The SM98 map says : consider range and CDF normalized, so range is in [1,2) fixed point. If we just scale CDF by 1X everywhere
("identity map" above) we are not using all of range. We can't *ever* scale CDF by 2X uniformly, because range is strictly < 2 , that would
make forward(cdf_tot) exceed range. What we can do is scale CDF by 2X for a portion of the interval, and 1X for the remainder, so that
we use all of range :
<FONT COLOR=GREEN><PRE>
Choose some CDF threshold t such that when we make the map :
cdf < t -> scale by 1X
cdf >= t -> scale by 2X
then we use the whole range in our map, eg. forward(cdf_tot) = range
The map is :
forward(cdf) :
if ( cdf < t ) ret = cdf
else ret = t + (cdf - t)*2 = cdf*2 - t
t = cdf_tot*2 - range
note that cdf_tot is normalized to be equal to range's top bit here,
so 't' is the same as "2 - range" in fixed point
that's the same as the ~ bit inverse of range's fractional bits
forward(cdf) :
if ( cdf < t ) ret = cdf;
else ret = range - (cdf_tot - cdf)*2;
forward(cdf) :
ret = MAX( cdf , range - (cdf_tot - cdf)*2 )
</PRE></FONT>
This final form with the branchless MAX is nicer for implementation, but it's also an alternate way to see the map.
What we're doing is a 1X map for the early cdf's, starting at the left side of the arithmetic range. If we stuck with that map the
whole way, it would not reach the end of range (and thus waste coding quality). We're simultaneously doing a 2X
mapping of the late CDF's, starting at the *right* side of the arithmetic range. If we stuck with that map the whole way, it would
overshoot zero and thus not be codeable. Where those two maps cross, we switch between them, thus using the more generous 2X mapping
as much as possible.
<P>
(the inverse map is done similarly, just with >>1 instead of *2)
<P>
So, the SM98 mapping uses all of range, thus does not introduce coding loss due to failure to use all of range. It does, however,
not assign intervals propertional to the probability.
<P>
When range is near 1, SM98 does the 1X map nearly everywhere, so its scaling is correct. When range is near 2, SM98 does the 2x map nearly
everwhere, so again there is little coding loss. Intervals are proportional to the probability. The problem is when range is in the
middle. Then some symbols will get a 1X scaling, and others will get a 2X scaling, distorting the probabilities, causing coding loss.
<P>
(aside: I did an experiment and confirmed that you can compensate for this somewhat with a skewed probability estimate. SM98 in this
form gives too much code space to later symbols (you can of course choose to reverse that). To compensate you should increase the
frequency more when you see early symbols than when you see later ones, so that the probability estimate is skewed in the opposite way
of the coder. This does in fact cut the coding loss of SM98, roughly in half, from about 1.0% to 0.5%. Note this is totally evil and
I'm not actually recommending this, just writing it down for the record.)
<P>
And I'll finish with a drawing :
<P>
<IMG SRC="https://www.cbloom.com/blogimages/simple_arithcoder_maps_small.jpg">
<P>
Teaser : you can of course combine the ideas of "fractional bits of range" map and the SM98 map. When range is in the interval [1,1.5)
you could use a 1X scaling for low CDF and 1.5X scaling for high CDF; in the [1.5,2) interval use an SM98 threshold to split the CDF's into
intervals with 1.5X and 2X scaling. This was tried by Daala as the "reduced overhead" coder. We will come back to this later.
<P>
Oh, and another connection :
<P>
If you did the "fractional bits of range" scaling thing; first 1 bit giving you a 1.5X zone, then two bits adding
1.25X and 1.75X zones, etc. If you keep doing that all the way down, in a range coder framework you are trying to
compute (range / cdf_tot). That means you need to look at (r_bits - cdf_bits). If you simply keep testing that
number of bits and adding in the forward() map - you will wind up with the full range coder map.
<P>
That process is the Moffat-Neal-Witten DCC95 multiplication free coder. In that context you might want to choose
r_bits = 14 or 15, bit renormalization. cdf_bits = 11 or so. The difference (r_bits - cdf_bits) is your coding
precision (larger = less coding loss), and it's also the number of times you have to test bits of r and possibly
do (ret += cdf>>n) in the forward map.
<P>
ADD :
<P>
I brought up thinking of range normalized to [1,2) as a conceptual aid, but that can also be a good implementation
choice, particularly for coders like SM98 where you spend most of your time doing clz to find the top bit position of
range. Instead of letting range float, like in Michael Schindler range coder 24-31 bits, you instead keep range pegged
at 31 bits. That lets you avoid the clz to find the top bit, at the cost of doing a clz after encoding to find out how
much range has shrunk to normalize it back up.
<P>
Now you might think this requires bit renorm, but it does not. Instead you can still do byte renorm, and keep a count
of the number of spare bits at the *bottom*. So you are still doing 24-31 bit renorm, but the space is at the bottom instead
of the top.
<P>
This implementation style is not shown in recip_arith for clarity, but I figure I better mention everything so Google
can't patent it.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/9lW-3RtY_Mk" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-34484825351868625112018-10-18T10:45:00.000-07:002018-10-19T08:33:54.627-07:00About arithmetic coders and recip_arith in particular
An arithmetic coder encodes a sequence of symbols as an infinite precision number. You divide the range of that number
(I will call this the "arithmetic domain") into successively smaller intervals; each interval corresponds to a symbol.
After encoding a symbol, your current coding interval shrinks to the low/high range of that symbol, and you encode further
symbols within that range. The minimum number of bits required to distinguish the desired sequence from another is the
code length of the sequence.
<P>
If you make each symbol's interval proportional to the probability of that symbol, the arithmetic coder can produce a
code length equal to the entropy. (we're still assuming infinite precision encoding). If your estimated probabilities do
not match the true symbol probabilities (they never do) you have some loss due to modeling. The difference between the
arithmetic coder's output length and the sum of -log2(P) for all model probabilities is the coding loss.
<P>
In order to do arithmetic coding in finite precision, we track the low/high interval of the arithmetic coder. As the top bits
(or bytes) of low & high become equal, we stream them out. This is like "zooming in" on a portion of the infinite number line
where the top & bottom of the interval are in the same binary power of two buckets.
<P>
We must then confront the "underflow problem". That is, sometimes (high - low) ("range") can get very small, but the top bits of
high and low never match, eg. if they happen to straddle a binary power of 2. They can be something like
<FONT COLOR=GREEN><PRE>
high = 1000000000aaaaa..
low = 0111111111bbbbb..
</PRE></FONT>
There are several solutions to the underflow problem. For example CACM87 "bit plus follow" or the MACM / VirtQ approach which you can read about
elsewhere, also the "just force a shrink" method (see links at end).
<P>
The method I use in "recip_arith" rather hides the underflow problem. Rather than checking for the top bits (bytes) of low & high
being equal, we simply zoom in regardless. The renormalization step is :
<FONT COLOR=GREEN><PRE>
while ( ac->range < (1<<24) )
{
*(ac->ptr)++ = (uint8_t)(ac->low>>24);
ac->low <<= 8;
ac->range <<= 8;
}
</PRE></FONT>
low and range are 32 bit here, when range is less than 2^24 the top byte of low & high is the same and can be shifted out, *except* in
the underflow case. In the underflow case, we could have the top byte of low is = FF , and high is actually 100 with an implicit bit above
the 32nd bit (eg. low + range exceeds 2^32). What we do is go ahead and output the FF, then if we later find that we made a mistake
we correct it by propagating a carry into the already sent bits.
<P>
(note that you could do range += FF here for slightly less coding loss, but the difference is small; the actual "high" of our arithmetic interval is 1 above range, range can approach that infinitely
closely from below but never touch it; the coding interval is [low,high) inclusive on the bottom & exclusive on the top. Coders that don't
quite get this right have a lot of +1/-1 adjustments around low & high)
<P>
Renormalization means we can send an infinite length number while only working on a finite precision portion of that number down in the
active range of bits at the bottom. Renormalization also means that "range" is kept large enough to be able to distinguish symbols with
only integer subdivision of the range, which we shall now address. Renormalization in and of itself does not introduce any coding loss; it is
perfectly accurate (though failing to add FF is coding loss, and schemes like the "force shrink" method or "just dont renormalize" method of
fpaq0p do contribute to coding loss).
<P>
The other way we must adapt to finite precision is the division of the interval into ranges proportional to symbol probabilities.
The infinite precision refinement would be (real numbers!) :
<FONT COLOR=GREEN><PRE>
arithmetic_low += CDF_low * arithmetic_range / CDF_sum;
arithmetic_range *= CDF_freq / CDF_sum;
(real numbers, no floor divide)
(CDF = cumulative distribution function, aka cumulative probability, sum of previous symbol frequencies)
CDF_freq = CDF_high - CDF_low for the current symbol ; CDFs in [0,CDF_sum]
</PRE></FONT>
We don't want to do real numbers, so we just approximate them with integer math. But how exactly?
<P>
<B> The crucial distinguishing aspect of an arithmetic coder is how you map the CDF domain to the arithmetic domain </B>
<P>
The CDF domain is controlled by you; you have modeled probabilities somehow. The CDF domain always starts at 0 and goes to CDF_sum,
which is under your control. In the decoder, you must search in the CDF domain to find what symbol is specified. Working in the CDF
domain is easy. In contrast, the arithmetic interval is always changing; "low/range" is being shrunk by coding, and then zoomed in again by
renormalization.
<P>
The forward map takes you from CDF domain to arithmetic domain. Adding on the arithmetic "low" is trivial and we will not include it in the
map. The crucial thing is just scaling by (arithmetic_range / CDF_sum).
<P>
We can now write a very general arithmetic encoder :
<FONT COLOR=GREEN><PRE>
arithmetic_low += forward(CDF_low,arithmetic_range,CDF_sum);
arithmetic_range = forward(CDF_high,arithmetic_range,CDF_sum) - forward(CDF_low,arithmetic_range,CDF_sum);
</PRE></FONT>
our "forward" map will be working on integers. Some properties forward must have :
<FONT COLOR=GREEN><PRE>
forward(x) should be monotonic
forward(x+1) > forward(x) strictly (so that range can never shrink to zero)
this may only be true for arithmetic_range > CDF_sum or some similar constraint
forward(0) >= 0
forward(CDF_sum) <= arithmetic_range
forward map of the CDF end points does not need to hit the end points of range, but it must be within them
(failure to use all of range does contribute to coding loss)
</PRE></FONT>
The coding loss of our approximation is caused by the difference in forward(high) - forward(low) and the ideal scaling
(which should be proportional to range & symbol probability).
<P>
The integer forward map with lowest coding loss is the "CACM87 map" :
<FONT COLOR=GREEN><PRE>
forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;
this is now integers (eg. floor division)
CACM87 has
forward(cdf_sum) = range ; eg. it uses the full range
</PRE></FONT>
coding loss is just due to the floor division not exactly matching the real number divide. (note that you might be tempted to
say, hey add (cdf_sum/2) to get a rounded integer division instead of floor; the exact form here is needed to be able to construct
an inverse map with the right properties, which we will get to later).
<P>
<B> Sketch of full arithmetic coding process </B>
<P>
A quick overview of what the decoder has to do. Most of the decoder just replicates the same work as the encoder; it narrows the
arithmetic interval in exactly the same way. Rather than streaming out bytes in renormalization, the decoder streams them in. The
decoder sees the arithmetic code value that the encoder sent, to some precision ahead. It needs enough bits fetched to be able to
resolve the correct symbol (to tell which CDF bin is selected).
<P>
In implementation, rather than track the low/high arithmetic interval and the arithmetic number within that interval, we instead just track
(arithmetic - low), the offset inside the interval. I call this the "code" in the decoder.
<P>
The decoder needs an extra step that the encoder doesn't do : given the current "code" , figure out what symbol that specifies.
To do that, we have to take the "code" (in the arithmetic interval domain), map it back to CDF domain, then scan the CDF intervals to
find which symbol's bin it falls in.
<P>
To do so requires an "inverse" map (arithmetic domain -> CDF domain), the opposite of the "forward" map (CDF -> arithmetic) we just
introduced.
<P>
A full general purpose (multi-symbol) arithmetic coder is :
<FONT COLOR=GREEN><PRE>
(in integers now)
Encode :
look up CDF of the symbol you want to encode
map CDF interval to range inverval :
lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);
arithmetic_low += lo;
arithmetic_range = hi - lo;
propagate carry in "arithmetic_low" if necessary
renormalize if necessary
Decode :
take current arithmetic "code"
map it back to CDF domain :
target = inverse(arithmetic_code,range,cdf_sum);
find symbol from CDF target such that :
CDF_low <= target < CDF_high
rest proceeds like encoder:
lo = forward(cdf_low,range,cdf_sum);
hi = forward(cdf_high,range,cdf_sum);
arithmetic_code -= lo;
arithmetic_range = hi - lo;
renormalize if necessary
</PRE></FONT>
The encoder needs "forward" twice, the decoder needs "forward" twice plus "inverse" once.
<P>
Naive implementation of forward & inverse both need division, which would mean 2 and 3 divides for encode & decode, respectively.
<P>
<B> The inverse map and when you don't need it </B>
<P>
First of all, why do you need the inverse map, and when do you not need it?
<P>
One common case where you don't need the inverse map at all is binary arithmetic coding. In that case it is common to just do
the forward map and resolve the symbol in arithmetic domain, rather than CDF domain.
<P>
That is :
<FONT COLOR=GREEN><PRE>
binary decoder :
arithetmetic_code is known
map the threshold between bits 0 & 1 to arithmetic domain :
arihmetic_mid = forward(cdf_min,range,cdf_sum);
find bin in arithmetic domain :
symbol = arithetmetic_code >= arithmetic_mid;
lo/hi = { 0 , arithmetic_mid , range }
</PRE></FONT>
(in the binary case we also only need one forward map, not two, since one of the end points is always 0 or range).
<P>
Now, you can do the same technique for small alphabet multi-symbol, for 4 or 8 or 16 symbols (in SIMD vectors); rather than make a CDF target
to look up the symbol, instead take all the symbol CDF's and scale them into arithmetic domain. In practice this means a bunch of calls to
"forward" (ie. a bunch of multiplies) rather than one call to "inverse" (a divide).
<P>
But for full alphabet (ie 8 bit, 256 symbol), you don't want to scale all the CDF's. (exception: Fenwick Tree style descent). Typically
you want a static (or semi-static, defsum) probability model, then you can do the CDF -> symbol lookup using just a table. In that case
we can construct the symbol lookup in CDF domain, we need the map from arithmetic domain back to CDF domain.
<P>
The inverse map must have properties :
<P>
<FONT COLOR=GREEN><PRE>
assume range >= cdf_sum
so the forward map is a stretch , inverse map is a contraction
should invert exactly at the CDF end points :
y = forward(x);
inverse(y) == x
the CDF buckets should map to the lower CDF :
lo = forward(x)
hi = forward(x+1)
(hi > lo , it can be much greater than 1 apart)
inverse( anything in [lo,hi) ) = x
</PRE></FONT>
in hand wavey terms, you need inverse to act like floor division. eg :
<FONT COLOR=GREEN><PRE>
CDF domain [012] -> arithmetic domain [00011122]
</PRE></FONT>
For example, for the CACM87 forward map we used above, the inverse is :
<FONT COLOR=GREEN><PRE>
CACM87
forward(cdf,range,cdf_sum) = ( cdf * range ) / cdf_sum;
inverse(code,range,cdf_sum) = ( code * cdf_sum + cdf_sum-1 ) / range;
(integers, floor division)
</PRE></FONT>
In most general form, both forward & inverse map require division. The forward map is easy to make divide-free, but the inverse map not so.
<P>
<B> Getting rid of division and cdf_sum power of 2 </B>
<P>
We'll now start getting rid of pesky division.
<P>
The first thing we can do, which we will adopt henceforth, is to choose cdf_sum to be a power of 2.
We can choose our static model to normalize the cdf sum to a power of 2, or with adaptive modeling use a scheme
that maintains power of 2 sums. (defsum, constant sum shift, sliding window, etc.)
<FONT COLOR=GREEN><PRE>
cdf_sum = 1<<code><</code>cdf_shift;
CACM87 forward(cdf,range,cdf_sum) = ( cdf * range ) >> cdf_shift;
</PRE></FONT>
So we have eliminated division from the forward map, but it remains in the inverse map. (the problem is that the CDF domain
is our "home" which is under our control, while the arithmetic domain is constantly moving around, stretching and shrinking,
as the "range" interval is modified).
<P>
We're fundamentally stuck needing something like "1 / range" , which is the whole crux of the problem that recip_arith is
trying to attack.
<P>
I think we'll come back to that next time, as we've come far enough.
<P>
While I'm going into these forward/inverse maps lets go ahead and mention the standard "range coder" :
<FONT COLOR=GREEN><PRE>
range coder :
forward(cdf,range,cdf_sum) = cdf * ( range / cdf_sum );
inverse(code,range,cdf_sum) = code / ( range / cdf_sum );
(integer, floor division)
or with power of 2 cdf_sum and in a more C way :
r_norm = range >> cdf_shift;
forward(cdf,range,cdf_sum) = cdf * r_norm;
inverse(code,range,cdf_sum) = code / r_norm;
</PRE></FONT>
where I'm introducing "r_norm" , which is just the ratio between "range" and "cdf_sum" , or the scaling from CDF to arithmetic domain.
<P>
Historically, the "range coder" map was attractive (vs CACM87) because it allowed 32 bit arithmetic state. In the olden days we needed
to do the multiply and stay in 32 bits. In CACM87 you have to do (cdf * range) in the numerator, so each of those was limited to 16 bits.
Because the arithmetic state (code/range) was only 16 bits, you had to do bit renormalization (in order to keep range large enough to do
large CDF sums (actually byte renormalization was done as far back as 1984, in which case CDF sum was capped at 8 bits and only binary
arithmetic coding could be done)). By adopting the "range coder" map, you
could put the arithmetic state in 32 bits and still use just 32 bit registers. That meant byte renormalization was possible.
<P>
So, with modern processors with 64-bit registers there's actually very little advantage to the "range coder" map over the CACM87 map.
<P>
The range coder map has some coding loss. The fundamental reason is that the forward() map scaling is not exactly (Probability * range).
Another way to think of that is that not all of range is used. In the "range coder" :
<FONT COLOR=GREEN><PRE>
forward(cdf_sum) = cdf_sum * r_norm = r_norm << cdf_shift = (range >> cdf_shift) << cdf_shift
forward(cdf_sum) = range with bottom cdf_shift bits turned off
unused region is (range % cdf_sum)
</PRE></FONT>
The coding loss is small in practice (because we ensure that range is much larger than cdf_sum). In typical use, range is 24-31 bits
and cdf_shift is in 11-14 , then the coding loss is on the order of 0.001 bpb. You can make the coding loss of the range coder
arbitrarily small by using larger range (eg. 56-63 bits) with small cdf_sum.
<P>
The "range coder" map is actually much simpler than the CACM87 map. It simply takes each integer step in the CDF domain, and turns
that into a step of "r_norm" in the arithmetic domain. The inverse map then just does the opposite, each run of "r_norm" steps in
the arithmetic domain maps to a single integer step in the CDF domain.
<P>
.. and that's enough background for now.
<P>
The entry page for recip_arith :
<A HREF="https://cbloomrants.blogspot.com/2018/10/10-16-18-multi-symbol-division-free.html"> cbloom rants A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication </A>
<P>
Links :
<P>
<A HREF="http://cbloomrants.blogspot.com/2008/10/10-05-08-5.html"> cbloom rants 10-05-08 Rant on New Arithmetic Coders </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2008/10/10-06-08-followup-on-russian-range-coder.html"> cbloom rants 10-06-08 Followup on the Russian Range Coder </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2008/10/10-07-08-little-more-on-arithmetic.html"> cbloom rants 10-07-08 A little more on arithmetic coding ... </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2008/10/10-10-08-on-art-of-good-arithmetic.html"> cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2010/08/08-16-10-range-coder-revisited-oh-wait.html"> cbloom rants 08-16-10 - Range Coder Revisited .. oh wait, nevermind </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2010/09/09-16-10-modern-arithmetic-coding-from.html"> cbloom rants 09-16-10 - Modern Arithmetic Coding from 1987 </A> <BR>
<A HREF="https://cbloomrants.blogspot.com/2008/10/10-08-08.html"> cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places. </A> <BR>
<A HREF="https://cbloomrants.blogspot.com/2010/08/08-11-10-huffman-arithmetic-equivalence.html"> cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence </A> <BR>
<A HREF="https://people.xiph.org/~tterribe/notes/range.html"> On the Overhead of Range Coders </A> <BR>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/KMZEPY8HZTY" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-33279902590671915962018-10-18T10:43:00.003-07:002018-10-22T16:31:48.792-07:00A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication
A standard multi-symbol arithmetic coder necessarily requires a costly divide to map the arithmetic code target back to a cdf
(even when you have chosen the cdf sum to be a power of 2).
<P>
Doing binary arithmetic coding without a divide is trivial.
Previous methods for division free multi-symbol arithmetic coding have been at the cost of low coding precision,
that is coding in more bits than -log2 of the symbol probabilities (for example see DCC95 and SM98).
<P>
code is here : <P>
<A HREF="https://github.com/thecbloom/recip_arith"> recip_arith on github </A>
<P>
recip_arith is public domain.
<P>
Our method uses an approximate map from the cdf interval to the range interval, scaling by only a few top bits of range.
This approximate map introduces coding loss, but that can be made quite small with just 8 bits from range.
<P>
The advantage of this approximate map is that it can be exactly inverted using a table of reciprocal multiplies :
<FONT COLOR=GREEN><PRE>
x / y == (x * recip_table[y]) >> 32
recip_table[y] = ((1<<32) + y-1) / y
</PRE></FONT>
x/y in integers is the floor divide, and recip_table is the ceil reciprocal. (see Alverson, "Integer Division using reciprocals").
The choice of a 32 bit numerator is somewhat arbitrary, but we do want the reciprocals to fit in 32 bit words to minimize the size of recip_table
in L1 cache.
<P>
Crucially because only the top 8 bits of "range" are used in the forward map, then "y" needed in the divide is only 8 bits, so the recip_table can
be quite small. Furthermore, 8 bits + 24 bits of cdf in the numerator "x" can be inverted exactly.
<P>
<B> Coding Efficiency </B>
<P>
The coding loss of "recip_arith" with 8 bit tables is reliably under 0.005 bpb (around 0.1%) , in constrast to SM98 where the coding loss can be 10X
higher (0.05 bpb or 1.0%)
<FONT COLOR=GREEN><PRE>
Calgary Corpus file "news" , len=377,109
cdf_bits = 13 (symbol frequencies sum to 1<<13)
from smallest to largest output :
recip_arith coder (down/up 8-bit) : 244,641 = 5.190 bpb
cacm87: 244,642 = 5.190 bpb
range coder: 244,645 = 5.190 bpb
recip_arith coder (with end of range map) : 244,736 = 5.192 bpb
recip_arith coder: 244,825 = 5.194 bpb
recip_arith coder (down/up 2-bit) (aka Daala "reduced overhead") : 245,488 = 5.208 bpb
SM98 : 245,690 = 5.212 bpb
</PRE></FONT>
(the down/up and "end of range map" variants will be discussed later)
<P>
The crucial step of recip_arith is the way the arithmetic coding interval is shrunk to encode a symbol.
<P>
The standard range coder does :
<FONT COLOR=GREEN><PRE>
uint32_t r_norm = ac->range >> cdf_bits;
ac->low += cdf_low * r_norm;
ac->range = cdf_freq * r_norm;
</PRE></FONT>
while recip_arith does :
<FONT COLOR=GREEN><PRE>
uint32_t range = ac->range;
int range_clz = clz32(range);
uint32_t r_top = range >> (32 - range_clz - RECIP_ARITH_TABLE_BITS);
uint32_t r_norm = r_top << ( 32 - range_clz - RECIP_ARITH_TABLE_BITS - cdf_bits);
ac->low += cdf_low * r_norm;
ac->range = cdf_freq * r_norm;
</PRE></FONT>
where "r_top" is just the highest RECIP_ARITH_TABLE_BITS bits of range, and r_norm is those bits back in their original position, then
shifted down by cdf_bits. In the end the way the interval is shrunk is exactly the same as the range coder, except that all bits below
the top RECIP_ARITH_TABLE_BITS of "r_norm" are turned off.
<P>
<B> Is it useful ? </B>
<P>
It's unclear if there are really killer practical advantages to recip_arith at the moment. On many modern high end CPUs, 32 bit divides are
pretty fast, so if you just take recip_arith and use it to replace a standard range coder you might not see much difference. On CPUs with slow
divides there is a more significant advantage.
<P>
recip_arith provides an interesting continuum that connects very low precision coders like SM98 to the full precision range coder.
The "up/down" variant with a very small reciprocal table (4 bits?) might be interesting for hardware implementations, where division is
not desirable.
<P>
recip_arith works with 64-bit coding state (the standard range coder would require a 64-bit divide, which is often much slower than a 32-bit
divide), which can provide advantages. recip_arith also works with the encoder & decoder not in lock step; eg. you could encode with 32-bit
state and decode with 64-bit state, allowing you independence of the bitstream & implementations, which is very desirable;
not all arithmetic coders have this property.
<P>
I think primarily it is theoretically interesting at the moment, and it remains to be seen if that turns into a compelling practical application.
<P>
<B> What's Next </B>
<P>
I'll be doing a couple of followup posts going into the details of how/why this works. We'll talk about the
theory of arithmetic coders and how to think about them. Then we'll look at the up/down and "map end" variants.
<P>
<B> Index of posts </B>
<P>
<A HREF="https://github.com/thecbloom/recip_arith"> recip_arith on github </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2018/10/about-arithmetic-coders-and-reciparith.html"> 1. cbloom rants About arithmetic coders and recip_arith in particular </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2018/10/arithmetic-coder-division-free-maps.html"> 2. cbloom rants Arithmetic coder division-free maps </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2018/10/working-through-reciparith.html"> 3. cbloom rants Working through recip_arith </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2018/10/reciparith-without-unused-range.html"> 4. cbloom rants recip_arith without unused range </A> <BR>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/JeowZUc-8tg" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-56241507232459233362018-10-13T12:18:00.003-07:002018-10-18T10:45:46.016-07:00If you have had difficulty sending me emailMy email has been extremely flaky over the last few months due to problems at dreamhost (who host cbloom.com).
<P>
If you sent me an email and I did not reply, it's possible I didn't get it. Check your spam for bounced emails, since in my experience many undelivered mail responses incorrectly wind up in spam folders these days.
<P>
I will be changing to a new host (because dreamhost sucks and has consistently failed to fix their email issues). There may be more missed emails in the transition.
<P>
If you want to send me something reliably, please try carrier pigeon or can on string, since technology has only gotten worse since then. (grumble grumble, we all as an industry fucking epically suck and computers are monumentally awful and depressing, grumble grumble)
<P><HR><P>
Transition in progress... looks like I'll be moving cbloom.com web hosting off dreamhost as well. Let me know if you see any problems.
<P><HR><P>
... done. Now on fastmail.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/xUAFTaKlsdI" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-55164929142298914422018-10-01T13:30:00.003-07:002018-10-01T16:40:27.395-07:00Oodle Lossless Image
We have released
<A HREF="http://www.radgametools.com/oodlelimage.htm">Oodle Lossless Image Compression</A> (aka OLI) v1.0
<P>
OLI is built on the blazing fast Oodle Data Compression engine, with a new very efficient image specific front end.
OLI has a very simple native API, and it also has drop in look-alike APIs to replace stb_image or libpng, so it's very easy to integrate.
<P>
OLI gets much more compression than PNG. Its compression ratio is similar to higher compression codecs like webp-ll or FLIF. The big advantage of
OLI is its blazing fast decode speed. OLI decodes way faster than any of the competition (3-10X faster),
so you can compress big images smaller and load them faster.
<P>
(OLI is mostly by Jon Olick who did an awesome job on it)
<P>
OLI is currently for full color lossless images only. It's not for 3d game textures, it doesn't do BCn GPU compressed textures, it doesn't do
things like half-float normal maps. It's for 24 bit RGB and 32 bit RGBA images, the kind of things you would have used PNG for before.
OLI currently has only limited support for low color images (eg. palettized, 1-bit, and gray scale images). It's early days for OLI still and
that support will get better, particularly if we hear from customers who need it.
<P>
For more about OLI visit <A HREF="http://www.radgametools.com/oodlelimage.htm">the RAD Game Tools web site</A>.
<P>
Image sizes on my "good" test set, smallest size for each image in bold :
<P>
<TABLE BORDER=2 CELLSPACING=2>
<TR BGCOLOR=#BFBFBF><TD BGCOLOR=#BFBFBF><B></B></TD> <TD>oli</TD> <TD>webpll</TD> <TD>pngcrush</TD> <TD>png</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>4069138464_193981e0e2_o </TD> <TD><B>3140944</B></TD> <TD>3319222</TD> <TD>4019282</TD> <TD>4025911</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>9xkFD </TD> <TD><B>2524927</B></TD> <TD>2772626</TD> <TD>3850985</TD> <TD>3856541</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>A-Girl-With-Kind-Eyes </TD> <TD><B>1212598</B></TD> <TD>1307518</TD> <TD>1568931</TD> <TD>1620105</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Beach_Hurricane_WaLp_TW2011 </TD> <TD><B>1970283</B></TD> <TD>1997532</TD> <TD>2543245</TD> <TD>2606013</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Beach_Starfish_and_Conch_Shell_Wp_TW </TD> <TD><B>2266789</B></TD> <TD>2299242</TD> <TD>2591019</TD> <TD>2701580</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>CliffSideA01_cm0001 </TD> <TD>3475407</TD> <TD><B>3340540</B></TD> <TD>4302971</TD> <TD>4309175</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>CliffSideA03_cm0001 </TD> <TD>3626390</TD> <TD><B>3560460</B></TD> <TD>4697715</TD> <TD>4704495</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Girl_in_white </TD> <TD><B>1253577</B></TD> <TD>1313662</TD> <TD>1726569</TD> <TD>1729053</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>IMG_0060 </TD> <TD><B>2324810</B></TD> <TD>2601634</TD> <TD>3767787</TD> <TD>3773211</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>IMG_0152 </TD> <TD><B>2166388</B></TD> <TD>2389052</TD> <TD>3668094</TD> <TD>3673386</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>IMG_0155 </TD> <TD><B>1526228</B></TD> <TD>1613594</TD> <TD>2369898</TD> <TD>2373318</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>IMG_0195 </TD> <TD><B>1998851</B></TD> <TD>2115548</TD> <TD>2910248</TD> <TD>2914448</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>IMG_0339 </TD> <TD><B>1815852</B></TD> <TD>2021904</TD> <TD>3153675</TD> <TD>3158211</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>IMG_0341 </TD> <TD><B>1196634</B></TD> <TD>1267144</TD> <TD>1865148</TD> <TD>1867836</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>LOVE1 </TD> <TD><B>1170354</B></TD> <TD>1217124</TD> <TD>1598967</TD> <TD>1601271</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Lily+Allen+smile </TD> <TD><B>1156702</B></TD> <TD>1291656</TD> <TD>1850890</TD> <TD>1853554</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>PDI_1200 </TD> <TD><B>921293</B></TD> <TD>959710</TD> <TD>1260621</TD> <TD>1262433</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>POMSep2007 </TD> <TD><B>1061705</B></TD> <TD>1066320</TD> <TD>1520546</TD> <TD>1522742</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Portal2Win </TD> <TD>2992069</TD> <TD><B>2904416</B></TD> <TD>3695920</TD> <TD>3701248</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Spring_Beauty </TD> <TD><B>1749534</B></TD> <TD>1788812</TD> <TD>2041309</TD> <TD>2118280</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Sunset_Rocky_Beach_Walp_TW </TD> <TD>2700270</TD> <TD><B>2665954</B></TD> <TD>3137206</TD> <TD>3192653</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>ainokura_bibble </TD> <TD>817631</TD> <TD><B>817516</B></TD> <TD>1085462</TD> <TD>1087022</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>artificial </TD> <TD>721090</TD> <TD><B>578668</B></TD> <TD>845572</TD> <TD>846796</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>big_building </TD> <TD><B>3772059</B></TD> <TD>3795954</TD> <TD>4569456</TD> <TD>4579407</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>big_tree </TD> <TD>3476811</TD> <TD><B>3466330</B></TD> <TD>3747685</TD> <TD>3753085</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>bmw </TD> <TD><B>1408411</B></TD> <TD>1429148</TD> <TD>1790859</TD> <TD>1793439</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>bridge </TD> <TD>1401722</TD> <TD><B>1400592</B></TD> <TD>1535838</TD> <TD>1538094</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>c77b75083b34 </TD> <TD><B>2516018</B></TD> <TD>2657536</TD> <TD>3818229</TD> <TD>3823737</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>cathedral </TD> <TD><B>1914743</B></TD> <TD>1926612</TD> <TD>2212063</TD> <TD>2215255</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>church-in-the-prairie </TD> <TD><B>1955457</B></TD> <TD>1984848</TD> <TD>2973585</TD> <TD>2977869</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>cool-orb </TD> <TD>379704</TD> <TD><B>326258</B></TD> <TD>492951</TD> <TD>493671</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>deer </TD> <TD>4753833</TD> <TD><B>4454234</B></TD> <TD>4757131</TD> <TD>4763983</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>flower_foveon </TD> <TD>1701007</TD> <TD><B>1672546</B></TD> <TD>2194479</TD> <TD>2213494</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>hdr </TD> <TD><B>1588182</B></TD> <TD>1610458</TD> <TD>1942456</TD> <TD>2011759</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>highclass_by_DivineError_lossless </TD> <TD>591236</TD> <TD><B>588602</B></TD> <TD>848270</TD> <TD>849494</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>leaves_iso_200 </TD> <TD><B>2537504</B></TD> <TD>2613286</TD> <TD>3120814</TD> <TD>3125314</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>moses </TD> <TD><B>2137994</B></TD> <TD>2183616</TD> <TD>3220571</TD> <TD>3225215</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>mysoup </TD> <TD><B>2473925</B></TD> <TD>2717346</TD> <TD>3631056</TD> <TD>3636300</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>nails2 </TD> <TD><B>2181876</B></TD> <TD>2300032</TD> <TD>3228688</TD> <TD>3233344</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>nightshot_iso_100 </TD> <TD>2023458</TD> <TD><B>1948254</B></TD> <TD>2293315</TD> <TD>2397815</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>phoenix </TD> <TD>110389</TD> <TD><B>59800</B></TD> <TD>83692</TD> <TD>131474</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>pulchra2.1 </TD> <TD><B>2421925</B></TD> <TD>2536570</TD> <TD>3482309</TD> <TD>3487337</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>record_only </TD> <TD><B>789644</B></TD> <TD>790842</TD> <TD>1211780</TD> <TD>1282677</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>school-shoot-moodboard-number-1 </TD> <TD><B>1587878</B></TD> <TD>1638888</TD> <TD>2199389</TD> <TD>2202557</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>space-desktop </TD> <TD>561563</TD> <TD><B>545412</B></TD> <TD>724571</TD> <TD>725615</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>spider_web </TD> <TD><B>2119921</B></TD> <TD>2203508</TD> <TD>2603659</TD> <TD>2715256</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>windmill_evsm </TD> <TD><B>1470104</B></TD> <TD>1487660</TD> <TD>2090287</TD> <TD>2093311</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>yahoo </TD> <TD>1999722</TD> <TD><B>1700342</B></TD> <TD>2710221</TD> <TD>2849650</TD> </TR>
<TR BGCOLOR=#BFBFBF><TD BGCOLOR=#BFBFBF>total</TD> <TD>91665412</TD> <TD>93248528</TD> <TD>121555414</TD> <TD>122618434</TD> </TR>
</TABLE>
<P>
<FONT COLOR=GREEN><PRE>
oli is oli_enc --super
webpll is cwebp -lossless -z 9
pngcrush is default options
png is made by bmp2png
</PRE></FONT>
<P><HR><P>
<div style="background-color:#d0d0d0">
Oodle is an SDK for high performance lossless data compression.
For more about Oodle, or licensing inquiries,
<A HREF="http://www.radgametools.com/oodle.htm">visit the RAD Game Tools web site.</A>
This is my personal blog where I post supplemental material about Oodle.
</div>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/n7xx2XGmWbg" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-72607233862514586492018-08-06T12:36:00.003-07:002018-08-06T12:36:48.303-07:00Oodle 2.7.0 release : Network SDK separation
Oodle 2.7.0 is now out.
<P>
The biggest change is that the Network and Data portions are now in two different SDKs.
You may now use either one on its own, or both together.
If you are a licensee of both Data and Network, you will get two SDKs. The two SDKS can be installed to
the same place, and shared files may overwrite.
<P>
The compressed data bitstreams (for both Network & Data) are not changed from 2.6 to 2.7 ; I've bumped the
major version just because of the large API change of splitting the libs.
<P>
We've also made some optimizations in the LZ decoders; Mermaid, Kraken & Leviathan are all a wee bit faster
to decode.
<P>
Perf comparison to previous versions :
<FONT COLOR=GREEN><PRE>
Oodle 2.7.0 , Kraken 8 :
PD3D : 4.02:1 , 1.0 enc MB/s , 1128.5 dec MB/s
GTS : 2.68:1 , 1.0 enc MB/s , 1324.1 dec MB/s
Silesia : 4.25:1 , 0.6 enc MB/s , 1062.2 dec MB/s
============================
PD3D :
Kraken8 255 : 3.67:1 , 2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5: 3.72:1 , 1.2 enc MB/s , 1079.9 dec MB/s
Kraken8 260 : 4.00:1 , 1.0 enc MB/s , 1034.7 dec MB/s
GTS :
Kraken8 255 : 2.60:1 , 2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5: 2.63:1 , 1.2 enc MB/s , 1343.8 dec MB/s
Kraken8 260 : 2.67:1 , 1.0 enc MB/s , 1282.3 dec MB/s
Silesia :
Kraken8 255 : 4.12:1 , 1.4 enc MB/s , 982.0 dec MB/s
Kraken8 260 -v5: 4.18:1 , 0.6 enc MB/s , 1018.7 dec MB/s
Kraken8 260 : 4.24:1 , 0.6 enc MB/s , 985.4 dec MB/s
</PRE></FONT>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/Q5-K8BCqJU4" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-66250376457708275312018-07-09T11:29:00.002-07:002018-07-09T11:29:21.355-07:00Oodle ozip for command line compression and piped streams
Oodle now ships with "ozip", a command line compressor that acts like gzip (and bzip, and xz).
<P>
ozip can be used as a command line compressor to create or decode Oodle-compressed files; ozip can also be
used to pipe Oodle-compressed streams.
<P>
The intention is that ozip should act similarly to gzip (in terms of command line arguments, but with
more compression and faster decompression) so it can be dropped in to standard work flows that use gzip-like
compressors.
For example ozip can be used with tar "I" ("--use-compress-program") to pipe the tar package through ozip.
<P>
ozip works with pipes (particularly useful on Unix), so it can be used to pipe compressed data. (eg. with
things like "zfs send" to a pipe).
<P>
A pre-compiled ozip binary is now distributed with the Oodle SDK.
<P>
You can also get and modify the ozip source code on github :
<P>
<A HREF="https://github.com/jamesbloom/ozip/"> ozip on github </A>
<P>
To build ozip from source code, you need the
<A HREF="http://www.radgametools.com/oodle.htm"> Oodle SDK </A>. (ozip is open source and public domain,
Oodle is not).
<P>
If you have corrections to ozip, we're happy to take pull requests, particularly wrst making sure we act like
gzip and it's easy to drop in ozip to gzip-like work flows on Unix.
When writing ozip, we found that gzip/bzip/xz don't have the exact same command line argument handling, and yet
are treated as interchangeable by various programs. We tried to replicate the common intersection of their
behavior.
<P>
ozip is a single file compressor, not a package archiver. It does not store file names or metadata.
<P>
ozip was written by my brother, James Bloom.
<P><HR><P>
If you have server workflows that involve streaming compressed data, or think you could benefit from Oodle,
we'd be happy to hear from you. We're still evolving our solutions in this space.
<P>
If you are compressing very large files (not piped streams), you can get much higher performance by using
threads and asynchronous IO to overlap IO with compression CPU time. If this is important to you, ask about the "oozi" example in Oodle for
reference on how to do that.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/L_-Iu6Un868" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-90282965967777469432018-06-07T11:43:00.000-07:002018-06-07T12:01:09.909-07:00New in Oodle 2.6.3 : HyperFast Encode Speeds
Oodle 2.6.3 now has faster encode levels ("hyperfast"), for uses where encode speed is crucial.
<P>
Previously the fastest Oodle encode level was "SuperFast" (level 1). The new "HyperFast" levels are below that
(level -1 to -4). The HyperFast levels sacrifice some compression ratio to maximize encode speed.
<P>
An example of the performance of the new levels (on lzt99, x64, Core i7-3770) :
<P>
<TABLE><TR><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=500x450&chbh=15,2,10&chxt=x,y&chco=A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7&chxr=1,0,700,100&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=Kraken-z-3|Kraken-z-2|Kraken-z-1|Kraken-z1|Kraken-z2|Kraken-z3|Kraken-z4|Mermaid-z-3|Mermaid-z-2|Mermaid-z-1|Mermaid-z1|Mermaid-z2|Mermaid-z3|Mermaid-z4|Selkie-z-3|Selkie-z-2|Selkie-z-1|Selkie-z1|Selkie-z2|Selkie-z3|Selkie-z4&chd=e:mH,ee,Zn,PR,LA,Dj,CS,oI,gW,ZW,X5,Pz,I4,Dv,5Y,qp,h2,fJ,Su,J0,D.&chxl=0:|lzt99&chtt=encode+MB/s">
</TD><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=500x450&chbh=15,2,10&chxt=x,y&chco=A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,A92FFF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,FF00FF,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7,B097F7&chxr=1,0,3,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=Kraken-z-3|Kraken-z-2|Kraken-z-1|Kraken-z1|Kraken-z2|Kraken-z3|Kraken-z4|Mermaid-z-3|Mermaid-z-2|Mermaid-z-1|Mermaid-z1|Mermaid-z2|Mermaid-z3|Mermaid-z4|Selkie-z-3|Selkie-z-2|Selkie-z-1|Selkie-z1|Selkie-z2|Selkie-z3|Selkie-z4&chd=e:kg,oC,s2,wY,xf,y-,z7,ja,mQ,q5,ri,tM,uy,vF,e3,gj,jy,kO,lS,nH,nv&chxl=0:|lzt99&chtt=compression+ratio">
</TD></TR></TABLE>
Higher CompressionLevels are to the right in the bar charts above; they get higher compression ratios at the cost of lower
encode speed. Charts show three HyperFast levels (-1 to -3) and 4 normal levels (1 to 4).
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=s&chs=750x380&chxt=x,y&chxr=0,4,10,1|1,0,2,1&chd=e:yKuusDkFfDNpG3y9vpr5q-knbwOc4dz5wWvDnOdSPc,YzdEiTlzm2oOpEXZa9gQg8ipkQkjRDThX5YcZyb.cu&chdl=Kraken-z-3|Kraken-z-2|Kraken-z-1|Kraken-z1|Kraken-z2|Kraken-z3|Kraken-z4|Mermaid-z-3|Mermaid-z-2|Mermaid-z-1|Mermaid-z1|Mermaid-z2|Mermaid-z3|Mermaid-z4|Selkie-z-3|Selkie-z-2|Selkie-z-1|Selkie-z1|Selkie-z2|Selkie-z3|Selkie-z4&chdlp=r&chco=A92FFF|A92FFF|A92FFF|A92FFF|A92FFF|A92FFF|A92FFF|FF00FF|FF00FF|FF00FF|FF00FF|FF00FF|FF00FF|FF00FF|B097F7|B097F7|B097F7|B097F7|B097F7|B097F7|B097F7&chm=s,000000,0,0,0|o,A92FFF,0,0,10|o,A92FFF,0,1,10|o,A92FFF,0,2,10|o,A92FFF,0,3,10|o,A92FFF,0,4,10|o,A92FFF,0,5,10|o,A92FFF,0,6,10|o,FF00FF,0,7,10|o,FF00FF,0,8,10|o,FF00FF,0,9,10|o,FF00FF,0,10,10|o,FF00FF,0,11,10|o,FF00FF,0,12,10|o,FF00FF,0,13,10|o,B097F7,0,14,10|o,B097F7,0,15,10|o,B097F7,0,16,10|o,B097F7,0,17,10|o,B097F7,0,18,10|o,B097F7,0,19,10|o,B097F7,0,20,10&chtt=loglog+ratio+vs+encode+MB/s&chxl=0:|16|32|64|128|256|512|1024|1:|1|2|4|">
<P>
In the loglog plot, up = higher compression ratio, right = faster encode.
<P>
<FONT COLOR=GREEN><PRE>
lzt99 : Kraken-z-3 : 1.711 to 1 : 416.89 MB/s
lzt99 : Kraken-z-2 : 1.877 to 1 : 333.28 MB/s
lzt99 : Kraken-z-1 : 2.103 to 1 : 280.09 MB/s
lzt99 : Kraken-z1 : 2.268 to 1 : 167.01 MB/s
lzt99 : Kraken-z2 : 2.320 to 1 : 120.39 MB/s
lzt99 : Kraken-z3 : 2.390 to 1 : 38.85 MB/s
lzt99 : Kraken-z4 : 2.434 to 1 : 24.98 MB/s
lzt99 : Mermaid-z-3 : 1.660 to 1 : 438.89 MB/s
lzt99 : Mermaid-z-2 : 1.793 to 1 : 353.82 MB/s
lzt99 : Mermaid-z-1 : 2.011 to 1 : 277.35 MB/s
lzt99 : Mermaid-z1 : 2.041 to 1 : 261.38 MB/s
lzt99 : Mermaid-z2 : 2.118 to 1 : 172.77 MB/s
lzt99 : Mermaid-z3 : 2.194 to 1 : 97.11 MB/s
lzt99 : Mermaid-z4 : 2.207 to 1 : 40.88 MB/s
lzt99 : Selkie-z-3 : 1.447 to 1 : 627.76 MB/s
lzt99 : Selkie-z-2 : 1.526 to 1 : 466.57 MB/s
lzt99 : Selkie-z-1 : 1.678 to 1 : 370.34 MB/s
lzt99 : Selkie-z1 : 1.698 to 1 : 340.68 MB/s
lzt99 : Selkie-z2 : 1.748 to 1 : 204.76 MB/s
lzt99 : Selkie-z3 : 1.833 to 1 : 107.29 MB/s
lzt99 : Selkie-z4 : 1.863 to 1 : 43.65 MB/s
</PRE></FONT>
<P>
A quick guide to the Oodle CompressionLevels :
<FONT COLOR=GREEN><PRE>
-4 to -1 : HyperFast levels
when you want maximum encode speed
these sacrifice compression ratio for encode time
0 : no compression (memcpy pass through)
1 to 4 : SuperFast, VeryFast, Fast, Normal
these are the "normal" compression levels
encode times are ballpark comparable to zlib
5 to 8 : optimal levels
increasing compression ratio & encode time
levels above 6 can be slow to encode
these are useful for distribution, when you want the best possible bitstream
</PRE></FONT>
Note that the CompressionLevel is a dial for encode speed vs. compression ratio. It does not have a
consistent correlation to decode speed. That is, all of these compression levels get roughly the same
excellent decode speed.
<P>
Comparing to Oodle 2.6.0 on Silesia :
<FONT COLOR=GREEN><PRE>
Oodle 2.6.0 :
Kraken 1 "SuperFast" : 3.12:1 , 147.2 enc MB/s , 920.9 dec MB/s
Kraken 2 "VeryFast" : 3.26:1 , 107.8 enc MB/s , 945.0 dec MB/s
Kraken 3 "Fast" : 3.50:1 , 47.1 enc MB/s , 1043.3 dec MB/s
Oodle 2.6.3 :
Kraken -2 "HyperFast2" : 2.92:1 , 300.4 enc MB/s , 1092.5 dec MB/s
Kraken -1 "HyperFast1" : 3.08:1 , 231.3 enc MB/s , 996.2 dec MB/s
Kraken 1 "SuperFast" : 3.29:1 , 164.6 enc MB/s , 885.0 dec MB/s
Kraken 2 "VeryFast" : 3.40:1 , 109.5 enc MB/s , 967.3 dec MB/s
Kraken 3 "Fast" : 3.61:1 , 45.8 enc MB/s , 987.5 dec MB/s
</PRE></FONT>
Note that in Oodle 2.6.3 the normal levels (1-3) have also improved (much higher compression ratios).
<P><HR><P>
<div style="background-color:#d0d0d0">
Oodle is an SDK for high performance lossless data compression.
For more about Oodle, or licensing inquiries,
<A HREF="http://www.radgametools.com/oodle.htm">visit the RAD Game Tools web site.</A>
This is my personal blog where I post supplemental material about Oodle.
</div>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/x946RqjDz3I" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com2tag:blogger.com,1999:blog-5246987755651065286.post-4493071136673302562018-06-06T10:45:00.004-07:002018-06-06T10:45:46.070-07:00The Perils of Holistic Profiling
I have found that many evaluators of Oodle are now trying to do timing of their entire process. That is, profiling the compressor by
measuring the effect on total load time, or by profiling an entire game frame time (as opposed to profiling just the
compression operation, perhaps in situ or perhaps in a test bench).
<P>
I believe what's happened is that many people have read about the dangerous of artificial benchmarks. (for example there are some famous
papers on the perils of profiling malloc with synthetic use loads, or how profiling threading primitives in isolation is pretty useless).
<P>
While those warnings do raise important issues, the right response is not to switch to timing whole operations.
<P>
For example, while timing mallocs with bad synthetic data loads is not useful (and perhaps even harmful), similarly timing an entire application run to determine
whether a malloc is better or not can also be misleading.
<P>
Basically I think the wrong lesson has been learned and people are over simplifying. They have taken one bad practice (time operations
by running them in a synthetic test bench over and over), and replaced it with another bad practice (time the whole application).
<P>
The reality of profiling is far more complex and difficult. There is no one right answer. There is not a simple prescription of how
to do it. Like any scientific measurement of a complex dynamic system, it requires care and study. It requires looking at the specific
situation and coming up with the right measurement process. It requires secondary measurements to validate your primary measurements,
to make sure you are testing what you think you are.
<P>
Now, one of the appealing things of whole-process timing is that in one very specific case, it is the right thing to do.
<P>
IF the thing you care about is whole-process time, and the process is always run the same way, and you do timing on the system that the
process is run on, and in the same application state and environment, AND crucially this you are only allowed to make one change to the
process - then whole process timing is right.
<P>
Let's first talk about the last issue, which is the "single change" problem.
<P>
Quite often a good change can appear to do nothing (or even be negative) for whole process time on its own. By looking at just
the whole process time to evaluate the change, you miss a very positive step. Only if another step is taken will the value of that
first step be shown.
<P>
A common case of this is if your process has other limiting factors that need to be fixed.
<P>
For example on the macroscopic level, if your game is totally GPU bound, then anything you do to CPU time will not show up at all
if you are only measuring whole frame time. So you might profile a CPU optimization and see no benefit to frame time. You can miss
big improvements this way, because they will only show up if you also fix what's causing the process to be GPU bound.
<P>
Similarly at a more microscopic level, it's common to have a major limiting factor in a sequence of code. For example you might have a
memory read that typically misses cache, or an unpredictable branch. Any improvements you make to the arithmetic instructions in that
area may be invisible, because the processor winds up stalling on a very slow cache line fill from memory. If you are timing your
optimization work "in situ" to be "realistic" you can completely miss good changes because they are hidden by other bad code.
<P>
Another common example, maybe you convert some scalar code to SIMD. You think it should be faster, but you time it in app
and it doesn't seem to be. Maybe you're bound in other places. Maybe you're suffering added latency from round-tripping from
scalar to SIMD back to scalar. Maybe your data needs to be reformatted to be stored in SIMD friendly ways. Maybe the surrounding
code needs to be also converted to SIMD so that they can hand off more smoothly. There may in fact be a big win there that you
aren't seeing.
<P>
This is a general problem that greedy optimization and trying to look at steps one by one can be very misleading when measuring
whole process time. Sometimes taking individual steps is better evaluated by measuring just those steps in isolation, because
using whole process time obscures them. Sometimes you have to try a step that you believe to be good even if it doesn't show up
in measurements, and see if taking more steps will provide a non-greedy multi-step improvement.
<P>
<B> Particular perils of IO timing </B>
<P>
A very common problem that I see is trying to measure data loading performance, including IO timing, which is
fraught with pitfalls.
<P>
If you're doing repeated timings, then you'll be loading data that is already in the system disk cache, so your IO speed may
just look like RAM speed. Is what's important to you cold cache timing (user's first run), or hot cache time? Or both?
<P>
Obviously there is a wide range of disk speeds, from very slow hard disks (as on consoles) in the 20 MB/s range up to SSD's and NVMe
in the GB/s range. Which are you timing on? Which will your user have? Whether you have slow seeks or not can be a huge factor.
<P>
Timing on consoles with disk simulators (or worse : host FS) is particularly problematic and may not reflect real world performance at all.
<P>
The previously mentioned issue of high latency problems hiding good changes is very common. For example doing lots of small IO calls
creates long idle times that can hide other good changes.
<P>
Are you timing on a disk that's fragmented, or nearly full? Has your SSD been through lots of write cycles already or does it need
rebalancing? Are you timing when other processes are running hitting the disk as well?
<P>
Basically it's almost impossible to accurately recreate the environment that the user will experience. And the variation is not small,
it can be absolutely massive. A 1 byte read could take anything between 1 nanosecond (eg. data already in disk cache) to 100 milliseconds
(slow HD seek + other processes hitting disk).
<P>
Because of the uncertainty of IO timing, I just don't do it.
I use a simulated "disk speed" and just set :
<FONT COLOR=GREEN><PRE>
disk time = data size / simulated disk speed
</PRE></FONT>
Then the question is, well if it's so uncertain, what simulated disk speed do you use? The answer is : all of them. You cannot
say what disk speed the user will experience, there's a huge range, so you need to look at performance over a spectrum of disk speeds.
<P>
I do this by making a plot of what the total time for (load + decomp) is over a range of simulated disk speeds. Then I can examine
how the performance is affected over a range of possible client systems, without trying to guess the exact disk speed of the client
runtime environment.
For more on this, see :
<A HREF="http://cbloomrants.blogspot.com/2015/03/03-02-15-oodle-lz-pareto-frontier.html"> Oodle LZ Pareto Frontier </A> or
<A HREF="http://cbloomrants.blogspot.com/2016/04/oodle-kraken-pareto-frontier.html"> Oodle Kraken Pareto Frontier </A>.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/Qlax6YcYE0w" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-8813145188977981102018-06-06T10:45:00.001-07:002018-06-06T10:45:12.489-07:00ZStd is faster than Leviathan
ZStd is faster than Leviathan on some files ; well, no, it's not that simple.
<P>
This is another post about careful measurement, how to compare compressors, and about the unique way Oodle works.
<P>
(usual caveat: I don't mean to pick on ZStd here; I use it as a reference point because it is excellent, the closest thing to Oodle, and
something we are often compared against. ZStd timing is done with lzbench; times are on x64 Core i7-3770)
<P>
There are two files in my "gametestset" where ZStd appears to be significantly faster to
decode than Leviathan :
<FONT COLOR=GREEN><PRE>
e.dds :
zstd 1.3.3 -22 3.32 MB/s 626 MB/s 403413 38.47%
ooLeviathan8 : 1,048,704 -> 355,045 = 2.708 bpb = 2.954 to 1
decode : 1.928 millis, 6.26 c/b, rate= 544.03 MB/s
Transistor_AudenFMOD_Ambience.bank :
zstd 1.3.3 -22 5.71 MB/s 4257 MB/s 16281301 84.18%
ooLeviathan8 : 19,341,802 ->16,178,303 = 6.692 bpb = 1.196 to 1
decode : 8.519 millis, 1.50 c/b, rate= 2270.48 MB/s
</PRE></FONT>
Whoah! ZStd is a lot faster to decode than Leviathan on these files, right? (626 MB/s vs 544.03 MB/s and 4257 MB/s vs 2270.48 MB/s)
<P>
No, it's not that simple. Compressor performance is a two axis value of
{space,speed}.
It's a 2d vector, not a scalar. You can't simply take one component of the vector
and just compare speeds at unequal compression.
<P>
All compressors are able to hit a range of {space,speed} points by making
different decisions. For example with ZStd at level 22 you could forbid length 3 matches
and that would bias it more towards decode speed and lower compression ratio.
<P>
Oodle is unique in being fundamentally built as a space-speed optimization process.
The Oodle encoders can make bit streams that cover a range of compression ratios and decode speeds,
depending on what the client asks it to prioritize.
<P>
Compressor performance is determined by two things : the fundamental algorithm, and the current settings.
The settings will allow you to dial the 2d performance data point to different places. The algorithm places a
limit on where those data points can be - it defines a Pareto Frontier. This Pareto curve is a fundamental
aspect of the algorithm.
<P>
There is no such thing as "what is the speed of ZStd?". It depends how you have dialed the settings to reach
different performance data points. The speed is not a fundamental aspect of the algorithm. The Pareto frontier
*is* a fundamental aspect, the limit on where those 2d data points can reach.
<P>
One way to compare compression algorithms (as opposed to their current settings) is to plot many points of their
2d performance at different settings, and then inspect how the curves lie. One curve might strictly cover the other,
then that algorithm is always better. Or, they might cross at some point, which means each algorithm is best in a
different performance domain.
<P>
Another way to compare compression algorithms is to dial them to find points where one axis is equal (either they
decode at the same speed, or they have the same compression ratio), then you can do a simple 1d comparison of the other value.
You can also try to find points where one compressor is strictly better on both axes. The inconclusive situations is
when one compressor is better on one axis, and the other is better on the other axis.
<P>
(note I have been talking about compressor performance as the 2d vector of {decode speed,ratio} , but of course you could
also consider encode time, memory use, other factors, and then you might choose other axes, or have a 3d or 4d value to compare.
The same principles apply.)
<P>
(there is another way to compare 2d compressor performance with 1d scalar; at RAD we internally use
<A HREF="http://cbloomrants.blogspot.com/2018/01/05-17-16-weissman-score.html"> the corrected Weissman score </A>. One of
the reasons we use the 1d Weissman score is because sometimes we make an improvement to a compressor and one of the axes gets worse.
That is, we do some work, and then measure, and we see compression ratio went down. Oh no, WTF! But actually decode speed went up.
From the 2d performance vector it can be hard to tell if you made an improvement or not, the 1d scalar Weissman score makes that easier.)
<P>
<B> Oodle is an optimizing compiler </B>
<P>
Oodle is fundamentally different than other compressors. There is no "Oodle has X performance". Oodle has whatever performance you
ask it to have (and the compressed size will vary along the Pareto frontier).
<P>
Perhaps an analogy that people are more familiar with is an optimizing compiler.
<P>
The Oodle decoder is a virtual machine that runs a "program" to create an output. The compressed data is the program of commands that
run in the Oodle interpreter.
<P>
The Oodle encoder is a compiler that makes the program to run on that machine (the Oodle decoder). The Oodle compiler tries to create
the most optimal program it can, by considering different instruction sequences that can create the same output. Those different sequences
may have different sizes and speeds. Oodle chooses them based on how the user has specified to consider the value of time vs. size.
(this is a bit like telling your optimizing compiler to optimize for size vs. optimize for speed, but Oodle is much more fine grained).
<P>
For example at the microscopic level, Oodle might consider a sequence of 6 bytes. This can be sent as 6 literals, or a pair of length 3
matches, or a literal + a len 4 rep match + another literal. Each possibility is considered and the cost is measured for size & decode time.
At the microscopic level Oodle considers different encodings of the command sequences, whether to send somethings uncompressed or with
different entropy coders, and different bit packings.
<P>
<B> Oodle is a market trader </B>
<P>
Unlike any other lossless compressor, Oodle makes these decisions based on a cost model.
<P>
It has been standard for a long time to make space vs. speed decisions in lossless compressors, but it has in the past always been done
with hacky ad-hoc methods. For example, it's common to say something like "if the compressed size is only 1% less than the uncompressed
size, then just send it uncompressed".
<P>
Oodle does not do that. Oodle considers its compression savings (bytes under the uncompressed size) to be "money". It can spend that
money to get decode time. Oodle plays the market, it looks for the best price to spend its money (size savings) to get the maximum gain
of decode time.
<P>
Oodle does not make ad-hoc decisions to trade speed for size, it makes an effort to get the best possible value for you when you trade
size for speed. (it is of course not truly optimal because it uses heuristics and limits the search, since trying all possible encodings
would be intractable).
<P>
Because of this, it's easy to dial Oodle to different performance points to find more fundamental comparisons with other compressors.
(see, for example : <A HREF="http://cbloomrants.blogspot.com/2017/09/oodle-tuneability-with-space-speed.html"> Oodle tuneability with space-speed tradeoff </A> )
<P>
Note that traditional ad-hoc compressors (like ZStd and everyone else) make mistakes in their space-speed decisions. They do not allocate
time savings to the best possible files. This is an inevitable consequence of having simple thresholds in decision making (and this flaw is
what led us to do a true cost model). That is, Leviathan decode speed is usually, say, 30% faster than ZStd. On some files that ratio goes
way up or way down. When that happens, it is often because ZStd is making a mistake. That is, it's not paying the right price to trade
size for speed.
<P>
Of course this relies on you telling Oodle the truth about whether you want decode speed or size. Since Oodle is aggressively trading the
market, you must tell it the way you value speed vs. size. If you use Leviathan at default settings, Oodle thinks your main concern is
size, not decode speed. If you actually care more about decode speed, you need to adjust the price (with "spaceSpeedTradeoffBytes") or possible
switch to another compressor (Kraken, Mermaid, or let <A HREF="http://cbloomrants.blogspot.com/2018/02/oodle-260-hydra-and-space-speed.html"> Hydra </A> switch for you).
<P>
<B> Back to the files where ZStd is faster </B>
<P>
Armed with our new knowledge, let's revist those two files :
<P>
<FONT COLOR=GREEN><PRE>
e.dds : zstd 1.3.3 -22 : 2.600 to 1 : 625.72 MB/s
e.dds : Leviathan -8 : 2.954 to 1 : 544.03 MB/s
Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4257 MB/s
Transistor_AudenFMOD_Ambience.bank : Leviathan -8 : 1.196 to 1 : 2270.48 MB/s
</PRE></FONT>
Is ZStd faster on these files? At this point we don't know. These are inconclusive data points. In both cases, Leviathan has more
compression, but ZStd has more speed - the victor on each axis differs and we can't easily say which is really doing better.
<P>
To get a simpler comparison we can dial Leviathan to different performance points using Oodle's "spaceSpeedTradeoffBytes" parameter,
which sets the relative cost of time vs size in Oodle's decisions.
<P>
That is, in both cases Oodle has size savings to spend. It can spend those size savings to get more decode speed.
<P>
On e.dds, let's take Leviathan and dial spaceSpeedTradeoffBytes up from the default of 256 in powers of two to favor decode speed more :
<FONT COLOR=GREEN><PRE>
e.dds : zstd 1.3.3 -22 : 2.600 to 1 : 625.72 MB/s
e.dds : Leviathan 1 : 3.020 to 1 : 448.30 MB/s
e.dds : Leviathan 256 : 2.954 to 1 : 544.03 MB/s
e.dds : Leviathan 512 : 2.938 to 1 : 577.23 MB/s
e.dds : Leviathan 1024 : 2.866 to 1 : 826.15 MB/s
e.dds : Leviathan 2048 : 2.831 to 1 : 886.42 MB/s
</PRE></FONT>
What is the speed of Leviathan? There is no one speed of Leviathan. It can go from 448 MB/s to 886 MB/s depending on what you tell the
encoder you want. The fundamental aspect is what compression ratio can be achieved at each decode speed.
<P>
We can see that ZStd is not fundamentally faster on this file; in fact Leviathan can get much more decode speed AND compression ratio at
spaceSpeedTradeoffBytes = 1024 or 2048.
<P>
Similarly on Transistor_AudenFMOD_Ambience.bank :
<FONT COLOR=GREEN><PRE>
Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256 : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512 : 1.193 to 1 : 3701.30 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 1024 : 1.190 to 1 : 4738.83 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 2048 : 1.187 to 1 : 6193.92 MB/s
zstd 1.3.3 -22 5.71 MB/s 4257 MB/s 16281301 84.18 Transistor_AudenFMOD_Ambience.bank
Leviathan spaceSpeedTradeoffBytes = 2048
ooLeviathan8 : 19,341,802 ->16,290,106 = 6.738 bpb = 1.187 to 1
decode : 3.123 millis, 0.55 c/b, rate= 6193.92 MB/s
</PRE></FONT>
In this case we can dial Leviathan to get very nearly the same compressed size, and then just compare speeds
(4275.38 MB/s vs 6193.92 MB/s).
<P>
Again ZStd is not actually faster than Leviathan here. If you looked at Leviathan's default setting encode
(2270.48 MB/s) you were not seeing ZStd being faster to decode. What you are seeing is that you told Leviathan to
choose an encoding that favors size over decode speed.
<P>
It doesn't make sense to tell Oodle to make a very small file, and then just compare decode speeds. That's like
telling your waiter that you want the cheapest bottle of wine, then complaining that it doesn't taste as good as the $100
bottle. You specifically asked me to optimize for the opposite goal!
<P>
(note that in the Transistor bank case, it looks like Oodle is paying a bad price to get a tiny compression savings;
going from 6000 MB/s to 2000 MB/s seems like a lot. In fact that is a small time difference, while 1.187 to 1.196 ratio
is actually a big size savings. The problem here is that ratio & speed are inverted measured of what we are really optimizing,
which is time and size. Internally we always look at bpb (bits per byte) and cpb (clocks per byte) when measuring
performance.)
<P>
e.dds charts : <P>
<TABLE><TR><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=3782FF,882787,882787,882787,882787&chxr=1,0,900,100&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=zstd 1.3.3 -22|Leviathan256|Leviathan512|Leviathan1024|Leviathan2048&chd=e:sf,mr,pC,6v,.B&chxl=0:|e.dds&chtt=decode+MB/s&chts=303030,14,l">
</TD><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,5,10&chxt=x,y&chco=3782FF,882787,882787,882787,882787&chxr=1,0,3,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=zstd 1.3.3 -22|Leviathan256|Leviathan512|Leviathan1024|Leviathan2048&chd=e:3c,.A,-r,9I,8Z&chxl=0:|e.dds&chtt=compression+ratio&chts=303030,14,l">
</TD></TR></TABLE>
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=s&chs=800x360&chxt=x,y&chxr=0,9,10,1|1,0,2,1&chd=e:ShFnLEsLyr,sGx.xwwmwC&chdl=zstd 1.3.3 -22|Leviathan256|Leviathan512|Leviathan1024|Leviathan2048&chdlp=r&chco=3782FF|882787|882787|882787|882787&chm=s,000000,0,0,0|o,3782FF,0,0,10|o,882787,0,1,10|o,882787,0,2,10|o,882787,0,3,10|o,882787,0,4,10&chtt=loglog+ratio+vs+MB/s&chxl=0:|512|1024|1:|1|2|4|">
<P>
<P>
See also :
<A HREF="http://cbloomrants.blogspot.com/2018/01/the-natural-lambda.html"> The Natural Lambda </A>
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/LPLq_QROCRY" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-33427740150893649092018-05-07T08:28:00.002-07:002018-05-07T08:28:30.164-07:00Visualizing Probability Update Schemes Part 2
A little bonus so we can look at the weird properties of the geometric / PAQ mixer.
<P>
Recall from the tour of mixing :
<P>
Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference)
domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't
sum to one, and they can be negative.
<P>
In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them.
Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]).
<P>
For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but
it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It
can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor
that says - yes I like your prediction, but take it farther.
<P>
The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent
optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply
amplification or suppression of the codelen difference.
<P>
Let's see this in a very simple case : just one expert.
<P>
The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but
it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate"
to exaggerate the effect (1000X faster than typical).
<P>
Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's
to make the effect more obvious.
<P>
The underlying expert adapts slowly : (P(1) in green, codelen difference in blue)
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=e:g.h9i6j0ksljmYnLn8ospbqIq0resHsvtVt7ufvCvkwFwlxDxhx-yay1zQzpyCweu9tfsEqspWoEm0lm&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=t:0.090,0.178,0.263,0.346,0.427,0.506,0.583,0.658,0.732,0.805,0.877,0.947,1.016,1.084,1.151,1.217,1.282,1.346,1.409,1.472,1.534,1.595,1.656,1.716,1.775,1.834,1.893,1.951,2.008,2.065,1.842,1.643,1.463,1.298,1.145,1.003,0.869,0.744,0.624,0.511&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0">
<P>
Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start.
Codelen difference is still positive (L(0) > L(1)).
<P>
With the PAQ mixer applied to just a single expert : <P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=paq mixed&chd=e:g.iJjjlYnxqYtfwbzX1.4T547X8U9H9s-M-i-1.C.Q.Z.g.m.r.v.x.0.1.3vZJoH1HbHjIPJQKlMFOM&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=paq mixed&chd=t:0.090,0.195,0.323,0.491,0.716,0.972,1.298,1.638,2.025,2.433,2.873,3.244,3.682,4.038,4.410,4.748,5.113,5.428,5.764,6.048,6.429,6.739,7.035,7.347,7.671,7.994,8.188,8.537,8.675,8.997,1.514,-2.498,-2.843,-2.930,-2.903,-2.760,-2.565,-2.336,-2.104,-1.812&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=ccaa00&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=weight&chd=t:1.000,1.074,1.216,1.415,1.661,1.935,2.225,2.512,2.783,3.028,3.242,3.422,3.577,3.703,3.810,3.898,3.973,4.034,4.086,4.129,4.166,4.196,4.221,4.243,4.261,4.276,4.288,4.299,4.308,4.316,0.814,-1.525,-1.948,-2.251,-2.508,-2.739,-2.960,-3.175,-3.383,-3.583&chxl=0:|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0">
<P>
In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning
that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too
low, so we're scaling up the codelen difference by 4X.
<P>
In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*.
P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50%
<P>
Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower
(otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding).
In practice the weight converge slowly to an ideal and stay there for long periods of time.
<P>
But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X).
<P>
For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the
longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics
but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo
the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because
it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can
learn that no, on this particular file the deterministic model is in fact better than that, so I like you and
amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic
expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as
predictable as you think).<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/FebRvhiX4RE" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-20584258300073191062018-05-06T16:54:00.000-07:002018-05-06T16:54:19.822-07:00Visualizing Probability Update Schemes
Unlike the last "visualizing" post, I don't have a real clear point to this one. This is more like "let's look at some pretty pictures".
<P>
I do think it helps to get intuition by actually looking at charts & graphs, rather than just always look at the end result score for
something like compression.
<P>
We're going to look at binary probability estimation schemes.
<P>
Binary probability estimation is just filtering the history of bits seen in some way.
<P>
Each bit seen is a digital signal of value 0 or 1. You just want some kind of smoothing of that signal. Boom, that's your probability estimate,
P(1). All this nonsense about Poisson and Bernoulli processes and blah blah, what a load of crap. It's just filtering.
<P>
For example, the "Laplace" estimator
<FONT COLOR=GREEN><PRE>
P(1) = (n1 + c)/(n0 + n1 + 2c)
</PRE></FONT>
That's just the average of the entire signal seen so far. Add up all the bits, divide by number.
(countless papers have been written about what c should be (c= 1? c = 1/2?), why not 1/4? or 3/4?
Of course there's no a-priori reason for any particular value of c and in the real world it should just be tweaked to maximize results.)
<P>
That's the least history-forgetting estimator of all, it weights everything in the past equally
(we never want an estimator where older stuff is more important). In the other direction
you have lots of options - FIR and IIR filters. You could of course take the last N bits and average them (FIR filter), or take the last N and
average with a weight that smoothly goes from 0 to 1 across the window (sigmoidal or just linear). You can of course take an IIR average,
that's
<FONT COLOR=GREEN><PRE>
average <- (average*N + last)/(N+1)
</PRE></FONT>
Which is just the simplest IIR filter, aka geometric decay, "exponential smoothing" blah blah.
<P>
Anyway, that's just to get us thinking in terms of filters. Let's look at some common filters and how they
behave.
<P><HR><P>
Green bars = probability of a 1 bit after the bit at bottom is processed.
<P>
Blue bars = log odds = codelen(0) - codelen(1)
<P><HR><P>
Laplace : count n0 & n1 : <P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=counting&chd=e:qqv.zM1V223.445m6L6q2JySu7r.panHlDjMhhgAemgAeugAe0gAe5gAe9gAfBeHfFgAfIeUdieZfOgA&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=counting&chd=t:1.000,1.585,2.000,2.322,2.585,2.807,3.000,3.170,3.322,3.459,2.459,1.874,1.459,1.138,0.874,0.652,0.459,0.290,0.138,0.000,-0.126,0.000,-0.115,0.000,-0.107,0.000,-0.100,0.000,-0.093,0.000,-0.087,-0.170,-0.082,0.000,-0.078,-0.152,-0.222,-0.144,-0.070,0.000&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
After 10 1's and 10 0's, P is back to 50/50 ; we don't like the way this estimator has no recency bias.
<P><HR><P>
Geometric IIR with updshift 2 (very fast) : <P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>2&chd=e:n.t.yf134Z6T7u8y9l-Mupi.aPTrOwLEITGOErDgCoR-NeaHTlerXBhQY8itaCThepm-dPV7QccVlPr7&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>2&chd=t:0.737,1.354,1.903,2.412,2.893,3.356,3.806,4.246,4.680,5.109,1.427,0.270,-0.525,-1.171,-1.738,-2.257,-2.746,-3.214,-3.667,-4.110,-4.546,-1.357,-1.906,-0.537,-1.182,-0.119,-0.833,0.114,-0.647,0.245,-0.545,-1.188,-0.122,0.640,-0.250,-0.940,-1.532,-0.332,0.478,1.131&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
When a fast geometric gets to the 010101 section is wiggles back and forth wildly. If you look at the codelen difference
in that region you can see it's wasting something like 1.5 bits on each coding (because it's always wiggled in just
the wrong way, eg. biased towards 0 right before a 1 occurs).
<P>
Note that geometric probabilities have an exponential decay shape. (specifically , 0.75^n , where 0.75 is from (1 - 1/4) and
the 4 is because shift 2). HOWEVER the codelen difference steps are *linear*.
<P>
The geometric probability update (in the direction of MPS increasing probability) is very close to just adding a
constant to codelen difference (logit). (this just because P *= lambda , so log(P) += log(lambda)). You can see
after the 111 region, the first 0 causes a big negative step in codelen difference, and then once 0 becomes the MPS
the further 0 steps are the same constant linear step.
<P><HR><P>
Geometric IIR with updshift 5 (fast) : <P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=e:g.h9i6j0ksljmYnLn8osnbmMlAj2iuhpglfkeldoctdzc4d-dCeIdMeRdVeaddcidpeudwc1b7dDeJfN&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=geo>>5&chd=t:0.090,0.178,0.263,0.346,0.427,0.506,0.583,0.658,0.732,0.805,0.683,0.566,0.455,0.349,0.247,0.148,0.053,-0.038,-0.127,-0.214,-0.298,-0.198,-0.282,-0.183,-0.268,-0.169,-0.254,-0.155,-0.241,-0.143,-0.229,-0.313,-0.212,-0.115,-0.202,-0.286,-0.369,-0.266,-0.167,-0.071&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
Shift 5 is still fast by real world standards but looks slow compared to the crazy speed of geo 2.
You can see that the lack of an initial adaptation range hurts the ramp up on the 1111 portion. That is,
"geo 5" acts like it has 33 symbols in its history; at the beginning it actually has 0, so it has a bogus
history of 33 times P = 0.5 which gives it a lot of intertia.
<P><HR><P>
FIR 8-tap window flat weight : <P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=fir flat 8&chd=e:n.n.v.v.3.3.........3.v.n.gAYAQAIAAAAAAAAAIAIAQAQAYAYAgAgAgAgAYAgAgAgAYAYAYAgAn.&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=fir flat 8&chd=t:0.737,0.737,1.585,1.585,2.807,2.807,6.629,6.629,6.629,6.629,2.807,1.585,0.737,0.000,-0.737,-1.585,-2.807,-6.629,-6.629,-6.629,-6.629,-2.807,-2.807,-1.585,-1.585,-0.737,-0.737,0.000,0.000,0.000,0.000,-0.737,0.000,0.000,0.000,-0.737,-0.737,-0.737,0.000,0.737&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
Just take the last 8 and average. (I initialize the window to 01010 which is why it has the two-tap stair step in
the beginning). In practice you can't like the probabilities get to 0 or 1 completely, you have to clamp at some epsilon,
and in fact you need a very large epsilon here because over-estimating P(1) and then observing a 0 bit is very very bad
(codelen of 0 goes to infinity fast as P->1). The "codelen difference" chart here uses a P epsilon of 0.01
<P><HR><P>
bilateral filter : <P>
It's just filtering, right? Why not?
This bilateral filter takes a small local average (4 tap) and weights contribution of those local averages back
through time. The weight of each sample is multiplied by e^-dist * e^-value_difference. That is, two terms
(hence bilateral), weight goes down as you go back in time, but also based on how far away the sample is from the
most recent one in value. ("sample" is the 4-tap local average)
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=bilateral&chd=e:l0q-zH366-9B-Z.R.x..4oqsbjOmIpFRDKBvAyAOAAHXKTUuWqcCcDfPeigffVZudbf6fXdPWbXZdzml&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=bilateral&chd=t:0.532,1.032,1.989,2.790,3.556,4.363,5.291,6.473,6.629,6.629,2.945,1.003,-0.404,-1.758,-2.679,-3.477,-4.271,-5.161,-6.336,-6.629,-6.629,-2.945,-2.382,-1.062,-0.867,-0.359,-0.358,-0.069,-0.131,0.045,-0.060,-0.573,-0.232,-0.008,-0.057,-0.249,-0.890,-0.795,-0.199,0.603&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
What the bilateral does is that as you get into each region, it quickly forgets the previous region. So as you
get into 000 it forgets the 111, and once it gets into 0101 it pretty solidly stabilizes at P = 0.5 ; that is,
it's fast in forgetting the past when the past doesn't match the present (fast like geo 2), BUT it's not fast in
over-responding to the 0101 wiggle like geo 2.
<P>
There are an infinity of different funny impulse responses you could do here. I have no idea if any of them
would actually help compression (other than by giving you more parameters to be able to train to your data,
which always helps, sort of).
<P><HR><P>
mixed : <P>
Linear mix of geo 2 (super fast) and geo 5 (still quite fast). mixing weight starts at 0.5 and is updated with online
gradient descent. The mixer here has an unrealistically fast learning rate to exaggerate the effect.
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=mixed&chd=e:kfoRrquxxr0Y285a7z-LpPljiofWbxYJUqRWOOLQIYYyVxciZYeUbIfLcEfjcoaNd2gddrbrZoc-e9hN&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=3030b0&chxr=1,0,10,1&chds=-10,10&chxr=1,-10,10,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=mixed&chd=t:0.408,0.763,1.102,1.442,1.795,2.175,2.604,3.124,3.837,5.101,0.859,0.506,0.238,-0.058,-0.383,-0.722,-1.069,-1.428,-1.807,-2.229,-2.731,-0.661,-0.956,-0.313,-0.605,-0.152,-0.443,-0.074,-0.356,-0.040,-0.305,-0.528,-0.194,0.041,-0.210,-0.392,-0.582,-0.274,-0.094,0.110&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,3&chxt=x,y&chco=ccaa00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=weight&chd=e:gAhij7m3qKtxxs1.6u.9QIM1OIRyWSa6fdj7oatBx4bHdjXyYtVgVXTTScQ8PoQmNENbKgKsM0HmHRKF&chxl=0:|1|1|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1|0|1|0|1|0|0|1|1|0|0|0|1|1|1">
<P>
The weight shown is the weight of geo 2, the faster model. You can see that in the 111 and 000 regions the
weight of geo 2 shoots way up (because geo 5 is predicting P near 0.5 still), and then in the 0101 region the
weight of geo 2 gradually falls off because the slow geo 5 does better in that region.
<P>
The mixer immediately does something that none of the other estimators did - when the first 0 bit occurs,
it takes a *big* step down, almost down to 50/50. It's an even faster step than geo 2 did on its own.
(same thing with the first 1 after the 000 region).
<P>
Something quite surprising popped out to me. The probability steps in the 111 and 000 regions wind up linear.
Note that both the underlying estimators (geo 2 and geo 5) has exponential decay curving probabilities, but
the interaction with the mixing weight cancels that out and we get linear. I'm not sure if that's a
coincidence of the particular learning rate, but it definitely illustrates to us that a mixed probability
can behave unlike any of its underlying experts!
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/93rg44LrjLA" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-91096410431712252522018-05-04T18:08:00.003-07:002018-05-04T19:32:36.738-07:00Visualizing Binary Probability Updates
Some time ago I noted that the standard way we use binary probabilities to model statistics has some odd quirks :
<A HREF="http://cbloomrants.blogspot.com/2014/06/06-12-14-some-lzma-notes.html"> 06-12-14 - Some LZMA Notes </A>
<P>
I thought I would make some pictures to make that more intuitive.
<P>
What I have here is a 4 bit symbol (alphabet of 16). It is coded with 4 binary probabilities in a tree.
That is :
<FONT COLOR=GREEN><PRE>
first code a bit for sym >= 8 or not (is it in [0-7] or [8-15])
then go to [0-3] vs [4-7] (for example)
then [4-5] vs [6-7]
lastly [6] vs [7]
</PRE></FONT>
One way you might implement this is something like :
<FONT COLOR=GREEN><PRE>
U32 p0[16];
// sym is given
sym |= 16; // set a place-value marker bit
for 4 times
{
int ctx = sym >> 4; int bit = (sym>>3)&1;
arithmetic_code( p0[ctx] , bit );
sym <<= 1;
}
</PRE></FONT>
and note that only 15 p0's are actually used, p0[0] is not accessed; p0[1] is the root probability for
[0-7] vs [8-15] , p0[2] is [0-3] vs [4-7] , etc.
<P>
The standard binary probability update is exponential decay (the simplest geometric IIR filter) :
<FONT COLOR=GREEN><PRE>
if ( bit )
{
P0 -= P0 * lambda;
}
else
{
P0 += (1.0 - P0) * lambda;
}
</PRE></FONT>
So what actually happens to the symbol probabilities when you do this?
<P>
Something a bit strange.
<P>
When you update symbol [5] (for example), his probability goes up. But so does the probability of his neighbor,
[4]. And also his next neighbors [6 and 7]. And so on up the binary tree.
<P>
Now the problem is not the binary decomposition of the coding. Fundamentally binary decomposition and
full alphabet coding are equivalent and should be able to get the same results if you form the correct
correspondence between them.
<P>
(aside : For example, if you use a normal n0/n1 count model for probabilities, AND you initialize the counts
such that the parents = the sum of the children, then what you get is :
<A HREF="http://www.cbloom.com/blog/visualizing_binary_probability_updates_n0_n1.html">
visualizing_binary_probability_updates_n0_n1.html </A> - only the symbols you observe increase in probability.
Note this is a binary probability tree, not full alphabet, but it acts exactly the same way.)
<P>
The problem (as noted in the previous LZMA post) is that the update rate at the top levels is too large.
<P>
Intuitively, when a 5 occurs you should update the binary choice [45] vs [67] , because a 5 is more likely,
that pair is more likely. The problem is that [4] did not get more likely, and the probability of the group
[45] represents the chance of either occuring. One of those things did get more likely, but one did not.
The 4 in the group should act as a drag that keeps the group [45] probability from going up so much.
Approximately, it should go up by half the speed of the leaf update.
<P>
The larger the group, the greater the drag of symbols that did not get more likely. eg. in [0-7] vs [8-15]
when a 5 occurs, all of 0123467 did not get more likely, so 7 symbols act as drag and the rate should be
around 1/8 the speed.
<P>
(see appendix at end for the exact speeds you should use if you wanted to adjust only one probability)
<P>
Perhaps its best to just look at the charts. These are histograms of probabilities (in percent). It
starts all even, then symbol 5 occurs 20 X, then symbol 12 occurs 20 X. The probabilities are updated with
the binary update scheme. lambda here is 1.0/8 , which is rather fast, used to make the visualization
more dramatic.
<P>
What you should see : when 5 first starts getting incremented, the probabilities of its neighbors go way up,
4, and 6-7, and the whole 0-7 side. After 5 occurs many times, the probabilities of its neighbors goes down.
Then when 12 starts beeing seen, the whole 8-15 side shoots up.
<P>
Go scroll down through the charts then we'll chat more.
<P>
This is a funny thing. Most people in the data compression community think of binary coding symbols this way
as just a way to encode an alphabet using binary arithmetic coding. They aren't thinking about the fact that
what they're actually doing is a strange sort of group-probability update.
<P>
In fact, in many applications if you take a binary arithmetic coder like this and replace it with an N-ary
full alphabet coder, results get *worse*. Worse !? WTF !? It's because this weird group update thing that
the binary coder is doing is often actually a good thing.
<P>
You can imagine scenarios where that could be the case. In some types of data, when a new symbol X suddenly
starts occuring (when it had been rare before), then that means (X-1) and (X+2) may start to be seen as well.
We're getting some kind of complicated modeling that novel symbols imply their neighbors novel probability
should go up. In some type of data (such as BWT post-MTF) the probabilities act very much in binary tree
groups. (see Fenwick for example). In other types of data that is very bit structured (such as ascii text
and executable code), when a symbol with some top 3 bits occurs, then other symbols with those top bits are
also more likely. That is, many alphabets actually have a natural binary decomposition where symbol groups
in the binary tree do have joint probability.
<P>
This is one of those weird things that happens all the time in data compression where you think you're
just doing an implementation choice ("I'll use binary arithmetic coding instead of full alphabet") but that
actually winds up also doing modeling in ways you don't understand.
<P>
The visuals :
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:EAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEA">
<P>
5 x 1
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:D8D8D8D8E.GaEbEbDgDgDgDgDgDgDgDg">
<P>
5 x 2
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DyDyDyDyFxJSErErDEDEDEDEDEDEDEDE">
<P>
5 x 3
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DkDkDkDkGTMhEvEvCrCrCrCrCrCrCrCr">
<P>
5 x 4
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DUDUDUDUGoP.EsEsCWCWCWCWCWCWCWCW">
<P>
5 x 5
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:DDDDDDDDGwTkEiEiCDCDCDCDCDCDCDCD">
<P>
5 x 6
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CyCyCyCyGtXKEUEUBzBzBzBzBzBzBzBz">
<P>
5 x 7
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CiCiCiCiGhasEEEEBlBlBlBlBlBlBlBl">
<P>
5 x 8
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CSCSCSCSGQeHDxDxBYBYBYBYBYBYBYBY">
<P>
5 x 9
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:CDCDCDCDF6hWDeDeBNBNBNBNBNBNBNBN">
<P>
5 x 10
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:B1B1B1B1FhkZDLDLBDBDBDBDBDBDBDBD">
<P>
5 x 11
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BoBoBoBoFHnPC5C5A7A7A7A7A7A7A7A7">
<P>
5 x 12
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BdBdBdBdEsp2CnCnA0A0A0A0A0A0A0A0">
<P>
5 x 13
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BSBSBSBSESsPCWCWAtAtAtAtAtAtAtAt">
<P>
5 x 14
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BJBJBJBJD4ubCGCGAnAnAnAnAnAnAnAn">
<P>
5 x 15
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:BABABABADgwZB4B4AjAjAjAjAjAjAjAj">
<P>
5 x 16
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:A5A5A5A5DJyKBrBrAeAeAeAeAeAeAeAe">
<P>
5 x 17
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AyAyAyAyC0zwBfBfAaAaAaAaAaAaAaAa">
<P>
5 x 18
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AsAsAsAsCh1LBUBUAXAXAXAXAXAXAXAX">
<P>
5 x 19
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AnAnAnAnCP2dBLBLAUAUAUAUAUAUAUAU">
<P>
5 x 20
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AiAiAiAiB.3lBCBCASASASASASASASAS">
<P>
12 x 1
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AeAeAeAeBwwoA6A6BGBGBGBGBxBYBOBO">
<P>
12 x 2
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AaAaAaAaBiqjAzAzBmBmBmBmD7CcB-B-">
<P>
12 x 3
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AXAXAXAXBVlPAsAsB5B5B5B5GpDWChCh">
<P>
12 x 4
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AUAUAUAUBLglAnAnCCCCCCCCJ0EEC4C4">
<P>
12 x 5
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ASASASASBBcgAiAiCFCFCFCFNSElDFDF">
<P>
12 x 6
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:APAPAPAPA5Y8AeAeCCCCCCCCQ7E5DKDK">
<P>
12 x 7
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ANANANANAyV1AaAaB9B9B9B9UoFCDJDJ">
<P>
12 x 8
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AMAMAMAMAsTGAXAXB2B2B2B2YTFCDDDD">
<P>
12 x 9
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AKAKAKAKAmQtAUAUBtBtBtBtb3E7C6C6">
<P>
12 x 10
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AJAJAJAJAiOoARARBkBkBkBkfREvCuCu">
<P>
12 x 11
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AIAIAIAIAdMzAPAPBcBcBcBcifEfCiCi">
<P>
12 x 12
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AHAHAHAHAaLMANANBTBTBTBTlfENCVCV">
<P>
12 x 13
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AGAGAGAGAWJzAMAMBLBLBLBLoRD5CJCJ">
<P>
12 x 14
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AFAFAFAFAUIlAKAKBDBDBDBDq0DlB8B8">
<P>
12 x 15
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AFAFAFAFARHgAJAJA8A8A8A8tIDRBwBw">
<P>
12 x 16
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AEAEAEAEAPGkAIAIA2A2A2A2vPC-BlBl">
<P>
12 x 17
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:AEAEAEAEANFvAHAHAwAwAwAwxICrBaBa">
<P>
12 x 18
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ADADADADAMFCAGAGAqAqAqAqy1CaBRBR">
<P>
12 x 19
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ADADADADAKEZAFAFAlAlAlAl0XCKBIBI">
<P>
12 x 20
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=008a00&chxr=1,0,100,10&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chd=e:ACACACACAJD2AFAFAhAhAhAh1uB7BABA">
<P>
<P><HR><P>
Appendix :
<P>
How to do a binary probability update without changing your neighbor :
<FONT COLOR=GREEN><PRE>
Consider the simplest case : 4 symbol alphabet :
[0 1 2 3]
A = P[0 1] vs [2 3]
B = P[1] vs [1]
P(0) = A * B
P(1) = A * (1 - B)
now a 0 occurs
alpha = update rate for A
beta = update rate for B
A' = A + (1-A) * alpha
B' = B + (1-B) * beta
we want P(1)/P(23) to be unchanged by the update
(if that is unchanged, then P(1)/P(2) and P(1)/P(3) is unchanged)
that is, the increase to P(0) should scale down all other P's by the same factor
P(1)/P(23) = A * (1-B) / (1-A)
so require :
A' * (1-B') / (1-A') = A * (1-B) / (1-A)
solve for alpha [...algebra...]
alpha = A * beta / ( 1 - (1 - A) * beta )
</PRE></FONT>
that's as opposed to just using the same speed at all levels, alpha = beta.
<P>
In the limit of small beta, (slow update), this is just alpha = A * beta.
<P>
The update rate at the higher level is decreased by the probability of the updated subsection.
<P>
In the special case where all the probabilities start equal, A = 0.5, so this is just alpha = beta / 2 -
the update rates should be halved at each level, which was the intuitive rule of thumb that I hand-waved
about before.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/abIj4WqJaec" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-24339955744653173292018-05-03T13:29:00.000-07:002018-05-03T13:29:07.865-07:00Whirlwind Tour of Mixing Part 3
On to online gradient descent.
<P>
First we need to talk about how our experts are mixed. There are two primary choices :
<FONT COLOR=GREEN><PRE>
Linear mixing :
M(x) = Sum{ w_i P_i(x) } / Sum{ w_i }
or = Sum{ w_i P_i(x) } if the weights are normalized
Geometric mixing :
M(x) = Prod{ P_i(x) ^ c_i } / Sum(all symbols y){ Prod{ P_j(y) ^ c_j } }
(the coefficients c_i do not need to be normalized)
P = probability from each expert
M = mixed probability
require
Sum(all symbols y){ M(y) } = 1
</PRE></FONT>
Linear mixing is just a weighted linear combination. Geometric mixing is also known as
"product of expert" and "logarithmic opinion pooling".
<P>
Geometric mixing can be expressed as a linear combination of the logs, if normalization is
ignored :
<FONT COLOR=GREEN><PRE>
log( M_unnormalized ) = Sum{ c_i * log( P_i ) }
</PRE></FONT>
In the large alphabet case, there is no simple logarithmic form of geometric M (with normalization included).
<P>
In the special case of a binary alphabet, Geometric mixing has a special simple form
(even with normalization) in log space :
<FONT COLOR=GREEN><PRE>
M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ (1 - P_i) ^ c_i } )
M = 1 / ( 1 + Prod{ ((1 - P_i)/P_i) ^ c_i } )
M = 1 / ( 1 + Prod{ e^c_i * log((1 - P_i)/P_i) } )
M = 1 / ( 1 + e^ Sum { c_i * log((1 - P_i)/P_i) } )
M = sigmoid( Sum { c_i * logit(P_i) } )
</PRE></FONT>
which we looked at in the previous post, getting some background on logit and sigmoid.
<P>
We looked before about how logit space strongly preserves skewed probabilities. We can see the same
thing directly in the geometric mixer.
<P>
Obviously if an expert has a P near 0, that multiplies into the product and produces an output near 0.
Zero times anything = zero.
Less obviously if an expert has a P near 1, the denomenator normalization factor becomes zero in the other
terms (that aren't the numerator), so the mix becomes 1, regardless of what the other P's.
<P>
So assume we are using these mixtures. We want to optimize (minimize) code length. The general procedure for
gradient descent is :
<FONT COLOR=GREEN><PRE>
the code length cost of a previous coding event is -log2( M(x) )
evaluated at the actually coded symbol x
(in compression the total output file length is just the sum of all these event costs)
To improve the w_i coefficients over time :
after coding symbol x, see how w_i could have been improved to reduce -log2( M(x) )
take a step of w_i in that direction
the direction is the negative gradient :
delta(w_i) ~= - d/dw_i { -log2( M(x) ) }
delta(w_i) ~= (1/M) * d/dw_i{ M(x) }
(~ means proportional; the actual delta we want is scaled by some time step rate)
</PRE></FONT>
so to find the steps we should take we just have to do some derivatives.
<P>
I won't work through them in ascii, but they're relatively simple.
<P>
(the geometric mixer update for the non-binary alphabet case is complex and I won't cover it any more here.
For practical reasons we will assume linear mixing for non-binary alphabets, and either choice is available
for binary alphabets. For the non-binary geometric step see Mattern "Mixing Strategies in Data Compression")
<P>
When you work through the derivatives what you get is :
<FONT COLOR=GREEN><PRE>
Linear mixer :
delta(w_i) ~= ( P_i / M ) - 1
Geometric mixer (binary) :
delta(w_i) ~= (1 - M) * logit(P_i)
</PRE></FONT>
Let's now spend a little time getting familiar with both of them.
<P>
First the linear mixer :
<P>
<FONT COLOR=GREEN><PRE>
put r_i = P_i(x) / M(x)
ratio of expert i's probability to the mixed probability, on the symbol we just coded (x)
Note : Sum { w_i * r_i } = 1 (if the weights are normalized)
Your goal is to have M = 1
That is, you gave 100% probability to the symbol that actually occured
The good experts are the ones with high P(x) (x actually occured)
The experts that beat the current mix have P > M
Note that because of Sum { w_i * r_i } = 1
Whenever there is an expert with r_i > 1 , there must be another with r_i < 1
The linear mixer update is just :
delta(w_i) ~= r_i - 1
experts that were better than the current mix get more weight
experts worse than the current mix get less weight
</PRE></FONT>
I think it should be intuitive this is a reasonable update.
<P>
Note that delta(w) does not preserve normalization. That is :
<FONT COLOR=GREEN><PRE>
Sum{ delta(w_i) } != 0 , and != constant
however
Sum{ w_i * delta(w_i) } = Sum{ w_i * P_i / M } - Sum { w_i } = Sum {w_i} - Sum{w_i} = 0
the deltas *do* sum to zero when weighted.
</PRE></FONT>
We'll come back to that later.
<P>
If we write the linear update as :
<FONT COLOR=GREEN><PRE>
delta(w_i) ~= ( P_i - M ) / M
</PRE></FONT>
it's obvious that delta is proportional to (1/M). That is, when we got the overall mixed prediction M right, M is near 1, the step is small.
When we got the step wrong, M is near 0, the step we take can be very large. (note that P_i does not have to also be near zero when M is near zero,
the w_i for that expert might have been small). In particular, say most of the experts were confused, they
predict the wrong thing, so for the actual coded event x, their P(x) is near zero. One guy said no, what about x! His P(x) was higher.
But everyone ignores the one guy piping up (his weight is low). The
mixed M(x) will be very low, we make a huge codelen (very bad log cost), the step (1/M) will be huge, and it bumps up the one guy who got the
P(x) right.
<P>
Now playing with the geometric update :
<FONT COLOR=GREEN><PRE>
Geometric mixer (binary) :
delta(w_i) ~= (1 - M) * logit(P_i)
this is evaluated for the actually coded symbol x (x = 0 or 1)
let's use the notation that ! means "for the opposite symbol"
eg. if you coded a 0 , then ! means "for a 1"
eg. !P = 1-P and !M = 1-M
(in the previous post I put up a graph of the relationship between L and !L, codelens of opposing bits)
We can rewrite delta(w_i) as :
delta(w_i) ~= !M * ( !L_i - L_i )
recalling logit is the codelen difference
that's
delta(w_i) ~= (mixed probability of the opposite symbol) * (ith model's codelen of opposite - codelen of coded symbol)
</PRE></FONT>
I find this form to be more intuitive; let's talk through it with words.
<P>
The step size is proportional to !M , the opposite symbol mixed probability. This is a bit like the (1/M) scaling in the linear mixer.
When we got the estimate right, the M(x) of the coded symbol will go to 1, so !M goes to zero, our step size goes to zero. That is, if
our mixed probability M was right, we take no step. If it ain't broke don't fix it. If we got the M very wrong, we thought a 0 bit would
not occur but then it did, M is near zero, !M goes to one.
<P>
While the general direction of scaling here is the same as the linear mixer, the scaling is very different. For example in the case where
we were grossly wrong, M -> 0 , the linear mixer steps like (1/M) - it can be a huge step. The geometric mixer steps like (1-M) , it hits a
finite maximum.
<P>
The next term (!L - L) means weights go up for experts with a large opposite codelen. eg. if the codelen of the actually coded symbol was low,
that's a good expert, then the codelen of the opposite must be high, so his weight goes up.
<P>
Again this is similar to the linear mixer, which took a step ~ P(x) , so experts who were right go up. Again the scaling is different, and
in the opposite way. In the linear mixer, if experts get the P right or wrong, the step just scales in [0,1] ; it varies, but not drastically.
Conversely in the geometric case because step is proportional to codelen, it goes to infinity as P gets extremely wrong.
<P>
Say you're an expert who gets it all wrong and guesses P(x) = 0 (near zero) then x occurs. In the linear case the step is just (-1). In
the geometric case, L(x) -> inf , !L -> 0 , your weight takes a negative infinite step. Experts who get it grossly wrong are not gradually
diminished, they are immediately crushed.
(in practice we clamp weights to some chosen finite range, or clamp step to a maximim)
<P>
Repeating again what the difference is :
<P>
In linear mixing, if the ensemble makes a big joint mistake, the one guy out of the set who was right can get a big weight boost.
If the ensemble does okay, individuals experts that were way off do not get big adjustments.
<P>
In geometric mixing, individual experts that make a big mistake take a huge weight penalty. This is only mildly scaled by
whether the ensemble was good or not.
<P>
Another funny difference :
<P>
With geometric mixing, experts that predict a 50/50 probability (equal chance of 0 or 1) do not change w at all. If you had a static
expert that just always said "I dunno, 50/50?" , his weight would never ever change. But he would also never contribute to the mixture
either, since as noted in the previous post equivocal experts simply have no effect on the sum of logits.
(you might think that if you kept seeing 1 bits over and over, P(1) should be near 100% and experts that keep saying "50%" should have
their weight go to crap; that does not happen in geometric mixing).
<P>
In contrast, with linear mixing, 50/50 experts would either increase or decrease in weight depending on if they were better or worse than
the previous net mixture. eg. if M was 0.4 on the symbol that occurred, the weight of an 0.5 P expert would go up, since he's better than the mix
and somebody must be worse than him.
<P>
Note again that in geometric mixing (for binary alphabets), there is no explicit normalization needed. The w_i's are not really weights,
they don't sum to 1. They are not in [0,1] but go from [-inf,inf]. It's not so much "weighting of experts" as "adding of experts" but
in delta-codelen space.
<P>
Another funny thing about geometric mixing is what the "learning rate" really does.
<P>
With the rate explicitly in the steps :
<FONT COLOR=GREEN><PRE>
Linear mixer :
w_i += alpha * ( ( P_i / M ) - 1 )
(then w_i normalized)
Geometric mixer (binary) :
c_i += alpha * ( (1 - M) * logit(P_i) )
(c_i not normalized)
</PRE></FONT>
In the linear case, alpha really is a learning rate. It controls the speed of how fast a better expert takes weight from a worse
expert. They redistribute the finite weight resource because of normalization.
<P>
That's not the case in the geometric mixer. In fact, alpha just acts as an overall scale to all the weights. Because all the
increments to w over time have been scaled by alpha, we can just factor it out :
<FONT COLOR=GREEN><PRE>
no alpha scaling :
c_i += ( (1 - M) * logit(P_i) )
factor out alpha :
M = sigmoid( alpha * Sum { c_i * logit(P_i) } )
</PRE></FONT>
The "learning rate" in the geometric mixer really just scales how the sum of logits is stretched back to probability. In fact
you can think of it as a normalization factor more than a learning rate.
<P>
The geometric mixer has yet another funny property that it doesn't obviously pass through the best predictor. It doesn't even
obviously pass through the case of a single expert. (with linear mixing it's obvious that crap predictors weight goes
to zero, and a single expert would just pass through at weight=1)
<FONT COLOR=GREEN><PRE>
aside you may ignore; quick proof that it does :
say you have only one expert P
say that P is in fact perfectly correct
0 bits occur at rate P and P is constant
initially you have some mixing coefficient 'c' (combining alpha in with c)
The mixed probability is :
M = sigmoid( c * logit(P) )
which is crap. c is applying a weird scaling we don't want.
c should go to 1, because P is the true probability.
Does it?
delta(c) after one step of symbol 0 is (1-M) * logit(P)
for symbol 1 it's M * logit(1-P)
since P is the true probability, the probabilistic average step is :
average_delta(c) = P * ( (1-M) * logit(P) ) + (1-P) * ( (M) * logit(1-P) )
logit(1-P) = - logit(P)
average_delta(c) = P * logit(P) - M * logit(P) = (P - M) * logit(P)
in terms of l = logit(P)
using P = sigmoid(l)
average_delta(c) = ( sigmoid(l) - sigmoid( c * l ) ) * l
this is a differential equation for c(t)
you could solve it (anyone? anyone?)
but even without solving it's clear that it does go to c = 1 as t -> inf
if c < 1 , average_delta(c) is > 0 , so c goes up
if c > 1 , average_delta(c) is < 0 , so c goes down
in the special case of l small (P near 50/50), sigmoid is near linear,
then the diffeq is easy and c ~= (1 + e^-t)
</PRE></FONT>
Note that this was done with alpha set to 1 (included in c). If we leave alpha separate, then c should
converge to 1/alpha in order to pass through a single predictor. That's what I mean by alpha not really
being a "learning rate" but more of a normalization factor.
<P>
So anyway, despite geometric mixing being a bit odd, it works in practice; it also works in theory
(the gradient descent can be proved to converge with reasonable bounds and so on). In practice it
has the nice property of not needing any divides for normalization (in the binary case). Of course it
doesn't require evaluation of logit & sigmoid, which are typically done by table lookup.
<P><HR><P>
Bonus section : Soft Bayes
<P>
The "Soft Bayes" method is not a gradient descent, but it is very similar, so I will mention it here.
Soft Bayes uses a weighted linear combination of the experts, just like linear gradient descent mixing.
<P>
<FONT COLOR=GREEN><PRE>
Linear mixer :
d_i = ( P_i / M ) - 1
w_i += rate * d_i
Soft Bayes :
w_i *= ( 1 + rate * d_i )
</PRE></FONT>
Instead of adding on the increment, Soft Bayes multiples 1 + the increment.
<P>
Multiplying by 1 + delta is obviously similar to adding delta and the same intuitive arguments we made before about why this
works still apply. Maybe I'll repeat them here :
<P>
General principles we want in a mixer update step :
<FONT COLOR=GREEN><PRE>
The step should be larger when the mixer produced a bad result
the worse M(x) predicted the observed symbol x, the larger the step should be
this comes from scaling like (1/M) or (1-M)
The step should scale up weights of experts that predicted the symbol well
eg. larger P_i(x) should go up , smaller P_i(x) should go down
a step proportional to (P-M) does this in a balanced way (some go up, some go down)
but a step of (P - 0.5) works too
as does logit(P)
</PRE></FONT>
We can see a few properties of Soft Bayes :
<FONT COLOR=GREEN><PRE>
Soft Bayes :
w_i *= ( 1 + rate * d_i )
w_i' = w_i + w_i * rate * d_i
w_i += rate * w_i * d_i
This is the same as linear gradient descent step, but each step d_i is scaled by w_i
Soft Bayes is inherently self-normalizing :
Sum{w_i'} = Sum{w_i} + Sum{w_i * rate * d_i}
Sum{ w_i * d_i } = Sum{ ( w_i * P_i / M ) - w_i } = ( M/M - 1 ) * Sum{ w_i } = 0
therefore :
Sum{w_i'} = Sum{w_i}
the sum of weights is not changed
if they started normalized, they stay normalized
</PRE></FONT>
The reason that "soft bayes" is so named can be seen thusly :
<FONT COLOR=GREEN><PRE>
Soft Bayes :
w_i *= ( 1 + rate * d_i )
d_i = ( P_i / M ) - 1
w_i *= (1-rate) + rate * ( P_i / M )
at rate = 1 :
w_i *= P_i / M
ignoring normalization this is
w_i *= P_i
</PRE></FONT>
which is "Beta Weighting" , which is aka "naive Bayes".
<P>
So Soft Bayes uses "rate" as a lerp factor to blend in a naive Bayes update. When rate is 1, it's "fully Bayes", when rate is lower it keeps
some amount of the previous weight lerped in.
<P>
Soft Bayes seems to work well in practice. Soft Bayes could have the bad property that if weights get to zero they can never rise, but in
practice we always clamp weights to a non-zero range so this doesn't occur.
<P>
A quick aside on why Beta weighting is the naive Bayesian solution :
<FONT COLOR=GREEN><PRE>
ignoring normalization through this aside
(also ignoring priors and all the proper stuff you should do if you're Bayesian)
(see papers for thorough details)
Mix experts :
M = Sum{ w_i * P_i }
P_i(x) is the probability of x given expert i ; P(x|i)
The w_i should be the probability of expert i, given past data , that is :
P(x|past) = Sum{ P(i|past) * P(x|i) }
P(i|past) is just ~= P(past|i)
(ignoring prior P(i) and normalizing terms P(past) that factor out in normalization)
Now *if* (big if) the past events are all independent random events, then :
P(past|i) = P(x0|i)*P(x1|i)*P(x2|i)... for past symbols x0,x1,...
since P("ab") = P(a)*P(b) etc.
which you can compute incrementally as :
w_i *= P_i
the Beta Weight update.
</PRE></FONT>
In words, the weight for the expert should be the probability of that expert having made the data we have seen so far.
That probability is just the P_i that expert has predicted for all past symbols, multiplied together.
<P>
Now this is obviously very wrong. Our symbols are very much not independent, they have strong conditional probability, that's how we
get compression out of modeling the stream. The failure of that assumption may be why true beta weighting is poor in practice.
<P><HR><P>
This Whirlwind Tour has been brought to you by the papers of Mattern & Knoll, PAQ of Mahoney, and the radness of RAD.
<P>
Small note about practical application :
<P>
You of course want to implement this with integers and store things like weights and probabilities in fixed point.
I have found that the exact way you handle truncation of that fixed point is very important. In particular you need
to be careful about rounding & biasing and try not to just lose small values. In general mixing
appears to be very "fiddly"; it has a very small sweet spot of tuned parameters and small changes to them can make
results go very bad.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/iHItA6QS73k" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-24680154859984477632018-05-02T15:48:00.000-07:002018-05-02T15:48:32.663-07:00Whirlwind Tour of Mixing Part 2
Before we dive into online gradient descent, let's look at a couple of building blocks.
<P>
In many cases we've been talking about mixers that work with arbitrary alphabets, but let's
specialize now to binary. In that case, each expert just makes one P, say P(0), and
implicitly P(1) = 1 - P(0).
<P>
We will make use of the logit ("stretch") and logistic sigmoid ("squash"), so let's look
at them a bit first.
<P>
<FONT COLOR=GREEN><PRE>
logit(x) = ln( x / (1-x) )
sigmoid(x) = 1 / (1 + e^-x ) (aka "logistic")
logit is "log odds" of a probability
logit takes [0,1] -> [-inf,inf]
sigmoid takes [-inf,inf] -> [0,1]
logit( sigmoid(x) ) = 1
</PRE></FONT>
I'm a little bit fast and loose with the base of the logs and exponents sometimes, because the
difference comes down to scaling factors which wash out in constants in the end. I'll try to be
not too sloppy about it.
<P>
You can see plots of them
<A HREF="https://nathanbrixius.wordpress.com/2016/06/04/functions-i-have-known-logit-and-sigmoid/">
here for example </A>.
The "logistic sigmoid" is an annoyingly badly named function; "logistic" is way too similar to "logit"
which is very confusion, and "sigmoid" is just a general shape description, not a specific function.
<P>
What's more intuitive in data compression typically is the base-2 variants :
<FONT COLOR=GREEN><PRE>
log2it(P) = log2( P / (1-P) )
log2istic(t) = 1 / (1 + 2^-t )
</PRE></FONT>
which is just different by a scale of the [-inf,inf] range. (it's a difference of where you measure
your codelen in units of "bits" or "e's").
<P>
In data compression there's an intuitive way to look at the logit. It's the difference of codelens
of symbol 0 and 1 arising from the probability P.
<FONT COLOR=GREEN><PRE>
log2it(P) = log2( P / (1-P) )
= log2( P ) - log2( 1 - P )
if P = P0
then L0 = codelen of 0 = - log2(P0)
L1 = - log2(1-P0)
log2it(P0) = L1 - L0
which I like to write as :
log2it = !L - L
where !L means the len of the opposite symbol
</PRE></FONT>
Working with the logit in data compression is thus very natural and has nice properties :
<P>
1. It is (a)symmetric under exchange of the symbol alphabet 0->1. In that case P -> 1-P , so logit -> - logit.
The magnitude stays the same, so mixers that act on logit behave the same way. eg. you aren't
favoring 0 or 1, which of course you shouldn't be. This is why using the *difference* of codelen
between L0 and L1 is the right thing to do, not just using L0, since the 0->1 exchange acts on L0 in
a strange way.
<P>
Any mixer that we construct should behave the same way if we swap 0 and 1 bits. If you tried to
construct a valid mixer like that using only functions of L0, it would be a mess. Doing it in the
logit = (!L-L) it works automatically.
<P>
2. It's linear in codelen. Our goal in compression is to minimize total codelen, so having a parameter
that is in the space we want makes that natural.
<P>
3. It's unbounded. Working with P's is awkward because of their finite range, which requires clamping and
theoretically projection back into the valid cube.
<P>
4. It's centered on zero for P = 0.5 and large for skewed P. We noted before that what we really want
is to gather models that are skewed, and logit does this naturally.
<P>
Let's look at the last property.
<P>
Doing mixing in logit domain implicitly weights skewed experts, as we looked at before in "functional weighting".
With logit, skewed P's near 0 or 1 can stretched out to -inf or +inf. This gives them very high implicit "weight",
that is even after mixing with a P near 0.5, the result is still skewed.
<FONT COLOR=GREEN><PRE>
Mix two probabilities with equal weight :
0.5 & 0.9
linear : 0.7
logit : 0.75
0.5 & 0.99
linear : 0.745
logit : 0.90867
0.5 & 0.999
linear : 0.74950
logit : 0.96933
</PRE></FONT>
Logit blending is sticky at skewed P.
<P>
A little plot of what the L -> !L function looks like :
<P>
<A HREF="http://www.cbloom.com/blogimages/len_to_len.png">
<IMG SRC="http://www.cbloom.com/blogimages/len_to_len_small.png">
</A>
<P>
Obviously it is symmetric around (1,1) where both codelens are 1 bit, P is 0.5 ; as P gets skewed either way
one len goes to zero and the other goes to infinity.
<P>
logistic sigmoid is the map back from codelen delta to probability. It has the nice property for us that
no matter what you give it, the output is a valid normalized probability in [0,1] ; you can scale or screw up your
logistics in whatever way you want and you still get a normalized probability (which makes it codeable).
<P>
Linear combination of logits (codelen deltas) is equal to geometric (multiplicative) mixing of probabilities :
<FONT COLOR=GREEN><PRE>
M = mixed probability
linear combination in logit space, then transform back to probability :
M = sigmoid( Sum{ c_i * logit( P_i ) } )
M = 1 / ( 1 + e ^ - Sum{ c_i * logit( P_i ) } )
e ^ logit(P) = P/(1-P)
e ^ -logit(P) = (1-P)/P
e ^ - c * logit(P) = ((1-P)/P) ^ c
M = 1 / ( 1 + Prod{ (( 1 - P_i ) / P_i)^c_i } )
M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ ( 1 - P_i )^c_i } )
</PRE></FONT>
What we have is a geometric blend (like a geometric mean, but with arbitrary exponents, which are
the mixing factors), and the denomenator is just the normalizing factor so that M0 and M1 some to one.
<P>
Note that the blend factors (c_i) do not need to be normalized here. We get a normalized probability
even if the c_i's are not normalized. That is an important property for us in practice.
<P>
This is a classic machine learning method called "product of experts". In the most basic form, all the c_i = 1,
the powers are all 1, the probabilities from each expert are just multiplied together.
<P>
I'm calling the c_i "coefficients" not "weights" to emphasize the fact that they are not normalized.
<P>
Note that while we don't need to normalize the c_i , and in practice we don't (and in the "product of experts"
usage with all c_i = 1 they are of course not normalized), that has weird consequences and let us not
brush it under the table.
<P>
For one thing, the overall scale of c_i changes the blend. In many cases we hand-waved about scaling because
normalization will wipe out any overall scaling factors of weights. In this case the magnitude scale of the c_i
absolutely matters. Furthermore, the scale of the c_i has nothing to do with a "learning rate" or blend strength.
It's just an extra exponent applied to your probabilities that does some kind of nonlinear mapping to how they are
mixed. e.g. doing c_i *= 2 is like mixing squared probabilities instead of linear probabilities.
<P>
For another, with unnormalized c_i - only skewed probabilities contribute. Mixing in more 50/50 experts does *nothing*.
It does not bias your result back towards 50/50 at all!
<P>
We can see this both in the logit formulation and in the geometric mean formulation :
<FONT COLOR=GREEN><PRE>
M = sigmoid( Sum{ c_i * logit( P_i ) } )
Now add on a new expert P_k with some coefficient c_k
M' = sigmoid( Sum{ c_i * logit( P_i ) + c_k * logit( P_k ) } )
if P_k = 50% , then logit(P_k) = 0
M' = M
even if c_k is large.
(if you normalized the c's, the addition of c_k would bring down the coffecient from the other contributions)
The Sum of logits only adds up codelen deltas
Experts with large codelen deltas are adding their scores to the vote
small codelen deltas simply don't contribute at all
In the geometric mean :
Q = Prod{ P_i ^ c_i }
M = Q / ( Q + !Q )
(where ! means "of the opposing symbol; eg. P -> (1-P) )
Add on another expert with P = 0.5 :
Q' = Q * 0.5 ^ c_k
!Q' = !Q * 0.5 ^ c_k
M' = Q * 0.5 ^ c_k / ( Q * 0.5 ^ c_k + !Q * 0.5 ^ c_k)
the same term multiplied around just factors out
M' = M
</PRE></FONT>
This style of mixing is like adding experts, not blending them. Loud shouty experts with skewed codelen
deltas add on.
<P>
Next up, how to use this for gradient descent mixing in compression.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/lC1jYjnwQPo" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-60300949689982632422018-05-02T12:10:00.000-07:002018-05-02T13:32:08.454-07:00Whirlwind Tour of Mixing Part 1
I'm going to try to run through mixing techniques in data compression in a wild quick tour.
Part 1 will cover techniques up to but not including online gradient descent.
<P>
First some background :
<P>
We have probabilities output by some "experts" , P_i(x) for the ith expert for symbol x.
<P>
We wish to combine these in some way to create a mixed probability.
<P>
The experts are a previous stage of probability estimation. In data compression these are usually models; they could be
different orders of context model, or the same order at different speeds, or other probability sources.
<P>
(you could also just have two static experts, one that says P(0) = 100% and one that says P(1) = 100%, then the
way you combine them is actually modeling the binary probability; in that case the mixer is the model. What
I'm getting at is there's actually a continuum between the model & the mixer; the mixer is doing some of the
modeling work. Another common synthetic case used for theoretical analysis is to go continuous and have an
infinite number of experts with P(0) = x for all x in [0,1], then the weight w(x) is again the model.)
<P>
Now, more generally, the experts could also output a confidence. This is intuitively obviously something that
should be done, but nobody does it, so we'll just ignore it and say that our work is clearly flawed and always
has room for improvement. (for example, a context that has seen n0 = 1 and n1 = 0 might give P(0) = 90% ,
but that is a far less confident expert that one that has n0 = 90 and n1 = 10 , which should get much higher
weight). What I'm getting at here is the experts themselves have a lot more information than just P, and that
information should be used to feed the mixer. We will not do that. We treat the mixer as independent of
the experts and only give it the probabilities.
<P>
(you can fix this a bit by using side information (not P) from the model stages as context for the mixer
coefficients; more on that later).
<P>
Why mixing? A lot of reasons. One is that your data may be switching character over time. It might switch
from very-LZ-like to very-order-3-context like to order-0-like. One option would be to try to detect those
ranges and transmit switch commands to select models, but that is not "online" (one pass) and the overhead to
send the switch commands winds up being comparable to just mixing all those models (much research shows that
these alternatives are asymptotically bounded). Getting perfect switch transmission right is actually way
harder than mixing; you have to send the switch commands efficiently, you have to find the ideal switch ranges
with consideration of the cost of transmitting the switch (which is a variant of a "parse-statistics feedback"
problem which is one of the big unsolved bugbears in compression).
<P>
Another reason we like mixing is that our models might be good for part of the character of the data, but not
all of it. For example it's quite common to have some data that's text-like (has strong order-N context
properties) but is also LZ-like (perhaps matches at offset 4 or 8 are more likely, or at rep offset). It's hard
to capture both those correlations with one model. So do two models and mix.
<P>
On to techniques :
<P><HR><P>
1. Select
<P>
The most elementary mixer just selects one of the models. PPMZ used this for LOE (Local Order Estimation). PPMZ's primary scheme
was to just select the order with highest MPS (most probable symbol) probability. Specifically if any order was deterministic
(MPS P = 100%) it would be selected.
<P>
2. Fixed Weights
<P>
The simplest mixer is just to combine stages using fixed weights. For example :
<FONT COLOR=GREEN><PRE>
P = (3 * P1 + P2)/4
</PRE></FONT>
This is used very effectively in BCM, for example, which uses it in funny ways like mixing the probability pre-APM with the probability post-APM
using fixed weight ratios.
<P>
Fixed linear combinations like this of multi-speed adaptive counters is not really mixing, but is a way to create higher order IIR filters.
(as in the "two speed" counter from BALZ; that really makes a 2-tap IIR from 1-tap IIR geometric decay filters)
<P>
3. Per-file/chunk transmitted Fixed Weights
<P>
Instead of fixed constants for mixing, you can of course optimize the weights per file (or per chunk) and transmit them.
<P>
If you think of the goal of online gradient descent as finding the best fit of parameters and converging to a constant - then this makes perfect sense.
Why do silly neural net online learning when we can just statically optimize the weights?
<P>
Of course we don't actually want our online learners to converge to constants - we want them to keep adapting; there is no steady state in real world
data compression. All the online learning research which assumes convergence and learning rate decreasing over time and various asymptotic gaurantees -
it's all sort of wrong for us.
<P>
M1 (Mattern) uses optimized weights and gets quite high compression. This method is obviously very asymmetric, slow to encode & fast(er) to decode. You can also
approximate it by making some pre-optimized weights for various types of data and just doing data type detection to pick a weight set.
<P>
4. Functional Weighting
<P>
You can combine P's using only functions of the P's, with no separate state (weight variables that are updated over time).
The general idea here is that more skewed P's are usually much more powerful (more likely to be true) than P's near 50/50.
<P>
If you have one expert on your panel saying "I dunno, everything is equally likely" and another guy saying "definitely blue!" , you
are usually best weighting more to the guy who seems sure. (unless the positive guy is a presidential candidate, then you need to
treat lack of considering both sides of an argument or admitting mistakes as huge character flaws that make them unfit to serve).
<P>
In PPMZ this was used for combining the three SEE orders; the weighting used there was 1/H (entropy).
<FONT COLOR=GREEN><PRE>
weight as only a function of P
the idea is to give more weight to skewed P's
W = 1 / H(P) = 1 / ( P * log(P) + (1-P) * log(1-P) )
e^ -H would be fine too
very simple skews work okay too :
W = ABS(P - 0.5)
just higher weight for P away from 0.5
then linearly combine models :
P_mixed = Sum{ W_i * P_i } / Sum{ W_i }
</PRE></FONT>
This method has generally been superceded by better approaches in most cases, but it is still useful for creating a prior model
for initial seeding of mixers, such as for 2d APM mixing (below).
<P>
I'll argue in a future post that Mahoney's stretch/squash (logit/logistic sigmoid) is a form of this (among other things).
<P>
5. Secondary Estimation Mixing / 2d APM Mixing
<P>
As mentioned at the end of the last post, you can use a 2d APM for mixing. The idea is very simple, instead of one P as input to the APM, you
take two probabilities from different models and look them up as a 2d matrix.
<P>
You may want to use the stretch remap and bilinear filtering so that as few bits as possible are needed to index the APM table, in order to
preserve as much density as possible. (eg. 6 bits from each P = 12 bit APM table)
<P>
This allows you to model completely arbitrary mixing relationships, not just "some weight to model 1, some to model 2". It can do the fully P-input
dependent mixing, like if model 1 P is high, it gets most of the influence, but if model 1 P is low, then model 2 takes over.
<P>
The big disadvantage of this method is sparsity. It's really only viable for mixing two or three models, you can't do a ton. It's best on large stable
data sources, not small rapidly changing data, because the mixing weight takes longer to respond to changes.
<P>
One of the big advantages of single-scalar weighted mixing is that it provides a very fast adapting stage to your compressor. Say you have big models with N
states, N might be a million, so each individual context only gets an update every millionth byte. Then you might have a (1d) APM stage to adapt the P's;
this has 100 entries or so, so it only gets 1/100 updates. But then you blend the models with a scalar mixing weight - that updates after every byte. The
scalar mixing weight is always 100% up to date with recent changes in the data, it can model switches very quickly while the rest of the big model takes much
more time to respond. 2d APM Mixing loses this advantage.
<P>
6. Beta Weighting
<P>
Beta Weighting was introduced by CTW ; see also the classic work of Volf on switching, etc.
(annoyingly this means something different in stock trading).
<P>
Beta Weighting is an incremental weight update. It does :
<FONT COLOR=GREEN><PRE>
P_mixed = Sum W_i * P_i / Sum W_i
(linear weighted sum of model probabilities, normalized)
after seeing symbol X
W_i *= P_i(X)
</PRE></FONT>
We can understand this intuitively :
<FONT COLOR=GREEN><PRE>
If a model had a high probability for the actually occuring symbol (it was a good model),
it will have P_i large
W_i *= 0.9 or whatever
if the model had low probability for the actual symbol (it was wrong, claiming that symbol was unlikely),
it will get scaled down :
W_i *= 0.1 for example
</PRE></FONT>
Furthermore, the weights we get this way are just negative exponentials of the cost in each model.
That is :
<FONT COLOR=GREEN><PRE>
P = 2^ - L
L = codelen
The weights winds up being the product of all coded probabilities seen so far :
W at time t = P_(t-1) * P(t-2) * P(t-3) * P(t_4) ...
W = 2^( - Sum_t{ L_(t-1) + L(t-2) + L(t-3) ... } )
W = 2^( - codelen if you used this model for all symbols )
W ~ e^ - cost
</PRE></FONT>
If you like we can also think of this another way :
<FONT COLOR=GREEN><PRE>
since W is normalized, overall scaling is irrelevant
at each update, find the model that produced the lowest codelen (or highest P)
this is the model we would have chosen with an ideal selector
so the cost of any other model is :
cost_i = L_i - L_best
the CTW Beta weight rule is :
W_i *= 2^ - cost_i
(and the best model weight stays the same, its cost is zero)
or
W = 2^ - total_cost
total_cost += cost_i
</PRE></FONT>
<P>
In the special case of two models we can write :
<FONT COLOR=GREEN><PRE>
The normalized weight for model 1 is :
W1 = ( 2^-C1 ) / ( 2^-C1 + 2^-C2 ) = 1 / ( 1 + 2^(C1-C2) ) = log2istic( C2-C1 )
W2 = 1 - W1 = log2istic( C1-C2 )
"log2istic" = logistic sigmoid with base 2 instead of e
We only need to track the delta of costs , dC = C1-C2
P = log2istic(-dC) * P1 + log2istic(dC) * P2
(equivalently, just store W1 and do W1 *= P1/P2)
</PRE></FONT>
We'll see in the next post how logistic is a transform from delta-codelens to probabilities, and this will become neater then.
<P>
Now Beta weighting is obviously wrong. It weights every coding event equally, no matter how old it is. In compression we always want strong
recency modeling, so a better cost update is something like :
<FONT COLOR=GREEN><PRE>
dC' = dC * aging + rate * (C1-C2)
where aging = 0.95 or whatever , something < 1
and rate is not so much the learning rate (it could be normalized out) as it is an exponential modifier to the weighting.
</PRE></FONT>
Beta weighting as used in CTW stores a weight per node. Each node weights counts as seen at that level vs. the next deeper level.
In contrast, in context mixers like PAQ the weight for order N vs order N+1 is stored outside of the model, not per context but as a
global property of the data. The PAQ way is much more flexible because you can always use some context for the weight if you like.
<P>
Next post we'll get into online gradient descent.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/GLaEYn1ZNm8" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-67539029806839499002018-05-01T11:58:00.002-07:002018-05-01T20:31:32.989-07:00Secondary Estimation : From PPMZ SEE to PAQ APM
I want to ramble a bit about how Matt Mahoney uses Secondary Estimation in APM's (in his compressors PAQ and BBB), but to
get there I thought I'd go through a little story about what secondary estimation is all about.
<P>
Secondary Estimation began as a way to apply a (small) correction to a (large) primary model, but we will see in our journey
that it has transformed into a way of doing heavy duty modeling itself.
<P>
The simplest and perhaps earlier form of secondary estimation may have been bias factors to linear predictors.
<P>
Say you are predicting some signal like audio. A basic linear predictor is of the form :
<FONT COLOR=GREEN><PRE>
to code x[t]
pred = 2*x[t-1] - x[t-2]
delta = x[t] - pred
transmit delta
</PRE></FONT>
Now it's common to observe that "delta" has remaining predicability. The simplest case is if it just has
an overall bias; it tends to be > 0. Then you track the average observed delta and subtract it off to
correct the primary predictor. A more interesting case is to use some context to track different corrections.
For example :
<FONT COLOR=GREEN><PRE>
Previous delta as context :
context = delta > 0 ? 1 : 0;
pred = 2*x[t-1] - x[t-2]
delta = x[t] - pred
bias = secondary_estimation[ context ]
transmit (delta - bias)
secondary_estimation[ context ].update(bias)
</PRE></FONT>
Now we are tracking the average bias that occurs when previous delta was positive or negative, so if they
tend to occur in clumps we will remove that bias. ("update" means "average in in some way" which we will
perhaps revisit).
<P>
Another possibility is to use local shape information, something like :
<FONT COLOR=GREEN><PRE>
for last 4 values , track if delta was > 0
each of those is a bit
this makes a 4 bit context
look up bias using this context
</PRE></FONT>
Now you can detect patterns like the transition from flats to edges and apply different biases in each case.
<P>
The fundamental thing that we are trying to do here, which is a common theme is :
<P>
<CITE>
The primary model is a mathematical predictor, which extracts some fundamental information about the data.
<P>
The secondary estimator tracks how well that primary model is fitting the current observation set and
corrects for it.
</CITE>
<P>
In CALIC a form of secondary estimation very similar to this is used. CALIC uses a shape-adaptive gradient
predictor (DPCM lossless image coding). This is the primary model; essentially it tries to fix the local
pixels to one of a few classes (smooth, 45 degree edge, 90 degree edge, etc.) and uses different linear
predictors for each case. The prediction error from that primary model is then corrected using a table lookup
in a shape context. The table lookup tracks the average bias in that shape context. While the primary model
uses gradient predictors that are hard-coded (don't vary from image to image), the secondary model can correct
for how well those gradient predictors fit the current image.
<P>
I'm going to do another hand wavey example before we get into modern data compression.
<P>
A common form of difficult prediction that everyone is familiar with is weather prediction.
<P>
Modern weather prediction is done using ensembles of atmospheric models. They take observations of various
atmospheric variables (temperature, pressure, vorticity, clouds, etc.) and do a physical simulation to
evolve the fluid dynamics equations forward in time. They form a range of possible outcomes by testing how
variations in the input lead to variations in the output; they also create confidence intervals by using
ensembles of slightly different models. The result is a prediction with a probability for each possible
outcome. The layman is only shown a simplified prediction (eg. predict 80 degrees Tuesday) but the models
actually assign a probability to each possibility (80 degrees at 30% , 81 degrees at 15%, etc.)
<P>
Now this physically based simulation is a primary model. We can adjust it using a secondary model.
<P>
Again the simplest form would be if the physical simulation just has a general bias. eg. it predicts
0.5 degrees too warm on average. A more subtle case would be if there is a bias per location.
<FONT COLOR=GREEN><PRE>
primary = prediction from primary model
context = location
track (observed - primary)
store bias per context
secondary = primary corrected by bias[context]
</PRE></FONT>
Another correction we might want is the success of yesterday's prediction. You can imagine that
the prediction accuracy is usually better if yesterday's prediction was spot on. If we got yesterday
grossly wrong, it's more likely we will do so again. Now in the primary physical model, we might
be feeding in yesterday's atmospheric conditions as context already, but it is not using yesterday's
prediction. We are doing secondary estimation using side data which is not in the primary model.
<P>
You might also imagine that there is no overall bias to the primary prediction, but there is bias
depending on what value it output. And furthermore that bias might depend on location. And it might
depend on how good yesterday's prediction was. Now our secondary estimator is :
<FONT COLOR=GREEN><PRE>
make primary prediction
context = value of primary, quantized into N bits
(eg. maybe we map temperature to 5 bits, chance of precipitation to 3 bits)
context += location
context += success of yesterday's prediction
secondary = primary corrected by bias[context]
</PRE></FONT>
Now we can perform quite subtle corrections. Say we have a great overall primary predictor.
But in Seattle, when the primary model predicts a temperature of 60 and rain at 10% , it tends to
actually rain 20% of the time (and more likely if we got yesterday wrong).
<P><HR><P>
<B>SEE in PPMZ</B>
<P>
I introduced Secondary Estimation (for escapes) to context modeling in PPMZ.
<P>
(See "Solving the Problems of Context Modeling" for details; I was obviously quite modest about my work
back then!)
<P>
Let's start with a tiny bit of background on PPM and what was going on in PPM development back then.
<P>
PPM takes a context of some previous characters (order-N = N previous characters) and tracks what
characters have been seen in that context. eg. after order-4 context "hell" we may have seen space ' '
3 times and 'o' 2 times (and if you're a bro perhaps 'a' too many times).
<P>
Classic PPM starts from deep high order contexts (perhaps order-8) and if the character to be coded
has not been seen in that context order, it sends an "escape" and drops down to the next lower context.
<P>
The problem of estimating the escape probability was long a confounding. The standard approach was to
use the count of characters seen in the context to form the escape probability.
<FONT COLOR=GREEN><PRE>
A nice summary of classical escape estimators from
"Experiments on the Zero Frequency Problem" :
n = the # of tokens seen so far
u = number of unique tokens seen so far
ti = number of unique tokens seen i times
PPMA : 1 / (n+1)
PPMB : (u - t1) / n
PPMC : u / (n + u)
PPMD : (u/2) / n
PPMP : t1/n - t2/n^2 - t3/n^3 ...
PPMX : t1 / n
PPMXC : t1 / n if t1 < n
u / (n + u) otherwise
</PRE></FONT>
eg. for PPMA you just leave the escape count at 1. For PPMC, you increment the escape count each time a
novel symbol is seen. For PPMD, you increment the escape count by 1/2 on a novel symbol (and also set
the novel characters count to 1/2) so the total always goes up by 1.
<P>
Now some of these have theoretical justifications based on Poisson process models and Laplacian priors
and so on, but the correspondence of those priors to real world data is generally poor (at low counts).
<P>
The big difficult with PPM (and data compression in general) is that low counts is where we do our
business. We can easily get contexts with high counts (sufficient statistics) and accurate estimators
by dropping to lower orders. eg. if you did PPM and limit your max order to order-2 , you wouldn't have
this problem. But we generally want to push our orders to the absolute highest breaking point, where
our statistical density goes to shit. Big compression gains can be found there, so we need to be
able to work where we may have only seen 1 or 2 events in a context before.
<P>
Around the time I was doing PPMZ, the PPM* and PPMD papers by Teahan came out
("Unbounded length contexts for PPM" and "Probability estimation for PPM"). PPMD was a new better
estimator, but to me it just screamed that something was wrong. All these semi-justified mathematical
forms for the escape probability seemed to just be the wrong approach. At the same time, PPM* showed
the value of long deterministic contexts, where by definition "u" (the number of unique symbols) is
always 1 and "n" (total) count is typically low.
<P>
We need to be able to answer a very important and difficult question. In a context that has only seen one
symbol one time, what should the escape probability be? (equivalently, what is the probability that one
symbol is right?). There is no simple formula of observed counts in that context that can solve that.
<P>
Secondary Estimation is the obvious answer. We want to know :
<FONT COLOR=GREEN><PRE>
P(esc) when only 1 symbol has been seen in a context
</PRE></FONT>
We might first observe that this P(esc) varies between files. On more compressible files, it is much
lower. On near-random files, it can be quite high. This is a question of to what extent is this
novel deterministic context a reflection of the end statistics, or just a mirage due to sparse statistics.
<P>
That is,
<FONT COLOR=GREEN><PRE>
In an order-8 context like "abcdefgh" we have only even seen the symbol 'x' once
Is that a reflection of a true pattern? That "abcdefgh" will always predict 'x' ?
Or is the following character in fact completely random, and if we saw more symbols we
would see all of them with equal probability
</PRE></FONT>
Again you don't have the information within the context to tell the difference. Using an average
across the whole file is a very rough guess, but obviously you could do better.
<P>
You need to use out-of-context observations to augment the primary model.
Rather than just tracking the real P(esc) for the whole file we can obviously use some context to find
portions of the data where it behaves differently. For example P(esc) (when only 1 symbol has been seen in a context)
might be 75% if the parent context has high entropy; it might be 50% if the order is 6, it might be 10% if order is 8
and the previous symbols were also coded from high order with no escape.
<P>
In PPMZ, I specifically handled only low escape counts and totals; that is, it was specifically trying to solve this
sparse statistics problem. The more modern approach (APM's, see later) is to just run all probabilities through
secondary statistics.
<P>
The general approach to secondary estimation in data compression is :
<FONT COLOR=GREEN><PRE>
create a probability from the primary model
take {probability + some other context} and look that up in a secondary model
use the observed statistics in the secondary model for coding
</PRE></FONT>
The "secondary model" here is typically just a table. We are using the observed statistics in one model as the input to another model.
In PPMZ, the primary model is large and sparse while the secondary model is small and dense, but that need not always be the case.
<P>
The secondary model is usually initialized so that the input probabilities pass through if nothing has been observed yet. That is,
<FONT COLOR=GREEN><PRE>
initially :
secondary_model[ probability + context ] = probability
</PRE></FONT>
Secondary estimation in cases like PPM can be thought of as a way of sharing information across sparse contexts that are not
connected in the normal model.
<P>
That is, in PPM contexts that are suffixes have a parent-child relationship and can easily share information; eg. "abcd" and "bcd" and "cd"
are connected and can share observations. But some other context "xyzw" is totally unconnected in the tree. Despite that, by having
the same statistics they may be quite related! That is
<FONT COLOR=GREEN><PRE>
"abcd" has seen 3 occurances of symbol 'd' (and nothing else)
"xyzw" has seen 3 occurances of symbol 'd' (and nothing else)
</PRE></FONT>
these are probably very similar contexts even though they have no connection in the PPM tree. The next coding event that happens in either
context, we want to communicate to the other. eg. say a symbol 'e' now occurs in "abcd" - that makes it more likely that "xyzw" will have
an escape. That is, 3-count 'd' contexts are now less likely to be deterministic.
The secondary estimation table allows us to accumulate these disparate contexts together and merge their observations.
<P>
The PPMZ SEE context is made from the escape & total count, as well as three different orders of previous symbol bits. The three orders
and then blended together (an early form of weighted mixing!). I certainly don't recommend the exact details of how PPMZ does it today.
The full details of the PPMZ SEE mechanism can be found in
<A HREF="http://www.cbloom.com/papers/ppmz.pdf"> PPMZ (Solving the Problems of Context Modeling) (PDF) </A> .
You may also wish to consult the later work by
Shkarin "PPM: one step to practicality" (PPMd/PPmonstr/PPMii) which is rather more refined. Shkarin adds some good ideas to the context
used for SEE, such as using the recent success so that switches between highly compressible and less compressible data can be modeled.
<P>
Let me repeat myself to be clear. Part of the purpose & function of secondary estimation is to find patterns where the observed counts in a
state to do not linearly correspond to the actual probability in any way.
<P>
That is, secondary estimation can model things like :
<FONT COLOR=GREEN><PRE>
Assuming a binary coder
The coder sees 0's and 1's
Each context state tracks n0 and n1 , the # or 0's and 1's seen in that context
On file A :
if observed { n0 = 0, n1 = 1 } -> actual P1 is 60%
if observed { n0 = 0, n1 = 2 } -> actual P1 is 70%
if observed { n0 = 0, n1 = 3 } -> actual P1 is 90%
On file B :
if observed { n0 = 0, n1 = 1 } -> actual P1 is 70%
if observed { n0 = 0, n1 = 2 } -> actual P1 is 90%
if observed { n0 = 0, n1 = 3 } -> actual P1 is 99%
The traditional models are of the form :
P1 = (n1 + c) / (n0 + n1 + 2c)
and you can make some Bayesian prior arguments to come up with different values of c (1/2 and 1 are common)
</PRE></FONT>
The point is there is *no* prior that gets it right. If the data actually came from a Laplacian or Poisson source, then observing counts
you could make estimates of the true P in this way. But context observations do not work that way.
<P>
There are lots of different effects happening. One is that there is multiple "sources" in a file. There might be one source that's pure random,
one source is purely predictable (all 0's or all 1's), another source is in fact pretty close to a Poisson process. When you have only a few
observations in a context, part of the uncertainty is trying to guess which of the many sources in the file that context maps to.
<P><HR><P>
<B> Adaptive Probability Map (APM) ; secondary estimation & beyond. </B>
<P>
Matt Mahoney's PAQ (and many other modern compressors) make heavy use of secondary estimation, not just for escapes, but for every coding event.
Matt calls it an "APM" , and while it is essentially the same thing as a secondary estimation table, the typical usage and some details are a bit
different, so I will call them APMs here to distinguish.
<P>
PPM-type compressors hit a dead end in their compression ratio due to the difficult of doing things like mixing and secondary estimation in
character alphabets. PAQ, by using exclusively binary coding, only has one probability to work with (the probability of a 0 or 1, and then the
other is inferred), so it can be easily transformed. This allows you to apply secondary estimation not just to the binary escape event, but to
all symbol coding.
<P>
An APM is a secondary estimation table. You take a probability from a previous stage, look it up in a table (with some additional
context, optionally), and then use the observed statistics in the table, either for coding or as input to another stage.
<P>
By convention, there are some differences in typical implementation choices. I'll describe a typical APM implementation :
<FONT COLOR=GREEN><PRE>
probability P from previous stage
P is transformed nonlinearly with a "stretch"
[stretch(P) + context] is looked up in APM table
observed P there is passed to next stage
optionally :
look up both floor(P) and ceil(P) in some fixed point to get two adjacent APM entries
linearly interpolate them using fractional bits
</PRE></FONT>
A lot of the details here come from the fact that we are passing general P's through the APM, rather than restricting to only low counts
as in PPMZ SEE. That is, we need to handle a wide range of P's.
<P>
Thus, you might want to use a fixed point to index the APM table and then linearly interpolate (standard technique for tabulated function
lookup); this lets you use smaller, denser tables and still get fine precision.
<P>
The "stretch" function is to map P so that we quantize into buckets where we need them. That is, if you think in terms of P in [0,1] we
want to have smaller buckets at the endpoints. The reason is that P steps near 0 or 1 make a very large difference in codelen, so getting
them right there is crucial. That is, a P difference of 0.90 to 0.95 is much more important than from 0.50 to 0.55 ; instead of having variable
bucket sizes (smaller buckets at the end) we use uniform quantization but stretch P first (the ends spread out more). One nice option for
stretch is the "logit" function.
<FONT COLOR=GREEN><PRE>
LPS symbol codelen (LPS = less probable symbol)
steps of P near 0.5 produce small changes in codelen :
P = 0.5 : 1 bit
P = 0.55 : 1.152 bits
steps of P near 1.0 produce big changes in codelen :
P = 0.9 : 3.32193 bits
P = 0.95 : 4.32193 bits
</PRE></FONT>
The reason why you might want to do this stretch(P) indexing to the APM table (rather than just look up P)
is again density. With the stretch(P) distortion, you can get away with only 5 or 6 bits of index, and
still get good resolution where you need it. With linear P indexing you might need a 10 bit table size for
the same quality. That's 32-64 entries instead of 1024 which is a massive increase in how often each slot
gets updated (or a great decrease in the average age of statistics; we want them to be as fresh as possible).
With such course indexing of the APM, the two adjacent table slots should be used (above and below the input P)
and linearly interpolated.
<P>
APM implementations typically update the observed statistics using the "probability shift" method (which
is equivalent to constant total count or geometric decay IIR).
<P>
An APM should be initialized such that it passes through the input probability. If you get a probability
from the previous stage, and nothing has yet been observed in the APM's context, it passes through that
probability. Once it does observe something it shifts the output probability towards the observed
statistics in that context.
<P>
We can see something interesting that APM's can do already :
<P>
I wrote before about how secondary estimation is crucial in PPM to handle sparse contexts with very
few events. It allows you to track the actual observed rates in that class of contexts. But we
thought of SEE as less important in dense contexts.
<P>
That is true only in non-time-varying data. In the real world almost all data is time varying. It can shift
character, or go through different modes or phases.
In that sense all contexts are sparse, even if they have a lot of observed counts, because they are sparse
in observation of recent events. That is, some order-4 context in PPM might have 100 observed counts,
but most of those were long in the past (several megabytes ago). Only a few are from the last few bytes,
where we may have shifted to a different character of data.
<P>
Running all your counts through an APM picks this up very nicely.
<FONT COLOR=GREEN><PRE>
During stable phase :
look up primary model
secondary = APM[ primary ]
when primary is dense, secondary == primary will pass through
Now imagine we suddenly transition to a chunk of data that is nearly random
the primary model still has all the old counts and will only slowly learn about the new random data
but the APM sees every symbol coded, so it can learn quickly
APM[] will quickly tend towards APM[P] = 0.5 for all P
that is, the input P will be ignored and all states will have 50/50 probability
</PRE></FONT>
This is quite an extreme example, but more generally the APM can pick up regions where we move to
more or less compressible data. As in the PPM SEE case, what's happening here is sharing of
information across parts of the large/sparse primary model that might not otherwise communicate.
<P>
Let me repeat that to be clear :
<P>
Say you have a large primary model, with N context nodes. Each node is only updated at a rate of (1/N) times per byte
processed. The APM on the other hand is updated every time. It can adapt much faster - it's sharing information across
all the nodes of the primary model. You may have very mature nodes in the primary model with counts like {n0=5,n1=10}
(which we would normally consider to be quite stable). Now you move to a region where the probability of a 1 is 100%.
It will take *tons* of steps for the primary model to pick that up and model it, because the updates are scattered all
over the big model space. Any one node only gets 1/N of the updates, so it takes ~N*10 bytes to get good statistics for
the change.
<P>
An interesting common way to use APM's is in a cascade, rather than a single step with a large context.
<P>
That is, you want to take some P from the primary model, and you have additional contexts C1 and C2 to condition the APM step.
You have two options :
<FONT COLOR=GREEN><PRE>
Single step, large context :
P_out = APM[ {C1,C2,P} ]
Two steps, small context, cascade :
P1 = APM1[ {C1,P} ]
P_out = APM2[ {C2,P1} ]
</PRE></FONT>
That is, feed P through an APM with one context, then take that P output and pass it through another APM with a different context,
as many times as you like.
<P>
Why use a cascade instead of a large context? The main reason is density.
<P>
If you have N bits of extra context information to use in your APM lookup, the traditional big table approach dilutes your statistics by 2^N.
The cascade approach only dilutes by *N.
<P>
The smaller tables of the APM cascade mean that it cannot model certain kinds of correlations. It can only pick up ways that each context
correlated to biases on P. It cannot pick up joint context correlations.
<FONT COLOR=GREEN><PRE>
The APM cascade can model things like :
if C1 = 0 , high P's tend to skew higher (input P of 0.7 outputs 0.9)
if C2 = 1 , low P's tend to skew lower
But it can't model joint things like :
if C1 = 0 and C2 = 1 , low P's tend towards 0.5
</PRE></FONT>
One nice thing about APM cascades is that they are nearly a NOP when you add a context that doesn't help. (as opposed to the single table
large context method, which has a negative diluting effecting when you add context bits that don't help). For example if C2 is just not
correlated to P, then APM2 will just pass P through unmodified. APM1 can model the correlation of C1 and P without being molested by the
mistake of adding C2. They sort of work independently and stages that aren't helping turn themselves off.
<P>
One funny thing you can do with an APM cascade is to just use it for modeling with no primary model at all. This looks like :
<FONT COLOR=GREEN><PRE>
Traditional usage :
do order-3 context modeling to generate P
modify observed P with a single APM :
P = APM[P]
APM Cascade as the model :
P0 = order-0 frequency
P1 = APM1[ o1 | P0 ];
P2 = APM2[ o2 | P1 ];
P3 = APM3[ o3 | P2 ];
</PRE></FONT>
That is, just run the probability from each stage through the next to get order-N modeling.
<P>
We do this from low order to high, so that each stage can add its observation to the extent it has seen events. Imagine that the current
order2 and order3 contexts have not been observed at all yet. Then we get P1 from APM1, pass it through APM2, since that has seen nothing it
just passes P1 through, then APM3 does to. So we wind up getting the lowest order that has actually observed things.
<P>
This method of using an APM cascade is the primary model is used by Mahoney in BBB.
<P>
A 2d APM can also be used as a mixer, which I think I will leave for the next post.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/VVf6eOPhiuE" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-51053606408876715012018-04-18T15:34:00.003-07:002018-10-27T19:28:49.416-07:00Race in EOF on Windows Pipes
There appears to be a race in the EOF check on Windows pipes.
I haven't seen this written about elsewhere, so here it is.
<P>
TLDR : don't call eof on pipes in Windows! Use read returning zero instead to detect eof.
<P>
We observed that semi-randomly pipes would report EOF too soon and we'd get a truncated stream.
(this is not the issue with ctrl-Z ; of course you have to make sure your pipe is set to binary).
<P>
To reproduce the issue you may use this simple piper :
<FONT COLOR=GREEN><PRE>
// piper :
#include <code><</code>fcntl.h>
#include <code><</code>io.h>
int main(int argc,char *argv[])
{
int f_in = _fileno(stdin);
int f_out = _fileno(stdout);
// f_in & out are always 0 and 1
_setmode(f_in, _O_BINARY);
_setmode(f_out, _O_BINARY);
char buf[4096];
// NO : sees early eof!
// don't ask about eof until _read returns 0
while ( ! _eof(f_in) )
// YES : just for loop here
//for(;;)
{
int n = _read(f_in,buf,sizeof(buf));
if ( n > 0 )
{
_write(f_out,buf,n);
}
else if ( n == 0 )
{
// I think eof is always true here :
if ( _eof(f_in) )
break;
}
else
{
int err = errno;
if ( err == EAGAIN )
continue;
// respond to other errors?
if ( _eof(f_in) )
break;
}
}
return 0;
}
</PRE></FONT>
the behavior of piper is this :
<FONT COLOR=GREEN><PRE>
vc80.pdb 1,044,480
redirect input & ouput :
piper.exe < vc80.pdb > r:\out
r:\out 1,044,480
consistently copies whole stream, no problems.
Now using a pipe :
piper.exe < vc80.pdb | piper > r:\out
r:\out 16,384
r:\out 28,672
r:\out 12,288
semi-random output sizes due to hitting eof early
</PRE></FONT>
If the eof check marked with "NO" in the code is removed, and the for loop is used instead, piping
works fine.
<P>
I can only guess at what's happening, but here's a shot :
<P>
If the pipe reader asks about EOF and there is nothing currently pending to read in the pipe, eof() returns true.
<P>
Windows anonymous pipes are unbuffered. That is, they are lock-step between the reader & writer. When the reader
calls read() it blocks until the writer puts something, and vice-versa. The bytes are copied directly from writer to
reader without going through an internal OS buffer.
<P>
In this context, what this means is if the reader drains out everything the writer had to put, and then races ahead and
calls eof() before the writer puts anything, it sees eof true. If the writer puts something first, it seems eof false.
It's just a straight up race.
<FONT COLOR=GREEN><PRE>
time reader writer
-----------------------------------------------------------
0 _write blocks waiting for reader
1 _eof returns false
2 _read consumes from writer
3 _write blocks waiting for reader
4 _eof returns false
5 _read consumes from writer
6 _eof returns true
7 _write blocks waiting for reader
</PRE></FONT>
at times 1 and 4, the eof check returns false because the writer had gotten to the pipe first. At time 6 the
reader runs faster and checks eof before the writer can put anything, now it seems eof is true.
<P>
As a check of this hypothesis : if you add a Sleep(1) call immediately before the _eof check (in the loop), there is no early eof
observed, because the writer always gets a chance to put data in the pipe before the eof check.
<P>
Having this behavior be a race is pretty nasty. To avoid the problem, never ask about eof on pipes in Windows,
instead use the return value of read(). The difference is that read() blocks on the writer process, waiting for
it to either put some bytes or terminate. I believe this is a bug in Windows. Pipes should never be returning EOF as long as
the process writing to them is alive.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/q7BR7fmKl3E" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-9781041460656883642018-04-18T10:15:00.002-07:002018-04-18T10:15:16.231-07:00The Kraft Number, Binary Arithmetic, and Incremental Package Merge
The Kraft number that we compute for
<A HREF="http://cbloomrants.blogspot.com/2018/04/engel-coding-and-length-limited-huffman.html">length limited Huffman codelen construction</A>
can be thought of as a set of bit flags that tell you which codelens to change.
<P>
Recall
<FONT COLOR=GREEN><PRE>
K = Sum{ 2^-L_i }
K <= 1 is prefix codeable
</PRE></FONT>
you can think of K as the sum of effective coding probabilities (P = 2^-L), so K over 1 is a total probability over 100%.
<P>
when we initially make Huffman codelens and then apply a limit, we use too much code space. That corresponds to K > 1.
<P>
If you write K as a binary decimal, it will be something like :
<FONT COLOR=GREEN><PRE>
K = 1.00101100
</PRE></FONT>
Those 1 bits that are below K = 1.0 are exactly the excess codelens that we need to correct.
<P>
That is, if you have an extra symbol of length L, that is K too high by 2^-L , that's a 1 bit in the binary at position L to the right of
the decimal.
<P>
<FONT COLOR=GREEN><PRE>
codelen set of {1,2,2,3}
a : len 1
b : len 2
c : len 2
d : len 3
K = 2^-1 + 2^-2 + 2^-2 + 2^-3
K = 0.100 +
0.010 +
0.010 +
0.001
K = 1.001
take (K-1) , the part below the decimal :
K-1 = .001
we have an excess K of 2^-3 ; that's one len 3 too many.
To fix that we can change a code of len 2 to len 3
that does
-= 0.010
+= 0.001
same as
-= 0.001
That is, when (K-1) has 2 leading zeros, you correct it by promoting a code of len 2 to len 3
</PRE></FONT>
so if we compute K in fixed point (integer), we can just take K - one and do a count leading zeros on it, and it tells you which code len to
correct. The bits that are on in K tell us exactly what's wrong with our code.
<P>
Now, similarly, we can think of the compound operations in the same way.
<P>
Any time we need to do a correction of K by 2^-L we could do one code from (L-1) -> L , or we could do two codes from L -> (L+1), or ...
that can be seen as just an expression of the mathematics of how you can add bits together to make the desired delta.
<P>
That is :
<FONT COLOR=GREEN><PRE>
You want 0.001 (increment codelen 2 to 3)
that can be :
0.001 = 0.0001
+0.0001
(increment two lower lens)
or :
0.001 = 0.01
-0.001
increment a lower codelen then decrement one at your level
</PRE></FONT>
Now, thinking about it this way we can try to enumerate all the possible moves.
<P>
To reduce the space of all possible moves, we need a few assumptions :
<P>
1. No non-len-group changing moves are profitable. That is, the set of symbols for the current len groups are the best possible set.
eg. it's not profitable to do something like { symbol a from len 2 to 3 and symbol b from len 3 to 2 } . If there are any profitable moves
like that, do them separately. What this means is the counts are sorted; eg. if a symbol is at a higher codelen, its count is less equal
the count of any symbol at lower codelen.
<P>
2. I only need to enumerate the moves that can be the cheapest (in terms of total code len).
<P>
In that case I think that you can enumerate all the moves thusly :
<FONT COLOR=GREEN><PRE>
a change of 2^-L can be accomplished via
inc(L-1) (that means increment a codelen at L-1)
inc(L) can be substitituted with { inc(L+1), inc(L+1) }
And each of those inc(L+1) can also be substituted with a pair at L+2.
You take either a codelen at L or two at the next higher len,
whichever has a smaller effect on CL (total codelen).
a change of 2^-L can also be accomplished via :
inc(L-2) and dec(L-1)
OR
inc(L-3) and dec(L-2) and dec(L-1)
again these can be understood as binary decimals :
0.0001 = 0.0010 - 0.0001
0.0001 = 0.0100 - 0.0011
0.0001 = 0.1000 - 0.0111
and finally the decs are also a tree of pairs :
dec(L) can instead be { dec(L+1), dec(L+1) }
</PRE></FONT>
this seems like a lot, but compared to all possible ways to make the number X from adds & subtractions of any power of two, it's quite small.
The reason we can consider such a reduced set of moves is because we only need the one best way to toggle a bit of K, not all possible ways.
<P>
Really we just do :
<FONT COLOR=GREEN><PRE>
enumerate the position of the lowest codelen to inc
between 1 and (L-1)
decrement at all codelens below the one you incremented
down to (L-1)
this produce the desired change in K of 2^-L
each "inc" & "dec" can either be at that code len, or a pair of next codelen
</PRE></FONT>
(somebody smarter than me : prove that these are in fact all the necessary moves)
<P>
Let's look at how dynamic programming reduces the amount of work we have to do.
<FONT COLOR=GREEN><PRE>
Say we need to do an inc() at L = 3
(inc means increment a codelen, that decreases K by 2^-4)
We can either increment a single symbol at L = 3
or a pair from L = 4
(this is just the same kind of operation you do in normal Huffman tree building)
The best symbol at L = 3 is just the lowest count symbol (if any exists)
Ask for the best two nodes at L = 4
Those can also be a single symbol, or a pair from L = 5
When you ask for the first node at L = 4, it gets the best two at L = 5
but then imagine the single symbol at L = 4 was lower count and is taken
Now you ask for the second node at L = 4, it again needs the best two at L = 5
we already have them, no more work is needed.
Any time you chose a symbol rather than a pair of higher len, the 2^n tree stops growing.
[3] -> { [4a], [4b] }
[4a] -> sym(4) or { [5a], [5b] }
[4a] takes sym(4)
[4b] -> sym2(4) or { [5a], [5b] }
[4b] doesn't need a new evaluation at level 5
</PRE></FONT>
Another issue I need to mention is that as you increment and decrement codelens, they move between
the lists, so the cached dynamic programming lists cannot be retained, or can they?
(for example, you want to keep the symbols at each len sorted by count)
<P>
In fact the accounting for symbols moving is simple and doesn't need to invalidate the cached lists.
<P>
<FONT COLOR=GREEN><PRE>
When you do an inc(L) , that symbol moves to L+1 and is now available for a further inc(L+1)
(this does not occur with dec(L) since it moves in the opposite direction)
Say you wanted an inc(3) ; you consider doing a pair of { inc(4), inc(4) }
One of the inc(4)'s can be a pair of inc(5)'s , and one of those len 5 symbols can be the one you did inc 4 on.
That is, say you have 'A' at len 4 and 'B' at len 5
inc(3) <- { inc( 'A' @ 4) , { (inc 'A' @ 5) , inc( 'B' @ 5 } }
This is a legal move and something you have to consider.
But the rule for it is quite simple - if a symbol occurs earlier in the list of chosen increments, it is
available at the next level.
</PRE></FONT>
If you're familiar with the way Package Merge makes its lists, this is quite similar. It just means
when you choose the lowest count symbol at the current level, you can also draw from the previous
increments in your list if they have lower count.
<P>
These queues we are building are exactly the same thing you would need to do the full package merge
algorithm. The difference is, in traditional Package Merge you would start with all the symbols
at codelen = max (K is too low), and then incrementally apply the best decrements to increase K.
Here we are starting with K pretty close to 1 , with K greater than 1. The result is that in many cases
we can do far fewer package merge steps. I call this Incremental Package Merge. It allows you to start
from a nearly-Kraft set of codelens and get to the same optimal solution as if you did full package merge.
<P>
Let's look at a concrete example or two :
<FONT COLOR=GREEN><PRE>
codelen : symbol+count
len 3 : a+7
len 4 : b+3 , c+3
len 5 : d+1 , e+1
you need an inc(3) to get K -= 2^-4
you can :
inc(a) ; cost 7
inc(b + c) ; cost 6
inc(b + { d + e } ) ; cost 5
The best inc(4) is {b,d,e}
Another example :
len 3 : a+7
len 4 : b+2
len 5 : c+1
again say you want an inc(3)
the best is
inc( b + { b + c } ) ; cost 5
here the best option is to inc b twice
</PRE></FONT>
And finally let's think again about how Package Merge is related to the "numismatic" or "coin collector
problem".
<P>
If you play with this you will see what we're really doing is working on a two-cost accountancy.
Each symbol has a cost in K which is determined only by its current codelen (2^-L). It also has a cost
in total codelen CL which is detemined only by its symbol count. We are trying to pay off a debt in K
(or spend a credit in K) and maximize the value we get in CL.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/88lLoDX5TIs" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com0tag:blogger.com,1999:blog-5246987755651065286.post-66856562342387802012018-04-04T09:44:00.002-07:002018-04-04T17:11:38.090-07:00Engel Coding and Length Limited Huffman Heuristic Revisited
Joern Engel has written about a technique he described to me for forming prefix code lengths
<A HREF="https://github.com/JoernEngel/joernblog/blob/master/engel_coding.md">here on his blog</A>.
<P>
Engel Coding is a fast/approximate way of forming length limited prefix code lengths.
<P>
I'm going to back up first and remind us what a prefix code is and the use of the Kraft inequality.
<P>
We want to entropy code some alphabet using integer code lengths. We want those codes to be decodeable
without side information, eg. by only seeing the bits themselves, not transmitting the length in bits.
<FONT COLOR=GREEN><PRE>
0 : a
10 : b
11 : c
</PRE></FONT>
is a prefix code. If you see the sequence "11100" it can only decode to "11=c,10=b,0=a".
<FONT COLOR=GREEN><PRE>
0 : a
1 : b
11 : c
</PRE></FONT>
is not a prefix code. Any occurance of "11" can either be two b's or one c. They can't be resolved without
knowing the length of the code. There is no bit sequence you can possibly assign to c that works.
The fact that this is impossible is a function of the code lengths.
<P>
You can construct a prefix code from code lengths if and only if the lengths satisfy the Kraft inequality :
<FONT COLOR=GREEN><PRE>
Sum{ 2^-L_i } <= 1
</PRE></FONT>
It's pretty easy to understand this intuitively if you think like an arithmetic coder. 2^-L is the effective
probability for a code length, so this is just saying the probabilities must sum to <= 100%
<P>
That is, think of the binary code as dividing the range [0,1] like an arithmetic coder does. The first bit
divides it in half, so a single bit code would take half the range. A two bit code takes half of that, so a quarter
of the original range, eg. 2^-L.
<P>
The two numbers that we care about are the Kraft code space used by the code lengths, and the total code length
of the alphabet under this encoding :
<FONT COLOR=GREEN><PRE>
Kraft code space :
K = Sum{ 2^-L_i }
Total code length :
CL = Sum{ C_i * L_i }
L_i = code length in bits of symbol i
C_i = count of symbol i
Minimize CL subject to K <= 1 (the "Kraft inequality")
</PRE></FONT>
We want the minimum total code length subject to the prefix code constraint.
<P>
The well known solution to this problem is Huffman's algorithm. There are of course lots of other ways to
make prefix code lengths which do not minimize CL. A famous historical one is Shannon-Fano coding, but
there have been many others, particularly in the early days of data compression before Huffman's discovery.
<P>
Now for a length-limited code we add the extra constraint :
<FONT COLOR=GREEN><PRE>
max{ L_i } <= limit
</PRE></FONT>
now Huffman's standard algorithm can't be used. Again the exact solution is known; to minimize CL under the
two constraints of the Kraft inequality and the maximum codelength limit, the algorithm is "Package Merge".
<P>
In Oodle we (uniquely) actually use Package Merge at the higher compression levels, but it is too slow and
complicated to use when you want fast encoding, so at the lower compression levels we use a heuristic.
<P>
The goal of the heuristics is to find a set of code lengths that satisfy the contraints and get CL reasonably
close to the minimum (what Package Merge would find).
<P>
The Oodle heuristic works by first finding the true Huffman code lengths, then if any are over the limit, they are
changed to equal the limit. This now violates the Kraft inequality (they are not prefix decodeable), so we apply
corrections to get them to K = 1. ZStd uses a similar method (and I imagine lots of other people have in the past;
this is pretty much how length-limited near-Huffman is done). My previous post on the heuristic length limited code
is below, with some other Huffman background :
<P>
<A HREF="http://cbloomrants.blogspot.com/2010/07/07-03-10-length-limitted-huffman-codes.html"> cbloom rants: 07-03-10 - Length-Limitted Huffman Codes Heuristic </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2010/07/07-02-10-length-limitted-huffman-codes.html"> cbloom rants 07-02-10 - Length-Limitted Huffman Codes </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2009/05/05-22-09-little-more-huffman.html"> cbloom rants 05-22-09 - A little more Huffman </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2010/08/08-12-10-lost-huffman-paper.html"> cbloom rants 08-12-10 - The Lost Huffman Paper </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2015/10/huffman-performance.html"> cbloom rants Huffman Performance </A> <BR>
<A HREF="http://cbloomrants.blogspot.com/2016/04/huffman-correction.html"> cbloom rants Huffman Correction </A> <BR>
<P>
(Engel correctly points out that most of the places where I say "Huffman coding" I should really be saying "prefix coding". The decoding
methods and canonical code assignment and so on can be done with any prefix code. A Huffman code is only the special case of a prefix code
with optimal lengths. That is, Huffman's algorithm is only the part about code length assignment; the rest is just prefix coding.)
<P>
So Engel's idea is : if we're going to limit the code lengths and muck them up with some heuristic anyway, don't bother with first
finding the optimal non-length-limited Huffman code lengths. Just start with heuristic code lengths.
<P>
His heuristic is (conceptually) :
<FONT COLOR=GREEN><PRE>
L_i = round( log2( P_i ) )
</PRE></FONT>
which is intuitively a reasonable place to start. If your code lengths didn't need to be an integer number of bits, then you would
want them to be as close to log(P) as possible.
<P>
Then apply the limit and fix the lengths to satisfy the Kraft inequality. Note that in this case the tweaking of lens to satisfy Kraft
is not just caused by lens that exceed the limit. After the heuristic codelens are made, even if they are short, they might not be
Kraft. eg. you can get code lengths like { 2,3,3,3,3,3,4,4,4 } which are not prefix (one of the 3's need to be changed to a 4).
The idea is that unlike Huffman or Shannon-Fano which explicitly work by creating a prefix code by construction, Engel coding instead
makes code lengths which could be non-prefix and relies on a fix up phase.
<P>
When Joern told me about this it reminded me of "Polar Coding" (Andrew Polar's, not the more common use of the term for error correction).
Andrew Polar's code is similar in the sense that it tries to roughly assign log2(P) codelens to symbols, and then uses a fix-up phase to
make them prefix. The details of the heuristic are not the same.
(I suspect that there are lots of these heuristic entropy coders that have been invented over the years and usually not written down).
<P>
Obviously you don't actually want to do a floating log2; for the details of Engel's heuristic
<A HREF="https://github.com/JoernEngel/joernblog/blob/master/engel_coding.md">see his blog</A>.
<P>
But actually the details of the initial codelen guess is not very important to Engel coding. His codelen adjustment phase
is what actually determines the codelens. You can start the codelens all at len 1 and let the adjustment phase do all
the work to set them, and in fact that gives the same final codelens!
<P>
I tried four methods of initial codelen assignment and they all produced the exact same final codelens.
The only difference is how many steps of the iterative refinement were needed to get them to Kraft equality.
<FONT COLOR=GREEN><PRE>
all initial codelens = 1 : num_adjustment_iterations = 2350943
codelens = floor( log2(P) ) : num_adjustment_iterations = 136925
codelens = round( log2(P) ) : num_adjustment_iterations = 25823
Engel Coding heuristic : num_adjustment_iterations = 28419
</PRE></FONT>
The crucial thing is how the refinement is done.
<P>
To get to the refinement, let's go over some basics. I'll first describe the way we were actually doing
the length limit heuristic in Oodle (which is not the same as what I described in the old blog post above).
<P>
In the Oodle heuristic, we start with Huffman, then clamp the lens to the limit. At this point, the Kraft K
is too big. That is, we are using more code space than we are allowed to. We need to raise some codelens
somewhere to free up more code space. But raising codelens increases the total codelen (CL). So the goal
is to bump up some codelens to get K = 1, with a minimum increase to CL.
<FONT COLOR=GREEN><PRE>
If you do L_i ++
K -= 2^(-L_i)
K += 2^(-(L_i+1))
for a net change of :
K -= 2^(-(L_i+1))
(shorter codelen symbols makes a bigger change to K)
and CL does :
CL -= C_i * L_i
CL += C_i * (L_i + 1)
net :
CL += C_i
(lower count symbols hurt total codelen less)
K_delta_i = 2^(-(L_i+1))
CL_delta_i = C_i
</PRE></FONT>
To get under the K budget, we want to find the lowest CL_delta with the maximum K_delta.
That is, code space (K) is the resource you want to buy, and code len (CL) is the currency you use to
pay for that code space. You want the best price :
<FONT COLOR=GREEN><PRE>
price = CL_delta / K_delta
price = C_i * 2^(L_i+1)
</PRE></FONT>
What I was doing in Oodle was taking the step with the best "price" that didn't overshoot the target of K = 1.
<P>
If your symbols are sorted by count (which they usually are for Huffman codelen algorithms), then
you don't need to compute "price" for all your symbols. The minimum price will always occur at the lowest
count (first in the sorted list) at each codelen. So rather than making a full heap of up to 256 symbols
(or whatever your alphabet size is), you only need a heap of the 16 (or whatever codelen limit is) lowest
count symbols at each codelen.
<P>
The big improvement in
Engel's refinement heuristic is that it allows overshooting K if the absolute value of the distance to K decreases.
<P>
Consider K in fixed point with a 12 bit codelen limit. Then "one" is at K = 4096. Say you had K = 4099.
It's 3 too big. My heuristic could only consider K steps of -=2 and -=1 (only power of two steps are possible).
Engel can also take a step of -= 4 , changing K to 4095. It's now 1 too small (codeable but wasteful)
and rather than increasing codelens to fit in the code space, we can decrease a symbol codelen somewhere to gain some
total codelen.
<P>
Engel converges to K = one by (possibly) taking successively smaller overshooting steps, so K wiggles around
the target, delta going positive & negative. This does not always converge, so a bail out to a simpler
approach is needed. This overshooting lets it get to K by doing a combination of positive and negative steps
(eg. 3 = 4 - 1 , not just 3 = 1 + 2), which is a little bit of a step towards Package Merge (the difference being
that package merge find the cheapest path to get the desired sum, while Engel's heuristic is greedy, taking the
single cheapest step each time).
<P>
In practice this turns out to be much better than only taking non-overshooting steps.
<P>
Time to look at the results on some real data :
<FONT COLOR=GREEN><PRE>
"seven" test set, cut into 64k chunks, order 0 entropy coded
comparing code len to package merge (ideal)
The length in excess (percent) is :
excess percent = (codelen(heuristic) - codelen(packagemerge))*100/codelen(packagemerge)
Huffman then limit monotonic mean : 0.027% max : 2.512%
Huffman then limit overshooting mean : 0.002% max : 0.229%
Engel coding mean : 0.016% max : 10.712%
(codelen limit of 12, 256 symbol byte alphabet)
</PRE></FONT>
The heuristic (overshooting) limit is really very good, extremely close to package merge and even the maximum excess len is
small. Engel coding (non-Huffman initial code lengths) works fine on average but does have (very rare) bad
cases. This is not surprising; there's reason we use Huffman's algorithm to get the code lengths right.
<P>
In that bad 10.71% excess case, the package merge average code len is 1.604 but Engel coding produces an
average code length of 1.776
<P>
Note that many of the blocks in this test did not hit the codelen limit; in that case "Huffman then limit"
produces the best possible codelens, but Engel coding might not.
<P>
For most applications, it's probably best to make the true Huffman code and then limit the lengths with a
heuristic. The time saved from the approximate initial code lengths is pretty small compared to other
operations needed to do entropy coding (histogramming for example is annoyingly slow). Nevertheless I found
this technique to be an interesting reminder to keep an open mind about approximations and understand where
our algorithms came from and why we use the ones we do.
<P><HR><P>
Another thing I find interesting is how Engel Coding points back to Package Merge again.
<P>
First there's the fact that you can start Engel Coding with just all the codelens set at 1 , and let the Kraft
fixup make the codelens. That's how Package Merge works. It starts all codelens at 1, and the increments the
cheapest ones until it gets up to Kraft. The log2ish starting guess for Engel Coding is just a way of jump-starting
the codelens closer to the final answer to avoid lots of steps.
<P>
Engel Coding's overshooting heuristic improves on the monotonic heuristic by allowing you to take some +- steps.
That is, increment one codelen and decrement another. In particular it can do things like :
rather than increment a len 3 codelen , instead increment a len 2 codelen and decrement a len 3 codelen.
This is the kind of move that you need to make to get to real optimal code lens.
<P>
The key missing element is considering the costs of all possible steps and finding a path to the desired K.
That is, Engel coding takes greedy (locally cheapest price) steps, which may not give the optimal path overall.
The way to turn this greedy algorithm into an optimal one is dynamic programming. Lo and behold, that's what
Package Merge is.<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/T-EB-wvE2Jw" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com2tag:blogger.com,1999:blog-5246987755651065286.post-55997601243839555112018-03-26T10:35:00.001-07:002018-03-26T16:02:20.147-07:00Compression of Android Game APK Packages with OodleThis is a look at compression of Android Game APK Packages. In this study I'm mainly looking at the issue of transmission of the APK to
the client, not storage on the client and decompression at runtime. The key difference between the two is that for transmission over the
network, you want to compress the package as a "tar", that is without division into files, so that the compressor can use cross-file correlation.
For storage on disk and runtime loading, you might want to store files individually (or perhaps combined and/or split into paging units), and you might want some files uncompressed.
<P>
The Android APK package is just a zip (thanks to them for just using zip and not changing the header so that it can be easily manipulated with
standard tools).
<P>
I chose the list of games from this article :
<A HREF="https://www.engadget.com/2018/03/19/google-play-instant-games-demo/"> Google's instant app tech now lets you try games before you buy </A>
which is :
<CITE>
<Q>Clash Royale, Words With Friends 2, Bubble Witch 3 Saga, Final Fantasy XV: A New Empire, Mighty Battles and -- of course -- Solitaire</Q>
</CITE>
<P>
I discovered that "Mighty Battles" internally contains a large pre-compressed pak file. (it's named "content.mp3" but is not really an mp3, it's
some sort of compressed archive. They use the mp3 extension to get the APK package system to store it without further zlib
compression.) Because of that I exluded Might Battles from the test; it would be about the same size with every compressor, and is not reflective
of how it should be compressed (their internal package should be uncompressed if we're testing how well the outer compressor does). Later I also
saw that "Clash Royale" is also mostly pre-compressed content. Clash Royale has its content in ".sc" files that are opaque compressed data.
I left it in the test set, but it should also have those files uncompressed for real use with an outer compressor. I wasn't sure which Solitaire to
test; I chose the one by Zynga.
<P>
The "tar" is made by unpacking the APK zip and concatenating all the files together. I also applied
<A HREF="http://cbloomrants.blogspot.com/2016/08/png-without-zlib.html">PNGz0</A> to turn off zlib compression on any PNGs.
I then tested various compressors on the game tars.
<P>
<TABLE BORDER=2 CELLSPACING=2>
<TR BGCOLOR=#BFBFBF><TD BGCOLOR=#BFBFBF></TD> <TD><FONT COLOR=BLUE>original</FONT></TD> <TD>tar</TD> <TD><FONT COLOR=BLUE>zlib</FONT></TD> <TD>Leviathan</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>BubbleWitch3</TD> <TD ALIGN="right"><FONT COLOR=BLUE>78,032,875</FONT></TD> <TD ALIGN="right">304,736,621</TD> <TD ALIGN="right"><FONT COLOR=BLUE>67,311,666</FONT></TD> <TD ALIGN="right">54,443,823</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>ClashRoyale</TD> <TD ALIGN="right"><FONT COLOR=BLUE>101,702,690</FONT></TD> <TD ALIGN="right">124,031,098</TD> <TD ALIGN="right"><FONT COLOR=BLUE>98,386,824</FONT></TD> <TD ALIGN="right">93,026,161</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>FinalFantasyXV</TD> <TD ALIGN="right"><FONT COLOR=BLUE>58,933,554</FONT></TD> <TD ALIGN="right">144,668,500</TD> <TD ALIGN="right"><FONT COLOR=BLUE>57,104,802</FONT></TD> <TD ALIGN="right">41,093,459</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>Solitaire</TD> <TD ALIGN="right"><FONT COLOR=BLUE>14,814,888</FONT></TD> <TD ALIGN="right">139,177,140</TD> <TD ALIGN="right"><FONT COLOR=BLUE>14,071,999</FONT></TD> <TD ALIGN="right">8,337,863</TD> </TR>
<TR><TD BGCOLOR=#BFBFBF>WordsWithFriends2</TD> <TD ALIGN="right"><FONT COLOR=BLUE>78,992,339</FONT></TD> <TD ALIGN="right">570,621,614</TD> <TD ALIGN="right"><FONT COLOR=BLUE>78,784,623</FONT></TD> <TD ALIGN="right">53,413,494</TD> </TR>
<TR BGCOLOR=#BFBFBF><TD BGCOLOR=#BFBFBF>total</TD> <TD ALIGN="right"><FONT COLOR=BLUE>332,476,346</FONT></TD> <TD ALIGN="right">1,283,234,973</TD> <TD ALIGN="right"><FONT COLOR=BLUE>315,659,914</FONT></TD> <TD ALIGN="right">250,314,800</TD> </TR>
</TABLE>
<FONT COLOR=GREEN><PRE>
original = size of the source APK (per-file zip with some files stored uncompressed)
tar = unzipped files, with PNGz0, concatenated together
zlib = zip -9 applied to the tar ; slightly smaller than original
Leviathan = Oodle Leviathan level 8 (Optimal4) applied to the tar
</PRE></FONT>
You can see that Clash Royale doesn't change much because it contains large amounts of pre-compressed data internally.
The other games all get much smaller with Leviathan on a tar (relative to the original APK, or zlib on the tar).
eg. BubbleWitch3 was 78 MB, Leviathan can send it in 54.4 MB ; Solitaire can be sent in almost half the size.
<P>
Leviathan is very fast to decode on ARM. Despite getting much more compression than zlib, it is faster to decode.
More modern competitors (ZStd, brotli, LZMA) are also slower to decode than Leviathan on ARM, and get less
compression.
<P>
For reference, here is the performance on this test set of a few compressors (speeds on Windows x64 Core i7-3770) :
<P>
<TABLE><TR><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,2,10&chxt=x,y&chco=FF0000,FFB000,FFD403,3782FF,A92FFF,882787&chxr=1,0,1800,200&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=lzma 16.04 -9|brotli24 2017-12-12 -11|zlib 1.2.11 -9|zstd 1.3.3 -22|Kraken8|Leviathan8&chd=e:DG,N1,O5,jJ,-W,wK&chxl=0:|total&chtt=decode+MB/s&chts=303030,14,l">
</TD><TD>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=320x500&chbh=25,2,10&chxt=x,y&chco=FF0000,FFB000,FFD403,3782FF,A92FFF,882787&chxr=1,0,6,1&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=lzma 16.04 -9|brotli24 2017-12-12 -11|zlib 1.2.11 -9|zstd 1.3.3 -22|Kraken8|Leviathan8&chd=e:1n,0l,rX,zV,1k,2r&chxl=0:|total&chtt=compression+ratio&chts=303030,14,l">
</TD></TR></TABLE>
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=FF0000,FFB000,FFD403,3782FF,A92FFF,882787&chxr=1,0,2700,300&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=lzma 16.04 -9|brotli24 2017-12-12 -11|zlib 1.2.11 -9|zstd 1.3.3 -22|Kraken8|Leviathan8&chd=e:CSAlBcGCDzCE,JeF5FjSSK7JO,KTJQGsL5KyJ7,caUPMrkPZMXc,wHpuY--jqIpk,nMeORrzIhOgH&chxl=0:|BubbleWitch3|ClashRoyale|FinalFantasyXV|Solitaire|WordsWithFriends2|total&chtt=decode+MB/s">
<P>
<IMG SRC="http://chart.apis.google.com/chart?cht=bvg&chs=900x320&chbh=15,2,10&chxt=x,y&chco=FF0000,FFB000,FFD403,3782FF,A92FFF,882787&chxr=1,0,17,2&chxs=0,000000,12,0,lt|1,000000,10,1,lt&chdl=lzma 16.04 -9|brotli24 2017-12-12 -11|zlib 1.2.11 -9|zstd 1.3.3 -22|Kraken8|Leviathan8&chd=e:U7E.NH.RloS7,VCFDMm6GjUSk,RDEwJilObRPT,UOE6L-31j8SH,UyE-M29TmtS6,VEFBNQ-1oNTT&chxl=0:|BubbleWitch3|ClashRoyale|FinalFantasyXV|Solitaire|WordsWithFriends2|total&chtt=compression+ratio">
<P>
Note that some of the wins here are not accessible to game developers.
When a mobile game developer uses Oodle on Android, they can apply Oodle to their own content and get the size
and time savings there. But they can't apply Oodle to their executable or Java files. The smaller they reduce
their content, the larger the proportion of their APK becomes that is made up of files they can't compress.
To compress all the content in the APK (both system and client files, as well as cross-file tar compression)
requires support from the OS or transport layer.
<P>
I'll also take this chance to remind clients that when using Oodle, you should always try to turn off any
previous compression on your data. For example, here we didn't just try the compressors directly on the APK
files (which are zip archives and have previous zlib compression), we first unpacked them. We then further
took the zlib compression off the PNG's so that the outer compressors in the test could have a chance to
compress that data better. The internal compressors used on Clash Royale and Mighty Battles should also have
been ideally turned off to maximize compression. On the other hand, turning off previous compression does not
apply to data-specific lossy compressors such as audio, image, and video codecs. That type of data should be
passed through with no further compression.
<img src="http://feeds.feedburner.com/~r/CbloomRants/~4/9F-oWpzYvPs" height="1" width="1" alt=""/>cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.com5