<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atomfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="0.3">
  <title>Cliff Click Jr.’s Blog</title>
  <link rel="alternate" type="text/html" href="http://blogs.azulsystems.com/cliff/" />
  <id>tag:typepad.com,2003:weblog-1244458</id>
  <link rel="service.post" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458" title="Cliff Click Jr.’s Blog" />
  <modified>2010-01-19T17:13:46Z</modified>

  <generator url="http://www.typepad.com/">TypePad</generator>
  <info type="application/xhtml+xml">
  <div xmlns="http://www.w3.org/1999/xhtml">This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit <a href="http://blogs.azulsystems.com/cliff/">Cliff Click Jr.’s Blog</a> for more info.</div>
  </info>
  <link rel="start" type="application/atom+xml" href="http://feeds.feedburner.com/typepad/azulsystems/cliff" /><feedburner:info uri="typepad/azulsystems/cliff" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
    <title>I've Been Slashdot'd</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/d-B20963gTE/i-been-slashdotd-and-other-short-stories.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20120a7ecc2c7970b" title="I've Been Slashdot'd" />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20120a7ecc2c7970b</id>
    <issued>2010-01-19T09:13:46-08:00</issued>
    <modified>2010-01-19T17:13:46Z</modified>
    <created>2010-01-19T17:13:46Z</created>
    <summary>I've been Slashdot'd. The slides in question are also here. I gave the talk at the JVM Language Summit, which itself was a lot of fun. The talk is a repeat of one of the talks I did at JavaOne....</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><a href="http://hardware.slashdot.org/article.pl?sid=10/01/14/239229">I've been Slashdot'd.</a>  <a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_HardwareCrashCourse.pdf">The slides in question are also here.</a><p>I gave the talk at the <a href="http://openjdk.java.net/projects/mlvm/jvmlangsummit/agenda.html">JVM Language Summit</a>, which itself was a lot of fun.</p><p>The talk is a repeat of one of the talks I did at JavaOne.  I also <a href="http://blogs.azulsystems.com/cliff/2009/07/javaone-slides.html">gave two other talks</a>, but the Sun JavaOne website appears to be unable to deliver the video right now.  I also <a href="http://java.sun.com/javaone/2009/articles/rockstar_click.jsp">gave a short interview at JavaOne</a>.</p><p>One of the talks I mentioned on the InfoQ video is also available <a href="http://www.youtube.com/watch?v=5uljtqyBLxI">here as a Google Tech Talk; Java on 1000 cores: Tales of Hardware/Software Co-Design</a>.  I also mentioned a talk <a href="http://blogs.azulsystems.com/cliff/2009/02/and-now-some-hardware-transactional-memory-comments.html">Azul's Experiences with Hardware Transactional Memory, and my blog on that is here</a>.  Alas, I don't believe the HTM talk has been video'd for public consumption at any time.  If you are interested in HTM support, you should also <a href="http://www.azulsystems.com/events/mspc_2008/2008_MSPC.pdf">check out this short gem</a>.  The GC talk alluded too has slides all over the web; <a href="http://www.usenix.org/events/vee05/full_papers/p46-click.pdf">here's the original paper</a>, but I could not find a public video presentation.</p><p /><p /><p /><p>Cliff</p><p /><p /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2010/01/i-been-slashdotd-and-other-short-stories.html</feedburner:origLink></entry>
  <entry>
    <title>Biased Locking</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/Ntylm1ClHLU/biased-locking.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20120a7bb2e75970b" title="Biased Locking" />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20120a7bb2e75970b</id>
    <issued>2010-01-09T12:11:55-08:00</issued>
    <modified>2010-01-09T20:11:55Z</modified>
    <created>2010-01-09T20:11:55Z</created>
    <summary>Recently I re-did HotSpot's internal locking mechanism for Azul's JVM. The old locking mechanism is approaching 15 years old and features a number of design decisions that are now out-dated: Recursion counts are kept as a NULL word on the...</summary>
    <author>
      <name>Cliff Click</name>
    </author>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml">Recently I re-did HotSpot's internal locking mechanism for Azul's JVM.  The old locking mechanism is approaching 15 years old and features a number of design decisions that are now out-dated:<br /><br /><ol>
<li>Recursion counts are kept as a NULL word on the stack for every recursion depth (i.e., counting in Base 1 math) in order to save a few instructions and a few bits of memory.  Both are now in vast plentiful supply.  On the 1st lock of an object, it's header is moved into the stack word instead of a NULL and this means that GC or other locking threads (or threads installing a hash code) all need to find and update the header word - which can now be "displaced".  This mechanism is complex, racey and error prone.</li>
<li>The existing mechanism requires a <a href="http://en.wikipedia.org/wiki/Memory_barrier">strong memory fence</a> after a <a href="http://en.wikipedia.org/wiki/Compare-and-swap">Compare-And-Swap</a> (CAS) op, but on most machines the CAS also includes a memory fence.  I.e., HotSpot ends up fencing *<strong>twice</strong>* for each lock acquire, once to CAS the header and again moving the displaced header to the stack.  Each memory fence costs about a cache-miss on most X86 CPUs.</li>
<li>The existing mechanism uses <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.4724&amp;rep=rep1&amp;type=pdf">"Thin Locks"</a> to optimize for the very common case of a locked object never being contended.  New in Java7, <a href="http://blogs.sun.com/dagastine/entry/no_tuning_required_java_se">+UseBiasedLocking</a> is on by default.  This optimizes the common case even more by not using any fencing for locks which have never (yet) changed threads.  <a href="https://researchweb.watson.ibm.com/trl/people/kawatiya/pub/Kawachiya02oopsla.pdf">(See this nice IBM paper on how to do it).</a>  The downside in the OpenJDK implementation is that when an object DOES have to change thread-ownership, <a href="http://java.sun.com/javase/technologies/hotspot/publications/biasedlocking_oopsla2006_wp.pdf">the cost is so high that Sun has choosen to disable biased locking for whole classes of locks to avoid future thread-ownership-change costs</a>.</li>
<li>When a lock does see contention it "inflates" and then the "inflated" lock is much more expensive than a fast-path "thin lock".  So even the smallest bit of contention will cause a lock to be much more expensive than the good case.</li>
<li>JVM internal locks and locked Java objects use 2 utterly different code bases.  This adds a lot of complexity to an already complex system.  The two classes of locks are used in slightly different ways and do have different requirements, BUT they both fundamentally implement a fast-path locking protocol over the OS provided locking abstraction.  At Azul Systems, we found that these two locking systems have a lot more in common than they do in difference.</li>
</ol>
<br />My new locking implementation has met a number of goals:<br /><ol>
<li>No "displaced header", ever</li>
<li>Only one strong memory fence to lock contended locks instead of 2.</li>
<li>The normal fast-path is as good as the +BiasedLocking case.  Revoking a biased-lock is cheap enough to do it on a case-by-case basis.</li>
<li>Inflation happens due to a one-time contention, but then low-contention or no-contention behavior quickly reverts back to a cost which is nearly as good as the normal fast-path.  i.e., uncontended inflated locks pay only an extra cache-hitting indirection load.</li>
<li>One code base for all locks</li>
</ol>
<br />As with the OpenJDK, Azul's new locks are implemented as some bits in every Object's header word, plus an ObjectMonitor structure when the lock "inflates".  Internal VM locks use a plain Monitor which ObjectMonitor subclasses (actually most VM locks are a subclass of Monitor called Mutex which supports only plain lock actions; Monitors support wait &amp; notify as well).<br /><br />Within the 64-bit object header word (Azul uses only 1 header word instead of 2), we reserve 32 bits for locking &amp; hashcodes.  If all 32 bits are zero, the object is unlocked &amp; un-hashed; new objects appear this way.  If the low bit is set, the remaining 31 bits are a hashCode.  If the low 2 bits are '00' the object is BiasedLocked to a thread, and the remaining 30 bits are the thread ID of owning thread (we can compute the thread ID of a thread in 1 or 2 clock cycles).  If the low 2 bits are '10' the remaining 30 bits are the address of an ObjectMonitor structure; e.g. the low 32 bits of the header contains a C++ pointer to an ObjectMonitor, plus 2.  <br /><br />The low 32-bit patterns of the object header are:<br /><br /><span style="text-decoration: underline; font-family: Courier;"><strong>00000000</strong></span> - The most common case is that on object is never locked and never hashed; it's low 32bits of header remains zero throughout it's lifetime.<br /><br /><br /><span style="text-decoration: underline; font-family: Courier;"><strong>xxTIDx00</strong></span> - The next most common case is that the object is locked but not hashed, and never changes owners.  Many many Strings fall into this category.  The first lock acquire will CAS the owning thread into the header, and the object is now Biased-Locked.  Future lock attempts will simply do a load/compare/branch to confirm that the object remains biased-locked to the owning thread.  Literally the code to do this is:<br /><span style="font-family: Courier;">  // R01 - holds object</span><br /><span style="font-family: Courier;">  // R02 - holds self-thread-ID</span><br /><span style="font-family: Courier;">  ldu4  R03,[R01+4] // load low 32-bits</span><br /><span style="font-family: Courier;">  bne   R03,R02,slow-path-locking // thread-id bits do not match?</span><br /><span style="font-family: Courier;">  // bits match, so we own the lock!</span><br /><br /><br /><span style="text-decoration: underline; font-family: Courier;"><strong>xxHASHx01</strong></span> - The next case is that the object is hashed but not locked, and we are trying to get the hashCode:<br /><span style="font-family: Courier;">  // R01 - holds object</span><br /><span style="font-family: Courier;">  ldu4 R02,[R01+4]    // load low 32-bits</span><br /><span style="font-family: Courier;">  beq  R02,0,not_hash // zero header so no hash</span><br /><span style="font-family: Courier;">  shr4 R02,1,   // low-bit to carry flag</span><br /><span style="font-family: Courier;">  bcc  not_hash // "branch carry clear", if low bit was zero it was not_hash</span><br /><span style="font-family: Courier;">  // R02 - holds 31 bits of hash</span><br /><br /><br /><span style="text-decoration: underline;"><strong>xxMONx10</strong></span> - The next case is that the object requires an ObjectMonitor, either because it is both locked AND hashed, or because it was biased-locked once and saw contention.  The monitor is a 32-bit pointer (so limited to the low 4Gig), but can be directly used.  This snippet of code assumes we already failed the fastest path lock-acquire given above (the snippet is actually a called subroutine so the code size does not really matter):<br /><span style="font-family: Courier;">  // R01 - holds object</span><br /><span style="font-family: Courier;">  // R02 - holds self-thread-ID</span><br /><span style="font-family: Courier;">  // R03 - holds loaded header word which might be a monitor:</span><br /><span style="font-family: Courier;">slow-path-locking: <br />  extract R04,R03,2      // test bit#1 for being set</span><br /><span style="font-family: Courier;">  bne R04,1,not_monitor  // branch if we need to inflate the lock</span><br /><span style="font-family: Courier;">  // Here we have code to check for self-biased in the monitor:    </span><br /><span style="font-family: Courier;">  ldu4  R04,[R03-2+owner_field_offset] // load ObjectMonitor.owner</span><br /><span style="font-family: Courier;">  beq   R04,R02,lock_is_owned // thread-id bits match?</span><br /><span style="font-family: Courier;">  // and here we carry on with slow-path locking</span><br /><span style="font-family: Courier;">  // including testing for recursive locks and</span><br /><span style="font-family: Courier;">  // attempting CAS acquire, short spin-waiting, etc</span><br /><br /><p /><p /><p>And so on.  If all the various fast-path attempts fail we eventually end up calling into the VM and blocking on the OS-provided mutex.  Along the way we'll attempt a bunch of other tests (e.g. recursion &amp; spin-waiting) and if all things fail we need to block, we'll also do a fast/short crawl of the call stack for profiling.  There are a lot of other interesting issues here, including heuristics on when to make a lock not-biased by default (producer/consumer design patterns make objects which *always* change hands at least once), or when to deflate an inflated lock that has "settled" into a new thread (or unlocked state), or how to achieve a modicum of fairness under extreme contention.  Since this blog is already getting long (and as raw Assembly code ta-boot!) I'll stick with a short discussion of the bias-lock revoke mechanism.</p><p /><p /><p /><h3>Revoking a Biased Lock</h3><p>Suppose thread T1 holds the lock on Object O - via the very fast-path mechanism: O's header word contains the bits "<span style="text-decoration: underline;"><strong><span style="font-family: Courier;">xxT1xx00</span></strong></span>", and that thread T2 needs to acquire the lock on O.  What kind of interactions are possible?  For starters, assume T2 has failed the fast-path attempts (it's got the wrong thread ID!) and already entered the slower-paths in the VM.  A simple inspection shows T2 that O is biased-locked to T1.  What does T1 know about this situation?  Turns out that thread T1 knows almost nothing:</p><p>- T1 may have acquired O's lock in the misty past and may not now have access to a pointer to O.</p><p>- T1 may be rapidly acquiring and releasing the lock on O, all without fencing or atomic operations.  Moment by moment, cycle by cycle, the actual Java state for T1 may have it holding &amp; releasing O's lock.  T1 isn't really tracking this, he is just periodically observing that he holds the bias lock on O.</p><p>- T1 may be blocked in long latency I/O operations or on other locks.</p><p>- T1 may be a dead exited thread.</p><p>- The id for thread T1 may have been recycled to a new thread, T1', which has never seen a pointer to O cross it's JVM state, ever.  Yet the lock for O is now biased to T1'!</p><p>What T2 can do is to ask T1 to reach a safepoint - a point in the code which matches <strong>some </strong>Java bytecode, and then decide if T1 really holds the lock on O or not.  T2 does this by setting a self-service task on T1; T1 periodically polls for such tasks at safepoints, and when it sees the request it breaks out of it's normal execution and handles the task (periodic polling typically 1 or 2 cycles every few thousand instructions).  T2 then waits for T1 to "do something" with the lock (but not really: what if thread T1 has already exited?).  </p><p>Suppose T1 is still running and spots the poll request.  It then does a self-stack-crawl to count the number of times it's supposed to hold the lock.  JIT'd code includes annotations to describe what locks are held (and how often, if recursive), and the interpreter counts lock acquires in any case (with slow simple Base-1 counting but we don't care about speed in the interpreter).  If this lock count is positive (T1 really holds the lock now!) T1 inflates the lock, slaps in the recursion count into the new ObjectMonitor struct, goes back to running... but now each unlock by T1 will lower the recursion count.  When the lock is unlocked for the last time, T1 does the usual wakeup on T2 and T2 then competes to grab the now-free lock.  If this lock count is zero (T1 holds the lock biased but not for-real) then it releases the lock and notifies T2 as normal.  In any case, the bias on O is revoked and the lock reverts to a state where T2 is allowed to compete for ownership.</p><p>This is all great if T1 is actively running and polling for self-service tasks, but what if T1 is blocked or exited?  After T2 prods T1 with the poll request, T2 then attempts to do the very same self-service task on T1's stack remotely.  To do this T2 has to acquire the "lock" on T1's stack.  All Azul threads unlock their stack whenever they block (or otherwise stop polling) - this same stack-lock mechanism is used by GC threads. If T2 can get T1's stack-lock, then T2 can safely crawl T1's stack (and T1 is not allowed to execute random Java code until he gets his own stack-lock back).  If T2 canNOT get T1's stack-lock, it must be because T1 is busy executing Java code - and hence T1 is also polling for self-service tasks and will (shortly) get around to crawling his own stack.  In any case either T1 or T2 will shortly get around to crawling T1's stack and discovering the proper lock recursion count for Object O.</p><p>And if T1 is exited, T2 can also detect this from T1's old thread-id (thread-id's map back to some type-stable per-thread data).  In this case, T2 can just freely revoke the bias on O and bias O to himself.</p><p /><p>Well, that's enough for now!  Hope you enjoyed this (very long overly complex) discussion of our biased-locking implementation.</p><p /><p>Cliff</p><p /><p /><p /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2010/01/biased-locking.html</feedburner:origLink></entry>
  <entry>
    <title>Touching Base...</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/SachkUqfOko/touching-base.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20120a7724fc7970b" title="Touching Base..." />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20120a7724fc7970b</id>
    <issued>2009-12-22T08:48:40-08:00</issued>
    <modified>2009-12-30T18:24:53Z</modified>
    <created>2009-12-22T16:48:40Z</created>
    <summary>It's been awhile since I blogged, so I thought I'd touch base with people to let them know what's been going on. Azul Systems has been hard at work improving our JVM. This is a bigger statement than it sounds...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><div class="moz-text-html" lang="x-unicode">It's been awhile since I blogged, so I thought I'd touch base with
people to let them know what's been going on.  Azul Systems has been
hard at work improving our JVM.  This is a bigger statement than it
sounds - there are not many groups that have a large enough 'quorum' of
JVM engineers to do large-scale changes to the HotSpot JVM.  Azul has
nearly a dozen engineers doing core HotSpot work (not counting JDK work
or QA folks - counting only core JVM engineers)!  We've been doing
large-scale changes to HotSpot for nearly 8 years now.  Our HotSpot has
been improved over Sun's standard HotSpot or the OpenJDK in a large
number of ways, some more visible and some less so.   Some of the more
obvious stuff we've got working:<br />
<br />
<ul>
<li>A new complete replacement GC: <a href="http://www.azulsystems.com/products/whitepaper/wp_pgc.pdf">Generational
Pauseless GC</a> (and the older <a href="http://www.usenix.org/events/vee05/full_papers/p46-click.pdf">PauselessGC
paper is here</a>).  This is one of our core strengths.  GPGC handles
heaps from 60Megabytes to 600Gigabytes and allocation rates from
4Megabytes/sec to 40Gigabytes/sec, with MAX pause-times consistently
down in the
10-20msec range.  </li>
<li>GPGC requires <a href="www.research.ibm.com/ismm04/slides/blackburn.pdf">read barriers</a>,
and this means
instrumenting every read from the garbage-collected heap. 
Instrumenting the JIT'd reads is easy: we altered the JITs long ago to
emit the needed instructions.  Instrumenting the VM itself is a bigger
job; every time we integrate a new source drop from Sun we have to find
all the new heap-reads Sun has inserted into their new C++ code
(HotSpot itself is a large complex C++ program) and add read-barriers
to them.</li>
<li><a href="http://www.azulsystems.com/e2e/">Real Time Performance
Monitoring</a> - RTPM.  This
is our high-resolution always-on no-overhead integrated profiling tool
and is our 2nd major selling point.  Because it's no-overhead
(literally less than 1%; it's very hard to measure the overhead) we
leave it always on.  This means you can look at a JVM that's been up in
production for a week or a month and introspect it.  It's *<strong>common</strong>*
for a 1hr session with RTPM to answer performance questions that have
plagued production systems for years, or to have people walk away with
10-line fixes worth 30% speedups.  It's as-if you've been blind to what
your JVM has been doing and suddenly your eyes are opened.  Live stack
traces, heap contents, leaks, hot-locks with contending stack traces,
profiled JIT'd assembly, I/O bottlenecks, GC issues, etc, etc.  <a href="http://www.azulsystems.com/e2e/">See the
link for a demo</a>.</li>
<li>Virtualized JVM - We can take pretty much any old server, install
a new JDK, change JAVA_HOME to the new JDK and re-launch the
application... and it now runs on Azul's JVM backed by an Azul
appliance.  No hardware change and no OS change.  This is a great
solution for in-place speedups of older gear.</li>
<li>More recently of course, we've been hard at working porting our
JVM to our new hardware platform.  This work is going well; look for
more discussion here as we have things to announce!<br />
 </li>
</ul>
<br />
Here's some of the LESS obvious stuff we have working:<br />
<br />
<ul>
<li>Tiered Compilation.  Despite the fact that Sun has shipped
"-client" and "-server" configurations for years, they never integrated
these two JITs into a single system.  Most other JVMs have had a tiered
compilation configuration for years and Azul Systems did this to
HotSpot a few years
ago.  We consistently see a roughly 15% speed improvement over a
plain "-server" configuration.  We use the "-client" JIT (also known
internally as C1) to do fast high-resolution profiling; this
high-quality profile information allows the "-server" JIT (C2) to do a
much better job of inlining and compiling.</li>
<li>A complete replacement for the existing HotSpot CodeCache: the
holder of all JIT'd code in the system.  While *<strong>adding</strong>* code has
always been easy, *<strong>removing</strong>* code has always been tricky (well,
tricky to do it without blowing all code away at once and without
requiring all calls to indirect through a 'handle').  Most
large server apps slowly churn new code, so if you leak code you
eventually run out of memory.  The
new CodeCache uses GC to control code lifetimes and this results in a
vastly simpler and less buggy structure all around.  We also use GC to
manage all the
auxiliary data structures surrounding code, e.g. the list of "class
dependencies" for a piece of JIT'd code is a standard heap object now. 
(A "class dependency" lists the set of classes &amp; methods that a
piece of JIT'd code assumes are NOT overridden; if a new class and/or
method overrides one of these then some inlining decision made by the
JIT is now illegal and the JIT'd code needs to be deoptimized, removed
&amp;
recompiled).  Besides being a common management point for all code, the
CodeCache is pinned in the low 4-Gig.  This means all hardware Program
Counters can be limited to 32bits (in our otherwise 64-bit system) and
this is a
tidy cost savings (shorter instruction sequences for calls; less
I-cache space consumed, etc).</li>
<li>Tons of internal JVM scaling work.  We run on systems with 100's
of CPUs and so we've found (and fixed!) any number of internal JVM
scaling limitations.  GPGC can run with hundreds of worker CPUs if
needed.  The JITs compile in parallel with dozens of CPUs (50 is common
during a large application startup).  Many internal VM structures have
been made lock-free or have had their lock hold-times reduced by 10x or more. 
Self-tuning auto-sizing JIT/compiler thread pool.  Concurrent stub/native-wrapper generation.  Concurrent code-dependency insertion (during compilation) and checking (during class loading).  Self-tuning
finalizer work queues.  etc, etc, etc....<br />
 </li>
<li>Cooperative Safepointing allows thousands of *<strong>running</strong>*
threads
(not just alive-but-blocked-on-IO) to come to a Safepoint in
under a millisecond.  Merely safepointing 100's of threads is down in
the microseconds.  Note that a full-on Safepoint does not happen until
the last thread checks-in but the stall time starts when the first
thread stops for a Safepoint.  The time-to-safepoint pause is measured
from when the first running thread stops till when the last thread
checks-in.</li>
<li>The
ability to asynchronously stop &amp; signal individual
threads, to have them do various self-service tasks cheaper than a
remote thread can do it.  This includes, e.g. stack crawls for GC or
profiling (a thread's stack is hot in his own L1 cache and can be
crawled
vastly faster than by a remote thread), or to acknowledge GC phase
shifts or to allow code to be deoptimized (jargon word for what happens
to code that is no longer valid due to class loading). We can also
efficiently do "ragged safepoints" - this is like a full Safepoint
except we don't need to simultaneously stop all threads.  Instead we
merely need to know when all threads have acknowledged e.g. a GC phase
shift.  The threads "check in" as they individually acknowledge the
Safepoint and keep on running.  When the last thread has checked in,
the "ragged safepoint" (and GC phase shift) is complete.<br />
 </li>
<li>No more "perm-gen" space to run out or require a separate tuning
flag.  No more old-gen or young-gen either.  No GC-thread-count knobs,
or space/ratio tuning knobs or GC age or SurvivorXXX flags.  GPGC takes
no flags (except max total resources allowed), and runs well.  There Is
Only One Heap
Space, and GPGC Rules It All.</li>
<li>A new thread &amp; stack layout that lets us use the
stack-pointer also as a ThreadLocal storage pointer, the HotSpot
"JavaThread*", AND as a small dense integer thread-id (requires 1 or 2
integer ops to flip between these forms).  This frees up a CPU register
for general use, while still allowing 1-cycle access to performance
critical thread-local structures.<br />
 </li>
<li>A complete replacement for the existing HotSpot locking
mechanisms.  Our new locks are 'biased' (<a href="www.trl.ibm.com/projects/jit/paper/p020-kawachiya.ps">here's the
original paper idea</a>) similar in
theory to Sun's +BiasedLocking but based on entirely new code.  No more
"<a href="http://blogs.sun.com/dave/entry/lets_say_you_re_interested">displaced
header</a>" <a href="http://blogs.tedneward.com/CommentView,guid,eca26c5e-307c-4b7c-931b-2eaf5b176e98.aspx">madness</a>
(this <a href="http://docjar.com/docs/api/sun/jvm/hotspot/runtime/ObjectSynchronizer.html">comment</a>
is probably only relevant to
hard-core HotSpot engineers).  Biased locks do not require ANY atomic
operation or memory barrier during locking &amp; unlocking, unless the
lock needs to "change hands".  Since we can stop individual threads
asynchronously, we have a fairly cheap way to hand biased locks off
between threads.  Once individual locks demonstrate that they need to
"change hands", we inflate that one lock (not the whole class of locks)
and it  becomes a "thin lock" as long as the contention is low enough
switching over to a "thick lock" only when there are threads waiting to
acquire the lock.  The issues here are fairly complex and subtle and
deserve an entire 'nother blog!<br />
 </li>
</ul>
<br />
That's enough for this Blog.  More later...<br />
<br />
Cliff<br />
<br />
</div></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/12/touching-base.html</feedburner:origLink></entry>
  <entry>
    <title>Java vs C performance... again...</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/0hklaEYF8SE/java-vs-c-performance-again.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20120a554a531970b" title="Java vs C performance... again..." />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20120a554a531970b</id>
    <issued>2009-09-07T14:26:37-07:00</issued>
    <modified>2009-10-20T15:21:19Z</modified>
    <created>2009-09-07T21:26:37Z</created>
    <summary>[Sept 9: Still more good comments; I'll likely be folding replies into the main blog report all day] [Sept 8: Editor: Given the number of high-quality comments, I've decided to edit the comments &amp; replies into the main blog body....</summary>
    <author>
      <name>Cliff Click</name>
    </author>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p style="background-color: #c0ff80; font-family: Arial;">[Sept 9: Still more good comments; I'll likely be folding replies into the main blog report all day]</p><p style="background-color: #c0ff80; font-family: Arial;">[Sept 8: Editor: Given the number of high-quality comments, I've decided to edit the comments &amp; replies into the main blog body.  Let me know if you want your name attached to your comment.]</p><p><br />
I just foolishly got caught in a You-Tube discussion on Java vs C
performance.  Foolish because You-Tube comments are a lousy way to
present anything and because it's hard to keep the level of discourse
scholarly.  And foolish especially for me because I've had this
discussion so many times and it always comes out the same way... so
here's my attempt at distilling my arguments into something I can point
people the *next* time I get caught in this silly discussion.</p><p>
Is Java faster than C/C++?  The short answer is: <em>it depends</em>.</p>
<p><br />
<strong>Places where C/C++ beats Java for obvious reasons:</strong></p>
<ul>
<li><strong>Very small footprint is required </strong>(these days that does not include most phones).  You can get JVMs that run well in a few hundred KB.  Sometimes that's too much.</li>
<li><strong>Very small startup time </strong>(as opposed to very low response
time on a well-warmed-up JVM).  Things like pacemakers (if mine takes a
cosmic-ray-induced reboot, I want it restarted pretty darned quick!),
or perhaps military gear (e.g. guided missiles).  Note that this does
not include e.g. long running hard-real-time airplane control; I know
that at least one UAV uses Java as it's primary control mechanism. 
Startup of a JVM in microseconds is a very hard problem; startup in
milliseconds might be vaguely possible; but a more common time-frame
for a small program to get JIT'd is measured in a few seconds.  Once
you are past the profiling &amp; JIT'ing stage micro-second response
times are entirely doable.  Flash games beat Java games mostly because
it took 30+sec to load the JVM from disk... and so now the web-game
developer community has settled on Flash as the standard (and it still
takes 10+sec to load the JVM). <br /><span style="background-color: #ffff00; font-family: Arial;">[BJ81: I DO care if my IDE takes 10 seconds to start as opposed to 2. I DO
care if my word processor or computer game of choice takes 10 seconds
to start as opposed to 2. Startup speed is an important component of
the user experience for all end-user software.] <br />[Lots of other people complain about loading time]<br /><span style="background-color: #c0ff80; font-family: Arial;">My IDE stays up for days...and all my computer games take more than a minute to load already.   But yes, I like faster loading.  This mostly depends on things like disk speed... and the implementation of Java as a large (on disk) JVM and <strong>not </strong>a lot on things like the actual language or JIT'ing. </span><br /></span><span style="background-color: #ffff00; font-family: Arial;">[SS: </span><span style="background-color: #ffff00; font-family: Arial;">you can try JetBrains IDEA. From my experience it's faster than
Eclipse, less footprint, no lockups on GC. It's Swing :) The only
problem it's not free.
The real problem with perceived Java performance is that none seriously
optimized client Java performance before the most recent time.</span><span style="background-color: #ffff00; font-family: Arial;"><span style="background-color: #ffff00; font-family: Arial;">]</span><br /><span style="background-color: #c0ff80; font-family: Arial;">Also my experience with JetBrains.  It's amazingly fast.  </span></span></li>
<li><strong>Optimizations beyond "gcc -O2"</strong>, such as tiling &amp;
blocking for registers or cache.  These optimizations pay off
handsomely for dense linear algebra codes but no production JVM that I
know of does them (some research JVMs back onto common heavy-weight
optimizers which include these optimizations).  Auto-parallelizing
compilers also fall into this realm.<br /><span style="background-color: #ffff00; font-family: Arial;">[DB: One article that may be of interest to you regarding Java's transcendental performance: </span><a href="http://blogs.sun.com/jag/entry/transcendental_meditation">http://blogs.sun.com/jag/entry/transcendental_meditation</a><span style="background-color: #ffff00; font-family: Arial;">

The basic message: programs that extensivelly use transcendentals
(sin/cos) will experience notably slower performance in Java due to the
"correct vs. fast" implementation.]</span></li>
<li><strong>Value Types</strong>, such as a 'Complex' type require a full object in Java.  This has both code speed and memory overheads.  Note that there are theoretical optimizations for both (Object Inlining) but implementations available in production JVMs are very weak right now.  See my comment below about rotating arrays-of-small-structs 90-degrees to a small-struct-of-arrays as a workaround.</li>
<li><strong>Direct machine access</strong>, as is common in embedded systems. 
Memory mapped I/O, etc.  Note that JNI goes some of the way to
addressing this problem but JNI is clunky to use and definitely adds
some overhead crossing the C/Java boundary.<br /><span style="background-color: #ffff00; font-family: Arial;">[FR:Java only supports one floating point rounding mode, so it's not
suitable for scientific applications. This might fall under "direct
machine access" but FP rounding modes are really machine-independent
because the IEEE standard requires them.
"How Java's Floating-Point Hurts Everyone Everywhere"
</span><span style="background-color: #ffff00; font-family: Arial;">http://www.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf</span><span style="background-color: #ffff00; font-family: Arial;">]</span><br /><span style="background-color: #c0ff80; font-family: Arial;">I'm not so sure how Everyone got hurt: working the simpler subset of FP allowed Java the time to get everything else right.  Had JVM's been forced to implement the whole spec well, we not never have gotten the JVMs we have now.</span> </li>
<li><strong>Interpreters</strong>.  Interpreters written in pure Java or pure C appear to be mostly equal (I've got very few cases where both are written in a <em>pure </em>style), but it's easier to cross over into the non-pure realm and get 2x
speedups from C.  gcc label vars are the obvious use-case here, getting
a 2x speedup over pure C in my experience.  Dipping into pure assembly
gets me another 2x speedup.  Java's "Unsafe" classes allow access to
e.g. "compare-and-swap" instructions plus unchecked 'peek' and 'poke'
but do not support code-gen or hand-made control flow very well. </li>
<li><strong>On the fly code-gen-&amp;-execute</strong>.  You can make bytecodes
in Java &amp; and execute them but it's somewhat more difficult than
the same machine instructions on a simple RISC chip... and you need the
JIT to do a good job with your bytecodes (there are decent libraries to
make it easier but it's still harder than just doing some bit-twiddly
thing and slamming bits into a buffer).  Sometimes what you want is to
gen some specific machine instructions that do not resemble bytecodes
(see the above comments about direct machine access) or to have
fine-grained interleaving between the gen'd code and the static runtime
bits (I've done sort routines which gen'd the key-compare sequence from
a command-line description before sorting and called that from the
inner sort loop).</li>
<li><strong>OS's</strong>.  These need lots of direct machine access (e.g. hardware interrupt support; page table support), but they also dork
with standard execution stacks (like interpreters do) and also load
code and execute it (see above comment about making &amp; executing
code).  Yes there have been some brave attempts at a pure Java OS, and
Microsoft has had a similar research project in this area for a long
time which has made interesting inroads into the problem.  So this
might be a 'solved' problem some day.</li>
<li><strong>"Direct Code Gen from C"</strong>... carefully hand-crafted C code,
such that the author knows (and plans for) which exact machine
instructions will be generated from the C compiler.  This is harder to
do in Java because the translation layer is much more indirect (I've
done it successfully, but I know a *lot* about the JIT).  Examples are
things like kernels of crypto loops, or audio/video codecs or
gzip/compression routines.  Small bits of very hot code with complex
and unusual control flow where the author knows a lot more about what's
going on in the code than the compiler does.  This kind of coding
obviously does not scale to the 100KLOC program but works very nicely
for things that are both very hot and compartmentalize into libraries
well.</li>
</ul>
<p /><p class="blockquote" style="margin-left: 40px;">
 <span style="background-color: #ffff00; font-family: Arial;">[MB: Any tool has its own advantages and disadvantages.

You simply cannot say Java is superior to C/C++ because Java is written with C/C++ ;-).]</span></p><p class="blockquote" style="margin-left: 40px;"><span style="background-color: #c0ff80; font-family: Arial;">Actually there are several Java-in-Java's Out There and some are not too far off HotSpot's mark.</span></p><p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[FG:  I think you're missing two important issues with Java. One is memory
consumption. Java programs use at least twice as much memory as C++
programs because pointers (called references in Java) are used for
everything and all strings are UTF-16.]</p><p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[AM: Java requires 2.5 times the memory a C++ program requires.]</p>

<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">Yes and no...  Yes in general Java uses more memory than C++ but it's usually more due to coding styles and data structure choices than pointers-vs-primitives and fat strings.  Whenever I see the 2x data bloat it's always due to *<strong>really bad</strong>* data structure choices.  With reasonable choices the bloat is far far lower... and memory is cheap.  I'd really like to see some hard evidence here: really equivalent implementations of larger apps in both C &amp; Java.</p><p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[FG: It gets even worse on 64 bit CPUs. For some applications like in
memory databases, in memory data analysis, etc, this leads to an orders
of magnitude performance difference because the disk has to be touched
more frequently.]</p>

<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">I assume you mean 64-bit JVMs and not just 64-bit CPUs...</p><p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">I guess I'd argue that you just mentioned a strong point of Java: with 64 bit JVMs you can have heaps of 100's of Gigs of ram (and ram is cheap!) and cache all those DB accesses and touch disk less often.  Azul Systems routinely sells gear to do exactly that.  To be fair, the GCs on X86 JVMs do not handle such large heaps well.  Some Day they will, and then you'll be delighted to have a 64-bit JVM, pay $49.99 for a few hundred Gigs of ram and skip the disk.</p>
<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[FG: The second issue is cache locality. Obviously using only pointersinstead of values in all collections (other than arrays of primitives)
leads to a lot more cache misses.]</p>
<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">64-bit VMs with 64-bit pointers get pointer-size bloat, and that appeared to cost Sparc about 15% in performance. X86-64 picks up double the number of GPRs which offsets the cache footprint lossage somewhat.</p>
<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[IJ: Hey Cliff, Regarding the comments about memory usage on 64-bit JVMs, compressed references can help for heaps smaller than 32GB on HotSpot.]</p>
<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[FG: So I would say C++ is a better choice for any application that
benefits from keeping a lot of data in memory. By the way, C# is vastly
better than Java for such applications due to its value types.]</p>
<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">Yes and no again...  For many applications you'll pay only 1 more cache-miss per entire array.  For arrays-of-small-structs (e.g. arrays of Complex), you are correct: Java's lousy there (and I added that to C's strengths).  When doing performance sensitive arrays-of-small-structs I turn the implementation 90-degrees and implement a small-struct-of-arrays.  It's clearly a work-around over a clunky language issue... but the performance is solidly there (both in memory footprint and in access speed).</p>
<p><br />
<strong>Places where Java beats C/C++ for obvious reasons:</strong></p>
<ul>
<li><strong>For Most Programs Runtime profiling </strong>is able to pay off well.  This is true
of <strong>most </strong>large Java apps; one obvious case is where there's a generic
interface but at runtime only one implementation is ever used: after
profiling the JVM can turn the interface call into a direct call and
even inline the call.  It's well known in the C/C++ world that peak
benchmark scores come from using the profiling passes these compilers
support, but it's pain to use them in practice... so probably 99% of
all C/C++ programs compile without the benefit of profiling.  In
practice, all Java programs are well profiled and this profiling data
allows for better code generation - sometimes *much* better code
generation.</li>
<li><strong>Very large programs </strong>(1MLOC +).  This is totally anecdotal on my part but my not-very-rigorous survey of the industry tells me
that the Java large-program tool-chain (and language features like GC)
are more robust and complete than the C equivalents and this allows
teams to write larger programs quicker than they could in C/C++.  Yes
large programs are written in C/C++, and yes they get the memory usage
"right enough" that the programs run usefully well... but the same
program written in Java appears to come together quicker, with fewer
bugs and a shorter overall development cycle.  I see a *lot*more 1MLOC
Java programs than C ones and it isn't because Java programmers write
fluffier code (which might also be true...) these large Java programs
are really doing a "bigger" job than what can be squeezed into 100KLOC
of C code.  In this case "Java beats C/C++" really means: we can't
afford to build these systems in C/C++ but we can in Java... so there
isn't any C/C++ equivalent.  Where's the C/C++ version of WebSphere or
WebLogic?  Maybe somebody Out There can tell me the state of C/C++
Application Servers...<br /><span style="background-color: #c0ff80; font-family: Arial;">Got a comment that similar functionality comes from a bunch of separate cooperating C processes.  Not sure I believe that, as I haven't seen anything close to the level of integration e.g. WebLogic has in the C world.</span></li>
<li><strong>Garbage Collection </strong>is just easier to use than
malloc/free.  This is well documented in the industry, and yes it's not
entirely "free".  Yes the heap needs to be managed in production
environments, leaks are still an issue, GC pause times are an issue,
etc, etc... but overall it's vastly quicker to write using GC and in
the time saved make the program more performant or resilient to GC
pause issues and you'll come out far ahead.  (I'm blithely ignoring all
the C/C++ hand-rolled memory management techniques like "arenas" or
"resource areas"; these fall into the category of "malloc/free is so
hard to use so we rolled our own poor-mans GC but if the language had
GC we would probably have never bothered").</li>
<li><strong>GC allows for entirely
different algorithms</strong>, isn't just easier to use than malloc/free.  Many concurrent algorithms are very easy to
write with a GC and totally hard (to down right impossible) using
explicit free.  Reference counting is commonly used in these situations
in C/C++ but it adds a lot more overhead (sometimes a *lot* more
overhead) and is much harder to get right: e.g. a common mistake is keeping the count in
the structure being guarded.</li>
<li><strong>Very Good Multi-Threading Support.  </strong>Parallel programming is just easier in Java.  <br /><span style="background-color: #ffff00; font-family: Arial;">[SS: Next dimension of Java performance is the easier access to parallelism.
Using j.u.concurrent I can beat C/C++ most of the time just using more
processors in my 16 core server. Concurrent memory management by hand
is a pain in the ass and using GC kills the point of using C/C++]</span></li>
</ul>

<p style="background-color: #ffff00; font-family: Arial;" /><p><br />
<strong>Places where C/C++ proponents claim C beats Java, but it doesn't appear (to me) to do so:<br /></strong>
</p><ul>
<li><strong>Most 'plain' modest sized program.  </strong>This will be programs
requiring no more than the "usual" compiler optimizations and are not
so tightly constrained by machine size or startup time.  Examples might
be things from simple compute-bound loops (string hash, compress) up to
IDEs &amp; editors (and most visual tools); DB cache/drivers, etc.<br /><span style="background-color: #ffff00; font-family: Arial;">[SS: </span><span style="background-color: #ffff00; font-family: Arial;">For the example where Java beats C/C++ a number of times visit
www.caucho.org and their OSS PHP engine Quercus. You can check the
numbers yourself using my http://code.google.com/p/wikimark/
For the example of super-fast memory DB in Java visit
http://www.h2database.com it beats MySQL both in performance and
footprint :) Of course it's specifically tuned for in-memory use.
And Java is just an example of so called Managed Runtime (Microsoft
term). If we are talking about Java performance we are mostly
considering JITed code performance. But JIT can be effectively applied
to C/C++ as well, see http://llvm.org Apple Mac OpenGL implementation
is based on LLVM and all OpenCL implementations too. So anyone running
their games on Mac or planing to use OpenCL will use the JIT.
</span><span style="background-color: #ffff00; font-family: Arial;">]</span></li>
<li><strong>16-bit chars vs 8-bit chars.</strong>  It's true that 'char' in Java is similar to C's 'unsigned short', but 'byte' exists and is similar to C's 'signed char'.  This confusion appears to confuse some number of new C-to-Java converts, but many of the old C tricks for dealing with different sized data types work great in Java and the JIT is very good at bit-fiddly code.  There are generally lots of Strings in big Java programs and this leads to code bloat, but if you deal with high-volume String data and it's all ASCII (no internationalization!) then converting the data to byte arrays makes sense.<br /><span style="background-color: #ffff00; font-family: Arial;">[MI: Internally, bytes are stored as 16 bit integers but cast down to only 8 bits.]<br />
<span style="background-color: #c0ff80; font-family: Arial;">No,
'bytes' are stored in memory as 8 bits always, 'chars' and 'shorts' and 16 bits.  CPU register contents
always depend on the JIT'ing of the moment - for both C and Java.<br /></span> </span><span style="background-color: #ffff00; font-family: Arial;">[MI: it's necessary to add a check for the state of the byte and a math operation to correct it prior to every byte transformation, and a check for the state afterwards.]<br />
<span style="background-color: #c0ff80; font-family: Arial;">Just mask the results (no "testing" or control flow), or use byte-typed variables.</span> </span></li>
<li><strong>Raw Small Benchmark Speed</strong> - This is actually mostly a non-issue, as Real Programs rarely look like these things, nor run for &lt;1sec, nor have all their time spent in 3 lines of uber hot code, etc... But Java still looks fairly good here, despite the general static-compilation bias built in to tiny short running programs:<br /><span style="background-color: #ffff00; font-family: Arial;">[SS: First about Java performance. Java is the second fastest mainstream
language after C/C++ in the Benchmark game, see
</span><span style="background-color: #ffff00; font-family: Arial;">http://sho otout.alioth.debian.org/</span><span style="background-color: #ffff00; font-family: Arial;">
And it's less than 2 times slower than C/C++ in this mostly numerical
benchmark. It could also be notice that JRuby is faster than native
Ruby.
Latest versions of Sun JVM efficiently use SSE and are comparable with
C in FP performance. </span><span style="background-color: #ffff00; font-family: Arial;">http://math.nist.gov/scimark2/index.html</span><span style="background-color: #ffff00; font-family: Arial;">]</span></li>
<li><strong>First-Person "Shooter" PC Games </strong><span style="background-color: #c0ff80; font-family: Arial;">[Retracted: I've gotten several well-written posts explaining how games have changed over the past twenty years.  :-)]</span> using the same game card &amp; interfaces (e.g.
DirectX).  Most of the heavy lifting is done by the game card itself;
the game engine is more about managing other resources and running the
game state &amp; AI's... all of which seems to me to work nicely in
Java.  I've met at least one person working this approach for-real (and
it was *working* for real, but I haven't kept in touch so I don't know
how it came out in the end).    </li>
</ul>

<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[AD: Given that most games use 'Optimizations beyond "gcc -O2"', and tend to
have an interpreter or scripting language, and often require 'Direct
machine access', plus plenty of tricky computationally intensive maths,
that would put them squarely into the world of C++. Especially any game
designed for a console, or handheld device.]</p>
<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">For PCs: I'm still thinking Java is up to the task.  I've yet to see games that needed BLAS-style routines; simple 'saxpy' style loops should come really close to C performance (not heavily tested!  But I've routinely talked people into testing Java FP performance and routinely had them come back with a positive report).  If 'direct machine access' is limited to a handful of graphics-card calls per frame (so hundreds to thousands per second), then JNI can handle that no-problem.  The games I worked on long ago didn't use any scripting; back then we would have used Java instead of scripting, so I don't have a feel for how crucial scripting is to game development.</p>
<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">For consoles &amp; handhelds perhaps; I did games on console like devices a long time ago (20+ years) and my vague recollection is that if a modern JVM could be squeezed into such a device it would be able to do the job. I weakened my assertion to just PC games.</p>
<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[TNT: A surprisingly large part of a game is performance sensitive and requires C code. Many games are CPU bound (not GPU bound as you suggest).</p>
<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">Ahhh, but exactly what I am showing here is you cannot equate 'performance sensitive' and 'requires C code'.  Java is up to speed for most (if not all!) of the performance sensitive parts.</p>
<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[AJ: AAA games now typically use middleware or custom physics engines that have highly-tune collision detection code (also used by game "AI") and
custom nonlinear solvers. Both are often SIMDed, prefetched and cache
blocked, with the CD sometimes doing bit-fiddly decompression, and the
solvers sometimes using custom code gen (for derivatives and such).]</p>
<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">JVM: SIMDed: 1/2 yes, prefetched: yes, cache blocked: no.  Perhaps closer than you think but probably not close enough.</p><p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[AJ: Some PC games push 5-10K draw calls per frame, with very roughly 4
additional state-setting calls per draw. So @60fps, that's something
like 1.5-3M/sec. Games are often single-thread limited on this alone.]</p>

<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;">The obvious fix is to batch the graphics calls per JNI call, but this starts to look like a hybrid C/Java solution and those rarely look pretty.</p><p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[AJ: PC games are also chew plenty of CPU with real-time decompression
(unzip or custom) and sometimes recompression (say JPG-&gt;DXTC).]</p>

<p class="blockquote" style="background-color: #c0ff80; font-family: Arial; margin-left: 40px;"> Azul Systems can't use the X86 ZIP routines so we went with the Java ones: performance was about as good as 'gcc -O2' and it was easy to parallelize.  After parallelization the Java version was as fast as we cared to add CPUs.</p><p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">[JJJK:  As for Java and games, it actually works better than most people think. No problem for indie games. But there are some issues:</p>

<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">GC pauses ruin everything. Smooth framerates are difficult to
achieve with the usual Fully-OO-Java-Programming-Style. Also the
resources on the GPU are not garbage collected, so you'll have some
kind of paradigm-clash anyway.</p>

<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">If you want do do some transformations on vertex arrays on the CPU,
you'll have to do them on direct byte buffers, since they are the only
arrays that can be sent to the GPU. Or do them on java arrays and copy
them into byte buffers, I have no idea what that does to performance
though.</p>

<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">3D-Vector math in java is plain ugly. You can either make it
readable or fast. And if you don't pay attention, it will be neither.</p>

<p class="blockquote" style="background-color: #ffff00; font-family: Arial; margin-left: 40px;">On the other hand, with more data and computation going to the GPU
(and staying there for the most part), Java is at least becoming
moderately attractive for games. I worked in a company which is now
starting to release web-based 3D java games.]</p><p><br />
<strong>"Flaws" in most Java-vs-C performance methodologies:</strong></p><p>
These are ways in which many many people (wrongly) claim Java/C/Ruby/etc is
faster than C/Java/Python/etc.  Sometimes these issues aren't flaws at all,
but instead point out conflicting basic requirements that truly make
one language superior to another for a particular task.</p>
<ul>
<li><strong>Warmup.</strong>  Sometimes no-warmup is important (see comments
above about pacemakers), but more often a short warmup period is
irrelevant to the overall application.  If I'm using an IDE, I expect a
largish loading period... but then I'm using the IDE all day. I don't use an IDE for 0.1 sec.  If warmup is NOT important to the application, then allow the JVM a warmup period before comparing performance numbers.  Many of the benchmarks in the language shootout at <a href="http://shootout.alioth.debian.org">http://shootout.alioth.debian.org</a> fall into this camp: too short to measure something interesting.  Nearly all complete in 1 minute or less.  A very large set of Java programs (and programmers) write server programs that run for days and a combination of throughput and quick response under load are the key measures.</li>
<li><strong>Not comparing the same overall algorithm.  </strong>This is common
for larger programs where exact Java &amp; C equivalents do not
exist... sometimes one version or the other gets saddled with a really
bad implementation.  And sometimes you just can't do it but people try
to "fake it" with a straw-man for the "other" side.  E.g.
any-GC'd-language doing a bunch of concurrent algorithms versus
non-GC'd language, or direct code-gen especially of unusual machine
instructions (e.g. X86 codecs using MMX).  Again the <a href="http://shootout.alioth.debian.org">language shootout</a> suffers this problem... and so all results have to be carefully inspected.</li>
<li><strong>Nyquist sampling or low-res sampling errors.  </strong>e.g. using a
milli-second resolution clock when reporting times in the milli-second
range.  Both Java &amp; C have common timer APIs reporting times below
the millisecond (micro &amp; nanos), but actual real hardware &amp;
common OS's vary widely with what they implement.  </li>
<li><strong>Broken statistics.  </strong>This is a hard problem in Java and
easy to get wrong for subtle reasons, but people get it wrong in C/C++
and other languages also.  Running anything *once* suffers from a huge
variety of startup issues.  Re-running in the same program gets you
past one issue and into another: the JVM/JIT "compile plan" will vary
from run-to-run.  Within a single run you might repeatedly get "X
frobs/sec" (say for 10 runs in a row in the same JVM launch) but if you
restart the JVM you can easily see "Y frobs/sec" reliably (repeated 10
times in a row in the same JVM) one-in-10 runs.  This kind of variation
can only be managed with proper statistics. See "<a href="http://escher.elis.ugent.be/publ/Edocs/DOC/P108_138.pdf">Statistically rigorous Java performance evaluation.</a>"</li>
<li><strong>Flags, version numbers, environmental factors all matter:</strong>
"java" is not the same as "java -client" or "java -server", and might
make a 10x difference one way or another.  Same as using "gcc -O1" vs
Intel's reference compiler set at the max "-O" flag.  Linking your C
program as "ld b.o a.o" can give a 30% difference from linking as "ld
*.o" (link order affects I-cache layout).  Environment variable sizes (i.e., length of your username) can
push the initial stack positions up or down to where the stack collides
in the cache or not, etc.  See "<a href="http://www-plan.cs.colorado.edu/diwan/asplos09.pdf">Producing wrong data without doing anything obviously wrong!</a>". Again, well reported numbers and good
statistics are your friend here.</li>
<li><strong>Varying dataset sizes: </strong>I've seen test harnesses that
re-sized the dataset to keep the runtimes approximately equal to 1
second, but once the dataset no longer fit in cache performance dropped
by 10x.</li>
<li><strong>Claiming 'X' but testing 'Y': </strong> examples might be SpecJVM98 209_db claiming to be an in-memory DB test, but really it's a bad string-sort test, or writing a single-threaded program to test OS locking speed (with JVMs, uncontended locks will never hit the OS) or the Caffeinemark logic test (with a little loop unrolling the code is entirely dead).  <a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf"> See more examples here</a>. Modern larger benchmarks do fairly well here but many microbenchmarks run afoul of this.</li>
</ul>
<p style="background-color: #ffff00; font-family: Arial;">[SG: One thing I might suggest is dropping your local memory caches
before performing. Java gets a obvious speed boost beacsuse the JVM
code tends to be cached, and the C code also has the same. </p>

<p style="background-color: #ffff00; font-family: Arial;">sync &amp;&amp; echo 3 &gt; /proc/sys/vm/drop_caches</p>

<p style="background-color: #ffff00; font-family: Arial;">Then run each program twice. First one giving you the time it takes
to access and load the JVM and other libraries included, and the second
giving you what it takes once the libraries and everything is cached.]</p><p><span style="background-color: #c0ff80; font-family: Arial;">No.  Run at least *30* times to get a decent statistical regression.  See the above papers on the 'run it twice' methodology is <strong>*seriously*</strong> flawed.  However dropping the local memory caches probably helps.  Here's a nice writeup on Java perf testing issues: <a href="http://www.ibm.com/developerworks/java/library/j-benchmark1.html">http://www.ibm.com/developerworks/java/library/j-benchmark1.html</a>.</span></p><p /><p /><p><strong>Some Commonly Mentioned Non-Issues:</strong></p><ul>
<li><p style="background-color: #ffff00; font-family: Arial;">[AM: C++ stack allocation beats Java GC for allocation of small objects.]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Does
not.  You have evidence?  I have evidence that small object allocation
in Java is darned near free... but not totally free.  So C++ might win
here but not by any interesting amount.  And HotSpot does stack
allocation when it can.  You should do some testing here to convince
yourself.</span></p></li>
<li><p style="background-color: #ffff00; font-family: Arial;">[AM: Java apps have lots of dynamic casts in them.]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Yes,
and it's free.  Really 90% of all casts are removed by the JIT and the
other 10% take a 'load/compare/branch' sequence which is entirely
predicted by X86 hardware and runs in 1 clock cycle.<br /><br /><span style="background-color: #ffff00; font-family: Arial;">[EXO: </span></span><span style="background-color: #ffff00; font-family: Arial;">This is pretty off topic, but I'd like to know how you've reasoned that
a load/compare/branch sequence can be done in 1-cycle on x86. Of course
this varies by implementation, and I'm sure that there are x86's that
can handle the compare and branch in parallel despite having a
dependency chain on the flags, probably due to careful misalignment of
the relevant pipeline stages. But there's no way you can cmp on a
loaded value the same cycle it's loaded in. The L1 dcache's 2 or (more
likely) 3 cycle latency is going to get in your way.
Sure, the pipeline can be busy doing other independent things in the
meantime, but that's always the case and I don't think it's what you
were getting at.]<br /><br /><span style="background-color: #c0ff80; font-family: Arial;">You're close: indeed the cmp happens many cycles after the load.  Meanwhile the branch predicts - and correctly &gt;&gt;99.99% of the time- and X86 O-O-O dispatch carries on past the branch.  As long as the load isn't a full miss to memory the whole ld/cmp/jne sequence will retire long before it causes the X86 pipeline to stall, and consume perhaps 1 clock of decode/exec/retire work.</span><br /></span></p></li>
<li><p style="background-color: #ffff00; font-family: Arial;">[AM: Interface dispatch is slower than double dispatch through a vtable.]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Yes
and nearly never taken.  I've yet to see interface calls show up on any
profile of any interface-heavy crazy Java programs.  Inline-caches
replace 99+% of all interface calls.</span></p></li>
<li>Try to write "XXX" in Java with the same speed.  In this case: <a href="http://rapidxml.sourceforge.net/manual.html">http://rapidxml.sourceforge.net/manual.html</a>.  In general, these kinds of comments are useless, because who has the time to do it?  In this case, it looks to be about a month's worth of work (you gonna pay my salary?) ... and entirely doable... except I'd come up with an entirely parallel XML parser so I believe time to parse could be dropped from roughly 'strlen()' on a long string to 'strlen() divided by #CPUs'.  The thing these kinds of comments totally miss is this: plain 'olde Java-looks-like-C code translates to machine instructions same as C does.</li>
<li><p style="background-color: #ffff00; font-family: Arial;">[TNT: 
 Where does C# stand compared to the mentioned languages?]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Not
that I track C# all that closely but... I believe the JIT produces
substantially slower code than Java; Microsoft leans pretty heavily on
static compilers (and have a better statically-compiled-code
integration story than Java does).  Also, Java's Memory Model is well
tested &amp; well used; C# equivalent appears to be not so robust in
part because of the requirement to run all that old code.  A real C#
expert should chime in here, I'm not able to give C# a fair treatment.</span></p><p style="background-color: #ffff00; font-family: Arial;">[IK: C# would land in the same bucket as Java.  It might be slightly slower because more money and man-years were
put in JVM, or it might be slightly faster because of unsafe and
structs and stuff.</p>

<p style="background-color: #ffff00; font-family: Arial;">"Slightly" means "a few benchmarks would be ruined". Like, I've
heard, CLR did NOT do interface-vs-implementation profiling, thus some
code gained 10x boost by replacing all occurences of "IList" with
"List" (it couldn't figure out how to dispatch calls to concrete List
class really fast when the slot type was IList).]<br /><br />[PL: C# performance would fall into the roughly the same range as Java. I
have a RETE engine written in Java and C# and the java version is
faster. One area where CLR takes a performance hit is
autoboxing/unboxing. Atleast that's from my experience with my rule
engine. Aside from that, I would say the performance difference isn't
significant.] </p>
<p><span style="background-color: #c0ff80; font-family: Arial;">Last round of anecdotal evidence I gathered (now 2 yrs old) showed Java JITs well ahead of C# JITs.  Would love to see some hard numbers here.</span></p></li>
<li><p style="background-color: #ffff00; font-family: Arial;">[CC: I still don't understand why they do not cache the validated and
optimized memory image for next time the application is launched. .NET
can do this.]</p><span style="background-color: #c0ff80; font-family: Arial;">It's a really hard problem for multi-threaded programs - which typically do parallel class loading - and lazy compilation -which typically inlines the classes that are loaded *at the moment*.  Re-using previously JIT'd code will require, amongst other things, that your code loading order is the same, and the last code-load order you JIT'd from probably depended on stuff like network packet arrival order.  Given that startup time for big User GUI apps is typically NOT the JIT (it's the DISK), I personally have not been very motivated to try and do cached code optimizations.<br /></span></li>
</ul>
<ul>
<br />
</ul>
<p><br />
On purpose, I'm being sloppy in the numbers I report here... because I
don't want to spend my entire life beating this tired horse.  But if
somebody reports something widely different from what I'm seeing, I'm
happy to dig in further - if the reporter is also.  I don't have access
to every kind of compiler &amp; system on the planet so I can't repro
other peoples' results easily.  Also in my experience, the number one
reason for conflicting reports here is because the reporter has
something really simple wrong on their end and a short email session
will clear it up.</p><p style="background-color: #ffff00; font-family: Arial;">[WW: Here's a hint. Next time you write silly code for comparison, make them
do something more useful than integer operations and basic maths...
those will always be the same (or close) between a compiled language
and an interpreted one with JIT.]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Sorry but I have a life outside outside endless blogging... and 100KLOC examples aren't useful in a blog format anyways. </span></p>
<p /><hr size="2" width="100%" /><p><br />
<strong>String Hash Example:<br />
</strong><br />
Complete C++ code:  <a href="http://pastebin.com/d280c1cd4">http://pastebin.com/d280c1cd4</a><br />

Complete Java code: <a href="http://pastebin.com/m541c4655">http://pastebin.com/m541c4655</a><br />

Here's a bit of code computing string hashes that looks ideal for a
static compiler (ala C/C++), yet Java is tied in performance.  I used
fairly recent versions of 'g++ -O2' and 'java -server' on a new X86
(-server is the default for me, so no cmd-line argument needed).  The
inner loop is:</p>
<blockquote>
<p /><blockquote><p><code>  int h=0;<br />
  for( int i=0; i&lt;len; i++ )<br />
    h = 31*h+str[i];<br />
  return h;<br />
</code></p></blockquote>
</blockquote><p>
Here I ran it on a new X86 for 100 million loops:</p>
<blockquote><p><code>&gt; a.out         100000000<br />
100000000 hashes in 5.636362 secs<br />
&gt; java str_hash 100000000<br />
100000000 hashes in 5.745 secs<br />
</code></p></blockquote><p>
Last year I ran the same code on an older version of both gcc &amp;
Java, and Java beat C++ by 15%.  Today C++ is winning by 2%... so
essentially tied.  Back then the JVM did unrolling and "gcc -O2" did
not, and this code pays off well when unrolled.</p><p style="background-color: #ffff00; font-family: Arial;">[TD: 
 In the String Hash example, is the unrolling done by the JIT or javac?]</p><p style="background-color: #c0ff80; font-family: Arial;">Done by the JIT.</p><br /><p />
<hr size="2" width="100%" /><p><br />
<strong>Sieve of Erathosthenes:<br />
</strong><br />
Complete C++ code: <a href="http://pastebin.com/m3784c090">http://pastebin.com/m3784c090</a><br />
Complete Java code: <a href="http://pastebin.com/m4b414295">http://pastebin.com/m4b414295</a><br />
Here's a simple <a href="http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes">Sieve of Eratosthenes</a>, again compiled
with g++ -O2 and run with java -server.  Again this looks ideal for a
static C/C++ compiler and again Java is tied in performance:</p>
<blockquote>
<p /><blockquote>
<p /><blockquote><p><code>  bool *sieve = new bool[max];<br />
  for (int i=0; i&lt;max; i++) sieve[i] = true;<br />
  sieve[0] = false;<br />
  sieve[1] = false;<br />
  int lim = (int)sqrt(max);<br />
  for (int n=2; n&lt;lim; n++) {<br />
    if (sieve[n]) {<br />
      for (int j=2*n; j&lt;max; j+=n)<br />
        sieve[j] = false;<br />
    }<br />
  }</code></p>
<p /></blockquote>
<p /></blockquote>
</blockquote>
<p><br />
</p><p><br />
I computed the primes up to 100million:</p>
<blockquote><p><code>&gt; a.out      100000000<br />
100000000 primes in 1.568016 secs<br />
</code>
 <code>&gt; java sieve 100000000<br />
100000000 primes in 1.548 secs<br />
 <br />
 </code></p></blockquote><p>
So again essentially tied.</p><br /><p style="background-color: #ffff00; font-family: Arial;">[AJ: How do the sieve timings differ if you use an array of bits rather than of bytes?]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Good question, but I bet they're tied again.  The JIT does fine with bit-twiddley stuff.  Test it and let me know!</span></p><p style="background-color: #ffff00; font-family: Arial;">[AL:
 Note that your sieve is not a true one: <a href="Note%20that%20your%20sieve%20is%20not%20a%20true%20one:%20http://lambda-the-ultimate.org/node/3127">http://lambda-the-ultimate.org/node/3127</a>]</p><p><span style="background-color: #c0ff80; font-family: Arial;">Cute!  Just swap '2*n' for 'n*n'... but this doesn't change the C-vs-Java argument. <br /></span></p><hr size="2" width="100%" /><br /><p>
<strong>Profiling Enables Big Gains:</strong></p><p>
Complete C code:<br />
<code>vcall.cpp  <a href="http://pastebin.com/m70dbe7d6">http://pastebin.com/m70dbe7d6</a><br />
vcall.hpp  <a href="http://pastebin.com/m13055a8c">http://pastebin.com/m13055a8c</a><br />
A.cpp      <a href="http://pastebin.com/m5aa1b232">http://pastebin.com/m5aa1b232</a><br />
B.cpp      <a href="http://pastebin.com/m2e46ec23">http://pastebin.com/m2e46ec23</a><br />
</code>

<br />
Complete Java code:<br />
<code>vcall.java <a href="http://pastebin.com/m149bbdf0">http://pastebin.com/m149bbdf0</a><br />
A.java     <a href="http://pastebin.com/m2e33d6df">http://pastebin.com/m2e33d6df</a><br />
B.java     <a href="http://pastebin.com/m2b1d75bb">http://pastebin.com/m2b1d75bb</a><br />
</code></p><p>
This bit of code makes a virtual-call in the inner loop of a simple
vector-sum... and selects the v-call target based on a command-line
argument.  The JVM profiles, decides there's only 1 target and inlines
... and unrolls the loop and discovers a bunch of simple math that collapses in the optimizer.  The C/C++ compiler can't do it because there
really are 2 possible targets at static compile time.  Delaying the
compilation until after profiling can enable major optimizations
downstream.  I compiled the C++ version with 'g++ *cpp' and the java
version as 'javac *java'.</p>
<blockquote>
<p /><blockquote><p><code>    int sum=0;<br />
    for (int i = 0; i &lt; max; i++) <br />
      sum += val();  // virtual call<br />
    return sum;<br />
 </code></p></blockquote>
</blockquote><p>
Here I run it on the same X86:</p>
<blockquote><p><code>&gt; a.out 1000000000 0<br />
1000000000 adds in 2.657645 secs<br />
&gt; java vcall 1000000000 0<br />
1000000000 adds in 0.0 secs<br />
</code></p></blockquote><p>
The Java code is "infinitely" quick: after JIT'ing the -server compiler
essentially deduces a closed-form solution for the answer and can
report the correct result with a few bits of math... no looping.  This
example is ridiculous of course, but it points out the obvious place
where dynamic compilation beats static compilation.  This "make a
virtual call into a static call" optimization is a major common
frequent optimization JIT's can do that static compilers cannot.</p><p style="background-color: #ffff00; font-family: Arial;">[<span>QB: If your compiler can reason that the virtual function is pure,
then the entire loop can be folded into single vcall and multiply. </span>]</p><p style="background-color: #ffff00; font-family: Arial;">[ALJ: It's worth noting that you can achieve the same effect - with the
compiler turning multiple additions into a single multiply - by using
C++ templates.]</p>
<p style="background-color: #c0ff80; font-family: Arial;">My example is perhaps too trivial; I can get the same performance benefits with a non-pure function (so no chance of using 'pure' in static setting).  In practice, you get dozens to hundreds of Interfaces turned into concrete Classes, and hundreds to thousands of calls to such optimized... using a code-cloning technique like C++ templates is going to blow you out with exponential code.<br />
</p>
<p style="background-color: #ffff00; font-family: Arial;">[QB: <span>Although initially expensive and not quite thread safe self
modifying code will do the trick here. Alternatively, you could use
existing dynamic recompilation techniques to make it thread safe, which
I understand is probably veering off into dynamic compilation land...</span>]</p>
<p style="background-color: #c0ff80; font-family: Arial;">Exactly
inline-caches are thread-safe self-modifying code... but they still
look like a call (inline caches make vtable calls as cheap as plain
calls).  In this case the big gain comes from removing the call
altogether, which means knowing there's only 1 implementation of the
virtual.</p>
<hr size="2" width="100%" /><p><br />
</p><p><br />
I hope this clears up where I stand on this issue...  and I'm (sorta) looking forward to the flamefest which is surely coming...</p><p>
Cliff</p></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/09/java-vs-c-performance-again.html</feedburner:origLink></entry>
  <entry>
    <title>Panic Last Week</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/gRNE8M18b4o/panic-last-week.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20120a552323e970c" title="Panic Last Week" />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20120a552323e970c</id>
    <issued>2009-08-16T09:36:56-07:00</issued>
    <modified>2009-08-22T16:02:51Z</modified>
    <created>2009-08-16T16:36:56Z</created>
    <summary>(warning: no technical content) Prior Friday: my Mom calls my sister &amp; tells her my brother Eric (hiking the entire PCT this summer) is 2 days late/overdue. Sister calls me in a PANIC!!! I plan on Saturday driving 5 hrs...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>(warning: no technical content)</p><p><strong>Prior Friday:</strong> my Mom calls my sister &amp; tells her my brother Eric (hiking the entire <strong><a href="http://www.pcta.org">PCT</a></strong> this summer) is 2 days late/overdue. 
Sister calls me in a PANIC!!!  I plan on Saturday driving 5 hrs to
<a href="http://maps.google.com/maps?f=q&amp;source=s_q&amp;hl=en&amp;geocode=&amp;q=seiad+valley,+ca&amp;sll=37.0625,-95.677068&amp;sspn=49.978077,63.193359&amp;ie=UTF8&amp;ll=41.826595,-123.033142&amp;spn=1.481774,1.974792&amp;z=9">northern California</a> to calm my Mom &amp; help coordinate search &amp; rescue. 
It's not like I can get on the trail and expect to cover 100miles in 4
days like my brother's been doing.  I can't reach Mom (no doubt she's driving twisty mountain roads and
is out of cell phone range).  To make matters worse my wife has been sick all the
prior week while trying to run Vacation Bible School (her and 8 kids got it from somebody who brung it; why do people bring their sick kids to School?); she's still recovering.</p>
<p><strong>Saturday: no panic.</strong>  Mom calls; Eric's been found and is not really
late: he had told her he probably would not be able to call midway
through the 100 mile stretch and indeed had no cell coverage the entire
time.  Mom had the local Sheriff out canvassing everything and
almost ready to put up a Search&amp;Rescue plane when she finds Eric (and the Sheriff really didn't want to do that because there was already almost 100 planes in the air fighting the local wildfires).  Mom's too
old to panic like that.   Grrrrr.....</p>
<p><strong>More Saturday:</strong> I take the kids to the beach while my wife sleeps off the cold; my eldest daughter needs to photo <a href="http://www.parks.ca.gov/?page_id=541">tidepools</a> for her Bio project and we all need
to relax.  In the middle of the relaxation my Dad calls home (my folks are divorced so no relationship with my Mom anymore) and reaches my wife napping; Dad and new wife (of nearly 20 years) are
in Stockton and will be arriving tomorrow.  <em>PANIC!!!!  </em>Wife's
still recovering; the house is a total shambles AND her inlaws are
arriving in less than 24 hrs.  </p>
<p><strong>Sunday:</strong> Inlaws arrive.  Much grandparent fun.</p>
<p><strong>Monday:</strong> I leave my folks to have a looong planned meeting with Intel.  I arrive and <span style="text-decoration: underline;">PANIC!!!</span> the power is out.  We
relocate to the break room which has floor-to-ceiling windows.  Makes me feel like Azul is a high-powered
got-act-together startup to host Intel folks in the dark.  Dad &amp; wife move
on (an unannounced drive-by grand-parenting), as they are traveling the western US as part of their retirement.</p>
<p><strong>Tuesday.</strong>  I spend all day on training &amp; mentoring, and
recovering from repeated power failures.  All boxes need rebooting and
..."PANIC!!!"... patches get installed, don't play nice together, need
more patches, more screwing around.  Things more or less functional
again by late Tuesday.  Of course a company wide demo coming on Thursday is hitting an unexpected new crash, so the demo gods are panic'ing and I can't help them all day because of power outage crapping out my machines.</p>
<p><strong>Wednesday.</strong>  Unrelated at-home PANIC keeps me busy all morning
(plus bugs in high-scale-lib showing up a year late).  I finally get the fires out around
noon and my wife offers to go out to lunch w/me.  First time we'd eaten
together w/out kids in a month.  Of course, the housekeeper has hid my
scheduler book and there's something in it keeps nagging me.  Just
before we walk out the door I find it and stuff it in my briefcase
un-inspected - because a rare lunch opportunity w/my wife is not to be
missed.  5 minutes later Jeremy of Google calls: I missed my planned Google
lunch w/ high-powered Java hackers and he wants to know if I'm
still giving the tech-talk at 1?  <em><strong><span style="text-decoration: underline;">PANIC!!!!</span></strong></em>  I
turn the car around, drop off the wife, panic another 5 minutes trying to get latest
slides on wife's tiny new netbook because the standard presentation laptop is
sick before the laptop recovers and I grab it and roar out the door.  I make it to Google
just in time (no officer, I wasn't speeding I was just flying too low).  <a href="http://www.youtube.com/watch?v=sgxCb7fjauE">The talk goes well</a>.  (<a href="http://www.azulsystems.com/events/vee_2009/2009_VEE.pdf">slides are here</a>)  </p>
<p><strong>Thursday.</strong>  Demo goes OK.  I actually get some work done.  It's quite shocking.</p>
<p><strong>Friday:</strong> Planning on some kind of panic but it hasn't
happened yet.  Got a concert planned tonight in Santa Cruz withOUT
kids... so an hour drive away from any kind of rescue.  We're making
plans (got neighbors lined up, etc).  In the end wife and I are so exhausted that we bail on the concert and settle for a nice dinner out without kids.</p>
<p><strong>Saturday:</strong> We have a 7:30am Dr appointment for my youngest (no: I normally never hear of Dr's giving 7:30am Saturday appointments, but my youngest needs to see a serious specialist who happens also to be a really nice guy), then a long planned trip
to SF zoo (<strong>In theory</strong>: more grist for the daughter's Bio project.  <strong>In practice</strong>: she's got a long-distance romance going on with somebody who's 2 hours north of SF so they agreed to meet in the middle).   We all get too much sun wandering the SF zoo in perfect weather.  I finally come down with the dreaded VBS cold, and feel like crud the whole day but had a great time nonetheless.</p>
<p><strong>Sunday:</strong> We're all recovering from massive sunburns and a week of madness.  School starts tomorrow... plenty of time to panic then!!!</p>
<p><br />
Cliff</p><p>PS- Concerned that there might be a correlation between PANIC!!!! and my
receding hairline</p></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/08/panic-last-week.html</feedburner:origLink></entry>
  <entry>
    <title>2009 ECOOP</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/zy5Lpk6lc3U/2009-ecoop.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20115715efab9970c" title="2009 ECOOP" />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20115715efab9970c</id>
    <issued>2009-08-02T09:32:31-07:00</issued>
    <modified>2009-10-21T14:46:19Z</modified>
    <created>2009-08-02T16:32:31Z</created>
    <summary>Long delayed, but at last I found time to publish my notes. I got a free all-expense paid trip to Italy, and all I had to do was give a keynote speech (slides) at 2009 ECOOP. It's a nice gig...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Long delayed, but at last I found time to publish my notes.</p><p><br /><br />I got a free all-expense paid trip to Italy, and all I had to do was give a <a href="http://www.azulsystems.com/events/vee_2009/2009_VEE.pdf">keynote speech (slides)</a> at <a href="http://ecoop09.disi.unige.it/">2009 ECOOP</a>.  It's a nice gig if you can get it.  Travel to Genoa from San Francisco takes awhile; my 8:30am SF flight got me in Genoa about 2pm on the *next* day - plus 2 hrs lead time at SF, plus a 1hr drive to SF - I started my "day" at 6am on Friday and ended it at 8pm on Saturday.  I managed to stay up till dinner (barely) then crashed pretty hard.  Sunday was my day for sightseeing; I managed to wander pretty far all around Genoa; rode some <a href="http://en.wikipedia.org/wiki/Funicular">funiculars</a>; walked all over; took loads of pictures; ate weird food and generally acted like a typical American tourist.  </p><p>There have been loads of sidewalk vendors; all Blacks aggressively selling fake designer purses and sunglasses laid out on sheets on the sidewalk.  Monday afternoon I got to watch a bust go down; lots of yelling, they grabbed their sheet bundles and took off running with plain clothes police in hot pursuit.  Somebody lost his wares and the police quit chasing to grab the loot.  It's Tuesday evening and the Blacks haven't returned.  I promised my wife I'd get her a cheap Gucci purse and now I'm regretting I didn't do it earlier.</p><p><a href="http://www.hotelbristolpalace.it/">Hotel Bristol</a> is a nice older place (means: lots of paint chips and paint layers, big chandeliers, gold leaf trim, dark wood, creaky old elevators; rooms are a mix of really old and really new; follow the link and oogle at the pictures).  My room is vast with 15ft ceilings (with chandelier of course) and a large jacuzzi tub; enough room to put up a basketball hoop.  The conference food has been pretty nice (free good wine with every meal); lunches are served in the <a href="http://www.palazzoducale.genova.it/eng/naviga.asp">Palace Ducalle</a> - historic summer residence of Doges.  I can't do the setting enough justice; the website is nice but the pictures don't really tell the story.  50ft ceiling with 20ft chandeliers and a dozen 8ft marble statues; gold leaf trim on every fresco; stunning paintings the size of basketball courts...  I totally laid down on the floor and stared at the ceiling for a good hour.  The local restaurants, on the other hand, are a hit-or-miss affair; food is only served at certain hours; sometimes the food has been very good and sometimes no better than the local deli around the corner from Azul.</p><p>The <a href="http://www.icooolps.info/">ICOOOLPS</a> workshop was on Monday; I got invited to talk there at the last minute.  I had some misgivings but most of the papers at the workshop were pretty nice; I enjoyed myself.  I'm still fighting jet-lag (today is Tuesday) pretty hard, so I'm writing this to try and stay awake until dinner.  The papers are <a href="http://portal.acm.org/toc.cfm?id=1565824&amp;type=proceeding&amp;coll=GUIDE&amp;dl=GUIDE&amp;CFID=56999077&amp;CFTOKEN=32639676"><strong>here</strong></a>.</p><p><br />1st up - The ICOOOLPS workshop.  ECOOP notes are further down.  As usual I abbreviate ruthlessly; skip some talks; take notes in a stream-of-consciousness style.  Caveat Emptor.  And before I forget: </p><p class="blockquote" style="margin-left: 40px;">The "Nick Mitchell SimpleDateFormat Challenge": parse a sample common SDF and JIT code to translate Date objects to ASCII efficiently.  Similar to C/C++ compiler optimizations for 'printf' strings.</p>
<p>Let me know when you got something working.</p><p><br />
</p>
<p /><p />

<hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Towards an Actor-based Concurrent Machine Model</span></strong></p><p>"delegation based" machine model; dynamic separation of concerns "MDSOC"</p><p>Machine model: objects, msgs, delegation.  High lvl objects represented as at least 2 low-level objects; a "proxy" and a real object.  Indirection to allow message interception.  All msg-sends get forwarded thru the proxy ("message" here is a java-lvl function call, not an OpenMPI msg - but could be either in a distributed system).</p><p>Aspect-Oriented-Programming - intercept msg-send links (between Class-proxy and methods in the actual Class-body), etc.  How to do CHA with basically dynamic sub-class insertion?  Just re-JIT in the new class hierarchy?</p><p>Ok - now new stuff... want concurrent in the delegation function.  Actor is a collection of local objects (and ptrs to remote objects).  Fcn calls between local objects are fast; fcn calls across actors are basically remote-msg send. Then receiving a msg invokes a co-routine in the remote actor.</p><p>Ugh, using 'yield' with the co-routines.  Sounds buggy to me: semantics of dead-lock will depend on proper calling 'yield'.  </p><p>(assuming a VM w/ actors not thread support)</p><p /><p /><p /><hr />
<p><strong><span style="font-size: 15px; font-family: Arial;">An Efficient Lock-Aware Transactional Memory Implementation</span></strong><br />Justin Gottschlich</p><p>Trying to integrate into the "boost" STM system.</p><p>Locks+TM break things.<br />But locks are prevalent - must be able to compose with them.</p><p>Example: Locks outside Transaction<br />- T1 runs lock; T2 runs transaction<br />Example: swap code.  Works with only-locks or only-TMs but breaks if mixed</p><p>Locks inside Trans<br />- T1 runs trans, calls lock; T2 runs either trans or lock</p><p>Prior work: "full lock protection" - if you take a lock, then the XTNs are all stopped/aborted/etc and you only get locking behavior.</p><p>Offers a programmer solution: programmer lists lock/XTN conflicts and the XTN system deals when you take a conflicting lock.  (I believe this is unrealistic).</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Eric Jul</span></strong>? - Whiteboard not slides....</p><p>Blurring the line between compilers &amp; runtimes</p><p>Gives some examples already done by hotspot</p><p>Nice discussion, but nothing new.  Mostly trying to get a discussion going about crosing compilers/JITs &amp; runtimes.  He might not have been aware about what the JIT already does.</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Trace-Based Type Specialization in JavaScript</span></strong><br />Andreas Gal</p><p>Same basic talk as given at PLDI.  TraceMonkey? FireFox 3.5<br />JavaScript &amp; Flash instead of Java/C/C++</p><p>JS has been very slow (interpreted only).  But is here to stay; very popular &amp; growing.  ActiveX &amp; client-side Java dying out (not sure about client-side Java which was never popular in the 1st place).</p><p>Static typing makes life easy, but dynamic typing is required.  More complex data types and more runtime tests.  Tag bits to be checked; overflow to Double, etc.</p><p>Coming around to wanting a HotSpot like dynamic JIT'ing thing based on types that happen to be true at the moment.  Basically, types in program traces remain stable over time.</p><p>  - for( var i=0; i&lt;100; ++i ) { /* nothing */}</p><p>loop:<br />  if ( int(i) &gt;= 100 ) break;<br />  i = box(int(i)+1);<br />  goto loop;</p><p>So trace can record, e.g. that 'i' has always been an 'int' so far.  Trace has a guard on the input types, and are type-specific.  Function calls are inlined in the traces, along with guard statements to check that you are taking the same control-flow.  </p><p>Tracing loops, and can verify trace is type-stable across loop.  So can remove e.g. boxing in the loop.</p><p>But real traces have many exits and the many exits are really taken.  So trace along each exit.  Build trace-trees, rooted at loop headers (exponential growth of trace trees as the loop DAG is split out?).  Can only link back into original tree if types all match up - else need a new loop /tree header.</p><p>Suffering for lack of a language spec; JS programs are driving the language semantics (i.e., w/a new interpreter some JS programs fail so we call the interpreter buggy- even if it meets the loose "spec").  No nothing of a memory model or threading; lots of other holes in the language.  Echo's of what Java went through: the popular implementations defined the spec.</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Tracing the Meta-Level: PyPy's Tracing JIT Compiler</span></strong><br />Carl Bolz</p><p>PyPy is a tracing JIT compiler.  Now apply this tracing JIT to an interpreter. <br />Compiler "Restricted Python" RPython - can target C, Java, .Net<br />Various interpreters: python, prolog, smalltalk, scheme, etc</p><p>Basic idea: <br />trace loops; look for type-stable loop execution; look for similar code path loops.</p><p>But the dispatch loop in interpreters means you never execute the same loop code twice (because each time you are running a new different bytecode).  Goal: trace user-mode program, not the language interpreter.  Effective the tracing interpreter unrolls the bytecode dispatch loop.  Provde 3 hints to laguange interpreter.  - hint for position key; here is the language interpreter's BCI; here are backwards edges; here is the PC modification</p><p>Works fairly well to clean out the language interpreter from the traces.<br />Then get traces which are fairly clean &amp; can JIT them.</p><p>Bears resemblance to partial evaluation, arrived at by different means.  Future work: better optimization of traces; some escape analysis to remove boxing operations.  Optimize frame objects, apply to larger programs.</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Faster than C#: efficient implementation of dynamic languages on .NET</span></strong><br />Antonio Cuni</p><p>Trying to make e.g. Python faster on .Net.<br />Looking at IronPython, Jython, JRuby, Groovy<br />(also Self, JavaScript/TraceMonkey/V8)</p><p>Why so slow?  Hard to compile efficiently; lack of type info at runtime; VMs not optimized to run them.  .Net is a multi-language VM, right (sure, as long as the language is C# - his quote, not mine!).  JVM is in better shape, but still heavily optimized for Java.</p><p>JIT compiler?  Wait until you know what you need; interleave compile-time and runtime; exploit runtime info.  JIT layering; fight the VM...</p><p>PyPy; JIT compiler generator; Python semantics "for free".  JIT frontend not limited to python; JIT backend: x86 or CLI/.NET backends.  Fun games with partial eval: assume e.g. Python bytecodes to be constant &amp; constant-prop them into the python interpreter.</p><p>Do even more fun with constants than HotSpot+OnStackReplacement - totally doing speculative constant value JIT'ing - if this argument is of value '3' then here is the JIT'd code.  Trick is to pick which variable to constant- speculate on (and getting that spec value as well).</p><p>Not yet doing all of Python, but getting really great speedups.</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Strata Virtual Machine</span></strong></p><p>Software Dynamic Translation - read a binary, &amp; translate it/jit the translated code.</p><p>Using a code-based hash-table lookup.  Hash; jump to hash-table entry.  Miss: jump to 'strata' interpreter<br />Hit: jump to bucket; bucket checks target (like HS inline cache, check at target)</p><p>HotSpot could use this to make more efficient v-calls?  Hash; indirect jump to nearby code table; table jumps to target &amp; checks target; expect no misses after warmup.  Nah... still got indirect branch in there.</p><p>Looks like a standard binary dynamic translation type stuff (converting PPC instructions to Java bytecodes?)</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Automatic Vectorization in JIKES RVM</span></strong></p><p>Using SIMD ops on X86.  Can't use BURS/BURG to pattern-match vector ops.  Unroll loops to make the patterns more obvious.</p><p>Basically getting some SIMD stuff to work (but talk given by PC not by author, and PC believes this is not the right way to discover SIMD ops).  I also believe this... it's not a clean fit to BURS.</p><p>Code emit via simple bitmask/shift/and/or.</p><p /><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">JIT Compilation on ARM Processors</span></strong><br />Michele Tartara</p><p>ARM - 31 32-bit GPRs, 16 available at a time?.  SP, LNK, PC are GPRs.  Fixed format 32-bit ops.  Dynamic fast compilation (cell-phone targets).  No BURS or tiling, just greedy rules.  This is probably the right way to go always.</p><p /><p /><p /><hr /><p><br /><strong><span style="font-size: 15px; font-family: Arial;">ECOOP 2009 Proper Starts</span></strong></p><p>ECOOP is being held in the Piazzo Ducall (Ducal Palace) - the main conference presentation chamber has perhaps 40ft ceilings, acres of gold leaf on the walls in between the massive medieval paintings.  The speaker dias was clearly meant to hold a throne or the orchestra; it's a circular marble dias perhaps 20ft in diameter with marble balustrade and marble railings.  The high tech screen &amp; projector setup in the middle is really anachronistic.</p><p>The adjacent formal ballroom is much larger; 50ft+ ceilings; chandeliers of at least 20ft tall and 30ft in diameter; paintings &amp; gold leaf in plenty plus also perhaps a dozen 8ft marble statues with ancient greek themes.</p><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Keynote - Classes, Jim, but Not as we Know Them</span></strong><br />Simon P Jones</p><p>As usual for Simon, he gave a wonderful presentation.</p><p>History - Haskell is 20yrs old.  Lots of fun with new languages &amp; machines &amp; ideas (Functional Programming).  After Backus's Turing Award lecture opened the gates, a storm of new languages hit the field.  Took 10 yrs before Haskell formed out of the chaos of new languages.</p><p>Lifetime of most programming languages: </p><ul>
<li>Invented by 1 person, used by 1 person, dies within a year.</li>
<li> Slow death version: invited by 1 person, used by 100, dies in 5 yrs</li>
<li> Successful: used by 100K+ programmers, never dies, e.g. Fortran still alive</li>
<li>Haskell?  hovering around 1000 programmers, but sharp recent uptick?</li>
</ul>
<p>Haskell might cross over into immortal language (100K+ programmers)?  Hackage: user uploaded libs, reaching 300 new libs uploaded /month, and approaching 1million downloads.</p><p>Type Classes- <br />Shows examples of lots of things you want actions on a wide variety of types:</p><ul>
<li> equality (for set membership?)  how to do equality of functions?</li>
<li> ordering (sorting lists)</li>
<li> print/showing</li>
<li> serialization</li>
<li> computing hash function</li>
</ul>
<p /><p>(so far sounds like Java interfaces)<br />For all 'a' such that 'a' is an instance of class 'Num'<br />  square Num a ==&gt; return a*a</p><p>Num classes have the following methods: +,-,*,/, etc.<br />The implementation of Num classes binds '+' to eq 'plusInt' or 'plusFloat'</p><p>Implementation is perhaps more clever than Java invokeinterface bytecode lookup: is passed in a vector of fcns, of the correct type of interface 'Num'.  The fcn lookup happens on this particular 'vtable' - call it an itable, but it's really a unique v-table per interface.</p><p>For Haskell, this is all done with syntatic sugar over the basic Haskell.  Which itself is very cool (e.g. Haskell compiler (javac equivalent) is doing interface calls w/syntatic sugar).</p><p>CLIFF NOTES: This is a faster way to do Java interface calls!!!  Or maybe this is how they are impl'd?  Find the i-table from a list of implemented interfaces, then pull out the correct fcn from the i-index.  In his world the i-table is pulled out early and kept pulled out, which avoids the point-by-point lookup of interfaces.  Ahhh.... he's statically computing the i-table ahead of time during compilation, using some combo of type unification and the closed-universe assumption.  Not applicable to Java.</p><p>Able to handle fcns which themselves take arguments of fcns with any signature, as long as the fcn handles the correct interface.</p><p>Doing type-based dispatch instead of value-based dispatch.  Not the same as Java interfaces...<br />In Haskell, can bind a value arg to multiple interfaces all at once.<br />In Haskell, existing classes can be made instances of new interfaces (duck typing)</p><p>Haskell has NO subtyping, but uses the interface (well: type class) notion instead.</p><p>Haskell: types are inferenced typically, type annotations occasionally needed<br />Java: types are declared explicitly; inference occasionally possible</p><p>Haskell does type unification, so things like Java's 'equal' call - which is normally a fcn of (Object,Object)=&gt;Bool - in Haskell, we KNOW ahead of time that the 2 args are of the exact same type.  Hence the implementations of various equals calls do not need to ask the "instanceof" question that all Java equals call implementations all start with.</p><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Making Sense of Large Heaps</span></strong><br />(Nick Mitchell, IBM)</p><p>Where does the memory go in large java apps.</p><p>Example from 2008, Cobol to Java.  In Cobol: 600 bytes; in Java 4492 bytes.<br />Blow up from delegation, bulky collections, or too many data fields.</p><p>Size examples: 58M objects, 3Gig ram, 8000+ types.</p><p>Yeti Tool - does smart cluster of datatypes (String w/char[], or HashMap w/HashMapEntry[], HashMapEntry, HashMapEntrySet).</p><p>Yeti also does dominance of sharing; notices single-owner root/tree bits, and then breaks out the shared parts at the leaves.  Also detect e.g., similar kinds of sharing, such as array B[] points to a set of B's; but also a linked list of Entry's points to the elements of B[] in turn.  Really lumping together things with the same owner; maximal dominance w/same ownership relationship.  Does nice heap shape summaries, followed by size info.  </p><p>Picture is really alternate sandwich effect, alternating between the Collections you selected, and the user-data at that layer of the 'sandwich'.  Edges hold the various sizes of things.  They fold up collection 'backbones'.  </p><p>If we get the data structures right, can easily shrink heaps by 2x to 10x!!!  Huge speedups possible.  Shows examples of people using CHM to hold 1 element (but programmer doesn't want to think about expected size usage, so does the easy thing and grabs CHM).</p><p /><p /><hr /><p style="font-size: 15px; font-family: Arial;"><strong>Scaling CFL-Reachability-Based Points-To Analysis Using Context Sensitive Must-Not-Alias Analysis</strong></p><p>Basically computes a must-not-alias approximation; where it holds do not bother to run a more expensive pts-too analysis.</p><p>Some experiments are medium programs (SpecJVM - 7 progs, Dacapo - 4 progs, 8 other progs).  Can compute pts-to graph in 2x to 5x smaller.  Compute time for the pts-to analysis is 3x faster on average than prior work (but still in the several minutes range)... but what do you do with the info?  (this is my standard question when presented with better/faster/new&amp;improved points-to work: now that you have the info how much does it *really* help speed up programs?  especially compared to having strong typing in the first place)</p><p>Nice presentation of a pts-to analysis, but hit right at my sleepy jet-lag afternoon siesta; had trouble staying awake.</p><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Intel NePalTM - Design and Implementation of Nested Parallelism for Transactional Memory Systems.  </span></strong>(Intel)</p><p>Issue is using locks instead a XTN; the locks are for the nested parallelism, but want the whole operation to be atomic.  Azul makes no attempt to use the HTM to span across threads.  Can't work for us, because the threads need to communicate through shared memory - and that shared memory will blow out the HTM.  </p><p>This is Intel's compiler-gen'd STM system for C++.  Handles either optimistic (read-friendly) or pessimistic (write-friendly) concurrency; roll-back &amp; retry.</p><p>Sigh - shows basically no speedup over plain locking.</p><p>(and perhaps not the best name....)</p><p /><p /><hr /><p style="font-size: 15px; font-family: Arial;"><strong>Debugging Method Names</strong></p><p>What's a naming bug?  A mismatch between the name and what the code does.</p><p>Break names apart (CamelCase in Java), do some grammer abstraction.<br />getLastName ==&gt; get - last - name  ==&gt; (verb) (adj) (noun), etc</p><p>Then also do semantic analysis of code.  Look for things like: has a loop, has a incoming parm, returns a value (from multiple points), includes some run-time checks in the loop, etc.  Decide it's a search loop, returning a result or NULL.</p><p>So come up with lots of data points for names &amp; methods.<br />Then mine the rules from a large corpus of code.</p><p>E.g. a "contains-XXX" name method should NOT be a void return, because a "contains-XXX" name implies a question with an answer.  In fact, in probably ought to return a bool.</p><p>Then with rules, apply rules to a program and report violations.<br />Turns out pretty easy to suggestion candidate replacement names.</p><p>Example:  <br />  public void isCaching(bool b) { this.caching=b; }<br />Method isn't asking a question, it's setting value.<br />Tool suggestions name should be "setCaching".</p><p>Example:<br />  public bool equals( Object o) {<br />    if( this==o ) return true;<br />    if( o instanceof Value )<br />      return equals((Value)o);<br />    return equals(new Value(o.toString()));<br />  }<br />So not really an 'equals' method, because it makes a new 'Value' object</p><p>More examples, ends up finding interesting bugs as well.  Hard part is to decide whether or not to change the name or change the code.  Sometimes the name is correct, but the code is 'wrong' here - and also wrong elsewhere.  Basically finding a poor factoring of code.</p><p /><p /><hr /><p style="font-size: 15px; font-family: Arial;"><strong>Mapping &amp; Recommending API usage Patterns</strong></p><p>API, libraries - often complex, lots of classes &amp; methods; complex; hard to use.</p><p>People use: books &amp; docs, forums &amp; newgroups; code search engines.</p><p>Example: we wish to add an item to Eclipse menu.  Browse docs; find potential call: "appendToGroup".  Google search finds 151 code snippets (at time of paper submission) and 287 code snippets (at time of talk).  The returned code snippets at least 2 different API usages.  </p><p>'MAPO' Tool does search; parses code; clusters snippets based on used; mines patterns; recommends final patterns.  Tool needs to inline non-3rd-party methods (scattered implementation).</p><p>Examples hard to follow...  I've been bit here before repeatedly (complex API &amp; no way to learn how to use it easily).  So very sad that it's so hard to follow his talk.  </p><p /><p /><hr /><p style="font-size: 15px; font-family: Arial;"><strong>Supporting Framework Use via Automatically Extract Concept Implementation Templates</strong></p><p>Another talk about finding &amp; figuring out reuse of existing code.</p><p>Choice between Templates vs Documentation had much less impact on dev time than the task's concept complexity.</p><p /><p /><hr /><p>I skipped all the refactoring talks.  Exhaustion is setting in, plus I've seen one or two of these talks before.</p><p>Went to a couple of talks attempting to deal better with constructors.  Mostly it's issues like publishing objects before construction completes, or in Java call an overridden v-call from a constructor (which means the sub-class v-call executes before the sub-call constructor can work on the object).  Final fields, read-only objects still being constructed; not-null field properties before you set the value; et al; there's a bunch of problems that stem from constructors not being instantly-quick - and if they take time, what does the object look like in that in-between state.</p><p>This is a general language issue with constructors.  They are a nice notion, but unless they are instantly quick, you have to live with having objects which are not 'complete' or 'constructed' yet and hence do not meet the expected invariants for the object.</p><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Inlining Security Policies</span></strong></p><p>Want to inline code into the original program; the new program either runs or is truncated (if a security violation happens).  Must show no *new* behaviors from the inlining, also all old behaviors are allowed (except the secuirty ones), etc.  Single-threaded versions of these inlinings all exist and are well documented.</p><p>Multi-threaded is harder: general case isn't solvable (probably because the monitors themselves become non-deterministic).  But if the monitor is race-free, (and some other easy properties), then you can do it.  Monitor is normally a state-machine.</p><p /><p /><hr /><p><strong><span style="font-size: 15px; font-family: Arial;">Cliff Click Talks About Azul</span></strong></p><p>My keynote went well enough; lots of Q&amp;A from the senior people in the audience afterwards.  We had to stop the Q&amp;A after running quite a bit over.  I also got a lot of positive comments afterwards, so I think people really enjoyed the talk.  I'll add that I enjoyed both Simon Jones' and Dave Ungars' keynotes also.  This ECOOP was unusual in having 3 "gentleman hackers" give keynotes; we're all three both very scholarly and very prolific programmers.  Doug Lea added in a later email:</p><p class="blockquote" style="margin-left: 40px;"> "Cliff: definitely the best talk I've seen you give (which is
quite a few). Someday  you'll have to figure out how you packed in so much technical content yet still fully captured attention of people supposedly without the background to understand most of it."  </p><p>High praise from Doug Lea, so I musta done something right.   :-)</p><p>Trip home is uneventful, except for Paris's Charles DeGaul airport being the usual zoo.  I get up at 4:30am and am in a Genova taxi by 5am for the 1/2hr drive to the airport.  Except that there's no traffic at this hour and the driver thinks he's Mario Andretti.  We make the trip in about 10 minutes.</p><p>So I'm way early into Genova's airport for a 7:10am flight.  No power connections for my laptop, for-pay wi-fi (not worth it), no shops are open - not even coffee.  No air-partnership; I cannot check in for my USAirways flight from the Air France desk in Genova.  I blog for the next hour and a half.  The flight leaves on-time and the two hours flying are very smooth.  Kudos to Air France.  Then we land at Charles DeGaul and the madness begins.  We have to walk down the plane stairs and across the tarmac and about 1/4 mile more to the main terminal - but actually it's the small-plane terminal.  After figuring out I'm not in Terminal's 1, 2 OR 3, I get instructions to take bus Number 3 to the automatic train and take it to Terminal 1.  Of course the buses &amp; bus stop are labeled- with LOTS and LOTS of numbers and plenty of French - takes a question of the drivers to figure out which is the #3 bus.</p><p>The bus ride is longish (for an airport bus) and uneventful and then I find the train.  Thankfully the train is clearly marked and again a longer ride to Terminal 1 than I expect: Charles DeGaul is a really spread out airport.  Terminal 1 is a total zoo; I see lines of people snaking out all over filling the large hall, hundreds of people long.  Three times I ask people about which line they are in; none of the lines are marked and all snake back and forth through out the hall; there's a steady stream of people traffic cutting through all the lines and the lines branch and rejoin helter-skelter.  Finally I find the 'entrance' to the line for flight #771 to Charlotte - just next to the line for Philly which clearly orbits the entire hall.  The Charlotte line is actually fairly short, maybe 20 people long, and after 10 minutes I'm talking to the agent.  No problem checking in and then I head off for gate 55 (cutting through the Philly line again).  Then it's customs (another 10 minute line), and then a small shopping zone - I figure I'll come back to it for a snack (no breakfast yes; nothing was open in Genova).  But then I discover there's yet another line for security.  This one is about 20mins long and no way I'm going back for my snack and then back through security.  I figure it's like US airports - once past security there will be food and such... WRONG!  It's just the end of the lines; I'm in a seating area with about 10 gates and absolutely no way to get a drink or snack - or bathroom near as I can tell. It's now 10:10pm and it's taken me a solid hour to navigate Charles DeGaul to find my gate.  As I type, I'm standing by a laptop charging station (no chairs nearby), in the hopes I can do a little something on the laptop during the next 8 hrs of flight.</p><p>Later: flight is fairly smooth.  They serve some meals; I have another flight through Charlotte, NC to San Francisco.  Home at last!  For about 24 hrs that is, then I'm on vacation in Texas with my entire family.  More plane flights  for me!  Yumm, I can already taste the small salty snacks they are going to serve...</p><p /><p /><hr /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/08/2009-ecoop.html</feedburner:origLink></entry>
  <entry>
    <title>JavaOne Slides</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/XaJ11jMirDw/javaone-slides.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=6a00d83451bd7669e20115721f615d970b" title="JavaOne Slides" />
    <id>tag:typepad.com,2003:post-6a00d83451bd7669e20115721f615d970b</id>
    <issued>2009-07-21T08:40:08-07:00</issued>
    <modified>2009-07-21T15:40:08Z</modified>
    <created>2009-07-21T15:40:08Z</created>
    <summary>Slides for my 2009 JavaOne talks: Alternative Languages on the JVM This Is Not Your Father's Von Neumann Machine; How Modern Architecture Impacts Your Java Apps The Art of (Java) Benchmarking I'd like to say more on JavaOne, but I'm...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Slides for my 2009 JavaOne talks:</p>
    <ul>
<li><strong><a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_JVMLang.pdf" target="_blank">Alternative Languages on 
      the JVM<br /></a></strong></li>
<li><strong><a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_HardwareCrashCourse.pdf" target="_blank">This Is Not Your Father's 
      Von Neumann Machine; How Modern Architecture Impacts Your Java 
      Apps</a></strong> <a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_HardwareCrashCourse.pdf"><strong /></a></li>
<li><strong><a href="http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf" target="_blank">The Art of (Java) 
      Benchmarking<br /></a></strong></li>
</ul>
<br /><p><br />I'd like to say more on JavaOne, but I'm (re)discovering that if you're a speaker at JavaOne it's really hard to get into the other talks &amp; technology being presented.  I had my head full with my 3 talks, making sure they looked good, were relevant, etc.  To be sure, there were plenty of talks I wanted to attend, but I never seemed to find the time.   :-(</p><p>And of course shortly after JavaOne, I went to 2009 ECOOP in Italy for a week, and now I'm on a long deserved vacation.  </p><p>More blogging soon, on ECOOP at least,<br />Cliff</p></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/07/javaone-slides.html</feedburner:origLink></entry>
  <entry>
    <title>IFIP WG2.4 Trip Report</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/TIeSr5m3NiU/ifip-wg24-trip-report.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=66935961" title="IFIP WG2.4 Trip Report" />
    <id>tag:typepad.com,2003:post-66935961</id>
    <issued>2009-05-18T12:40:59-07:00</issued>
    <modified>2009-05-18T19:40:50Z</modified>
    <created>2009-05-18T19:40:59Z</created>
    <summary>2009 IFIP IFIP WG2.4 remains one of my favorite mini-conferences to attend. The group is eclectic &amp; varied; the discussion vigorous. Everybody is willing to interrupt the speaker &amp; ask questions. The talks are usually lively and pointed, and this...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="html" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="moz-text-html" lang="x-western"&gt;



&lt;br&gt;
&lt;strongig&gt;&lt;strong&gt;2009 IFIP&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
&lt;a href="http://wg24.cs.uvic.ca/new/"&gt;IFIP WG2.4&lt;/a&gt; remains one of my
favorite mini-conferences to attend.&amp;nbsp; The group is eclectic &amp;amp;
varied; the discussion vigorous. Everybody is willing to interrupt the
speaker &amp;amp; ask questions.&amp;nbsp; The talks are usually lively and pointed,
and this time was no exception.&lt;br&gt;
&lt;br&gt;
The conference was held in &lt;a href="http://www.parks.wa.gov/fortworden/"&gt;Fort Worden&lt;/a&gt;, a historic
Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific
Northwest, defending &lt;a href="http://en.wikipedia.org/wiki/Admiralty_Inlet"&gt;Admiralty Inlet&lt;/a&gt;
and our shores from the marauding Russians ... or Japanese... or
somebody, surely.&amp;nbsp; Nobody ever attacked that way, so clearly the fort
was a big success.&lt;br&gt;
&lt;br&gt;
Meanwhile, the fort has been turned into a park and it's beautiful.&amp;nbsp;
The whitewashed buildings with green trim remind me of some mythical
50's era America that never was, the kind that shows up in the backdrop
of various movies or maybe "&lt;a href="http://www.leaveittobeaver.org/"&gt;Leave
It To Beaver&lt;/a&gt;".&amp;nbsp; We stayed in the officers quarters and held our
meeting in the old mess hall.&amp;nbsp; The officers lived quite nicely; the
quarters are actually old duplexes in great shape, with high ceilings
and crown molding and carved brass fittings everywhere - the house is
clearly larger than my house and with a vast green lawn ta-boot.&amp;nbsp; In
San Jose such a place would be upwards of $2m...&lt;br&gt;
&lt;br&gt;
Fort Worden overlooks Admiralty Inlet from some high bluffs (classic
fort-like location) and the surroundings are all gorgeous.&amp;nbsp; The weather
varied from overcast &amp;amp; chilly to clear &amp;amp; crisp (alas, overcast
and chilly was the mode on our all-day "whale" watching expedition.&amp;nbsp; We
did lots of watching but saw no&lt;br&gt;
whales of any kind.&amp;nbsp; And we bounced over heavy seas and got sprayed
with frigid water most of the day.)&lt;br&gt;
&lt;br&gt;
Since the Hood Canal bridge was out I got treated to a 3 hr drive the
long way around the peninsula, but somebody else drove and the scenery
was worth looking at.&amp;nbsp; The airplane portion of the trip was easy.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
As usual, my reports are (very) brief, stream-of-consciousness
writing.&amp;nbsp; I often skip around, or brutally summarize (or brutally
criticize - but this group is above the usual conference par, so less
of that is needed).&amp;nbsp; I skip talks; I sleep through ones I'm not
interested in, etc, etc.&amp;nbsp; Caveat Emptor.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Synchronization Talks&lt;/strong&gt;&lt;br&gt;
&lt;a href="#atomicset"&gt;Atomic Sets&lt;/a&gt; - another nice concurrency alternative&lt;br&gt;
&lt;a href="#fj_net"&gt;Fork/Join for .Net&lt;/a&gt;&lt;br&gt;
&lt;a href="#occam"&gt;occam / CSP&lt;/a&gt; - Uber lite-weight user-thread scheduling&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Java Analysis Talks&lt;/strong&gt;&lt;br&gt;
&lt;a href="#snugglebug"&gt;SnuggleBug&lt;/a&gt; - Better bug reporting for Java&lt;br&gt;
&lt;a href="#satire"&gt;SATIrE&lt;/a&gt; - A complete deep analysis toolkit for Java&lt;br&gt;
&lt;a href="#pointsto"&gt;PointsTo&lt;/a&gt; - A comparison of various points-to
algorithms for Java&lt;br&gt;
&lt;a href="#javascript"&gt;analyzing JavaScript&lt;/a&gt; - Has a type-lattice as
complex as C2's internal lattice&lt;br&gt;
&lt;a href="#jvm_perf"&gt;Performance variations in JVMs&lt;/a&gt; - Noticed the wild
variations in run-to-run performance of JVMs&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Misc&lt;/strong&gt;&lt;br&gt;
&lt;a href="#cheap_cores"&gt;Cheap Cores discussion&lt;/a&gt;&lt;br&gt;
&lt;a href="#fpga_variants"&gt;Compiler-directed FPGA variations&lt;/a&gt;&lt;br&gt;
&lt;a href="#clones"&gt;Clone Wars&lt;/a&gt; - huge study on cut-n-paste code in
large projects&lt;br&gt;
&lt;a href="#business"&gt;Business Process Modeling&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Security&lt;/strong&gt;&lt;br&gt;
&lt;a href="#pdf_attack"&gt;PDF Attack&lt;/a&gt; - "Adobe Reader is the new
Internet Explorer"&lt;br&gt;
&lt;a href="#britney_spears"&gt;The Britney Spears Problem&lt;/a&gt;&lt;br&gt;
&lt;a href="#squeeze_tab"&gt;Squeezing your Trusted Base&lt;/a&gt;&lt;br&gt;
&lt;a href="#ghost"&gt;Who you gonna call?&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Evolving Systems&lt;/strong&gt;&lt;br&gt;
&lt;a href="#swarms_in_space"&gt;Swarms in Space&lt;/a&gt;&lt;br&gt;
&lt;a href="#boids_occam"&gt;Finding emergent behavior&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="atomicset"&gt;&lt;/a&gt;Frank Tip,
Type-Based Data-Centric
Synchronization&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Locks are hard.&amp;nbsp; Auto-inference of correct sync in O-O programs.&lt;br&gt;
Even with no data-races still have atomicity races.&lt;br&gt;
&lt;br&gt;
Instead of code-centric locking paradigms, do data-centric locking.&lt;br&gt;
Tied to classes with fields, leverage O-O structures&lt;br&gt;
&lt;br&gt;
Group a subset of fields in a class into a *atomic_set*.&lt;br&gt;
Then find units_of_work for that atomic_set.&lt;br&gt;
&lt;br&gt;
Add language features to Java:&lt;br&gt;
&amp;nbsp; atomicset account;&lt;br&gt;
&amp;nbsp; atomic(account) int checking;&lt;br&gt;
&amp;nbsp; ...&lt;br&gt;
&amp;nbsp; atomicset logging;&lt;br&gt;
&amp;nbsp; atomic(logging) Log log;&lt;br&gt;
&amp;nbsp; atomic(logging) int logCount;&lt;br&gt;
&lt;br&gt;
So add 'atomicset' keyword, 'atomic' keyword and strongly type
variables with atomic_sets.&amp;nbsp; Units_of_work expand to reach the
syntactic borders anytime an atomic variable is mentioned.&amp;nbsp; Can expand
unit_of_work by hand also.&lt;br&gt;
&lt;br&gt;
Can at runtime cause atomic_sets to alias - to union together.&amp;nbsp; Used on
ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on
HashMap or HashMap$Entry become units_of_work for both).&lt;br&gt;
&lt;br&gt;
Atomic_sets can be unioned; e.g. for LinkList the entire backbone of
the list is one atomicset.&amp;nbsp; Lots of discussion - points out the issues
with hacking all the Code Out There.&lt;br&gt;
&lt;br&gt;
Has proven strong serializable properties on atomicsets.&amp;nbsp; Proven
various kinds of soundness, including &lt;strong&gt;*internal*&lt;/strong&gt; objects are
only
accessed by their owner, etc.&lt;br&gt;
&lt;br&gt;
All concurrency annotations appear in the libraries but not in the
client code - shows a nice concurrent example with 2 threads
manipulating the same linked list.&amp;nbsp; &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Has a working version using Eclipse &amp;amp; taking annotations as Java
comments.&amp;nbsp; Using j.u.c.Locks; can do lock aliasing by simply sharing
j.u.c Locks.&amp;nbsp; Applied this to a bunch of Java Collections; 63 types
&amp;amp; 10KLOC of code.&amp;nbsp; Needs about 1 line of annotation per 21 lines of
code for the libraries (and none in&lt;br&gt;
clients)&lt;br&gt;
&lt;br&gt;
About 15-30% slower than using plain sync wrappers on a single-threaded
testcase; probably because the "synchronization" keyword is very
optimized.&amp;nbsp; Performance is similar or slightly faster when running with
10 threads &amp;amp; high contention.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="fj_net"&gt;&lt;/a&gt;Daan Leijen - The
Task Parallel Library
for .Net&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
- A concurrent parallel library for .Net (think dl's Fork/Join).&lt;br&gt;
&lt;br&gt;
Infrastructure: locks/threads/STM, async agents (responsiveness), task
parallelism (performance)&lt;br&gt;
&lt;br&gt;
Gives a demo of a parallel matrix-multiply w/C#.&amp;nbsp; Very ugly code.&lt;br&gt;
&lt;br&gt;
So instead, write a nice combinator... &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Standard task-stealing work-queue approach.&amp;nbsp; (private set of tasks that
don't need CAS, public set of steal-able tasks that require CAS).&lt;br&gt;
&lt;br&gt;
Points out that want fine-grained parallelism, so can do work-stealing
and do auto-load-balancing. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
-- Effectful state.&amp;nbsp; Statically typed, with typed side-effects.&amp;nbsp; Type
signature includes all side-effects (yes!)&lt;br&gt;
&lt;br&gt;
&amp;nbsp; int fib(int n) { return n&amp;lt;=1 ? 1 : fib(n-1)+fib(n-2); }&lt;br&gt;
&lt;br&gt;
But this program&lt;br&gt;
&lt;br&gt;
&amp;nbsp; int fib(int n) { return n&amp;lt;=1 ? print"hi",1 : fib(n-1)+fib(n-2); }&lt;br&gt;
&lt;br&gt;
Does not type in Haskell - need the IO monad.&amp;nbsp; And now the function
returns 2 results: the IO "hi" and the int.&lt;br&gt;
&lt;br&gt;
So want some syntax to allow "auto-lifting" into a side-effect world -
same function syntax, but auto-includes the IO monad as a result.&lt;br&gt;
&lt;br&gt;
Then can prove that some stateful computation does not leave the
function - that the state is entirely encapsulated &amp;amp; unobservable
outside.&amp;nbsp; So can do type'ing which allows some state-ful work
internally, but it doesn't escape the function "box".&lt;br&gt;
&lt;br&gt;
&amp;nbsp; int fib(int n) { &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ref f1 = 1; // side effects, but entirely internal to the function&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ref f2 = 1;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n--; }&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; return sum;&lt;br&gt;
&amp;nbsp; }&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="occam"&gt;&lt;/a&gt;Peter Welch, Multicore
Scheduling for
Lightweight Communicating Processes&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Google: kroc install&lt;br&gt;
&lt;br&gt;
Carl Ritson did all the work...&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Cliff Notes&lt;/strong&gt;: Insanely cheap process scheduling, plus "free" load
balancing, plus "free" communicating-process affinity.&amp;nbsp; Very similar to
work-stealing in GC marking, but with "threads".&amp;nbsp; OS guys take
note!&lt;br&gt;
&lt;br&gt;
Scheduler for OCCAM/PI.&amp;nbsp; Process-oriented computation (millions), very
fine grained processors.&amp;nbsp; Message passing over channels, plus
barriers.&amp;nbsp; Uses CSP and pi-calculus for reasoning.&amp;nbsp; Dynamic creation
&amp;amp; deletion of processes &amp;amp; channels.&lt;br&gt;
&lt;br&gt;
Goal: automatic exploitation of shared-memory multis.&amp;nbsp; "Nothing" for
the programmer to do: the program exposes tons of parallelism.&amp;nbsp;
"Nothing" for the compiler to do - it's all runtime scheduling. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Goals: scalable performance from 10's of processes to millions;
co-locate communicating processes - maximize cache, minimize atomics
(lock-free &amp;amp; wait-free where possible).&amp;nbsp; Heuristics to balance
spreading processes around on spare CPUs and co-locating the
communicating ones. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Usual sales job for CSP... which still looks good. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
A blocked process requires only *8* words of state (no stack for you!).
Scheduling is cooperative not preemptive.&amp;nbsp; No complex processor state,
no register dumps.&amp;nbsp; Up to 100M process contexts per second.&amp;nbsp; Typical
context switch is 10ns to 100ns.&amp;nbsp; Performance heavily depends on
getting good cache behavior.&amp;nbsp; So processes are batched.&amp;nbsp; Stacks are
spaghetti-stack (no that's why no stack saving...).&amp;nbsp; These 8 words
include the process run-queues.&lt;br&gt;
&lt;br&gt;
Batching: sets of e.g. 48 processes are run on a single core; that core
is the only core touching this queue &amp;amp; round-robins them until the
entire batch times out.&amp;nbsp; So no cache-misses.&amp;nbsp; A Batch is a Process,
plus a run-queue.&lt;br&gt;
&lt;br&gt;
A Core is a Batch, plus a queue of Batches.&amp;nbsp; Again, no other cores
touch this list either. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
A Core with no Batches does work-stealing.&amp;nbsp; There's some proposed
batches available for work-stealing from each Core.&amp;nbsp; Cores need some
atomic ops to manipulate the queue of Batches for these steal-able
Batches.&amp;nbsp; Careful design to ensure all the interesting structures fit
in a single cache line.&lt;br&gt;
&lt;br&gt;
So until you need to enqueue, dequeue or steal a Batch - it's all done
without any contention for process scheduling.&amp;nbsp; These Batch-switching
operations require atomics but are otherwise fairly rare. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Key bit missing: scaling up beyond 8 or 16 cores with some kind of log
structure.&amp;nbsp; But otherwise looks really good for very low cost context
switch &amp;amp; work steal.&amp;nbsp; Lots of questions about fairness.&lt;br&gt;
&lt;br&gt;
Batches are formed &amp;amp; split by the runtime (no compiler or
programmer).&amp;nbsp; Aim is to keep processes that communicating on the same
core (so it's all cache-hit communication), but spread the work across
cores.&amp;nbsp; Initial batches are kept small to encourage spreading.&amp;nbsp; Channel
ops require atomics, but typically only 1 (sometimes 2).&amp;nbsp; Choosing
amongst many channels might require up to 4 atomic ops.&amp;nbsp; Each time
communication happens on a channel, the sleeping process is pulled into
the run queue Batch of the awake process - thus naturally grouping
communicating processes.&amp;nbsp; But as batches get large, need to split. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Periodically can observe that if all the processes in a Batch are all
communicating with each other in some complex cycle - then the spare
run queue empties from time-to-time (with one runnable process).&amp;nbsp; So as
long as the run queue empties now &amp;amp; then, then the Batch should
stay together.&amp;nbsp; But if the run queue never empties, then
split-the-Batch.&amp;nbsp; Split off a single process into a new Batch - which
will drag it's connected component together into the new Batch.&amp;nbsp; But
eventually if the original Batch had 2 complete strong graphs,&lt;br&gt;
they will now be split - and can be stolen onto other CPUs. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
So this kernel is about 100x less overhead than the same design done on
a JVM.&lt;br&gt;
:-)&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="snugglebug"&gt;&lt;/a&gt;Satish Chandra,
Symbolic Analysis for Java&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
"Snugglebug"&lt;br&gt;
&lt;br&gt;
Communication between the programmer &amp;amp; the tool tends to stop
early: "Dear Programmer: you have a null-ptr bug here"...&lt;br&gt;
&lt;br&gt;
Going to the next step: "Dear Mr Programmer: try running your method
with &lt;strong&gt;*these*&lt;/strong&gt; values and you'll see the bug...".&amp;nbsp; Programmer:&amp;nbsp;
now that I think about it, these values will never arise.&lt;br&gt;
&lt;br&gt;
We just forced the programmer to define the legitmite range of values.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;More&lt;/strong&gt;: Bug Report is more believable if given a concrete input is
given.&amp;nbsp; &lt;br&gt;
&lt;strong&gt;API hardening&lt;/strong&gt;: gives the preconditions under which this library
will throw an exception&lt;br&gt;
&lt;strong&gt;Unit-test generation&lt;/strong&gt;: inputs needed to make a specific assert
fail.&lt;br&gt;
&lt;br&gt;
Works by weakest-precondition.&amp;nbsp; Given an assert in the program, work
backwards computing the weakest-precondition at each point until you
reach the top of the method.&amp;nbsp; If this chain works, then hand the WP to
a decision procedure; sometime the decision is "I don't know" but many
times you get an explicit counter example.&lt;br&gt;
&lt;br&gt;
Good match for the counter-example problem.&lt;br&gt;
- Yes loops are a problem, but don't really need the *weakest*
precondition.&lt;br&gt;
&lt;br&gt;
Most Java features can modeled:&lt;br&gt;
- heap r/w, arrays, allocation, subtyping, virtual dispatch&lt;br&gt;
- exceptions: many bugs are here so we modeled exception handling
try/catch&lt;br&gt;
&amp;nbsp; very closely.&lt;br&gt;
&lt;br&gt;
Have a path-explosion problem.&amp;nbsp; For heap updates, even a single path
includes branch effectively depending on aliasing.&lt;br&gt;
&lt;br&gt;
Has to do strong interprocedural analysis.&amp;nbsp; Lazily grow the call-graph
consistent with known symbolic constraints.&amp;nbsp; First look for a conflict
on a call-free path, then on call-paths but without needing to look
into the call.&amp;nbsp; Finally, if I must peek into the call - but use the
symbolic info the reduce the set of possible call targets.&lt;br&gt;
&lt;br&gt;
Generally, no fixed-point.&amp;nbsp; Cannot expect programs to give loop
invariants.&lt;br&gt;
Work-list management: avoid various common pathologies.&lt;br&gt;
Goal is to find &lt;strong&gt;*a*&lt;/strong&gt; contradiction, not to explore all options.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="satire"&gt;&lt;/a&gt;SATIrE&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
&lt;a href="http://www.complang.tuwien.ac.at/satire/"&gt;http://www.complang.tuwien.ac.at/satire/&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Shape analysis, points-to, flow-sensitive, context-sensitive,
annotations, can read them in &amp;amp; out.&amp;nbsp; Applications to e.g., slicing.&lt;br&gt;
&lt;br&gt;
Read C/C++ - large complex tool chain.&amp;nbsp; Reads in program x-forms in
either a functional programming language or in Prolog.&amp;nbsp; Can combine the
tools in any order at this point (very well integrated).&amp;nbsp; Output is
C/C++ w/annotations at each step (or internal "ROSE" ASTrees, etc). &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Has interval analysis for integers, shape analysis, loop (bounds)
analysis, points-to, etc.&amp;nbsp; Gives example of reaching-def's in the
functional programming language.&amp;nbsp; It's a domain-specific language for
writing compiler analyzes.&amp;nbsp; Can specify the strength of the analyzes in
many cases: e.g. the depth of the allowed call-chain for
context-senstive (set to zero for context insensitive).&lt;br&gt;
&lt;br&gt;
Shape analysis gets may/must alias pairs, etc - and all results put
back into the C program as comments/annotations.&lt;br&gt;
&lt;br&gt;
Basically describes a large complex tool chain for doing pointer
analysis on C &amp;amp; C++ programs.&amp;nbsp; The tool chain looks very complete
&amp;amp; robust.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="pointsto"&gt;&lt;/a&gt;Welf Lowe -
Comparing, Combining &amp;amp;
Improving Points-to Analysis&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
...skips explaining what points-to is.&lt;br&gt;
&lt;br&gt;
Clients: compilers, JIT compilers, refactoring tools, software
comprehension.&lt;br&gt;
Different clients have different needs: speed, precision, security.&lt;br&gt;
Granularity &amp;amp; conversatism also matter.&amp;nbsp; Clients might want, e.g.,
dead code elimination, or removing polymorphic calls, or rename methods
&amp;amp; calls.&amp;nbsp; Other people want object granularity - escape info &amp;amp;
side-effects (i.e., compilers and JITs).&amp;nbsp; Also static-vs-dynamic
analysis.&amp;nbsp; Dynamic analyzes typically run the program and produce
optimistic results.&lt;br&gt;
&lt;br&gt;
Doing careful experiments of 2 analysis w.r.t. accuracy.&amp;nbsp; Hard part:
can't get a perfect "gold standard" for real programs.&amp;nbsp; Hard even for a
human to inspect a proposed "gold standard" and declare it "gold" or
not.&amp;nbsp; Some special cases are easier: conservative analysis (no false
positives) &amp;amp; optimistic analysis&lt;br&gt;
(no false negatives).&amp;nbsp; Can under- and over-estimate the gold standard,
but this still messes with the results (messes with detecting which of
the 2 analysis is better w.r.t. accuracy).&lt;br&gt;
&lt;br&gt;
E.g., ask the question: is it worth increasing k&amp;gt;1 (i.e., becoming
context sensitive vs staying insensitive).&amp;nbsp; Checking size of computed
call-graph - smaller graph is more precise.&amp;nbsp; Very small increase in
accuracy.&amp;nbsp; Sometimes made things worse - because the original analysis
was optimistic sometimes.&lt;br&gt;
&lt;br&gt;
Open question: what is the use of an Uber Points-To - it's definitely
beyond the point of diminishing returns for JIT compilers.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="javascript"&gt;&lt;/a&gt;Anders Moller -
Type Analysis for
JavaScript&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Not much tool support for catching errors during programming.&lt;br&gt;
No nice IDE support.&amp;nbsp; Many bugs are browser specific, but we focus on
language bugs.&lt;br&gt;
&lt;br&gt;
Most JS programs are quite small, so throw a heavy-weight analysis at
it.&lt;br&gt;
&lt;br&gt;
Object based, dynamic typed, proto-type inheritance, 1st-class
functions, coercions, dynamic code construction, exceptions, plus 161
built-in functions.&amp;nbsp; Types include null &amp;amp; undefined, primitive
types, some properties "ReadOnly" and "DontDelete".&lt;br&gt;
&lt;br&gt;
Broken things: tracking down an "undefined" and where it came from,
reading an absent variable or absent property, etc.&lt;br&gt;
&lt;br&gt;
So do a dataflow analysis over the program, including a lattice,
transfer functions, etc.&amp;nbsp; Handle interprocedural stuff.&amp;nbsp; Tightly
interleaved control flow &amp;amp; data - so need a complex lattice. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Lattice is roughly as complex as C2's lattice, with more stuff on the
allocation sites and less on the class hierarchy.&amp;nbsp; Also then model all
the abstract machine state (C2 does this as well).&amp;nbsp; Full context info
(total inlining at analysis time except for recursion).&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Cliff Notes: &lt;/strong&gt;Nice idea: for each allocation site, model both
the &lt;strong&gt;*most recent object*&lt;/strong&gt; from that allocation site, and also
model &lt;strong&gt;*all the rest*&lt;/strong&gt; from that site.&amp;nbsp; You can do strong-update
on the singleton most-recent object, but weak update on all the rest.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="jvm_perf"&gt;&lt;/a&gt;Rhodes Brown -
Measuring the Performance
of Adaptive Virtual Machines&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Noticing the badness of the "compile plan" - 1 run in 10 had a &amp;gt;10%
performance reduction.&lt;br&gt;
&lt;br&gt;
JIKES - always compiled, no interpreter.&amp;nbsp; On demand at 1st invoke.&amp;nbsp;
Focused recompilation, lots of GC variants.&lt;br&gt;
&lt;br&gt;
Points out the places where you get variations - stack-profiles for
estimating the call-graph, basic-block counters, etc.&lt;br&gt;
&lt;br&gt;
Causes of variation?&lt;br&gt;
So far: GC &amp;amp; inlining&lt;br&gt;
Concerned that i-cache placement is a major issue of instability.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;*No*&lt;/strong&gt; correlation between total amount of compile-time spent vs
score.&amp;nbsp; You must compile the hot stuff, but apparently tons of
not-so-hot code is getting compiled and it doesn't help the score any.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="cheap_cores"&gt;&lt;/a&gt;Kurt William -
lots of cores for cheap&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
General discussion round, not really a "talk".&lt;br&gt;
&lt;br&gt;
Expect &amp;gt;32 core/die in 5 yr, &amp;gt;128 core/die in 10 yrs.&lt;br&gt;
What's "New" and what can we expect in 5 years?&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Theory&lt;/strong&gt;: contention is that there's nothing new here since the
70's &amp;amp; 80's (concerning concurrency).&amp;nbsp; (general agreement on
incremental progress but not breakthrough).&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Systems&lt;/strong&gt;: Lots of libraries (TBB, pthreads), lots of platforms
(webservers), all OS's support multi-core.&amp;nbsp; JVMs are New in this
space.&amp;nbsp; Might see OS-support for user-level thread scheduling.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Languages&lt;/strong&gt;: Support been there, but it's locks &amp;amp; threads - no
good.&amp;nbsp; Support for TM is coming, but will it help?&amp;nbsp; Then there's CSP...&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Applications: &lt;/strong&gt;Not much new here.... more games, more
interactive more user-interface.&amp;nbsp; Old stuff: lots of HPC, speech,
image.&amp;nbsp; Matching: easy parallelism; lots of interesting apps here.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="fpga_variants"&gt;&lt;/a&gt;Uwe Kastens,
Compiler-Driven Dynamic
Reconfiguration of Arch Variants&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
What is "reconfigurable"?&amp;nbsp; "Usual" approach - HW guys compile to a
netlist, do some mapping, place &amp;amp; route, make a FGPA/silicon, while
the SW guys write some assembler on the FPGA - then at the end they try
to run the SW on the HW.&amp;nbsp; Goal is to run some complex function faster
on a FPGA than in normal ops.&lt;br&gt;
&lt;br&gt;
Our approach: read source code, look at program structure, find hot
code &amp;amp; hot loops; compiler knows how to switch the hardware between
variants; then it compiles the code for a particular FPGA variant for
each loop, and reloads the FPGA with a new function between loops.&amp;nbsp; But
need to limit the FPGA to certain variants so don't actually need to
reload the whole FPGA. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Fixed small set of architecture variants.&amp;nbsp; Keeps cost to reconfigure
the FPGA very low.&amp;nbsp; Compiler can choose the variants.&amp;nbsp; Example: FPGA
runs either a small CPU or a small SIMD CPU or a MIMD CPU.&amp;nbsp; Or
reconfigure the pipeline structure, or the default register bank access
patterns.&amp;nbsp; In the different configurations can use more or less power
for problems that are more or less parallel.&amp;nbsp; e.g., in SIMD mode have
all ALUs active but turn all but 1 decoder.&lt;br&gt;
&lt;br&gt;
Pick variants &amp;amp; design space to those known to be efficiently
solved by a compiler - gets a fast compile time &amp;amp; good results
there (use a human elsewhere).&lt;br&gt;
&lt;br&gt;
Can map any arch register to any physical register - so can e.g. have
CPUs share registers, or do unbalanced register allocation - if some
CPU needs more registers in it's portion of code than others.&amp;nbsp; Does
some register allocation affinity games, to avoid having to reconfigure
register layouts between blocks of code. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Nicely flexible overall design.&amp;nbsp; Instructions name arch registers &amp;amp;
ops, but also can re-map real registers &amp;amp; ops periodically.&amp;nbsp; So use
targeted registers &amp;amp; ops for a particular chunk of code, then
reconfigure when the next chunk of code is different enough from the
last chunk- and the cost to reconfigure is lower than running with a
less-well-targeted registers &amp;amp; instructions.&lt;br&gt;
&lt;br&gt;
Ugh, can't read data charts - has some 4 embedded programs.&amp;nbsp; Seeing 30%
reduction in execution time AND power cost, by reconfiguring.&amp;nbsp; Code
size is a little larger, because either poor parallization(?) &amp;amp; the
reconfiguration overhead.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="clones"&gt;&lt;/a&gt;Jim - Clone Detection&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Exact clone'd code, white-space only changes, near-miss clones.&lt;br&gt;
Bug in 1 deserves inspection in all the rest.&lt;br&gt;
&lt;br&gt;
Lots of cloning studies out there -esecpially in-depth of the linux
file system.&amp;nbsp; But no wide-range-of-systems studies.&amp;nbsp; We did a bunch of
open-source systems, C, Java, C#.&amp;nbsp; OS, Web apps, DBs, etc.&amp;nbsp; Used their
own clone-detection, as a combo of AST-detection and text-based
detection.&lt;br&gt;
&lt;br&gt;
Text-based comparisons are sensitive to formating changes.&amp;nbsp; But AST
parsing is expensive (requires tree compares) and does not find "near
miss" clones.&amp;nbsp; Ended up doing "standard pretty-printing" - which uses
the AST to normalize the text.&amp;nbsp; Then do text-based comparisons.&amp;nbsp; Also
handle small diffs in the middle of clones.&lt;br&gt;
&lt;br&gt;
Did 24 systems, 10 C, 7 Java, 7 C#.&amp;nbsp; Linux 2.6 kernel, all of
j2sdk-swing, db4.&amp;nbsp; Varied from 4KLOC to 6MLOC.&lt;br&gt;
&lt;br&gt;
Java methods have more cloning than C methods - 15%-25% (depending all
the diff threshold) of Java methods are clones; C varies from 3% to
13%.&amp;nbsp; C# is more like Java, until the methods allow more diffs - and
then there is a lot more cloning.&amp;nbsp; Swing is hugely clone-ish, &amp;gt;70%.&lt;br&gt;
&lt;br&gt;
C systems - Linux has the same cloning behavior as postgres &amp;amp; bison
&amp;amp; wget &amp;amp; httpd. Java has no more clones as you allow more diffs
as compared to C - I suspect good Java refactoring tools detect &amp;amp;
remove clones earlier.&lt;br&gt;
&lt;br&gt;
C# shows a LOT more clones as more diff's are allowed.&amp;nbsp; Maybe
cut-n-paste is a coding style?&amp;nbsp; C#'s clones are larger methods as well.&lt;br&gt;
&lt;br&gt;
Something like 10-20% of all files in Java have 50% of functions are
clones.&amp;nbsp; In C# it's more like 20-35% of all files.&lt;br&gt;
&lt;br&gt;
Files with 100% cloned code: 15% of C# files are totally cloned.&amp;nbsp; For
Java whole file cloning is fairly low.&lt;br&gt;
&lt;br&gt;
Localization of cloned pairs: are they in the same file?&amp;nbsp; same
directory?&amp;nbsp; different directories?&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="business"&gt;&lt;/a&gt;Thomas Gschwind -
Business Process
Modeling Accelerators&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Business Process Modeling, patterns, x-forms, refactorings&lt;br&gt;
Process Structure Tree - &lt;br&gt;
&lt;br&gt;
&amp;nbsp;- Process is a large unstructured graph.&amp;nbsp; Visio-style graphical is
error prone for large process models.&lt;br&gt;
&lt;br&gt;
But can take the structure graph and build a tree from it.&lt;br&gt;
(The graph is all fork/join parallelism).&lt;br&gt;
A small part of the tree requires state-space exploration.&lt;br&gt;
&lt;br&gt;
After the tree is correct, can apply patterns &amp;amp; Xforms to the tree.&lt;br&gt;
&lt;br&gt;
Essentially Petri Net edit'ing on a structured graph/tree using Eclipse.&lt;br&gt;
&lt;br&gt;
100's of users inside of IBM; generated business models are much higher
quality, make them 2x faster.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="pdf_attack"&gt;&lt;/a&gt;Philip Levy, The
New Age of Software
Attacks&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Why? &lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Not enuf to do?&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Crime is Big Business&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Usable exploit: can be sold for $75K to $100K&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;So called "security researchers".&amp;nbsp; Some are respectable, most are
going after the $100K payoff.&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
"Visual Virus Studio" - actually exists.&amp;nbsp; PDF is a common vector for &lt;strong&gt;*targeted*&lt;/strong&gt;
attacks.&amp;nbsp; "Adobe Reader is the new Internet Explorer".&lt;br&gt;
&lt;br&gt;
Looked at all the fixed vulnerabilities in Adobe Reader&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Need to do a better job teaching students on how to write secure
software&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Stop writing in assembly language&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;ANY unmanaged code, include C/C++, should be banned&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Languages need to be upgraded to be more secure&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;New constructs &amp;amp; type systems&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;More sophisticated testing tools &amp;amp; methodologies&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Better analysis tools for legacy systems&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
How big is Adobe Reader?&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;100K FILES&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;42MLOC, no comments or blank lines: 25MLOC&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Estimated 75 man years of development per year over it's lifetime&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Problems are clustered - e.g. one module has had 25% of all security
problems.&lt;br&gt;
Top 6 modules have 80% of security problems.&lt;br&gt;
&lt;br&gt;
Break down problems another way:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;array index OOB - 32%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;unknown - 27%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;intege overflow - 14%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;range check missing - 8%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;capability check missing - 4%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;out of memory 3%&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
How found:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Fuzzing - 42%&amp;nbsp; (creating illegal inputs and throwing them at the
program)&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Code Review - 35%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;External report - 12%&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Some examples:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;asserting against null, but no asserts in debug build and null
possible in release build.&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Failed to check for error returns, sometimes even not catching
return value&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Casting "attacker controlled values" (values from the PDF file)
to valid enum&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Or using attacker-controlled value for loop &amp;amp; array bounds&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Sometimes a complex pre-condition needs to be checked first&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Some of the fixes are still incorrect&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
A Quick List:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Badly designed lib funcs (strcat, sprintf)&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Microsoft's banned API list is 4 pages&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;AIOOB&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Calling thru func ptrs&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Integer over/underflow.&amp;nbsp; Many protocols send N followed by N
bytes.&amp;nbsp; Send a really big number that overflows (after scaling for
malloc) into a small number - then malloc succeeds, but the data
overwrites the buffer.&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Pointer math&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Uninitialized vars&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Dangling ref pointer&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;Basically: it's BASIC STUFF that goes bad, not fancy stuff.&lt;br&gt;
&lt;br&gt;
All memory integrity or soundness type stuff.&lt;br&gt;
But it goes beyond AIOOB - &lt;br&gt;
e.g. Name string like "Robert); DROP TABLE Students;-- "&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="britney_spears"&gt;&lt;/a&gt;Jeff Fischer,
Solving the "Britney Spears
Problem"&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Users assigned roles; roles are assigned privileges. &lt;br&gt;
Benefit: users can come &amp;amp; go without changing policies.&lt;br&gt;
&lt;br&gt;
"Patients" can view medical records, but "doctors" can change them.&lt;br&gt;
&lt;br&gt;
Can add annotations to Java that specify roles.&lt;br&gt;
&lt;br&gt;
Problem: lack of expressiveness; solution: logging.&amp;nbsp; Violation happened
(people saw Britney Spears' data), but the logging allowed catching the
violators.&amp;nbsp; Another solution: manual checks.&amp;nbsp; But no compiler support,
so missed checks.&lt;br&gt;
&lt;br&gt;
Problem: Lack of explicit access policies.&lt;br&gt;
Really: need a way to "type" data (and hence methods using the data)
with some kind of security/access policy.&amp;nbsp; More problems: often checks
are hoisted to entry point for efficiency, and then violate policy
later.&amp;nbsp; Relying on manually-inserted runtime checking.&lt;br&gt;
&lt;br&gt;
Our stuff: parameterize classes by some kind of index value defining
Role. &lt;br&gt;
Each method includes an explicit policy (statically checked).&lt;br&gt;
Annotation processor can either statically type-check the policy, or
insert runtime checks.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="squeeze_tab"&gt;&lt;/a&gt;Jens Knoop -
Squeeze your Trusted
Annotation Base&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
"Interesting program properties are undecidable!"&lt;br&gt;
&lt;br&gt;
Compiler optimization - not so worried: just lose some optimality.&amp;nbsp;
Results still usable; but if we have better results we can produce
better code.&lt;br&gt;
&lt;br&gt;
Worse-Case execution time analysis for hard real time: Bang.&amp;nbsp; Your dead.&lt;br&gt;
WCET bounds must be safe (soundness), and wish to be tight (optimal or
useful).&lt;br&gt;
E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but
useless.&lt;br&gt;
Need, e.g., all loop bounds. &amp;nbsp;&lt;br&gt;
Need user annotations.&lt;br&gt;
User annotations are untrustworthy, but must be trusted.&lt;br&gt;
&lt;br&gt;
Squeeze the trusted annotation base - reduce user chance of screwing up.&lt;br&gt;
&lt;br&gt;
Try to replace trust by proof - as a side-effect generally improving
optimality.&amp;nbsp; Then take advantage of trade-off between power &amp;amp;
performance of analysis&amp;nbsp; algorithms.&lt;br&gt;
&lt;br&gt;
Use model-checking with care.&amp;nbsp; If something can't be bounded
automatically, as&lt;br&gt;
the user for some bound - then prove it.&amp;nbsp; BUt also believe the user's
bound isn't tight.&amp;nbsp; Try to prove a tighter bound with the
model-checking using binary search.&amp;nbsp; Indeed - can do away with user
assistance often; just guess and binary search to tighten it.&lt;br&gt;
&lt;br&gt;
Summary: can often completely empty the trusted annotation base,
usually shrink it; often improve WCET bounds.&lt;br&gt;
&lt;br&gt;
Start with the cheapest techniques (constant prop), move up to interval
analysis, then symbolic bounds &amp;amp; model checkers...&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="ghost"&gt;&lt;/a&gt;Joe Newcomer - Who do
you trust to run
code on your computer?&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
(Joe is a Windows attack expert)&lt;br&gt;
&lt;br&gt;
Attack Vectors - autoexec.bat, .cshrc &amp;amp; friends, autoplay,
device-driver installs (Sony), mac resource attacks&lt;br&gt;
&lt;br&gt;
Engineered attacks: phishing attacks, downloads, activeX, client-side
scripting&lt;br&gt;
&lt;br&gt;
Lots of "security" designs &amp;amp; codes are done by junior programmers
w/out adequate oversight.&amp;nbsp; Pervasive attitude: "software is secure - as
designed, as implemented, as maintained".&amp;nbsp; Small+fast is all that
matters...&amp;nbsp; not...&lt;br&gt;
&lt;br&gt;
Rant against all modern OS's, talking about #of security patches in
Mac, Linux, unix, Windows...&lt;br&gt;
&lt;br&gt;
Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc.&lt;br&gt;
&lt;br&gt;
What are we teaching people about security?&amp;nbsp; How about non-CS types? &lt;br&gt;
Graphics designers, website designers...&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="swarms_in_space"&gt;&lt;/a&gt;Stefan
Jahnichen, Swarm Systems in Space&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Build very small satellites, swarms of them to watch the Earth.&amp;nbsp; They
will spread out after launch to watch e.g. weather, bush fires, etc.&amp;nbsp;
Takes about 2 days after launching to get a communication link.&lt;br&gt;
&lt;br&gt;
Satellites have good earth coverage; orbit at 500 to 600km, orbit earth
every 3 hrs or so.&amp;nbsp; Wide range of sensors amongst different satellites.&lt;br&gt;
&lt;br&gt;
Have a Real Swarm of cameras; map a Virtual Swarm onto that; map
mission goals on the V.S. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Optimized synthesis of consensus algorithms.&lt;br&gt;
&lt;br&gt;
Special OS - distributed OS basically.&amp;nbsp; Used to maintain swarm
integrity (keep satellites together). &amp;nbsp;&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="boids_occam"&gt;&lt;/a&gt;Peter Welch,
Engineering Emergence...&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Thesis: in the future, systems will be to complex to design &lt;strong&gt;*explicitly*&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Instead engineer systems to have the desired properties "implicitly".&lt;br&gt;
&lt;br&gt;
"mob programming", ant mounds, etc.... simple rules of behavior leading
to complex behaviors that emerge.&amp;nbsp; Emergent Behavior&lt;br&gt;
&lt;br&gt;
Mechanisms Design - (game theory), &lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Rational actors have local private info&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Emergent: optimal allocation of scarce resources&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Optimal decisions rely on truth revelation&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Swarming behavior (flocks, etc)&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;local actors, local behaviors&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Emergent "swarm" behavior&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;UAV swarms &amp;amp; autonomous robots&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Social communication (gossip, epidemic algorithms)&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;large, ad hoc networks&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;emergent: min power to achieve eventual consistency&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;low power, low reliability sensors &amp;amp; data propagation&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Design for emergent behavior&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;occam process per "location", a 1024x1024 grid has 1mil processes&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;each location has a com channel to it's 4(8) neighbors&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;occam process per "mobile agent"&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;agents register with local process (or up to 4 if it's
straddling a local region) &amp;amp; local process tells the agent what's
nearby.&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
&lt;br&gt;
No idea how to design high-level behavior from low-level rules, but
busy experimenting and looking for some cases.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
Cliff&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;/div&gt;&lt;/div&gt;
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/05/ifip-wg24-trip-report.html</feedburner:origLink></entry>
  <entry>
    <title>DaCapo Trip Report</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/5_Yne_x903U/dacapo-trip-report.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=66395247" title="DaCapo Trip Report" />
    <id>tag:typepad.com,2003:post-66395247</id>
    <issued>2009-05-05T09:23:31-07:00</issued>
    <modified>2009-05-05T16:23:31Z</modified>
    <created>2009-05-05T16:23:31Z</created>
    <summary>http://www.cs.tufts.edu/research/dacapo DaCapo is a small (but growing) research group focused on GC &amp; Memory Management issues. They meet about every 9 months, and the meeting bounces around depending on who is hosting it. People mostly present work-in-progress, and the audience...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><div class="moz-text-html" lang="x-western">



<br />
<pre wrap=""><a class="moz-txt-link-freetext" href="http://www.cs.tufts.edu/research/dacapo">http://www.cs.tufts.edu/research/dacapo</a><br /></pre>

DaCapo is a small (but growing) research group focused on GC &amp;
Memory Management issues.  They meet about every 9 months, and the
meeting bounces around depending on who is hosting it.  People mostly
present work-in-progress, and the audience is up front with their
critiques... often interrupting the speakers or getting sidetracked
into loud off-topic conversations, i.e., it's a fun boisterous crowd.<br />
<br />
DaCapo is in the Boston area this year, held at <a href="http://www.tufts.edu/">Tufts University's Medford</a> campus
(everyone's heard of Harvard and MIT, but how about the Berklee College
of Architecture, Fisher College, Boston University, Suffolk University
and dozens and dozens more? - Boston has more colleges and universities
per square mile than any other place I know).  As usual for me
traveling to Boston, I skipped the car and just used public
transportation - the <a href="http://www.mbta.com/rider_tools/trip_planner/default.asp">T
works marvelously well</a> (as compared to e.g. VTA; crossing the
entire town from Logan Airport stuck way out in the water to Tufts some
8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose
on the VTA takes something like 2 hrs).  <br />
<br />
Anyways the weather was supposed to be highs of 65 and rainy with lows
in the 40's - so I packed my parka.  Turned out the weather report was
a <strong>little bit off</strong>, with highs in the upper 80's and sunny.  Man
did I roast.  But I got out and enjoyed the sun - we (the entire DaCapo
group of 40 or 50 people) walked about a mile to lunch each day, plus I
walked to and from my hotel &amp; the T stop (also about a mile). 
Enough ruminating, on to the talks!<br />
<br />As usual, I pay attention to some talks and chat w/neighbors through others.  I brutally summarize.  I'm opinionated &amp; narrow focused... so Caveat Emptor.<br /><br />
<hr size="2" width="100%" /><br />
<strong>Change Tracking</strong><br />
<br />
What changed?  Who did it?  When?  Points out the flaws of using 'diff'.<br />
<br />
Logical-Structure diff.  Auto-magically spotted systematic change
patterns.<br />
It's a better code-review diff.<br />
<br />
Can report diff's as reports like "for all XXX, replace XXX with YYY
except for ZZZ_XXX_ZZZ", etc.  Vastly better error reporting!!!  Vastly
better diff reporting!!!<br />
<br />
<strong>Cliff Notes: </strong>We <strong>must </strong>get this gizmo and check it
out!!!!<br />
<br />
Paper report: <a href="http://users.ece.utexas.edu/%7Emiryung/Publications/icse09-LSdiff.KimNotkin.pdf">http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf</a><br />
Sample output: <a href="http://users.ece.utexas.edu/%7Emiryung/LSDiff/carol429-430.htm">http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm</a><br />
<br />
<hr size="2" width="100%" /><br />
<strong>Inferred Call Path Profiling</strong><br />
<br />
Very low cost path profiling.  Hardware counters have very little cost,
but provide little context.  Software is reversed: high context &amp;
high
cost. <br />
<strong>Cliff Notes: </strong>Compare this to call/ret RPC hashing.<br />
<strong>Cliff Notes: </strong>Bond &amp; McKinely does something similar.<br />
<br />
Instead, look at number of bytes from main() to leaf call on stack. 
Basically PC &amp; SP (offset from stack base) provides all the context
we need.  Dynamically detect unique PC &amp; SP (they did it in a
training run) pairs, and reverse them to get a call-stack.<br />
<br />
Lots of duplicated stack heights, how to tell them apart?  On SpecInt
unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call
stack).  But huge spread; really sucks bad on some benchmarks.<br />
<br />
If have a list of entire call-graph, then can remove ambiguity by
growing some frames.  But requires some global analysis to know which
frames to grow in order to be unique.  "Solved" problem with a simple
greedy search.  Compute all call-stacks over whole program run, pick a
conflict, grow some frame, re-compute &amp; keep if more PC/SP pairs
are distinct.  Repeat until it's rare enough that can live with dups.<br />
<br />
<strong>Cliff Notes: </strong>The whole-program ahead-of-time thing isn't feasible for us.  How to do this on-the-fly during tick-profiling? 
Hash PC/SP, find a
tick record, 1-time-in-1024 check for dups (crawl stack the hard way
and confirm call-stack), otherwise assume unique &amp; increment tick
record.  If MISS in hash table, then crawl stack the hard way.  If
record is labeled as having dups, then crawl the call-stack &amp; find true record?)  Actually, probably expect the
call stacks to be common, but the PC's to be fairly unique.  So really
asking for unique RPC+SP (not PC+SP).<br />
<br />
His experiments: collect PC/SP pairs about 100 times/sec.  0.2%
slowdown (must have had careful measurements to detect that!)<br />
<br />
<hr size="2" width="100%" /><strong><br />
Breadcrumbs - Recovering Efficiently Computed Dynamic Calling Contexts</strong><br />
<br />
Also builds on probabilistic calling contexts.  <br />
<br />
Given a (nearly) unique ID (say, computed at each fcn entry via a
simple hash on the return PC).  IN the past, reversing these unique
ids: use a dictionary (very large &amp; high cost), use a
calling-context-tree (smaller, but requires you to track where you are
in the CC-tree at runtime).  Forward search, but undecidable and
limited options for pruning.<br />
<br />
Attempt to reverse the hash fcn (but only when we need to take the
semi-unique id) and convert it to a call-stack.  Straight-up reverse is
no good, but is easy to guide search.  Callers of leaf call themselves
have structure - e.g. expect to see a call-op from the call-site, etc. 
Still need a list of all call-sites (generally available anyways
because of inline-cache structures).  His function is "p' = (3p+c)", where "c" is RPC?  Requires a handful of instructions on function entry.<br />
<br />
<strong>Cliff Notes: </strong>If this reverse search is successful most of the
time, it says that during a profile tick we can record just the PC/SP
(or RPC/SP/PC?) and then reverse them to the entire call-stack later -
on demand when RTPM requests.  So a 'tick' is no more than record a few
bits (and handle buffer overflow), but<strong> no stack crawl</strong>.<br />
<br />
<strong>Cliff Notes: </strong>combined nicely with 'null-assignment-stack' info,
ie. collect the entire stack and 'color' all NULLs with a unique stack
id.  Then if get a NPC can report both the NPC at the crash site, but
also the call-stack that assigned the NULL in the first place.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Transparent Clustering - </strong><strong>Phil McGachey, Eliot Moss, Tony
Hosking</strong><br />
<br />
Attempt to auto-distribute multi-threaded programs.  Pure Java, no JVM
hacks.  <strong>No programmer involvement.  </strong>Dynamic class re-writing. 
JVMs can
vary.  Hardware can vary.<br />
<br />
All machines on a network run RuggedJ nodes.  RuggedJ includes a
rewriting class loader, some runtime libs, and some
application-specific partioning strategy.<br />
<br />
Rewriter: generate lots of classes &amp; interfaces; especially hook
methods &amp; fields.  Hook into the runtime.  Abstract over object
location; remote &amp; local objects.<br />
<br />
Ex: Standard class "X".  Rewrite into a X_Local and X_Stub classes.  X
objects on Node A are of type X_Local.  On Node B, X objects are really
X_Stub objects where all the calls are remote calls.  Both X_Local and
X_Stub implement the "X" interface.  Some other Y_Local can refer to
X_Local on Node A, but refer to X_Stub on Node B.  If want to migrate
from Node A to Node C, want to indirect the X_Locals; so for mobile
objects there's an indirection layer called "X_Proxy", that then be
switched from pointing to an X_Local vs an X_Stub (if the object
migrates).  Non-mobile objects skip the X_Proxy indirection layer.  <br />
<br />
Inheritance &amp; interfaces are maintained in the rewritten classes. 
Static classes conform to the RuggedJ model (I assume replication is
allowed?).  Arrays are also wrapped to support indirection.  <br />
<br />
Constraints: Native code breaks everything.  Try to not rewrite stuff
referred to by native code (maybe can intercept all native calls and
rewrite args?).  System code is also an issue: loaded by the system
loader, is not rewritten.  (Can they use -Xbootclasspath?).  But most
code run is actually system code!  (eg. String libs are very common!)<br />
<br />
So must deal with system code without changing the JVM.  <br />
<ol>
<li>Can wrap system class objects.  Must unwrap when passing to
system or native code; must re-box on return.  Re-wrapping is
expensive: need the exact same wrapper, so need some kind of
hash-lookup to do re-wrapping.  High overhead, so avoided where
possible.<br />
 </li>
<li>Extend: some classes can be extended and preserve the original
class &amp; signatures.  Extended class follows the RuggedJ model but
also the original system model. No need to unwrap/re-wrap.  Does not
work e.g. when system code generates the object (since sys code makes
instances of the original and not of the extended superclass).  <br />
 </li>
<li>Some system classes are only ever referenced from user code; can
just dup the code out of rt.jar and make a user-class (which now gets
rewritten by the normal RuggedJ model).  <br />
 </li>
<li>Immutable classes.  Just replicate.<br />
 </li>
</ol>
<br />
Apparently able to cluster some complex programs.<br />
No idea about performance as compared to explicitly clustered apps.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Jikes RVM Native Threads</strong><br />
<br />
Was Green threads, now Native threads<br />
(everybody else is dropping green threads too....)<br />
Motherhood &amp; Apple Pie.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Computer Science as an Experimental Science</strong><br />
<br />
Reproducibility?  Experimental bias?<br />
Show how these problems are solved in very-high-energy astrophysics.<br />
<br />
(1) No publish without multiple colleague confirmation.<br />
(2) Publish<br />
(3) MUST also now publish reproductions or <strong>non-</strong>reproductions's
of other peoples' experiments (or lose credibility as a research
entity).  But bar to publication is lower, and expectations are lower. 
i.e., people expect to publish in-progress work<br />
<br />
Biggest comp-sci problem now: lack of infrastructures.<br />
Examples of infrastructure bias: green-vs-native threads, some barrier
styles are hard to use (.Net embedded objects), stack-scan (OVM makes
fast stack scanning at the cost of everything else being slower). 
HotSpot (high-value compiler, good GC, good most everything) vs JIKES.<br />
<br />
<strong>Cliff Notes: </strong>This was actually a really fun talk (and <strong>lots </strong>of
discussion) despite my paucity of notes.  <br />
<br />
<hr size="2" width="100%" /><br />
<strong>Web 2.0 Profiling</strong><br />
<br />
Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.<br />
Relies on browser to run Flash, JavaScript, REST, AJAX.<br />
Benchmarks: IBM social networking software; Java PetStore 2.0,
AJAX-based RIA using JEE5, uses Lucene search.<br />
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14),
selectTag (PetStore 9), etc.  <br />
Workload mix can be varied, but based on commonly observed usage
pattersn from internal deployment of Lotus Connections).<br />
<br />
Wanted to remove DB from the workload, put DB on ramdisk.  Same for
LDAP.  Used Grinder to generate load.  Main server running
J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5
1.66GHz.  Fairly reasonable setup for a <strong>small </strong>web/app load.<br />
<br />
See lots more chattiness; very short fast requests, high interrupt
rates, smaller requests but many more of them.  Looked at low-level
profiling.  See fewer I-cache misses as compared to a large legacy
apps.  Better I-cache behavior.  But data-cache miss rate is still very
significant.<br />
<br />
Also different transactions have different scaling limitations.  For
PetStore the Java persistence layer started falling off after 4 cpus.  <br />
<br />
<hr size="2" width="100%" /><br />
<strong>Stream Processing</strong><br />
<br />
<a href="http://www.cs.ucsb.edu/%7Eckrintz/papers/gedik_et_al_2008.pdf">"Spade"
- Cluster stream processing at IBM</a>.  Continuous high-level
streams.  Stream arrives for weeks &amp; months; want to process and
then emit stream data continously.  Want to express the problem with a
language.<br />
<br />
Priorities - fast streaming on a cluster; then Generality; finally
Usability.  Beyond <a href="http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-620.pdf">StreamIt</a>. 
Like StreamIt, works with streams that
can be split &amp; joined.  Language is typed; stream contents are
strongly typed.  <br />
<br />
<hr size="2" width="100%" /><strong><br />
Demystifying Magic - High-Level Low-Level Programming</strong><br />
<br />
<a href="http://domino.research.ibm.com/comm/research_people.nsf/pages/dgrove.vee09.html/$FILE/vee25-frampton.pdf">Seen
this talk before</a>, but this time given by Steve Blackburn.  Nice
historical perspective on writing system software in C vs asm - and the
parallels to writing sys software in Java vs asm.<br />
<br />
Java-In-Java<br />
<ul>
<li>Key Principle: Containment.  Write as much as possible (99%!) in
pure Java, and dive into the low level "magic" bits as little as
possible.</li>
<li>Extensibility: Requires change quickly, languages change slowly.</li>
<li>Encapsulation: contain low-level semantics</li>
<li>Fine-grained lower of semantics: minimize impedance, seperation
of concerns</li>
</ul>
<br />
Framework:<br />
<ul>
<li>Extend Java semantics; intrinsics</li>
<li>Scoped semantics changes</li>
<li>Extend types; un/boxing, ref-vs-value, architecture sizes</li>
<li>Pragmatic: retain syntax where possible to keep front-end tools
working</li>
</ul>
<br />
Introduce raw pointer type: "Address" looks like a Java type.<br />
E.g.: <strong><font face="Courier New, Courier, monospace">Address a =
...;   ... = a.loadByte(offset);</font></strong><br />
<br />
Semantic Regimes:<br />
Additive: "its ok here to have unchecked casts"<br />
Subtractive: "no allocation via new allowed", similar to HotSpot
VerifyNoGC.<br />
<br />
Allow "unboxed" types - no v-table, so it's a bare primitive type.  But
this isn't the default (but can be asked for).<br />
<br />
Abstraction results are good: can implement a JVM on bare hardware and
with a simple change essentially virtualize the same JVM inside a GC
debugging harness running inside another JVM (where the virtualized
JVM's heap is just a big array in the outer implementing JVM), etc.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Grace, Safe Multithreaded Programming for C/C++ - Emery Berger</strong><br />
<br />
Seen the paper before, but this is the talk by Emery.<br />
<br />
Fake sequential semantics on stock multi-threaded hardware using
fork/join and process protection.  <br />
<br />
Grace is a pthreads replacement: "g++ -lpthreads" becomes "g++ -lgrace".<br />
<br />
Speedups: in the absence of common benign races, Grace run programs run
about 10% slower than raw pthreads - but have full sequential
semantics.  It's "as-if" all the parallel programs are run sequentially
immediately at the thread spawn point.  <br />
<br />
CAN'T run applications where threads run forever, i.e. reactive or
server apps.<br />
Works well with fork/join, Cilk, TBB, etc.<br />
<br />
So thread spawn becomes a unix fork with COW, use mmap for allow memory
sharing.  At join points, smash in join'd threads memory updates via
mmap.  Also need scalable thread-local malloc heap, plus aligned
globals (to avoid false-sharing at the page granularity), plus some
improved I/O.<br />
<br />
Some simple measures remove nearly all false sharing.  Big one:
everybody mallocs into their own space.  2nd big one: spread the
globals out 1 per page.<br />
<br />
<strong>Thread-spawn is as cheap as a fork on linux</strong> (has experiments to
show it).  Due to thread-cpu affinity, if you spawn a bunch of threads
they share a single CPU.  At the low end of thread-grain-length, fork
is faster than spawn because the scheduler immediately spreads the
processes across CPUs, whereas threads share for awhile.  So fork is
actually faster than thread-spawn for awhile.<br />
<br />
Real performance limiter for Grace is conflicts &amp; rollbacks, and
not thread-spawn overheads.<br />
<br />
Performance is much better than e.g. STM, because after the 1st touch
on a new page (and the SEGV &amp; accounting), all accesses on that
page run at full speed.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>GC Assertions: Using the GC to check heap properties</strong><br />
<br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&amp;rep=rep1&amp;type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&amp;rep=rep1&amp;type=pdf</a><br />
<br />
Global knowledge is required to understand seemingly local properties.<br />
JBB example:<br />
<br />
<strong><font face="Courier New, Courier, monospace">    Order order;<br />
    for(...)<br />
      order = ....;<br />
    delete orderTable;<br />
</font></strong>    ? are all 'orders' dead here? <br />
<br />
But actually 'order' is held by the last-customer-transaction (as an
optimization, or convienvence for customer?).  Leads to a leak of
'orders'.  Really: programmer does not understand program, too complex.<br />
<br />
Add assertions to the code, and use the GC to check the property.<br />
<br />
<strong>Cliff Notes: </strong>Azul is already gathering global heap properties
during GC - but not asserts.  Gathering liveness &amp; points-to info. 
If points-to included a handful of sample real instances (and all the
sames linked when possible), would be a very powerful way to instantly
check some heap properties.<br />
<br />
Sample asserts:<br />
'assert_dead(p)' - expect to reclaim 'p' at the next GC cycle<br />
Or region-collection: start_region(); ... assert_all_dead();<br />
Or shape property: assert_unshared(p);<br />
Or ownership properties (members of a collection are 'owned' by the
collection)<br />
<br />
If an assert fails, then do a slow crawl and provide the full path
through the heap showing the failure.   Asserts only checked at GC
points.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Proving Proving Correctness of Abstract Concurrency Control and Recovery
- Trek Palmer &amp; Eliot Moss</strong><br />
<br />
Transactions - closed nesting has issues<br />
Open nesting &amp; Boosting need from programmer: conflict info &amp;
roll-back info.<br />
These are hard to provide correctly.<br />
<br />
This work: a description language of abstract data types or structures.<br />
Can describe the conflict predicates &amp; inverses.<br />
Can prove correctness of the conflict &amp; inverse expressions.<br />
<br />
Working with the <strong>abstract</strong> description of the data structure,
and
NOT e.g. with the real Java implemention.  E.g., no loops, no
mutations, no recursion.  So the language isn't a 'real' language, but
can describe many kinds of ADTs in code that looks sorta like Java.<br />
<br />
Output isn't a functional program, instead the output is the result of
using a SAT solver to prove correctness.  <br />
<br />
XTN-semantics allows conflict detection to be proven correct pairwise,
instead of having to do full interleaving.  Formally, a conflict<br />
predicate is correct iff it is true when the operations do no commute.<br />
<br />
The conflict predicate tells when 2 XTNs conflict, and the inverse
allows an optimistic XTN to rollback.<br />
<br />
Obvious use-case: use this tool to write a transactional-version of
NBHM or other JDK concurrency utilities.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Hard Real-Time GC Scheduling</strong><br />
<br />
- Periodic Scheduling<br />
 - GC runs at highest priority, but periodically yields to mutator<br />
 - Metronome<br />
- Slack-based Scheduling<br />
 - GC runs at lowest priority<br />
 - Can be preempted at any time<br />
- Makinac (Sun RTS)<br />
- Work based Scheduling<br />
 - GC runs at allocation time<br />
 - Problem with allocation-rate jitter<br />
<br />
HRT - Deadlines must never be missed AND must be verifiable
analytically that no deadlines are missed.  Systems tend to therefore
be very simple, so that they can be proven.<br />
<br />
OVM - like Metronome.  Dynamic defrag; arraylets; Brooks style barrier;
replicating barrier/; incremental, concurrent, supporting slack-based
style scheduling.  <br />
<br />
<hr size="2" width="100%" /><br />
<strong>Understanding Performance of Interactive Applications</strong><br />
<br />
Typical profilers give the wrong numbers: they report total time spent,
not time spent between mouse-click and page updated.<br />
<br />
AWT/Swing &amp; SWT - similar: single-threaded, gui thread in "event
loop", relies heavily on Listener model.  So profile using gui
call-back Listeners between user events and the eventual display change.<br />
<br />
Really simple idea: profiling &amp; event visualizer tool between click
&amp; view times.<br />
<br />
<hr size="2" width="100%" /><strong>Program Metamorphsis</strong><br />
<br />
Trouble w/refactorings: must preserve program behavior with each step. 
But if we want multiple refactorings then at each step along the way
fixup code (needed to preserve program behavior) pollutes the code. 
Want to do multiple refactoring steps at once - while leaving the
program possibly broken in the in-between steps.<br />
<br />
So compute a program-semantics (names &amp; def-use chains), and let
the user make partial refactoring steps and then declare "I think I'm
done" - and the system compares the new program-semantics with the old,
and reports if they are not equal.  <br />
<br />
So actually comes up with primitive not-quite refactoring steps, which
he will compose to build larger "real" refactorings, or compose lots to
build a multi-stage refactoring.  <br />
<br />
<hr size="2" width="100%" /><strong><br />
Instruction Selectors</strong><br />
<br />
HotSpot C2 uses a BURS framework.  Very hard to optimize &amp; debug. 
So these guys have a semantics which can span both the ideal IR and
real machine instructions.  Describe both using "CISL" and the tool
does the mapping.<br />
<br />
Once the tool has a mapping, the user provides an adapter that converts
the mapping into the code needed by the compiler back-end.  This would
replace e.g. the ADLC.  Gives examples of machine encodings for PPC and
Java bytecodes.<br />
<br />
CISL is given an IR tree and produces a target tree with equivalent
semantics.<br />
<br />
Not sure it's entirely better than BURS (maybe no worse), but I'm still
sold on rewriting in e.g. C++/Java hand-written greedy match rules. 
These code-gen-gen tools are pain for long-term maintenance.<br />
<br />
<hr size="2" width="100%" /><strong><br />
Do Design &amp; Performance Relate?</strong><br />
<br />
Is fast code required to be ugly?  Is beautiful required to be slow?<br />
<br />
Pick 200 code metrics.  Also pick some performance metrics (cycles,
single threaded, objects created, etc).  Could not follow talk... or at
least he wasn't very clear with the objectives &amp; results (if any).  Interesting idea though.<br /><br />
<br />
</div></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/05/dacapo-trip-report.html</feedburner:origLink></entry>
  <entry>
    <title>Odds &amp; Ends</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/2akg9-qZ84w/odds-ends.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=65502891" title="Odds &amp; Ends" />
    <id>tag:typepad.com,2003:post-65502891</id>
    <issued>2009-04-15T10:29:14-07:00</issued>
    <modified>2009-04-15T18:36:45Z</modified>
    <created>2009-04-15T17:29:14Z</created>
    <summary>Various pending goodies... Load-Linked/Store-Conditional vs Compare-And-Swap Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!). There was a long period of debate in...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Various pending goodies...</p><h2><span style="text-decoration: underline;">Load-Linked/Store-Conditional vs Compare-And-Swap</span></h2><p><br />Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!).  There was a <a href="http://www.cs.huji.ac.il/course/2005/dist/p124-herlihy.pdf">long period of debate</a> in the CompSci community one what constituted the "minimal" needed instruction to do useful multi-CPU work.  Eventually the community has decided that the <a href="http://en.wikipedia.org/wiki/Compare-and-swap">Compare-And-Swap</a> (CAS) instruction and the <a href="http://en.wikipedia.org/wiki/Load-Linked/Store-Conditional">Load-Linked/Store-Conditional</a> (LL/SC) instruction combo both are (1) sufficient to do useful work ("infinite concensus number") and (2) relatively easy to implement in real hardware.  X86's, Sparc's and Azul's CPUs use CAS.  IBM's Power, Alpha, MIPS &amp; ARM6 all use LL/SC. ARM5 only has an atomic-swap.</p><p>LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms.  With LL/SC, you first Load-Linked a word.  The hardware marks the cache line as "linked".  You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter).  Finally you issue a Store-Conditional to the same address.  If the cache line is still "linked", the store happens.  If not, then not.  The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line.  Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented).  "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail.  I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure.  The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.</p><p>With CAS, you typical first load a word with a normal load, then manipulate it.  Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory.  CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic <a href="http://en.wikipedia.org/wiki/ABA_problem">ABA bug</a>.  In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless <strong>all</strong> copies of A die first - including the one loaded before the CAS.  Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS.  Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively.  With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link".  E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS).  </p><p>These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations.  Nothing in the academic literature requires any sort of global ordering on these atomic instructions.  Instead the usual academic assumption is that <strong>all </strong>memory operations are strongly ordered - which is obviously <strong>not</strong> true on all modern hardware.  Practitioners are required to insert memory fences as needed to achieve the desired ordering.  Nonetheless most <strong>implementations </strong>of CAS include a strong ordering: X86 and Sparc certainly do.  Azul's CAS does <strong>not</strong>, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations.  <em>(As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking.  Report back the %-tage of dropped counts and the throughput rate.  Then try it with an Atomic counter.  My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless).  </em>I am less familiar with the fence properties of common LL/SC implementations.  Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?</p><p /><p /><p /><p /><h2><span style="text-decoration: underline;">build.java</span></h2><p><br /><a href="http://blogs.azulsystems.com/cliff/2008/01/i-hate-makefile.html">Some time ago I reported on a build tool</a> I've been using.  
Currently 'build' is being used to build HotSpot internally to Azul.  We've got it building 400+ files, plus build rules for building portions of the
JDK, compiling w/gcc &amp; javac and also 'javah', tar, jar, strip,
binary signing, etc.  In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc.  It's blazingly fast, it's easy to hack (being written in plain-olde Java).  I hunted around awhile and couldn't find a good place to dump a medium-large blob of code, so here's a sample build.java in 4 parts:</p><p><a href="http://snippets.dzone.com/posts/show/7206">Part 1</a><br /><a href="http://snippets.dzone.com/posts/show/7207">Part 2</a><br /><a href="http://snippets.dzone.com/posts/show/7208">Part 3</a><br /><a href="http://snippets.dzone.com/posts/show/7209">Part 4</a></p><p /><p><a href="http://snippets.dzone.com/posts/show/7209"><br /></a></p><p /><p /><h2><span style="text-decoration: underline;">A "HotSpot -server" aka <em>C2</em> compiler IR visualizer</span></h2><p>No I didn't do it!  <a href="http://ssw.jku.at/General/Staff/TW/">This deserving grad student did it</a>.</p><p>Here is the reference to the visualizer for the ideal graph of the 
HotSpot server compiler:
<br />
<br /><a class="moz-txt-link-freetext" href="http://ssw.jku.at/igv/master.jnlp">http://ssw.jku.at/igv/master.jnlp</a>
<br />
<br />It is a JNLP application hosted on a university server 
(<a class="moz-txt-link-freetext" href="http://ssw.jku.at/General/Staff/TW/">http://ssw.jku.at/General/Staff/TW/</a> has also a test file to 
download), but the easiest way to get a first impression. 
The source code of the tool is hosted on <a class="moz-txt-link-freetext" href="http://kenai.com/projects/igv">http://kenai.com/projects/igv</a> 
The server compiler instrumentation is part of OpenJDK and also 
included in Sun's weekly builds of Java 7.
<br />
</p><p /><p /><h2><span style="text-decoration: underline;">More C2 Goodies</span></h2><div style="margin-left: 40px;">
<em>John Cavazos wrote:
</em><br />
... One of the things I am 
currently looking
at is determining the right phase-ordering of optimizations applied to 
a particular program being optimized.  I have some nice (un-published) results for 
JikesRVM,
<br />
but it would be nice to replicate the research for HotSpot...
<br /><br /></div><p>I have strong arguments for nearly all orderings of our passes, so I'm 
curious to know about your phase-ordering results. 

<br />The only obvious transform we might be missing is PRE, but our peephole 
opts pretty nearly subsumes PRE (they are not mathematically equivalent 
- I can write programs where either PRE or the peephole stuff makes 
progress against the other).  In practice, PRE will find essentially 
nothing once the peephole opts are done.  You'll notice that we do the 
peephole pass alot; it's totally incremental and provably linear 
bounded.  In other words, if there's nothing to be gained then there's no 
cost in trying.  The peephole pass includes amongst other things 
all pessimistic forms of copy propagation, value equivalence, 
constant propagation (especially the null/not-null property), constant 
test folding, repeated test folding, dead code elimination, 
load-after-store opts (aliasing analysis is included for free during 
building of SSA form), algebraic properties, and lots more.
<br />
<br />
<br />For HotSpot, the optimization ordering is:
<br />
</p><ul>
<li>Build an inline tree, including class hierarchy analysis.  This is the 
one pass I'd be willing to move, as it happens too early.
</li>
<li>(a single unified pass:) parse bytecodes, inline, build SSA form (the 
IR remains in SSA form always), do peephole opts over SSA form.  This 
pass typically costs 40% of compile-time budget.
</li>
<li>Fixed-point the peephole opts over SSA form, once all backedges are 
known after parsing.
</li>
<li>Loop opts round 1: "beautify loops" (force polite nesting of 
ill-structured loops), build a loop-tree &amp; dominator tree, split-if 
(zipper-peel CFG diamonds with common tests, plus some local cloning 
where I can prove progress), peel loops (required to remove 
loop-invariant null checks)
</li>
<li>Fixed-point the peephole opts over SSA form
</li>
<li>Loop opts round 2: "beautify loops" (force polite nesting of 
ill-structured loops), build a loop-tree &amp; dominator tree, lock 
coarsening, split-if &amp; peel - but if these don't trigger because there's 
nothing to gain, the do iteration-splitting for range-check-elimination 
&amp; a 1st loop unrolling.
</li>
<li>Fixed-point the peephole opts over SSA form
</li>
<li>Conditional Constant Propagation (the optimistic kind, instead of the 
pessimistic kind done by the peephole pass)
</li>
<li>Iterate loop (split-if, peel, lock coarsen - but these typically never 
trigger again and take very little time to check), unrolling &amp; peephole 
passes, until loops are unrolled "enough".  On last pass, insert 
prefetches.  Typically this iterates once or twice, unless this is a 
microbenchmark and then unrolling might happen 8 or 16 times.
</li>
<li>Remove tail-duplication, and a bunch of other minor code-shaping 
optimizations e.g. absorb constant inputs into deoptimization-info in 
calls, or commuting Add ops so that 2-address machines can do 
update-in-place.
</li>
<li>Convert "ideal" IR into machine code IR.
</li>
<li>Build a real CFG for the 1st time, including a dominator tree, loop 
tree.  Populate with frequencies from earlier profiling.
</li>
<li>Global latency-aware (loop-structure-aware) scheduling.
</li>
<li>Replace null-checks with memory references where appropriate.
</li>
<li>Register allocate.  Internally this has many passes.  This pass 
typically costs 40% of compile-time budget.
</li>
<li>Sort basic blocks to get good control flow ordering (forward branches 
predicted not-taken, backwards predicted taken, etc)
</li>
<li>Some last-minute peephole opts.
</li>
<li>Emit code into a buffer, including OOP-maps &amp; deoptimization info.
</li>
</ul>
<p>















<br />
</p><p /><h2>Travel Plans</h2>
<p>I got too much travel coming up!</p><p>Apr 30, May 1st - DaCapo in Boston.  Favorite small group; GC &amp;
Java focused. Website:
<a class="moz-txt-link-freetext" href="http://www.cs.tufts.edu/research/dacapo/">http://www.cs.tufts.edu/research/dacapo/</a>.  I had to turn down the invite to an <a href="http://www.prog.uni-saarland.de/ssasem/">SSA seminar</a> in France because the dates conflicted.  Very sad.  I hope one of the attendees will post a trip report.</p><p>
May 11-May 15.  <a href="http://wg24.cs.uvic.ca/ContentWG24.shtml">IFIP WG2.4</a>  (International Federation of Information
Processors, Working Group 2.4 - a *really* old European group with a
random mix of industry &amp; academia.)  It's a nice group to preview
JavaOne talks, and always the meetings are a week-long rambling
discussion in <a href="http://www.parks.wa.gov/fortworden/">some quaint resort</a>.</p><p>
June 2-June 5.  <a href="http://java.sun.com/javaone/">JavaOne</a>.  I owe slides for 3 talks by end of this week.  'enuf said.</p><p>
July 6-10th - <a href="http://ecoop09.disi.unige.it/">ECOOP</a>, as a paid-for invited speaker.  A free vacation to
Italy!   :-)</p><p>Cliff</p><p /><p /><p /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/04/odds-ends.html</feedburner:origLink></entry>

</feed><!-- ph=1 --><!-- nhm:from_kauri -->
