<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atomfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="0.3">
  <title>Cliff Click Jr.’s Blog</title>
  <link rel="alternate" type="text/html" href="http://blogs.azulsystems.com/cliff/" />
  <id>tag:typepad.com,2003:weblog-1244458</id>
  <link rel="service.post" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458" title="Cliff Click Jr.’s Blog" />
  <modified>2009-05-18T19:40:59Z</modified>

  <generator url="http://www.typepad.com/">TypePad</generator>
  <info type="application/xhtml+xml">
  <div xmlns="http://www.w3.org/1999/xhtml">This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit <a href="http://blogs.azulsystems.com/cliff/">Cliff Click Jr.’s Blog</a> for more info.</div>
  </info>
  <link rel="start" href="http://feeds.feedburner.com/typepad/azulsystems/cliff" type="application/atom+xml" /><entry>
    <title>IFIP WG2.4 Trip Report</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/TIeSr5m3NiU/ifip-wg24-trip-report.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=66935961" title="IFIP WG2.4 Trip Report" />
    <id>tag:typepad.com,2003:post-66935961</id>
    <issued>2009-05-18T12:40:59-07:00</issued>
    <modified>2009-05-18T19:40:50Z</modified>
    <created>2009-05-18T19:40:59Z</created>
    <summary>2009 IFIP IFIP WG2.4 remains one of my favorite mini-conferences to attend. The group is eclectic &amp; varied; the discussion vigorous. Everybody is willing to interrupt the speaker &amp; ask questions. The talks are usually lively and pointed, and this...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="html" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="moz-text-html" lang="x-western"&gt;



&lt;br&gt;
&lt;strongig&gt;&lt;strong&gt;2009 IFIP&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
&lt;a href="http://wg24.cs.uvic.ca/new/"&gt;IFIP WG2.4&lt;/a&gt; remains one of my
favorite mini-conferences to attend.&amp;nbsp; The group is eclectic &amp;amp;
varied; the discussion vigorous. Everybody is willing to interrupt the
speaker &amp;amp; ask questions.&amp;nbsp; The talks are usually lively and pointed,
and this time was no exception.&lt;br&gt;
&lt;br&gt;
The conference was held in &lt;a href="http://www.parks.wa.gov/fortworden/"&gt;Fort Worden&lt;/a&gt;, a historic
Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific
Northwest, defending &lt;a href="http://en.wikipedia.org/wiki/Admiralty_Inlet"&gt;Admiralty Inlet&lt;/a&gt;
and our shores from the marauding Russians ... or Japanese... or
somebody, surely.&amp;nbsp; Nobody ever attacked that way, so clearly the fort
was a big success.&lt;br&gt;
&lt;br&gt;
Meanwhile, the fort has been turned into a park and it's beautiful.&amp;nbsp;
The whitewashed buildings with green trim remind me of some mythical
50's era America that never was, the kind that shows up in the backdrop
of various movies or maybe "&lt;a href="http://www.leaveittobeaver.org/"&gt;Leave
It To Beaver&lt;/a&gt;".&amp;nbsp; We stayed in the officers quarters and held our
meeting in the old mess hall.&amp;nbsp; The officers lived quite nicely; the
quarters are actually old duplexes in great shape, with high ceilings
and crown molding and carved brass fittings everywhere - the house is
clearly larger than my house and with a vast green lawn ta-boot.&amp;nbsp; In
San Jose such a place would be upwards of $2m...&lt;br&gt;
&lt;br&gt;
Fort Worden overlooks Admiralty Inlet from some high bluffs (classic
fort-like location) and the surroundings are all gorgeous.&amp;nbsp; The weather
varied from overcast &amp;amp; chilly to clear &amp;amp; crisp (alas, overcast
and chilly was the mode on our all-day "whale" watching expedition.&amp;nbsp; We
did lots of watching but saw no&lt;br&gt;
whales of any kind.&amp;nbsp; And we bounced over heavy seas and got sprayed
with frigid water most of the day.)&lt;br&gt;
&lt;br&gt;
Since the Hood Canal bridge was out I got treated to a 3 hr drive the
long way around the peninsula, but somebody else drove and the scenery
was worth looking at.&amp;nbsp; The airplane portion of the trip was easy.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
As usual, my reports are (very) brief, stream-of-consciousness
writing.&amp;nbsp; I often skip around, or brutally summarize (or brutally
criticize - but this group is above the usual conference par, so less
of that is needed).&amp;nbsp; I skip talks; I sleep through ones I'm not
interested in, etc, etc.&amp;nbsp; Caveat Emptor.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Synchronization Talks&lt;/strong&gt;&lt;br&gt;
&lt;a href="#atomicset"&gt;Atomic Sets&lt;/a&gt; - another nice concurrency alternative&lt;br&gt;
&lt;a href="#fj_net"&gt;Fork/Join for .Net&lt;/a&gt;&lt;br&gt;
&lt;a href="#occam"&gt;occam / CSP&lt;/a&gt; - Uber lite-weight user-thread scheduling&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Java Analysis Talks&lt;/strong&gt;&lt;br&gt;
&lt;a href="#snugglebug"&gt;SnuggleBug&lt;/a&gt; - Better bug reporting for Java&lt;br&gt;
&lt;a href="#satire"&gt;SATIrE&lt;/a&gt; - A complete deep analysis toolkit for Java&lt;br&gt;
&lt;a href="#pointsto"&gt;PointsTo&lt;/a&gt; - A comparison of various points-to
algorithms for Java&lt;br&gt;
&lt;a href="#javascript"&gt;analyzing JavaScript&lt;/a&gt; - Has a type-lattice as
complex as C2's internal lattice&lt;br&gt;
&lt;a href="#jvm_perf"&gt;Performance variations in JVMs&lt;/a&gt; - Noticed the wild
variations in run-to-run performance of JVMs&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Misc&lt;/strong&gt;&lt;br&gt;
&lt;a href="#cheap_cores"&gt;Cheap Cores discussion&lt;/a&gt;&lt;br&gt;
&lt;a href="#fpga_variants"&gt;Compiler-directed FPGA variations&lt;/a&gt;&lt;br&gt;
&lt;a href="#clones"&gt;Clone Wars&lt;/a&gt; - huge study on cut-n-paste code in
large projects&lt;br&gt;
&lt;a href="#business"&gt;Business Process Modeling&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Security&lt;/strong&gt;&lt;br&gt;
&lt;a href="#pdf_attack"&gt;PDF Attack&lt;/a&gt; - "Adobe Reader is the new
Internet Explorer"&lt;br&gt;
&lt;a href="#britney_spears"&gt;The Britney Spears Problem&lt;/a&gt;&lt;br&gt;
&lt;a href="#squeeze_tab"&gt;Squeezing your Trusted Base&lt;/a&gt;&lt;br&gt;
&lt;a href="#ghost"&gt;Who you gonna call?&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Evolving Systems&lt;/strong&gt;&lt;br&gt;
&lt;a href="#swarms_in_space"&gt;Swarms in Space&lt;/a&gt;&lt;br&gt;
&lt;a href="#boids_occam"&gt;Finding emergent behavior&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="atomicset"&gt;&lt;/a&gt;Frank Tip,
Type-Based Data-Centric
Synchronization&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Locks are hard.&amp;nbsp; Auto-inference of correct sync in O-O programs.&lt;br&gt;
Even with no data-races still have atomicity races.&lt;br&gt;
&lt;br&gt;
Instead of code-centric locking paradigms, do data-centric locking.&lt;br&gt;
Tied to classes with fields, leverage O-O structures&lt;br&gt;
&lt;br&gt;
Group a subset of fields in a class into a *atomic_set*.&lt;br&gt;
Then find units_of_work for that atomic_set.&lt;br&gt;
&lt;br&gt;
Add language features to Java:&lt;br&gt;
&amp;nbsp; atomicset account;&lt;br&gt;
&amp;nbsp; atomic(account) int checking;&lt;br&gt;
&amp;nbsp; ...&lt;br&gt;
&amp;nbsp; atomicset logging;&lt;br&gt;
&amp;nbsp; atomic(logging) Log log;&lt;br&gt;
&amp;nbsp; atomic(logging) int logCount;&lt;br&gt;
&lt;br&gt;
So add 'atomicset' keyword, 'atomic' keyword and strongly type
variables with atomic_sets.&amp;nbsp; Units_of_work expand to reach the
syntactic borders anytime an atomic variable is mentioned.&amp;nbsp; Can expand
unit_of_work by hand also.&lt;br&gt;
&lt;br&gt;
Can at runtime cause atomic_sets to alias - to union together.&amp;nbsp; Used on
ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on
HashMap or HashMap$Entry become units_of_work for both).&lt;br&gt;
&lt;br&gt;
Atomic_sets can be unioned; e.g. for LinkList the entire backbone of
the list is one atomicset.&amp;nbsp; Lots of discussion - points out the issues
with hacking all the Code Out There.&lt;br&gt;
&lt;br&gt;
Has proven strong serializable properties on atomicsets.&amp;nbsp; Proven
various kinds of soundness, including &lt;strong&gt;*internal*&lt;/strong&gt; objects are
only
accessed by their owner, etc.&lt;br&gt;
&lt;br&gt;
All concurrency annotations appear in the libraries but not in the
client code - shows a nice concurrent example with 2 threads
manipulating the same linked list.&amp;nbsp; &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Has a working version using Eclipse &amp;amp; taking annotations as Java
comments.&amp;nbsp; Using j.u.c.Locks; can do lock aliasing by simply sharing
j.u.c Locks.&amp;nbsp; Applied this to a bunch of Java Collections; 63 types
&amp;amp; 10KLOC of code.&amp;nbsp; Needs about 1 line of annotation per 21 lines of
code for the libraries (and none in&lt;br&gt;
clients)&lt;br&gt;
&lt;br&gt;
About 15-30% slower than using plain sync wrappers on a single-threaded
testcase; probably because the "synchronization" keyword is very
optimized.&amp;nbsp; Performance is similar or slightly faster when running with
10 threads &amp;amp; high contention.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="fj_net"&gt;&lt;/a&gt;Daan Leijen - The
Task Parallel Library
for .Net&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
- A concurrent parallel library for .Net (think dl's Fork/Join).&lt;br&gt;
&lt;br&gt;
Infrastructure: locks/threads/STM, async agents (responsiveness), task
parallelism (performance)&lt;br&gt;
&lt;br&gt;
Gives a demo of a parallel matrix-multiply w/C#.&amp;nbsp; Very ugly code.&lt;br&gt;
&lt;br&gt;
So instead, write a nice combinator... &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Standard task-stealing work-queue approach.&amp;nbsp; (private set of tasks that
don't need CAS, public set of steal-able tasks that require CAS).&lt;br&gt;
&lt;br&gt;
Points out that want fine-grained parallelism, so can do work-stealing
and do auto-load-balancing. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
-- Effectful state.&amp;nbsp; Statically typed, with typed side-effects.&amp;nbsp; Type
signature includes all side-effects (yes!)&lt;br&gt;
&lt;br&gt;
&amp;nbsp; int fib(int n) { return n&amp;lt;=1 ? 1 : fib(n-1)+fib(n-2); }&lt;br&gt;
&lt;br&gt;
But this program&lt;br&gt;
&lt;br&gt;
&amp;nbsp; int fib(int n) { return n&amp;lt;=1 ? print"hi",1 : fib(n-1)+fib(n-2); }&lt;br&gt;
&lt;br&gt;
Does not type in Haskell - need the IO monad.&amp;nbsp; And now the function
returns 2 results: the IO "hi" and the int.&lt;br&gt;
&lt;br&gt;
So want some syntax to allow "auto-lifting" into a side-effect world -
same function syntax, but auto-includes the IO monad as a result.&lt;br&gt;
&lt;br&gt;
Then can prove that some stateful computation does not leave the
function - that the state is entirely encapsulated &amp;amp; unobservable
outside.&amp;nbsp; So can do type'ing which allows some state-ful work
internally, but it doesn't escape the function "box".&lt;br&gt;
&lt;br&gt;
&amp;nbsp; int fib(int n) { &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ref f1 = 1; // side effects, but entirely internal to the function&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ref f2 = 1;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n--; }&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; return sum;&lt;br&gt;
&amp;nbsp; }&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="occam"&gt;&lt;/a&gt;Peter Welch, Multicore
Scheduling for
Lightweight Communicating Processes&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Google: kroc install&lt;br&gt;
&lt;br&gt;
Carl Ritson did all the work...&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Cliff Notes&lt;/strong&gt;: Insanely cheap process scheduling, plus "free" load
balancing, plus "free" communicating-process affinity.&amp;nbsp; Very similar to
work-stealing in GC marking, but with "threads".&amp;nbsp; OS guys take
note!&lt;br&gt;
&lt;br&gt;
Scheduler for OCCAM/PI.&amp;nbsp; Process-oriented computation (millions), very
fine grained processors.&amp;nbsp; Message passing over channels, plus
barriers.&amp;nbsp; Uses CSP and pi-calculus for reasoning.&amp;nbsp; Dynamic creation
&amp;amp; deletion of processes &amp;amp; channels.&lt;br&gt;
&lt;br&gt;
Goal: automatic exploitation of shared-memory multis.&amp;nbsp; "Nothing" for
the programmer to do: the program exposes tons of parallelism.&amp;nbsp;
"Nothing" for the compiler to do - it's all runtime scheduling. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Goals: scalable performance from 10's of processes to millions;
co-locate communicating processes - maximize cache, minimize atomics
(lock-free &amp;amp; wait-free where possible).&amp;nbsp; Heuristics to balance
spreading processes around on spare CPUs and co-locating the
communicating ones. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Usual sales job for CSP... which still looks good. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
A blocked process requires only *8* words of state (no stack for you!).
Scheduling is cooperative not preemptive.&amp;nbsp; No complex processor state,
no register dumps.&amp;nbsp; Up to 100M process contexts per second.&amp;nbsp; Typical
context switch is 10ns to 100ns.&amp;nbsp; Performance heavily depends on
getting good cache behavior.&amp;nbsp; So processes are batched.&amp;nbsp; Stacks are
spaghetti-stack (no that's why no stack saving...).&amp;nbsp; These 8 words
include the process run-queues.&lt;br&gt;
&lt;br&gt;
Batching: sets of e.g. 48 processes are run on a single core; that core
is the only core touching this queue &amp;amp; round-robins them until the
entire batch times out.&amp;nbsp; So no cache-misses.&amp;nbsp; A Batch is a Process,
plus a run-queue.&lt;br&gt;
&lt;br&gt;
A Core is a Batch, plus a queue of Batches.&amp;nbsp; Again, no other cores
touch this list either. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
A Core with no Batches does work-stealing.&amp;nbsp; There's some proposed
batches available for work-stealing from each Core.&amp;nbsp; Cores need some
atomic ops to manipulate the queue of Batches for these steal-able
Batches.&amp;nbsp; Careful design to ensure all the interesting structures fit
in a single cache line.&lt;br&gt;
&lt;br&gt;
So until you need to enqueue, dequeue or steal a Batch - it's all done
without any contention for process scheduling.&amp;nbsp; These Batch-switching
operations require atomics but are otherwise fairly rare. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Key bit missing: scaling up beyond 8 or 16 cores with some kind of log
structure.&amp;nbsp; But otherwise looks really good for very low cost context
switch &amp;amp; work steal.&amp;nbsp; Lots of questions about fairness.&lt;br&gt;
&lt;br&gt;
Batches are formed &amp;amp; split by the runtime (no compiler or
programmer).&amp;nbsp; Aim is to keep processes that communicating on the same
core (so it's all cache-hit communication), but spread the work across
cores.&amp;nbsp; Initial batches are kept small to encourage spreading.&amp;nbsp; Channel
ops require atomics, but typically only 1 (sometimes 2).&amp;nbsp; Choosing
amongst many channels might require up to 4 atomic ops.&amp;nbsp; Each time
communication happens on a channel, the sleeping process is pulled into
the run queue Batch of the awake process - thus naturally grouping
communicating processes.&amp;nbsp; But as batches get large, need to split. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Periodically can observe that if all the processes in a Batch are all
communicating with each other in some complex cycle - then the spare
run queue empties from time-to-time (with one runnable process).&amp;nbsp; So as
long as the run queue empties now &amp;amp; then, then the Batch should
stay together.&amp;nbsp; But if the run queue never empties, then
split-the-Batch.&amp;nbsp; Split off a single process into a new Batch - which
will drag it's connected component together into the new Batch.&amp;nbsp; But
eventually if the original Batch had 2 complete strong graphs,&lt;br&gt;
they will now be split - and can be stolen onto other CPUs. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
So this kernel is about 100x less overhead than the same design done on
a JVM.&lt;br&gt;
:-)&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="snugglebug"&gt;&lt;/a&gt;Satish Chandra,
Symbolic Analysis for Java&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
"Snugglebug"&lt;br&gt;
&lt;br&gt;
Communication between the programmer &amp;amp; the tool tends to stop
early: "Dear Programmer: you have a null-ptr bug here"...&lt;br&gt;
&lt;br&gt;
Going to the next step: "Dear Mr Programmer: try running your method
with &lt;strong&gt;*these*&lt;/strong&gt; values and you'll see the bug...".&amp;nbsp; Programmer:&amp;nbsp;
now that I think about it, these values will never arise.&lt;br&gt;
&lt;br&gt;
We just forced the programmer to define the legitmite range of values.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;More&lt;/strong&gt;: Bug Report is more believable if given a concrete input is
given.&amp;nbsp; &lt;br&gt;
&lt;strong&gt;API hardening&lt;/strong&gt;: gives the preconditions under which this library
will throw an exception&lt;br&gt;
&lt;strong&gt;Unit-test generation&lt;/strong&gt;: inputs needed to make a specific assert
fail.&lt;br&gt;
&lt;br&gt;
Works by weakest-precondition.&amp;nbsp; Given an assert in the program, work
backwards computing the weakest-precondition at each point until you
reach the top of the method.&amp;nbsp; If this chain works, then hand the WP to
a decision procedure; sometime the decision is "I don't know" but many
times you get an explicit counter example.&lt;br&gt;
&lt;br&gt;
Good match for the counter-example problem.&lt;br&gt;
- Yes loops are a problem, but don't really need the *weakest*
precondition.&lt;br&gt;
&lt;br&gt;
Most Java features can modeled:&lt;br&gt;
- heap r/w, arrays, allocation, subtyping, virtual dispatch&lt;br&gt;
- exceptions: many bugs are here so we modeled exception handling
try/catch&lt;br&gt;
&amp;nbsp; very closely.&lt;br&gt;
&lt;br&gt;
Have a path-explosion problem.&amp;nbsp; For heap updates, even a single path
includes branch effectively depending on aliasing.&lt;br&gt;
&lt;br&gt;
Has to do strong interprocedural analysis.&amp;nbsp; Lazily grow the call-graph
consistent with known symbolic constraints.&amp;nbsp; First look for a conflict
on a call-free path, then on call-paths but without needing to look
into the call.&amp;nbsp; Finally, if I must peek into the call - but use the
symbolic info the reduce the set of possible call targets.&lt;br&gt;
&lt;br&gt;
Generally, no fixed-point.&amp;nbsp; Cannot expect programs to give loop
invariants.&lt;br&gt;
Work-list management: avoid various common pathologies.&lt;br&gt;
Goal is to find &lt;strong&gt;*a*&lt;/strong&gt; contradiction, not to explore all options.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="satire"&gt;&lt;/a&gt;SATIrE&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
&lt;a href="http://www.complang.tuwien.ac.at/satire/"&gt;http://www.complang.tuwien.ac.at/satire/&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Shape analysis, points-to, flow-sensitive, context-sensitive,
annotations, can read them in &amp;amp; out.&amp;nbsp; Applications to e.g., slicing.&lt;br&gt;
&lt;br&gt;
Read C/C++ - large complex tool chain.&amp;nbsp; Reads in program x-forms in
either a functional programming language or in Prolog.&amp;nbsp; Can combine the
tools in any order at this point (very well integrated).&amp;nbsp; Output is
C/C++ w/annotations at each step (or internal "ROSE" ASTrees, etc). &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Has interval analysis for integers, shape analysis, loop (bounds)
analysis, points-to, etc.&amp;nbsp; Gives example of reaching-def's in the
functional programming language.&amp;nbsp; It's a domain-specific language for
writing compiler analyzes.&amp;nbsp; Can specify the strength of the analyzes in
many cases: e.g. the depth of the allowed call-chain for
context-senstive (set to zero for context insensitive).&lt;br&gt;
&lt;br&gt;
Shape analysis gets may/must alias pairs, etc - and all results put
back into the C program as comments/annotations.&lt;br&gt;
&lt;br&gt;
Basically describes a large complex tool chain for doing pointer
analysis on C &amp;amp; C++ programs.&amp;nbsp; The tool chain looks very complete
&amp;amp; robust.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="pointsto"&gt;&lt;/a&gt;Welf Lowe -
Comparing, Combining &amp;amp;
Improving Points-to Analysis&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
...skips explaining what points-to is.&lt;br&gt;
&lt;br&gt;
Clients: compilers, JIT compilers, refactoring tools, software
comprehension.&lt;br&gt;
Different clients have different needs: speed, precision, security.&lt;br&gt;
Granularity &amp;amp; conversatism also matter.&amp;nbsp; Clients might want, e.g.,
dead code elimination, or removing polymorphic calls, or rename methods
&amp;amp; calls.&amp;nbsp; Other people want object granularity - escape info &amp;amp;
side-effects (i.e., compilers and JITs).&amp;nbsp; Also static-vs-dynamic
analysis.&amp;nbsp; Dynamic analyzes typically run the program and produce
optimistic results.&lt;br&gt;
&lt;br&gt;
Doing careful experiments of 2 analysis w.r.t. accuracy.&amp;nbsp; Hard part:
can't get a perfect "gold standard" for real programs.&amp;nbsp; Hard even for a
human to inspect a proposed "gold standard" and declare it "gold" or
not.&amp;nbsp; Some special cases are easier: conservative analysis (no false
positives) &amp;amp; optimistic analysis&lt;br&gt;
(no false negatives).&amp;nbsp; Can under- and over-estimate the gold standard,
but this still messes with the results (messes with detecting which of
the 2 analysis is better w.r.t. accuracy).&lt;br&gt;
&lt;br&gt;
E.g., ask the question: is it worth increasing k&amp;gt;1 (i.e., becoming
context sensitive vs staying insensitive).&amp;nbsp; Checking size of computed
call-graph - smaller graph is more precise.&amp;nbsp; Very small increase in
accuracy.&amp;nbsp; Sometimes made things worse - because the original analysis
was optimistic sometimes.&lt;br&gt;
&lt;br&gt;
Open question: what is the use of an Uber Points-To - it's definitely
beyond the point of diminishing returns for JIT compilers.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="javascript"&gt;&lt;/a&gt;Anders Moller -
Type Analysis for
JavaScript&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Not much tool support for catching errors during programming.&lt;br&gt;
No nice IDE support.&amp;nbsp; Many bugs are browser specific, but we focus on
language bugs.&lt;br&gt;
&lt;br&gt;
Most JS programs are quite small, so throw a heavy-weight analysis at
it.&lt;br&gt;
&lt;br&gt;
Object based, dynamic typed, proto-type inheritance, 1st-class
functions, coercions, dynamic code construction, exceptions, plus 161
built-in functions.&amp;nbsp; Types include null &amp;amp; undefined, primitive
types, some properties "ReadOnly" and "DontDelete".&lt;br&gt;
&lt;br&gt;
Broken things: tracking down an "undefined" and where it came from,
reading an absent variable or absent property, etc.&lt;br&gt;
&lt;br&gt;
So do a dataflow analysis over the program, including a lattice,
transfer functions, etc.&amp;nbsp; Handle interprocedural stuff.&amp;nbsp; Tightly
interleaved control flow &amp;amp; data - so need a complex lattice. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Lattice is roughly as complex as C2's lattice, with more stuff on the
allocation sites and less on the class hierarchy.&amp;nbsp; Also then model all
the abstract machine state (C2 does this as well).&amp;nbsp; Full context info
(total inlining at analysis time except for recursion).&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Cliff Notes: &lt;/strong&gt;Nice idea: for each allocation site, model both
the &lt;strong&gt;*most recent object*&lt;/strong&gt; from that allocation site, and also
model &lt;strong&gt;*all the rest*&lt;/strong&gt; from that site.&amp;nbsp; You can do strong-update
on the singleton most-recent object, but weak update on all the rest.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="jvm_perf"&gt;&lt;/a&gt;Rhodes Brown -
Measuring the Performance
of Adaptive Virtual Machines&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Noticing the badness of the "compile plan" - 1 run in 10 had a &amp;gt;10%
performance reduction.&lt;br&gt;
&lt;br&gt;
JIKES - always compiled, no interpreter.&amp;nbsp; On demand at 1st invoke.&amp;nbsp;
Focused recompilation, lots of GC variants.&lt;br&gt;
&lt;br&gt;
Points out the places where you get variations - stack-profiles for
estimating the call-graph, basic-block counters, etc.&lt;br&gt;
&lt;br&gt;
Causes of variation?&lt;br&gt;
So far: GC &amp;amp; inlining&lt;br&gt;
Concerned that i-cache placement is a major issue of instability.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;*No*&lt;/strong&gt; correlation between total amount of compile-time spent vs
score.&amp;nbsp; You must compile the hot stuff, but apparently tons of
not-so-hot code is getting compiled and it doesn't help the score any.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="cheap_cores"&gt;&lt;/a&gt;Kurt William -
lots of cores for cheap&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
General discussion round, not really a "talk".&lt;br&gt;
&lt;br&gt;
Expect &amp;gt;32 core/die in 5 yr, &amp;gt;128 core/die in 10 yrs.&lt;br&gt;
What's "New" and what can we expect in 5 years?&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Theory&lt;/strong&gt;: contention is that there's nothing new here since the
70's &amp;amp; 80's (concerning concurrency).&amp;nbsp; (general agreement on
incremental progress but not breakthrough).&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Systems&lt;/strong&gt;: Lots of libraries (TBB, pthreads), lots of platforms
(webservers), all OS's support multi-core.&amp;nbsp; JVMs are New in this
space.&amp;nbsp; Might see OS-support for user-level thread scheduling.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Languages&lt;/strong&gt;: Support been there, but it's locks &amp;amp; threads - no
good.&amp;nbsp; Support for TM is coming, but will it help?&amp;nbsp; Then there's CSP...&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;Applications: &lt;/strong&gt;Not much new here.... more games, more
interactive more user-interface.&amp;nbsp; Old stuff: lots of HPC, speech,
image.&amp;nbsp; Matching: easy parallelism; lots of interesting apps here.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="fpga_variants"&gt;&lt;/a&gt;Uwe Kastens,
Compiler-Driven Dynamic
Reconfiguration of Arch Variants&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
What is "reconfigurable"?&amp;nbsp; "Usual" approach - HW guys compile to a
netlist, do some mapping, place &amp;amp; route, make a FGPA/silicon, while
the SW guys write some assembler on the FPGA - then at the end they try
to run the SW on the HW.&amp;nbsp; Goal is to run some complex function faster
on a FPGA than in normal ops.&lt;br&gt;
&lt;br&gt;
Our approach: read source code, look at program structure, find hot
code &amp;amp; hot loops; compiler knows how to switch the hardware between
variants; then it compiles the code for a particular FPGA variant for
each loop, and reloads the FPGA with a new function between loops.&amp;nbsp; But
need to limit the FPGA to certain variants so don't actually need to
reload the whole FPGA. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Fixed small set of architecture variants.&amp;nbsp; Keeps cost to reconfigure
the FPGA very low.&amp;nbsp; Compiler can choose the variants.&amp;nbsp; Example: FPGA
runs either a small CPU or a small SIMD CPU or a MIMD CPU.&amp;nbsp; Or
reconfigure the pipeline structure, or the default register bank access
patterns.&amp;nbsp; In the different configurations can use more or less power
for problems that are more or less parallel.&amp;nbsp; e.g., in SIMD mode have
all ALUs active but turn all but 1 decoder.&lt;br&gt;
&lt;br&gt;
Pick variants &amp;amp; design space to those known to be efficiently
solved by a compiler - gets a fast compile time &amp;amp; good results
there (use a human elsewhere).&lt;br&gt;
&lt;br&gt;
Can map any arch register to any physical register - so can e.g. have
CPUs share registers, or do unbalanced register allocation - if some
CPU needs more registers in it's portion of code than others.&amp;nbsp; Does
some register allocation affinity games, to avoid having to reconfigure
register layouts between blocks of code. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Nicely flexible overall design.&amp;nbsp; Instructions name arch registers &amp;amp;
ops, but also can re-map real registers &amp;amp; ops periodically.&amp;nbsp; So use
targeted registers &amp;amp; ops for a particular chunk of code, then
reconfigure when the next chunk of code is different enough from the
last chunk- and the cost to reconfigure is lower than running with a
less-well-targeted registers &amp;amp; instructions.&lt;br&gt;
&lt;br&gt;
Ugh, can't read data charts - has some 4 embedded programs.&amp;nbsp; Seeing 30%
reduction in execution time AND power cost, by reconfiguring.&amp;nbsp; Code
size is a little larger, because either poor parallization(?) &amp;amp; the
reconfiguration overhead.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="clones"&gt;&lt;/a&gt;Jim - Clone Detection&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Exact clone'd code, white-space only changes, near-miss clones.&lt;br&gt;
Bug in 1 deserves inspection in all the rest.&lt;br&gt;
&lt;br&gt;
Lots of cloning studies out there -esecpially in-depth of the linux
file system.&amp;nbsp; But no wide-range-of-systems studies.&amp;nbsp; We did a bunch of
open-source systems, C, Java, C#.&amp;nbsp; OS, Web apps, DBs, etc.&amp;nbsp; Used their
own clone-detection, as a combo of AST-detection and text-based
detection.&lt;br&gt;
&lt;br&gt;
Text-based comparisons are sensitive to formating changes.&amp;nbsp; But AST
parsing is expensive (requires tree compares) and does not find "near
miss" clones.&amp;nbsp; Ended up doing "standard pretty-printing" - which uses
the AST to normalize the text.&amp;nbsp; Then do text-based comparisons.&amp;nbsp; Also
handle small diffs in the middle of clones.&lt;br&gt;
&lt;br&gt;
Did 24 systems, 10 C, 7 Java, 7 C#.&amp;nbsp; Linux 2.6 kernel, all of
j2sdk-swing, db4.&amp;nbsp; Varied from 4KLOC to 6MLOC.&lt;br&gt;
&lt;br&gt;
Java methods have more cloning than C methods - 15%-25% (depending all
the diff threshold) of Java methods are clones; C varies from 3% to
13%.&amp;nbsp; C# is more like Java, until the methods allow more diffs - and
then there is a lot more cloning.&amp;nbsp; Swing is hugely clone-ish, &amp;gt;70%.&lt;br&gt;
&lt;br&gt;
C systems - Linux has the same cloning behavior as postgres &amp;amp; bison
&amp;amp; wget &amp;amp; httpd. Java has no more clones as you allow more diffs
as compared to C - I suspect good Java refactoring tools detect &amp;amp;
remove clones earlier.&lt;br&gt;
&lt;br&gt;
C# shows a LOT more clones as more diff's are allowed.&amp;nbsp; Maybe
cut-n-paste is a coding style?&amp;nbsp; C#'s clones are larger methods as well.&lt;br&gt;
&lt;br&gt;
Something like 10-20% of all files in Java have 50% of functions are
clones.&amp;nbsp; In C# it's more like 20-35% of all files.&lt;br&gt;
&lt;br&gt;
Files with 100% cloned code: 15% of C# files are totally cloned.&amp;nbsp; For
Java whole file cloning is fairly low.&lt;br&gt;
&lt;br&gt;
Localization of cloned pairs: are they in the same file?&amp;nbsp; same
directory?&amp;nbsp; different directories?&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="business"&gt;&lt;/a&gt;Thomas Gschwind -
Business Process
Modeling Accelerators&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Business Process Modeling, patterns, x-forms, refactorings&lt;br&gt;
Process Structure Tree - &lt;br&gt;
&lt;br&gt;
&amp;nbsp;- Process is a large unstructured graph.&amp;nbsp; Visio-style graphical is
error prone for large process models.&lt;br&gt;
&lt;br&gt;
But can take the structure graph and build a tree from it.&lt;br&gt;
(The graph is all fork/join parallelism).&lt;br&gt;
A small part of the tree requires state-space exploration.&lt;br&gt;
&lt;br&gt;
After the tree is correct, can apply patterns &amp;amp; Xforms to the tree.&lt;br&gt;
&lt;br&gt;
Essentially Petri Net edit'ing on a structured graph/tree using Eclipse.&lt;br&gt;
&lt;br&gt;
100's of users inside of IBM; generated business models are much higher
quality, make them 2x faster.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="pdf_attack"&gt;&lt;/a&gt;Philip Levy, The
New Age of Software
Attacks&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Why? &lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Not enuf to do?&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Crime is Big Business&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Usable exploit: can be sold for $75K to $100K&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;So called "security researchers".&amp;nbsp; Some are respectable, most are
going after the $100K payoff.&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
"Visual Virus Studio" - actually exists.&amp;nbsp; PDF is a common vector for &lt;strong&gt;*targeted*&lt;/strong&gt;
attacks.&amp;nbsp; "Adobe Reader is the new Internet Explorer".&lt;br&gt;
&lt;br&gt;
Looked at all the fixed vulnerabilities in Adobe Reader&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Need to do a better job teaching students on how to write secure
software&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Stop writing in assembly language&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;ANY unmanaged code, include C/C++, should be banned&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Languages need to be upgraded to be more secure&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;New constructs &amp;amp; type systems&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;More sophisticated testing tools &amp;amp; methodologies&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Better analysis tools for legacy systems&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
How big is Adobe Reader?&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;100K FILES&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;42MLOC, no comments or blank lines: 25MLOC&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Estimated 75 man years of development per year over it's lifetime&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Problems are clustered - e.g. one module has had 25% of all security
problems.&lt;br&gt;
Top 6 modules have 80% of security problems.&lt;br&gt;
&lt;br&gt;
Break down problems another way:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;array index OOB - 32%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;unknown - 27%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;intege overflow - 14%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;range check missing - 8%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;capability check missing - 4%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;out of memory 3%&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
How found:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Fuzzing - 42%&amp;nbsp; (creating illegal inputs and throwing them at the
program)&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Code Review - 35%&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;External report - 12%&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Some examples:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;asserting against null, but no asserts in debug build and null
possible in release build.&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Failed to check for error returns, sometimes even not catching
return value&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Casting "attacker controlled values" (values from the PDF file)
to valid enum&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Or using attacker-controlled value for loop &amp;amp; array bounds&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Sometimes a complex pre-condition needs to be checked first&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Some of the fixes are still incorrect&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
A Quick List:&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Badly designed lib funcs (strcat, sprintf)&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Microsoft's banned API list is 4 pages&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;AIOOB&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Calling thru func ptrs&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Integer over/underflow.&amp;nbsp; Many protocols send N followed by N
bytes.&amp;nbsp; Send a really big number that overflows (after scaling for
malloc) into a small number - then malloc succeeds, but the data
overwrites the buffer.&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Pointer math&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Uninitialized vars&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Dangling ref pointer&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;Basically: it's BASIC STUFF that goes bad, not fancy stuff.&lt;br&gt;
&lt;br&gt;
All memory integrity or soundness type stuff.&lt;br&gt;
But it goes beyond AIOOB - &lt;br&gt;
e.g. Name string like "Robert); DROP TABLE Students;-- "&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="britney_spears"&gt;&lt;/a&gt;Jeff Fischer,
Solving the "Britney Spears
Problem"&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Users assigned roles; roles are assigned privileges. &lt;br&gt;
Benefit: users can come &amp;amp; go without changing policies.&lt;br&gt;
&lt;br&gt;
"Patients" can view medical records, but "doctors" can change them.&lt;br&gt;
&lt;br&gt;
Can add annotations to Java that specify roles.&lt;br&gt;
&lt;br&gt;
Problem: lack of expressiveness; solution: logging.&amp;nbsp; Violation happened
(people saw Britney Spears' data), but the logging allowed catching the
violators.&amp;nbsp; Another solution: manual checks.&amp;nbsp; But no compiler support,
so missed checks.&lt;br&gt;
&lt;br&gt;
Problem: Lack of explicit access policies.&lt;br&gt;
Really: need a way to "type" data (and hence methods using the data)
with some kind of security/access policy.&amp;nbsp; More problems: often checks
are hoisted to entry point for efficiency, and then violate policy
later.&amp;nbsp; Relying on manually-inserted runtime checking.&lt;br&gt;
&lt;br&gt;
Our stuff: parameterize classes by some kind of index value defining
Role. &lt;br&gt;
Each method includes an explicit policy (statically checked).&lt;br&gt;
Annotation processor can either statically type-check the policy, or
insert runtime checks.&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="squeeze_tab"&gt;&lt;/a&gt;Jens Knoop -
Squeeze your Trusted
Annotation Base&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
"Interesting program properties are undecidable!"&lt;br&gt;
&lt;br&gt;
Compiler optimization - not so worried: just lose some optimality.&amp;nbsp;
Results still usable; but if we have better results we can produce
better code.&lt;br&gt;
&lt;br&gt;
Worse-Case execution time analysis for hard real time: Bang.&amp;nbsp; Your dead.&lt;br&gt;
WCET bounds must be safe (soundness), and wish to be tight (optimal or
useful).&lt;br&gt;
E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but
useless.&lt;br&gt;
Need, e.g., all loop bounds. &amp;nbsp;&lt;br&gt;
Need user annotations.&lt;br&gt;
User annotations are untrustworthy, but must be trusted.&lt;br&gt;
&lt;br&gt;
Squeeze the trusted annotation base - reduce user chance of screwing up.&lt;br&gt;
&lt;br&gt;
Try to replace trust by proof - as a side-effect generally improving
optimality.&amp;nbsp; Then take advantage of trade-off between power &amp;amp;
performance of analysis&amp;nbsp; algorithms.&lt;br&gt;
&lt;br&gt;
Use model-checking with care.&amp;nbsp; If something can't be bounded
automatically, as&lt;br&gt;
the user for some bound - then prove it.&amp;nbsp; BUt also believe the user's
bound isn't tight.&amp;nbsp; Try to prove a tighter bound with the
model-checking using binary search.&amp;nbsp; Indeed - can do away with user
assistance often; just guess and binary search to tighten it.&lt;br&gt;
&lt;br&gt;
Summary: can often completely empty the trusted annotation base,
usually shrink it; often improve WCET bounds.&lt;br&gt;
&lt;br&gt;
Start with the cheapest techniques (constant prop), move up to interval
analysis, then symbolic bounds &amp;amp; model checkers...&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="ghost"&gt;&lt;/a&gt;Joe Newcomer - Who do
you trust to run
code on your computer?&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
(Joe is a Windows attack expert)&lt;br&gt;
&lt;br&gt;
Attack Vectors - autoexec.bat, .cshrc &amp;amp; friends, autoplay,
device-driver installs (Sony), mac resource attacks&lt;br&gt;
&lt;br&gt;
Engineered attacks: phishing attacks, downloads, activeX, client-side
scripting&lt;br&gt;
&lt;br&gt;
Lots of "security" designs &amp;amp; codes are done by junior programmers
w/out adequate oversight.&amp;nbsp; Pervasive attitude: "software is secure - as
designed, as implemented, as maintained".&amp;nbsp; Small+fast is all that
matters...&amp;nbsp; not...&lt;br&gt;
&lt;br&gt;
Rant against all modern OS's, talking about #of security patches in
Mac, Linux, unix, Windows...&lt;br&gt;
&lt;br&gt;
Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc.&lt;br&gt;
&lt;br&gt;
What are we teaching people about security?&amp;nbsp; How about non-CS types? &lt;br&gt;
Graphics designers, website designers...&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="swarms_in_space"&gt;&lt;/a&gt;Stefan
Jahnichen, Swarm Systems in Space&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Build very small satellites, swarms of them to watch the Earth.&amp;nbsp; They
will spread out after launch to watch e.g. weather, bush fires, etc.&amp;nbsp;
Takes about 2 days after launching to get a communication link.&lt;br&gt;
&lt;br&gt;
Satellites have good earth coverage; orbit at 500 to 600km, orbit earth
every 3 hrs or so.&amp;nbsp; Wide range of sensors amongst different satellites.&lt;br&gt;
&lt;br&gt;
Have a Real Swarm of cameras; map a Virtual Swarm onto that; map
mission goals on the V.S. &amp;nbsp;&lt;br&gt;
&lt;br&gt;
Optimized synthesis of consensus algorithms.&lt;br&gt;
&lt;br&gt;
Special OS - distributed OS basically.&amp;nbsp; Used to maintain swarm
integrity (keep satellites together). &amp;nbsp;&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;hr size="2" width="100%"&gt;&lt;strongig&gt;&lt;strong&gt;&lt;a name="boids_occam"&gt;&lt;/a&gt;Peter Welch,
Engineering Emergence...&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Thesis: in the future, systems will be to complex to design &lt;strong&gt;*explicitly*&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;
Instead engineer systems to have the desired properties "implicitly".&lt;br&gt;
&lt;br&gt;
"mob programming", ant mounds, etc.... simple rules of behavior leading
to complex behaviors that emerge.&amp;nbsp; Emergent Behavior&lt;br&gt;
&lt;br&gt;
Mechanisms Design - (game theory), &lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;Rational actors have local private info&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Emergent: optimal allocation of scarce resources&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Optimal decisions rely on truth revelation&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Swarming behavior (flocks, etc)&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;local actors, local behaviors&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;Emergent "swarm" behavior&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;UAV swarms &amp;amp; autonomous robots&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Social communication (gossip, epidemic algorithms)&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;large, ad hoc networks&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;emergent: min power to achieve eventual consistency&lt;/big&gt;&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;low power, low reliability sensors &amp;amp; data propagation&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
Design for emergent behavior&lt;br&gt;
&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;occam process per "location", a 1024x1024 grid has 1mil processes&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;each location has a com channel to it's 4(8) neighbors&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strongig&gt;occam process per "mobile agent"&lt;/big&gt;&lt;ul&gt;
&lt;li&gt;&lt;strongig&gt;agents register with local process (or up to 4 if it's
straddling a local region) &amp;amp; local process tells the agent what's
nearby.&lt;/big&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;strongig&gt;&lt;br&gt;
&lt;br&gt;
No idea how to design high-level behavior from low-level rules, but
busy experimenting and looking for some cases.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
Cliff&lt;br&gt;
&lt;br&gt;
&lt;/big&gt;&lt;/div&gt;&lt;/div&gt;
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/05/ifip-wg24-trip-report.html</feedburner:origLink></entry>
  <entry>
    <title>DaCapo Trip Report</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/5_Yne_x903U/dacapo-trip-report.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=66395247" title="DaCapo Trip Report" />
    <id>tag:typepad.com,2003:post-66395247</id>
    <issued>2009-05-05T09:23:31-07:00</issued>
    <modified>2009-05-05T16:23:31Z</modified>
    <created>2009-05-05T16:23:31Z</created>
    <summary>http://www.cs.tufts.edu/research/dacapo DaCapo is a small (but growing) research group focused on GC &amp; Memory Management issues. They meet about every 9 months, and the meeting bounces around depending on who is hosting it. People mostly present work-in-progress, and the audience...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><div class="moz-text-html" lang="x-western">



<br />
<pre wrap=""><a class="moz-txt-link-freetext" href="http://www.cs.tufts.edu/research/dacapo">http://www.cs.tufts.edu/research/dacapo</a><br /></pre>

DaCapo is a small (but growing) research group focused on GC &amp;
Memory Management issues.  They meet about every 9 months, and the
meeting bounces around depending on who is hosting it.  People mostly
present work-in-progress, and the audience is up front with their
critiques... often interrupting the speakers or getting sidetracked
into loud off-topic conversations, i.e., it's a fun boisterous crowd.<br />
<br />
DaCapo is in the Boston area this year, held at <a href="http://www.tufts.edu/">Tufts University's Medford</a> campus
(everyone's heard of Harvard and MIT, but how about the Berklee College
of Architecture, Fisher College, Boston University, Suffolk University
and dozens and dozens more? - Boston has more colleges and universities
per square mile than any other place I know).  As usual for me
traveling to Boston, I skipped the car and just used public
transportation - the <a href="http://www.mbta.com/rider_tools/trip_planner/default.asp">T
works marvelously well</a> (as compared to e.g. VTA; crossing the
entire town from Logan Airport stuck way out in the water to Tufts some
8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose
on the VTA takes something like 2 hrs).  <br />
<br />
Anyways the weather was supposed to be highs of 65 and rainy with lows
in the 40's - so I packed my parka.  Turned out the weather report was
a <strong>little bit off</strong>, with highs in the upper 80's and sunny.  Man
did I roast.  But I got out and enjoyed the sun - we (the entire DaCapo
group of 40 or 50 people) walked about a mile to lunch each day, plus I
walked to and from my hotel &amp; the T stop (also about a mile). 
Enough ruminating, on to the talks!<br />
<br />As usual, I pay attention to some talks and chat w/neighbors through others.  I brutally summarize.  I'm opinionated &amp; narrow focused... so Caveat Emptor.<br /><br />
<hr size="2" width="100%" /><br />
<strong>Change Tracking</strong><br />
<br />
What changed?  Who did it?  When?  Points out the flaws of using 'diff'.<br />
<br />
Logical-Structure diff.  Auto-magically spotted systematic change
patterns.<br />
It's a better code-review diff.<br />
<br />
Can report diff's as reports like "for all XXX, replace XXX with YYY
except for ZZZ_XXX_ZZZ", etc.  Vastly better error reporting!!!  Vastly
better diff reporting!!!<br />
<br />
<strong>Cliff Notes: </strong>We <strong>must </strong>get this gizmo and check it
out!!!!<br />
<br />
Paper report: <a href="http://users.ece.utexas.edu/%7Emiryung/Publications/icse09-LSdiff.KimNotkin.pdf">http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf</a><br />
Sample output: <a href="http://users.ece.utexas.edu/%7Emiryung/LSDiff/carol429-430.htm">http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm</a><br />
<br />
<hr size="2" width="100%" /><br />
<strong>Inferred Call Path Profiling</strong><br />
<br />
Very low cost path profiling.  Hardware counters have very little cost,
but provide little context.  Software is reversed: high context &amp;
high
cost. <br />
<strong>Cliff Notes: </strong>Compare this to call/ret RPC hashing.<br />
<strong>Cliff Notes: </strong>Bond &amp; McKinely does something similar.<br />
<br />
Instead, look at number of bytes from main() to leaf call on stack. 
Basically PC &amp; SP (offset from stack base) provides all the context
we need.  Dynamically detect unique PC &amp; SP (they did it in a
training run) pairs, and reverse them to get a call-stack.<br />
<br />
Lots of duplicated stack heights, how to tell them apart?  On SpecInt
unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call
stack).  But huge spread; really sucks bad on some benchmarks.<br />
<br />
If have a list of entire call-graph, then can remove ambiguity by
growing some frames.  But requires some global analysis to know which
frames to grow in order to be unique.  "Solved" problem with a simple
greedy search.  Compute all call-stacks over whole program run, pick a
conflict, grow some frame, re-compute &amp; keep if more PC/SP pairs
are distinct.  Repeat until it's rare enough that can live with dups.<br />
<br />
<strong>Cliff Notes: </strong>The whole-program ahead-of-time thing isn't feasible for us.  How to do this on-the-fly during tick-profiling? 
Hash PC/SP, find a
tick record, 1-time-in-1024 check for dups (crawl stack the hard way
and confirm call-stack), otherwise assume unique &amp; increment tick
record.  If MISS in hash table, then crawl stack the hard way.  If
record is labeled as having dups, then crawl the call-stack &amp; find true record?)  Actually, probably expect the
call stacks to be common, but the PC's to be fairly unique.  So really
asking for unique RPC+SP (not PC+SP).<br />
<br />
His experiments: collect PC/SP pairs about 100 times/sec.  0.2%
slowdown (must have had careful measurements to detect that!)<br />
<br />
<hr size="2" width="100%" /><strong><br />
Breadcrumbs - Recovering Efficiently Computed Dynamic Calling Contexts</strong><br />
<br />
Also builds on probabilistic calling contexts.  <br />
<br />
Given a (nearly) unique ID (say, computed at each fcn entry via a
simple hash on the return PC).  IN the past, reversing these unique
ids: use a dictionary (very large &amp; high cost), use a
calling-context-tree (smaller, but requires you to track where you are
in the CC-tree at runtime).  Forward search, but undecidable and
limited options for pruning.<br />
<br />
Attempt to reverse the hash fcn (but only when we need to take the
semi-unique id) and convert it to a call-stack.  Straight-up reverse is
no good, but is easy to guide search.  Callers of leaf call themselves
have structure - e.g. expect to see a call-op from the call-site, etc. 
Still need a list of all call-sites (generally available anyways
because of inline-cache structures).  His function is "p' = (3p+c)", where "c" is RPC?  Requires a handful of instructions on function entry.<br />
<br />
<strong>Cliff Notes: </strong>If this reverse search is successful most of the
time, it says that during a profile tick we can record just the PC/SP
(or RPC/SP/PC?) and then reverse them to the entire call-stack later -
on demand when RTPM requests.  So a 'tick' is no more than record a few
bits (and handle buffer overflow), but<strong> no stack crawl</strong>.<br />
<br />
<strong>Cliff Notes: </strong>combined nicely with 'null-assignment-stack' info,
ie. collect the entire stack and 'color' all NULLs with a unique stack
id.  Then if get a NPC can report both the NPC at the crash site, but
also the call-stack that assigned the NULL in the first place.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Transparent Clustering - </strong><strong>Phil McGachey, Eliot Moss, Tony
Hosking</strong><br />
<br />
Attempt to auto-distribute multi-threaded programs.  Pure Java, no JVM
hacks.  <strong>No programmer involvement.  </strong>Dynamic class re-writing. 
JVMs can
vary.  Hardware can vary.<br />
<br />
All machines on a network run RuggedJ nodes.  RuggedJ includes a
rewriting class loader, some runtime libs, and some
application-specific partioning strategy.<br />
<br />
Rewriter: generate lots of classes &amp; interfaces; especially hook
methods &amp; fields.  Hook into the runtime.  Abstract over object
location; remote &amp; local objects.<br />
<br />
Ex: Standard class "X".  Rewrite into a X_Local and X_Stub classes.  X
objects on Node A are of type X_Local.  On Node B, X objects are really
X_Stub objects where all the calls are remote calls.  Both X_Local and
X_Stub implement the "X" interface.  Some other Y_Local can refer to
X_Local on Node A, but refer to X_Stub on Node B.  If want to migrate
from Node A to Node C, want to indirect the X_Locals; so for mobile
objects there's an indirection layer called "X_Proxy", that then be
switched from pointing to an X_Local vs an X_Stub (if the object
migrates).  Non-mobile objects skip the X_Proxy indirection layer.  <br />
<br />
Inheritance &amp; interfaces are maintained in the rewritten classes. 
Static classes conform to the RuggedJ model (I assume replication is
allowed?).  Arrays are also wrapped to support indirection.  <br />
<br />
Constraints: Native code breaks everything.  Try to not rewrite stuff
referred to by native code (maybe can intercept all native calls and
rewrite args?).  System code is also an issue: loaded by the system
loader, is not rewritten.  (Can they use -Xbootclasspath?).  But most
code run is actually system code!  (eg. String libs are very common!)<br />
<br />
So must deal with system code without changing the JVM.  <br />
<ol>
<li>Can wrap system class objects.  Must unwrap when passing to
system or native code; must re-box on return.  Re-wrapping is
expensive: need the exact same wrapper, so need some kind of
hash-lookup to do re-wrapping.  High overhead, so avoided where
possible.<br />
 </li>
<li>Extend: some classes can be extended and preserve the original
class &amp; signatures.  Extended class follows the RuggedJ model but
also the original system model. No need to unwrap/re-wrap.  Does not
work e.g. when system code generates the object (since sys code makes
instances of the original and not of the extended superclass).  <br />
 </li>
<li>Some system classes are only ever referenced from user code; can
just dup the code out of rt.jar and make a user-class (which now gets
rewritten by the normal RuggedJ model).  <br />
 </li>
<li>Immutable classes.  Just replicate.<br />
 </li>
</ol>
<br />
Apparently able to cluster some complex programs.<br />
No idea about performance as compared to explicitly clustered apps.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Jikes RVM Native Threads</strong><br />
<br />
Was Green threads, now Native threads<br />
(everybody else is dropping green threads too....)<br />
Motherhood &amp; Apple Pie.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Computer Science as an Experimental Science</strong><br />
<br />
Reproducibility?  Experimental bias?<br />
Show how these problems are solved in very-high-energy astrophysics.<br />
<br />
(1) No publish without multiple colleague confirmation.<br />
(2) Publish<br />
(3) MUST also now publish reproductions or <strong>non-</strong>reproductions's
of other peoples' experiments (or lose credibility as a research
entity).  But bar to publication is lower, and expectations are lower. 
i.e., people expect to publish in-progress work<br />
<br />
Biggest comp-sci problem now: lack of infrastructures.<br />
Examples of infrastructure bias: green-vs-native threads, some barrier
styles are hard to use (.Net embedded objects), stack-scan (OVM makes
fast stack scanning at the cost of everything else being slower). 
HotSpot (high-value compiler, good GC, good most everything) vs JIKES.<br />
<br />
<strong>Cliff Notes: </strong>This was actually a really fun talk (and <strong>lots </strong>of
discussion) despite my paucity of notes.  <br />
<br />
<hr size="2" width="100%" /><br />
<strong>Web 2.0 Profiling</strong><br />
<br />
Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.<br />
Relies on browser to run Flash, JavaScript, REST, AJAX.<br />
Benchmarks: IBM social networking software; Java PetStore 2.0,
AJAX-based RIA using JEE5, uses Lucene search.<br />
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14),
selectTag (PetStore 9), etc.  <br />
Workload mix can be varied, but based on commonly observed usage
pattersn from internal deployment of Lotus Connections).<br />
<br />
Wanted to remove DB from the workload, put DB on ramdisk.  Same for
LDAP.  Used Grinder to generate load.  Main server running
J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5
1.66GHz.  Fairly reasonable setup for a <strong>small </strong>web/app load.<br />
<br />
See lots more chattiness; very short fast requests, high interrupt
rates, smaller requests but many more of them.  Looked at low-level
profiling.  See fewer I-cache misses as compared to a large legacy
apps.  Better I-cache behavior.  But data-cache miss rate is still very
significant.<br />
<br />
Also different transactions have different scaling limitations.  For
PetStore the Java persistence layer started falling off after 4 cpus.  <br />
<br />
<hr size="2" width="100%" /><br />
<strong>Stream Processing</strong><br />
<br />
<a href="http://www.cs.ucsb.edu/%7Eckrintz/papers/gedik_et_al_2008.pdf">"Spade"
- Cluster stream processing at IBM</a>.  Continuous high-level
streams.  Stream arrives for weeks &amp; months; want to process and
then emit stream data continously.  Want to express the problem with a
language.<br />
<br />
Priorities - fast streaming on a cluster; then Generality; finally
Usability.  Beyond <a href="http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-620.pdf">StreamIt</a>. 
Like StreamIt, works with streams that
can be split &amp; joined.  Language is typed; stream contents are
strongly typed.  <br />
<br />
<hr size="2" width="100%" /><strong><br />
Demystifying Magic - High-Level Low-Level Programming</strong><br />
<br />
<a href="http://domino.research.ibm.com/comm/research_people.nsf/pages/dgrove.vee09.html/$FILE/vee25-frampton.pdf">Seen
this talk before</a>, but this time given by Steve Blackburn.  Nice
historical perspective on writing system software in C vs asm - and the
parallels to writing sys software in Java vs asm.<br />
<br />
Java-In-Java<br />
<ul>
<li>Key Principle: Containment.  Write as much as possible (99%!) in
pure Java, and dive into the low level "magic" bits as little as
possible.</li>
<li>Extensibility: Requires change quickly, languages change slowly.</li>
<li>Encapsulation: contain low-level semantics</li>
<li>Fine-grained lower of semantics: minimize impedance, seperation
of concerns</li>
</ul>
<br />
Framework:<br />
<ul>
<li>Extend Java semantics; intrinsics</li>
<li>Scoped semantics changes</li>
<li>Extend types; un/boxing, ref-vs-value, architecture sizes</li>
<li>Pragmatic: retain syntax where possible to keep front-end tools
working</li>
</ul>
<br />
Introduce raw pointer type: "Address" looks like a Java type.<br />
E.g.: <strong><font face="Courier New, Courier, monospace">Address a =
...;   ... = a.loadByte(offset);</font></strong><br />
<br />
Semantic Regimes:<br />
Additive: "its ok here to have unchecked casts"<br />
Subtractive: "no allocation via new allowed", similar to HotSpot
VerifyNoGC.<br />
<br />
Allow "unboxed" types - no v-table, so it's a bare primitive type.  But
this isn't the default (but can be asked for).<br />
<br />
Abstraction results are good: can implement a JVM on bare hardware and
with a simple change essentially virtualize the same JVM inside a GC
debugging harness running inside another JVM (where the virtualized
JVM's heap is just a big array in the outer implementing JVM), etc.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Grace, Safe Multithreaded Programming for C/C++ - Emery Berger</strong><br />
<br />
Seen the paper before, but this is the talk by Emery.<br />
<br />
Fake sequential semantics on stock multi-threaded hardware using
fork/join and process protection.  <br />
<br />
Grace is a pthreads replacement: "g++ -lpthreads" becomes "g++ -lgrace".<br />
<br />
Speedups: in the absence of common benign races, Grace run programs run
about 10% slower than raw pthreads - but have full sequential
semantics.  It's "as-if" all the parallel programs are run sequentially
immediately at the thread spawn point.  <br />
<br />
CAN'T run applications where threads run forever, i.e. reactive or
server apps.<br />
Works well with fork/join, Cilk, TBB, etc.<br />
<br />
So thread spawn becomes a unix fork with COW, use mmap for allow memory
sharing.  At join points, smash in join'd threads memory updates via
mmap.  Also need scalable thread-local malloc heap, plus aligned
globals (to avoid false-sharing at the page granularity), plus some
improved I/O.<br />
<br />
Some simple measures remove nearly all false sharing.  Big one:
everybody mallocs into their own space.  2nd big one: spread the
globals out 1 per page.<br />
<br />
<strong>Thread-spawn is as cheap as a fork on linux</strong> (has experiments to
show it).  Due to thread-cpu affinity, if you spawn a bunch of threads
they share a single CPU.  At the low end of thread-grain-length, fork
is faster than spawn because the scheduler immediately spreads the
processes across CPUs, whereas threads share for awhile.  So fork is
actually faster than thread-spawn for awhile.<br />
<br />
Real performance limiter for Grace is conflicts &amp; rollbacks, and
not thread-spawn overheads.<br />
<br />
Performance is much better than e.g. STM, because after the 1st touch
on a new page (and the SEGV &amp; accounting), all accesses on that
page run at full speed.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>GC Assertions: Using the GC to check heap properties</strong><br />
<br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&amp;rep=rep1&amp;type=pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&amp;rep=rep1&amp;type=pdf</a><br />
<br />
Global knowledge is required to understand seemingly local properties.<br />
JBB example:<br />
<br />
<strong><font face="Courier New, Courier, monospace">    Order order;<br />
    for(...)<br />
      order = ....;<br />
    delete orderTable;<br />
</font></strong>    ? are all 'orders' dead here? <br />
<br />
But actually 'order' is held by the last-customer-transaction (as an
optimization, or convienvence for customer?).  Leads to a leak of
'orders'.  Really: programmer does not understand program, too complex.<br />
<br />
Add assertions to the code, and use the GC to check the property.<br />
<br />
<strong>Cliff Notes: </strong>Azul is already gathering global heap properties
during GC - but not asserts.  Gathering liveness &amp; points-to info. 
If points-to included a handful of sample real instances (and all the
sames linked when possible), would be a very powerful way to instantly
check some heap properties.<br />
<br />
Sample asserts:<br />
'assert_dead(p)' - expect to reclaim 'p' at the next GC cycle<br />
Or region-collection: start_region(); ... assert_all_dead();<br />
Or shape property: assert_unshared(p);<br />
Or ownership properties (members of a collection are 'owned' by the
collection)<br />
<br />
If an assert fails, then do a slow crawl and provide the full path
through the heap showing the failure.   Asserts only checked at GC
points.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Proving Proving Correctness of Abstract Concurrency Control and Recovery
- Trek Palmer &amp; Eliot Moss</strong><br />
<br />
Transactions - closed nesting has issues<br />
Open nesting &amp; Boosting need from programmer: conflict info &amp;
roll-back info.<br />
These are hard to provide correctly.<br />
<br />
This work: a description language of abstract data types or structures.<br />
Can describe the conflict predicates &amp; inverses.<br />
Can prove correctness of the conflict &amp; inverse expressions.<br />
<br />
Working with the <strong>abstract</strong> description of the data structure,
and
NOT e.g. with the real Java implemention.  E.g., no loops, no
mutations, no recursion.  So the language isn't a 'real' language, but
can describe many kinds of ADTs in code that looks sorta like Java.<br />
<br />
Output isn't a functional program, instead the output is the result of
using a SAT solver to prove correctness.  <br />
<br />
XTN-semantics allows conflict detection to be proven correct pairwise,
instead of having to do full interleaving.  Formally, a conflict<br />
predicate is correct iff it is true when the operations do no commute.<br />
<br />
The conflict predicate tells when 2 XTNs conflict, and the inverse
allows an optimistic XTN to rollback.<br />
<br />
Obvious use-case: use this tool to write a transactional-version of
NBHM or other JDK concurrency utilities.<br />
<br />
<hr size="2" width="100%" /><br />
<strong>Hard Real-Time GC Scheduling</strong><br />
<br />
- Periodic Scheduling<br />
 - GC runs at highest priority, but periodically yields to mutator<br />
 - Metronome<br />
- Slack-based Scheduling<br />
 - GC runs at lowest priority<br />
 - Can be preempted at any time<br />
- Makinac (Sun RTS)<br />
- Work based Scheduling<br />
 - GC runs at allocation time<br />
 - Problem with allocation-rate jitter<br />
<br />
HRT - Deadlines must never be missed AND must be verifiable
analytically that no deadlines are missed.  Systems tend to therefore
be very simple, so that they can be proven.<br />
<br />
OVM - like Metronome.  Dynamic defrag; arraylets; Brooks style barrier;
replicating barrier/; incremental, concurrent, supporting slack-based
style scheduling.  <br />
<br />
<hr size="2" width="100%" /><br />
<strong>Understanding Performance of Interactive Applications</strong><br />
<br />
Typical profilers give the wrong numbers: they report total time spent,
not time spent between mouse-click and page updated.<br />
<br />
AWT/Swing &amp; SWT - similar: single-threaded, gui thread in "event
loop", relies heavily on Listener model.  So profile using gui
call-back Listeners between user events and the eventual display change.<br />
<br />
Really simple idea: profiling &amp; event visualizer tool between click
&amp; view times.<br />
<br />
<hr size="2" width="100%" /><strong>Program Metamorphsis</strong><br />
<br />
Trouble w/refactorings: must preserve program behavior with each step. 
But if we want multiple refactorings then at each step along the way
fixup code (needed to preserve program behavior) pollutes the code. 
Want to do multiple refactoring steps at once - while leaving the
program possibly broken in the in-between steps.<br />
<br />
So compute a program-semantics (names &amp; def-use chains), and let
the user make partial refactoring steps and then declare "I think I'm
done" - and the system compares the new program-semantics with the old,
and reports if they are not equal.  <br />
<br />
So actually comes up with primitive not-quite refactoring steps, which
he will compose to build larger "real" refactorings, or compose lots to
build a multi-stage refactoring.  <br />
<br />
<hr size="2" width="100%" /><strong><br />
Instruction Selectors</strong><br />
<br />
HotSpot C2 uses a BURS framework.  Very hard to optimize &amp; debug. 
So these guys have a semantics which can span both the ideal IR and
real machine instructions.  Describe both using "CISL" and the tool
does the mapping.<br />
<br />
Once the tool has a mapping, the user provides an adapter that converts
the mapping into the code needed by the compiler back-end.  This would
replace e.g. the ADLC.  Gives examples of machine encodings for PPC and
Java bytecodes.<br />
<br />
CISL is given an IR tree and produces a target tree with equivalent
semantics.<br />
<br />
Not sure it's entirely better than BURS (maybe no worse), but I'm still
sold on rewriting in e.g. C++/Java hand-written greedy match rules. 
These code-gen-gen tools are pain for long-term maintenance.<br />
<br />
<hr size="2" width="100%" /><strong><br />
Do Design &amp; Performance Relate?</strong><br />
<br />
Is fast code required to be ugly?  Is beautiful required to be slow?<br />
<br />
Pick 200 code metrics.  Also pick some performance metrics (cycles,
single threaded, objects created, etc).  Could not follow talk... or at
least he wasn't very clear with the objectives &amp; results (if any).  Interesting idea though.<br /><br />
<br />
</div></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/05/dacapo-trip-report.html</feedburner:origLink></entry>
  <entry>
    <title>Odds &amp; Ends</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/2akg9-qZ84w/odds-ends.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=65502891" title="Odds &amp; Ends" />
    <id>tag:typepad.com,2003:post-65502891</id>
    <issued>2009-04-15T10:29:14-07:00</issued>
    <modified>2009-04-15T18:36:45Z</modified>
    <created>2009-04-15T17:29:14Z</created>
    <summary>Various pending goodies... Load-Linked/Store-Conditional vs Compare-And-Swap Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!). There was a long period of debate in...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Various pending goodies...</p><h2><span style="text-decoration: underline;">Load-Linked/Store-Conditional vs Compare-And-Swap</span></h2><p><br />Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!).  There was a <a href="http://www.cs.huji.ac.il/course/2005/dist/p124-herlihy.pdf">long period of debate</a> in the CompSci community one what constituted the "minimal" needed instruction to do useful multi-CPU work.  Eventually the community has decided that the <a href="http://en.wikipedia.org/wiki/Compare-and-swap">Compare-And-Swap</a> (CAS) instruction and the <a href="http://en.wikipedia.org/wiki/Load-Linked/Store-Conditional">Load-Linked/Store-Conditional</a> (LL/SC) instruction combo both are (1) sufficient to do useful work ("infinite concensus number") and (2) relatively easy to implement in real hardware.  X86's, Sparc's and Azul's CPUs use CAS.  IBM's Power, Alpha, MIPS &amp; ARM6 all use LL/SC. ARM5 only has an atomic-swap.</p><p>LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms.  With LL/SC, you first Load-Linked a word.  The hardware marks the cache line as "linked".  You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter).  Finally you issue a Store-Conditional to the same address.  If the cache line is still "linked", the store happens.  If not, then not.  The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line.  Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented).  "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail.  I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure.  The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.</p><p>With CAS, you typical first load a word with a normal load, then manipulate it.  Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory.  CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic <a href="http://en.wikipedia.org/wiki/ABA_problem">ABA bug</a>.  In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless <strong>all</strong> copies of A die first - including the one loaded before the CAS.  Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS.  Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively.  With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link".  E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS).  </p><p>These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations.  Nothing in the academic literature requires any sort of global ordering on these atomic instructions.  Instead the usual academic assumption is that <strong>all </strong>memory operations are strongly ordered - which is obviously <strong>not</strong> true on all modern hardware.  Practitioners are required to insert memory fences as needed to achieve the desired ordering.  Nonetheless most <strong>implementations </strong>of CAS include a strong ordering: X86 and Sparc certainly do.  Azul's CAS does <strong>not</strong>, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations.  <em>(As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking.  Report back the %-tage of dropped counts and the throughput rate.  Then try it with an Atomic counter.  My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless).  </em>I am less familiar with the fence properties of common LL/SC implementations.  Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?</p><p /><p /><p /><p /><h2><span style="text-decoration: underline;">build.java</span></h2><p><br /><a href="http://blogs.azulsystems.com/cliff/2008/01/i-hate-makefile.html">Some time ago I reported on a build tool</a> I've been using.  
Currently 'build' is being used to build HotSpot internally to Azul.  We've got it building 400+ files, plus build rules for building portions of the
JDK, compiling w/gcc &amp; javac and also 'javah', tar, jar, strip,
binary signing, etc.  In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc.  It's blazingly fast, it's easy to hack (being written in plain-olde Java).  I hunted around awhile and couldn't find a good place to dump a medium-large blob of code, so here's a sample build.java in 4 parts:</p><p><a href="http://snippets.dzone.com/posts/show/7206">Part 1</a><br /><a href="http://snippets.dzone.com/posts/show/7207">Part 2</a><br /><a href="http://snippets.dzone.com/posts/show/7208">Part 3</a><br /><a href="http://snippets.dzone.com/posts/show/7209">Part 4</a></p><p /><p><a href="http://snippets.dzone.com/posts/show/7209"><br /></a></p><p /><p /><h2><span style="text-decoration: underline;">A "HotSpot -server" aka <em>C2</em> compiler IR visualizer</span></h2><p>No I didn't do it!  <a href="http://ssw.jku.at/General/Staff/TW/">This deserving grad student did it</a>.</p><p>Here is the reference to the visualizer for the ideal graph of the 
HotSpot server compiler:
<br />
<br /><a class="moz-txt-link-freetext" href="http://ssw.jku.at/igv/master.jnlp">http://ssw.jku.at/igv/master.jnlp</a>
<br />
<br />It is a JNLP application hosted on a university server 
(<a class="moz-txt-link-freetext" href="http://ssw.jku.at/General/Staff/TW/">http://ssw.jku.at/General/Staff/TW/</a> has also a test file to 
download), but the easiest way to get a first impression. 
The source code of the tool is hosted on <a class="moz-txt-link-freetext" href="http://kenai.com/projects/igv">http://kenai.com/projects/igv</a> 
The server compiler instrumentation is part of OpenJDK and also 
included in Sun's weekly builds of Java 7.
<br />
</p><p /><p /><h2><span style="text-decoration: underline;">More C2 Goodies</span></h2><div style="margin-left: 40px;">
<em>John Cavazos wrote:
</em><br />
... One of the things I am 
currently looking
at is determining the right phase-ordering of optimizations applied to 
a particular program being optimized.  I have some nice (un-published) results for 
JikesRVM,
<br />
but it would be nice to replicate the research for HotSpot...
<br /><br /></div><p>I have strong arguments for nearly all orderings of our passes, so I'm 
curious to know about your phase-ordering results. 

<br />The only obvious transform we might be missing is PRE, but our peephole 
opts pretty nearly subsumes PRE (they are not mathematically equivalent 
- I can write programs where either PRE or the peephole stuff makes 
progress against the other).  In practice, PRE will find essentially 
nothing once the peephole opts are done.  You'll notice that we do the 
peephole pass alot; it's totally incremental and provably linear 
bounded.  In other words, if there's nothing to be gained then there's no 
cost in trying.  The peephole pass includes amongst other things 
all pessimistic forms of copy propagation, value equivalence, 
constant propagation (especially the null/not-null property), constant 
test folding, repeated test folding, dead code elimination, 
load-after-store opts (aliasing analysis is included for free during 
building of SSA form), algebraic properties, and lots more.
<br />
<br />
<br />For HotSpot, the optimization ordering is:
<br />
</p><ul>
<li>Build an inline tree, including class hierarchy analysis.  This is the 
one pass I'd be willing to move, as it happens too early.
</li>
<li>(a single unified pass:) parse bytecodes, inline, build SSA form (the 
IR remains in SSA form always), do peephole opts over SSA form.  This 
pass typically costs 40% of compile-time budget.
</li>
<li>Fixed-point the peephole opts over SSA form, once all backedges are 
known after parsing.
</li>
<li>Loop opts round 1: "beautify loops" (force polite nesting of 
ill-structured loops), build a loop-tree &amp; dominator tree, split-if 
(zipper-peel CFG diamonds with common tests, plus some local cloning 
where I can prove progress), peel loops (required to remove 
loop-invariant null checks)
</li>
<li>Fixed-point the peephole opts over SSA form
</li>
<li>Loop opts round 2: "beautify loops" (force polite nesting of 
ill-structured loops), build a loop-tree &amp; dominator tree, lock 
coarsening, split-if &amp; peel - but if these don't trigger because there's 
nothing to gain, the do iteration-splitting for range-check-elimination 
&amp; a 1st loop unrolling.
</li>
<li>Fixed-point the peephole opts over SSA form
</li>
<li>Conditional Constant Propagation (the optimistic kind, instead of the 
pessimistic kind done by the peephole pass)
</li>
<li>Iterate loop (split-if, peel, lock coarsen - but these typically never 
trigger again and take very little time to check), unrolling &amp; peephole 
passes, until loops are unrolled "enough".  On last pass, insert 
prefetches.  Typically this iterates once or twice, unless this is a 
microbenchmark and then unrolling might happen 8 or 16 times.
</li>
<li>Remove tail-duplication, and a bunch of other minor code-shaping 
optimizations e.g. absorb constant inputs into deoptimization-info in 
calls, or commuting Add ops so that 2-address machines can do 
update-in-place.
</li>
<li>Convert "ideal" IR into machine code IR.
</li>
<li>Build a real CFG for the 1st time, including a dominator tree, loop 
tree.  Populate with frequencies from earlier profiling.
</li>
<li>Global latency-aware (loop-structure-aware) scheduling.
</li>
<li>Replace null-checks with memory references where appropriate.
</li>
<li>Register allocate.  Internally this has many passes.  This pass 
typically costs 40% of compile-time budget.
</li>
<li>Sort basic blocks to get good control flow ordering (forward branches 
predicted not-taken, backwards predicted taken, etc)
</li>
<li>Some last-minute peephole opts.
</li>
<li>Emit code into a buffer, including OOP-maps &amp; deoptimization info.
</li>
</ul>
<p>















<br />
</p><p /><h2>Travel Plans</h2>
<p>I got too much travel coming up!</p><p>Apr 30, May 1st - DaCapo in Boston.  Favorite small group; GC &amp;
Java focused. Website:
<a class="moz-txt-link-freetext" href="http://www.cs.tufts.edu/research/dacapo/">http://www.cs.tufts.edu/research/dacapo/</a>.  I had to turn down the invite to an <a href="http://www.prog.uni-saarland.de/ssasem/">SSA seminar</a> in France because the dates conflicted.  Very sad.  I hope one of the attendees will post a trip report.</p><p>
May 11-May 15.  <a href="http://wg24.cs.uvic.ca/ContentWG24.shtml">IFIP WG2.4</a>  (International Federation of Information
Processors, Working Group 2.4 - a *really* old European group with a
random mix of industry &amp; academia.)  It's a nice group to preview
JavaOne talks, and always the meetings are a week-long rambling
discussion in <a href="http://www.parks.wa.gov/fortworden/">some quaint resort</a>.</p><p>
June 2-June 5.  <a href="http://java.sun.com/javaone/">JavaOne</a>.  I owe slides for 3 talks by end of this week.  'enuf said.</p><p>
July 6-10th - <a href="http://ecoop09.disi.unige.it/">ECOOP</a>, as a paid-for invited speaker.  A free vacation to
Italy!   :-)</p><p>Cliff</p><p /><p /><p /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/04/odds-ends.html</feedburner:origLink></entry>
  <entry>
    <title>YATR2YAC (yet another trip report to yet another conference)</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/Oc02LmcCVGU/yatr2yac-yet-another-trip-report-to-yet-another-conference.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=65099863" title="YATR2YAC (yet another trip report to yet another conference)" />
    <id>tag:typepad.com,2003:post-65099863</id>
    <issued>2009-04-05T10:40:35-07:00</issued>
    <modified>2009-04-05T20:24:58Z</modified>
    <created>2009-04-05T17:40:35Z</created>
    <summary>'Tis the Season to be Conferencing... Blah, I'd rather be hacking, but here goes... I went to EclipseCon, to CGO and back to EclipseCon. Here's how it can down: I got invited out of the blue to talk at EclipseCon...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>'Tis the Season to be Conferencing...<br />Blah, I'd rather be hacking, but here goes...   </p><p>I went to <a href="http://www.eclipsecon.org/"> EclipseCon</a>, to <a href="http://www.cgo.org/cgo2009/">CGO </a>and back to <a href="http://www.eclipsecon.org/">EclipseCon</a>.  Here's how it can down: I got invited out of the blue to talk at EclipseCon by <a href="http://wiki.eclipse.org/Darin_Wright">Darin Wright</a>, who had seen me speak earlier at the <a href="http://openjdk.java.net/projects/mlvm/jvmlangsummit/">JVM Language Summit</a>.  Like a dummy I accepted it instantly instead of checking my calander.  Naturally it conflicts with CGO - and I was on the CGO PC this year, so I felt obligated to attend CGO as well.  Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) - so I scheduled a 1-day only trip to Seattle.  Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths.  The flights on Alaska Air were smooth and easy.  No WiFi at CGO - the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?).  (<a href="http://blogs.azulsystems.com/cliff/2009/03/once-again-i-put-down-my-hackers-keyboard-to-bring-you-this-trip-report.html">Remember my grief over Hotel WiFi in the last blog?</a>)  So over lunch break (which was Chinese food and I don't like Asian food of any kind) I headed over to Starbucks.  Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT - provided you register $5 on a Starbucks card with them.  Yes!!!  Suddenly I have a $5/month nation-wide WiFi plan!  I snacked on some salad'y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came 'round again.</p><p>1st up: EclispeCon, mostly cause I got nothing to say.  These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer).  Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window.  Both me and the attendees would have been better off if I could have spent more time on the details.  <em><span style="font-family: Arial;">C'est la vie</span></em><span style="font-size: 13px; font-family: Arial;">.  Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend.  Also the staff was exceptional friendly and easy to work with.<br /></span></p><p>Next up: CGO.  Despite my 1 day fly-in-fly-out I got a lot more out of CGO.  </p><p>----------------------</p><p><strong>Keynote: Vikram Adve</strong></p><p>Missed!  Arrived barely in time for the last 10 minutes.  That's the penalty for attempting a 1-day fly-by.  I hear he said something good about Azul systems (he did ping me the day before for some details).  Basically it's a Call To Arms for improving C compilers and C runtimes.  He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!).</p><p>----------------------</p><p><strong>Revisiting Out-Of-SSA Translation</strong></p><pre style="font-family: Arial;" wrap=""><strong>Cliff Notes: </strong>HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... <strong>but there's a key new piece</strong>.<br /><br />Linear-time colaescing congruence classes; speed &amp; memory footprints.<br />Why is coming out of SSA hard? (<strong>Cliff Notes:</strong> It's not!)<br /><br />And now they show what HS has been using for years: insert copies as needed for a simple out-of-SSA translation, then aggressively coalesce away the extra copies. <br /><br />Nice diagrams for IFGs + copy-affinities.<br /><br />Notices the value-equivalence property for defining IFGs (also done by HS).<br /><br /><strong>Ok: here's what's new compared to HotSpot -server:</strong><br /><br />Fast interference test w/out building an IFG!<br />Only need to test against closest dominating var?<br />So scan the dom tree, with a stack-based DFS traverse of dom tree. <br />Uses 2 sorted lists (?); more complexity for value-based IFG tests.<br />But no building (and rebuilding?) the quadratic IFG.<br />Claim 2x faster than Sreedhar for coming out of SSA.<br />Big memory footprint reduction, because no need for live-ness sets &amp; IFG!!!<br /><br />40% of HS's total JIT budget is in reg-alloc! A faster reg-alloc is definitely interesting. A large part of that 40% is building the liveness &amp; IFG sets<br /></pre><p style="font-family: Arial;">----------------------</p><pre wrap=""><strong><span style="font-family: Arial;">Wave Propagation and Deep Propagation for Pointer Analysis</span></strong><br /><br /></pre><div style="text-align: left;"><span style="font-family: Arial;">Wave Prop is very fast and uses huge memory (working on parallel version)</span><br /><span style="font-family: Arial;">Deep prop is faster for small &amp; precise memory sets - works better</span> <span style="font-family: Arial;">on desktop (and JIT?)</span><br /><br /><span style="font-family: Arial;">There's a zoo of ptr-analysis names:</span><br /><span style="font-family: Arial;">- context-sensitive or not</span><br /><span style="font-family: Arial;">- flow-sensitive or not</span><br /><span style="font-family: Arial;">- field-sensitive or not</span><br /><span style="font-family: Arial;">- inclusion-based (Andersen-based) or unification-based (Steesgaard-based)</span><br /><br /><span style="font-family: Arial;"><strong>Cliff Notes</strong>: HotSpot -server uses context-INsenstive,</span> <span style="font-family: Arial;">flow-INsensitive, field-Sensitive, unification-based</span><br /><br /><span style="font-family: Arial;">Build a graph of constraints, and do contraint-propagation.</span><br /><span style="font-family: Arial;">Code: "a = &amp;b;" Constraint: "a superset {b}" - means b element-of P(a)</span><br /><span style="font-family: Arial;">Code "a = b" constaint: "a superset b"- means P(b) contained in P(a)</span><br /><span style="font-family: Arial;">Code "a = *b" or "*a = b" - complex constraint, must add edges to the</span> <span style="font-family: Arial;">graph as constraint propagation adds new sets to the nodes.</span><br /><br /><span style="font-family: Arial;">Now build a graph: nodes are P(a) - sets of things pted-at by 'a', and</span><span style="font-family: Arial;"> edges represent subsets.</span><br /><br /><span style="font-family: Arial;">Cycles are common; collapsing them is crucial for performance. All</span> <span style="font-family: Arial;">things in the cycle must be equal. Uses Nuutila to find strongly</span> <span style="font-family: Arial;">connected components in a single pass (so does HotSpot - Vick's algorithm).</span>  <span style="font-family: Arial;">After collapsing cycles, you get a DAG - and Nuutila also produces a</span> <span style="font-family: Arial;">topo-sort as output.</span><br /><br /><span style="font-family: Arial;">(Still looks like a complex diff/delta propagation algorithm - but</span> <span style="font-family: Arial;">makes me wonder if standard bit-vector stuff is faster).</span><br /><br /><span style="font-family: Arial;">Not a bad-looking ptr-analysis. Might work in a JIT.</span><br /><span style="font-family: Arial;">Manesh suggests using BDD's.</span><br /><br /><span style="font-family: Arial;"><strong>Cliff Notes: </strong>Real question of course, is if all the extra alias</span> <span style="font-family: Arial;">precision actually buys you any performance. PLDI paper of a few</span> <span style="font-family: Arial;">years ago suggestions otherwise: that HS's style of alias-analysis</span> <span style="font-family: Arial;">gets you 95% of all the possible gain. </span><br /><br />----------------------<br /><br /><div style="text-align: left;"><strong><span style="font-family: Arial;">Fast Track: A Software System for Speculative Program Optimization</span></strong><br /><br /><span style="font-family: Arial;">Chen Ding: another process-level fork/join speculative program</span> <span style="font-family: Arial;">optimization (contrast this to <a href="http://www.cs.umass.edu/%7Etingy/publications/grace.ps">Emery Burger's "Grace"</a> system). Gives</span> <span style="font-family: Arial;">sequential program semantics. Add annotations to "suggest" fork/join</span><br /><span style="font-family: Arial;">points. Actually fork a process, use COW &amp; <a href="http://www.rt.com/man/mmap.2.html">mmap</a> to guarantee</span> <span style="font-family: Arial;">sequential semantics in the face of races (end speculation, or force</span> <span style="font-family: Arial;">commit-ordering on speculative threads, etc).</span><br /><br /><span style="font-family: Arial;"><em>New Idea</em>: User can write an alternative implementation of some module;</span> <span style="font-family: Arial;">run both versions in parallel; keep the faster version &amp; kill the</span> <span style="font-family: Arial;">slower version:</span> <span style="font-family: Arial;">"return do_whichever_is_faster( methodX(), fast_methodX() )"</span><br /><br /><span style="font-family: Arial;"><em>New Idea</em>: "fast_methodX" might be speculative and *<strong>wrong</strong>*. So run slow-version to</span> <span style="font-family: Arial;">completion and do error checking after the fact. But let the fast</span> <span style="font-family: Arial;">version run ahead (and if it's faster), it can spawn off new</span> <span style="font-family: Arial;">speculative fast versions. When slow version completes, use</span> <span style="font-family: Arial;">page-level protections to know what's been changed, and verify the</span> <span style="font-family: Arial;">correct semantics of the fast version. Run all the correctness checks</span> <span style="font-family: Arial;">in parallel with all the normal fast-case stuff. Needs some careful</span> <span style="font-family: Arial;">throttling.</span><br /><br /><span style="font-family: Arial;">Example using gcc's "<a href="http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging">mudflap</a>". Array under/overflow, leaks, un-init'd</span> <span style="font-family: Arial;">memory, etc. Run 2 versions of all code: the existing code is "fast</span> <span style="font-family: Arial;">track" and the mudflap version is the "slow correct track". Goal:</span> <span style="font-family: Arial;">same speed as the "fast track" code with all the safety checks of the</span> <span style="font-family: Arial;">"slow track using mudflap". In some cases can get real nice speedups.</span><br /><br /><span style="font-family: Arial;"><strong>Cliff Notes: </strong>can we turn on mudflag for Azul?</span><br /></div><pre style="font-family: Arial;" wrap="">----------------------<br /></pre><div style="text-align: left;"><strong><span style="font-family: Arial;">Scenario Based Optimization</span></strong><br /><br /><span style="font-family: Arial;">Profile the hot methods. Tweak various parameters (e.g. loop</span> <span style="font-family: Arial;">unrolling, or cache-tiling parameters). Profile both old and new</span> <span style="font-family: Arial;">versions. Pick the best winner, and stick with it for awhile. Keep</span> <span style="font-family: Arial;">profiling continously, and switching code as needed.  Some JVM JITs do this.  HotSpot only sorta does.</span><br /><br /><span style="font-family: Arial;">The programmer has to pick what the new &amp; old versions are, and the</span> <span style="font-family: Arial;">runtime picks the fastest one (no support for correctness - what if</span> <span style="font-family: Arial;">one of the versions is busted? Or tickles a timing bug which breaks</span> <span style="font-family: Arial;">things?)</span><br /><br /><span style="font-family: Arial;">Seeing some nice gains, some of the time. Avoids losing when turning</span> <span style="font-family: Arial;">on some aggressive optimizations. But generally unrealistic.</span><br /><br /><pre style="font-family: Arial;" wrap="">----------------------<br /></pre>
<div style="text-align: left;"><strong><span style="font-family: Arial;" /></strong><strong><span style="font-family: Arial;">Software Pipelined Execution of Stream Programs on GPUs</span></strong><br /></div><br /><span style="font-family: Arial;">Basically <a href="http://www.cag.lcs.mit.edu/streamit/">StreamIt</a> on a GPU</span><br /><br /><span style="font-family: Arial;">StreamIt is much easier to use than e.g. CUDA or OpenCL.</span><br /><br /><span style="font-family: Arial;">StreamIt model is a collection of connected serial parts; connected</span> <span style="font-family: Arial;">via pipes. Due to splits &amp; joins, might need to exec some parts more</span> <span style="font-family: Arial;">often than than others (e.g. part Y "eats" 2 ouputs from part "X"</span> <span style="font-family: Arial;">before outputing a result itself). Get complex patterns of execution;</span> <span style="font-family: Arial;">difficult to map down to GPGPUs. Have bounded-buffer problems: need</span> <span style="font-family: Arial;">to map to a steady-state schedule using a fixed count of buffers.</span> <span style="font-family: Arial;">Then can arrange to do no allocation during execution.</span><br /><br /><span style="font-family: Arial;">- Profile each filter standalone, to figure out it's work load.</span><br /><span style="font-family: Arial;">- Configure selection</span><br /><span style="font-family: Arial;">- Add constraints </span><br /><span style="font-family: Arial;">- Use CPLEX to solve</span><br /><span style="font-family: Arial;">- Output CUDA code.</span><br /><br /><span style="font-family: Arial;">Looks like a classic software-pipeline, just done on a GPGPU. Compute</span> <span style="font-family: Arial;">minimum initialization interval, compile resource constraints, etc.</span> <span style="font-family: Arial;">Have issues with data-layout, so can do parallel data access.</span><br /><br /><span style="font-family: Arial;">Impressive compile times (30sec to 178sec).</span><br /><span style="font-family: Arial;">Can get pretty good utilization on some kernels, so seeing 100x speedups.</span><br /><br />Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt's model - then StreamIt is the fastest language for you!  If what you got doesn't look like StreamIt, then Too Bad.<br /><br /><pre style="font-family: Arial;" wrap="">----------------------</pre><strong><span style="font-family: Arial;">Stream Compilation for Real-Time Embedded Systems</span></strong><br /><br /><span style="font-family: Arial;">StreamIt Again, but for Cell.</span><br /><span style="font-family: Arial;">New constraint: out of memory per cell (distributed memory).</span><br /><span style="font-family: Arial;">Different from GPU - GPU has shared memory, but very slow if doing</span> <span style="font-family: Arial;">bank-conflicting access - but can pile on more memory (more buffers)</span> <span style="font-family: Arial;">for some GPUs and less for others. Not so for Cell - must have a </span><br /><span style="font-family: Arial;">uniform split of memory.</span><br /><br /><br />Thanks,<br />Cliff<br /><br />PS - Thanks Doug for inspiring this blog<br /><br /><br /></div></div><p><span style="font-size: 13px; font-family: Arial;" /></p></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/04/yatr2yac-yet-another-trip-report-to-yet-another-conference.html</feedburner:origLink></entry>
  <entry>
    <title>Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/CBBsgyTIa_Q/once-again-i-put-down-my-hackers-keyboard-to-bring-you-this-trip-report.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=64380481" title="Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report" />
    <id>tag:typepad.com,2003:post-64380481</id>
    <issued>2009-03-19T14:59:09-07:00</issued>
    <modified>2009-03-21T18:14:33Z</modified>
    <created>2009-03-19T21:59:09Z</created>
    <summary>Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report... sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene. I went to DC for the...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report... sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene.  I went to DC for the ASPLOS and VEE conferences and all you get is this lousy blog.</p>
<div class="moz-text-html" lang="x-western">



<br />
<strongig><strong>2009 ASPLOS &amp; VEE Conferences</strong><br />
<br />
<hr size="2" width="100%" /><strongig><br />
ASPLOS &amp; VEE are co-located in D.C. this year.  Mostly the trip
itself was uneventful.  I was very busy either attending the conference
sessions or preparing for my own talk and I didn't get around much. 
The flights themselves were mostly uneventful (ComAir's HQ in the
midwest lost power from a tornado, so I was stuck on the tarmac at JFK
for an hour waiting to get permission to fly into DC - first time I've
been delayed by bad weather when the weather between the source and
destination airports was perfect).  The most notable point was that the
hotel used WayPort as their internet access.  WayPort sucks horribly;
I've used them in the past and swore I'd never use them again - but I
forgot to ask when checking in to the hotel!  WayPort charges
$13/night, and service is filtered through some horrible layout routing
through TimBuk Too - average ping times to hit a server in the DC area
was about 600msec (ping times from AT&amp;T wireless in the conference
room were ~50ms).  To add insult to injury, some other conference
attendees stayed at a nearby hotel - which was cheaper and had free
WiFi AND the free stuff was faster!  Definitely having WayPort as your
WiFi provider is sufficient for me to change hotels.<br />
<br />
<strong>Cliff Notes: </strong>I had a wonderful 1.5-long 1-on-1 talk with Monica
Lam (<a href="http://en.wikipedia.org/wiki/Monica_S._Lam">http://en.wikipedia.org/wiki/Monica_S._Lam</a>);
she didn't know what Azul does.<br />
<strong>Cliff Notes: </strong>More interesting talks with Jon Shapiro (<a href="http://www.eros-os.com">http://www.eros-os.com</a>, who's
busy implementing a high-level low-level language for high-availability
systems).<br />
<br />
Reality Check: yes again a Mac failed to be able to output to the
projector, and we all waited while the presenter hotswapped to a PC<br />
<br />
Caveat Emptor: As usual for these talks, I only attend some sessions, I
write
stream-of-consciousness reports which I only lightly edit and the
reports tend to be brutal and short.  They are roughly sorted here by
my interest level, and <strong>not</strong> by e.g. time order.<br />
<br />
<span style="text-decoration: underline;"><strong>General Cool Stuff:</strong></span><br />
<strong><a href="#keynote_lam">Keynote: Monica Lam's Vision of the Future of
Human / Computers Interaction</a></strong><br />
<a href="#wrong"><strong>Getting Wrong Answers Without Doing
Anything Wrong</strong></a> - <a href="http://en.wikipedia.org/wiki/All_your_base_are_belong_to_us">all
your base are belong to us</a><br />
<a href="#par_text"><strong>Text Processing in Parallel in a Register</strong></a>
- wicked fast<br />
<a href="#nanoscale"><strong>Nanoscale Computing</strong></a> - <a href="http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky">wicked scary</a><br />
<a href="#high_low"><strong>Demystifying High-level Low-level Programming</strong></a>
- how to write Java-in-Java<br />
<a href="#asymmetric"><strong>Asymmetric Chip Multiprocessor</strong></a> - 1st
good use I've seen for these<br />
<a href="#leaks"><strong>Leak Pruning</strong></a> - Reported on before, but
still good<br />
<a href="#TRIPS"><strong>TRIPS</strong></a> - Funky dataflow/VLIW hybrid (but not
the weirdest I've recently seen!), too bad it didn't pan out<br />
<br />
<span style="text-decoration: underline;"><strong>HTM Talks</strong></span><br />
Which I read but due to a snafu failed to attend the talks.   :-(<strong><br />
<a href="#min_htm">Maximum Benefit from Minimum HTM</a></strong> - not so
minimum (bigger than Azul's HTM!) and lousy results<br />
<a href="#rock"><strong>Rock's HTM</strong></a> - a peek at Sun's experience with
Rock's HTM; mostly Rock's HTM is really really weak<br />
<br />
<span style="text-decoration: underline;"><strong>Multithreaded Bug Finders</strong></span><br />
<a href="#CTrigger"><strong>CTrigger</strong></a> - The Winner in this section:
cleverly induced
sleeps trigger race bugs reliably<br />
<strong><a href="#Anomaly">Anomaly-based Bug Detection &amp;
Prediction</a></strong> - It's an, ahhhh, <em>interesting</em> hack<br />
<strong><a href="#isolator">Isolator</a></strong> - silently prevent race bugs<br />
<br />
<span style="text-decoration: underline;"><strong>Making Programs Deterministic</strong></span><br />
<strong>Cliff Notes: </strong>Major flaw in the thinking here: Real Programs
have Real Random Inputs.  Unless you capture &amp; replay ALL external
events, minor things like network-packet-arrival-order and the chaotic
butterfly effect (<a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Butterfly_effect">http://en.wikipedia.org/wiki/Butterfly_effect</a>)
ruin
your determinism.  So I think all these papers are tilting at windmills.<br />
<a href="i#kendo"><strong>Kendo</strong></a> - Stock hardware, but requires a
race-free program<br />
<a href="#replay"><strong>Deterministic Replay</strong></a> - Goes only 1/2-way
to preventing the Butterfly Effect<br />
<a href="#DMP"><strong>DMP - Deterministic Shared Multiprocessing</strong></a> -
No replay attempt, just full-speed determinism (assuming no Butterflies)<br />
<br />
<span style="text-decoration: underline;"><strong>Other Stuff</strong></span><br />
Talks I attended but aren't otherwise interested in...<br />
<a href="#buddies"><strong>Memory Buddies</strong></a> - Yet Another Use for
Bloom Filters<br />
<a href="#livemigration"><strong>Live VM Migration</strong></a> - pre-copy push
still beats post-copy pull<br />
<a href="#validate"><strong>Validate</strong></a> - Microsoft patches Are Bad (tm)<br />
<a href="#raytracing"><strong>Raytracing vs GPUs</strong></a> - Doom!<br />
<a href="#commute"><strong>Commutativity Analysis for Software
Parallelization</strong></a> - <a href="http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf">the
original work</a> is WAY more interesting than
this follow-on<br />
<a href="#gcyield"><strong>Predicting Yield of GCs</strong></a> - Yea Olde Hat
Trick<br />
<a href="#phatombtb"><strong>Phantom BTB</strong></a> - winner of the Most
Unrealistic Hardware award<br />
<strong><br />
</strong><strong><br />
</strong>
<hr size="2" width="100%" /><strongig><strong><a name="keynote_lam" />Keynote: Monica
Lam</strong><br />
<br />
Why do computers break?  When we buy something, we expect it to work. 
If it breaks, you buy another one.  But with computers they break on
their own (e.g. u$oft update) or you can break it doing "normal" stuff
- and if it breaks it doesn't help to buy another one - the new one is
broken already.<br />
<br />
Virtual Desktop - idea: keep the virtual machine in the datacenter, and
do remote login.  Does not work well - requires WiFi everywhere.<br />
<br />
With each new comp generation, 10x cheaper &amp; 10x more users.<br />
Mainframe-&gt; mini -&gt; worksta -&gt; PC -&gt; laptop -&gt; cell<br />
Also see 3 industries: internet, media/TV, telephone<br />
Also see 3 networks: cable/satillite, broadband/phone, wifi/wimax<br />
Also see 3 computing models: cell (always on, pocket), PC (rich legacy,
offline exec, locality of reference), Cloud (device independent,
central management).  <br />
Things like GoogleGears or Virtual Desktop are trying to split the
Cloud/PC model.<br />
<br />
Guess where we are going?<br />
<br />
1- (Man) a phone-y thing: key, cache, window into digital cloud, ID,
personality, asserts, internet access.  Small, always on.  Can be lost
easily &amp; replaced easily.  No <em>important </em>state, but many
Gigs of flash cache.  No real compute power.  Small visual display
&amp; keyboard.<br />
<br />
2- (Earth) Can (1) plug into a PC.  Get big screen &amp; keyboard, get
fast processor.  Get hot DRAM, teribyte disk drive.<br />
<br />
3- (Heaven) Backs onto the internet/cloud.<br />
<br />
<br />
This model has some different security issues.  DataCenter is probably
secure (or can be?)  You hold key in your hands.  Can we make path from
hand-held-key to datacenter secure?<br />
<br />
Trends - green, Bring-Your-Own-PC, Mac-vs-Windows.  End of Corporate PC
model.  80% of people use personal PC for work; 45% of people use work
PC for personal use.  25% of workers change security settings, etc.<br />
<br />
SO... want: {User, Ubiquity, IT}  Separation of Concerns.<br />
<br />
x86 virtualization seperates: hardware from VM; hardware from OS;
different hardware configs; different user roles<br />
<br />
MokiaFive company - Desktop-As-A-Service.  LivePC images "in the
cloud".  Managed on central server.  Can go up to any mac/pc and run
"your desktop" from a USB stick. USB stick/phone is a local cache only
- backed up on the cloud. Think of USB memory as a "network
accelerator".<br />
<br />
Nice live demo.  From USB stick to a selection of PC desktops is not
very long (+2 mins for a new PC to install software from stick).  No
hotel wifi?  Claims to be sync'ing with cloud??? <br />
<br />
Nice integrity model: basically re-install the OS clean each time on a
fresh VM-ware layer with anti-viral.  Intent is to clean out root-kits
with each fresh login to machine. <br />
<br />
Graphics is hard: many games are wired to DirectX, so losing much 3D
performance.  Trying to bypass all the virtualization layers for
DirectX.  But... guest &amp; host have different address spaces; but
graphics has lots of pointers; fast pointer munging is hard.<br />
<br />
Other services: LDAP, revocation, offline access, logging, dlp.
Encyption, Storage Virtualization: reju, incremental &amp; atomic
restore; secure access.  Uniquification, single-sign-on.<br />
<br />
Seperate-of-Concerns Principles - 90% of code handles exceptions (only
10% does the normal stuff).  "Re-Install the OS" as a root-kit
prevention.  e.g. system is structured as a fresh-OS-install because of
the rare root-kit issue.<br />
<br />
Bootstrap from a small trusted base: baremetal VM; LivePC player; 
IT-managed base image (Windows+AntiVirus); finally user-apps &amp; data<br />
<br />
Like the Azul philosophy about GC - make the hard case (FullGC,
infection from rootkit) common and then get it right - then all else
follows.<br />
<br />
Need to keep users ownership of own data - avoid the Big Brother
mentality.  Give users a choice on where data is kept.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="wrong" />Getting Wrong Answers
Without Doing
Anything Wrong</strong><br />
<br />
Variations in unix env variable size, can see 15% variation in
performance!  Running gcc -O2 vs gcc-O3, 400.perlbench.  Bunch of runs
at each setting very repeatable per-run.  But with different env sizes
see huge variation on 400.perlbench &amp; 'ibm' benchmark.  <br />
<br />
<strong>Conclusion: setting unix env variables can change your conclusions.</strong><br />
<br />
Also, sorting object files on the link-lines differently changes layout
in memory, and in I-cache layout.<br />
<br />
<strong>Conclusion: linking with 'ld *.o' is different from 'ld a.o b.o'.</strong><br />
<br />
Bias holds true even if looked at a large benchmark suite, or across
different CPU #'s from Intel.<br />
<br />
See a prior paper (which I now cannot find!!!) which studies this
carefully.  Note that there are a bunch of JVM/Java benchmark issues
here that also have been written up recently. <br />
<br />
<strong>Cliff Notes: </strong>I've dealt with both of these in the past; so has
Mark Moir (Sun) and Bean Anderson (SGI).  I discovered the need for I-cache layout while working for
Motorola on PPC chips (somebody else actually did the implementation). 
The results gave us an average 10% performance boost on a bunch of
benchmarks, plus removed a huge source of variation.  Later at Azul,
with our shared low-associativity caches we discovered that stack
placement was crucial to avoid conflicts, fixed with a little random
offset to each stack base.<br />
<br />
<strong>Cliff Notes: </strong>What are other large systemic biases?  Are these
two it?  And obvious question for a JIT - should I start doing I-cache
layout games for performance?  How about making copies of an existing
hot method and running both old &amp; new and doing sampling, and pick
the winner?<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="par_text" />Text Processing in
Parallel in a Register </strong><br />
<br />
<strong>Cliff Notes: this technique is insanely fast</strong><br />
Like SIMD, but char/bytes within a standard GPR.  So 4 or 8 chars in a
GPR?<br />
<br />
Doubling: field width from n-bit to 2n-bit <br />
Inductive: repeated doubling.<br />
<br />
Parallel bit streams; 1-to-1 correspondence to byte stream.  Vastly
faster XML parsing; better cache locality; better branch prediction.<br />
<br />
So: take 1st bit from each byte, and jam 64 bytes (bit #0 only) into 1
GPR.  Take 2nd bit into 2nd GPR, etc.  Basically, load 64 bytes into 8
GPRs, but bit-striped.  Need architecture support for bit-striping
pack/unpack.<br />
<br />
Now break up desired operations (e.g., compare for lexical token
boundaries) into bit-field operations.  Can do compares of 64 bytes for
tokens by simple ops on 8 GPRs.<br />
<br />
<strong>Cliff Notes: </strong>really lousy presentation.  Hope the paper is
better. Guesswork: take this string "java/lang/Object(Crunkify)" and
find the '(' tokens.<br />
<br />
  GPR #0 - xxxxxxxx - bit 0 from all bytes<br />
  GPR #1 - xxxxxxxx - bit 1 from all bytes<br />
  GPR #2 - xxxxxxxx - bit 2 from all bytes<br />
  GPR #3 - xxxxxxxx - bit 3 from all bytes<br />
<br />
My turn for a lousy presentation: Look at bits for '('.  Want some
combo of 0's &amp; 1's from the GPRs.  So e.g. XOR w/-1 all GPRs
needing 0's - now need them as 1's. Then do 7 AND's (parallel log tree
AND).  Any bits still set represent the '('s.  Nicely wide-issues per
64-bytes.  Expect to compare 64-bytes for a single char in maybe 6-8
clocks!?!?!?!  Plus evil pack/unpack issues/costs.<br />
<br />
Wrote some code to demo finding '(' in a long string using maybe 1/10th
clock per character, plus whatever the packing costs are (which
probably dominate!).  In this code, 'b0' through 'b7' are 64-bit values
holding bits 0 (through 7) for 64 chars in turn.  Here's the
test-64-chars-for-paren check:<br />
<code>      // The open-paren char, '(', hex 0x28:<br />
      //           0     0    1     0    1     0     0     0<br />
      long res = ~b7 &amp; ~b6 &amp; b5 &amp; ~b4 &amp; b3 &amp; ~b2
&amp; ~b1 &amp; ~b0;<br />
      // A non-zero res means a '(' is found, and the bit-number is<br />
      // the index into the 64-byte array where it was found.</code><br />
<br />
See attached source code for TextScan.java<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="nanoscale" />Nanoscale
Computing </strong><br />
128 bytes of ram; 5micrometers in area<br />
<br />
Smaller version: 8 bytes of ram; 1.5micrometers in area; similar in
size to largest virus.  Chemical sensors report results by directly
changing bits in the instruction stream (e.g., flipping a bit changes a
JMP to a NOP, in effect making a conditional branch).  Can detect
adequately running at 100Hz (<strong>not</strong> 100MHz).  Can write small
simple test/loop codes in this form factor.  Can e.g. count molecule
rates or virus rates in a solution (injected in bloodstream).  <br />
<br />
<br />
<strong>Cliff Notes: </strong>Very Very Scary.  See Vernor Vinge's dystopia "A
Deepness in the Sky". 
<a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky">http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky</a><br />
<a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Smartdust">http://en.wikipedia.org/wiki/Smartdust</a><br />
<br />
<hr size="2" width="100%" /><a name="high_low" /><strong>Demystifying High-level Low-level Programming</strong><br />
<br />
<span style="font-size: 14px; font-family: Arial;">Goal: Abstraction w/out guilt.  Separation of concerns.  Nice
historical perspective.  Issues go back as far as 1975 - using C for
writing an OS was a big issue back then.  So it's time for the modern
revolution of the same thing.  JVMs for OS's.  </span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Various C support tools: coding styles; Purify; idioms; tools
(Valgrind).  Also Systems-languages - BLISS, Modula-3, Cyclone, etc Or
2 languages: JNI+Java JikesRVM extends Java, also OVM, Moxie, DRLVM, C#
(Bartok).  Usually in context of runtime research.  This talk: Jikes,
Java-in-Java.</span><br />
<br />
<span style="font-size: 14px; font-family: Arial;">Containment:</span><br /><span style="font-size: 14px; font-family: Arial;">
- minimize exposure to low-level coding</span><br /><span style="font-size: 14px; font-family: Arial;">
- extensible; requires change fast, but language changes slowly</span><br />
<span style="font-size: 14px; font-family: Arial;">Encapsulate: </span><span style="font-size: 14px; font-family: Arial;">contained low-level semantics.</span><br />
<span style="font-size: 14px; font-family: Arial;">Fine-grained; </span><span style="font-size: 14px; font-family: Arial;">separation of concerns.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
"Address" is Magic, like unsafe.</span><br /><span style="font-size: 14px; font-family: Arial;">
"Address foo; int x = foo.loadInt(offset)"</span><br /><span style="font-size: 14px; font-family: Arial;">
More types for "void*" in Java, or generic OOP pointers.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Intrinsics: contract between you &amp; the compiler; at the Java level
it has an empty implementation.  But the compiler will always insert
the correct code for the e.g. prefetch intrinsic.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Also annotations to do e.g., turn off bounds-checks or null-checks
(performance hints, but break safety).</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Scoped semantic regions - scoped rule changes, e.g. "NoNew" - no
allocation in this scope (HotSpot uses this technique </span><span style="font-size: 14px; font-family: Arial;">extensively</span><span style="font-size: 14px; font-family: Arial;">),
or "NoBoundsChecks" in this scope, etc.  Unboxed is another annotation
- no class header, no v-table,  basically a C struct.  Also can do
direct field layout.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
@RawStorage - basic static C memory layout.  Backed by native width raw
storage, only accessed by native-width C intrinsics.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Can both compile "fast" - down to raw bits, and "virtual" - to some
virtual implementations allowing full GC on full simulation.</span><br />
<br />

<hr size="2" width="100%" /><strongig><strong><strongig><a name="asymmetric" /><strong>Asymmetric Chip
Multiprocessor</strong><br />
<br />
</strongig></strong><span style="font-size: 14px; font-family: Arial;">Idea: instead of e.g. 16 cores on a die, have 12 for normal work and 1
"fat" core for hot contended locks.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Cross grid of threads vs open-tables in MySQL. (looks like ref-counting
of MySQL tables).  Blah - implementation looks alot of NmethodBuckets
in HotSpot.  2-d grid.  Updates are complex &amp; require complex
locking.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Look at some locking traces.  If we could accelerate the critical
sections then we get a Amdhal's Law approach - removes the serial
portion.</span><br />
<br /><span style="font-size: 14px; font-family: Arial;">
Idea: switch to the one fast CPU on a contended lock.</span><br /><span style="font-size: 14px; font-family: Arial;">
Seeing speedups of 50% or so for 12-16 cpu system, where program was
seeing e.g. 5x speedup and now gets e.g. 8x speedup.  One of the
benefits: locks sections always executed by same CPU, so caches remain
'hot' for hot-lock code.</span><br />
<br />
<span style="font-size: 14px; font-family: Arial;">Cliff Notes: This is the first real use of Asymmetric
Multiprocessors that I believe in.</span><strongig><strong><strongig><br />
<br />
</strongig></strong><hr size="2" width="100%" /><strongig><strong><a name="leaks" />Leak Pruning </strong><br />
<br />
See it &amp; reported it before.  Survive OutOfMemory &amp; leaks by
reclaiming predicted dead objects.  Used a simple X86 read-barrier w/6%
slowdown.  Hoping to see new slowdown numbers for the read-barrier: see
3% slowdown on Core2 &amp; 5% on Pentium4.  Using 2 bits per reference;
bit 0 is for stale-refs; bit 1 is for pruned refs.<br />
<br />
So instead of throwing an OOM, look thru heap and see if find a
dominate object.  Assume dominate object is dead &amp; leaked; "Cut"
link to object and replace with a poisoned pointer.  Touching this
pointer<br />
throws the original OOM.  But if never touch object, then never throw
OOM.  Cut links allows dead object (and his transitive closure) to be
reclaimed.<br />
<br />
Try to predict roots of dead datastructures.  So profile max time
between last use (using something like GC age bits, but exponential
counter bits).  Also need to track which refs are keeping most bytes
alive.  <br />
<br />
<tt>HashEntry -&gt; PreparedStatement -&gt; ParserInfo -&gt; byte[][]
-&gt; byte[]<br />
                               -&gt; ResultSet -&gt; Field[]<br />
</tt><br />
<em>Compute during Phase 1:</em><br />
Sometimes the HashTable grows, and PreparedStatement gets touched.  So
max-stale-then-used for HashTable-&gt; PreparedStatement is 16-32 GC
cycles.  But max-stale-then-used for  PreparedStatement-&gt; ParserInfo
is 0-1 GC cycles.<br />
<br />
<em>Compute during Phase 2:</em><br />
Count bytes kept alive for each type of thing that has a low MSTU
number.  See PPS-&gt;ParserInfo has 420M kept alive.  So kills all
edges from old PPS-&gt;ParserInfo (but not young ones, because MSTU is
kept per object?  I now think he keeps MSTU per-class, so no bits in
object but can suffer from common types like Object[] being used both
in leaks and in hot code).  Don't worry about DAG sharing - ends up
blaming the leak on 1 ref when it's really kept alive by several refs;
just means on 1st go-around you prune one edge - but leak remains
alive.  On 2nd go-around all the leak is blamed on object type#2 and
pruned then.<br />
<br />
Results are decent - keeping some complex real apps alive essentially
forever.  Performance is decent, but near OOM conditions things slow
down as first come frequent GCs, then LeakPruning kicks in and cleans
the heap back out.<br />
<br />
<strong>Cliff Notes: </strong>Wondering if could do read-barrier check on low
bits by e.g. loading from the loaded pointer with an X86 op which
faults on odd addresses?  Means faulting on stale &amp; poisoned refs;
faulting on poisoned is OK (means throwing OOM anyways), but faulting
on stale might be too frequent still.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="TRIPS" />TRIPS, McKinley et al </strong><br />
<br />
TRIPs is an alternative processor architecture, with a square 4x4 grid
of functional units (GPUs); registers are along one "wall", D-cache is
along another 2 sides, etc.  Inter-GPU traffic is explicitly scheduled,
but with a data-flow-like flavor (static scheduling, dynamic timing). 
I've reported on this project before.  This talk is essentially an
end-of-project summary.  And the summary is: not much better than a
Core2 (if the chip had been built with Core2 technology).<br />
<br />
Each chip has 2 processors; 1M L2.  16GPUs, 128 registers, 32K L1
cache.  Processor is distributed.  4x4 grid of GPUs; registers along
one side; D-cache along other sides. Window size of 1024 instructions.<br />
<br />
Block atomic execution model; dataflow between blocks.  Blocks are
single-entry, multi-exit.  1 non-speculative block; up to 7 speculative
blocks.<br />
<br />
Heavy use of predication to build larger blocks.  Direct instruction
communication model.  Special instructions to handle large fan-out. 
Large block of conditional/predicted stuff mapped onto the 4x4 grid of
FPUs; using reads from registers and reads/writes to d-cache.<br />
<br />
Larger blocks require more scheduling; harder to place.  What is the
utilization of the large instruction window?  Both scheduling, plus
e.g. cache-miss &amp; branch mis-predict.  Static delay: scheduling
issues; 1-clk to move data from FPU to FPU.<br />
<br />
Benchmarks: 53 compiled benchmarks; 13 hand-optimized benchmarks; 2
hand-optimized and hand-scheduled.  Gather numbers of the actual chip,
except when needs to use SIM numbers (because chip didn't gather the
right data).<br />
<br />
Generally, out of the possible 128 instructions in a block compiler is
able to get about 60 instructions.  Hand-optimized code uses smaller
blocks because better scalar optimizations reduces instruction count.  <br />
About 1/4 of all instructions are predicated &amp; never executed.<br />
About 1/4 of all cycles are lost due to scheduling.<br />
About 1/2 of all cycles do useful work.<br />
<br />
Very large code-bloat: 55% bloat from nops, or direct reg read/write
ops, 18% from predicated useless, 27% from instruction-replication
optimizations (loop unrolling, inlining).  4 SPEC benchmarks have
i-cache miss rates &gt; 10%.  Direct memory accesses are much lower (no
spill ops?)<br />
<br />
Need variable-sized blocks; would reduce the code bloat from 11x to 6x
(over canonical Alpha code).<br />
<br />
L1 is partitioned.  Compiler can acheive 2 mem-ops /cycle on vector
add; hand-optimized can hit the max of 4 mem-ops / cycle.  Again for
matrix-multiple the compiler hits 1 flops/cycle; hand optimized hits 7
flops/cycle.  <br />
<br />
Clock speed is 366Mhz; mem speed is 200Mhz; process tech is 130nm. 
Compare to a Core2 but boost TRIPS "as if" if gets the 65nm process.
Compiled versions are competitive to Core2.  Hand-optimized about 3x
faster than Core2.<br />
<br />
What works: dramatic reduction in data-cache &amp; register file refs. 
Compiler is able to <em>work</em> (to schedule to this weirdo thing). 
But performance isn't very big over a Core2.  Challenges: dynamic code
size; dataflow overhead instructions; I-cache efficiency; predication
overhead; generating efficient code for control intensive codes.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="min_htm" />Maximum Benefit from
Minimum HTM</strong><br />
<br />
Running a (supposedly) simple HTM on a transaction-aware Linux kernel. 
Their proposed HTM is somewhat stronger than Azuls', in that they
support parallel independent transactions (the main one, and ones run
inside interrupt handlers), a special version of CAS, and graceful
overflow to software.  Also they holding pending transactions in a much
more generous L1 than Azuls'.<br />
<br />
Results look pretty anemic and fragile, which is on par with Azuls'
results (4x speedups on 16-way boxes; 60% of exec cycles wasted on
retries).  On the plus side, they at least attempt a reasonable
benchmark set.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="rock" />The Rock's HTM</strong><br />
<br />
A look at Sun's Rock's HTM.  It's a lot weaker than Azul's - max of 16
(or 32, depending on how many hardware threads you use) pending writes;
aborts on function calls, TLB misses, mispredicted branches &amp; some
fp ops, - plus the same reasons Azul's HTM can fail: eviction from the
L1 (due to transaction being too large, or conflict from another CPU).<br />
<br />
Major discovery (just like Azul discovered!): feedback is crucial to
decent heuristics.  Your failed HTMs need to report as much as possible
about what went wrong so the heuristic stands a chance of getting it
right.  This required a hardware spin (Rock2).  One useful heuristic is
the standard backoff mechanism: if there's true contention, then
backing off can spread the memory references far enough apart to allow
the HTM to succeed (presumably succeed with less overhead than a simple
spin-lock solution).  <br />
<br />
They report hashtable numbers, but wildly concurrent hashtables are
commonly available.  <br />
They report using the Rock HTM for doing an N-CAS.  This is largely
successful, and might simplify many libraries.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="CTrigger" />CTrigger -
Atomicity Violation Bugs </strong><br />
<br />
Software testing is inadequate.  Poor interleaving during standard
stress testing.<br />
<br />
CTrigger - for any given program input, <em>force interesting
interleavings</em>.  There is an exponential interleaving space for each
input, but stress testing is using random interleavings.   Previous
work only controlled the context-switch interleavings.  CTrigger
focuses on interleavings that likely hide atomicity bugs and rarely
happen during normal testing.<br />
<br />
Focus on interleavings that likely hide bugs: e.g. 2 nearby refs to the
same memory in one thread - and another write to same ref in another
thread.  Force these accesses to interleave by injecting "sleeps".<br />
<br />
Tried on 7 apps (server, client, scientific) on 8-core machine.  143
total Unserializable Interleavings (program interleavings deemed
interesting from atomicity standpoint).  Stress testing only hits about
80 to 100 of the U.I., out of 143, even for many many runs.  Basically
some (1/3rd) of the U.I.s are very unlikely to hit with stress testing
- this demonstrates the key weakness of standard stress testing.<br />
<br />
CTrigger looks at the access traces and attempts to force the
interleaving.  Needs to trim out infeasible paths: locking forces many
orderings to be impossible (i.e., the code is correct and no race
happens).  Some refs are inside the same lock; some are Happens-Before
on thread-create or thread-join.  Typically 90% of apparent bad
interleavings are infeasible.<br />
<br />
Focus on rare interleavings also (not-rare races happen easily in
stress testing).  If the "local gap" is small - time between 2 refs in
the same thread is short, then the race is narrow - and hard to hit in
stress testing.  Also look at the distance between external ref
distance in time to the gap.  Sort these potential races by
likelihood.  Instrument the program to add an artificial delay at a few
points to try &amp; force these rare races to happen often.  <br />
<br />
Then run program as normal.  Program speed is similar to unmodified
program, because artificial delays are small and few in number. 
Program runs fine on multi-core machines; using the multi-cores as
before.  But now the rare races are common, and stress testing finds
them quickly.<br />
<br />
Typically find bugs now 100x faster than before!!!  Found bugs in all
these apps.  Was able to repro the breakage 100% of the time, generally
very quickly.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="Anomaly" />Anomaly-based Bug
Detection &amp;
Prediction</strong><br />
<br />
Defect - the Bug<br />
Infection - Defect is triggered ==&gt; program state is corrupt<br />
Failure - An observable error (Crash, hang, wrong result)<br />
<br />
Infections spread over time, or get overwritten (hence bug is squashed
in this case).<br />
<br />
Software debugging: hypothesis bug; change code; attempt to repro bug,
etc.<br />
<br />
- detecting anomalies; out-of-bounds addresses, unusual control paths,
page faults or redundant  computations.  So profile program to teach
the detector what the "good" behavior is.  Some values are in fixed
ranges during the teaching runs, or some mem locations only accessed
from a few ld/st ops; or abnormal number of loop iterations.  Current
crop of anomaly detectors have a fairly high false-positive rate.<br />
<br />
- Now isolate relevant anomalies: compute dynamic forward slices from
anomalies to crash point.  Ouch - this step looks really expensive. 
Rerun the broken program, propagating tokens from anomaly site to crash
site.  Use binary search on the anomaly set.  Can reduce the count of
false positive anomalies by a factor of 10.<br />
<br />
- Now "validate" that have found buggy line of code: NOP that line out
and see that you can execute further.  <br />
<br />
Still don't know what the fix is, but can point out the step that e.g.
corrupted memory.<br />
<br />
Appears to scale up to e.g. gcc (330KLOC) found a bug in distributive
laws causing a loop in RTL tree optimization (C2 used to suffer these
routinely).<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="isolator" />Isolator</strong><br />
<br />
Isolation in concurrent programs: detect isolation guarantees; i.e.
access in critical regions from non-locking threads (or incorrect
locking).  Bad threads can break well-behaved threads, (the usual
motivation talk for debugging data races, I wish people could skip this
or at least limit it to 1 slide &amp; 1 minute).  Also STMs only have
weak atomicity, so bugs still happen for well-behaved STM threads.  <strong>Cliff
Notes: </strong>This is the key: some other buggy thread breaks your
well-behaved concurrent thread; studying the break point is useless.<br />
<br />
Isolator: Program executes as-if no isolation violation.  Uses a combo
of data-replication, virtual memory &amp; code instrumentation.  Does
not require the whole program.  Requires "guarded by" annotations.<br />
<br />
<strong>Cliff Notes: </strong>just having thread-level TLB protection would let
you do some COW-style (<a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Copy-on-write">http://en.wikipedia.org/wiki/Copy-on-write</a>)
protections.  Just write-protect all pages touched during a locked
region (except allow writes from self).  All other threads that attempt
to write without holding the lock will get a TLB fault and can be
stalled (and reported as racing).<br />
<br />
Stalling can cause deadlocks, so need dynamic deadlock detector. 
(Youch!)<br />
Either allow racing access OR throw a DeadlockError.<br />
<br />
This above option is done by Isolator, except requires annotations
rather than lazily discovering touched variables.  <br />
<br />
Issues: expensive to change protection levels on every lock.  Fix:
"lazy unprotect" - only remove protections on-demand, as other threads
need them.  Could "shatter" big pages, or split unrelated objects to
private pages (via GC).  <br />
<strong>Cliff Notes: </strong>similar to HotSpot's Biased Locking, but this
wants per-thread page-protections; really want user-mode control over
per-thread page protections.<br />
<br />
Issues: page-size granularity - has fragmentation overhead for
fine-grained locks.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="kendo" />Kendo</strong><br />
<br />
Only works for data-race free programs.  Very efficient deterministic
interleaving.  Progress is made with "Deterministic logical clock" time.<br />
<br />
Work on SPLASH benchmarks.  See 18% to 27% overhead, but deterministic
always.  Requires no special hardware.  <br />
<br />
Each thread maintains a "logical clock" similar to Azul's COUNT
register or TSC on X86.  Could be e.g., a "bytecodes executed"
counter.  Globally, the thread with the lowest "logical clock" has the
right to do a e.g. lock acquire (non-determinstic action).  Assumes all
inter-thread communication happens nicely inside locks.<br />
<br />
Enforces lock-acquire ordering, except for nested locks (have to
prevent deadlock, since basically adding another locking layer).<br />
<br />
<strong>Cliff Notes: </strong>Mostly interesting because this could work on
stock hardware (but it is limited to race-free programs which are
precisely those that you'd like this to work with).<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="replay" />Deterministic Replay</strong><br />
"Time Travel" for concurrent debugging<br />
<br />
Phase 1 - record non-deterministic events<br />
Phase 2 - replay<br />
<br />
Software versions have been around for ~15 years.  Now commercially
available.  Also available are libraries, compilers &amp; OS's to
help.  SW stack flaw: limited to speed of 1 cpu<br />
<br />
How do you integrate HW &amp; SW stacks?  i.e., If HW has captured a
log, where does it go?  Do not want to capture or replay the SW stack
that is doing the SW recording.<br />
<br />
Also need the HW to be running across the whole machine, so that all
processes are using it because interacting processes need to all be
deterministic (see butterfly effect above).<br />
<br />
So some abstraction: not all elements of the system are being recorded
and can be replayed.  May need to simo record multiple user-mode
processes in the same recording "sphere".  Only user-mode threads are
being recorded (OS support needed to virtualize HW, but also OS needs
SW record/replay that is process specific).  Really he's making a weak
stab at recording everything to avoid the butterfly effect.<br />
<br />
Now record 2 different kinds of info:<br />
HW info: memory interleave log<br />
SW info: system stuff log; scheduling; context switch; I/O; signals<br />
<br />
Replay also has to handle, e.g. malloc has to be deterministic.<br />
<strong>Cliff Notes: </strong>but not GC, although HashCode has to be
deterministic.<br />
<br />
Key Issues:<br />
1- Deterministic injection of data into controlled "sphere" (I/O)<br />
2- Using fewer processors during replay than were used during recording<br />
3- Emulating vs re-execute system call (e.g. fork requires a real
system call) - but this makes a lot of OS state changes.<br />
<br />
Key Issue #1 - Capturing ALL I/O<br />
- Random OS calls write into user-space (i.e. setting return values in
memory).  But during replay need e.g. the SAME "fd" as before.  But OS
under no obligation to hand out the same "fd" at an open().  Or OS
writes data into a buffer - BUT another user-thread is ALSO writing
into the same buffer.  <br />
<br />
Key Issue #2 - Not enough CPUs during replay<br />
- I <strong>think</strong> he's expecting HW to do the memory interleaving on
replay!!  But basically, it's time for thread#1 to run - but the OS
wants to schedule thread#2 - just need to block #2 and allow #1 to 
run.  Very slow (constant OS thread scheduling overhead).<br />
<br />
Log size - in bits per kilo-instructions.  <br />
Small: 3.15 bits/kilo-instruction.<br />
Recording performance overhead: 20-50%.<br />
Replay performance: using replay HW!?!?!  About 2x slowdown during
replay.<br />
<br />
How to handle large datasets?<br />
 <br />
<hr size="2" width="100%" /><strongig><strong><a name="DMP" />DMP - Deterministic
Shared Multiprocessing</strong><br />
<br />
Hard to debug, etc, etc...<br />
So just make the whole thing deterministic.  No "Heisenbugs"; can use
time-travel debugging.  Can test like it's a serial program.  No
attempt at record/replay really - just force same ordering every time. 
<strong>Cliff Notes: </strong>Suffers butterfly effect.<br />
<br />
Instructions communicate via shared memory.  Only care about
"communicating" instructions.  Can serialize - but that blows
performance. <br />
<br />
So really want to figure out which instructions <em>really </em>communicate.
Requires HW support.  <br />
<br />
Conceptually have a "deterministic token" which is handed out to
processors and then is passed around all the CPUs in round-robin
format.  This token totally orders all instructions.  Each "quantum" is
roughly 1000 instructions, but varies.  Now filter out
non-communicating instructions.  Allow the non-comm instructions to run
in parallel.  Leverage cache coherency to track when cache lines move. 
Build a "sharing table" to help order memory accesses.<br />
<br />
Now add some TM support to help get more parallelism.<br />
Now come some quantum-building optimizations: break at sync/lock
boundaries.<br />
Communication is often bursty: so break quantum after a series of
stores to shared memory.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="buddies" />Memory Buddies</strong><br />
<br />
Sharing memory between VMware VMs(?)  Hypervisor finds identical pages
and shares them under the hood.  (Probably good for e.g. shared
libraries). See 1/3 reduction in memory usage in VMware ESX server. Get
sharing only if you run similar apps on the same hardware.  Placement
is hard in datacenter; lots of clients &amp; software stacks; things
change over time.  Manual placement is tough.<br />
<br />
Goal: estimate sharing ability across many physical machines and many
virtual machines across the datacenter.<br />
<br />
Create 32bit hash per 4K page.  1MB hashs per 1GB of ram.  Forward to
other hosts.  Comparing lists is slow.  So use a Bloom filter.  With
hash lists, just sort &amp; compare.  With Bloom filters you use a
dot-product of bits; very accurate in estimating the number of elements
shared between Bloom filters. <br />
<br />
New VM launch - run on staging host.  Figure out it's memory
footprint.  Compute it's Bloom filter per page.  Compare it's filter
against other Filters on all other hosts in data center.  Find closest
fit (on machine with enuf memory &amp; cpu) and move the new VM there.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="livemigration" />Live VM
Migration</strong><br />
<br />
How to migrate VMs cheaper (force their free memory pools to shrink
before migration; similar to having a JVM do a GC &amp; heap-shrink
before migration, then inflate back to normal size in the new host).<br />
<br />
Contrast to the usual approach: <br />
- 1st iteration copy all<br />
- later iterations copy any dirty pages<br />
- after dirty workset is "small enuf", stop VM, copy active dirty pages,<br />
- restart on new host.<br />
Goal: reduce downtime.<br />
Con: Write-intensive apps have long migration times (because of
constant page copying) and high network overhead.<br />
<br />
Alternative: copy minimal CPU state &amp; pages, start running on
target.  VM demand-pages from source.  Also actively push pages from
source to target.  Good: copy pages only once; lower total migration
time &amp; network overhead.  Works well for both read-intensive &amp;
write-intensive.  Con: resumed app suffers while pages demand-paged in.<br />
<br />
Can make some predictions on future needed pages from apps page-faults
- likely will want other pages nearby in virtual-memory-addresses
soon.  Examples are walking an array or working a stack.<br />
<br />
Make a psudeo-swap device in the guest OS for demand-pull pages
(basically swapped over the network).  Allows the VM job to keep
running in other threads or processes - only 1 thread is blocked on a
page fault.  (Actual implementation has lots of warts due to nature of
existing Xen infrastructure).  e.g. we have a much more direcy way to
detect memory-full conditions and politely share memory between JVMS.<br />
<br />
Results: feasible, but clearly more downtime over pre-copy in the
common
case.  But works well w/write-intensive VMs.<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="validate" />Validate</strong><br />
<br />
Patches are dangerous.  Delaying change is a safer bet even for
security bugs: more likely for downtime from patches than your system
gets hacked while you wait.  <br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="raytracing" />Raytracing vs GPUs</strong><br />
<br />
Raytracing is more expensive per-ray, but less expensive per-object. 
Can be parallelized.  Makes really great shadows &amp; reflections. 
Want
to improve predictability in ray-tracing to get "good enough" frame
rates.<br />
<br />
Trying to use SIMD.  Primary rays are parallel, easy to SIMD.<br />
Seconday rays are incoherent, not good for SIMD.<br />
<br />
Cliff Notes: no surprise but modern GPUs are not well suited to
raytracing work<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="commute" />Commutativity
Analysis for Software
Parallelization: Letting Program Transformations See the Big Picture</strong><br />
<br />
Sources of Commutativity - e.g. sum, malloc<br />
<br />
Can detect it?  Check memory state before &amp; after running fcn X.<br />
<br />
But misses e.g. set's (via linked-list) - linked-list order is
different when swapping 'put(A)' and 'put(B)' calls, but semantically
the same: both elements are in the set, but the set's implementation
stores them differently.<br />
<br />
How to test for commutativity? Symbolically execute 'put' calls in
different order. See if memory contents are same: "put(A);put(B)" vs
"put(B);put(A)".  If yes, then Yes - but suppose not.<br />
<br />
Now check reader fcns: try is_member() (on what values???)  <br />
Also check writers like 'remove'.<br />
<br />
Problem: exponential search space for equivalence.<br />
Use Random Interpretation- <br />
- choose random input values<br />
- interpret till conditional branch<br />
- follow which way it goes; tests program<br />
- Then back up to cond branch<br />
- Adjust memory &amp; values such that test fails<br />
- Now follow thru on other branch; test program<br />
<br />
Now combine memory states for each branch of program.  Using an 'affine
join' to combine memory states???  Somehow correctness bug becomes an
exponentially low; very very unlikely the Random Interpretation ever
computes some wrong answer (declares as commutative some function which
is not really commutative).<br />
<br />
Looking at MediaBench &amp; SpecInt on an infinite issue machine.  See
20% to 30% of all functions are commutative (with themselves?).  Only
see speedup when commutative fcns are called near each other, but then
can see parallel speedup (log tree combining?).  Average is 50%
speedup.  Apparently assuming an extremely wide machine, as opposed to
threads (ignoring communication costs &amp; thread start/stop).<br />
<br />
<strong>Cliff Notes: </strong>Random Interpretation looks interesting, not sure
about the rest of <em>this </em>paper.  In fact, this paper looks like
a small delta (and very difficult to use correctly) over the R.I&gt;
work.   Instead see 2003 paper: "Discovering Affine Equalities Using
Random Interpretation"
<a class="moz-txt-link-freetext" href="http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf">http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf</a><br />
<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="gcyield" />Predicting Yield of
GCs</strong><br />
<br />
Amount of memory reclaimed by GC *can* be low.  But looking at minimal
GC footprints.  When GC is not productive, then time to extend the
heap.  Estimate GC yield by using Virtual Memory to track pages which
are touched and which are not touched - not-touched pages are probably
full of mostly dead stuff.<br />
<br />
<strong>Cliff Notes: </strong>Yawn, Olde Hat Trick.<br />
<strong>Cliff Notes: </strong>got bored; symptom of Day 3 at a conference;
worked on slides<br />
<br />
<hr size="2" width="100%" /><strongig><strong><a name="phatombtb" />Phantom BTB</strong><br />
<br />
<strong>Cliff Notes: </strong>Winner of the most ridiculous proposed hardware,
backed up a big speedup on the most-unrealistic simulated hardware.<br />
<br />
Virtual BTB Design - as in Virtual Memory not as in VMWare.<br />
BTBs today are {small,fast,inaccurate}.  Want {big,fast,accurate}.<br />
Allocate space for the BTB in the L2(!?!?!).  <br />
But latency to the L2 is large, so much prefetch the BTB data.<br />
<br />
Getting tidy 10%+ speedups on unrealistic hardware.<br />
<br />
BTB in L2 cache is large+slow.  Missing BTB entries are evicted to L2.
Then prefetch from L2 based on previous missing branches (assuming that
an upcoming BTB miss can be predicted from a just-prior missing
branch).  Attempted lots of other predictors; most failed miserably.<br />
<br />
Pay-as-you go model; BTB gets effectively larger as more pressure
happens on the on-chip BTB.<br />
<br />
<strong>Cliff Notes: </strong>HotSpot inline-caches wipe out the need for faster
jump-register support in hardware; standard branch predictors work well
with HotSpot.  Total clocks spent doing jump-register instructions are
less than 1% of all clocks on realistic hardware.<br />
<br />
<br />
</strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></strongig></div><div class="moz-text-plain" graphical-quote="true" lang="x-western" style="font-family: -moz-fixed; font-size: 13px;" wrap="true"><pre wrap=""><strongig><br /><hr size="4" width="90%" /><strong><strongig><br />import java.lang.*;<br /><br />class TextScan {<br /><br /> static String test1 = "scan_for_some_token(like_this_open_paren";<br /> static String junk0 = "scan_for_some_token_like_this_open_paren";<br /> static String junk1 = junk0.concat(junk0);<br /> static String junk2 = junk1.concat(junk1);<br /> static String test2 = junk2.concat(test1);<br /><br /> // A TextScan is a lite object that's really 8 longs representing<br /> // the bits of 64 bytes split 90-degrees from verticle.<br /> long b0,b1,b2,b3,b4,b5,b6,b7;<br /><br /> public static void main( String args[] ) {<br /> TextScan[] bits = bit_split(test2);<br /><br /> System.out.print("bits="+bits.length+"@(");<br /> for( int i = 0; i&lt;bits.length; i++ )<br /> System.out.print(bits[i]);<br /> System.out.println(")");<br /><br /> // Scan for byte <br /> System.out.println("Scanning for byte "+(byte)'(' + (char)0x28);<br /> System.out.print("[");<br /> for( int i=0; i&lt;bits.length; i++ ) {<br /> // bit pattern 0x28 - 0 0 1 0 1 0 0 0<br /> TextScan ts = bits[i];<br /> // The open-paren char, '(', hex 0x28:<br /> // 0 0 1 0 1 0 0 0<br /> long res = (~ts.b7) &amp; (~ts.b6) &amp; ts.b5 &amp; (~ts.b4) &amp; ts.b3 &amp; (~ts.b2) &amp; (~ts.b1) &amp; (~ts.b0);<br /> // A non-zero res means a '(' is found, and the bit-number is<br /> // the index into the 64-byte array where it was found.<br /> System.out.print(Long.toHexString(res)+",");<br /> }<br /> System.out.println("]");<br /> }<br /><br /> // Split a String into an array of 64-bit longs.<br /> // Parallel-split bits: bit 0 from 64 bytes go in 'b0',<br /> // bit 7 from 64 bytes go in 'b7', etc.<br /> public static TextScan[] bit_split( String s ) {<br /> int slen = s.length();<br /> TextScan[] bits = new TextScan[(slen+63)&gt;&gt;6];<br /> for( int i=0 ; i&lt;bits.length; i++ ) {<br /> TextScan ts = bits[i] = new TextScan();<br /> // Suck 64 bytes into bytearray from String.<br /> for( int j=0; j&lt;64; j++ ) {<br />	int idx = (i&lt;&lt;6) + j;<br />	byte b = idx&lt;slen ? (byte)s.charAt(idx) : 0;<br />	if( (b&amp;(1&lt;&lt;0)) != 0 ) ts.b0 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;1)) != 0 ) ts.b1 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;2)) != 0 ) ts.b2 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;3)) != 0 ) ts.b3 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;4)) != 0 ) ts.b4 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;5)) != 0 ) ts.b5 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;6)) != 0 ) ts.b6 |= (1L&lt;&lt;j);<br />	if( (b&amp;(1&lt;&lt;7)) != 0 ) ts.b7 |= (1L&lt;&lt;j);<br /> }<br /> }<br /> return bits;<br /> }<br /><br /> // --- toString<br /> // Reverse bit-swapping to a String.<br /> public String toString() {<br /> String s = "[";<br /> for( int i=0; i&lt;64; i++ ) {<br /> byte b = 0;<br /> b |= ((b0&gt;&gt;i)&amp;1) &lt;&lt; 0;<br /> b |= ((b1&gt;&gt;i)&amp;1) &lt;&lt; 1;<br /> b |= ((b2&gt;&gt;i)&amp;1) &lt;&lt; 2;<br /> b |= ((b3&gt;&gt;i)&amp;1) &lt;&lt; 3;<br /> b |= ((b4&gt;&gt;i)&amp;1) &lt;&lt; 4;<br /> b |= ((b5&gt;&gt;i)&amp;1) &lt;&lt; 5;<br /> b |= ((b6&gt;&gt;i)&amp;1) &lt;&lt; 6;<br /> b |= ((b7&gt;&gt;i)&amp;1) &lt;&lt; 7;<br /> if( b == 0 ) break;<br /> s += (char)b;<br /> }<br /> s += "]";<br /><br /> return s;<br /> }<br /><br />}<br /><br /><br /><br /></strongig></strong></strongig></pre></div></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/03/once-again-i-put-down-my-hackers-keyboard-to-bring-you-this-trip-report.html</feedburner:origLink></entry>
  <entry>
    <title>And now some Hardware Transactional Memory comments...</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/GtSQaKrkmXs/and-now-some-hardware-transactional-memory-comments.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=63331867" title="And now some Hardware Transactional Memory comments..." />
    <id>tag:typepad.com,2003:post-63331867</id>
    <issued>2009-02-25T10:16:13-08:00</issued>
    <modified>2009-02-25T18:20:16Z</modified>
    <created>2009-02-25T18:16:13Z</created>
    <summary>(sorry for the long gap between postings; work's gotten interesting and I got busy) I recently attended the Bay Area Workshop on Transactional Memory at Stanford generously hosted by HP. Slides are here; my slides are helpfully titled "2009_TMW.pdf". In...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>(sorry for the long gap between postings; work's gotten interesting and I got busy)</p><p>I recently attended the <a href="https://vsci.hpl.hp.com/marcs/tm_workshop.html">Bay Area Workshop on Transactional Memory</a> at Stanford generously hosted by HP.  <a href="https://vsci.hpl.hp.com/marcs/tmPresentations.zip">Slides are here</a>; my slides are helpfully titled "2009_TMW.pdf".  In this workshop I gave a brief overview of Azul Systems' <a href="http://en.wikipedia.org/wiki/Hardware_transactional_memory">Hardware Transactional Memory</a> support plus our experiences in using it.  This was followed shortly by a <a href="http://blogs.sun.com/dave/resource/asplos09-Rock-Final2.pdf">blog posting</a> from <a href="http://blogs.sun.com/dave/">David Dice</a> discussing Sun's HTM support in the <a href="http://en.wikipedia.org/wiki/Rock_processor">Rock processor</a>.</p><p>For Azul Systems' certainly (and to a lesser extent Rock), the name of the game is throughput: we appear to be <a href="http://www.cs.virginia.edu/stream/top20/Bandwidth.html">generously over-provisioned with bandwidth</a>, we can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10's of milliseconds (and unlike <a href="http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html">IBM's Metronome hard-real-time collector</a>, our MMU past 20msec is 0.99; see MMU definition at bottom).  Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time.  We have a lite microkernel style OS; we can easily handle 100K <em>runnable</em> threads (not just blocked ones).  Our JVM &amp; GC scales easily to the whole box.  In short: the bottleneck is NOT the platform.  We need our users to be able to write scalable concurrent code.</p><p>The goal of our HTM has always been to accelerate "dusty deck" Java programs via Lock Elision.  (We've
never been tempted to enter into the language wars and implement some
form of full-blown transactional programming via an "atomic" keyword). We see a lot of defensive uses of the "synchronization" keyword; legacy libraries using "sync" on all methods or code maintainers sprinkling "sync" about liberally in order to squash some bug.  In practice, 99% of all locks are never contended - and these we run very fast (our <a href="http://en.wikipedia.org/wiki/Compare_and_swap">CAS</a> &amp; FENCE instructions can hit in cache; an uncontended lock requires only a few clock cycles).  But the <a href="http://en.wikipedia.org/wiki/Lock_%28computer_science%29#Granularity">locks that are contended</a> prevent parallel execution (<a href="http://en.wikipedia.org/wiki/Amdahl%27s_Law">Amdahl's Law</a> and all that)- and we observed that much of the time the contention is on the lock itself and not on the data it protects.  So we added in HTM support in our first processors and this support has been shipping and turned on at all costumer sites for over 4 years now.  By now we have a <em>lot</em> of field experience with using HTM.</p><p>But first, a brief overview of our HTM support: we use a few extra bits on each L1 cache line to track "speculatively read" and "speculatively written" cache lines.  Each CPU can be put in a "speculate" mode; loads and stores then set these bits accordingly.  If a speculative line is evicted from the L1 <em>for any reason</em> then the transaction "aborts" (there is also an "abort" instruction).  Speculatively-written lines are marked invalid (meaning: they no longer cache any data) and speculatively-read lines have their spec-bit cleared.  Also the CPU branches to a fixed trap vector for software fixup.  </p><p>If the CPU makes it to a "commit" instruction then the spec bits are cleared - including the speculative-write bits - which atomically makes all those writes visible to other CPUs. In other words, the XTN commits.</p><p>Nothing aborts a transaction except the "abort" op or losing a speculative line out of L1.  We do not abort on function calls/returns, nor TLB misses, nor exceptions nor rain nor snow nor sleet nor dark of night...  Our XTNs are limited therefore by the size and associativity of the 16K 4-way L1 cache (and the shared inclusive L2).  We routinely see successful HTMs of many thousands of instructions with many hundreds of cache-hitting memory ops. (contrast this to what I can figure out about Rock: Rock appears to abort on function call/return, on running out of store-buffer depth (which limits an XTN to 16 stores?), on TLB misses, on L1 associativity, on common nested locking patterns (because of abort on function calls)).  </p><p>Software heuristics determine when to try the HTM (uncontended locks are much cheaper using a cache-hitting CAS).  The software measures the success/fail ratio of the HTM on contented locks and determines when to flip to a standard blocking implementation and when to use the HTM.  </p><p>While some apps get a nice 2x speedup (e.g. <a href="http://www.azulsystems.com/products/compute_appliance/performance_breakthrough.htm">Trade6</a>), most apps are speedup only slightly (we had lots of teething problems with the software heuristic that used to cause 10-20% slowdowns due to endless fail/retry loops, although these all appear to be fixed now).  We nearly always see a handful of locks using the HTM support, but these appear to only rarely be the "right" locks to get more CPUs really busy.  HTM failure appears to nearly always be due to <em>conflict </em>and <strong>not </strong><em>capacity</em>.</p><p style="font-size: 15px; font-family: Arial;"><strong>Issues</strong></p><p>In short, users' don't write "TM-friendly" code.  Neither do library writers.  For example the "modcount" field in Hashtable is bumped for every update in the Hashtable.  For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable.  i.e., without the modcount field update we would expect the HTM to allow parallel 'put' operations.  With the modcount field update, 'put' operations always have a true data conflict - and this aborts the HTM.  After some number of failed 'put' operations we have to start really locking the Hashtable (to allow the puts to make forward progress) and this now single-threads the 'get' operations as well.  This general pattern of updates to peripherial shared values (usually performance counters of one sort or another) is very common and it kills the HTM - because it presents a <em>true data conflict</em>.</p><p>Many times a small rewrite to remove the conflict makes the HTM useful.  But this blows the "dusty deck" code - people just want their old code to run faster.  The hard part here is getting customers to accept that a code rewrite is needed.  Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog.  Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)?  Also locks have well understood performance characteristics, unlike TM's which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying "sweet spots" such that code which performs well on one STM might be really unusably slow on another STM).  </p><p>Really what the customers want to know is: "which locks do I need to 'crack' to get performance?".  Once they have that answer they are ready and willing to write fine-grained locking code.  And nearly always the fine-grained locking is a very simple step up in complexity over what they had before.  It's <strong>not</strong> the case that they need to write some uber-hard-to-maintain code to get performance.  Instead it's the case that they have no clue which locks need to be "cracked" to get a speedup, and once that's pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc)</p><p /><h2><span style="font-size: 15px;">Summary</span></h2><ul>
<li>Very modest gains for HTM; usually &lt;10% (although 2x for some large apps)</li>
<li>Rewrites of "dumb" direct contention (e.g. perf counters) would help a lot</li>
<li>Azul did this for some common JDK pieces Out There (e.g. Hashtable), but there's just too much of it</li>
<li>Customers not willing in general to touch code for HTM<ul>
<li>If code is being hacked for performance, it will be with fine-grained locking instead of simply making it "HTM friendly".  </li>
</ul>
</li>
</ul>
<p /><p /><p /><p /><p>Thanks,<br />Cliff
</p><p style="font-size: 12px; font-family: Arial;">[MMU is a common measure of GC performance and stands for "minimum mutator utilization".  It shows the worst-case amount of time a worker thread gets for a timeslice of a given length vs time spent in GC.  Thus an MMU graph plots utilization (a number from 0 to 1 where 1 is better) versus timeslice size.  A program using Metronome might see a max GC pause of 1microsecond; it's MMU at 1 microsecond is then zero because the mutator can do no work for that microsecond.  However the MMU at 1 second might be 70% showing that in the long run Metronome's GC exacts a 30% performance "tax" on mutator operations.  Contrast this to Azul's Pauseless GC: at 10msec our MMU is zero but at 1sec our MMU is 0.99; in the long run there is no GC "tax" but in the short run our max pauses are much larger than Metronome's.  Metronome is suitable for e.g. hard-real-time applications like audio-playback, whereas Azul Systems' is suitable for e.g. soft-real-time transactional apps with high throughput.]</p><p /><p /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2009/02/and-now-some-hardware-transactional-memory-comments.html</feedburner:origLink></entry>
  <entry>
    <title>A Brief Conversation with David Moon</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/DOIu8H9YoX0/a-brief-conversation-with-david-moon.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=58671546" title="A Brief Conversation with David Moon" />
    <id>tag:typepad.com,2003:post-58671546</id>
    <issued>2008-11-18T10:05:08-08:00</issued>
    <modified>2008-11-18T18:05:08Z</modified>
    <created>2008-11-18T18:05:08Z</created>
    <summary>I had an email conversation between myself, David Moon &amp; Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><div class="moz-text-plain" graphical-quote="true" lang="x-western" style="width: 470px;" wrap="true"><span style="font-family: Verdana;">I had an email conversation between myself, David Moon &amp; Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul is doing now; i.e. language-specific hardware support. I lightly edited the emails for clarity and ordering (otherwise the nested interleavings get horrendous to follow). <br />[Ed: 11/20/2008 - Some small follow-on conversation has been appended to the end]</span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Mostly I felt inspired by your Symbolics work. Azul Systems makes gear for running Java (but actually it runs a C/C++ program - we just happen to run a large C++ that implements Java). One of the biggest impact changes we made was a hardware read-barrier for GC - a simple instruction that tests invariants of freshly loaded pointers and takes a fast-trap if the test fails. Fixup is entirely in software. It's a RISC-y approach to the GC hardware that Symbolics had 20 years ago. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> I remember this now; this was one of the things that impressed me when I read about your machine a couple years ago. As you may or may not know, the Symbolics 3600 had fairly complex GC hardware and the Symbolics Ivory went 90% of the way towards riscifying it, eliminating all the large hardware tables. I think there was still a 16x1 bit (NOT 16Kx1 !) table to indicate which portions of address space are oldspace; with 64 address bits instead of 32 you were able to reduce that to a single bit, which is nice. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> We use very simple 3-address 32-register 64-bit RISC cores, with 54 cores per die. All chips are cache-coherent and uniform (medicore) memory access time for all; the upper bound is 16 chips or 864 cpus. The "big" box is about the size of a dorm room fridge, 14U form-factor. The big box also has top-20 super-computer bandwidth (barely). Modest 16K I- &amp; 16K D-caches for all cpus, plus groups of 9 share a 2-meg L2. Chips are fully interconnected. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> 3-address 32-register 64-bit address, 32-bit instruction RISC does seem to be the "sweet spot" for instruction set architectures. <span style="background-color: #ffff80;">[<strong>Cliff</strong>: Yes!]</span> I hope you avoid needless complexity like condition codes, link registers, maybe even separate integer and floating point registers. <span style="background-color: #ffff80;">[<strong>Cliff</strong>: Yes, yes &amp; yes]</span> I don't think Java needs floating point rounding and trapping modes either. <span style="background-color: #ffff80;">[<strong>Cliff</strong>: Yes, it does not]</span> For your application you probably don't even need separate user and supervisor modes since all executable code is generated by your just-in-time compiler from a safe language. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> That's the theory but the practice is a little different: JVM's crash and 1 crashing JVM should not bring down the whole box. So we in fact have a user/kernel split and it's saved our butts any number of times. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Maybe you don't need any kind of mode bits register at all? Being completely modeless would be cool. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> There's also a speculative-memory mode: writes are buffered in the L1 but don't become visible until a 'COMMIT' instruction. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> That's a pissload o' cores. Are those 54 real cores per die? Or is it perhaps 6 real cores, each executing 9 interleaved instruction streams, in the style of Denelcor HEP? </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Nope, real cores all the way. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Does the precisely defined Java memory model give you any help in making a more simple implementation of cache coherency than is usually done e.g. for Intel x86 architecture? Or did that memory model come out after your hardware was designed? </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> The Java Memory Model came first, and we designed the hardware around it. Actually the hardware implements something somewhat weaker than the JMM by default, and the JIT inserts memory FENCE ops as needed. More like Sparc RMO or Power's memory model. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> I assume uniform memory access doesn't mean cached accesses aren't any faster than cache misses, but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network. <span style="background-color: #ffff80;">[<strong>Cliff</strong>: Yup]</span> <br /><br />It must be a real challenge to get enough memory bandwidth to keep 864 instruction streams going at full speed, even more so with such tiny caches. </span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> Moon says: "... but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network." Actually, I heard a talk about a year ago, at the first New England Database Day, suggesting that the way information moves between the L1 caches of the cores is by a sequence of moves, each being up, down, right, or left, in the direction of the destination core. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> No, we're a full direct interconnect on the L2's. 9 L1's share an L2. One step to the L2, then one step to the correct remote L2 - across the box in any direction (including another L2 on the same chip; it was easier to run outside the chip and back in than arrange for a special faster-path for within-chip communication. 15/16ths of your L2 misses are remote and the gain for speeding up the remaining 1/16th of the misses isn't worth it. </span></span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> If so, the times would not actually be uniform. Taking advantage of such knowledge in the software would be challenging except for special-purpose domains, I would think. I don't know how the 54 L1 caches (or maybe fewer if you're doing the piplined business that Moon described) are interconnected, but there is a limit to how many can participate in a full crossbar NxN connection. <span style="background-color: #ffff80;">[<strong>Cliff</strong>: Key word: challenging]</span> </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> We have a <strong>lot</strong> of memory bandwidth - one of our bigger boxes is on </span><a href="http://www.cs.virginia.edu/stream/top20/Bandwidth.html"><span style="font-family: Verdana;">the top-20 supercomputer bandwidth list</span></a><span style="font-family: Verdana; background-color: #ffff80;">. The chips are fully interconnected. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Meaning 
every chip has 15 ports directly connected to another chip? <span style="background-color: #ffff80;">[<strong>Cliff</strong>: 
Yup]</span> That's a lot of pins. You might want to look at </span><a href="http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect"><span style="font-family: Verdana;">http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect</span></a> <span style="font-family: Verdana;">for 
an interesting slimmer approach. </span><span style="background-color: #ffff80;"><br />
<br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> I'm sure we looked at sicortex at one point. </span></span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> I agree about the challenge of the memory bandwidth for all those cores. Do you have very wide paths between the chips and the main memory? </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> There are 4 memory controllers per chip (and up to 16 chips so 64 controllers). All memory is striped across all controllers at a fine grained level so each core's memory requests are spread across the box - mostly eliminating "hotspots". </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Also they are a message passing architecture and you have to be a shared memory architecture, but I think that's a separate issue from the interconnect topology. I am no longer sure which I believe in. On the one hand, I am a shared memory guy from way, way back. On the other hand, it is becoming increasingly expensive to pretend that memory is uniform and random-access, and the illusion is becoming increasingly unconvincing. The von Neumann architecture may be reaching the end of its life. I like the way the Cell treats main memory as essentially an I/O device, but I have never attempted to program it. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> By all accounts, programming the Cell is a nightmare. Since we can do shared-memory with so many cores, other people can too. The more difficult issue is parallel programming that uses all those cores. Shared memory is one way to make programming easier. </span></span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> What about cache coherency over multiple chips? <br /><br />These days, as Moon has pointed out to me, the importance of caches means that you gain a lot by not looking at your address space as a sequence of co-equal words, but rather as a sort of block-oriented device (a little like the way one looks at a disk), in which the block size is the size of the cache line in some relevant cache (probably L2 but it depends a lot on the hardware if the hardware is novel). (See the word "illusion" in his mail.) <br /><br />Does your Java implementation specifically take this into account, e.g. allocating objects aligned on cache line boundaries, implementing collections as B-trees with nodes being an integral number of cache lines, and that sort of thing? If not, you might consider benchmarking it; I'd love to know the results of such a test. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Our hardware has short cache lines (32b) so there's less to be gained by aligning objects to cache lines. For certain benchmarks (not to mention SpecJBB or anything) such alignment is probably worth alot - but not for most use cases. Perhaps it makes sense to expose cache-line alignment to the Java/JDK programmer? </span></span><br /><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Things we do specifically for Java: </span>
<ul>
<li><span style="font-family: Verdana; background-color: #ffff80;">GC read-barrier enables a fully parallel &amp; concurrent GC; we can sustain 40G/sec allocation on a 400G heap indefinitely, with max-pause times on the order of 10-20msec. This uber-GC is partially made possible because of the read barrier (and partially possible because we 'own' the OS and can play major page-mapping tricks). </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> That is indeed impressive! Even if I divide 40 GB/sec by 864 it's still fast. Of course you need it, much Java code really likes to generate garbage all over the place. My pet peeve is no way to return multiple values from a method without generating garbage. </span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> Dave, you say: "My pet peeve is no way to return multiple values from a method without generating garbage." In Rich Hickey's talk about </span><a href="http://clojure.org/"><span style="font-family: Verdana;">Clojure</span></a><span style="font-family: Verdana; background-color: #c0ff80;">(Cliff: a new Lisp dialect implemented on the JVM, mainly having been run on the Sun JVM but he ran it on an Azul at least once) addressed this issue. <span style="background-color: #ffff80;">[<strong>Cliff</strong>: I'm familar with Clojure]</span> He left out multiple value returns exactly on the grounds that the consing cost for a small returned value is so cheap these days, ephemeral GC's being so damned good. I do not know to what extent he has done actual metering, though. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Yes, annoying. And it's embedded into the JVM bytecodes that way, so you can't even rescue the problem by changing languages while keeping the JVM. The only hope here is for a combo of JIT'ing &amp; smart memory allocation: either inline &amp; optimize away the junk object or do some sort of stack-based allocation for the very-short-lived object. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Can you say anything about what page-mapping tricks are good for? <br /><br />Actually I am a little surprised you bother with virtual memory at all. If memory gets full surely it is faster to do a garbage collection than to swap pages out to disk (I think you don't even have a disk if I remember correctly). Maybe page mapping is just to avoid core shuffling if there are multiple heaps? Is the page size large to cut down on mapping overhead? </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Ok, a bunch of things here: </span>
<ul>
<li><span style="font-family: Verdana; background-color: #ffff80;">Page sizes are 1Meg. </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">No swap; swapping is death for GC. If you are out of memory, then you are OutOfMemory. To put it another way: you are going to blow your performance guarantees so badly, the JVM might as well have crashed. Better to make the user allocate enough resources up front to keep the performance guarantees. </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">The page mapping is integral to the GC; we free physical memory earlier in the GC cycle than virtual memory. This lets us recycle physical memory much sooner, and makes it much harder for an application to "surprise" the concurrent GC - i.e. an app whose allocation rate suddenly accelerates might run a different concurrent GC out of physical memory - and force an application stall. </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">The mappings are also used to deny user-mode threads access to a page where the objects are being concurrently relocated, but still allow GC-mode access to that page. GC-mode is a subset of user-mode (so we sorta are like a 3-privilege mode machine), except that user-mode can promote itself to GC-mode anytime it wants. </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">The read-barrier will take a fast-trap on failure, and by default get promoted to GC-mode in the trap handler. This lets the faulting CPU fixup the object reference and continue, without needing to wait for GC to catch up. </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">We also have a write barrier, but it's merely a slight efficiency hack. The read-barrier enables the good GC. </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> I would expect the write barrier to be more than a slight improvement, unless your memory scanning is incredibly fast. The main thing the write barrier accomplishes is to greatly decrease the number of pages (or cards or other chunks) that have to be scanned for pointers to the youngest generation. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Ahh.... here's the difference: a 'card-mark write barrier' can be fairly cheaply done with plain-o-instructions, so the write-barrier instruction is a delta above that. The biggest change is our write-barrier op 'filters' card-marks that don't need to happen which in turn cuts down on useless card-mark writes. Writes are generally really cheap - unless there's a ton of contention caused by rapidly updating of an old-gen object holding a pointer into young-gen. In that case the filtering is worth a lot. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Since the GC is concurrent the time required to scan for pointers of interest to the GC shouldn't slow down the application unless the application uses all the cores, except surely you run out of memory bandwidth eventually and then the GC's memory scanning interferes with the application's accesses to memory. So even with all that concurrency and massive processing power and memory bandwidth I would think you would still get a lot of benefit from doing less scanning because of the write barrier. But I haven't measured it and you probably have. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> We're definitely doing a generational collector, so we'd be using a software write-barrier if we didn't have the hardware one handy. Or maybe: "We have a write-barrier. The fast-path is done in hardware. There is a slow-path done in software." </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> I don't know if this was published, but the write barrier on the Symbolics Ivory was really simple. Each page table word contains an encoding (in 4 bits?) of the age of the youngest generation pointed to by this page; the write barrier just causes a trap when those bits need to be updated. I don't remember if we kept another index or just had the GC scan the page table. Since it's an IBM-style inverted page table the page table is dense even when a lot of pages are swapped out so scanning it is not too slow. It would have worked better if the Ivory page size hadn't been way too small. <br /><br />Did you ever look at Eliot Moss's "train" garbage collection ideas to do ephemeral-GC-type optimization for long-lived data? If I remember correctly there is a bug: memory consumption can become arbitrarily large in the worst case, but the idea might still be good. I think Microsoft is using a form of it in .NET. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> HotSpot tried out the train collector years ago, and never got it to perform well in practice. There are a bunch of odd-ball issues, (remembered sets with popular objects?) that made it not very performant. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Things we do specifically for Java (continued): </span>
<ul>
<li><span style="font-family: Verdana;"><span style="background-color: #ffff80;">Simple 3-adr RISC is easy to JIT for; it's HotSpot's JIT and makes quite good code - easily comparable to "gcc -O2". </span>[<strong>David</strong>: Yeah.]<span style="background-color: #ffff80;"> </span></span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">Just-In-Time zero'ing of L1 cache for allocation - and we don't issue a memory read for the cache-line getting zap'd to zeros. This is worth about 1/3 less bandwidth than the usual "store a bunch of zeros in a row" implementations. </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> </span><a href="http://www.ibm.com/developerworks/power/library/pa-memory/index.html"><span style="font-family: Verdana;">dcbz</span></a><span style="font-family: Verdana;">instruction on POWER. Clearly a good idea. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> dcbz doesn't make the grade: </span>
<ul>
<li><span style="font-family: Verdana; background-color: #ffff80;">it's not a prefetch so you can't issue it far enough in advance without causing the stall you are trying to cover, and </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">any java lock that happens will invoke a fence, which will also stall </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">it <strong>does</strong> avoid the memory read bandwidth. </span></li>
</ul>
<span style="font-family: Verdana; background-color: #ffff80;">Our version gets all of the above right. </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Dedicated register for the allocation pointer? </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> No. The main issue with allocation is avoiding the mandatory cache-miss as you stream through memory. You can hide tons of dumb int-ops and cache-hitting-loads underneath that cache-miss. So we pay a little bit of integer-cycles to setup our J-I-T-zeroing, and getting the allocation pointer is one of those cycles. In short: it's too small picken's to be worth holding a register for. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Things we do specifically for Java (continued): </span>
<ul>
<li><span style="font-family: Verdana;"><span style="background-color: #ffff80;">GC bits in each pointer (old-gen, young-gen, stack-allocated, normal-C-ptr) </span>[<strong>David</strong>: I remember that now. Good.] </span>
</li>
<li><span style="font-family: Verdana; background-color: #ffff80;">Java type-heirarchy bits in each pointer. This saves a trip to memory to fetch the Java type when making virtual calls. </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> So you're saying that since 64 bits is way too many, you can use a bunch of those bits to put a rather wide tag in each pointer? Maybe 20 bits wide? Seems like a great idea. No problem with having too many classes defined and running out of bits? </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Yes, wide tag per pointer. No problem (yet) with running out of classes. Big Java Apps these days seem to have about 2^15 classes. </span></span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> I am curious about how you do threads. On the Lisp machine, the stack had to be contiguous in virtual memory. So whenever we had to allocate one for a new thread, we had to trade off between wasting VM versus worrying about running out of stack. I don't know if we ever considered making the stack such that it could be a linked list of (big) blocks, but I suppose anything that slows down the function call path is very dicey. Dave, did you ever think about this? It would have been nice to have cheap full coroutines ("stack groups"). </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Nope - we're using the fairly classic stack layout for HotSpot. And yes this means you have to decide up front the stack-vs-heap memory split. </span></span><span style="background-color: #c0ff80;"><br /><br /><span style="font-family: Verdana; background-color: #c0ff80;"><strong>Daniel Weinreb wrote:</strong> Cliff, have you heard anything about the possibility that the JVM will be extended to be able to do tail-calling? Rich Hickey says that there was recently a </span><a href="http://openjdk.java.net/projects/mlvm/jvmlangsummit/"><span style="font-family: Verdana;">"JVM summit" (something like that)</span></a><span style="font-family: Verdana; background-color: #c0ff80;">of people writing non-Java languages for the JVM, and there was a lot of interest in tail-calling. It would be great for the Lisp world, since Scheme depends on it so much; it would be nice to re-unite the Scheme and Lisp communities as much as possible. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> I'm well aware of the interest in tail-calling optimizations - but it has to compete for scarce engineering resources like everything else. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Things we do specifically for Java (continued): </span>
<ul>
<li><span style="font-family: Verdana; background-color: #ffff80;">Really big interconnect &amp; memory bandwidth. Java has lousy locality, so things miss in cache - a lot. </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Would that be true even with 4 times larger caches? </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> I think so. We did a bunch of studies on cache-sizes, looking at fewer-cores-more-cache vs more-cores-less-cache. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Things we do specifically for Java (continued): </span>
<ul>
<li><span style="font-family: Verdana;"><span style="background-color: #ffff80;">Support for 2 outstanding memory misses per cpu, up to 24 per L2 cache counting prefetches - so 2304 outstanding memory ops across the whole box. </span>[<strong>David:</strong> "Non-blocking caches are good."] </span>
</li>
<li><span style="font-family: Verdana;"><span style="background-color: #ffff80;">Minor tweaks to help do Java-specific math: array addressing is sign-extend, THEN shift-and-add - which we do in 1 alu op. Also we have a Java range-check op, faster virtual-call support, the IEEE754 subset needed for Java FP, etc. </span>[<strong>David</strong>: Good.] </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> Do you have any kind of specialized cache to avoid memory references to find the address of a virtual method to call? I haven't measured it but I get the sense that all those not very predictable indirections through vtbls really hurt C++ speed. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Yes... specialized hardware, not specialized cache. <br /><br />We do the standard HotSpot </span><a href="http://research.sun.com/self/papers/pics.html"><span style="font-family: Verdana;">"inline cache"</span></a><span style="font-family: Verdana; background-color: #ffff80;">trick that's done on all CPUs. <br /><br />The cache is a 1-entry cache, inlined in the code. The Key is the expected class of the object, the Value is the target of the function call encoded as a plain "CALL" instruction. So you do a Key-compare inline then a plain static call. If the Key compare hits the hardware will Do The Right Thing with the static call. In practice, 95% of all call-sites (that can't be optimized by the JIT) never fail the 1-entry cache. The other 5% tend to fail for having lots and lots of targets. <br /><br />So there IS a "cache", but it's not specialized. It's the normal HotSpot cache. Instead, we have an instruction to do the Key-compare in 1 cycle (and the call is another cycle). So we do cache-hitting v-calls in 2 clks &amp; 2 ops. On a wide-issue O-O-O X86 that same sequence is maybe 4 dependent ops but issues also in 2 or 3 clks - except loading the object header is typically a cache-miss; but the branch predicts right and the O-O-O hardware lets the X86 proceed despite the miss. </span></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Things we do specifically for Java (continued): </span>
<ul>
<li><span style="font-family: Verdana; background-color: #ffff80;">Simple hardware transactional memory, used to speculate through Java locks. </span></li>
</ul>
</span><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> I'd be interested in reading more about that. </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> We have a white-paper here. </span><a href="http://www.azulsystems.com/products/whitepaper/wp_otc.pdf"><span style="font-family: Verdana;">http://www.azulsystems.com/products/whitepaper/wp_otc.pdf</span></a><span style="font-family: Verdana; background-color: #ffff80;" /></span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Ok, I'm done boasting. :-) </span></span><br /><br /><span style="font-family: Verdana;"><strong>David A Moon wrote:</strong> It's plenty worth boasting about. <br /><br />--Dave Moon </span><span style="background-color: #ffff80;"><br /><br /><span style="font-family: Verdana; background-color: #ffff80;"><strong>Cliff Click wrote:</strong> Thanks, Cliff </span></span>

<br /><br />===========================================
<br />11/20/2008 - A little more:
<br /><br /><strong>David A Moon wrote:</strong>
Then I guess your thread scheduler is not aware of the partitioning of the hardware and doesn't try to put related threads onto cores in the same chip.

<br /><br /><span style="background-color: #ffff80;"><strong>Cliff Click wrote:</strong>
The thread scheduler is well aware of the hardware layout. It's no mean feat to schedule 800+ cpus with 1000's of runnable threads.

<br /><br />Instead, nobody has a clue what counts as "related threads". Certainly there's no Java-level hint. You might imagine doing L2-to-L2 miss-rate profiling and try to decide which set of threads are communicating through memory instead of through an L2. It's a non-trivial task for which there are some non-trivial academic papers out there showing modest gains for lots of hard work.
</span>

<br /><br /><strong>David A Moon wrote:</strong>
But you said your memory access time is mediocre. That's the first symptom of the uniform memory illusion breaking down. Then you find that no matter how big you make your caches they aren't big enough. Then you find that each new generation has slower memory, as a ratio of memory access time to processor speed. Then you find that programs are too slow unless they are written with detailed knowledge of the hardware memory structure. You probably know way better than me how far we are down that slope today.
 
<br /><br /><span style="background-color: #ffff80;"><strong>Cliff Click wrote:</strong>
Split out the notions of 'uniform' from 'fast'. Everybody pays attention to caches (or should): the mindset is "<span style="text-decoration: underline;">memory is slow</span>". But only some people pay attention to memory layout: the mindset is "<span style="text-decoration: underline;">memory is uniform</span>".

<br /><br />For some parallel scientific programs, with well understood access patterns people can build both the machine and the program such that most accesses are "near" the CPU. In such a case it makes sense to build a NUMA machine: "near" accesses are faster than "far" accesses.

<br /><br />Azul builds are gear as 'UMA' because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow.

<br /><br />In both the NUMA and UMA case everybody has given up on the illusion that memory is fast. Instead we all acknowledge the presence of caches; we prefetch (or expect the hardware to do so); we know that "1st access is slow, 2nd one is free"; we do cache-smart object layout and so on. No illusions about memory there.
</span>

</div></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2008/11/a-brief-conversation-with-david-moon.html</feedburner:origLink></entry>
  <entry>
    <title>Some Source Forge Threads on NBHM</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/U0wjJIEHtBQ/some-source-forge-threads.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=58472094" title="Some Source Forge Threads on NBHM" />
    <id>tag:typepad.com,2003:post-58472094</id>
    <issued>2008-11-13T11:10:14-08:00</issued>
    <modified>2008-11-13T19:10:14Z</modified>
    <created>2008-11-13T19:10:14Z</created>
    <summary>A couple of interesting threads running on high-scale-lib's forums I thought I'd post to a wider audience. Josh Dybnis (jdybnis) - 2008-11-10 11:46 I wrote an implementation of the non-blocking hash map in C. I put it up at http://code.google.com/p/nbds/...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>A couple of interesting threads running on <a href="https://sourceforge.net/projects/high-scale-lib/">high-scale-lib's forums</a> I thought I'd post to a wider audience.</p><p><hrule /></p><p /><p><strong>Josh Dybnis (jdybnis) - 2008-11-10 11:46</strong><br />I wrote an implementation of the non-blocking hash map in C. I put it up at <a href="http://code.google.com/p/nbds/">http://code.google.com/p/nbds/</a></p><p><br /><strong><span style="background-color: #ffff80; font-family: Arial;">Cliff Click (cliffclickProject Admin) - 2008-11-10 17:59</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">I never know what to make of such statements - </span></p><p><span style="background-color: #ffff80; font-family: Arial;">- if the table has "almost no synchronization" then it's not non-blocking (but it might be highly concurrent). Non-blocking has a specific meaning to the academic community; if you say "non-blocking" you should also be able to say "no synch". If you are based on NBHM then you need to provide e.g. a non-blocking malloc and a way to reclaim storage. Both are things NBHM uses GC for, and are not-trivial to get non-blocking. NBHM does indeed do very little allocation - so the blocking embedded in "malloc" will probably never hurt you. </span></p><p><span style="background-color: #ffff80; font-family: Arial;">- No "C" program can ever implement any reasonable concurrent program, because "C" has no memory model (and the proposed model is a major punt and not much better than nothing). Instead you can implement NBHM using "C using gcc_ver_XXX on X86_model_YYY". If you change the "XXX" or the "YYY", then all bets are off, and the program can fail. In short, the "volatile" keyword in C does not have nearly the semantics as it does in Java, and C compilers routinely do things with "volatile" variables that Java does not allow. Changing the C compiler version or the underlying hardware can break things.  </span></p><p><span style="background-color: #ffff80; font-family: Arial;">Despite my complaints a C version of NBHM on X86 is surely a very useful and non-trivial thing. </span><br /><span style="background-color: #ffff80; font-family: Arial;">Congrats! </span><br /><span style="background-color: #ffff80; font-family: Arial;">Cliff </span></p><p><br /><strong>Josh Dybnis (jdybnis) - 2008-11-11 01:12</strong><br />Thanks for the feedback. I really appreciate you sharing the algorithm. It's one of the most elegant pieces of work I've come across in a few years.  </p><p>- In theory, I agree 100% with you about writing concurrent programs in C. And people need to be clear about the limitations before attempting such a thing. In practice I think it is reasonable to do "porting" work for each platform/compiler. My implementation is targeted to gcc 4.x on 64-bit x86 (Intel does specify a memory model in their Software Developer's Manual). I think it could be ported to most other architectures without major modifications. </p><p>- My implementation includes a non-blocking malloc() and a method to do safe memory reclamation.  </p><p>- My statement about "almost no synchronization" was badly stated. I was referring to my malloc implementation and not the NBHM itself. Also I was also using "synchronization" to refer to x86 operations having the "lock" prefix, not any non-blocking property of the program. </p><p>- A couple of more caveats about my malloc and the safe memory reclamation. They aren't the most robust implementations. They need a bit of work. But they are non-blocking, and should be very low overhead on systems without too many cores (say less than 32). The malloc does call mmap() from multiple threads, it could block in there. So I guess technically it is not "non-blocking". But that is an artifact of the implementation. In theory one could use an asynchronous method of invoking the kernel to make it truly non-blocking.  </p><p>Josh</p><p><strong><span style="background-color: #ffff80; font-family: Arial;">By: Cliff Click (cliffclickProject Admin) - 2008-11-11 07:41</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">How do you know when e.g. the NBHM Big Array is free? </span><br /><span style="background-color: #ffff80; font-family: Arial;">Cliff </span></p><p><br /><strong>By: Josh Dybnis (jdybnis) - 2008-11-12 17:06</strong><br />&gt;How do you know when e.g. the NBHM Big Array is free?  <br />Missed this earlier.  <br />After I promote a new array the old one gets queued up for a deferred free. (The same with keys in the old array that are tombstoned and were not copied to the new array.) The frees actually happens at a later point when all the threads are guaranteed not to be holding references to the memory. I accomplish this using a technique from RCU. Each thread that might be holding a reference periodically announces that it is in a quiescent state. I have a way of detecting when every thread has made such an announcement. At that point I can safely free the array. My implementation of all that is not particularly robust. If a thread blocks, and never announces it is in a quiescent state nothing more will ever get freed. This is straightforward to fix with a more complex implementation. I'm having an on-going discussion about it on comp.programming.threads. <a href="http://groups.google.com/group/comp.programming.threads/browse_thread/thread/702aedc141b34fa1#%20">http://groups.google.com/group/comp.programming.threads/browse_thread/thread/702aedc141b34fa1# </a></p><p>Also, one could also use GC in a C program too. Or one of the safe memory reclamation techniques documented in the academic literature (e.g. hazard pointers, smr, etc.).  </p><p><br /><strong><span style="background-color: #ffff80; font-family: Arial;">By: Cliff Click (cliffclickProject Admin) - 2008-11-12 17:21</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">Sounds good. Just was wondering. </span><br /><span style="background-color: #ffff80; font-family: Arial;">Lack-of-GC generally brings about various broken attempts to free memory in concurrent programs. </span></p><p><span style="background-color: #ffff80; font-family: Arial;">Cliff <br /></span></p><p>---------------------------<br /><span style="background-color: #ffff80; font-family: Arial;" /></p><p><span style="background-color: #ffff80; font-family: Arial;"><strong>By: Cliff Click (cliffclickProject Admin) - 2008-11-11 13:52</strong><br />&gt; In theory, I agree 100% with you about writing concurrent programs in C..... I think it could be ported to most other architectures without major modifications.  <br /><br />Itanium. :-) <br /><br />Cliff <br /><br /><br /><strong><span style="background-color: #ffffff; font-family: Arial;">By: Josh Dybnis (jdybnis) - 2008-11-11 19:47</span></strong><br /><span style="background-color: #ffffff; font-family: Arial;">Yeah Itanium, the Alpha would be even worse. It is your NBHM algorithm though; I don't think there is anything dependent on having a strong memory model. You would just have to put memory barriers in the right places. I'm not saying that it would be easy to figure out the optimal solution, but it wouldn't require any major rework to the code. The lazy way to do it would be to change all the CAS calls to include a full memory fence. Doing it that way would constrict performance of course, but it would be correct, and probably still more scalable than a traditional lock-based hash table.</span></span></p><p><span style="background-color: #ffff80; font-family: Arial;"><br /><strong>By: Cliff Click (cliffclickProject Admin) - 2008-11-11 21:25</strong><br />The NBHM State Machine is "correct" independent of any memory model. I still need fences &amp; a Memory Model around the promotion-logic &amp; table-copy areas. It's here with an IA64 would cause you grief. If you didn't need table-copy (i.e., pre-made fixed-sized table), then it would be correct on any shared-memory machine.  <br /><br />The NBHM as presented as stronger semantics than a minimalistic non-blocking hash table - Keys are treated with Java volatile semantics, so e.g. you can make &amp; install a fresh Key on 1 thread, and have a 2nd thread see the Key from the table and also see the Key's contents so it can correctly run a Key compare. A naive C IA64 implementation can screw up there, although the X86 strong ordering saves it. Same issue happens with the returned Value (made in Thr1, returned in Thr2, but does Thr2 see the initialized contents of the Value?). </span></p><p><span style="background-color: #ffff80; font-family: Arial;">Here's a sample IA64 table-copy screwup: <br /></span></p><ul>
<li><span style="background-color: #ffff80; font-family: Arial;">- T1 copies last K/V from old table to new table. </span></li>
<li><span style="background-color: #ffff80; font-family: Arial;">- T1 incrs the last count of copied values. Bug: No ordering between the stores. </span></li>
<li><span style="background-color: #ffff80; font-family: Arial;">- T2 sees the last count (read of T1's 2nd store),  </span></li>
<li><span style="background-color: #ffff80; font-family: Arial;">- T2 promotes (stores new table as default) </span></li>
<li><span style="background-color: #ffff80; font-family: Arial;">- T3 sees the new table (read of T2's store) </span></li>
<li><span style="background-color: #ffff80; font-family: Arial;">- T3 probes for K, but does not see T1's store. </span></li>
<li><span style="background-color: #ffff80; font-family: Arial;">- T3 reports K as not-in-table, when in fact it is. </span></li>
</ul>
<p><span style="background-color: #ffff80; font-family: Arial;">Cliff <br /></span></p><p>---------------------------</p><p /><p><strong>By: Josh Dybnis (jdybnis) - 2008-11-12 16:36</strong><br />What happens if a thread copies a slot and then dies before updating _copyDone? Does the copy never get promoted?<br /><br /><strong><span style="background-color: #ffff80; font-family: Arial;">By: Cliff Click (cliffclickProject Admin) - 2008-11-12 16:39</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">The fall-back case is that some thread (all threads?) scan the entire array and discover that all things are copied, despite any failed counts. At this point the "discovering" thread can promote. </span></p><p><br /><span style="background-color: #ffff80; font-family: Arial;">Cliff </span><br /><br /><br /><strong>By: Josh Dybnis (jdybnis) - 2008-11-12 17:15</strong><br />I'm missing where that is in the implementation. I see where a thread enters panic-mode and does the scan, but it looks like it always compares its workdone with _copyDone and the size of the array. So if every slot is copied, but a thread didn't get around to updating _copyDone, none of the panicking threads will ever terminate their scans.</p><p><br /><br /><strong><span style="background-color: #ffff80; font-family: Arial;">By: Cliff Click (cliffclickProject Admin) - 2008-11-12 17:23</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">Welcome to the difference between algorithm and implementation. :-) </span></p><p><span style="background-color: #ffff80; font-family: Arial;">Feel free to post a fix. It should be a 1-liner. </span><br /><span style="background-color: #ffff80; font-family: Arial;">Cliff </span><br /><br /><br /><strong>By: Josh Dybnis (jdybnis) - 2008-11-12 17:44</strong><br />Heh. I have a similar bug in my implementation.</p><p />
<hline><p /><p>=============================<br />
</p><p><strong>compareAndSwap memory fence (New)</strong></p><p><strong>Andrew Trick (andrewtrick) - 2008-11-12 11:16</strong><br />My interpretation of your slide-set/blog comments is that the NBHM state machine--ignore table copying for now--does not require any memory fences or store ordering. CAS_key, CAS_val gets the job done as long as your machine has atomic compare-exchange. Correct? <br /> <br />In fact, in NBHM source you bypass util.concurrent.atomic, calling Unsafe.compareAndSwapXXX presumably to avoid the memory fence. However, the Sun JDK and HotSpot assumes that Unsafe.compareAndSwapXXX enforces a full memory fence. So would it be fair to say that you've made Azul-specific changes to the JDK/JVM to optimize NBHM? Should Sun get rid of the full-fence semantics of Unsafe.compareAndSwap before more Java/JDK code begins to rely on it? <br /> <br />-Andy<br /><br /><br /><strong><span style="background-color: #ffff80; font-family: Arial;">By: Cliff Click (cliffclickProject Admin) - 2008-11-12 11:20</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">Yes, I need no fencing (ignoring table copy) for some kind of hashtable semantics. You DO need very carefully placed fences to get CHM semantics - which amount to 'volatile' semantics on values used in put &amp; get calls. </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">I bypassed the Atomic classes to get the no-fencing CAS. The Unsafe calls specifically have weak no-fencing versions. On an X86 &amp; Sparc you are stuck with the fencing anyways, but on Azul the weak no-fence version actually does not fence. </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">So I did not do any Azul-specific changes to the JDK here (Azul's JDK IS different from Sun's - but not here). </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">Cliff </span><br /><br /><br /><strong>Andrew Trick (andrewtrick) - 2008-11-12 19:05</strong><br />Thanks for the reply and clarification. <br /> <br />The JDK does indeed have AtomicReference.weakCompareAndSet. Sorry to be pedantic about API details but... What I meant by "JDK assumes a fence" is that AtomicReference.compareAndSet for Sun is identical to weakCompareAndSet, so the underlying Unsafe.compareAndSwapObject must have a fence. What I meant by "HotSpot assumes a fence" is that the Unsafe.compareAndSwap intrinsic generates abstract memory barriers, which you would have to materialize as a real barrier on any machine that doesn't have implicit CAS fences. <br /> <br />But really I was just trying to understand the intention behind your NBHM source, which is pretty clear now. It's not important whether some careful JVM porting is needed to make it fast on certain h/w.</p><p /><p /><p><strong><span style="background-color: #ffff80; font-family: Arial;">By: Cliff Click (cliffclickProject Admin) - 2008-11-12 21:35</span></strong><br /><span style="background-color: #ffff80; font-family: Arial;">Letsee... </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">- sun.misc.Unsafe does not specify any memory ordering properties for compareAndSet. </span><br /><span style="background-color: #ffff80; font-family: Arial;">- AtomicReference.compareAndSet is doc'd to have volatile semantics, and operates on a volatile variable. </span><br /><span style="background-color: #ffff80; font-family: Arial;">- AtomicReference.weakCompareAndSet is doc'd to NOT have volatile semantics. </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">A legit JDK implementation could then </span><br /><span style="background-color: #ffff80; font-family: Arial;">- not fence for sun.misc.Unsafe </span><br /><span style="background-color: #ffff80; font-family: Arial;">- but yes fence around volatile ops, including Unsafe ops on volatiles. </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">This is exactly the intent of the docs &amp; implementation, and it specifically allows Azul to not fence on sun.misc.Unsafe &amp; still meet the spec - which we do (on both fronts: not-fence and meet-the-spec). We also typically do not suffer from other peoples' bad assumptions here - where they expect Unsafe to fence and are surprised when it doesn't. I suspect Unsafe.CAS users are a rare breed. </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">Removing the fence for Unsafe ops is a tidy win for Azul on some highly parallel contention-heavy codes. </span><br /><span style="background-color: #ffff80; font-family: Arial;"> </span><br /><span style="background-color: #ffff80; font-family: Arial;">Cliff </span></p><p /><p /><p /><p>That's all,</p><p>Cliff</p><p /></hline></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2008/11/some-source-forge-threads.html</feedburner:origLink></entry>
  <entry>
    <title>Cliff's New England Fall Travel Classic </title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/890ia1whK4s/cliffs-new-england-fall-travel-classic.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=57690117" title="Cliff's New England Fall Travel Classic " />
    <id>tag:typepad.com,2003:post-57690117</id>
    <issued>2008-10-28T14:13:42-07:00</issued>
    <modified>2008-10-28T21:13:42Z</modified>
    <created>2008-10-28T21:13:42Z</created>
    <summary>Or how I went to New England to view the Fall Colors in Perfect Weather, and All You Got was this Lousy Blog Azul sent me to Manhattan (my least favorite place on the planet) to speak at the NYC...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="xhtml" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
<div xmlns="http://www.w3.org/1999/xhtml"><p /><h2>Or how I went to New England to view the Fall Colors in Perfect Weather, <br />and All You Got was this Lousy Blog</h2><p><strong><span style="font-size: 16px; font-family: Arial;"><span style="font-size: 18px;" /></span></strong>Azul sent me to Manhattan (my least favorite place on the planet) to speak at the <a href="http://www.javasig.com/meeting/home.xhtml">NYC Java Users Group</a>.  I discover that the earliest flight out of San Jose (plus 3 time zones) gets in NYC at 4pm - and that you can't get from a NYC airport to downtown Manhattan in 2hrs during rush hour - so I can't make a 6pm talk if I leave San Jose on the same day.  Sigh - it's the Red-Eye for me.</p><p><strong>Wednesday: </strong>I end up on the red-eye from Tuesday the night before and arrive at JFK at a delightful 6:30am local time.  No sleep for me.  I've done the NYC subway system a number of times so it's not much problem for me to buy a MetroCard and make my way to downtown.  I arrive at the hotel at maybe 8am in the morning.  Lo!  The hotel is full and they have no room for me - and indeed have no room until 3pm. Certainly by then I am one shabby dude - unshaven for 36 hours, same clothes, no shower, no sleep - not quite up (down?) to the standards of the street people hanging around the hotel but definitely looking grubby.  And I'm exhausted.</p><p>3pm arrives and I finally get my room and my shower, change clothes, lay down.... bad mistake.  I barely scrape up enough brain cells to force myself out of bed, and stumble, shaking from sleep deprivation, to the clock: only 90 minutes gone in a blink.  I splash some water on my face and go to meet the JUG.</p><p>The talks go very well; lots of good questions from an animated audience.  I presented <strong><a href="http://www.azulsystems.com/events/javaone_2008/2008_Challenges.pdf" target="_blank"><strong>Challenges and Directions in Java Virtual Machines</strong></a></strong> and <strong><a href="http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf" target="_blank"><strong>Towards a Scalable Non-Blocking Coding Style</strong></a></strong>, and I got quite animated myself - all in all very good.  Dinner was some close-by expensive steak joint, but excellent food.  NYC definitely does steaks well.  Finally the hotel bed looms... and dawn breaks.</p><p><strong>Thursday: </strong>Today is 2 customer visits, one planned and one directly due to the JUG talk the evening before.  Alas the new customer has micro-second response time requirements; we can only provide millisecond response times (albeit with massive throughput).  The planned customer visit goes much better; it's a tools&amp;libraries group looking for advice on scalable Collections.  After that I take a train to visit my Uncle's farm in Connecticut.  Penn Central is full of these little train ticket kiosks; I happily approach one.  It takes my credit card, works me down through the menus to New London, CT - then hangs.  I try another; this one takes 30sec before posting loudly "Out of Servce".  I'm 0 for 2 so I sigh and get in the infinite line to see a human.  The line takes a full hour to buy a ticket.  Then I go wait for my (now a full hour later) train.  Just when it gets near the head of the list of arriving trains, the train board reports my train as "delayed: 10 minutes".  This goes on for 20 minutes before it changes to "stalled", then finally "now on track #9".  But all goes well after that, my Uncle picks me up in New London and we drive off to the farm.</p><p><strong>Friday, Sat, Sun:</strong> The next few days are a real delight. The weather is perfect - highs maybe 65, lows in the 50's, breezy, no clouds and the trees in full color.  My cousins are all married and with kids all the same ages as my kids so it's a great time to visit.  I go to a soccer game, the end of a 5K run, a local chocolate &amp; wine tasting, buy from a craft fair and shop at the local antiques shops; help (interfere?) with some hay harvesting, take a stab at fixing 2 Windoze machines (bizarre XP SP3 things about not being able to install Beethoven.wm3 after about of hour of disk grinding), and just plain visiting.</p><p><strong>Monday: </strong>I pack up my suitcase, both physical and mental, and prepare to jump back into the fray.  The flight to Nashville is easy - except for my last child is finally going off to public school (we've home-schooled all our kids for years with great success), and my wife is in a frenzy getting him ready - and I'm a thousand miles away.  So I get to hear about it between plane flights, in airports, in the cab, etc. The Nashville hotel is nice, and only $150/night for a room twice the size of the NYC $500/night room, and a much nicer neighborhood.  </p><p><strong>Tuesday, Wednesday, Thursday:</strong> <a href="http://www.oopsla.org/oopsla2008/">OOPSLA</a>.  Paid $710 at the door - I forgot to pre-reg.  I had great a lunch conversation Doug Lea, Bill Pugh, &amp; Ben Zorn.  I take lots of notes.  Lots of great hallway and dinner conversation covering reader/write locks, the state of the JCP &amp; Exec Committee, whether Sun should drop the JCP process altogether, whether the market will implode &amp; we should all run up our credit cards to the max, end of civilization, etc. Ben tries to convince me that Microsoft can be trusted as the sole holder of all my critical personal information (sorry Ben) and that totally transparent sync'ing of all my portable &amp; desktop devices is coming Real Soon Now.  Perhaps - but I'm <em><strong>never</strong></em> going to trust my personal info to some single central server - even if it does get old-school Ma-Bell-land-line-style 5-9's reliability for both the server and the sync'ing.  Too many companies have said "it's safe with us" and been wrong.  Too much motivation for crooks if all the eggs are so close together.  Other conversations that stuck in my mind:</p>

<ul>
<li>Azul could Open Source or JCP our hacks for <a href="http://blogs.azulsystems.com/cliff/2007/03/finding_data_ra.html">+UseLockedCollections</a> (see about 1/2-way down the blog) and +UseCheckedCollections.  This could be a Good Thing for Azul as it will reduce the number of apps which fail "out of the box" on Azul as well as raise awareness of the issue.  Note that all major App-Servers (and many a large 3-letter-acronym company's major product) fail on Azul without +UseLockedCollections (and runs great with it) and the issue has nothing to do with Azul except we have more cores.  I.e., I predict these crashes will soon be coming to an X86 multi-core Near You!  I can hand out stack traces for any interested parties; email me for details (and yes, bug reports have been filed but between me and you I suspect most of them get filed into the round bin). </li>
<li>Bill Pugh offers that at Google they have a 3rd option: log&amp;continue.  The logging is only invoked where now I would throw a ConcurrentModificationException - but there's no standard JDK place for logging such things.  We certainly could add an Azul logging option where the output is available on a RTPM data-race screen.</li>
<li>Doug Lea desperately needs 3 things to make his Fork-Join parallelism have a sane API - note that Fork-Join has a really nice approach to handling large complex parallel problems - there's no real limit on scaling plus it has a fantastic story on automatic load balancing which is typically really crucial for us.
<ol>
<li>Tail-call optimization.  Part and parcel of the solution when doing recursive decomposition.</li>
<li>Some kind of auto-boxing-removal story.  See also <a href="http://openjdk.java.net/projects/mlvm/">John Rose's DaVinci Machine</a>.</li>
<li>Some kind of forced inlining story, so that a complex F-J iterator gets cloned at the site where the loop-body is declared.  This will allow the (highly megamorphic) inner loop function to be specialized per call site, inlined into the iterator loop and decently optimized.</li>
</ol>
</li>
<li>Several people asked for a Weak-Key'd AND Weak-Value'd NonBlocking HashMap.  
</li>
</ul>
<ul>


</ul>
<p><strong>Friday: </strong>I make last minute plans back to NYC to visit the tools &amp; libraries customer group again.  It's a 6:45am flight out of Nashville, which means be at airport by 5:am so leave hotel at 4:30am and get up at 3:30am.  Sign, another day with no sleep.  The flight to NYC is rocky as all heck, the clouds have finally arrived to ruin my days.  We bump for 2hrs out of a 2:45hr flight.  Then it's AirTrain to Jamaca Station, Long Island Railroad to Penn Station, Subway#3 to Wall Street - I'm coming quit good at navigating NYC.  I'm almost an hour early to my customer visit - which is really an open-ended tutorial on Azul's profiling tool: <a href="http://www.azulsystems.com/e2e/">RTPM</a>.</p><p>These guys are doing some fun stuff.  They have their own versions of the standard JDK collections where the expected collection size is 1million to 10+million elements.  They have <em>very</em> efficient striped iterators (100x speedups on a 100-way Azul box) and are looking at Doug Lea's Fork-Join for describing their inner loop closures.  RTPM goes a long way to helping them figure out what's going on.</p><p>There's also a nifty in-memory DB thingy where we discover that turning on "-tiered" mode (contrast to HotSpot's -client and -server modes) is worth a stunning 5x speedup (Azul usually sees something like a 15% speedup for -tiered).  We try to make the in-memory DB run that fast under a ReadWriteLock.  The customer has a very light-weight roll-your-own version (new in Java6: lots of new features in the JDK <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html">ReentrantReadWriteLock</a> class which slows it down by 50%).  The customers' lock doesn't stripe the reader-counts and instantly suffers massively on the highly contended single reader count word (which the JDK version does as well).  I step their engineer through details of reader-lock-striping; it's definitely ticklish stuff to get right.</p><p>Finally 5pm arrives and everybody is going home - although for me "home" is really a subway ride to Penn Station, then a train back to New London, CT for another weekend on the farm.  I am sooooo ready to be back in San Jose.</p><p><strong>Saturday, Sunday: </strong>Another great weekend on the farm; I finally get to see some rain (it's not rainy season in San Jose yet so I haven't see rain since March).  The wind blows down a tree over the power lines and we lose power from 9pm Sat to 5am Sunday (when my room lights pop on and wake me up, grumble).  With the power out and the internet gone it really feels like a farm.  On Sunday I take the train in to Boston, then have dinner with David Bacon at Legal Seafoods &amp; take the T to my hotel.  Gosh I almost feel like a savvy New England traveler.  </p><p><strong>Monday </strong>is the 2009 VEE PC meeting.  Despite missing 1/3 of the PC members (two have new babies, one got food poisoning, several had other issues that made them dial-in only for a few minutes at a time), the meeting went very well.  We got done early and I got to take the new Boston T Silverline to the airport - it's labeled like a subway but it's really an electric bus (the long bendy kind) in it's own bus-sized tunnel.  Logan was easy, the flight was smooth and I made it home (finally) at 1am local time.  It's good to be home.</p><p>
</p><hrule />
<p style="font-family: Arial;" /><h1>OOPSLA</h1><p><strong>Some (very raw) OOPSLA notes.  As usual with these notes, it's my stream-of-conscious thinking with a heavy Cliff bias.  Caveat Emptor.<br /></strong><br />========<br />I sat through some talks about new language designs/features over JVMs - they looked klunky.  I liked this one: <a href="http://www.oopsla.org/oopsla2008/program-overview/research-papers.html#res0000068">Mixing Source and Bytecode</a>.  Sorta like GNU ASM with C++, except for Java.  The bytecodes are fully verifiable.  This allows you to write, i.e., a naked monitorenter bytecode - something you can't do with pure Java.  Useful, e.g., for hand-over-hand locking or read/write locks.  Can nest Java &amp; bytecodes so can use Java to build e.g. string constants to be used in 'ldc' bytecodes.</p><p><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/tuesday.html#res0000061">Tolerating Memory Leaks</a> - <a href="http://www.cs.utexas.edu/%7Emikebond/">Mike Bond</a> - (graduating this fall, Go Mike!)<br /><br />Great story about a memory leak in a 10000 line C# Darpa Grand Challenge vehicle - would run out of memory in 45minutes - Quick Fix: stop truck &amp; reboot machine after 40 minutes.  Alas, in the real challenge (and not the test bed) garbage accumulated faster &amp; machine crashed after 28 minutes (while truck was still rolling and so IT ran off course &amp; crashed). <br /><br />I've reported on this before (Tool is called 'Melt') and liked it then as a Neat Idea.  He basically pickles to disk stale "leaking" objects, and can tolerate leaks until he runs out of disk space. He's using software read barriers that are costing them 6% app slowdown using Jikes 2.9.2.  The barrier is a bit-test, then if it's set they have to CAS it off:<br /><strong><span style="font-family: Courier;">  b = a.f;      // load object field</span><br /><span style="font-family: Courier;">  if( b &amp; 1 ) { // low-bit is set?</span><br /><span style="font-family: Courier;">    b &amp;= ~1;    // clr low bit</span><br /><span style="font-family: Courier;">    if( b &amp; 2 ) ...slow path...</span><br /><span style="font-family: Courier;">    CAS a.f,b   // clr in memory as well; no need to retry failure</span><br /><span style="font-family: Courier;">  }</span><br /></strong>So this little-bitty read-barrier costs 6% at the app-level???  I suspect a failure to optimize in the JIT.<br /><br />Cool - most leaks are in HashMaps - <em>and by </em>stale objects are never accessed again EXCEPT to re-hash(!?!?!) the table when it grows.  Since the objects are actually accessed (by the re-hash logic) they appear to be live - but the only use is to re-hash them.  Perhaps we'd like a HashMap where re-hash does not involve calling hashCode but just uses a cached prior copy of the hashCode, and we can avoid calling hashCode on a stale key (you still have to copy the K/V to the new table but that can be done without touching the Key again).<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/tuesday.html#res0000111">Jolt - Reducing Object Churn in Large Programs</a> - Matthew Arnold (IBM labs)<br />(why not a simple Escape Analysis or a Gen-GC???)<br />(but also: this is why SBA 'works' - there IS lots of local churn)<br /><br />Implemented in IBM's J9 &amp; run on large programs.  Better than Escape Analysis (EA) alone?  Getting up to a 15% speedup.  Some EA's do better: add context &amp; go up call-stack.<br /><br />Really it's doing a EA on the root of a call-tree that contains churn.  Same as C2 (HotSpot's -server) would do after inlining.  They are doing some fairly tricky analysis to decide where to run their more aggressive EA, using dynamic info to decide that they have a 'large enough' subprogram to get the right amount of context.  Then they force inlining to get the right 'context' to get good a EA; inline children with the largest ratio of bytes-allocated to code-size (trying to avoid code-bloat blowout).<br /><br />Looks a lot weaker than Azul's Escape Detection solution - we'd first let SBA identify places where stack-local objects are being churned, and then we'd inline &amp; do a trivial EA.  But IBM has a working solution (in the lab) for deciding what to inline to enable EA (and J9 has an existing compile-local EA which is working fine).<br /><br />EA by itself in J9 - from 1% to 9% of objects eliminated.  Just not enough context to do good EA.  With JOLT seeing 5%-30% objects eliminated (30% is for SpecJBB2005).  (Note: Azul's SBA sees 60-70% of all objects get stack-allocated).<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/tuesday.html#res0000110">QVM- Dynamic Analysis for Debugging</a> - Matthew Arnold (IBM labs)<br /><br />Two camps: </p><ol>
<li><em>testing </em>- high overhead tolerable in QA; deep properties relating to program correctness</li>
<li><em>production </em>- must be low overhead (or not used) so limited info comes back</li>
</ol>
<p>How far can we go with low-overhead but checking deep program properties.  Ask user: "how much overhead is ok for this info?"  Give as much info as possible within this budget - so may miss errors, etc.  But approximately good info.<br /><br />So have an "overhead manager" to meet users'budget.  Then have 3 analysis that can be managed: <span style="text-decoration: underline;">Typestate properties</span>, <span style="text-decoration: underline;">heap probes &amp; assertions</span>, and finally running some <span style="text-decoration: underline;">Java assertions</span>.  Overhead manager watches the sampling costs &amp; adjusts sampling rate to keep costs within budget.  Must have fine-grained timers, must have total CPU time across all things; must have direct access to RDTSC or else accumlative errors kill you.  Total CPU was hard for them to get from Linux (I believe Azul has this via azps/aztop).  <br /><br />Different program points are sampled more or less heavy.  Really, sample at 100% at each program point until the OHM starts to see bad overheads for some program point - then start lowering the samping rate.  So init code gets 100% coverage, but hot loops maybe 1% or 0.1% sampling.  Thus get very good coverage of rare events and "good enough" coverage for highly repetitive events.<br /><br />Type-state sampling is performed not at the method-level, but at the object granularity (steal a bit in object header - set a bit to track an object).<br /><br />Heap-profiling asserts, e.g. - "is_shared(Object <em><strong>o</strong></em>)" asks the GC "is Object <strong><em>o</em></strong> shared or am I the only Thread with a ptr to <em><strong>o</strong></em>".  Requires a full heap traversal.  Put it in the Overhead Manager, which throttles the full GC rate into budget. <br /><br />Experiments: it works.  Overhead IS kept within budget for these large apps &amp; complex tests.  Able to find several bugs in interesting large apps.<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/wednesday.html#res0000063">Contention-Aware Scheduler: Unlocking Execution Performance in Multithreaded Java Programs</a><br /><br />Have a global per-thread lock-held-count.  If it's above zero, hint to scheduler to avoid losing the quantum; allow up to 5 quantums to release all locks.  5 quanta is typically enough to release most Java locks.  Also - raise priority of threads holding locks; more locks held ==&gt; raise priority more.<br /><br />ALSO - if T1 &amp; T2 are sharing the same lock/object then schedule them on the same processor.  Actually limit this heuristic to just simply enqueuing T1 &amp; T2 on the same runqueue (if already all processors queues are busy).  (You don't want to deny a runnable T2 a shot at an idle CPU, but if all CPUs are already time-slicing...)<br /><br />Thread Clustering: record contended sync events in a per-thread ring buffer.  Each time the ring-buffer wraps, summarize the per-thread per-lock counts and then compare them to other threads.  Look for affinities; create clusters of threads sharing locks.  Migrate whole thread clusters onto a single CPU's run-queue.  Increasing the size of the per-thread ring buffer lowers the rate at which clusters are recomputed (so can adjust the overhead; using 2K; reports 2-3% overhead (but is that per-thread or only-when-blocked, etc)).  Load balance threads not involved in clusters.<br /><br />Bench'd on SpecJAppServer2005, ECPerf on JBoss4 (and others).  Run on a 16-way AMD (8xdual).  Seeing a 20% to 40% contention-reduction (what's that?)  when using both scheduling notions (clusters of related threads, plus lock-aware quantums).  Seeing overall perf improvement by 5 to 15% on these big app servers.  Also seeing better scaling (up to their 16-way box); but not hugely better scaling.  Azul clearly gets better scaling here - but we might get even better with this kind of lock-contention-aware scheduling.<br /><br />Was pointed out that since threads sharing locks are now on the same CPU, it might be that the speedup is really because of cache locality of other shared objects (not just the locking ones).<br /><br />We might could do better by tracking L2-to-L2 miss rate and associating the misses with the threads on the CPUs at the time - and come up with some sort of "these 2 threads are talking to each other" metric.<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/wednesday.html#res0000042">Dynamic Optimization for Efficient Strong Atomicity</a><br />(Yet Another STM Paper; can you tell I'm not fond of STM?)<br /><br />Getting Strong Atomicity is expensive (well maybe or maybe not: see the <a href="http://clojure.org">Clojure</a> model).<br /><br />Looking at whole-program analysis to discover objects NOT being accessed inside transactions - plus some dynamic analysis to find that most objects do not participate in a transaction.  Then the STM does not need to annotate/track accesses to these objects.  <br /><br />NOTE that Clojure gets this "for free".<br /><br />They manage to get performance withOUT any XTN's (e.g. single-threaded SpecJVM98) to <strong>only</strong> be 10% slower.  It's very complex - involving compiling with "phantom barriers" around most memory updates, then stop-the-world code-patching to install barriers as he discovers objects are being used in XTN's, etc.<br /><br />========<a href="http://www.oopsla.org/oopsla2008/schedule/wednesday.html#res0000076"><br /></a></p><p><a href="http://www.oopsla.org/oopsla2008/schedule/wednesday.html#res0000076">Design and Implementation of Transactional Constructs for C/C++</a></p><p>Next talk is about Intel adding STM into C/C++ code - alas the new C/C++ memory model gives no semantics to racey programs and all large multi-threaded programs are racey.  Indeed the authors had to hack the handful of XTN-friendly programs they had access too to remove races.<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/wednesday.html#res0000008">Renaming for Java</a><br /><br />"isn't this solved already"?  </p><p>All major Java IDE's implement renaming for Java; a very basic and very important refactoring.  But all these IDE's have bugs.  He gives short demos &amp; spares no IDEs.  Basically: renaming is hard.  The lookup rules are complex.  Gives a nice javac/JAST hack for doing it right  (and sometimes you *must* give up, the renaming rules are so complex).<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/thursday.html#res0000028">Java Performance Evaluation through Rigorous Replay Compilation</a><br /><br />Show nice violin plots of 30 measurements of SpecJVM98 &amp; DaCapo.  About 10% variation overall, but more for e.g. mtrt.  It's Standard Issue: samplers &amp; tick counts (and scheduler variation) cause JIT compilations to occur at different times &amp; places; so the basic code-gen is different.  After this variant JIT compilation performance from run-to-run in the same VM is reliable faster or slower.<br /><br />Attempts to remove variation by fixing compilation load.  Replay compilation: 1st run, record compilation decisions.  Later runs: read in the "replay" and compile exactly the same way (ala Marty's FAM).  Actually, run the same benchmark over &amp; over; during the 1st run do all the compilation (because you can't compile until classes load, and loading requires execution).  Later runs in the same VM no longer are allowed to JIT.<br /><br />Issues with choosing a good compilation play (pick plan giving best score, or majority plan or....)  moves the non-deterimism from runtime to compilation-plan-choosing-time.<br /><br />Show significant differences amongst compile plans (yes, well experienced by me; often I get "good" runs and "bad" runs based on stupid compiler decisions).  <br /><br />Run 'q' benchmark iterations for performance, and also 'p' VM invocations yielding 'p' compilation plans plus the statistical variation during the 'q' runs-per-plan.  End up with p x q measurements (bleah!  Lots and lots of runs to control for variations!).  <br /><br />Show violin plots for 10 plans for jython.  A lot of variation is observed for most benchmarks; i.e. "the plan matters".  Compile plans have little overlap on which methods are the 'root' of compilations.<br /><br />Nice really good statistical analysis of the situation.  Can show that picking different plans can significantly change the results of apparently unrelated experiments (i.e., which GC is better).  Can show that about 10% of compile plan choices on jython will cause conflicting ("lying") results of which GC is better.  i.e., if I pick plan A then I can show MarkSweep is better than GenMS, but if I picked plan B then instead GenMS appears better than MarkSweep.<br /><br />Can do paired measurements: run GCxxx and GCyyy with plan A &amp; diff the results.  Diff the results for plan B, etc...  Now I can see if picking a plan makes a difference. Show that it does about 85% of the time.  </p><p><strong><em>Really shows: </em></strong>you should pick more compile plans before running the benchmark more iterations.  I.e., If my time budget requires me to pick between running the benchmark in the same harness 10 times vs 10-from-scratch runs, I need to do the from-scratch runs because this will capture the non-determinsm in the compilation plan.<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/thursday.html#res0000031">Analysis &amp; Reduction of Memory Inefficiencies in Java Strings</a><br /><br />Trade6 - 40% of heap is strings.  Many dup'd strings &amp; many unused literals.  Unused literals - due to error messaes for error paths never taken.  (but I suspect unused literals are not important for Azul because their count is fixed with the application and does not scale with heap size).  For Trade6 - about 1/2 of heap strings are wasted - 1/4 are dups and 1/4 are wasted in fragmentation.  Azul does not suffer fragmentation because we don't share string bodies.<br /><br />He proposes interning strings during GC (but this changes semantics ever so slightly - things that were not "==" before become "==" afterwards.  Have a "tenured string" hash table - long lived strings go also into this table; then young-gen strings can be checked against the table by GC.  Most of the experts in the audience want to claim that the change is ok - no self-respecting Java program will be broken by this change (myself included, but I don't want to put words in other peoples' mouths).</p><p>Note that Azul has already rewritten the String implementation to remove the offset &amp; count fields and we don't share String bodies.  This hack LOWERED our average String footprint by maybe 10% and brought us a similar speedup (10% smaller heap footprint means 10% smaller working set means 10% more effective D-cache usage, etc).</p><p>They also hacked the VM to deal with the large count of unused literals.  Make a special version of String which holds the constant-pool index and a null char array.  When exploding on the null-char-array access they look for this special String &amp; lazily make the body as needed.  Removes ~20,000 string literals from Trade6.  But note he is using tiny heaps - and the literals will not scale with heap-size, but the dup'd strings will.<br /><br />When run on SpecJVM98, DB becomes vastly faster but no change on the other benchmarks.  Reduces live heap on DaCapo by 10% of heap.  For Trade6 &amp; Tuscany reduce live heap by 20%.<br /><br />========<br /><a href="http://www.oopsla.org/oopsla2008/schedule/thursday.html#res0000053">Analyzing the Perfomance of Code-Copying Virtual Machines</a><br /><br />Interesting hack to speed up interpreter-only languages.</p><p><br />========<br /><a href="http://www.oopsla.org/oopsla2008/program-overview/practitioners-reports.html#pra0000016">Performance of translated COBOL programs on a JVM</a><br /><br />Some things translate nicely; can make all COBOL variables Java instance variables for thread-safety.  COBOL divisions become Java classes, etc.  COBOL "picture" clause specifies amount of storage to be allocated, but mostly has direct translation to e.g. Java's BigDecimal class or fixed-length Strings.  <br /><br />Can specify arrays-of-structs in COBOL; again fairly easy to make a Java struct array and fill in all structures (but lots of wasted space for Java object headers).  Actually seeing <strong>lots of wastage</strong> - gives 7x space blowup example.  Seeing vast wastage in practice.<br /><br />Can '#include' COBOL header files, often done for just one or two fields in the header.  But must eagerly construct all defined storage space.  <br /><br />COBOL subroutines include "open" and "closed" (normal) subroutines; open subroutines tend to be translated as ending with exception throws - result is most subroutines end in an exception throw.<br /><br />Opt 1: share common init'd String, essentially interning all Strings of blanks.  <br /><br />Opt 2: Same thing for BigDecimal for value zero but with different scale/decimal-point values.  Instead just return a common cached version.<br /><br />Opt 3: All internal fields of structs &amp; arrays are lazily constructed.  Use getters/settings, and get'ers lazily convert null's to real storage only if asked for - for each array element or field value.<br /><br />Opt 4: Remove extra throws; notice when calls just return normally and allow callee to just return.<br /><br />Opt 5: Translator by default uses a lot of reflection, they now make class-specific versions of many things to avoid reflection.<br /><br />Opt 6: Lots of time/date String manipulation.  So made some custom date/string conversion functions to match the specific COBOL requirements instead of using the default Java Date-conversion libs.<br /><br />Overall, seeing 10-20% perf improvement, with nearly 75% of objects no longer being allocated.</p><p /><p /></div>
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2008/10/cliffs-new-england-fall-travel-classic.html</feedburner:origLink></entry>
  <entry>
    <title>JVM Language Summit</title>
    <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/azulsystems/cliff/~3/PStEouaUW1o/jvm-language-su.html" />
    <link rel="service.edit" type="application/x.atom+xml" href="http://www.typepad.com/t/atom/weblog/blog_id=1244458/entry_id=56302173" title="JVM Language Summit" />
    <id>tag:typepad.com,2003:post-56302173</id>
    <issued>2008-09-29T16:22:38-07:00</issued>
    <modified>2008-09-29T23:22:38Z</modified>
    <created>2008-09-29T23:22:38Z</created>
    <summary>I spent last week at the JVM Language Summit, a small conference focused on non-Java JVM-based languages. It was populated almost solely by either JVM implementors or bytecodes-are-my-new-assembler types. It was a very fun &amp; geeky conference for me; some...</summary>
    <author>
      <name>Cliff Click</name>
    </author>
    <dc:subject>Web/Tech</dc:subject>

    <content type="html" xml:lang="en-US" xml:base="http://blogs.azulsystems.com/cliff/">
&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;I spent last week at the &lt;a href="http://openjdk.java.net/projects/mlvm/jvmlangsummit/"&gt;JVM Language Summit&lt;/a&gt;, a small conference focused on non-Java JVM-based languages.&amp;nbsp; It was populated almost solely by either JVM implementors or bytecodes-are-my-new-assembler types.&amp;nbsp; It was a very fun &amp;amp; geeky conference for me; some of the best technical discussions I've had in a long time.&amp;nbsp; My favorite philosophical talk was definitely &lt;a href="http://research.microsoft.com/~emeijer/"&gt;Erik Meijer&lt;/a&gt; talking about Fundamentalist Functional Programming.&amp;nbsp; In the end, he's all for non-pure side-effecting programming BUT he wants the type system to &amp;quot;stop lying&amp;quot; about side-effects.&amp;nbsp; We both agreed that we'd love to see a functional programming language which had types for side-effects (e.g. &lt;a href="http://www.haskell.org/pipermail/haskell-cafe/2008-September/047069.html"&gt;Haskell's IO &amp;quot;sin bin&amp;quot;&lt;/a&gt;), but with more elaborate side-effects.&amp;nbsp; I.e., if I'm writing my program with few-to-none side effects I might want to track them exactly (e.g. this function has type 'side-effects field _code') but if I call into the Swing or AWT libs I might want a type like 'side-effects all Java variables'.&lt;/p&gt;

&lt;p&gt;My favorite techy/JVM talk was &lt;a href="http://research.sun.com/people/mybio.php?c=540"&gt;Bernd's&lt;/a&gt; &lt;a href="http://research.sun.com/projects/maxine"&gt;Maxine VM&lt;/a&gt;.&amp;nbsp; &amp;nbsp;I think he's got a really good nearly pure-Java layer figured out for generating ASM code.&amp;nbsp; You want something that looks like running plain Java code (and does field reads &amp;amp; writes) that guaranteed must inline down to where the JIT turns it into atomic machine code loads &amp;amp; stores.&amp;nbsp; He's missing lots &amp;amp; lots of major features (e.g. good JIT, good GC) but the framework looks nice.&amp;nbsp; If we ever are to get a true JVM-in-JVM (and skip bootstrapping your JVM via a C++ compiler which is what HotSpot does), I think Maxine has the best chance.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Most desired JVM feature&lt;/em&gt;: &lt;strong&gt;invokedynamic&lt;/strong&gt;; this would cut huge chunks of evil code out of most JVM-based languages (but not Clojure &amp;amp; JPC, both of which target 1.5 JVMs more closely).&lt;br /&gt;&lt;em&gt;Most needed JVM feature without changing the spec&lt;/em&gt;: a trivial Escape Analysis suitable for dealing with endless Fixnum/BigNum math.&amp;nbsp; Most of these non-Java JVM languages do fixnum/bignum math (automatically overflowing fixed-int math into bignums) - which means every trivial bit of integer math creates a new Fixnum object.&lt;/p&gt;

&lt;p&gt;I showed off &lt;a href="http://www.azulsystems.com/e2e/"&gt;Azul's RTPM&lt;/a&gt; and got lots of oooo's and aahhhh's.&amp;nbsp; I convinced lots of people to sign up for academic accounts mostly for access to RTPM.&amp;nbsp; I was able to quickly diagnose a big problem w/&lt;a href="jruby.codehaus.org"&gt;JRuby&lt;/a&gt; (failure to inline a key trampoline, masked by +PrintInlining lying), and also got 20% speedup on the &lt;a href="http://asm.objectweb.org/"&gt;ASM&lt;/a&gt; library - the core bytecode parsing library in widespread use.&lt;/p&gt;

&lt;p&gt;My own talk was well received:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;Brian Goetz wrote:&lt;/p&gt;&lt;blockquote type="cite"&gt;&lt;p&gt;your talk was both the most popular and highest rated, according to the comment cards.&amp;nbsp; &lt;/p&gt;&lt;/blockquote&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;a href="http://wiki.jvmlangsummit.com/pdf/36_Click_fastbcs.pdf"&gt;The slides are here&lt;/a&gt; although the postscript conversion mostly screwed up my transparent ovals.&amp;nbsp; Mostly I ran &lt;a href="http://pastebin.com/m648330ef"&gt;this trivial program&lt;/a&gt; on Azul's JVM using these languages: &lt;a href="http://www.scala-lang.org/"&gt;Scala&lt;/a&gt;, &lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt;, &lt;a href="http://www.jython.org/"&gt;Jython&lt;/a&gt;, &lt;a href="http://jruby.codehaus.org/"&gt;JRuby&lt;/a&gt;, &lt;a href="http://www.mozilla.org/rhino"&gt;JavaScript/Rhino&lt;/a&gt;, &amp;amp; &lt;a href="http://www-jpc.physics.ox.ac.uk/"&gt;JPC&lt;/a&gt;. I profiled them using RTPM and report on the profiling.&amp;nbsp; See the slides for the in-depth discussion.&amp;nbsp; 3 things popped out:&lt;/p&gt;

&lt;ol style="width: 500px;"&gt;&lt;li style="width: 470px;"&gt;Clojure, JRuby, Jython, Javascript/Rhino all using fixnum/bignums for every trivial piece of math - which ends up allocating new FixNums for each simple integer operation. A major fix for this would be to implement the simplest possible Escape-Analysis. You'll still end up with the overflow checks but the major cost here is the Fixnum allocation (and NOT the GC costs; the fixnums trivally die in the young-gen; direct GC costs are quite modest).&lt;/li&gt;

&lt;li&gt;JRuby was thinking that a key trampoline function was inlining -
and it wasn't.&amp;nbsp; See the slides for details, but hopefully we'll see a
JRuby with a substantial performance boost soon.&lt;/li&gt;

&lt;li&gt;I ran Clojure's STM Traveling Salesman Problem using 600 worker
'ants' to optimize the path.&amp;nbsp; It ran straight-up without a hitch, no
tweaks - and with great scaling (it did allocate about 20Gig/sec on a
200Gig heap but this is easy for an Azul box to cope with).&amp;nbsp; There may be hope for STM's after all - IF the set of STM variables is limited to a handful of named shared variables, and not for dusty-desks that must assume all-is-shared.&lt;/li&gt;&lt;/ol&gt;

&lt;p&gt;Cliff&lt;/p&gt;&lt;/div&gt;
</content>



  <feedburner:origLink>http://blogs.azulsystems.com/cliff/2008/09/jvm-language-su.html</feedburner:origLink></entry>

</feed><!-- ph=1 --><!-- nhm:from_kauri -->
