JFKBits

Code Turbulence

2010-03-08T11:43:00.000-06:00

Code turbulence: the chaotic, stochastic changes in the code base resulting from functionally equivalent changes to names, function signatures, etc. that require everyone to stop their progress and synchronize with.

Speedlink: Engine Yard SHA-1 Contest Winner Writeup

2009-07-23T15:57:00.002-05:00

Another blog from the Engine Yard's winning team: Paul Wins an iPhone.

This fills in some of the details left out in the more prominent post, such as the hand-optimized C code for getting 10x improvement on the SHA-1 for ordinary CPUs (not the GPU).

A Crack at Cracking the EngineYard's SHA-1 Challenge

2009-07-21T20:10:00.003-05:00

That was fun. This is a brief report on my strategy for the just-concluded Engine Yard programming contest. My farm of thirty-something CPU cores, with Polar SSL's SHA-1 hash code and my wrappings managed to come in about #57 with a Hamming distance of 36 for key dijkstra parameters format i10n expo dedicated crispin programming output inspect beta.

I wasn't originally going to enter, but after reading some of the blogs and forum discussions, my interest was piqued and I started thinking about the problem.

I began to get pulled in by counting up the order of magnitude for the search space, and the claim of someone in a discussion forum that he could "solve" the contest in about 5 minutes with about 30,000 nodes. My estimate was that the search space was about 10^63, large enough to make a lot of decisions pretty clear.

Also, I had higher priority activities (work, family, intermittent power outages that were traced back to a tree rubbing away the insulation on one wire of our service line and neutral), so I didn't start coding for the contest until it was already underway, which of course is a bit self-defeating, but I'm happy with where I ended up.

Hardware: I have a collection of nodes at work that are typically unused, most of them of relatively recent vintage meaning somewhere around 3GHz. Mac, Linux, and Windows are all represented, most of them dual core with a couple of 4- or 8-core machines, all of them AMD 64 chips. At the peak I had 37 cores running.

Software: Prototyped in Mathematica, then recoded in C. Polar SSL's SHA1, perfectly acceptable Hamming distance function based on a bytewise bit counting lookup table, and a simple enumeration technique for generating candidate phrases that ignores the capitalization and random character suffix parts of the search space.

Parallelism: Manually start the contest executable by hand:


% nohup contest &

Not scalable, but fine for the dozen-odd machines I had. I did two or three rounds of manually killing all the executables and restarting them with an updated version, which did get old. To distribute the workload, the executable picks a random initial phrase. Random generator is seeded with the random seed based on the current time and process ID.

Tracking minimum: contest executable is compiled with the current best known minimum.

Status reporting: All programs are started from the same network file system directory and write updates there. When an executable finds a new minimum, it writes the phrase to a file using the Hamming distance, hostname, and PID in the filename. My Windows explorer window, sorted by name looks something like this, and I just ignore or delete files with higher scores from nodes that don't know a better score has been found:


36-mylinuxbox-10485.txt
36-my8corewindowsbox-7361.txt
38-mylinuxbox-9587.txt
38-mymacmini-82075.txt
38-an8corelinuxbox-30256.txt
...

Performance: the SHA-1 algorithm (of course) seemed to be the limiting factor for my C code, clocking in somewhere around 1M hashes/sec. My initial simple and stupid method of forming candidate phrases was responsible for cutting performance in about half:

sprintf(phrase,"%s %s %s %s %s %s %s %s %s %s %s %s",
  contestDictionaryWords[word[11]]], contestDictionaryWords[word[10]]], 
  contestDictionaryWords[word[9]]], contestDictionaryWords[word[8]]], 
  contestDictionaryWords[word[7]]], contestDictionaryWords[word[6]]], 
  contestDictionaryWords[word[5]]], contestDictionaryWords[word[4]]], 
  contestDictionaryWords[word[3]]], contestDictionaryWords[word[2]]], 
  contestDictionaryWords[word[1]]], contestDictionaryWords[word[0]]]);

All this copying is horribly slow, but it got me underway. Before retiring for the night on Monday, I refined it so this copy was only done every 1000 times.

When I went back and wrote something that optimally copies only what's changed, I saw around a 50% to 75% speed improvement. One of the tricks in the improvement is to make two copies of the dictionary: one with a space appended to each word (for the first 11 words) and one without the space. Here's how the final phrase initialization code looks:


void initPhrase()
{
  int i, w;
  unsigned char *p;

  initRandom();

  /* Start with a random enumeration to partition work */
  for(i=0; i < 12; i++)
    word[i] = random() % 1000;

  p = phrase;
  for(i=11; i > 0; i--){
    w = word[i];
    strcpy(p, contestDictionaryWordsWithSpace[w]);
    phrasepartend[i] = p += contestDictionaryWordWithSpaceLen[w];
  }

  w = word[0];
  strcpy(p, contestDictionaryWords[w]);
  p += contestDictionaryWordLen[w];
  phraselen = p-phrase;

  printf("initial phrase: %s\n", phrase);
}

Hamming distance: my initial code for this seemed to perform well enough that I never bothered optimizing it further, e.g. performing the XORs a 32-bit word at a time:


unsigned challengeDistance(unsigned char hash[20])
{
  int i, distance;
  distance = 0;
  for(i=0; i < 20; i++){
    distance += numOnes[challengeHash[i] ^ hash[i]];
  }
  return distance;
}

For each byte, xor the hash with the corresponding byte in the challenge hash, and use a precomputed lookup table to count the ones.

Performance reporting: The main program structure is a while(1) loop that generates and evaluates candidate phrases in batches of a few million. These batches are times and current results are printed to the console. I had dozens of console windows where I could keep an eye on progress:


390392448 evals, 1000000.000000 eval/sec, 40 session min, 36 min
400392448 evals, 909090.909091 eval/sec, 40 session min, 36 min
410392448 evals, 909090.909091 eval/sec, 40 session min, 36 min
420392448 evals, 909090.909091 eval/sec, 40 session min, 36 min
430392448 evals, 1000000.000000 eval/sec, 40 session min, 36 min
440392448 evals, 909090.909091 eval/sec, 40 session min, 36 min

Conclusion: I put together a modest-performing program that performed about 100 billion searches during the contest and found a phrase that had a score of 36. One thing I'd like to do is analyze the performance of the SHA-1 hash implementation itself and understand why it takes a few thousand machine cycles, what the limiting factor is, and whether there's a better performing implementation.

What Tools do I Need to Program from a Winnebago

2009-04-28T13:30:00.007-05:00

Geez, does this guy get it: Patrick Collison blog : Hacking for fun and profit with Mathematica and the Google Analytics API. It's always nice to see Mathematica used in more traditional programming and software prototyping senses.

The maps in Patrick's post prompted me to post a fun result I got yesterday.

Below is a map of the United States showing me where all the comfy cities are --- according to my definition of comfortable. These are the cities that show up in CityData where the current temperature, grabbed from WeatherData, lies within my personally-defined comfort zone of 74-84 degrees Fahenheit (23.2C-28.9C).

The post's title is a stackoverflow.com nod and hints at the real question I had. What I was curious to do but didn't have time is to explore the historic temperatures and answer the question "if I lived in a motor home, where could I go during the year so that I was always living in a temperature range that I specify?" I don't really want to do that, I'm used to the four seasons and lots of weather variety, but I started to wonder what those places would be.

There's absolutely nothing clever to the code for this, it's sheer brute force actually. We just ask for all the U.S. cities in the database, look up their temperature, select those in the right range, and then plot them.


Needs["Units`"];

CityHasDesirableTemperatureQ[city_, low_, high_] := 
 With[{lowC = ConvertTemperature[low, Fahrenheit, Celsius], 
   highC = ConvertTemperature[high, Fahrenheit, Celsius], 
   temp = WeatherData[city, "Temperature"]},
  lowC <= temp <= highC];

allCities = CityData[{All, "UnitedStates"}];

Graphics[{Gray, CountryData["UnitedStates", "Polygon"], 
  PointSize[Large], Red, 
  Point[Reverse[CityData[#, "Coordinates"]]] & /@ 
   Select[allCities, CityHasDesirableTemperatureQ[#, 74, 84] &]}]

Ship It

2009-03-18T09:55:00.007-05:00

We just shipped the server side of one of my products, the Wolfram Lightweight Grid System.

I am reminded of the immortal words of Steve Bjork, famed CoCo game programmer, when he said you know your program is going to be good because you start to hate it. He seemed to capture that feeling you get when the thrill of seeing something cool come to life is long gone, and you're grinding through all the details that need to be nailed down so you can ship.

Check out some of the screencasts. Also, check out Roman's blog post for some of his longer-term perspective on parallel computing within Mathematica.

Speedlink: Harvest

2009-03-09T16:36:00.003-05:00

Harvest is a time-tracking webapp that has a broader scope than Time Tracking Tool, and so far I like it a bit better for my needs. The web interface is just a tad more efficient than that that of TTT.

Harvest is geared for businesses and generating invoices for billable hours, so it has this concept of Timesheets that let you pick tasks from your projects. This matches how I work pretty well, even though I'm not directly billing anybody. On my timesheet I can see at a glance where my time has gone for that day, and how much time total I've spent so far. This is the number one thing I need on an ongoing basis because my schedule is so irregular and I task switch so often between home and work.

I haven't looked into what support it has for planning, and tracking how estimates match actual figures, but I imagine the support is not there.

Speedlink: Time Tracking Tool

2009-03-07T10:49:00.003-06:00

Time Tracking Tool, a program that I'm evaluating. This is pretty close to what I'd want to use. It has the Play/Pause notion, as well as a task hierarchy. When you activate a task, it shows the time you started, to help remind you if (when) you forget to pause it when you leave the computer. That's probably the chief drawback to the whole problem, when you get distracted from a task (phone call, someone stops by your desk to ask a question), by nature it's hard to remember to tell the computer what's going on.

Stephen Lifts the Lid on Wolfram|Alpha

2009-03-05T10:00:00.003-06:00

We finally have said something public about Wolfram|Alpha. The tagline for the product is "Computational Knowledge Engine", and Stephen's blog post introduces the concept this way:

Fifty years ago, when computers were young, people assumed that they’d quickly be able to handle all these kinds of things.

And that one would be able to ask a computer any factual question, and have it compute the answer.

Quote of the Day

2008-07-18T17:10:00.003-05:00

We just wrapped up the Advanced Mathematica Summer School, two weeks of lectures and on-site consulting and coaching for attendees in developing a Mathematica project idea. As a developer, this was an awesome opportunity to work hands-on with real users and develop a better feeling for (1) real world problems, and (2) ways that people use our products, versus what we may assume they want and what they know.

During the lightning presentations this morning, one attendee wanted to express his appreciation:

"I want to thank Wictor for writing some really cool code for me,
and I want to thank Paul-Jean for helping me understand the code that Wiktor wrote."

Yes We Controlled A Rover

2008-07-16T23:22:00.004-05:00

I'm just recovering from this weekend's ICFP Programming Contest in which we had to write a controller for a simulated Mars Rover. I'm pretty sure we're not winning anything, even assuming we ever get Mathematica in the hands of the contest organizers. After the rules were announced the week before, I frantically went around trying to figure out if we could even enter a contest with Mathematica, a commercial product that requires a license key to run, finally coming up with two workable options. Only to discover on Saturday that one of the options wasn't going to work because the submission form was capped at 20MB. I had counted on being able to upload a stripped down copy of Mathematica weighing in at around 140MB. The contest staffer on the IRC chat channel told me we would work it out later.

We used Java to do the socket stuff, and JLink to go between Java and Mathematica. This isn't as hard as it might sound, and especially as all three of us on the team develop Java/Mathematica hybrid products (Wolfram Workbench is Eclipse-based and webMathematica works with Tomcat).

The code we worked on during the contest had Java in control and called Mathematica to process each event from the server (i.e. the rover). This worked fine, but I wish I'd listened more carefully to the 3 senior people at work who advised Mathematica be in control and call Java.

Last night I inverted the control and rewrote the little event loop in Mathematica, and suddenly the light went on that we really missed out by not having the Mathematica front end part of our interactive development process. I was quickly able to write a Dynamic-based dashboard: literally we could watch the controller's state updated in real time, including calculated state. We saw things like instantaneous speed, heading, and the rover's steering and acceleration state. It seems it would have been so much more helpful to see these numbers as our rover wandered around, to get more familiar with the problem. As it was we stared at static log files of telemetry messages after the run was over.

I hope I can write a little more about our experience later. It was fun though.

Replacing One Line with Two in Eclipse

2008-06-25T16:57:00.006-05:00

Eclipse 3.4 is out today. In reading through the list of new features, I find one that I literally could have used yesterday.

Simply, I was doing a refactoring where I wanted to replace one import declaration by two, so my desired replacement text was two lines. Eclipse has a nice multi-file search/replace feature. However, I couldn't find or figure out if the replace text field would accept some syntax to indicate "I want a line break here." Trying the obvious \n resulted in a literal backslash-n.

In Eclipse 3.4, we can now use \R to indicate a platform-dependent line break, and \n and \r now work as expected. Usually I hate being on the bleeding edge, but this time I curse my reluctance to have been trying out the release candidates.

Breakout

2008-06-17T13:44:00.006-05:00

In relation to the release of Firefox 3.0, someone at work mentioned that we have screencasts too, and I was happy to see one where Theo shows off Luc Berthelet's version of Breakout. Theo scrolls through the code on the screencast, so by judicious pausing you can probably copy it, but I wish I could find the notebook somewhere, for, you know, research purposes.

In a serendipitous connection, Wikipedia tipped me off that in the past month or so Luc has broken out of his position at Electronic Arts to start Tir Nua, something in the virtual world line of goods. One of his new employees that he brought with him is a name familiar around my workplace, former Wolfram Research employee Sarah Flannery. I remember hearing Sarah give a talk last fall about the webMathematica-based Wiki she'd worked up for Sims Online players, and how she'd tackled the runaway inflation in the Sims Online economy (all sources and no sinks).

People like Luc and Sarah are fun users for Wolfram employees like me who consider themselves mainline software designers and like to see Mathematica applied more like a regular programming language in areas very different from what you might typically expect.

Picking a CLOS Implementation to Play With

2008-06-04T22:02:00.004-05:00

Being provoked into trying to learn a bit more about CLOS's method of multiple dispatch, and wanting to find an implementation, led me on this search path in a few spare minutes:

1. Search Google for CLOS

2. Click http://www.dreamsongs.com/CLOS.html, which is on the site of Richard Gabriel ("worse is better")

3. Click the intriguing title CLOS: Integrating Object-Oriented and Functional Programming

4. Read with interest on the first page the essence of multi methods:

A generic function, or polymorphic function as it is sometimes called, is one whose implementation depends on the types of its arguments. That is, the code that is executed for a particular set of arguments is mechanically determined by the types of the arguments.

In strictly functional languages, an operation is defined by a single, monolithic piece of code; any argument-type conditionality is expressed as code explicitly programmed in by the user. In contrast, the CLOS notion of generic functions supports automatic selection from among separately defined, typespecific implementational parts. Yet, from the client’s point of view, generic functions and traditional ordinary functions are called in the same way: The procedural abstraction barrier is still in force.

5. Start looking for implementations. Search reddit for "lisp implementations".

6. Click Common Lisp Implementations: A Survey.

7. The survey lists 11 Common Lisp implementations. Of these, only 5 are listed as available on the Unixes and Windows. Of those, only 3 do not have commercial licenses. Two of the remaining choices have weird names: Armed Bear Common Lisp, which does not look promising (its own home page says, under "Bugs", that "ABCL's CLOS is intolerably slow"), and Embedded Common Lisp (ECL). It's difficult to tell what immediately from Wikipedia or Google what kind of support has. But the remaining candidate, GNU clisp, appears to be in good standing, judging from the Wikipedia article (where it is claimed this is the platform Paul Graham used for Viaweb), and the Sourceforge stats which claim 100 downloads a day in the past week.

8. Find the Clisp site. Another Lisp site without an obvious Download link.

Multiple Definitions

2008-05-30T21:14:00.005-05:00

We had an interesting discussion at work yesterday debating whether Mathematica supports multimethods with its variety of pattern-matching. So I'm taking the opportunity to do a mini-survey of the multiple dispatch spectrum, starting with overloading. As far as Mathematica, it clearly gives you the power of selecting from multiple definitions based on runtime information; more on this in a minute.

Overloading allows you to write different definitions of a function or method, and the definition used when you call the function depends on the number and types of the arguments you pass them. That is, the overload resolution is done by the compiler at compile time with static type analysis. Whatever type your arguments are, and exactly those types, will determine which definition is chosen.

Multimethods let you write different definitions of a function or method, and the definition used when you call the function depends on the number and types of the arguments you pass to them, as examined at runtime. Now you can write one definition for a class, and specializations for its subclasses if so desired, and the definition will be chosen based on the actual type of the arguments at runtime.

You get into a fuzzy gray third area if the runtime values can also be used to select different definitions. This is where Mathematica lies, because its pattern matches can be used to differentiate between a list of at least one element and a list of at least two elements, or between the symbols Null and Infinity. What's useful for writing symbolic algorithms turns out to be useful for regular programmers.

It seems that ML's structural pattern matching is also in this fuzzy gray third area, and that helps me make an interesting connection. For my purposes, multiple dispatch is interesting because it's the way to do expression tree traversal. That is, it lets you write pretty printers and type checkers and the like without needing to code the dispatch yourself (if the node is actually a lambda abstraction, do this, but if it's a cons cell, do that). What I'm noticing now is that one way or another, multimethods and pattern matching are giving you the notational convenience that I enjoy in writing tree traversals, with still perhaps an edge to pattern matching on that score.

The ucc Compiler

2008-05-28T22:11:00.002-05:00

Someone with the name dreamAnders has posted announcements to comp.compilers and comp.lang.c for his small open-source C compiler, ucc, the chief attraction of which is that the source code is meant to be small and self-explanatory.

On "I can't believe I'm praising Tcl"

2008-05-21T21:41:00.005-05:00

Today we look at a refreshing use case for a programming language, where economy of expression in calling functions from a read-eval-print loop is prized. Raganwald recently tagged "I can't believe I'm praising TCL", in which the embedded systems author helped us understand how TCL made for a great debugger command environment, and the "pop infix languages (C/Java/Python/Ruby/you name it)" don't.

In this case the author wants to define some glue function or functions, and then in the language's interactive interpreter, call his function over and over. He's not programming; he's commanding, so the function calls need to be short and sweet, so he doesn't mind typing them for hour after hour as he thinks about the real problem, a buggy piece of embedded hardware. The author wants a command shell, where he uses his command interface to an embedded device as a kind of gdb replacement. An example session to set breakpoints, inspect memory, etc., looks like this:


$ pmem 0 stat
IDLE
$ pmem 0 bkpt 0 0xbff
$ pmem 0 bkpt 1 0xa57
$ pmem 0 cmd run
$ pmem 0 stat
DEBUG
$ pmem 0 pc
0xbff
$ pmem 0 rstack
3 return addresses
addr 0: 0x0005
addr 1: 0x05a8
addr 2: 0x0766
$ pmem 0 cmd stp
$ pmem 0 pc
0xc00

The question then is whether a language you may be designing or using could support something close to this syntactic economy for calling functions.

The argument for Tcl over the pop infix language may perhaps be best summarized by this quote:

And then we have interactive shells. And in Python it’s doit("xx","yy"). And in Lisp it’s (doit "xx" "yy"), or (doit :xx :yy), or (doit xx yy) if you make it a macro. And in Ruby it’s doit :xx :yy, if you use symbols and omit parens. And that’s about as good as you can get without using your own parser as in doit "xx yy", which can suck in the (more rare) case when you do need to evaluate expressions before passing parameters, and doesn’t completely remove overhead. Also note how all these languages use (), which makes you press Shift, instead of [] which doesn’t. Ruby and Perl let you omit (), but it costs in readability. And [] is unanimously reserved for less important stuff than function calls.

Analysis

First we see the emphasis is not on defining functions, on programming, but on using, on the syntax for calling, functions. The author wants an internal DSL (Domain Specific Language).

Second, it should be noted that in discussing () that Scheme lets you use [] as well as (). There's good Scheme style, where [] is reserved for the let blocks, but if you open up Dr. Scheme or Chez Scheme, define some choice Turtle graphics functions, and start typing commands like [penup] [rt 45] [pendown] [fd 100] it will work fine.

One thing the author noted is that TCL's preference for strings over variables makes bkpt a string and $bkpt a variable, whereas in the pop infix languages, it's the variables that get lexical economy and strings that need delimiters. Because of this preference, calling a Tcl command lets you pass in what look like symbols, but you treat them as strings in the command definition. Hence, for the author's use case, a chief consideration seemed to be a way to write symbolic arguments, where the command in question may take a one-of-several subcommand or option name, without lexical decoration like string delimiters or single quotes or colons. I wonder if this was really a language design goal of Tcl, because it's hard to understand the motivation for the string-vs-variable syntax any other way. For all that, enumerated types or sum types are a known language feature that meet the author's criterion. In Standard ML you could define a datatype Subcommand = bkpt | status | memset or the like, and now undecorated references like bkpt can appear as arguments.

Note if you do define your functions as Scheme macros, to address the symbol/string problem, and if you modified Scheme to accept an S-expression forest on each line (i.e. no need to delimit a top-level input line with parens), you'd have the economic expression of Tcl. I think this is worth considering in some circles where Scheme may be more familiar.

Footnote

This could be a nice motify for the "language design of the day": extend the basic Scheme-like interpreter to support an extensible debugger command interface.

Call by Need Lambda a Poor Man's Macro?

2008-05-16T22:39:00.004-05:00

I've been seriously considering why more languages don't include a call-by-need lambda (hereafter called "lazy lambda"). With its delayed evaluation it offers macro-like powers for writing non-looping forms (they're bad for loop forms since the arguments are evaluated but once by design), but they don't have the bloating effect of macro expansion (if code size is more important to you than the time overhead of a function call), and they are hygienic. They're not a cure-all, but this seems to be an approach which can still be wielded effectively by trained professionals.

How to Use Lazy Lambda

Here's how a lazy lambda would work in an existing language like Scheme. You have a new abstraction lazy-lambda with precisely the same syntax as lambda. When applied, lazy-lambda follows call by need evaluation rules. That means arguments to the lazy procedure are not evaluated at the call site, but only when their corresponding formal parameter is first referenced inside the body. On this reference, the evaluated argument value is remembered for future references inside the body. Here's how you might write something like Groovy's Elvis operator:


(define ?: (lazy-lambda (expr default) 
  (if (null? expr) default expr)))

The thing I like is that this is automatically hygienic: it works fine even if you call it in an environment with variables named expr or default.

Implementation

I like to divide my thinking about new features into two phases: introduction and use. When lazy-lambda is introduced, the parser needs to create a structure essentially identical to that of a lambda, namely an arguments list and a body expression, but of course it needs to be marked as a different type from lambda so the evaluator can distinguish the two. Lazy lambda is used in two ways, once when it is evaluated (e.g. when a lazy-lambda expression is returned from a function) and once when it is applied.

Summary of Lazy Lambda Implementation
Introduction: '(lazy-lambda args body)
Evaluation: '(closure (lazy-lambda args body) env)
Application: Bind args to thunks, evaluate thunks in a memoizing way

When lazy-lambda is evaluated, it should create a closure, the pair of the lazy-lambda expression and the environment in which it was evaluated. This closure needs to be marked as a different type from a regular closure. Alternatively the evaluator can be arranged to check the type of the "lambda" expression: a closure may look like '(closure (lambda args body) env) or '(closure (lazy-lambda args body) env).

What happens a lazy-lambda closure is applied? We know you don't evaluate the arguments, but what then? As with eager closures, you first create a new environment scope. Then, instead of binding the formal arguments to the evaluated values of the arguments, you bind the formal arguments to some box (container) of the unevaluated argument expressions. The container needs to be distinct from any other language type so that the evaluator knows how to treat it. That is, once we set up the new environment scope, we will simply evaluate the body of the lazy-lambda under this new environment, and the references to the formal arguments need to be handled in this special memoizing way. So let's introduce an expression type delayed, which is not directly creatable in the language, and we bind each formal argument x with actual argument expression A to the "value" (delayed A env). The env value will be needed when we evaluate A, because we will need to evaluate it in the calling environment, not whatever environment is in effect when the symbol is first referenced. (Think about what gets returned by (lambda (x) ((lazy-lambda (x) (begin (set! x (+1 x)) x))) x).) Then when the evaluator handles variable references and gets a delayed value back from the environment lookup, it's time to do the memoizing: evaluate A in environment env, rebind (set!) the referenced variable with its evaluated value, and return that.

Conclusion

None of the popular scripting languages I can think of (Javascript, Perl, Python, Ruby) have a macro facility, but most of them have anonymous functions which evaluate to closures in the Scheme sense. On the other hand, they also tend to have a richer set of control structures (Perl's unless), and they have other mechanisms (dare I say classes and objects?) which address most of the killer applications for macros, and hence for lazy-lambda. But for all that, I'd have to figure that those languages, and the tinier languages or DSLs, could add this feature.

Macros are subtle things to get right, and I'm sure there are deficiencies I haven't addressed here. But that shouldn't stop us from thinking about these issues, and I think there's some potential value in the call by need lambda.

Thoughts on an S-Expression Parser

2008-05-14T22:48:00.003-05:00

In this post we look at a tiny Scheme parser, and generalize it to a non-language-specific S-expression parser that can be subclassed to handle any language design, as a base class for studying language varieties.

As I mentioned last time, I recently wrote a tiny Schemish interpreter, and then started extracting from it a basic interpreter platform that other people or myself could use to try out different language designs or implementations. One practical goal would be for instructors to provide the framework and have students modify it. The approach is not mine, it's from Sam Kamin's "Programming Languages: An Interpreter-Based Approach" (1990).

The original Schemish parser had these two parsing methods:


protected Expr parseExpr() throws ParseError
{
  Token token = consumeToken();
  switch(token.type)
  {
  case '"': return new StringLit(token.text);
  case '(': return parseExprList(token);
  default:
    // Test the token to see if it's a numeric
    Int intExpr = Int.fromString(token.text);
    if(intExpr == null)
      return new Atom(token.text);
    else
      return intExpr;
  }
}
  
protected Expr parseExprList(Token openParen) throws ParseError
{
  Vector acc = new Vector();
  while(peekToken().type != ')')
  {
    Expr expr = parseExpr();
    acc.add(expr);
  }
  Token closeParen = consumeToken();
  
  // Handle special forms
  ExprList retval = null;
  if(acc.size() > 0 && ((Expr)acc.firstElement()).isAtom())
  {
    Expr head = (Expr)acc.firstElement();
    if(head.isAtom("lambda"))
    {
      String headName = head.getAtom().getText();
      String lambdaUsage = "Syntax error: "+
        "expected ("+headName+" (arg ...) body)";
      if(acc.size() != 3)
        throwError(openParen, lambdaUsage);
      Expr argExpr = (Expr)acc.get(1);
      if(!argExpr.isExprList())
        throwError(openParen, lambdaUsage);
      ExprList argList = (ExprList)argExpr;
      Expr[] args = argList.getElements();
      HashSet argSet = new HashSet(); // to check for duplicates
      for(int i=0; i < args.length; ++i)
      {
        if(!args[i].isAtom())
          throwError(openParen, lambdaUsage);
        boolean wasAdded = argSet.add(args[i]);
        if(!wasAdded)
          throwError(openParen, "Syntax error: argument "+
            args[i].getAtom()+" appears more than once");
      }
      Expr bodyExpr = (Expr)acc.get(2);
      retval = Lambda(argList, bodyExpr);
    }
  }
    
  if(retval == null)
    retval = new ExprList(acc);
  
  retval.filename = m_filename;
  retval.firstLine = openParen.line;
  retval.lastLine = closeParen.line;
  return retval;
}

The lambda-handling code is big, and obscures the structure of the list parsing part. That may not be so bad if you're working on this one interpreter, you don't forget the basic list-parsing structure, because it's simple. You'll care more about all the special forms. But for my purposes, of wanting to write many interpreters from the same code base, we want to be able to talk about different interpreters, and it's a little awkward: you're always sharing code patches, and having to explain where they go. Instead, why not give a sufficient base class, and share complete subclasses?

The improvement gives a parser class which handles only S-expressions; it knows only atoms and expression lists, it doesn't know any language constructs at all: it knows no keywords. You can think of it as generating just a tree structure, like an XML parser. Parsers for a particular language design will be written as subclasses, and will override the methods constructAtom and constructExprList. These methods are reminiscent of the actions in YACC-like parser generators, the blocks of code that construct parse tree elements given data named by the grammar rule symbols and from the general lexer and parser state (e.g. line numbers in our case).

Thus, parseExpr and parseExprList reduce to tiny fragments and subclasses can flesh out the meat of special forms in constructExprList:


public Expr parseExpr() throws ParseException
{
  Token token = consumeToken();
  Expr retval = (token.type == '(')? 
    parseExprList(token) 
    : constructAtom(token);
  return retval;
}

protected Expr parseExprList(Token openParen)
  throws ParseException
{
  Vector acc = new Vector();
  while(peekToken().type != ')')
  {
    Expr element = parseExpr();
    acc.add(element);
  }
  Token closeParen = consumeToken();

  Expr retval = constructExprList(acc, m_filename, 
    openParen.line, closeParen.line);
  return retval;
}

protected Expr constructAtom(Token token)
{
  return new Atom(token.text);
}

protected Expr constructExprList(
  AbstractCollection exprs, String srcId, 
  int startLine, int endLine)
{
  ExprList retval = new ExprList(exprs);
  retval.filename = srcId;
  retval.firstLine = startLine;
  retval.lastLine = endLine;
  return retval;
}

Now the Schemish interpreter can subclass Parser and call it SchemeParser, overriding constructExprList to handle special syntactic forms, and overriding constructAtom to handle language-specific literals, such as character literals or rational number literals (2/3).

Thoughts and Code for the S-Expression Lexer

2008-05-09T23:15:00.004-05:00

My recent project has been a tiny Schemish interpreter, and more recently have been considering a design that would work well as the Java counterpart to Kamin's chapter 1 interpreter, a calculator language using Lisp syntax, which is to be morphed into any one of a variety of different languages based on S-expression syntax.

So, this post is just for sharing some observations in working on the lexical analyzer part, as well as its code.

Just so it's abundantly clear, the problem a lexical analyzer solves, for S-expressions anyway, is to get a stream of parens, quotes, and symbols out of a character input source.

The apostrophe quote operator, as for creating list literals like '(lambda (x) x), appears to be a token not requiring whitespace. I've never seen the apostrophe not have preceding whitespace, but after testing MzScheme, ChezScheme and a few other minor Scheme implementations, it's apparent that a'b'c is the same as a 'b 'c. I've not done any serious development in Scheme, but I wonder whether this is common knowledge among Scheme programmers.

Similarly, Scheme read-eval-print loops appear to accept forests of S-expressions at the prompt, not just one expression. If you type 1 2 3 4 and the values get echoed back. This is useful for multiple defines, or for pasting code. Obviously, any REPL loop should support this behavior if at all possible.

I was happy to see that StreamTokenizer tracks line numbers. Having line numbers available "for free" helps error messages instantly. For what it's worth, I want my scanner to track column numbers too, but I understand if James Gosling (marked as the author of StreamTokenizer) didn't want to get into arguments about what a tab is worth.

The xUnit test frameworks are a much welcome tool for testing language processors. In 1998 I was fiddling with diff-based Perl scripts for automated regression testing. Writing JUnit tests is a lot more fun, productive, and exact than relying on textual comparison with blessed output.

Rather than subclass StreamTokenizer, I wanted to change the consumer's model of looping from "while the current token is not end-of-file" to a standard Iterator terminating on !hasNext(). This required making use of pushBack() on every token get, but I considered the ease of use for the client had a slight edge.

Using an iterator means you need an object to return, so there is a small Token class returned by the iterator. Token bundles the public fields of the StreamTokenizer that represent the token value. I opted not to have StreamTokenizer parse numbers. Since the intended use is ultimately interpreters for arbitrary languages merely based on S-expression syntax, I needed to let them have their own numeric literal syntax and domain of representation (double? float? BigInteger?). Now for some code. Token looks like this:


package jfkbits;
import java.io.StreamTokenizer;

public class Token
{
  public static final int SYMBOL = StreamTokenizer.TT_WORD;
  public int type;
  public String text;
  public int line;

  public Token(StreamTokenizer tzr)
  {
    this.type = tzr.ttype;
    this.text = tzr.sval;
    this.line = tzr.lineno();
  }

  public String toString()
  {
    switch(this.type)
    {
    case SYMBOL:
    case '"':
      return this.text;
    default:
      return String.valueOf((char)this.type);
    }
  }
}

Here's how the type field, an int, as defined in StreamTokenizer, works: for "ordinary characters", the type is the character literal cast to an int. The LispTokenizer marks '(', ')', and the apostrophe quote operator '\'' as ordinary. In code using Token, this reads very naturally, as in if(token.type == '(') return parseExprList();. For "words", atoms in our case, the type is a negative int defined by StreamTokenizer.TT_WORD, which Token redefines as SYMBOL. If we read a Token t with t.type==Token.SYMBOL, the good stuff (like "42", "x", "eval", or "lambda") is in t.text. String literals have the type code of the delimiter, so t.type=='"' means we've got a string literal, the contents of which (without the delimiter!) are also in t.text.

And what about string literals? Strictly speaking, the same decision I made about numeric literals should also apply to string literals. Namely, that different languages have different syntaxes and potentially different representations. Perhaps I should not configure StreamTokenizer to recognize string literals. In that case, the parser would get atoms containing the double quotes, themselves, and the parser would be expected to split it apart. Currently, I don't expect this tool to be used for studying string literals very much.

And finally, for the code itself:


package jfkbits;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.util.Iterator;

public class LispTokenizer implements Iterator
{
  // Instance variables have default access to allow unit tests access.
  StreamTokenizer m_tokenizer;
  IOException m_ioexn;

  /** Constructs a tokenizer that scans input from the given string.
   * @param src A string containing S-expressions.
   */
  public LispTokenizer(String src)
  {
    this(new StringReader(src));
  }

  /** Constructs a tokenizer that scans input from the given Reader.
   * @param r Reader for the character input source
   */
  public LispTokenizer(Reader r)
  {
    if(r == null)
      r = new StringReader("");
    BufferedReader buffrdr = new BufferedReader(r);
    m_tokenizer = new StreamTokenizer(buffrdr);
    m_tokenizer.resetSyntax(); // We don't like the default settings

    m_tokenizer.whitespaceChars(0, ' ');
    m_tokenizer.wordChars(' '+1,255);
    m_tokenizer.ordinaryChar('(');
    m_tokenizer.ordinaryChar(')');
    m_tokenizer.ordinaryChar('\'');
    m_tokenizer.commentChar(';');
    m_tokenizer.quoteChar('"');
  }

  public boolean hasNext()
  {
    if(m_ioexn != null)
      return false;
    try
    {
      m_tokenizer.nextToken();
    }
    catch(IOException e)
    {
      m_ioexn = e;
      return false;
    }
    if(m_tokenizer.ttype == StreamTokenizer.TT_EOF)
      return false;
    m_tokenizer.pushBack();
    return true;
  }

  /** Return the most recently caught IOException, if any,
   * 
   * @return
   */
  public IOException getIOException()
  {
    return m_ioexn;
  }

  public Token nextToken()
  {
    return (Token)next();
  }

  public Object next()
  {
    try
    {
      m_tokenizer.nextToken();
    }
    catch(IOException e)
    {
      m_ioexn = e;
      return null;
    }

    Token token = new Token(m_tokenizer);
    return token;
  }

  public void remove()
  {
  }
}

A token stream can be processed something like this:


LispTokenizer tzr = new LispTokenizer("(define x 42)");
for(Iterator it=tzr; it.hasNext(); ) {
   Token token = it.nextToken();
   processToken(token);
}

And finally, some unit tests.


package jfkbits;

import java.io.StreamTokenizer;
import java.util.Iterator;

import junit.framework.TestCase;

public class LispTokenizerTest extends TestCase
{
  public LispTokenizerTest(String name)
  {
    super(name);
  }

  public void testLispTokenizerIterator()
  {
    LispTokenizer tzr;
    
    tzr = new LispTokenizer("");
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer(" ");
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer("\n");
    assertFalse(tzr.hasNext());
    
    tzr = new LispTokenizer("7");
    assertTrue(tzr.hasNext());
    checkToken(1, "7", Token.SYMBOL, tzr.next());
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer("()");
    assertTrue(tzr.hasNext());
    checkToken(1, null, '(', tzr.next());
    checkToken(1, null, ')', tzr.next());
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer("(newline)");
    assertTrue(tzr.hasNext());
    checkToken(1, null, '(', tzr.next());
    checkToken(1, "newline", Token.SYMBOL, tzr.next());
    checkToken(1, null, ')', tzr.next());
    assertFalse(tzr.hasNext());
  }

  private void checkToken(int line, String text, int type, Object tokenObj)
  {
    assertNotNull(tokenObj);
    assertTrue(tokenObj instanceof Token);
    Token token = (Token)tokenObj;
    assertEquals(line, token.line);
    if(text != null && token.type == StreamTokenizer.TT_WORD)
      assertEquals(text, token.text);
    assertEquals(type, token.type);
  }

  public void testCharacterMapping()
  {
    assertEquals((int)'(', mkTokenizer("(").nextToken().type);
    assertEquals((int)')', mkTokenizer(")").nextToken().type);
    assertEquals((int)'\'', mkTokenizer("'").nextToken().type);

    assertEquals(StreamTokenizer.TT_WORD, mkTokenizer("0").nextToken().type);
  }

  public void testSimpleLispExpressions()
  {
    test("",new String[]{});
    test("()",new String[]{"(",")"});
    test(" ()",new String[]{"(",")"});
    test("\n()",new String[]{"(",")"});
    test("() ",new String[]{"(",")"});
    test("()\n",new String[]{"(",")"});
  }

  public void testLispExpressionsWithComments()
  {
    test(";Comment here\n()", new String[]{"(",")"});
  }

  public void testLispExpressionsWithStrings()
  {
    test("\"\"",new String[]{""});
    test("\"a\"",new String[]{"a"});
    test("\" a\"",new String[]{" a"});
    test("\"a \"",new String[]{"a "});
    test("(print \"Hello world.\\n\");",
      new String[]{"(","print","Hello world.\n",")"});
  }

  public void testFactorial()
  {
    String src = 
      ";;\n"+
      ";; Compute the factorial of a given number\n"+
      "\n"+
      "(defun fact (n)\n"+ 
      "  (if (< n 2)\n"+ 
      "      1\n"+
      "    (* n (fact (- n 1)))\n"+
      "    )\n"+
      "  )\n"
      ;
    String[] expected = {
      "(","defun","fact","(","n",")",
      "(","if","(","<","n","2",")",
      "1",
      "(","*","n","(","fact","(","-","n","1",")",")",")",
      ")",
      ")"
    };
    test(src,expected);
  }

  static void test(String src, String[] expectedTokens)
  {
    LispTokenizer tzr = mkTokenizer(src);
    int i = 0;
    for(Iterator it=tzr; it.hasNext(); i++)
    {
      Token token = (Token)it.next();
      assertNotNull(token);
      assertTrue("Expected "+expectedTokens.length+" tokens, got more", 
        i < expectedTokens.length);
      assertEquals(expectedTokens[i], token.toString());
    }
  }

  static LispTokenizer mkTokenizer(String src)
  {
    return new LispTokenizer(src);
  }
}

Motivations for Little Interpreters

2008-05-07T22:25:00.003-05:00

We're continuing a series about writing little interpreters. Although it is fun and cool, so are isolated guitar licks for about the first five minutes; long-term you probably will want to play whole songs if you want to have listeners. Similarly, you need an application, killer or not, for an interpreter or it will lie around collecting bit dust and require "maintenance" (right, Dijkstra?).

I suppose you could say I'm talking about an interpreter for a Domain Specific Language here, but really all I mean is that if you care enough to write an interpreter, you may as well take the opportunity to build into it something you can use, and for this post I'll address the area of a useful domain.

Specifically, if you care about an interpreter it means it will support a domain you care about: photos, 3D graphics, matrices.

In the implementation language of the interpreter, you will want to define
1. Data structures
2. Ways to construct and operate on them

and you'll make provisions for these in your language. Thinking these through carefully, and probably writing them into a runtime library complete with unit tests, is a respectable way to approach things before sticking an interpreter front end onto the data and code of your domain of interest.

Let me give you one simple example of an interpreter I wrote for personal productivity. In 1995 I wrote "Joel's Calculator" to help me write status reports for my manager, and also to do some basic statistical analysis. (No, I didn't have Excel; I was working with an HP-UX workstation.) Our embedded systems group had gotten a rare opportunity for a complete product rewrite, and we needed to track our time and estimates carefully. We had to track time spent on particular modules and count overhead, and in our weekly status report submit an account of time on our various modules, with overhead (anything not in a particular module) evenly divided among our projects. I would keep a log of time spent that looked something like this:

Tue May 7
0840-0920 Email
0920-1135 Serial driver
1250-1300 Ovr
1300-1400 Phase 2 mtg
1400-1600 Serial driver testing
1600-1715 Code review

It was an easy way to keep track, in a text editor. Then at the end of the week, I would sum the "billable hours" for modules like Serial driver. What I ended up wanting was a way to do "time arithmetic":


serial = 11:35-9:20 + 16:00-14:00
ovr = 9:20-8:40+0:10+14:00-12:50+17:15-16:00

This worked very well for me as a workflow. I could write simple expressions like this, assign them to variables, combine the results gradually, and so on. This was really a desk calculator, but I was pleased with the effect that I could enter times textually in a very natural way, use simple infix arithmetic notation, and have it just work.

In this case my chief domain of interest was pretty simple, a pair of hour and minutes, and the operations on the domain were also fairly simple, but a little challenging to get right.

Of course, you may be coming at this from the perspective that the goal and the domain are already well defined. For example, you're working with a library of interest and you'd like to play with it interactively; you can envision making Scheme bindings and writing abstractions to make it do something useful. Or you even have a very powerful system written in an existing but limited language, but you need to bust out of the limits. Using a lightweight Lisp/Scheme interpreter that you can modify to script some macros to generate code in another language may solve some scaling problems for you.

So, if you want to start a little interpreter project, which I'd like to encourage, before you get started, pick a goal, pick a domain, and refine the domain as the first step.

Google AppEngine Activation Arrives

2008-05-06T15:05:00.003-05:00

I signed up for Google AppEngine, well-covered on programming.reddit.com, about a month ago, but got the "don't email us, we'll email you" message at that time. Today I got my email letting me know I can start farming on the Google spread.

Little Lisps: Programming Candy or Spinach?

2008-05-02T21:42:00.004-05:00

Last time (StreamTokenizer and Declarative Lexing), I mentioned an idea of presenting language designs in this space as a puzzle to be solved, like a crossword puzzle. I invented the idea there and its been growing on me.

The previous week or so I'd been reading Sam Kamin's "Programming Languages: An Interpreter-Based Approach", so this didn't seem like such a crazy idea as it may sound. In "Programming Languages", Kamin starts with an interpreter for an extremely simple language using Lisp syntax, and procedes with each chapter to show what modifications need to be made to get interpreters for Lisp, Scheme, SASL, APL, Smalltalk, Prolog and others. They all use Lisp syntax, so the code changes are kept quite manageable, and the reader can focus on the essential differences in scoping, evaluation strategies, and the like.

It seems, as a first step, if the Language Design Crossword Puzzle were to be a reality, that making available a standard interpreter source such as Kamin's is a reasonable idea. "Given interpreter0, add macros" would be a puzzle. "Given interpreter0, add object serialization." This is similar in spirit to comments I made earlier about the utility of lambda calculus for studying language features.

But is the Language Design Crossword Puzzle a good idea? What's the point? Arcane Sentiment, a blog I discovered and subscribed to today, introspects on writing little language implementations. He describes certain parts of the exercise as "programming candy", and ironically, they're often the parts written in C, a series of little programming victories. The hard, ill-defined problems to be written in Lisp are the parts that tend to slow him down and demotivate him. (Arcane, I hope I'm fairly characterizing that post. Please correct me if not.)

I had to chuckle in self-recognition, as earlier this week I was watching the first few SICP videos, evaluating the examples in a Scheme-ish interpreter I'd whipped up on Monday and Tuesday, extending it during pauses in dialog while Sussman wrote on the blackboard. Until he hit the pi fraction example and I realized I wasn't at all sure if I wanted to right at that moment be writing code to rationalize denominators and factor fractions or whatever else Scheme might do to support exact rational number (fraction) arithmetic (e.g. (/ 1/2 (+ 1/8 1/3))). That problem at that moment was not interesting for me, and was not well-specified; are rationals always reduced to lowest terms? How is conversion to and from reals handled? I'd have to go study R5RS to learn the expected behavior. Handling the number tower was not my goal going into this project.

Why does Arcane Sentiment dabble in Lispy implementations? Why do I? Why reinvent the wheel? For me, it's a way to learn, to study. Toy implementations are rewarding, as they let you discard parts of a language implementation system that are indeed hard, and focus on particular points of interest. You need to be careful not to oversimplify, if you intend to take your lessons back to a real system. But this approach is something we advise junior programmers all the time: if you're struggling with how a library or language feature is working in the application, try writing a small example program first until you understand how it works.

So, I propose we have the best of both world. Language design problems can be programming candy, as well as programming spinach, that is something good for you. My wife has been making a spinach recipe from her Spain and Portugal cookbook which features raisins and pine nuts. It rocks.

The other question is, is there interest in a "language design puzzle" feature? Before we get to that, let me ask a more relevant question: what aspects of programming language implementation or operation are of interest to you? Macros? Evaluation strategy? Optimizations? Drop me a line to the Google mail account jfkbits and let me know.

Blog challenge: write a post using the phrase "free as in spinach".

StreamTokenizer and Declarative Lexing

2008-04-30T13:26:00.005-05:00

One thing I've noticed about marketing is how frequently an appeal is made to change your lifestyle to include more of whatever is being sold. I first noticed this when I read in a catalog that Ikea wants me out of that rat race of the outside world, so I can spend more time at home. And by the way, don't I want my home to be a comfortable inviting place to spend time? After that, I've seen it everywhere. Here at JFKBits we want to know why the world doesn't write more language processors. Everyone can enjoy a world of recreational symbol translation. I was on vacation last week, reading a programming language book by the beach. Maybe someday we'll run programming language designs in this space and JFKBits readers can implement them as a kind of crossword puzzle of the day, in an hour on the train home from work. So today, we're thinking about how to get a lexer up and going more quickly.

When you're designing a syntax, it's typically done at a high-level:

Comment syntax: line-oriented or delimited, or both? do comments nest?

Identifier syntax: Limited choice for the first character (e.g. letters and '_'), followed by mix of alphanumeric and underscore. In Lisp, symbols syntax is much more flexible.

Operators and grouping syntax: parentheses, brackets, multi-character operators like ++, ->, and &&=.

It might be useful, in terms of quickly prototyping a domain-specific language or language-aware utility (e.g. a tool that finds buggy code patterns), to have a lexer generator that lets you declare properties in these terms rather than translating to regular expressions.

Java's StreamTokenizer, while not a universal tool, is a step in that direction.

What I like about StreamTokenizer is that it raises the level of programming closer to the problem domain. Regular expressions are the traditional and well-understood means of specifying a lexer's accepted language, but this declarative style is possibly a better programmer's tool. (Note well that I'm talking about StreamTokenizer, not StringTokenizer which is a much simpler state machine.)

For example, in playing with Kamin's interpreters with their S-expression-based syntax, I've been using this configuration to scan LISP-type code in Java:


tokenizer = new StreamTokenizer(new BufferedReader(reader));
tokenizer.resetSyntax(); // We don't like the default settings
tokenizer.whitespaceChars(0, ' ');
tokenizer.wordChars(' '+1,255);
tokenizer.ordinaryChar('(');
tokenizer.ordinaryChar(')');
tokenizer.commentChar(';');
tokenizer.quoteChar('"');

The basic idea is that StreamTokenizer is a state machine with hard-coded states and transition arcs, and you configure it with what characters are in the relevant character classes. For example, from the state of skipping whitespace, there's a transition to a string literal state, when a character from the "quoteChar" class is encountered. The string literal state accumulates, transitioning back to itself on characters not in the quoteChar class. It's simply up to you to configure which character or characters constitute a quote delimiter. The essential observation is that lexers for many useful languages share state machines of identical shapes, and they differ only in the definitions of character classes.

Of course, not every language fits the preprogrammed state machine shape. StreamTokenizer can't even handle Java, because it has no capacity for multi-character operators. It can be configured so that '+' is an "ordinary character", meaning it is lexed as a single-character token, but there's no way for a parser to know if two '+' tokens in sequence came from the input "++" or "+ +". This is what I mean when I say StreamTokenizer is not a universal lexing tool.

But I still wonder if there's room for this declarative manner of input, building on higher-level concepts like identifiers and operators, for a more general lexing tool. This could be combined with a programmatic knowledge base of some of the standard lexical idioms running around. You could say "give me Python-style identifiers, with the standard Java operators except these four which don't apply to my language." I'm not at all sure there is need for such generality, but I think it's worth writing down, as an idea for future inspiration.

Small Tokenizing World

2008-04-17T11:48:00.004-05:00

I wanted to see what Java programs out there parse Lisp syntax using the handy StreamTokenizer, and the first one I found is Mike Bayne's, circa 1997.

Mike was one of the bright guys who did version 2 of the Prizm IDE that I helped work on for my senior project at Rose-Hulman. Now he runs his own online game company, ThreeRings.net, out of San Francisco.

AP Computer Science Language

2008-04-15T23:40:00.004-05:00

I've been reading Sam Kamin's "Programming Languages: An Interpreter Based Approach" from 1990, which compares programming languages through the device of writing an interpreter in Pascal for a form of the language using Lisp-based syntax.

On reading all that Pascal source code, it takes me back. I reflect that I learned Pascal as a senior in high school, to prepare for the AP Computer Science test. That was in 1990. I see that the AP tests now use Java as their language. I wonder, was there any other language between Pascal in 1990 and Java in 2008?

If I had to bet on what language would succeed Java as that used for the AP exam, I'd bet on Python.