Web, Development, Functional Programming, and Scheme

Flapjax - Functional Reactive Programming in Javascript

2011-03-03T19:10:00.000-08:00

[This preview is posted here during the transition phase; once the transition is complete I will stop posting here]

What if programming can be as simple as writing spreadsheet formulas?

Now - there are some complex spreadsheets out there for sure, but the power of spreadsheet comes from that the values are self-updating, i.e. changing value in one cell will automatically cause all its dependent cells to reflect the change. If spreadsheets cannot do the auto-propagation, even the simplest spredsheet will be much more complex.

Most programming languages do not have the auto-propagating capability, but the ones that do are called reactive programming languages. And if the language is also functional, then it's called a functional reactive programming language.

Since most of the mainstream languages did not have reactive features built-in, the feature need to be "bolted on", as a library or a transformer. The one for Javascript is called Flapjax.

[Continue to read the rest of the post @ the new site....]

Administrivia: Moving of Weblambda to New Blog

2011-03-01T22:38:00.000-08:00

I am consolidating my blogging over to my personal site, http://yinsochen.com, in the near future.

The main site will contain more than just scheme development - if that interests you, that's great. You can grab the whole feed from http://yinsochen.com/feed/.

In case you just want to have scheme-related posts - use http://yinsochen.com/tag/scheme (RSS feed at http://yinsochen.com/tag/scheme/feed/).

There will be time to make the switch - I will be linking from here to the new site for the next few posts, before I fully make the move.

Cheers.

BZLIB/SHP Web API & Converters

2010-07-18T04:27:00.000-07:00

The previous post describes the basics on how to use the web API. This post will focus on integrating your module with the web API.

As shown before, you can create a web API by creating an SHP script as follows:


;; /api/add2 
(:api-args (a number?) (b number?)) 
(+ a b)

Where both a and b are validated as number?. It would be nice if we can validate any type of scheme values, as long as the value can be created via the request input.

For example - let's say that you have a struct with the following definition:

(define-struct foo (bar baz))

We want to do the following:

(:api-args (foo foo?)) ;; takes in the foo struct

And let web API handle the rest. This is achieved via converters.

Converters

The mappings between the request and the api args are done via converters, which maps the parameter key against the type's test function (such as number?). And when you want to use the converter in the api-args expression, you specify the <test?> function in the parameter position in one of the following forms:


(<name> <test?> <default>) ;; this form means this parameter is an optional parameter, and if no values are passed in it will return the default value.

(<name> <test?>) ;; this form means the parameter is a required parameter – if no value passed in it will error.  Any value passed in will be validated and converted (if it fails the conversion it will error)

<name> ;; this form means the parameter does not have any validation – any value will be passed verbatim.

Because the <test?> function is used as the mapping to the actual underlying converter object, we will call the underlying converter object as the "<test?> converter". i.e., the actual converter mapped by number? is called the "number? converter" (the quotes are dropped going forward).

You can define your own converters to use in the api-args expression to validate your own objects. To do so, you can define either a scalar or a struct converter as of bzlib/shp:1:3.

Scalar vs. Struct Converters

There are two different types of converters - scalar and struct. They differ in the type of input required. This is best explained with a JSON request.

Scalar maps to the JSON number and strings, while struct maps to JSON objects (JSON arrays are automatically mapped to lists).

As an example - the following JSON object has a scalar value for both foo and bar key, but a struct value for the baz key:


{ foo : 1 , bar : "test me" , baz : { abc : 1 , def : 2 } }

So we will use a scalar converter for both foo and bar field (number? and string? respectively), and we will use a struct converter for the baz field that takes in an abc and def fields as number?.

The mapping is pretty straightforward for XMLRPC request as well, except that XMLRPC request method call does not handle named parameters, so we will map the arguments by positions. Otherwise, XMLRPC's string and integers maps to scalars, and struct maps to struct converters (the XMLRPC array maps to lists automatically).

The mapping in query-based request is a bit more complicated, since the query string is only consisted of key/value pairs, it does not handle nested hierarchy of objects by default. We solve this problem by using the dot notation (familiar to most OOP developers) to simulate the hierarchy in the key name themselves. As an example, the query string below maps to the JSON object above:


&foo=1&bar=test%20me&baz.abc=1&baz.def=2

The baz JSON object is flattened into baz.abc and baz.def key/value pairs. This flattening can be done for arbitrary levels.

Hence - you need to ensure that the request are constructed correctly based on the above rules. The rules for JSON & XMLRPC are specified in their corresponding specs, and the query-string rules are specified here.

Define a Scalar Converter

To define a scalar converter, we use the define-scalar-converter! syntax.


(define-scalar-converter! <test?> 
  (<type?> <transform-from-type?>) ...)

For example – the following is the definition of the number? converter:


(define-scalar-converter! number? (string? string->number))

It roughly means the following:


(lambda (x) 
  (cond ((number? x) x) 
        ((string? x) 
         (let ((x (string->number x))) 
           (if x
               x 
               (error 'invalid-conversion "~s" x))))
        (else (error 'invalid-type "~s" x))))

If the passed in value is already the desired type, we just let it pass through. Otherwise we test to see if it is one of the known types (or error out), and try to convert the value. If the converter succeeds, return the value, otherwise, throw an exception.

Although the other type of converters is called struct converter, we can define a scalar converter for a struct as well, as long as the struct can be mapped from a scalar value. NOTE - you cannot define both a scalar and struct converter for a struct, since all converters shares the same namespace.

Here's an example of a scalar converter for a struct – we will create a scalar converter for url? from net/url.


(define-scalar-converter! url? 
  (string? string->url) 
  (bytes? (compose string->url bytes->string/utf-8))

Define Struct Converters

To define a struct converter, we use the define-struct-converter! syntax.


(define-struct-converter! name ((field converter?) ...))

For example – the following defines a struct converter for the foo struct above:


(define-struct-converter! foo ((bar number?) (baz string?)))

The struct converter looks like the struct form in provide/contract, but instead of contract expressions, each field is accompanied by a converter instead.

The field converter must already have been defined, or else the definition will fail and throw an exception.

The field converter can of course be either a scalar converter or a struct converter as well.

Converter Definition Orders and Other Design Decisions

Converters should be defined in the order from the most general types to the most specific types, similarly to how object hierarchies are defined. For example, if you have a sub struct, you should define the converter for the sub struct after you have defined the converter for the parent struct. And if you want to have converter for both number? and integer?, you should first define the number? converter and then define the integer? converter.

One of the design decisions for the converters is that once a converter is defined it is immutable – future definitions are simply ignored.

Furthermore, both scalar converter and struct converter shares the same namespace.

The above two points means that you can only define either a scalar converter or a struct converter for any given type, and once it is defined it cannot be changed.

Where to Define Converters

Since the design philosophy behind bzlib/shp is that shp scripts should be used for presentations and the application logics should reside in regular PLT Scheme/Racket module/packages, converters are best defined in the respective modules.

However, for ad hoc purposes, the required script (the single shp script that is responsible for loading the required modules) has been extended with bzlib/shp 0.4 so it can now handle ad hoc defintions as well. So if you want to define converters in the required script you can do so, but remember that if it seems like these definition should be shared they might need to eventually migrate into their respective modules.

That's it for now. Feel free to leave comment if there are any questions. Enjoy.

BZLIB/SHP.plt 0.4 now available - Web API

2010-07-08T12:31:00.000-07:00

A new version of SHP.plt is now available via planet. This is a major rewrite of SHP and provides two main upgrades:

a "web API" interface - your web script can now be exposed as an "API" (think XMLRPC/JSON), and it automatically works with either XMLRPC or JSON (details below)
general performance enhancement - the scripts are now compiled and cached to reduce disk IO. If the scripts are updated then they are automatically recompiled

As usual, the code is released under LGPL.

Installation

(require (planet bzlib/shp:1:3))

SHP requires some newer dependencies (bzlib/base:1:6, bzlib/date:1:3, bzlib/parseq:1:3, bzlib/xml:1:3, bzlib/mime:1:0), and the current versions of PLT Scheme and Racket have issues with version dependencies (the link: module mismatch bug), so you might have to clear out the planet cache and recompile them again.

As usual, SHP comes with a small example site that you can play with under the example sub directory - cd to the example directory and run (require "web.ss") will start the example site. The example site is still just trivial code right now - it will eventually be enhanced and separated into its own package.

Cached Compiled Script

All of the scripts are now compiled and cached. This has some potential performance benefit, since we will only access the file content when the file timestamp changes (meaning the file has been touched and/or modified). As this is a non-visible feature, we won't spend much time discussing it, except to note that the change is not just done for performance reasons - it is also done to enable and simplify the design of web api, which is discussed below.

Web API

Under the example site you can find the script shp/api/add2, which contains the following:


;; -*- scheme -*- -p 
(:api-args (a number?) (b number?)) 
(+ a b)

This is the new *web api* - it takes in 2 numbers, a and b, and return the added result. To write an api script, you must use the :api-args expression, and then supply the arguments inside. The arguments can be specified in the following forms:


(:api-args a b) ;; both a & b are non-validating and you get what's passed in

(:api-args (a number?) (b string?)) ;; a expects a number, and b expects a string 

(:api-args (a number? 3) (b number? 5)) ;; a & b both expect numbers, and both have default values if they are not passed in (a defaults to 3, and b defaults to 5).

When you run the example site you can access the api via the following http call:


GET /api/add2 HTTP/1.0

When running the above in browser you should get back an XMLRPC response:


Content-Type: text/xml; charset=utf-8 

<methodResponse>
<fault>
<value>
<string>required: a</string>
</value>
</fault>
</methodResponse>

XMLRPC is the default response mode for web APIs. What it returns by default as shown above is an error message, because neither a or b is passed in.

To pass in the values - you just need to specify them in the query string as following:


GET /api/add2?&a=50&b=90 HTTP/1.0

which will return the following:


Content-Type: text/xml; charset=utf-8

<methodResponse>
<params>
<param>
<value>
<int>140</int>
</value>
</param>
</params>
</methodResponse>

The add2 script contains args of (a number?) and (b number?), which mean that both a & b expects a number input. So if we pass a non-number to the api


GET /api/add2?&a=not-a-number&b=also-not-a-number HTTP/1.0

we get the following:


<methodResponse>
<fault>
<value>
<string>invalid-conversion: "not-a-number"</string>
</value>
</fault>
</methodResponse>

Which shows that the validation takes place for each of the arguments. We will talk about how the underlying validation magic works later.

XMLRPC Request Payload

Now the API doesn't just work for query string parameters - you can pass in an XMLRPC payload as well. To do so you need to use POST instead of GET, and put the following XMLRPC request into the payload:


POST /api/add2 HTTP/1.0 
Content-Type: text/xml; charset=utf-8 

<methodCall>
<methodName>add2</methodName>
<params>
<param><name>a</name><value><int>50</int></value></param>
<param><name>b</name><value><int>90</int></value></param>
</params>
</methodCall>

Which will return the same result:


Content-Type: text/xml; charset=utf-8

<methodResponse>
<params>
<param>
<value>
<int>140</int>
</value>
</param>
</params>
</methodResponse>

Partial Path Dispatch with XMLRPC methodName

You might have noticed in the above that the token add2 is specified twice in the request - once in the path, and once in the methodName parameter in the payload. And if you were to use a bogus method name such such as add3 in the methodName parameter, you will see that it is being ignored - the path has precedence.

So - the dispatch rule is - if the path matches the script exactly, the methodName parameter will be ignored.

This is designed to follow the same dispatching rule of regular SHP scripts, and to ensure that one script compiles into one api.

But if you like to use the regular XMLRPC dispatch, it also works - you just need to remove the add2 from the path, as follows:


POST /api HTTP/1.0 # notice - no add2 here 
Content-Type: text/xml; charset=utf-8 

<methodCall>
<methodName>add2</methodName>
<params>
<param><name>a</name><value><int>50</int></value></param>
<param><name>b</name><value><int>90</int></value></param>
</params>
</methodCall>

So instead of posting to /api/add2, just post to /api, and specify add2 in the methodName parameter. This is called partial path dispatch, for the lack of a better term for now.

NOTE - in order for partial dispatch to work, the /api directory must not contain an index script, because otherwise the index script will be matched first prior to the partial dispatch kicking in.

Partial Path Dispatch With Query String

There is another way of doing partial path dispatch, and that involve the using of query string. Yes, you can specify query string with POST requests, and the query string values will be parsed properly in SHP.

The way to do it is to specify a query key that starts with **. So for example, if we want to dispatch to /api/add2, we can dispatch as follows:


POST /api?**add2 HTTP/1.0 # note the **add2 key in the query string. 
Content-Type: text/xml; charset=utf-8 

<methodCall>
<methodName>add3</methodName><!-- note the wrong method name --> 
<params>
<param><name>a</name><value><int>50</int></value></param>
<param><name>b</name><value><int>90</int></value></param>
</params>
</methodCall>

The above has exactly the same effect as if you have done a direct dispatch - the ** prefix will be stripped, and then appended to the path. And this rule, along with the direct dispatch, supersedes the method name in the xmlrpc payload.

This design exists for a reason - to deal with web forms that have multiple submit buttons.

Web forms that have multiple submit buttons usually have each button mapped to different actions, and instead of requiring you to do manual dispatch on the server side based on which button is clicked (the name & value of the submit button clicked gets submitted to the server side along with the data), you can partition the API calls based on the name of the button (partition on name instead of the value is better, because that way it allows you to localize the value without worrying about breaking the script).

For example, if you have a server-side-based calculator that has the +, -, *, /, = submit buttons, you can theoretically have the following scripts:


/api/add 
/api/subtract
/api/multiply
/api/divide
/api/equal

And then have the buttons mapped to the name of **add, **subtract, **multiply, **divide, and **equal, and map the form's action to /api. SHP will do the dispatching for you correctly.

JSON Request & Responses

Besides working with query strings and XMLRPC request and responses, the same api script also can handle JSON requset & responses - you just need to POST the payload with the appropriate content-type, which is text/json.


POST /api/add2 HTTP/1.0 
Content-Type: text/json; charset=utf-8 

{ a : 50 , b : 90 }

And the response will automatically be a JSON as well:


Content-Type: text/json; charset=utf-8 

140

Since JSON does not have a method name parameter, it does not have a corresponding partial path dispatch rule, but the query string's partial path dispatch rule remains in effect.

And of course if you are using JSON you will want to use JSONP, which you can by specifying an additional ~jsonp query string parameter as follows:


POST /api/add2?~jsonp=myCallback HTTP/1.0 
Content-Type: text/json; charset=utf-8 

{ a : 50 , b : 90 }

Which will generate the following response as a JSONP result:


Content-Type: text/json; charset=utf-8

return myCallback( 140 );

That explains the basic rules with regards to using the web api. The next post will discuss the syntax of the web API, as well as how to extend it so you can use other types.

BZLIB/PLANET.plt - A Local PLANET Proxy/Repository Server

2010-01-18T18:48:00.000-08:00

PLT's planet system is great - if you want to install a planet module, you just declare it in the code, and it will automatically manage the download and the install for you, without you having to separately run another module installation programs like Perl's CPAN, Ruby's GEM, PHP's PEAR, etc.

However, planet can still be improved. Specifically, as there is currently only a single central repository, you might experience some inconvenient server outage from time to time. And it is difficult to take advantage of the planet's automatic code distribution power unless you plan on releasing the code for public consumption.

That is - until today.

BZLIB/PLANET.plt is designed to solve exactly the issue of a single planet repository. Going forward, you can install bzlib/planet and run a local planet proxy and repository. bzlib/planet is, of course as usual, available via planet under LGPL ;)

Usage

bzlib/planet contains a proxy server that you can setup and run. The first thing to do is to install it via require:


(require (planet bzlib/planet/proxy))

After you have installed proxy, you can start the proxy server via:


(parameterize ((current-command-line-arguments #("-r" "/path/to/local/planet/repo" "-p" "9000")))
    (start!))

Alternatively, you can also choose to start it via the included unix script (in the root directory of the installed location for bzlib/planet.plt) called "proxy" if you are in Unix or Mac OSX. And you can start it with something like:


# assume you are in the package root directory
# you might have to run chmod +x proxy first 
$ chmod +x proxy 
$ ./proxy -r </path/to/local/planet/repo> -p <port=9000>

which will parameterize the command line arguments and call start! for you.

The -r argument points to the local repository path - it defaults to /var/data/plt/planet-repo, and can be controlled via -r or via environment variable named BZLIB_PLANET_REPO_PATH. Make sure the path you have created exists.

The -p argument holds the port number that the server will bind against, and it defaults to 9000.

PLT Scheme Planet Configuration

To have your DrScheme or mzscheme call the planet proxy instead of the central planet repository, on version 4.2.3 or later, you can do so via the environent variable PLTPLANETURL. So let's say you have your planet proxy running on http://planet-proxy:9000, you should then set the environment variable of PLTPLANETURL to http://planet-proxy:9000.

If you are running a prior version of PLT, you can make manual modifications to the collects/planet/config.ss, and change the parameter HTTP-DOWNLOAD-SERVLET-URL to the url of your planet proxy.

Proxy Server Behavior

That's all you have to do to setup and use the proxy. Your regular require statements would do exactly the same thing with the proxy. When you make a request of an uninstalled module, the planet module system will now call the proxy server, and the proxy server will check to see if the package is available locally, if so, and if the version constraint matches, the package will be sent to the planet module system, otherwise, the proxy server will make a call to the central repository to make the request. If the central repository holds the module, it will be downloaded, stored in the repository and sent the to planet module system.

Hence, subsequent calls will not require another trip to the central repository. This decouples the proxy from the central repository so you do not need to maintain a live internet connection. The downside is that the download count on the central repository for your module will no longer be accurate going forward ;)

Private Packages

A derived benefit for running a local planet proxy is that you now can take advantage of the automatic code distribution power without having to make your code publicly available, i.e., you can release private code via planet now.

The way to do it is to simply create the appropriate directory structures, and drop your planet packages in them.

The planet repository have the following directory structures:
/<planet-user>/<package-name>/<major-version>/<minor-version>/<package-name>. As an example, bzlib/planet.plt has the following directory structure: /bzlib/planet.plt/1/0/planet.plt. So just drop your packages in the appropriate directories, and the proxy will treat them as if they were released through the central repository.

There are other enhancements that we can add to the package, such as mirroring the central repository, or automate the drop process, etc. I would love to hear them if you have them. For now - enjoy.

BZLIB/PARSEQ.plt - (4) Token-Based Parsers API

2010-01-08T19:00:00.000-08:00

Previously we have looked at fundamental parsers and the combinators API, as well as common parsers for character, number, and string, now it is time to look at token-based parsers provided by bzlib/parseq.

A huge class of parsing involves tokenizing the streams by skipping over whitespaces. For example, if we want to parse for a list of 3 integer, separated by comma, we currently have to write:


(define three-int-by-comma 
  (seq whitespaces 
       i1 <- integer 
       whitespaces 
       #\, 
       whitespaces 
       i2 <- integer 
       whitespaces 
       #\, 
       whitespaces 
       i3 <- integer 
       (return (list i1 i2 i3))))

The code above looks messy, and it would be nice if we do not have to explicitly specify the parsing of whitespaces. token allows us to abstract away the parsing of whitespaces:


(define (token parser (delim whitespaces)) 
  (seq delim 
       t <- parser 
       (return t)))

The above code can now be rewritten as:


(define three-int-by-comma2 
  (seq i1 <- (token integer) 
       (token #\,) 
       i2 <- (token integer) 
       (token #\,) 
       i3 <- (token integer) 
       (return (list i1 i2 i3))))

Which looks a lot better. But given tokenizing is such a common parsing task, we have a shorthand for the above called tokens:


(define-macro (tokens . exps) 
  (define (body exps) 
    (match exps 
      ((list exp) (list exp)) 
      ((list-rest v '<- exp rest) 
       `(,v <- (token ,exp) . ,(body rest)))
      ((list-rest exp rest) 
       `((token ,exp) . ,(body rest)))))
  `(seq . ,(body exps)))

Which will reduce the above parsing to the following:


(define three-int-by-comma3
  (tokens i1 <- integer 
          #\,
          i2 <- integer 
          #\, 
          i3 <- integer 
          (return (list i1 i2 i3))))

There is a case insensitive version of tokens called tokens-ci that allows the character and string token to be parsed in case insensitive fashion.

Besides tokenizing, another common need in token-based parsing is to handle delimited sets. In the above example, the 3 integers are delimited by commas. delimited generalize the pattern:


(define (delimited parser delim) 
  (tokens v <- parser 
          v2 <- (zero-many (tokens v3 <- delim
                                   v4 <- parser
                                   (return v4)))
          (return (cons v v2))))

The following parses a list of comma-delimited integers:


(delimited integer #\,)

Another common pattern is to parse for brackets that surrounds the value that you need. Just about all programming languages have such constructs. And bracket handles such parses:


(define (bracket open parser close) 
  (tokens open
          v <- parser 
          close 
          (return v)))

And bracket/delimited combines the case where you need to parse a bracketed delimited values:


(define (bracket/delimited open parser delim close) 
  (tokens open ;; even the parser is optional...  
          v <- (zero-one (delimited parser delim) '()) 
          close 
          (return v)))

That's it for the bzlib/parseq API. If you find anything missing, please let me know.

Enjoy.

BZLIB/PARSEQ.plt - (3) Common Parsers API

2010-01-08T11:03:00.000-08:00

Previously we have looked at fundamental parsers and the combinators API, now it is time to look at some common parsers provided by bzlib/parseq.

In this case, since we are constructing these parsers on top of the fundamental parsers and combinators, we will show the definitions accordingly.

Character Category Parsers

digit is a character between #\0 and #\9.


(define digit (char-between #\0 #\9))

not-digit is a character not between #\0 and #\9.


(define not-digit (char-not-between #\0 #\9))

lower-case is a character beween #\a and #\z.


(define lower-case (char-between #\a #\z))

upper-case is a character between #\A and #\Z.


(define upper-case (char-between #\A #\Z))

alpha is either an lower-case or upper-case character.


(define alpha (choice lower-case upper-case))

alphanumeric is either an alpha character or a digit character.


(define alphanumeric (choice alpha digit))

whitespace is either a space, return, newline, tab, or vertical tab.


(define whitespace (char-in '(#\space #\return #\newline #\tab #\vtab)))

not-whitespace is a character that is not a whitespace.


(define not-whitespace (char-not-in '(#\space #\return #\newline #\tab #\vtab)))

whitespaces parses for zero or more whitespace characters:


(define whitespaces (zero-many whitespace))

ascii is a charater bewteen 0 to 127:


(define ascii (char-between (integer->char 0) (integer->char 127)))

word is either an alphanumeric or an underscore:


(define word (choice alphanumeric (char= #\_)))

not-word is a character that is not a word:


(define not-word (char-when (lambda (c) 
                              (not (or (char<=? #\a c #\z)
                                       (char<=? #\A c #\Z)
                                       (char<=? #\0 c #\9) 
                                       (char=? c #\_))))))

Finally, newline parses for either CR, LF, or CRLF:


(define newline 
  (choice (seq r <- (char= #\return) 
               n <- (char= #\newline)
               (return (list r n)))
          (char= #\return)
          (char= #\newline)))

Number Parsers

sign parses for either + or -, and defaults to +.


(define sign (zero-one (char= #\-) #\+))

natural parses for 1+ digits:


(define natural (one-many digit))

decimal parses for a number with decimal points:


(define decimal (seq number <- (zero-many digit)
                     point <- (char= #\.)
                     decimals <- natural 
                     (return (append number (cons point decimals)))))

positive parses for either natural or decimal. Note decimal needs to be placed first since natural will succeed when parsing a decimal:


(define positive (choice decimal natural))

The above parsers returns the characters that represents the positive numbers. To get it to return numbers, as well as parsing for both positive and negative numbers, we have a couple of helpers:


;; make-signed will parse for the sign and the number.
(define (make-signed parser)
  (seq +/- <- sign
       number <- parser 
       (return (cons +/- number)))) 

;; make-number will convert the parsed digits into number. 
(define (make-number parser)
  (seq n <- parser 
       (return (string->number (list->string n)))))

Then natural-number parses and returns a natural number:


(define natural-number (make-number natural))

integer will parse and returns an integer (signed):


(define integer (make-number (make-signed natural)))

positive-number will parse and return a positive number (integer or real):


(define positive-number (make-number positive))

real-number will parse and return a signed number, integer or real:


(define positive-number (make-number positive))

String Parsers

The following parsers parses for quoted string and returns the inner content as a string.

escaped-char parses for characters that were part of an escaped sequence. This exists for characters such as \n (which should return a #\newline), and character such as \" (which should return just "):


(define (escaped-char escape char (as #f)) 
  (seq (char= escape) 
       c <- (if (char? char) (char= char) char)
       (return (if as as c)))) 

;; e-newline 
(define e-newline (escaped-char #\\ #\n #\newline)) 

;; e-return 
(define e-return (escaped-char #\\ #\r #\return)) 

;; e-tab 
(define e-tab (escaped-char #\\ #\t #\tab)) 

;; e-backslash 
(define e-backslash (escaped-char #\\ #\\))

quoted parses for the quoted string pattern (including escapes):


;; quoted 
;; a specific string-based bracket parser 
(define (quoted open close escape)
  (seq (char= open) 
       atoms <- (zero-many (choice e-newline 
                                   e-return 
                                   e-tab 
                                   e-backslash 
                                   (escaped-char escape close) 
                                   (char-not-in  (list close #\\)))) 
       (char= close)
       (return atoms)))

make-quoted-string abstracts the use of quoted.


(define (make-quoted-string open (close #f) (escape #\\)) 
  (seq v <- (quoted open (if close close open) escape)
       (return (list->string v))))

Then single-quoted-string and double-quoted-string look like the following:


(define single-quoted-string (make-quoted-string #\'))

(define double-quoted-string (make-quoted-string #\"))

Finally, quoted-string will parse both single-quoted-string and double-quoted-string:


(define quoted-string 
  (choice single-quoted-string double-quoted-string))

That is it for now - we will talk about parsing tokens next. Enjoy.

BZLIB/PARSEQ.PLT - (2) Parser Combinators API

2010-01-06T18:14:00.000-08:00

[Continuing the previous post on the API of bzlib/parseq]

Previously we have looked at the basic parsers that peeked into inputs, now it's time to look at the combinators.

Basic Parser Combinators

bind is the "bind operator" for the parsers, with the following signature:


(-> Parser/c (-> any/c Parser/c) Parser/c)

It takes in a parser, a "transform" function that will consume the returned value from the first parser and transform the value into another parser, and combine both into another parser.

As there are other more specialized combinators, you should not need to use bind directly unless you are trying to compose functions during run-time.

result simplifies the transform function you need to write so your transform function only need to return the value, not another parser. Below is the definition of result:


(define (result parser helper)
  (bind parser 
        (lambda (v) 
          (if v 
              (return (helper v))  
              fail))))

result* works like result, but it only works with a return value that is a list, and it applies the transform against the value. This comes in handy when you know the parser returns a list and you want to bind the list to individual arguments in your transform function. Below is an example (sequence* combinator is explained later):


(result* (sequence* (char= #\b) (char= #\a) (char= #\r)) 
         (lambda (b a r) ;; b maps to #\b, a maps to #\a, and r maps to #\r 
            (list->string (list b a r))))

Multi-Parser Combinators

seq is a macro-based combinator that bounds multiple parsers in succession, and returns the results only if all parsers succeed. The design of this parser is inspired by Shaurz's Haskell-style parser combinator.

The default form of seq is (seq <parser>), which is equivalent of <parser>. An seq expression must end in the first form.

The second form of seq allows you to create lexical binding against the value of the intermediate parser so you can manipulate the result with the last parser. An example to parse 2 separate digit and sum by with the first being multiplied by 2 and the second multiplied by 4:


(seq a <- (char= #\0 #\9) 
     b <- (char= #\0 #\9) 
     (return (+ (* 2 (string->number (string a))) 
                (* 4 (string->number (string b))))))

This form, however, cannot be the last statement in seq, i.e., you must have other parser statements after a binding (otherwise why do you need the binding?).

The last form of seq behaves like the second form but does not create a binding, so it looks like the first form.

An example is to parse away the parentheses surrounding a digit:


(seq (char= #\() ;; 3rd form - no lexical binding 
     digit <- (char= #\0 #\9) ;; 2nd form - binds digit 
     (char= #\)) ;; 3rd form - no lexical binding 
     (return (string->number (string digit))) ;; 1st form 
     )

seq is perhaps the most versatile combinator that you will use to create other combinators and parsers.

sequence is the functional version of seq that takes a list of parsers as its argument:


(sequence (list parser ...))

sequence has the following limitations:

sequence do not allow for custom bindings
sequence cannot be used to define recursive parser (its inner parser must be defined first)
sequence do not handle custom transformation of the results - you should use it with result* to bind the transformation (see above).

sequence* is the variable args version of sequence, i.e. (sequence* parser ...).

choice is a macro combinator that combines mutliple parsers so the first one that succeeds will return the value:


;; returns if the next char is either #\a, #\b, or #\c.  Else fail. 
(choice (char= #\a) (char= #\b) (char= #\c))

one-of is the functional version of choice that takes a list of parsers as its argument:


(one-of (list parser ...))

Since this is an function combinator, unlike choice, you cannot define recursive parser with it.

one-of* is the variable arg version of one-of, i.e., (one-of* parser ...).

all-of requires all of the inner parsers to succeed, and it returns the result from the last parser:


;; an contrived example: parse successfully if the next char is #\e 
(all-of (list alpha (char-between #\d #\f) (char= #\e)))

Remember it is the last result being returned, so if you inner parsers requires different amount of bytes matched, make sure the one that peek the most bytes gets matched last.

all-of* is the variable arg list version of all-of.

Repetition-Based Combinators

repeat is the most flexible repetition-based combinators:


(repeat <parser> <minimum> <maximum>)

The minimum argument defaults to 1, and the maximum argument defaults to positive infinity, which corresponds to the one-many combinator:


(define (one-many parser) 
  (repeat parser 1 +inf.0))

And zero-many maps to:


(define (zero-many parser) 
  (repeat parser 0 +inf.0))

Note that all three parsers will return a list that contains each of the successful parses. zero-many will return a list even if the inner parser does not match, so zero-many will always successfully parse.

zero-one is a special repetition-based combinator in that it does not return a list. Instead, it allows you to parse for a single occurence, and if it fails, you can specify a default value to substitute for the failure:


;; example of parsing for "foo" but returns "bar" if the parse failed. 
(zero-one (string= "foo") "bar")

Hence zero-one will always succeed as well.

That's it for the combinators. We'll discuss some pre-defined parsers in the next post. Stay tuned.

BZLIB/PARSEQ.plt - a Monadic Parser Combinator Library

2010-01-05T18:53:00.000-08:00

BZLIB/PARSEQ.plt is now available via PLANET. Inspired by Haskell's Parsec and Shaurz's Haskell-style parser combinator, bzlib/parsec provides a monadic parser combinator library that can handle both character and binary data parsing.

If you need a refresher on parser combinators, you can read my previous posts for a quick tour.

Installation


(require (planet bzlib/parseq))

The package includes the following examples that you can inspect for the usage of the API:


(require (planet bzlib/parseq/example/csv) ;; a customizable csv parser 
         (planet bzlib/parseq/example/calc) ;; a simple calculator with parens 
         (planet bzlib/parseq/example/regex) ;; a simplified regular expression parser & evaluator 
         (planet bzlib/parseq/example/sql) ;; parsing SQL's create table statement 
         (planet bzlib/parseq/example/json) ;; parses JSON data format 
         )

Parser Type Signature & Input
A parser is a function that has the following signature:


(-> Input/c (values any/c Input/c)) ;; returns the value and the next input

If the value is #f then the parse has failed. This might be changed in the future to another value so you can return #f as a parsed value.

The input is a struct with the following structure:


(define-struct input (source pos) #:prefab)

It is an abstraction over an input-port so you can keep track of the current position on the port. The function make-input will take in an input-port, a string, or a byte and return an input struct with the position initiated to 0.

During the parsing, the values are peeked instead of read so we can backtrack to the beginning. That means when you finished parsing, all of the data are still in the port. make-reader wraps over a parser so you can just pass an input-port instead of needing to create an input struct, and it also consumed the bytes if the parse is successful:


(make-reader parser) ;; => (-> input-port? any/c)

Fundamental Parsers (input not peeked)

The following parsers do not peek into the input.

(return <v>) returns <v> that you specify.

fail returns a failed struct that includes the position of the parse when the parser fails. The failed struct is currently defined as follows:


(define-struct failed (pos) #:prefab)

succeeded? and failed? tests whether the returned value is a failed struct (succeeded? equals (compose not failed?)).

SOF (start of file) will return 'sof if it is the start of the input (i.e., position = 0).

Fundamental Parsers (input peeked)

item peeks the input, test the input to ensure it satisfies a criteria, and if so, returns the value and advance the port by the size of the peeked value:


(item <peek> <isa?> <satisfy?> <size>) 
peek => (-> Input/c any/c) 
isa? => (-> any/c boolean?) 
satisfy? => (-> any/c any) 
size => (-> any/c exact-integer?)

isa? tests for the return value's type so you can simplify the writing of satisfy?, which can assume the value is of the right type.

You use item only when you need to create new parsers that the library do not already provide.

Non-Character Type Parsers

bzlib/parseq allows non-character parsers so you can use it to parse binary data instead of just text streams. You can mix them together of course.

(bytes= <bytes>) returns when the next set of bytes equals the passed in bytes. For example:


> ((bytes= #"foo") (make-input "foo bar"))
#"foo"
#s(input # 3)

(string= <string>) returns when the next set of bytes (in string) equals the passed in string.

(string-ci= <string>) is the case-insensitive version of string=.

(byte= <byte>) returns when the next byte equals the passed in byte.

(bits= <bits>) returns the next byte when it equals the passed in bits (a list of 8 0's or 1's).

byte= and bits= are built on top of byte-when, which is built on top of item. byte-when has the following signature:


(byte-when <satisfy?> (<isa?> byte?) (<size> (the-number 1))) 
;; (the-number <n>) returns a lambda that returns <n>

EOF is also built on top of byte-when as (byte-when identity eof-object? (the-number 0)).

Character-Based Parsers

The counterpart to byte-when for character-based parsers is char-when, which has the following signature:


(item <satisfy?> (<isa?> char?) char-utf-8-length)

The following are built on top of char-when:

(char= <c>) returns <c> when the next character equals <c>.

(char-not= <c>) is the opposite of char=.

(char-ci=? <c>) is the ci (case-insensitive) version of char=.

(char-not-ci=? <c>) is the opposite of char-ci=?.

(char-between <lc> <hc>) returns the next char when it falls between <lc> and <hc>.

(char-not-between <lc> <hc>) is the opposite of char-between.

(char-ci-between <lc> <hc>) is the ci version of char-between.

(char-ci-not-between <lc> <hc>) is the opposite of char-ci-between.

(char-in <chars>) returns the next char when it is one of the characters in <chars>.

(char-not-in <chars>) is the opposite of char-in.

(char-ci-in <chars>) is the ci version of char-in.

(char-ci-not-in <chars>) is the opposite of char-ci-in.

Literal Parsers

literal is used to abstract the parsers that basically performs an equal comparison (e.g., string=, byte=, etc), as well as allowing an inner parser to pass through, so you do not have to explicitly choose between char=, string=, etc., based on the argument. Example:


(literal #\a) ;; => (char= #\a) 
(literal "abc") ;; => (string= "abc") 
(literal any-byte) => any-byte

literal-ci is the case-insensitive version of literal, the difference is that it will return the case-insensitive parser for character and string:


(literal-ci #\a) ;; => (char-ci= #\a) 
(literal-ci "abc") ;; => (string-ci= "abc") 
(literal-ci #"abc") ;; => (literal #"abc")

That about sums it up for the basic parsers. The next post will document the combinators. Stay tuned.

BZLIB/XML.plt - An XML Utility for Xexpr and SXML

2009-12-22T14:47:00.000-08:00

BZLIB/XML.plt is now available via PLANET, it provides the following utilities to help with XML manipulation:

convert to/from xexpr to sxml
reading sxml/xexpr from html or xml sources
managing html and xml enities

Installation


(require (planet bzlib/xml))

Xexpr and SXML

Although Xexpr is the default xml representation in PLT Scheme (and the web-server), it lacks the toolkits that SXML enjoys. bzlib/xml helps by providing conversion functions to convert between sxml and xexpr:


;; convert from xexpr to sxml 
(xexpr->sxml `(p ((class "default")) "an xexpr instance"))
;; => `(p (@ (class "default")) "an xexpr instance") 

; convert from sxml to xexpr 
(sxml->xexpr `(p (@ (class "default")) "an xexpr instance"))
;; => `(p ((class "default")) "an xexpr instance")

Converting from xexpr to sxml will allow you to use the facilities such as sxpath, ssax, and sxml-match with xexpr, and converting from sxml to xexpr will allow you to feed sxml into web-server for to generate output based on sxml.

Reading and Writing Xexpr/SXML

bzlib/xml provides read-xexpr and read-sxml to simplify the conversion from html sources to either xexpr or sxml:


;; reading xexpr 
(read-xexpr <input-port?>) 

;; reading sxml
(read-sxml <input-port?>)

The <input-port?> can be an http-client-response structure defined in bzlib/http, which provides an content-type header that helps aid the determination of whether this is an html or xml document, for example:


(read-sxml (http-get "http://www.google.com/"))

There are corresponding write-sxml and write-xexpr functions:


;; write-xexpr 
(write-xexpr <xexpr?> <output-port?>) 
;; write-sxml
(write-sxml <sxml?> <output-port?>)

Managing Entities

As part of converting from xexpr to sxml you'll need to deal with normalizing the xml entities. Since xexpr simply converts entities into symbols and numeric entities into numbers instead of converting them into final strings, bzlib/xml provides a entity->string routine that'll convert the entity into strings for you.


(entity->string <symbol or number entity>)

This is automatically called by xexpr->sxml, read-xepxr, and read-sxml, so you generally do not have to use it explicitly, except to extend the entity mapping.

entity->string converts entities by mapping numeric entities via against the unicode character map, and symbol entities via two separate entity mapping tables, one for predefined HTML entities, and the other is a parameterizable XML entities.

The HTML entity table contains a set of pre-defined HTML entities that were mapped to the underlying character numeric code. Generally you should not have to modify this set of entities, but if you need to, you can do so via set-html-entities!:


(set-html-entities! <list of symbol/integer pairs>) 
;; example
(set-html-entities! `((nbsp . 160) (lt . 60) (gt . 62)))

The XML entity table (xml-entities) is parameterizable, and it takes a list of symbol and string pairs:


(parameterize ((xml-entities '((lt . "<") (gt . ">")))) 
  (read-sxml ...))

If you use the same symbol entity in both tables, the xml-entities takes precedence.

Any unknown entity is mapped to the null character.

That's it for now. Enjoy.

BZLIB/SESSION.plt - a Session Store via BZLIB/DBI.plt

2009-11-15T22:31:00.000-08:00

I originally planned to write a series on the development of a session store, but it turned out that there aren't that many things to write about, so I am just going to release the code via planet. As usual, this is released under LGPL.

Installation

To download the planet package:


(require (planet bzlib/session))

The package comes with three separate database installation scripts (one each for sqlite, mysql, and postgresql). You can call them via the following:


;; installing to a sqlite database
(require (planet bzlib/session/setup/jsqlite)) 
(setup-session-store/jsqlite! <path-to-sqlite-db>) 

;; installing to a mysql database 
(require (planet bzlib/session/setup/jazmysql)) 
(setup-session-store/jazmysql! <host> <port> <user>  
                               <password> <schema>)

;; installing to a postgresql database 
(require (planet bzlib/session/setup/spgsql)) 
(setup-session-store/spgsql! <host> <port> <user>
                             <password> <database>)

Known Issue: the script currently can only be run once and it assumes the table session_t does not exist in the target database. This will be rectified in the future.

Session ID

The session store uses uuid for session IDs. bzlib/base provides API for manipulation of uuids.

To create an uuid, just run (make-uuid). It can optionally takes in a parameter that are either an uuid in string, and then create the corresponding uuid structure (it can also takes in another uuid structure and make an equivalent uuid structure).


> (require (planet bzlib/base))
> (make-uuid)
#<uuid:4ba52eac-a0b4-415a-88f5-57d1fadd1aba>
> (make-uuid "4ba52eac-a0b4-415a-88f5-57d1fadd1aba")
#<uuid:4ba52eac-a0b4-415a-88f5-57d1fadd1aba>
> (make-uuid (make-uuid "4ba52eac-a0b4-415a-88f5-57d1fadd1aba"))
#<uuid:4ba52eac-a0b4-415a-88f5-57d1fadd1aba>

bzlib/session.plt does not handle parsing cookies into uuids.

Creating a Session Object

In order to create a session object, you first need to make a dbi handle with either dbd-spgsql, dbd-jazmysql, or dbd-jsqlite driver to where you have setup the session_t table, and you need to pass in the corresponding query script so the prepared statements ('make-session!, 'load-session, 'save-session!, and destroy-session!) can be loaded:


(require (planet bzlib/session) (planet bzlib/dbi)) 
;; loading 'jsqlite 
(require (planet bzlib/dbd-jsqlite)) 
(define h (connect 'jsqlite <path-to-db> 
                   '#:load (session-query-path/sqlite))) 
;; loading 'spgsql 
(require (planet bzlib/dbd-spgsql)) 
(define h (connect 'spgsql <spgsql-parameters> ... 
                   '#:load (session-query-path/postgres)))
;; loading 'jazmysql
(require (planet bzlib/dbd-jazmysql)) 
(define h (connect 'jazmysql <jazmysql-parameters> ...
                   '#:load (session-query-path/mysql)))

Then you can pass the handle to build-session to create the session object.


;; create a new session with a new uuid 
(define s (build-session h)) 
;; create a new session with a known uuid 
(define s (build-session h <uuid>))

As soon as build-session is called, you'll find a corresponding session record in session_t.

Accessing Session Key/Value Pairs

session-ref, session-set!, session-del! modifies the values of session key/value pairs:


(session-ref <session> <key> <optional-default-value>) 

(session-set! <session> <key> <value>) 

(session-del! <session> <key>)

Writing Out Sessions

save-session! saves the sessions out to the database:


(save-session! <session>)

refresh-session! will reload the session from the database:


(refresh-session! <session>)

And destroy-session! will delete the session record from session_t:


(destroy-session! <session>)

call-with-session and with-session will help you manage the call to save-session! so you do not have to write it with every request:


(call-with-session (build-session h) 
  (lambda (session) 
    <... do something with session ...>))

(with-session (build-session h) 
  (lambda () 
    <... do something with session via (current-session) ...>))

with-session works via (current-session), which is a parameter that holds either #f or a session structure. By passing the session object via with-session it will automatically parameterize current-session.

Session Expirations

You can use session-expired? to test whether the session object has expired:


(session-expired? <session>)

The expiration value is stored as a number (in Julian days) in session_t, which by default will be 14 days in the future from the time when save-session! is called.

The default session expiration value of 14 days can be controlled via the session-expiration-interval parameter.

Web Server Continuation

To simplify the usage of bzlib/session with web-server continuation calls, the following wrappers are exported to replace the send/suspend family:


(require (planet bzlib/session/web-server)) 
;; you then have access to... 
send/back 
send/finish
send/suspend
send/suspend/url
send/forward
send/suspend/dispatch
redirect/get
redirect/get/forget

These wrappers have the same corresponding APIs from web-server itself, and they'll call save-session! before making the continuation, and call refresh-session! before returning from the continuation.

That's it for now, enjoy.

Building a Web Session Store (1)

2009-11-09T20:54:00.000-08:00

Given that we have previously determined the need for a web session store even if we are using continuations, we'll go ahead and build it on top of our DBI stack, so the session data can be persisted as long as necessary.

Quick Word About Session Store Performance

One thing to note about session data is that its data usage is both read and write intensive, and such data can put strain on the database. It's write-intensive because with each request we'll extend the expiration time on the session itself, and it's read-intensive because the data is needed for every request, but it changes with every request.

For now we'll assume that our database is capable of handling such needs (and it will until you have a sufficiently large amount of traffic), but it's something to keep in mind. The nice thing of building the session logic on top of DBI is that when we need to deal with the performance issue, we can add logics into the DBI tier easily with developing a customer driver, for example, by integrating memcached as a intermediate store that'll flush out the changes to the database once a while instead of with every request.

Active Record

The active record pattern are not just for OOP fanatics - we schemers know that you can craft your own OOP with FP easily. In DBI today there is a base structure for active record definition:


(define active-record (handle id))

Such definition is a lot simpler than the usual OOP representations, which usually try to construct the data model in memory, along with dynamically constructed SQL statements. Although such OOP records provide simplicity for the simple cases, it has proven to be a leaky abstraction due to the object vs relational paradigm mismatch, as well as a significant performance overhead. Our simple definition will do us just fine right now.

What would our session API look like then?


;; expiration a julian-day 
;; store is a hash table 
(define-struct (session active-record) (expiration store) #:mutable) 

;; the session key/value manipulation calls... 
(define (session-ref session key (default #f)) ...) 
(define (session-set! session key val) ...) 
(define (session-del! session key) ...) 
;; the persistence calls 
(define (build-session handle ...) ...) 
(define (save-session! session) ...) 
(define (refresh-session! session) ...) 
(define (destroy-session! session) ...)

We'll go through and flesh out the definitions in details.

The Store in Memory

Hashtable is a good internal representation of the key/value pairs that session will hold (for now we'll assume the held data are serializable... we'll deal with this problem later), and this immediately tell us what session-ref, session-set!, and session-del! will look like:


(define (session-ref session key (default #f)) 
  (hash-ref (session-store session) key default)) 

(define (session-set! session key val) 
  (set-session-store! session 
                      (hash-set (session-store session) key val)))

(define (session-del! session key) 
  (set-session-store! session 
                      (hash-remove session key)))

And yes - we are using immutable hash rather than mutable hash.

When to Persist

You probably have noticed that session-set! and session-del! do not persist out to the database. So if you have multiple concurrent connections for the same session, it might be possible for the session object to get out of the sync.

While this is possible, the chance of it happening isn't great, since for the majority of the time users are going to make one main request at a time, with many auxiliary requests for accompanying images and css files that should not modify session values.

On the other hand, saving every changes with each single session-set! call could drastically increase the read & write access for the session object (what's the point of saving with each write if you are not doing the same for read?) and could have detrimental impact on performance unless we are ready to implement an intermediate cache.

And finally such decoupling actually simplify the code (I have written the code with the other approach for comparison) and makes it look more refactored. So for now we'll go with this approach.

Hence we'll persist at the end of the request with a call to save-session!. A simple wrapper so you do not have to explicitly write the separate call would be:


(define (call-with-session session proc) 
  (dynamic-wind void 
                (lambda () 
                  (proc session)) 
                (lambda ()
                  (save-session! session))))

And with a current-session parameter we can simplify it as:


(define current-session (make-parameter #f))

(define (with-session session proc) 
  (call-with-session session 
                     (lambda (session)
                       (parameterize ((current-session session)) 
                         (proc)))))

Except for one bug, the above will work as you expected in web-server environment. If you have an idea of what the bug will be - please feel free to make a comment. I'll discuss the bug and how to fix it in the next post for the series.

Web Sessions vs. Continuations

2009-11-07T17:36:00.000-08:00

The "session info in web server applications" thread recently in plt-scheme list has an undertone that continuations are equivalent of web sessions as understood in other languages and frameworks. This undertone is highlighted by the lack of a session-like capability within the web-server collection that exists in other web frameworks.

This got me to think: are continuations equivalent of sessions?

The original intent (indicated in Shriram's research paper) of web-server's continuation is to correctly and succinctly model interactive web application's application flow. The paper sites examples of incorrectly implemented web apps that would do something like the following:

user browse a list of goods
user opens new window to get the details of goods A
user goes back to original window
user then opens another new window to get the details of goods B
user then goes to goods A and click "Buy Now"
incorrectly implemented app will cause the user to buy goods B instead of goods A

The traditional solution to the above interaction would be to use sessions, and since continuation models such interactions as well, there is no question that in this case continuations supplant the needs of sessions.

But for other scenarios involving sessions it might be more natural to model the computations by using the traditional session concepts.

For example - identifying the user across visits after significant time lapse (this is generally toggled by a "remember me" checkbox during login). Normally web sites accomplish this by persisting the user's authenticators via cookies or sessions.

This process is awkward to model with continuations, since the user likely come back to the site via a top level link that has no captured continuations, instead of digging up the last continuation url for the site, and the continuations might have expired between the visits if you use stateful servlets.

If you use web-server's stateless servlet language, an approach is probably to serialize the continuation into a cookie so it can model the above scenario, but you'll have to write your code in the stateless language or convert your code over, and it feels like a more complex solution compared to simply having a regular session capability. This is similar to using continuations to model non-interactive web links - it can work, but it does not follow Occam's razor.

Furthermore - if your site uses extensive ajax, your use of continuations will decrease, since Ajax models the interactions as well and supplants the needs for continuations. and in such case you might regain the needs for sessions that was reduced by continuations.

So, as far as I can tell, continuations is not equivalent to web sessions and do not eliminate the needs for session capabilities.

Using DBI to Run Scripts & Load Prepared Statement Scripts

2009-11-06T23:37:00.000-08:00

There are a couple of utility functions that I have designed into DBI that was not previously discussed. They are both oriented to work with SQL scripts.

I don't know about you, but I like to write SQL statements in SQL scripts:


-- a sql query inside a sql script
select * 
  from table1;

instead of embedding them as strings in programming languages:


;; a sql query embedded in scheme code 
(exec handle "select * from table1")

- it just looks so much nicer.

If you have a ton of complex prepared statements, you'll find you'll have such statements littered everywhere, which makes them difficult to maintain.

Of course - a possible solution to this problem is to create a DSL so the SQL strings can be generated. We might eventually entertain such solution, but it's clear that any such DSL will be non-trivial.

A simpler but equally as powerful of an idea is to move all such queries into SQL scripts, and have the database handle load the scripts and convert them into prepared statements.

Loading Prepared Statement Scripts

You can supply a #:load parameter to the three RDBMS drivers' connect procedure:


;; example - connect to spgsql 
(define handle 
  (connect 'spgsql <regular-args> ... 
           '#:load <path-string? or (listof path-string?)>))

The #:load parameter takes either a path-string? or a list of path-string?, all of which needs to point to valid prepared statement scripts. The connect procedure will then load each of the scripts and convert them into prepared statements.

Format of the Prepared Statement Scripts

In order for this to work - you'll need to follow some conventions:

You can have multiple prepared statements in a single script, and it must follow the order of <name-of-the-statement>\n+ <statement>\n+ ...
the name of the statement must be on a line starting with --;;-- , follow by the name itself, and then nothing else (besides additional whitespace)
the name of the statement is basically a regular scheme symbol, but more restricted - like a regular scheme function name, which would consist of alphanumeric, -, _, ?, and !

An example will make it more clear - below is a sample script that contains 2 prepared statements:


--;;-- session-ref 
select session_value
  from session_value_t 
 where session_uuid = ?uuid 
   and session_key = ?key 

--;;-- session-update! 
update session_value_t 
   set session_value = ?val
 where session_uuid = ?uuid 
   and session_key = ?key

The first statement's name is session-ref, and the second is session-update!. If you need more in a single script - just keep extending it with the same format. You can add other regular SQL comment lines in between, just make sure it does not start with --;;--.

Because the #:load parameter can load a list of scripts, you can organize your scripts however you want.

Run a SQL Script

The other script-related capability that DBI has is to run a SQL script. This is useful if you want to have installation scripts that has a set of SQL queries that needs to be executed sequentially (like creating a bunch of tables). All you need to do is to write up the script, and then:


(run-script! handle <path-to-script> <query-args>)

The args have the same format as the args for query (a list of key/value pairs). This means that your script can also contain placeholders like your prepared statements - they'll get passed to the statements as usual. Note because such script will contain multiple queries, if you do not want to pass the same value to a subsequent query, you must use a different placeholder name for the subsequent query.

Now - such a script also has a special format to ensure it works correctly. Basically, you must use the semicolon (;) terminator for the queries - below is an example script - notice the semicolon:


create table session_t 
( session_id integer primary key auto_increment not null 
, session_uuid varchar(32) unique not null 
, expiration date not null) ; -- query terminator 

create table session_value_t 
( session_value_id integer primary key auto_increment not null 
, session_uuid varchar (32) not null 
, session_key varchar (128) not null 
, session_value text null 
, unique ( session_uuid , session_key ) 
)

You'll also notice that the second statement does not have a terminator so there won't be an empty query.

This is capability is only available since planet version (1 3), i.e.


(require (planet bzlib/dbi:1:3))

The two capability should help you a lot in writing complex sql queries and scripts.

That's it for now. Enjoy.

Latest DBI.plt Available - Handling Last Inserted ID and Side Effects

2009-11-05T23:07:00.000-08:00

The newest version of DBI (and the 3 RDBMS drivers) has now been made available on planet - it addresses the issue of side effects and last inserted id.

As usual - they are released under LGPL.

To download & install - use require:


(require (planet bzlib/dbi:1:3) 
         (planet bzlib/dbd-jazmysql:1:2) 
         (planet bzlib/dbd-jsqlite:1:3) 
         (planet bzlib/dbd-spgsql:1:2))

Unifying the side effects and the last inserted id turns out to be a non-trivial task - I have already voiced the options on the plt-scheme list, repeating here for the sake of completeness:

the underlying drivers returns different values for side effects
there might be legacy code that are utilizing the underlying side effect results - which can increase effort if porting to another database
not all drivers provide ways to retrieve all side effect values
last inserted id does not work consistently when dealing with the multi-records-insert-at-once scenario - luckily this is generally not how insertion is used
for postgresql - we'll need to derive the underlying table to sequence name mapping in order to determine the last inserted id, and this requires a separate query that you might not want to "pay for" unless you have needs for last inserted id

Based on the above design constraints, I have chosen the following:

make available multiple drivers for each database - one for different types of the side effect results
default the side effect results to an unified effect structure, which is inspired by jazmysql
provide last-inserted-id for the 3 RDBMS drivers (SPGSQL has a special requirement, described below)

The Different Side Effects

There are 3 separate side effect types:

past-thru-effect - this is the side effect available as a backward compatibility. Basically whatever the side effect objects are returned by the underlying driver are direct returned; and you can use the underlying driver's code to access the values
effect - this is the unified side effect object and the default going forward
effect-set - this converts the effect structure into a result set

For example, if you want to make a connection with the first side effect type with bzlib/dbd-spgsql, you can do the following:


(connect 'spgsql/past-thru-effect <rest-of-args> ...)

And for the other two types:


;; use the effect structure
(connect 'spgsql/effect <rest-of-args> ...)
;; use result set as the effect structure
(connect 'spgsql/effect-set <rest-of-args> ...)

The default is 'spgsql/effect. That means when you pass in 'spgsql as the driver name, you are passing in the equivalent of 'spgsql/effect.

The same goes for the other two drivers:


(connect 'jsqlite/past-thru-effect ...)
(connect 'jsqlite/effect ...) ;; the default; same as 'jsqlite
(connect 'jsqlite/effect-set ...) 

(connect 'jazmysql/past-thru-effect ...)
(connect 'jazmysql/effect ...) ;; the default; same as 'jazmysql
(connect 'jazmysql/effect-set ...)

If you chooes */past-thru-effect you'll have to use the side effect structures returned by the underlying driver - I won't discuss this since this is meant for backward compatibility - if you are writing new code I would encourage to either use */effect or */effect-set.

The effect Structure

Inspired by jaz/mysql, the effect structure has the following definition:


(define-struct effect 
  (rows ;; the # of rows affected or #f
   insert-id ;; the last inserted id or #f
   status ;; the status of the underlying connection or #f
   warning-count ;; the warning count or #f 
   message ;; the message returned with the query or #f 
   error ;; the error message (or exception object) or #f 
  ))

You can use the appropriate struct accessor functions to access the values if you use the */effect drivers.

In */effect-set drivers, the returned effect object are converted to results, with the first row being the column names of:

affected rows
insert id
status
warning count
message
error

And the second row would contain the value converted from the effect structure, but with #f mapped to '(), based on the convention of the result set's handling of NULL.

Last Inserted ID

For both dbd-jsqlite and dbd-jazmysql, the */effect & */effect-set correctly captures the last inserted ID in the effect structure's insert-id field. They are also correctly returned in the */past-thru-effect version, since the underlying driver directly supports the concept of last inserted id (the jsqlite/past-thru-effect will return the last-inserted-id as a number, and jazmysql/past-thru-effect will return the last-inserted-id contained as part of the side-effect structure).

The dbd-spgsql driver is more complicated, however. The spgsql/past-thru-effect does not return the last-inserted-id, because in postgresql you need to know the underlying sequence object name that the table uses to manages the auto increment, and you also need to make a secondary query, which adds additional overhead.

dbd-spgsql handles this issue by taking in an additional parameter, identified by keyword #:t2s (table to sequence), when you make the connection:


(connect 'spgsql/effect <arg> ... '#:t2s 
         <procedure-to-translate-table-name-to-sequence-name>)

The #:t2s parameter takes a procedure that takes in a string (the table name) and returns a string (the sequence name). If you supply the parameter, and the query is an insert query, then the driver will help you to automatically make the subsequent query to retrieve the last inserted id. If you do not supply the parameter, then no overhead for accessing the last inserted id will be incurred.

That's it for now. Enjoy.

BZLIB/DATE & BZLIB/DATE-TZ 0.2 Now Available

2009-10-20T14:05:00.000-07:00

bzlib/date and bzlib/date-tz are now available via planet. They are released under LGPL. You can find the previous documentation for previous usage.

The changes included are:

re-exports for SRFI-19 functions
Wrapper functions for PLT date objects (you can use the functions with PLT date objects instead of with SRFI date objects)
day comparison functions
RFC822 date parsers and generators
additional date manipulation functions

SRFI19 Re-export
Previous you have to explicitly include srfi/19 to use the functions within SRFI19 as follows:


(require srfi/19 (planet bzlib/date))

Now you just have to do the following:


(require (planet bzlib/date/srfi))

And almost all srfi/19 functions will be re-exported along with the functions within bzlib/date.

The exceptions are date->string and string->date, neither of which are exported from srfi/19. This is because we may want to use those names for our own date parsers and generator functions. I'll examine the details before deciding whether to re-export those or create our own.

PLT Date Wrappers

You now can use PLT date objects instead of srfi/19 date objects (I do not really know why they are different date objects in the first place...). You can just do the following:


(require (planet bzlib/date/plt))

Which will export functions with the same name, but takes (and returns) PLT date objects instead of SRFI date objects. Because the exports are the same name - you cannot require it along with the SRFI-date version.

Besides wrapper over all of the bzlib/date functions, it also wraps over the srfi/19 functions, so you can use, for example, current-date and it'll now return a PLT date object.

This version also does not export string->date and date->string.

bzlib/date-tz also exports its functions in PLT wrapper form:


(require (planet bzlib/date-tz/plt))

It does not re-export bzlib/date/plt so you will need to explicitly require it if you want to use its functions.

There is a matching module in bzlib/date-tz/srfi but it is exactly the same as bzlib/date-tz. This is provided so you might write mirroring code:


;; in one file... 
(require (planet bzlib/date/srfi)
         (planet bzlib/date-tz/srfi)) 

;; then you can change it to 
(require (planet bzlib/date/plt)
         (planet bzlib/date-tz/plt))

Different Types of Date Comparisons

The date comparison functions (date=?, date<?, date>?, date<=?, date>=?, date!=?) can now be used to compare the dates in the following fashions:

You can now compare multiple dates at once (previously - just two)
You can use it to compare for:
- day-only - just compare the day (year month and day)
- day+time - compare both the day & the time (hour minute second), but without comparing timezone
- date - this is the default behavior - compare the date & time as well as accounting for the timezone.
- date+tz - this would require the dates being compared all have the same time zone

To control the behavior, you use the date-comp-type parameter:


(parameterize ((date-comp-type 'day-only ;; or 'day+time 'date 'date+tz 
                ))
  (date<? d1 d2 d3 ...))

day=?, day!=?, day<?, day<=?, day>?, day>=? are provided as helper functions that parameterize the date-comp-type to 'day-only for you.

Additional Utility Functions for Timezone

current-date/tz returns a date based on the optional tz value (which defaults to (current-tz):


> (list (parameterize ((current-tz "America/Los_Angeles"))
          (current-date/tz))
        (parameterize ((current-tz "Europe/Brussels"))
          (current-date/tz)))
;; notice the date difference... 
(#(struct:tm:date 0 56 18 13 20 10 2009 -25200)
 #(struct:tm:date 0 56 18 22 20 10 2009 7200))

daylight-saving-time? returns true or false depending on the optional date and tz (default to the current-date & current-tz):


> (list (parameterize ((current-tz "America/Los_Angeles"))
          (daylight-saving-time?))
        (parameterize ((current-tz "Asia/Taipei"))
          (daylight-saving-time?)))
(#t #f)

That's it for now. Enjoy.

Building Parser Combinators in Scheme (2) - Higher Order Combinators

2009-10-19T00:58:00.000-07:00

Previously we started building parser combinators for parsing a symbol that has the following signature (seq alpha (zero-more (one-of alpha numeric))), and we stopped at refactoring the numeric and alpha parsers, and ended up with return, fail, and char-test.

char-test Expanded

char-test gives us a good base for building a bunch of other character-based parser:


;; char= returns a parser to test whether the next char equals c
(define (char= c)
  (char-test (lambda (it)
               (char=? it c)))) 

;; in-chars returns a parser to match against the list of chars
(define (in-chars lst)
  (char-test (lambda (it) 
               (member it lst)))) 

;; not-in-chars is the reverse of in-chars 
(define (not-in-chars lst)
  (char-test (lambda (it) 
               (not (member it lst))))) 

;; in-char-range returns a parser to match char between from & to chars
(define (in-char-range from to)
  (char-test (lambda (it)
               (char<=? from it to))))

;; the oppposite of in-char-range
(define (not-in-char-range from to)
  (char-test (lambda (it)
               (not (char<=? from it to)))))

So we can build parsers such as the following:


;; a parser to test for the backslash
(define backslash (char= #\\)) 
;; writing numeric using in-chars
(define numeric (in-chars '(#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)))
;; writing equivalent to regexp \S (non-whitespace) 
(define non-whitespace (not-in-chars '(#\tab #\return #\newline #\space #\vtab)))
;; lower-case alpha using in-char-range 
(define lower-case (in-char-range #\a #\z)))
;; upper-case alpha using in-char-range
(define upper-case (in-char-range #\A #\Z)))

It would be nice to write alpha in terms of upper-case and lower-case as defined above:


(define alpha (one-of upper-case lower-case))

So let's see how we can write one-of.

Higher order Parsers

The idea of one-of is straight forward - take in a list of the parsers to test against the input, one parser at a time. The first one that succeeded would be returned:


(define (one-of . test)
  (lambda (in (skip 0)) 
    (let loop ((rest test))
      (if (null? rest)
          (fail in skip)
          (let-values (((v count)
                        ((car rest) in skip)))
            (if (not count)
                (loop (cdr rest))
                ((return v) in count)))))))

Now we can then write the following:


(define alpha (one-of (in-char-range #\a #\z)
                      (in-char-range #\A #\Z)))

(define alpha-numeric (one-of numeric alpha))

one-of is basically a parser combinator that acts like an or expression. We can also have a combinator that acts like an and:


(define (all-of . test)
  (lambda (in (skip 0)) 
    (let loop ((rest test)
               (v #f)
               (count skip))
      (if (null? rest)
          ((return v) in count)
          (let-values (((v count)
                        ((car rest) in skip)))
            (if (not count)
                (fail in skip)
                (loop (cdr rest) v count)))))))

Notice that all-of will return the value and the count from the last match (the same behavior as and), but all the previous tests also need to match. It would of course be the user's responsibility to construct valid combination that will pass.

If we want to parse multiple matches in succession (i.e. one after another), we need a sequence combinator:


(define (seq . test)
  (lambda (in (skip 0)) 
    (let loop ((rest test)
               (acc '()) 
               (count skip))
      (if (null? rest)
          ((return (reverse acc)) in count)
          (let-values (((v count)
                        ((car rest) in count)))
            (if (not count) ;; we are done! 
                (fail in skip)
                (loop (cdr rest) (cons v acc) count)))))))

This will allow us to parse a sequence of tokens, for example, the below demonstrates parsing a social security number in nnn-nn-nnnn form:


(define dash (char= #\-)) 
(define ssn (seq numeric numeric numeric dash numeric numeric dash numeric numeric numeric numeric))

Now it can be bothersome to write repeated numerics like above, so we can have a repeat parser:


(define (repeat test n) 
  (lambda (in (skip 0)) 
    (let loop ((i 0)
               (acc '())
               (count skip))
      (let-values (((v count)
                    (test in count))) 
        (cond ((not count) ;; failed before reaching n
               (fail in skip))
              ((= i (sub1 n)) ;; succeeded
               ((return (reverse (cons v acc))) in count))
              (else
               (loop (add1 i) (cons v acc) count)))))))

Then ssn can be written as:


(define NN (repeat numeric 2)) 
(define NNN (repeat numeric 3)) 
(define NNNN (repeat numeric 4)) 
(define ssn (seq NNN dash NN dash NNNNN))

repeat parses for fixed numbers of repeats - what if we want to have unbounded repeats? Let's try to build zero-many that'll match for zero or more occurrences:


(define (zero-many test) 
  (lambda (in (skip 0)) 
    (let loop ((acc '())
               (count skip)) 
      (let-values (((v new-count)
                    (test in count)))
        (if (not new-count) ;; we are done... 
            ((return (reverse acc)) in count)
            (loop (cons v acc) new-count))))))

Which we can then use to build one-many that must have at least one match:


(define (one-many test)
  (lambda (in (skip 0))
    (let-values (((v count)
                  (test in skip))) 
      (if (not count)
          (fail in skip)
          (let-values (((out count)
                        ((zero-many test) in count)))
            ((return (cons v out)) in count))))))

And of course there is a special test of zero or one occurrence:


(define (zero-one test)
  (lambda (in (skip 0)) 
    (let-values (((v count)
                  (test in skip))) 
      (if (not count) 
          ((return #f) in skip)
          ((return v) in count)))))

With all the above we now finally can construct the symbol parse:


(define symbol (seq alpha (zero-many (one-of alpha numeric))))

Which will return the following when parsing a symbol:


> (symbol (open-input-string "asymbol1 "))
(#\a (#\s #\y #\m #\b #\o #\l #\1))
8

The first is the read value - notice that they are listed according to the position within the sequence (#\a matches alpha, and (#\s #\y #\m #\b #\o #\l #\1) matches (zero-many (one-of alpha numeric))). And the second value indicating the bytes "peeked", which is 8.

The final step is to have a parser that allow us to take in the value and transform them according to our needs:


(define (make-parser test return)
  (lambda (in (skip 0)) 
    (let-values (((v count)
                  (test in skip))) 
      (cond ((not count)
             (values #f #f))
            (else
             (read-bytes count in)
             (values (apply return v) count))))))

So we can build the final symbol parser as following:


(define parse-symbol 
  (make-parser (seq alpha (zero-many (one-of alpha numeric)))
    (lambda (alpha lst) 
      (list->string (cons alpha lst)))))

which will return a string for us instead of the above args:


> (parse-symbol (open-input-string "asymbol1 "))
"asymbol1"
8

At this point we have the most of the basic parser combinators constructed, and the rest is to fill in the details as necessary. We'll take a look at how to improve the parser to handle more complex scenarios in the future posts. Stay tuned.

Building Parser Combinators in Scheme

2009-10-16T11:09:00.000-07:00

I do not write code for the sake of writing code, but sometimes the best way to understand a concept is to write it up. I can think of many situations where I have no clue what the heck is going on by reading the documentations or even the tutorials, but I start to understand what's going on once I start to hack up with a solution.

Parser combinator is one of those things for me. Based on what I can find, Haskell was where the term is coined and expanded, but unfortunately I cannot read Haskell well enough to fully follow the excellent monadic parser combinator paper, let along trying to understand Parsec, and think about how it would be translated into scheme.

PLT Scheme contains a parser combinator library, but its documentation is written for someone who is completely familiar with the parser combinator and how it would work within scheme, so I couldn't start with it (in contrast, I was able to figure out parser-tools in short order, even though my previous experience with lex & yacc isn't a lot either).

Searching online, there are a couple of parser combinator written in scheme:

It wasn't much, but it was something that I can build on. Let's get started...

Immediate Obstacle - Reading from Port Instead of String or List

Parser combinator promises top-down parser development in almost BNF-like fashion, including infinitely look ahead and unlimited backtracking, all of which are extremely fascinating powers.

But reading the monadic parser combinator leaves me scratching my head wondering how this would work with ports. The parser combinator appears to derive its backtracking power via reading the string into a list of characters. This is nice and all but it doesn't work with ports, since once a character is read from a port it is gone and you lose the ability to backtrack (yes - with some ports it is possible to backtrack but this does not work with all ports).

I wasn't able to find an answer to this problem via the above URLs either, so I'll have to come up with my own solution.

What I came up with was the following:

we'll use the peek-* functions instead of the read-* functions to access the port data
we'll keep track of the byte counts that we have accessed so far, and we'll pass the byte count as a skip-ahead to the next parser
we'll only do a single read at the end to remove everything we have found so far - i.e. if the parse failed we'll never remove a single character from the port

By adhering to the above approach we now can backtrack without side-effect on ports as well (the side-effect only occur once at the end of the parse when we have a successful match).

With the above in mind - let's start building our parser combinators.

A Simple Example

Let's try to parse a very simple example - a symbol that contains only alphanumeric characters, with the first character being alpha:


(parse-symbol (open-input-string "asymbol1 ")) ;; => "asymbol1"

So what should parse token look like? If it looks something like the following it would be good.


(make-parser (seq alpha (zero-more (one-of alpha numeric))) 
  (lambda (a1 lst) 
    (list->string (cons a1 lst)))

Look at the above bolded line - you can almost read it as the definition of the token: a sequence of an alpha character, followed by zero or more of either an alpha or a numeric character. All we need to do is to build the seq, alpha, numeric, one-of, and zero-more.

We should start with the simplest of the above, which would be the numeric parser - we can implement it as the following:


(define (numeric in (skip 0)) 
  (let ((c (peek-char in skip)))
    (if (char<=? #\0 c #\9)
        (values c (add1 skip)) 
        (values #f #f))))

Basically - if the next character is one of the numeric characters, we will return the character and the updated count (which is increment of the skip count). Otherwise we return #f for both, with the second indicating that we did not read anything. Note it is the second value indicating whether or not we have a successful parse, because we want to allow #f as a legal return value from a successful parse. This means if you want to backtrack you'll have to keep track of the skip in your calling function in case the parse fails.

Although this example is "simple" - it already tells us exactly what a parser would look like:

peek something from the port
do some test to determine whether the peek returns the desired data
if it does - return the data and update the count appropriately - this is a "success"
otherwise return #f #f - this is a "fail"

All of the parsers will basically follow this structure.

Now - both the success and the fail can be refactored out as their own parsers:


(define (fail in (skip 0))
  (values #f #f)) 

(define (return v) ;; we return the value without consuming from the port 
  (lambda (in (skip 0)) 
    (values v skip)))

Then numeric can be rewritten as:


(define (numeric in (skip 0)) 
  (let ((c (peek-char in skip)))
    (if (char<=? #\0 c #\9)
        ((return c) in (add1 skip))
        (fail in skip))))

This makes it more flexible to construct simpler higher level parser combinators.

The alpha can be written as follows:


(define (alpha in (skip 0)) 
  (let ((c (peek-char in skip))) 
    (if (or (char<=? #\a c #\z) (char<=? #\A c #\Z))
        ((return c) in (add1 skip)) 
        (fail in skip))))

It ought to be clear that we can refactor alpha and numeric to make it more succinct:


(define (char-test test?) 
  (lambda (in (skip 0)) 
    (let ((c (peek-char in skip)))
      (if (and (char? c) (test? c)) 
          ((return c) in (add1 skip)) 
          (fail in skip)))))

(define alpha 
  (char-test (lambda (c) 
               (or (char=<? #\a c #\z) (char=<? #\A c #\Z)))))

(define numeric 
  (char-test (lambda (c)
               (char=<? #\0 c #\9))))

In this case - char-test is a higher order parser that takes in a predicate to create a parser. We now have a base to create more character-based parsers. We'll do so in the next post. Stay tuned.

Parsing "Encoded Word" in RFC Headers (3) - Parsing & Decoding

2009-10-15T16:02:00.000-07:00

In the previous two posts we have built capabilities to generate encoded words, now it's time to decode them. We'll start with detecting whether a string is an encoded word.

Testing for Encoded Word

Remember that an encoded word has the following format:


=?<charset>?<Q or B>?<encoded data>?=

The above format can be succintly specified via regular expression:


(define encoded-word-regexp #px"=\\?([^\\?]+)\\?(?i:(b|q))\\?([^\\?]+)\\?=")

(define (encoded-word? str)
  (regexp-match encoded-word-regexp str))

Since the format is not recursive, regular expression is good enough, even though it can be considered as ugly. You can certainly try using other approaches, such as a hand written lexer, or using parser-tools to do the job. For now we keep things simple.

Decoding

Once we can test whether a string is an encoded word, we can then use it to handle the decoding:


(define (encoded-word->string str)
  (if-it (encoded-word? str)
         (apply decode-encoded-word (cdr it))
         str))

If the string is an encoded word, we decode it, otherwise we return it verbatim. This way it allows regular string to be passed into this function.

The decode-encoded-word function looks like the following:


(define (decode-encoded-word charset encode str) 
  (bytes/charset->string ((cond ((string-ci=? encode "q") q-decode)
                                ((string-ci=? encode "b") b-decode)) (string->bytes/utf-8 str))
                         (string-downcase charset) 
                         ))

Of course - if the charset and the bytes do not match, it would error out, which is a sensible choice since the only time that would have occurred would be due to bugs in the generation.

Now that we can handle decoding a single encoded word, we need to handle decoding a string with multiple encoded words intermixing with non encoded words.

Decoding Multiple Encoded Words

While RFC822 does not define an actual maximum length for the header values, it considers headers > 72 characters as "long" since the users wanted (back then) to be able to read the headers in a terminal setting, and hence they build in the ability to "fold" a line into multiple lines with the use of LFWS (\r\n\s).

So a line of


"this is a line and it continues \r\n
 on the next line"

Should be folded into


"this is a line and it continues on the next line"

And since an encoded word can have maximum length of 72 bytes, having multiple of them means that the line will most likely be folded, with a high likelihood that each single line within consists of a single encoded word (or it is not enocoded).

We have previously discussed on how to fold such a line with
read-folded-line, so we can use it as a basis for reading in the folded line first and then try to parse out the encoded words from the folded line, but this requires quite a bit of work since:

our regex test for encoded word will consumed and throw away bytes that are not encoded words, which is not what we want
if we do not want to throw away the bytes we will have to look for a different approach - either writing a custom lexer or use parser-tools
if we take that approach then what we have written so far is useless

Or is it? Let's see how far we can salvage what we have before having to look for another solution.

As we stated above, a very likely scenario for multi-encoded-word line is that each encoded word will be on its own line (and if one of the line is not encoded it should not have encoded words), so a very simple approach would be to let decode-encoded-word handle the conversion while read-folded-line is accumulating and folding over the lines. This will require us to modify read-folded-line:


(define (read-folded-line in (convert identity)) 
  (define (folding? c)
    (or (equal? c #\space)
        (equal? c #\tab)))
  (define (return lines) 
    (apply string-append "" (reverse lines)))
  (define (convert-folding lines)
    (let ((c (peek-char in)))
      (cond ((folding? c) 
             (read-char in)
             (convert-folding lines))
            (else
             (helper lines)))))
  (define (helper lines)
    (let ((l (read-line in 'return-linefeed)))
      (if (eof-object? l) 
          (return lines)
          (let ((c (peek-char in)))
            (if (folding? c) ;; we should keep going but first let's convert all folding whitespaces... 
                (convert-folding (cons (convert l) lines))
                ;; otherwise we are done... 
                (return (cons (convert l) lines)))))))
  (helper '()))

Then we can write the decoder as follows:


(define (encoded-word-string->string str)
  (read-folded-line (open-input-string str) encoded-word->string))

Which will handle encoded word string that is generated "normally" where each encoded word will reside on its own line.

Handling General Case of Multiple Encoded Words on the same Line

While the above encoded-word-string->string should handle normally generated encoded word string out there, it still cannot handle situations where multiple encoded words resides on the same line, or if encoded words coincide with non encoded words on the same line. Such situation can occur if the generation strategy is to encode each word individually (in a way this is why it's called "encoded word") - it's there in the RFC1342 example:


... 
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard 
...

If we try to decode it with what we have we'll lose the non encoded word:


> (encoded-word-string->string "=?ISO-8859-1?Q?Andr=E9_?= Pirard")
"André " ;; we lost Pirard

How can we solve this problem? Can we push what we have further or do we need to buckle down and look at using parser-tools?

Fortunately the format of encoded words helps us out. As defined in RFC1342, only way the above situation would exist is if they are separated by either spaces (which are significant) on the same line. Hence we can split the line by space, and then decode the individual word, and then join back by space:


(define (encoded-word-string->string str)
  (define (helper line)
    (string-join (map encoded-word->string (regexp-split #px" " line)) 
                 " "))
  (read-folded-line (open-input-string str) helper))

That's it - now we can generate and parse encoded words in RFC message headers. Enjoy.

Parsing "Encoded Word" in RFC Headers (2) - Charset Handling & Multiple Encoded Words

2009-10-14T18:53:00.000-07:00

In the previous post we discussed the Q and B encodings, and ended with a bug on mismatching charset if the charset is not utf-8, let's try to fix the bug here.

It would be nice if we can use local charsets such as iso-8559-1 or big5 if we know for sure that the charset contains all of the characters that appears in the string (of course, it is the developer's responsibility to choose the right charset; the code will error out if the charset does not match the data).

PLT Scheme provides a convert-stream to help handle converting bytes from one charset to another. We can build helpers that takes strings or bytes and return string or bytes on top of this function. What we want are something like:


(bytes/charset->string #"this is a string" "ascii") ;; => returns a string
(bytes/charset->bytes/utf-8 <bytes> <charset>) ;; => returns a bytes

The idea is that we'll convert the input data to input-port, and then retrieve the data from the output-port, which will be a bytes port.

So let's start with a helper function that'll take in an input-port, and the charsets and then return a bytes:


(define (port->bytes/charset in charset-in charset-out)
  (call-with-output-bytes 
   (lambda (out)
     (convert-stream charset-in in charset-out out))))

Then we can have the following:


(define (bytes->bytes/charset bytes charset-in charset-out)
  (port->bytes/charset (open-input-bytes bytes) charset-in charset-out))

And we can define converting bytes to and from utf-8:


(define (bytes/charset->bytes/utf-8 bytes charset)
  (bytes->bytes/charset bytes charset "utf-8")) 

(define (bytes/utf-8->bytes/charset bytes charset)
  (bytes->bytes/charset bytes "utf-8" charset))

And finally we can then return strings on top of these two functions:


;; there are more to handle (specifically charsets).
(define (bytes/charset->string bytes charset)
  (bytes->string/utf-8 (bytes/charset->bytes/utf-8 bytes charset)))

(define (string->bytes/charset string charset)
  (bytes/utf-8->bytes/charset (string->bytes/utf-8 string) charset))

With the above functions, we can now ensure to convert the encoded word into the correct charset:


(define (encode-encoded-word charset encode str)
  (format "=?~a?~a?~a?=" 
          (string-downcase charset)
          (string-upcase encode)
          ((cond ((string-ci=? encode "q") q-encode)
                 ((string-ci=? encode "b") b-encode)) 
           (string->bytes/charset str charset))))

Notice now that converting the same string with different charset will result in different encoded word:


> (encode-encoded-word "iso-8859-1" "q" "Keld Jørn Simonsen")
"=?iso-8859-1?Q?Keld_J=F8rn_Simonsen?="
> (encode-encoded-word "utf-8" "q" "Keld Jørn Simonsen")
"=?utf-8?Q?Keld_J=C3=B8rn_Simonsen?="

So now the bug is fixed.

Convert a String of Arbitrary Length into Encoded Word String

In cases of a string exceeding the encoded word length of 75, we'll need to convert the string into multiple encoded words, separated by linear folding whitespace (\r\n\s).

Since both Q and B encoding will lengthen the actual bytes (increasing by 33% in case of B), we will not be able to encode 75 bytes; instead, we can only encode 75 bytes minus the delimiters (12 bytes) and divide by 133%, which will give us total of 48 bytes of characters per encoded word.

Also - since some of the characters will be multi-bytes, we want to make sure we do not break up the string right in the middle of a character. We want to make sure we break around the characters.

Let's get started.

The following function will split a string up according to a maximum bytes length:


(define (split-string-by-bytes-count str num)
  (define (maker chars)
    (list->string (reverse chars)))
  (define (helper str i chars blen acc)
    (if (= i (string-length str)) ;; we are done here!!!... 
        (reverse (if (null? chars) acc
                     (cons (maker chars) acc)))
        (let* ((c (string-ref str i))
               (count (char-utf-8-length c))) 
          (if (> (+ count blen) num) ;; we are done with this version....
              (if (= blen 0) ;; this means the character itself is greater than the count.  
                  (helper str (add1 i) '() 0 (cons (maker (cons c chars)) acc))
                  (helper str i '() 0 (cons (maker chars) acc)))
              (helper str (add1 i) (cons c chars) (+ count blen) acc)))))
  (helper str 0 '() 0 '()))

What it does is to accumulate the characters according to the maximum bytes count, and if the addition of the next character's bytes length exceeds the maximum bytes count, then we do not include that character in the current split. In the case where the maximum bytes count is lower than the character's bytes length, that character gets its own string (i.e. if you pass in 0 you'll get per character split).


> (split-string-by-bytes-count "孫中山畢業於香港西醫書院" 0)
("孫" "中" "山" "畢" "業" "於" "香" "港" "西" "醫" "書" "院")

Once we can split the string according to maximum bytes count, we can now separately encode the splitted strings (and then join them together with \r\n\s):


(define (string->encoded-words s charset)
  (define (helper s)
    (case (string-type s)
      ((ascii) s)
      ((latin-1) (encode-encoded-word "iso-8859-1" "q" s))
      (else (encode-encoded-word charset "b" s))))
  (map helper (split-string-by-bytes-count s 48))) 

(define (string->encoded-word-string s (charset "utf-8"))
  (string-join (string->encoded-words s charset) "\r\n "))

Notice that in the above we have tests to see whether the string is an ascii string or a latin-1 string, because we do not have to encode ascii, and Q is a better encoding for latin-1 string. Also notice that charset only impacts the encoding of strings that containing characters outside of latin-1 characters.

The definition of string-type is defined as follows:


(define (char-type c)
  (let ((i (char->integer c))) 
    (cond ((< i 128) 'ascii)
          ((< i 256) 'latin-1)
          (else 'unicode))))

(define (string-type s)
  (define (helper len i prev)
    (if (= len i) prev
        (let ((type (char-type (string-ref s i))))
          (case type 
            ((unicode) type)
            ((latin-1) 
             (helper len (add1 i) (case prev
                                    ((ascii) type)
                                    (else prev))))
            (else (helper len (add1 i) prev))))))
  (helper (string-length s) 0 'ascii))

With the above, we can now encode strings into encoded words:


> (string->encoded-word-string "Keld Jørn Simonsen")
;; => 
=?iso-8859-1?Q?Keld_J=F8rn_Simonsen?=
> (string->encoded-word-string "伦敦(英文:London,讀音:/ˈlʌndən/ 文件-播放)是英格蘭和英國的首都、第一大城及第一大港")
;; => 
=?utf-8?B?5Lym5pWmKOiLseaWhzpMb25kb24s6K6A6Z+zOi/LiGzKjG5kyZluLyDmlofku7Yt?=
 =?utf-8?B?5pKt5pS+KeaYr+iLseagvOiYreWSjOiLseWci+eahOmmlumDveOAgeesrOS4gA==?=
 =?utf-8?B?5aSn5Z+O5Y+K56ys5LiA5aSn5riv?=
> (string->encoded-word-string "China (simplified Chinese: 中国; traditional Chinese: 中國; Hanyu Pinyin: zh-zhongguo.ogg Zhōngguó (help·info); Tongyong Pinyin: Jhongguó; Wade-Giles: Chung1kuo2) is a cultural region, an ancient civilization, and, depending on perspective, a national or multinational entity extending over a large area in East Asia.")
;; => 
=?utf-8?B?Q2hpbmEgKHNpbXBsaWZpZWQgQ2hpbmVzZTog5Lit5Zu9OyB0cmFkaXRpb25hbCBD?=
 =?utf-8?B?aGluZXNlOiDkuK3lnIs7IEhhbnl1IFBpbnlpbjogemgtemhvbmdndW8ub2dnIFpo?=
 =?utf-8?B?xY1uZ2d1w7MgKGhlbHDCt2luZm8pOyBUb25neW9uZyBQaW55aW46IEpob25nZ3U=?=
 =?iso-8859-1?Q?=F3;_Wade-Giles:_Chung1kuo2=)_is_a_cultural_region?=
 , an ancient civilization, and, depending on per
 spective, a national or multinational entity ext
 ending over a large area in East Asia.

At this point, the generation of encoded word string is complete. Our next step is to parse such an encoded word string back into its original form. Stay tuned.

Parsing "Encoded Word" in RFC Headers

2009-10-13T23:08:00.000-07:00

If you want to correctly handle internet message headers as defined in RFC822 or as improved by RFC2822, you'll find that you currently have no way of handling encoded words, which is defined separately in RFC1342.

Below is the example of encoded words in message headers from RFC1342:


From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
 =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

which should be decoded into


From: Keith Moore <moore@cs.utk.edu>
To: Keld Jørn Simonsen <keld@dkuug.dk>
CC: André Pirard <PIRARD@vm1.ulg.ac.be>
Subject: If you can read this you understand the example.

But currently, net/head cannot handle the encode words and they are not parsed:


(extract-all-fields <the-above-string>)
;; => 
'(("From" . "=?US-ASCII?Q?Keith_Moore?= ")
 ("To" . "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= ")
 ("CC" . "=?ISO-8859-1?Q?Andr=E9_?= Pirard ")
 ("Subject"
  .
  "=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=\r\n =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?="))

So we'll need to handle it ourselves. Let's get started.

The Format of an Encoded Word

An encoded word has the following format:


=?<charset>?<Q or B>?<encoded data>?=

And an encoded word should not exceed 75 bytes (including all the delimiters). If the string being encoded cannot fit in the length, then multiple encoded words should be separated by space or linear folding whitespace (\r\n\s*). Encoded words can coexist with plain text in the same header (shown above in the Cc header).

There are only two encodings defined for the encoded words, Q and B. They are almost identical to quoted-printable and base64, with some minor exceptions:

Q use _ to substitute for space
B is not terminated by \r\n

We'll first generate encoded words, and then we'll parse them back.

Q Encoding
Since Q more or less work the same as quoted-printable, we can use net/qp as the base, and wrap around qp-encode and qp-decode.

The decoding would be more straight forward since we just have to first replace _ with #x20, which translates to space in ASCII:


(define (q-decode bstr)
  ;; convert all _ to #\space first...
  (qp-decode (regexp-replace* #px"_" bstr (list->bytes (list #x20)))))

The encoding also works similarly, except that we need to encode more characters than qp-encode, since the encoding need to avoid conflict with the encoded word delimiters (=, ?), and it cannot include spaces, tabs, newlines, etc:



;; convert the integer to bytes... 
(define (char->q-bytes c)
  (bytes->list (string->bytes/utf-8 (string #\= c))))

(define BYTE:_ (char->integer #\_))
(define Q-BYTES:_ (char->q-bytes #\_))
(define BYTE:space (char->integer #\space))
(define Q-BYTES:space (list (char->integer #\_)))
(define BYTE:tab (char->integer #\tab))
(define Q-BYTES:tab (char->q-bytes #\tab))
(define BYTE:open-paren (char->integer #\())
(define Q-BYTES:open-paren (char->q-bytes #\())
(define BYTE:close-paren (char->integer #\)))
(define Q-BYTES:close-paren (char->q-bytes #\)))
(define BYTE:? (char->integer #\?))
(define Q-BYTES:? (char->q-bytes #\?))

(define (q-encode bstr)
  (define (push bytes acc)
    (cond ((null? bytes) acc)
          (else
           (push (cdr bytes) (cons (car bytes) acc)))))
  (define (helper in acc)
    (let ((c (read-byte in)))
      (cond ((eof-object? c) ;; we are done...
             (list->bytes (reverse acc)))
            ((= c BYTE:_)
             (helper in (push Q-BYTES:_ acc)))
            ((= c BYTE:space)
             (helper in (push Q-BYTES:space acc)))
            ((= c BYTE:tab)
             (helper in (push Q-BYTES:tab acc)))
            ((= c BYTE:open-paren)
             (helper in (push Q-BYTES:open-paren acc)))
            ((= c BYTE:close-paren)
             (helper in (push Q-BYTES:close-paren acc)))
            ((= c BYTE:?)
             (helper in (push Q-BYTES:? acc)))
            (else (helper in (cons c acc))))))
  (helper (open-input-bytes (qp-encode bstr)) '()))

B Encoding

Similarly, B works almost the same as base64, which is provided by net/base64. The decode works exactly the same so we just rename base64-decode to b-decode, and we just need to trim the \r\n at the end of a base64 encoding:


(define (b-encode bstr)
  (let ((bout (base64-encode bstr)))
    (subbytes bout 0 (- (bytes-length bout) 2))))

With the encoding mechanism now being available, it would be straight forward to generate a single encoded word:


(define (encode-encoded-word charset encode str)
  (format "=?~a?~a?~a?=" 
          (string-downcase charset)
          (string-upcase encode)
          ((cond ((string-ci=? encode "q") q-encode)
                 ((string-ci=? encode "b") b-encode)) (string->bytes/utf-8 str))))

Calling it would generate the following result:


> (encode-encoded-word "utf-8" "q" "Keld Jørn Simonsen")
"=?utf-8?Q?Keld_J=C3=B8rn_Simonsen?="
> (encode-encoded-word "utf-8" "b" "If you can read this you understand the example.")
"=?utf-8?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW91IHVuZGVyc3RhbmQgdGhlIGV4YW1wbGUu?="

The above code, however, has a bug - and that is that the charset will not match up with the actual charset of the string if the charset is not utf-8.

Generally speaking this is not a big issue, since utf-8 is really superior to just about every other charset and under normal situation that is the proper choice. However, we should not have such bug in our code, so we'll fix it in the next post. Stay tuned.

Introducing BZLIB/DATE & BZLIB/DATE-TZ - Date Time Manipulation Libraries

2009-10-05T11:50:00.000-07:00

BZLIB/DATE & BZLIB/DATE-TZ are now available on planet. They provide additional date manipulation capability on top of SRFI/19, including timezone manipulation.

They are released under LGPL.

Usage and Installation


(require (planet bzlib/date)) 
(require (planet bzlib/date-tz))

bzlib/date provides date manipulations. bzlib/date provides timezone manipulations. Their usages are separately discussed below.

bzlib/date

To create a date object, you can use bulid-date, which provides a more natural year/month/day/hour/minute/second order of the parameters:


(build-date <year> <month> <day> <hour> <minute> <second> #:tz <offset>)

And it would do the right thing if you enter February 31st:


(build-date 2009 2 31) ;; => #(struct:tm:date 0 0 0 0 3 3 2009 0)

By default, the tz offset is 0, which equates to GMT (see below for timezone support). Only year, month, and day are required.

Date Comparisons
The following function compares two dates to determine their orders:

(date>? <d1> <g2>)
(date<? <d1> <g2>)
(date>=? <d1> <g2>)
(date<=? <d1> <g2>)
(day=? <d1> <g2>)
(date!=? <d1> <g2>)
(date===? <d1> <g2>)

day=? only compares the year/month/day values, and date===? means they are the same date with the same tz offset.

Conversions

You can convert between date and seconds with (date->seconds <date>) and (seconds->date <seconds>).

You can add to a date with (date+ <date> <number-of-days>). The number of day can be a non-integer.

You can find out the gaps between two dates with (date- <date1> <date2>).

You can create an alarm event with date via (date->alarm <date>) or (date->future-alarm <date>). The difference between the two is that date->future-alarm will return false if the date is in the past.

Dealing with Weekdays

To find out the weekday of a particular date, you can use (week-day <date>).

To find out the date of the nth-weekday (e.g., first sunday, 3rd wednesday, last Friday) of a particular month, use nth-week-day:


(nth-week-day <year> <month> <week-day> <nth> <hour> <minute> <second> #:tz <offset>)

For the week-day argument, use 0 for Sunday, and 6 for Saturday. For the nth argument, use 1, 2, 3, 4, 5, or 'last. hour, minute, second, and offset are optional (same as build-date, and the other functions below that have them).

To find out the date of a particular weekday relative to another date, use one of the following:

week-day>=mday
week-day<=mday
week-day>mday
week-day<mday

They all share the same arguments, which are year,

month, week-day, month-day, hour, minute, second, and offset.

The usage is something like:


;; the sunday after May 15th, 2009
(week-day>mday 2009 5 0 15) ;; 5/17/2009 
;; the friday before September 22nd, 2008 
(week-day<mday 2009 9 5 22) ;; 9/19/2009 
The hour, minute, second, and offset parameters are there for you to customize the return values:


;; the sunday after May 15th, 2009
(week-day>mday 2009 5 0 15 15 0 0 #:tz -28800) ;; 5/17/2009 15:00:00-28800
;; the friday before September 22nd, 2008 
(week-day<mday 2009 9 5 22 8 30 25 #:tz 14400) ;; 9/19/2009 08:00:00+14400


bzlib/date-tz



By default, you need to parameterize the current-tz parameter, which defaults to America/Los_Angeles.  The timezone names are the available names from the olson database.



To determine the offset of any date for a particular timezone, use tz-offset:


(parameterize ((current-tz "America/New_York")) 
  (tz-offset (build-date 2008 3 9))) ;; => -18800 
(parameterize ((current-tz "America/New_York")) 
  (tz-offset (build-date 2008 3 10))) ;; => -14400 
If you want to separate between the standard offset and the daylight saving offset, you can use tz-standard-offset or tz-daylight-saving-offset:


(let ((d1 (build-date 2008 3 9))
        (d2 (build-date 2008 3 10)))
    (parameterize ((current-tz "America/New_York"))
      (values (tz-standard-offset d1)
              (tz-daylight-saving-offset d1)
              (tz-daylight-saving-offset d2))))
;; => -18800 (std)
;; => 0 (dst on 3/9/2008)
;; => 3600 (dst on 3/10/2008) 


Conversion



To reset a date's tz offset, you can use the helper function date->tz, which will reset the offset for you:


(let ((date (build-date 2008 3 10 #:tz -18800))) 
  (parameterize ((current-tz "America/New_York")) 
    (date->tz date))) 
;; => #(struct:tm:date 0 0 0 0 10 3 2008 -14400)
This function is meant for you to fix the offsets for dates that belong to a particular timezone but did not correctly account for the offset - it does not switch the timezone for you. 



Couple other functions makes it even easier to work with timezone.  


(build-date/tz <year> <month> ...)
(date+/tz <date> <number-of-days>)
They basically creates the date object and calls date->tz so the offset is properly adjusted based on the timezone. 



Besides using current-tz, you can also pass it explicitly to tz-offset, tz-standard-offset, tz-daylight-saving-offset, date->tz, build-date/tz, and date+/tz.  You pass it in in the following forms:


(tz-offset <date> "America/Los_Angeles")
(tz-daylight-saving-offset <date> "Asia/Kolkata")
(tz-standard-offset <date> "Europe/London")
(date->tz <date> "Europe/London")
(build-date/tz <year> <month> <day> #:tz "America/New_York")
(date+/tz <date> <number-of-days> "America/Los_Angeles")


Convert from One Timezone to Another



To covert the timezone of a date so you get the same date in a different timezone, use tz-convert:


(tz-convert <date> >from-timezone> <to-timezone>)
All parameters are required.  


(tz-convert (build-date 2008 3 10 15) "America/New_York" "America/Los_Angeles")
;; => #(struct:tm:date 0 0 0 12 10 3 2008 -25200) ;; 2008/3/10 12:00:00-25200
(tz-convert (build-date 2008 3 10 15) "America/New_York" "GMT")
;; ==> #(struct:tm:date 0 0 0 19 10 3 2008 0) ;; 2008/10/10 19:00:00+00:00


That's it for now - enjoy.



Handling Time Zone in Scheme (5): Pre-Convert the Rules
2009-10-01T07:42:00.000-07:00
This is part five of the timezone series - you can find the previous posts on this subject to get up to speed on the details:

motivation and overview
parsing the zoneinfo database
compute the offsets
convert the rules into dates


We previously have finished the first draft of tz-offset, and it works correctly in majority of the situations.  But unfortunately, there is one issue with it:

The code will not work correctly when there are gaps between the years where the rules are applicable. 
Fortunately, under regular use of the code, we will not encounter this issue.  But the issue definitely exists.  Below is the US rule: 


  ("US"
   (1918 1919 - 3 (last 0) (2 0 0 w) 3600 "D")
   (1918 1919 - 10 (last 0) (2 0 0 w) 0 "S")
   (1942 1942 - 2 9 (2 0 0 w) 3600 "W")
   (1945 1945 - 8 14 (23 0 0 u) 3600 "P")
   (1945 1945 - 9 30 (2 0 0 w) 0 "S")
   (1967 2006 - 10 (last 0) (2 0 0 w) 0 "S")
   (1967 1973 - 4 (last 0) (2 0 0 w) 3600 "D")
   (1974 1974 - 1 6 (2 0 0 w) 3600 "D")
   (1975 1975 - 2 23 (2 0 0 w) 3600 "D")
   (1976 1986 - 4 (last 0) (2 0 0 w) 3600 "D")
   (1987 2006 - 4 (match 0 >= 1) (2 0 0 w) 3600 "D")
   (2007 +inf.0 - 3 (match 0 >= 8) (2 0 0 w) 3600 "D")
   (2007 +inf.0 - 11 (match 0 >= 1) (2 0 0 w) 0 "S"))
You can see that there is a gap between 1945 and 1967 for applicable rules.  You probably are not going to calculate the offsets for 1948 most of the time, but it would be nice if we do not have to face the potential issues. 



What we want is to have the last rule of 1945 continue to be applicable until 1967 in this case.  And that means we need to be able to skip over multiple years, which our current design does not account for.  The nice thing is that the situation is not as dire as it sounds, since most of the timezones that I inspected visually do not make use of the "US" rule during this gap.  But it would be nice to know such potential bug will not exist. 



As we have found out in the previous posts, inferring the "previous" rules with the data format is difficult, and it's easier to compute the applicable rules for a given year, the easiest solution is to pre-compute the dates for every year that we need to worry about.  In this case, any of the gaps will automatically be filled with the previously applicable rules, with the applicable years. 



The Applicable Years



It should be obvious that the timezone concept has definite bound toward the past, as GMT was not established until 1675, and US does not adopt the timezone until 1918.  The minimum year from the zoneinfo database is 1916:




(define (zone-year (aggregate min))
  (define (until-helper zone) 
    (let ((until (zone-until zone))) 
      (if (date? until) (date-year until) #f)))
  (let ((rules (filter identity 
                       (flatten 
                        (hash-map zones
                                  (lambda (key zone)
                                    (map zone-rule zone))))))
        (untils (hash-map zones (lambda (key zone)
                                  (map until-helper zone)))))
    (apply aggregate 
           (filter (lambda (date)
                     (and (number? date) 
                          (not (equal? date +inf.0))
                          (not (equal? date -inf.0)))) 
                   (flatten (append untils 
                                    (map rule-from rules)
                                    (map rule-to rules)))))))

(zone-year min) ;; ==> 1916. 


And while there is no clear upper bound for the time zones, we can follow some conventions to help establish such upper bounds to avoid infinite time zone generation, which would have likely to be incorrect anyways, since timezone and daylight saving times can easily be changed in the future at the whim of politicians and governments.  A good upper bound is 2038, since that's the Y2K problem for Unix, and it will give us plenty of time to update the library.  This number coincidentally is also the largest number in the zoneinfo database:


(zone-year max) ;; ==> 2038 
The two years now forms our range for calculation all applicable rules:


(define ZIC-MIN (zone-year min)) 
(define ZIC-MAX (zone-year max)) 

(define (all-applicable-rule/years rules)
  (sort (apply append 
               (for/list ((year (in-range ZIC-MIN ZIC-MAX))) 
                 (applicable-rules-by-year rules year)))
        rule/year
With the above we'll be able to convert zones with rules.  What about the zones without rules?  We'll just have to fill out the list of the rules ourselves:


(define (zone->rule/years zone)
  (if (zone-rule zone)
      (all-applicable-rule/years (zone-rule zone)) 
      (for/list ((year (in-range ZIC-MIN ZIC-MAX))) 
        (cons (make-rule year year 1 1 '- '(0 0 0 w) 0 "S") year))))
Now each list of the zones are basically mapped to a full list of the rule/years pairs. We want the final outcome to be a list of structure that contains the boundary (a date), the standard offset, and the daylight saving offset:


(define-struct normalized (bound std dst)) 
And our rule/year pair can be converted to normalized with the following:


(define (rule/year->normalized rule/year std-offset offset)
  (define (helper rule year)
    (let ((date (on/year->date rule year)))
      (match (rule-time rule) 
        ((list hour minute second type) ;; wall clock 
         (make-normalized (build-date (date-year date)
                                      (date-month date)
                                      (date-day date)
                                      hour 
                                      minute
                                      second 
                                      #:tz (case type 
                                             ((g u z) 0 0)
                                             ((s) std-offset)
                                             (else ;; wall-clock requires the previous rule...  
                                              (+ std-offset offset))))
                          std-offset
                          (rule-offset rule))))))
  (helper (car rule/year) (cdr rule/year)))
Which looks quite similar to rule/year->date/offset, which we might retrofit with this newer function.



The next challenge will then to be convert from zone into a list of normalized,  following the offset dependency between the rules, as well as filtering out the upper and the lower bounds (the lower bounds comes from the previous zone). 



Without filtering it looks like the following:


(define (zone->normalized zone (prev '()))
  (define (helper rule/years acc) 
    (cond ((null? rule/years) ;; we are done... 
           acc)
          (else
           (let ((normalized (rule/year->normalized (car rule/years)
                                                    (zone-offset zone)
                                                    (if (null? acc) 
                                                        0
                                                        (normalized-dst (car acc))))))
             (helper (cdr rule/years) (cons normalized acc))))))
  (helper (zone->rule/years zone) prev))
To filter for upper bound, we'll have to test against the UNTIL field.  If it's #f, it means no upper bound. Otherwise we should first test for the year value to quickly filter away the ones that exceed the year, and if it's not the same year, we can keep the rule.  For the boundary in the same year as UNTIL we'll convert UNTIL into a date object for comparison by using the boundary's wall time:


(define (zone->normalized zone (prev '()))
  (define (until-helper offset)
    (define (helper year month day hour minute second type)
      (build-date year month day hour minute second
                                    #:tz (case type 
                                           ((u g z) 0)
                                           ((s) (zone-offset zone))
                                           ((w) (+ offset (zone-offset zone))))))
    (apply helper (zone-until zone)))
  (define (helper rule/years acc) 
    (cond ((null? rule/years) ;; we are done... 
           acc)
          ((and (zone-until zone) ;; if until & date > until's year. 
                (> (cdar rule/years) (car (zone-until zone))))
           (helper (cdr rule/years) acc))
          (else
           (let ((normalized (rule/year->normalized (car rule/years)
                                                    (zone-offset zone)
                                                    (if (null? acc) 
                                                        0
                                                        (normalized-dst (car acc))))))
             (cond ((not (zone-until zone)) ;; infinite upper bound 
                    (helper (cdr rule/years) (cons normalized acc)))
                   ((< (cdar rule/years) (car (zone-until zone))) ;; less than until's year 
                    (helper (cdr rule/years) (cons normalized acc)))
                   (else ;; we need to ensure the bound is lower than the year
                    (let ((until (until-helper (if (null? acc) 
                                                   0
                                                   (normalized-dst (car acc))))))
                      (if (< until (normalized-bound normalized))
                          (helper (cdr rule/years) (cons normalized acc))
                          (helper (cdr rule/years) acc)))))))))
  (helper (zone->rule/years zone) prev))
To filter out the lower bound, we will test to see if the new normalized bound is greater than the previous batch's bound:


(define (zone->normalized zone (prev '()))
  ... 
  (define (helper rule/years acc) 
    (cond ...
          (else
           (let ((normalized (rule/year->normalized (car rule/years)
                                                    (zone-offset zone)
                                                    (if (null? acc) 
                                                        0
                                                        (normalized-dst (car acc))))))
             (cond ((date<? (normalized-bound normalized) 
                            (if (null? prev) 
                                (build-date 1 1 1) 
                                (normalized-bound (car prev)))) 
                    (helper (cdr rule/years) acc))
                   ...)))))
  (helper (zone->rule/years zone) prev)) 
With the above, we can just fold over the list of zones to get to the list of normalized:


(define (zones->normalized zones)
  (foldl zone->normalized 
         '()
         zones))
From this point on, the rest is to convert the serialization to serialize out the list, and then to modify tz-offset to take use the new list.  Stay tuned.


DBI and SQL Escape
2009-09-30T11:56:00.000-07:00
Scott Hickey has discovered a bug in DBI: 


;; assume you have a table1 with an id and a date field. 
(exec h "insert into table1 values (?id , ?date)" `((id . 1) (date . ,(srfi19:current-date))))
;; => regexp-replace*: expects type  as 2nd argument, given:
;;    #(struct:tm:date 9150000 8 19 0 30 9 2009 -18000); other arguments were: #px"\\'" "''"
This issue is now fixed and the newer version of the DBI and are now available through planet:

bzlib/dbi:1:2
bzlib/dbd-jazmysql:1:1
bzlib/dbd-jsqlite:1:2
bzlib/dbd-spgsql:1:1
This post documents the issue and the resolution of the bug. 



This issue is caused by the default SQL escape code that does not know how to handle srfi date objects.  The default SQL escape code is quite primitive for the reasons below. 



To work around the problem - you can use prepared statements:


(prepare h 'insert-table1 "insert into table1 values (?id , ?date)")
(exec h 'insert-table1 `((id . 1) (date . ,(srfi19:current-date))))
The default preparation code exists as a prepared statement proxy for those database drivers that have no prepared statement capabilities.  This is one of the selling points of DBI over other API - the query always allow parameterizations.  But that means DBI cannot delegate the task of SQL escaping back to the user.



Because my usage of database has always surround prepared statements, I did not write an extensive SQL escaping library (and hence the bug). Plus, there were technical reasons that prepared statements are superior:

SQL escapes does not work uniformly across all databases, especially for types such as blobs and date objects, which each database have their own syntax (and some basically discourage using SQL escapes for blobs) 


SQL escapes are prone to SQL injections if done poorly.  One of my previous gigs was to weed out SQL injection bugs in client code base and while the concept is simple many implementations still got it wrong 


In general prepared statements will have better performances (but this unfortunately is not always true if the cached query plan results in a miss by the server) for multiple uses
Prepared statements (and stored procedures) are superior to SQL escapes in just about all aspects, including performance and security.  There are only three downsides that I am aware of for prepared statements:

it might cause the database to hold onto the referred objects so it cannot be dropped - this mainly impacts development environment, since that actually helps your production environment from having tragic accident of dropping tables, views, etc.


it might not work well for code that creates dynamic SQL statements that refers to tables with unique prefixes (wordpress and a bunch of PHP code falls into this design style), since there might be thousands of such unique prefixes in a given database.  In general, such design really should be discouraged, since databases are designed more for few large tables instead of many small tables
it's not all that useful and can potentially be slower for one-call statements, but most of the time this is a non-issue
Anyhow - the reason I am highlighting the merit of prepared statements over SQL escapes is that I believe going toward prepared statements is the way to go, especially for databases that already have them.  So I decide to make the database drivers for dbd-spgsql, dbd-jsqlite, and dbd-jazmysql to implicitly create prepared statements if you do not want to explicitly name the query via the prepare call. 



So - you can just write: 


(exec h "insert into table1 values (?id , ?date)" `((id . 1) (date . ,(srfi19:current-date))))
And it will behave as if you do the following:


;; note - prepare only takes symbol as a key - so you cannot do this manually yourself
(prepare h "insert into table1 values (?id , ?date)" "insert into table1 values (?id , ?date)")
(exec h "insert into table1 values (?id , ?date)" `((id . 1) (date . ,(srfi19:current-date))))
So dbd-spgsql, dbd-jazmysql, dbd-jsqlite will no longer use SQL escape for parameterization going forward. They have been made available via planet.



Thank you Scott for discovering and reporting this issue.


Handling Time Zone in Scheme (4): Rule Conversion
2009-09-30T09:34:00.000-07:00
Previously we have discussed timezone handling in a series of posts:

motivation and overview
parsing zoneinfo database
calculating offsets
To continue from the third post where we are in the midst of calculating daylight saving offsets, we have figured out the applicable rules, and we now need to convert them into date structs so we can determine the exact boundary.  



Going back to our two rule examples for America/Los_Angeles:


  (2007 +inf.0 - 3 (match 0 >= 8) (2 0 0 w) 3600 "D")
  (2007 +inf.0 - 11 (match 0 >= 1) (2 0 0 w) 0 "S")
We want to convert them into the applicable values (for both 2009 and the previous year - 2008): 


2009/3/8 02:00:00-08:00 
2009/11/1 02:00:00-07:00
2008/3/9 02:00:00-08:00 
2008/11/2 02:00:00-07:00 
In order to do so, we'll first have to be able to convert the ON (day of the month) into the correct date value, and then we'll have to convert the AT (time of the date) into the correct time value.  Let's get started. 



Day of the Month



The simplest ON format is a day number (ranging from 1-31), and for that we do not have to do too much.  But there are also two other formats that are based on weekdays:


'(last 0) ;; => last sunday (sunday = 0, monday = 1 ..., saturday = 6) 
'(match 0 >= 5) ;; => sunday on or after 5th 
'(match 2 <= 10) ;; => tuesday on or before 10th 


That means we need to be able to convert them to the appropriate day of the month based on the year and the month. 



Doomsday Algorithm and Weekday Calculation



To be able to calculate the weekday-based date values, we first need to be able to calculate the weekday of a particular date.  For that we can make use of the doomsday algorithm, which are based on the concept that there is a doomsday every month, and they are easy to remember based on a moniker (4/4, 6/6, 8/8, 10/10, 12/12, ...).  The linked explanation makes it sounds more complicated than it actually is - below is the oneline doomsday algorithm in scheme:


(define (doomsday y)
  (modulo (+ 2 (floor (+ y (/ y 4) (- (/ y 100)) (/ y 400)))) 7))
Then with doomsday we can calculate the weekday of a date:




(define (leap-year? year)
  ;; every 4 year is a leap year
  (or (and (= (modulo year 4) 0)
           ;; unless it's divisble by 100 
           (not (= (modulo year 100) 0)))
       ;; but if it's divisible by 400 then we'll be fine.
      (= (modulo year 400) 0)))

(define (week-day d)
  (modulo (+ (doomsday (date-year d)) (year-day d)
             (if (leap-year? (date-year d)) -3 -2)) 7))
Then to figure out the weekday on or greater than a particular date, we just need to do the following:

figure out the weekday of the date
figure out the difference between the weekday of the date and the weekday of your choice
add the differences to the date
The following figures out the difference between 2 weekdays


(define (week-day-diff to from)
  (modulo (- to from) 7))
And assuming we have a date+ that takes in a date and a number as the days to add, then the following will determine the date on or after a particular date by weekday:


(define (week-day>=? year month wday mday) 
  (define (helper date)
    (date+ date (week-day-diff wday (week-day date))))
  (helper (build-date year month mday)))
The other combinations (week-day>?, week-day<=?, and week-day<?) are left as exercises.



To determine the last (or nth) weekday of the month, we can employ a similar algorithm:

figure out the weekday of the first of the month
figure out the difference between that weekday and the weekday of your choice
add the number of weeks on top of the date to get the desired date
if the date exceeds the month, subtract a week to get to the last weekday within the month boundary
The following accomplish the above:


(define (nth-week-day year month wday nth)
  ;; the way to do so is to figure out the weekday for the first of the month, and then work toward 
  ;; the nth wday 
  (define (date-helper date)
    (if (not (= (date-month date) month))
        (date+ date -7)
        date))
  (define (helper date)
    (date-helper 
     (date+ date 
            (+ (week-day-diff wday (week-day date)) 
               (* (sub1 (case nth
                          ((first) 1)
                          ((last) 5)
                          (else nth))) 
                  7)))))
  (helper (build-date year month 1)))
With the above, we can now finally convert the ON field into the correct date value:


(define (on/year->date rule year) 
  (match (rule-date rule) 
    ((? integer? date)
     (build-date year (rule-month rule) date))
    ((list 'last (? number? wday))
     (nth-weekday year (rule-month rule) wday 'last)) 
    ((list 'match (? number? wday) (? symbol? test) (? number? day))
     ((case test 
        ((>=) week-day>=?)
        ((>) week-day>?)
        ((<=) week-day<=?)
        ((<) week-day<?)) year (rule-month rule) wday day))))


Determining "Wall Clock" Time



We are almost able to convert an applicable rule into a date object, but we first still have to fully convert the AT field into the corresponding time of the day value.



Unfortunately, AT field holds more than just a representation of hour:minute:seconds.  It also holds the type of the clock, which can be one of the following:

universal time - no offsets
standard time - time-zone offsets only; no daylight saving offsets
"wall clock" time - time-zone offsets + daylight saving offsets
The first two are straight forward - universal time has an offset of 0, and the standard time has the default offsets that we should pass in from the zone.  But the "wall clock" time is more complicated.  It basically means we need to know what the previous rule is at the moment of the rule coming into effect, since at the moment of daylight saving transition, the previous rules would have been effecting the wall clock. 



Yes - it means that in order for us to arrive at the correct wall-clock time, we need to figure out the previous rule's offsets. 



Serendipitously, we have already generated the previous-year's applicable rules.  But since the previous year's rules also depend on its own previous year's rules, we will need to calculate the dates 2-years-prior to ensure we get the wall clock time correctly for the previous year (we can drop the 2-years-prior from the final consideration once they aid in calculating the offsets). 


(define (applicable-rules date rules)
  ... 
  (let ((year (date-year date))) 
    (append (by-year year rules)
            (by-year (sub1 year) rules) 
            (by-year (- year 2) rules))))


Let's first convert a single rule/year pair to be a date, based on the previous applicable rule:


(define (rule/year->date rule year prev-rule std-offset) 
  (let ((date (on/year->date rule year)))
    (match (rule-time rule) 
      ((list hour minute second type) ;; wall clock 
       (build-date (date-year date)
                   (date-month date)
                   (date-day date)
                   hour 
                   minute
                   second 
                   #:tz (case type 
                          ((g u z) 0 0)
                          ((s) std-offset)
                          (else ;; wall-clock requires the previous rule...  
                           (+ std-offset (rule-offset prev-rule)))))))))
Then we will sort the rule/year pairs according to their precedence, and then call rule/year->date by passing in the rules and the previous rules.  


(define (rule/year>? r/y1 r/y2) 
  (define (date-helper r1 r2 year) 
    (date>? (on/year->date r1 year) (on/year->date r2 year)))
  (define (month-helper r1 r2 year)
    (cond ((> (rule-month r1) (rule-month r2)) #t)
          ((= (rule-month r1) (rule-month r2))
           (date-helper r1 r2 year)) 
          (else #f))) 
  (let ((r1 (car r/y1)) 
        (y1 (cdr r/y1))
        (r2 (car r/y2))
        (y2 (cdr r/y2))) 
    (cond ((> y1 y2) #t)
          ((= y1 y2) 
           (month-helper r1 r2 y1)) 
          (else #f))))

(define (rule/years->date/offsets rule/years std-offset) 
  (define (helper rest acc)
    (cond ((null? rest) (reverse acc)) 
          ((null? (cdr rest)) ;; we have the last one... 
           (reverse acc)) 
          (else ;; we'll 
           (helper (cdr rest)
                   (cons (cons (rule/year->date (caar rest) (cdar rest) (caadr rest) std-offset)
                               (rule-offset (caar rest)))
                         acc)))))
  (helper (sort rule/years rule/year>?)
          '()))
With rule/years->date/offsets we finally were able to map rules into an ordered pairs of date boundaries and offsets that we can use to determine the correct offset:


(define (tz-rules-offset date rules std-offset) 
  (define (helper date/offsets)
    (cond ((null? date/offsets) 0) 
          ((date>? date (caar date/offsets)) 
           (cdar date/offsets))
          (else
           (helper (cdr date/offsets)))))
  (helper (rule/years->date/offsets (applicable-rules date rules) std-offset)))
And tz-daylight-saving-offset needs to be updated accordingly since tz-rules-offset now requires an additional std-offset:


(define (tz-daylight-saving-offset date zone-name)
  (define (until/rules-helper until offset rules)
    (define (until->date year month day hour minute second type)
      (date->seconds (build-date year month day hour minute second #:tz offset)))
    (list (if (not until) 
              +inf.0
              (apply until->date until))
          rules
          offset))
  (define (match-until/rules-helper date u/r)
    (cond ((null? u/r) 
           (error 'tz-standard-offset "invalid zone ~a for date ~a" zone-name date))
          ((<= date (caar u/r))
           (let ((until/rules (car u/r)))
             (tz-rules-offset (seconds->date date) (cadr until/rules) 
                              (caddr until/rules)))) 
          (else
           (match-until/rules-helper date (cdr u/r)))))
  ...) 
Now we can combine the tz-daylight-saving-offset and tz-standard-offset to determine the actual offset for a particular date:


(define (tz-offset date tz)
  (+ (tz-standard-offset date tz)
     (tz-daylight-saving-offset date tz))) 
Now we can finally correctly calculate the actual offsets.  Stay tuned.