<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>jonisalonen.com</title>
	
	<link>http://jonisalonen.com</link>
	<description />
	<lastBuildDate>Mon, 30 Apr 2012 00:13:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/JoniSalonen" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="jonisalonen" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Your Bash Prompt Needs This</title>
		<link>http://jonisalonen.com/2012/your-bash-prompt-needs-this/</link>
		<comments>http://jonisalonen.com/2012/your-bash-prompt-needs-this/#comments</comments>
		<pubDate>Fri, 16 Mar 2012 15:13:47 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Bash]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=344</guid>
		<description><![CDATA[Have you ever had this happen to you? You start a program and then interrupt it with Ctrl-C, and Bash prints its prompt after the ^C you have typed: prompt$ a very long command ^Cprompt$ █ Then you hit the Up key to retrieve very long command from the history and try to edit it, [...]]]></description>
			<content:encoded><![CDATA[<p>Have you ever had this happen to you? You start a program and then interrupt it with Ctrl-C, and Bash prints its prompt <em>after</em> the <code>^C</code> you have typed:</p>
<p><code>prompt$ <strong>a very long command</strong><br />
<strong>^C</strong>prompt$ █<br />
</code></p>
<p>Then you hit the Up key to retrieve <code>very long command</code> from the history and try to edit it, only to discover that the text you see on the screen is shifted by two characters from where it appears to be, and you can&#8217;t see where you&#8217;re typing anymore. A major annoyance!</p>
<p>Here&#8217;s how you can fix it: <em>glue the bash prompt always go to the first column</em>. To do this, include <code>\033[G</code> in your <code>$PS1</code>:</p>
<pre>PS1="\[\033[G\]$PS1"</pre>
<p>This code happens to be the <a href="http://en.wikipedia.org/wiki/ANSI_escape_code">ANSI escape code</a> for moving the cursor to the first column. Your prompt will now start from the first column and write over the <code>^C</code> you typed. The \[ and \] on the other hand are needed so that Bash does not count these movement codes when calculating the length of the prompt.</p>
<p><a href="http://tldp.org/HOWTO/Bash-Prompt-HOWTO/index.html">The Bash Prompt HOWTO</a> is a great resource if you want to learn more about how and why to customize your prompt.</p>
<p><em>Update: Wow, I never expected this would spark such a <a href="http://news.ycombinator.com/item?id=3899507">lively discussion on Hackernews</a>! People there suggest many alternative solutions to this problems, such as simply hitting Ctrl-C or Enter to get a clean line whenever this occurs, configuring a two-line prompt, or ditching Bash for Zsh.<br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/your-bash-prompt-needs-this/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Blaming the tool</title>
		<link>http://jonisalonen.com/2012/blaming-the-tool/</link>
		<comments>http://jonisalonen.com/2012/blaming-the-tool/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 07:00:18 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Fallacies]]></category>
		<category><![CDATA[Project management]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=310</guid>
		<description><![CDATA[I was thinking about the following scenario the other day: Give an experienced cook a dull knife. The blade slips and he cuts himself. Give an inexperienced cook a sharp knife. He doesn&#8217;t hold it right and he cuts himself. It occurred to me that there&#8217;s no way to tell the difference between the two [...]]]></description>
			<content:encoded><![CDATA[<p>I was thinking about the following scenario the other day:</p>
<blockquote><p>Give an experienced cook a dull knife. The blade slips and he cuts himself. Give an inexperienced cook a sharp knife. He doesn&#8217;t hold it right and he cuts himself.</p></blockquote>
<p>It occurred to me that there&#8217;s no way to tell the difference between the two from the outside. All we see is a cook with with a bloody hand, cursing the knife. In the same way there&#8217;s no way to tell the difference between programmers using tools too sharp or too dull for their skills.</p>
<p>The scary part is this: often there&#8217;s no way to tell the difference from the inside either. As programmers we tend to be pretty smart people, or at least we think we are. But smart people suffer from <a title="Quora: What are some stupid things smart people do?" href="http://www.quora.com/What-are-some-stupid-things-smart-people-do" target="_blank">several crippling fallacies</a>, two being:</p>
<ul>
<li>being experts in a field they think they are experts in others, and</li>
<li>they tend to underestimate the value of experience and the effort it takes to become really good at something.</li>
</ul>
<p>Together these imply that when forced to use a new tool we think we&#8217;ll be as productive as before a lot sooner than we think. We try to use it in ways that are unnatural and unsafe, get hurt, and then we blame the tool. We may even claim that the knife is dull, not knowing any better. Re-appropriating Paul Graham: &#8220;That Java project was bound to fail. How could it not? Java <a title="Paul Graham: Beating the Averages" href="http://www.paulgraham.com/avg.html" target="_blank">doesn&#8217;t even have X</a>.&#8221;</p>
<p>We think knowing all sorts of things about programming languages means we don&#8217;t have to know how databases work. &#8220;All they do is push and pull bytes from the disk, right?&#8221; Or that knowing everything there is to know about pointers and recursion means <a title="Joel on Software: The Perils of Java Schools" href="http://www.joelonsoftware.com/articles/ThePerilsofJavaSchools.html" target="_blank">we should ignore OOAD</a>. (News flash! 99% of code written today doesn&#8217;t use recursion or pointers. Well, not directly anyway.)</p>
<p><em>N<em>ext time</em>, t</em><em>ry holding the sharp edge downwards</em>. There are no bad programming languages, only bad programmers.</p>
<p><strong>Update:</strong> Commentary on <a href="http://news.ycombinator.com/item?id=3656130" target="_blank">Hackernews</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/blaming-the-tool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Calling C From Java Is Easy</title>
		<link>http://jonisalonen.com/2012/calling-c-from-java-is-easy/</link>
		<comments>http://jonisalonen.com/2012/calling-c-from-java-is-easy/#comments</comments>
		<pubDate>Wed, 15 Feb 2012 07:00:25 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[JNI]]></category>
		<category><![CDATA[Tutorials]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=282</guid>
		<description><![CDATA[Sometimes we need to access operating system functions that the standard Java API doesn&#8217;t expose, or use non-Java libraries. Although it&#8217;s well known that you can call this &#8220;native code&#8221; from Java using JNI, there is not so much entry-level material on how it&#8217;s actually done. It is often left out of introductory material&#8211;including the official [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes we need to access operating system functions that the standard Java API doesn&#8217;t expose, or use non-Java libraries. Although it&#8217;s well known that you can call this &#8220;native code&#8221; from Java using JNI, there is not so much entry-level material on how it&#8217;s actually done. It is often left out of introductory material&#8211;including the official Java Tutorial. Here I hope to give a short introduction to get you started.</p>
<p>Programming in JNI starts with defining a class with methods declared as <code>native</code>. Next you generate a C header file that declares functions that implement. Then you define these functions in a separate file and compile it into a shared library. We&#8217;ll use Linux and GCC in this example.</p>
<p>Suppose we need to know the name of the terminal device from which the JVM was launched. (A bit of a contrived example since <a title="Runtime.exec with Unix console programs" href="http://jonisalonen.com/2012/runtime-exec-with-unix-console-programs/">you could just use <code>/dev/tty</code></a>.) From C you would do this by calling the POSIX functions <code>isatty</code> and <code>ttyname</code>. Let&#8217;s make a class that gives us access to them:</p>
<pre>package ex;
public class TTYUtil {
    static { System.loadLibrary("ttyutil"); }
    public static native boolean isTTY();
    public static native String getTTYName();
}</pre>
<p>The call to <code>System.loadLibrary</code> in the class initializer looks for a shared library and links it to the JVM. The file name depends on the operating system: on Windows this code would look for <code>ttyutil.dll</code>, on Linux and Solaris <code>libttyutil.so</code>.</p>
<p>The next step is compiling the Java code and generating the C header file:</p>
<pre>$ javac ex/TTYUtil.java
$ javah ex.TTYUtil</pre>
<p>The <code>javah</code> tool will create a C header file called <code>ex_TTYUtil.h</code>. This file contains the declarations for the C functions we have to define:</p>
<pre>JNIEXPORT jboolean JNICALL Java_ex_TTYUtil_isTTY
  (JNIEnv *, jclass);

JNIEXPORT jstring JNICALL Java_ex_TTYUtil_getTTYName
  (JNIEnv *, jclass);</pre>
<p>The idea is that the header file should be generated automatically during the build process so you should not modify it manually. If you need other declarations or <code>#include</code> statements you should use a different file.</p>
<p>Create <code>ex_TTYUtil.c</code> to define the C functions:</p>
<pre>#include "ex_TTYUtil.h"
#include &lt;unistd.h&gt;

JNIEXPORT jstring JNICALL Java_ex_TTYUtil_getTTYName
  (JNIEnv *env, jclass cls)
{
    char *name = ttyname(STDOUT_FILENO);
    return (*env)-&gt;NewStringUTF(env, name);
}

JNIEXPORT jboolean JNICALL Java_ex_TTYUtil_isTTY
  (JNIEnv *env, jclass cls)
{
    return isatty(STDOUT_FILENO)? JNI_TRUE: JNI_FALSE;
}</pre>
<p>These functions receive the JNI environment object as the first argument. The second argument is the class for static native methods and the object for non-static methods. The rest of the arguments, if any, are the method arguments from Java. The JNI environment is used for interacting with the virtual machine, like here the <code>NewStringUTF</code> function is used to create a new Java <code>String</code> object from a C string.</p>
<p>To compile and link the C code you can use</p>
<pre>$ gcc -fPIC -c ex_TTYUtil.c -I $JAVA_HOME/include
$ gcc ex_TTYUtil.o -shared -o libttyutil.so -Wl,-soname,ttyutil</pre>
<p>Now you should have <code>libttyutil.so</code> in your working directory. Let&#8217;s try using the library.</p>
<pre>import ex.TTYUtil;
public class Test {
    public static void main(String[] args) {
        if (TTYUtil.isTTY()) {
            System.out.println("TTY: "+TTYUtil.getTTYName());
        } else {
            System.out.println("Not a TTY");
        }
    }
}</pre>
<p>Compile this class like you normally would and then run it:</p>
<pre>$ export LD_LIBRARY_PATH=.
$ java Test
TTY: /dev/pts/3</pre>
<p>And what if the output is not connected to a terminal?</p>
<pre>$ java Test | cat
Not a TTY</pre>
<p>The JVM looks for native libraries in the paths specified in the system property <code>java.library.path</code> in addition to what&#8217;s normal for the operating system. Here we made the shared library available to the JVM temporarily by adding the current directory to <code>LD_LIBRARY_PATH</code>. To permanently install a shared library on Linux you would copy the <code>.so</code> to <code>/usr/lib</code> (or any other directory mentioned in <code>/etc/ld.so.conf</code>) and then run <code>ldconfig</code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/calling-c-from-java-is-easy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Runtime.exec with Unix console programs</title>
		<link>http://jonisalonen.com/2012/runtime-exec-with-unix-console-programs/</link>
		<comments>http://jonisalonen.com/2012/runtime-exec-with-unix-console-programs/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 07:00:36 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Console]]></category>
		<category><![CDATA[Linux tips]]></category>
		<category><![CDATA[Runtime.exec]]></category>
		<category><![CDATA[Sample code]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=274</guid>
		<description><![CDATA[Ever wanted to launch less or vi from a console Java program to show or edit a file, only to find that they won&#8217;t work like when you launch them from the terminal? The problem is that these programs need to communicate with a TTY (teletypewriter) device to find the screen size and to be [...]]]></description>
			<content:encoded><![CDATA[<p>Ever wanted to launch <code>less</code> or <code>vi</code> from a console Java program to show or edit a file, only to find that they won&#8217;t work like when you launch them from the terminal?</p>
<p>The problem is that these programs need to communicate with a TTY (teletypewriter) device to find the screen size and to be able to write anywhere on the terminal window. You don&#8217;t notice this when using them from a terminal because the shell sets up their stdin and stdout so they are connected to the terminal.</p>
<p>When you launch a program from Java using <code>Runtime.exec</code> the <code>stdin</code> and <code>stdout</code> are connected to <em>pipes</em> handled by the JVM, not to a TTY device: it is as if you tried to launch less with something like <code>less file.txt &lt;jvm.in &gt;jvm.out</code>. Needless to say that wouldn&#8217;t work even from a terminal.</p>
<p>What you can do is redirect the <code>stdin</code> and <code>stdout</code> streams <em>back</em> to the original terminal device. To find the actual TTY device we would have to call the POSIX <code>ttyname</code> function with JNI, but luckily that&#8217;s not necessary: we can use <code>/dev/tty</code>, which is the <a href="http://tldp.org/HOWTO/Text-Terminal-HOWTO-7.html#ss7.3">controlling terminal for the current process</a>.</p>
<p>An interesting application of this is to use <code>less</code> as a pager to show lengthy messages to a user, like database result sets:
<pre>
import java.io.OutputStream;

public class Test {
    public static void main(String[] args) throws Exception {
        Process p = Runtime.getRuntime().exec(new String[] {"sh", "-c",
                "less &gt;/dev/tty"});
        OutputStream out = p.getOutputStream();
        out.write("Lengthy message".getBytes());
        out.close();
        System.out.println("=&gt; "+p.waitFor());
    }
}</pre>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/runtime-exec-with-unix-console-programs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Software Design Maxims</title>
		<link>http://jonisalonen.com/2012/maxims/</link>
		<comments>http://jonisalonen.com/2012/maxims/#comments</comments>
		<pubDate>Sun, 05 Feb 2012 11:00:00 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[DRY]]></category>
		<category><![CDATA[YAGNI]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=242</guid>
		<description><![CDATA[max·im [mak-sim] noun (now rare) A self-evident axiom or premise; a pithy expression of a general principle or rule. A precept; a succinct statement or observation of a rule of conduct or moral teaching. These are my maxims when designing software. DONE is better. A design can never be perfect; you&#8217;ll always think of a way to make it better. A design should be just good enough to ship, rather than delay the release by months to [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><strong>max·im </strong>[mak-sim] noun</p>
<ol>
<li>(now rare) A self-evident axiom or premise; a pithy expression of a general principle or rule.</li>
<li>A precept; a succinct statement or observation of a rule of conduct or moral teaching.</li>
</ol>
</blockquote>
<p>These are my maxims when designing software.</p>
<p><strong>DONE is better.</strong> A design can never be perfect; you&#8217;ll always think of a way to make it better. A design should be <em>just good enough</em> to ship, rather than delay the release by months to make everything perfect. Consistency can be sacrificed for simplicity temporarily; you can (and should) refactor later.</p>
<p>A related concept is <strong>YAGNI (You Aren&#8217;t Gonna Need It).</strong> You think you need to build a frobnicator at some point for the project? Don&#8217;t start working on it yet: There&#8217;s a good chance you won&#8217;t need it, or you find one that someone has already made. Don&#8217;t build the framework for the bug tracker for the project before creating something that can be shipped.</p>
<p><strong>DRY (Don&#8217;t Repeat Yourself)</strong> in any level. Information should have only one source. Functionality and control structures should not be repeated. Create abstractions and use ideas from other programming paradigms. In the spirit of YAGNI apply the <em>rule of three</em>: don&#8217;t spend an enormous effort on refactoring repetition until you see it 3 times.</p>
<p><strong>Conservation of complexity</strong>, or<strong> &#8220;no silver bullet&#8221;:</strong> You can move complexity around but you can never reduce it beyond the level inherent to the problem domain. Good design is about deciding how to break complexity into reasonable chunks and distributing it between components: You can make the processing module simpler by moving some logic to the I/O modules. You can make the code simpler by moving complexity to the type system. You can make the interface simpler by making the code more complex. You decide what&#8217;s better.</p>
<p><strong>Think about usability.</strong>  Even if you are creating a library that interfaces a banking application, far removed from the end user with a monitor and a keyboard, you have a a user: the person writing code that uses your library. Arrange the design so that it&#8217;s possible to create a reasonable user interface, whether it&#8217;s a GUI or an API.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/maxims/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Can your interpreter do this?</title>
		<link>http://jonisalonen.com/2012/can-your-interpreter-do-this/</link>
		<comments>http://jonisalonen.com/2012/can-your-interpreter-do-this/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 11:00:49 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[Hotspot]]></category>
		<category><![CDATA[JIT]]></category>
		<category><![CDATA[OSR]]></category>
		<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=234</guid>
		<description><![CDATA[Consider this code: Object obj = new Object(); WeakReference&#60;Object&#62; ref = new WeakReference&#60;Object&#62;(obj); List&#60;byte[]&#62; filler = new LinkedList&#60;byte[]&#62;(); while (ref.get() != null) { filler.add(new byte[1000]); } System.out.println("Filler size " + filler.size()); When executed you would expect it to run out of memory: ref.get() will never return null because the referenced object cannot be garbage collected. [...]]]></description>
			<content:encoded><![CDATA[<p>Consider this code:</p>
<pre>Object obj = new Object();
WeakReference&lt;Object&gt; ref = new WeakReference&lt;Object&gt;(obj);

List&lt;byte[]&gt; filler = new LinkedList&lt;byte[]&gt;();
while (ref.get() != null) {
    filler.add(new byte[1000]);
}
System.out.println("Filler size " + filler.size());</pre>
<p>When executed you would expect it to run out of memory: <code>ref.get()</code> will never return null because the referenced object cannot be garbage collected. The local variable <code>obj</code> still holds a reference to it. The filler increases in size indefinitely, and the program will ultimately crash with an OutOfMemoryError. </p>
<p>What really happens is this:</p>
<pre>$ java test
Filler size 28186</pre>
<p>Wait, what?!</p>
<p>It turns out that the Hotspot virtual machine analyzes the code, sees that the variable <code>obj</code> is not used after the <code>while</code>-loop, and rewrites the method <em>while it is running</em> so that the local variable is effectively removed. If you added something like <code>print(obj)</code> after the loop you would get the expected OOME. </p>
<p>The behaviour is kind of fickle: the thresholds for compiling code depend on the VM options. I haven&#8217;t been able to reproduce this with the server VM for example. You also get an OOME if you make the filler grow faster by adding bigger arrays of bytes: you have to give the compiler enough loop iterations to be triggered.</p>
<p>This is called <strong>OSR (On Stack Replacement)</strong> compilation in the Hotspot VM. Quoting Kris Mok in <a href="https://gist.github.com/1165804#file_notes.md">About Printcompilation</a>:</p>
<blockquote><p>OSR in HotSpot is used to help improve performance of Java methods stuck in loops [6]. Without OSR, a method running in the interpreter can&#8217;t transfer to its compiled version even if there is one available, until the next time this method is invoked. With OSR, though, a Java method with long-running loops can run in the interpreter, trigger an OSR compilation in one of its loops, keep running in the interpreter until the compilation completes, and jump right into the compiled version without having to wait for &#8220;the next invocation&#8221;.</p></blockquote>
<p>The process of &#8220;jumping right into the compiled version&#8221; sounds simple but in reality is anything but. The new method body does not start from the beginning but from the &#8220;back edge&#8221; of the running loop. The stack frame created by the interpreter is replaced by the one created by the JIT compiler. It is this process that is capable of removing local variables from the method.</p>
<p>You can see when OSR happens by using the PrintCompilation flag:</p>
<pre>$ java -XX:+PrintCompilation test
    125   1       java.lang.String::hashCode (60 bytes)
    133   2       sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (490 bytes)
    158   3       java.lang.String::charAt (33 bytes)
    159   4       java.lang.String::indexOf (151 bytes)
    174   5       java.lang.Object::&lt;init&gt; (1 bytes)
    182   6       java.util.LinkedList$Entry::&lt;init&gt; (20 bytes)
    183   7       java.util.LinkedList::add (12 bytes)
    185   8       java.util.LinkedList::addBefore (52 bytes)
    348   1%      test::main @ 25 (78 bytes)
Filler size 28082</pre>
<p>The <code>%</code>-flag in the second column tells us that OSR compilation happened.</p>
<h3>Implications</h3>
<p>This optimization has a big implication on when finalizers are run. Since <code>obj</code> is a local variable you would expect that its finalizer would not be called at least until after the method. But since the object is GC&#8217;d before the method ends, it is finalized as well. This special case is also noted in <a href="http://java.sun.com/docs/books/jls/third_edition/html/execution.html#12.6.1">JSL 12.6.1</a>:</p>
<blockquote><p>A reachable object is any object that can be accessed in any potential continuing computation from any live thread. Optimizing transformations of a program can be designed that reduce the number of objects that are reachable to be less than those which would naively be considered reachable. For example, a compiler or code generator may choose to set a variable or parameter that will no longer be used to null to cause the storage for such an object to be potentially reclaimable sooner.</p></blockquote>
<p>The lesson is <strong>you can&#8217;t depend <strong>in <em>any</em> way </strong>on when objects are finalized</strong>.</p>
<p>On-Stack-Replacement also affects performance in unexpected ways. The <a href="http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good">Azul Systems Blog</a> has a very good post on this subject.</p>
<p>(Inspired by the Stackoverflow question <a href="http://stackoverflow.com/questions/8818424/are-weakhashmap-cleared-during-a-full-gc">Are WeakHashMap Cleared During A Full GC?</a> Thanks to berry120 and jalopaba for contributing detailed answers. Tests run on OpenJDK 1.6.0_23.)</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/can-your-interpreter-do-this/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MySQL Collations in practice</title>
		<link>http://jonisalonen.com/2012/mysql-collations-in-practice/</link>
		<comments>http://jonisalonen.com/2012/mysql-collations-in-practice/#comments</comments>
		<pubDate>Tue, 31 Jan 2012 08:00:55 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[I18N]]></category>
		<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=254</guid>
		<description><![CDATA[Collations are important for how text is ordered by indices and ORDER BY clauses; in fact a collation defines how they work. Sadly developers seem to often ignore collations in data models, and the database default settings are used. If you are using MySQL and haven&#8217;t configured any particular collation, your database is probably sorted according [...]]]></description>
			<content:encoded><![CDATA[<p>Collations are important for how text is ordered by indices and ORDER BY clauses; in fact a collation <strong>defines</strong> how they work. Sadly developers seem to often ignore collations in data models, and the database default settings are used. If you are using MySQL and haven&#8217;t configured any particular collation, <strong>your database is probably sorted according to Swedish</strong>.</p>
<p>Let&#8217;s check what MySQL collations mean in practice. First create a table with three fields, each with the same data but a different collations:</p>
<pre>CREATE TABLE test (
  en varchar(3) CHARACTER SET utf8 COLLATE utf8_general_ci,
  se varchar(3) CHARACTER SET utf8 COLLATE utf8_swedish_ci,
  es varchar(3) CHARACTER SET utf8 COLLATE utf8_spanish_ci,
);</pre>
<p>Then let&#8217;s fill the columns with some data:</p>
<pre>INSERT INTO test (en, se, es) VALUES ('A','A','A');
INSERT INTO test (en, se, es) VALUES ('Ä','Ä','Ä');
INSERT INTO test (en, se, es) VALUES ('N','N','N');
INSERT INTO test (en, se, es) VALUES ('Ñ','Ñ','Ñ');
INSERT INTO test (en, se, es) VALUES ('Z','Z','Z');
INSERT INTO test (en, se, es) VALUES ('Ö','Ö','Ö');
INSERT INTO test (en, se, es) VALUES ('Nz','Nz','Nz');
INSERT INTO test (en, se, es) VALUES ('Az','Az','Az');</pre>
<p>Now comes the fun part. In English Ä=A, N=Ñ, and O=Ö so we get this ordering:</p>
<pre>SELECT en FROM test ORDER BY en;
-- Result: A Ä Az N Ñ Nz Ö Z</pre>
<p>In Spanish N and Ñ are different letters so we get this ordering:</p>
<pre>SELECT es FROM test ORDER BY es:
-- Result: A Ä Az N Nz Ñ Ö Z</pre>
<p>In Swedish Ä and Ö are not only separate from A and O; they are the last letters of the alphabet:</p>
<pre>SELECT se FROM test ORDER BY se:
-- Result: A Az N Ñ Nz Z Ä Ö</pre>
<p>Ok, that&#8217;s it for ORDER BY. What about indices?</p>
<pre>mysql&gt; ALTER TABLE test ADD UNIQUE INDEX idx_en (en);
ERROR 1062 (23000): Duplicate entry 'Ñ' for key 'idx_en'</pre>
<p>Makes sense, N and Ñ are the same according to English! With Swedish you would get the same error. What about Spanish?</p>
<pre>mysql&gt; ALTER TABLE test ADD UNIQUE INDEX idx_es (es);
ERROR 1062 (23000): Duplicate entry 'A' for key 'idx_es'</pre>
<p>Here N and Ñ are different, but A and Ä are considered still the same.</p>
<p>The default collation on many installations is Swedish because MySQL AB, the company that created the database, is a Swedish company. Next time you create a table don&#8217;t forget to set the collation to something that makes sense for your users.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/mysql-collations-in-practice/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Ultimate Guide To UTF-8 and MySQL</title>
		<link>http://jonisalonen.com/2012/ultimate-guide-to-utf8-and-mysql/</link>
		<comments>http://jonisalonen.com/2012/ultimate-guide-to-utf8-and-mysql/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 21:02:31 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[I18N]]></category>
		<category><![CDATA[latin1]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=232</guid>
		<description><![CDATA[How character encodings work in MySQL seem to continue baffle people, at least based on the number of questions posted on Stackoverflow. Read on if you have an application that still outputs funny text like &#8220;seÃ±or&#8221; or &#8220;se�or&#8221; where &#8220;señor&#8221; is expected. The truth about text in MySQL Strings in MySQL consist of an encoding [...]]]></description>
			<content:encoded><![CDATA[<p>How character encodings work in MySQL seem to continue baffle people, at least based on the number of questions posted on Stackoverflow. Read on if you have an application that still outputs funny text like &#8220;seÃ±or&#8221; or &#8220;se�or&#8221; where &#8220;señor&#8221; is expected.</p>
<h3>The truth about text in MySQL</h3>
<p>Strings in MySQL consist of an encoding marker and a list of bytes. The <code>CHARSET</code> function reads the encoding while <code>HEX</code> can be used to read bytes:</p>
<pre>SELECT CHARSET("€"), HEX("€")
=&gt; latin1, 80</pre>
<p>This enables MySQL to use different encodings while working on a query. You can write a query that compares text columns in different encodings and returns the results where the text is the same, rather than the bytes that are stored.</p>
<h3>I have everything set to <code>utf8</code> and still the database gives me garbage!</h3>
<p>That&#8217;s because of the MySQL client encoding. One query can return text in several different encodings, it&#8217;s unreasonable to expect the user software to check the encoding of each individual string. So all text returned by the MySQL client is encoded in only one encoding.</p>
<p><strong>The fast and easy solution to encoding problems: set the MySQL client encoding</strong>.</p>
<p>How this is done depends on what you use to connect. If you work in PHP and the old MySQL functions you should use <code>mysql_set_charset("utf8")</code>. If you have an old version try <code>mysql_query("SET NAMES utf8")</code>. If you use MySQLi it would be <code>mysqli_set_charset($link, "utf8")</code> (or <code>$link-&gt;set_charset("utf8")</code> if you use the OO API).</p>
<h3>What are the encodings in the tables good for, then?</h3>
<p>When you create a table you can specify a <strong>character encoding</strong> for a column. It means how the MySQL server will convert text like &#8220;abc123&#8243; to bytes that can be written to disk. The encoding affects two things:</p>
<ol>
<li>the <strong>range of characters</strong> that can be stored,  and</li>
<li>the <strong>number of bytes </strong>the text will occupy on disk.</li>
</ol>
<div>Popular encodings for storage include latin1 and utf8:</div>
<ul>
<li><strong>latin1</strong>: the characters that you can store cover most European languages, and the text occupies 1 byte per character.</li>
<li><strong>utf8</strong>: you can store any(*) Unicode characters, and the text occupies from 1 to 3 bytes per character, depending on the character.</li>
</ul>
<p>Other encodings exist, of course. For example <strong>latin2</strong> has characters used in Eastern European languages. If you use <strong>ucs2</strong> you can store any Unicode characters and the text occupies 2 bytes per character. If you are storing a lot of text in Japanese you may want to use ucs2 instead of utf8 because you are saving a third in the storage.</p>
<p>You can also specify a <strong>collation</strong> like <code>utf8_general_ci</code> for a column. It means how MySQL will <strong>sort</strong> the data. This affects <strong>indices</strong> and <strong>ORDER BY</strong> clauses. Cultures have different rules for alphabetic order: for example in Swedish Ä is the second to last letter of the alphabet, while in English it&#8217;s equivalent to A. So with Swedish collation you get a &lt; b &lt; ä and with English collation you get a = ä &lt; b. <em>Use the ordering your users expect.</em> You can read <a title="MySQL Collations in practice" href="http://jonisalonen.com/2012/mysql-collations-in-practice/">more about collations</a> in my other post.</p>
<p>What if you don&#8217;t specify an encoding and collation for the column? Then MySQL will use the default specified for the table, the database, or the server.</p>
<h3>Converting encodings</h3>
<p>Strings in MySQL consist of an encoding marker and a list of bytes. To convert strings from one encoding to another you can use the CONVERT(&#8230; USING &#8230;) function.  What CONVERT actually does is change both the marker and the bytes so that the resulting string is the same as the original; not very useful for fixing broken text.</p>
<p>The solution is <em>stripping the encoding marker</em> from the string before passing it to CONVERT. This is done by converting the string to &#8220;binary&#8221; first:</p>
<pre>CONVERT(BINARY broken_text USING utf8)</pre>
<p>I&#8217;ve written a detailed case study on <a title="Fixing mangled characters in MySQL" href="http://jonisalonen.com/2010/mysql-character-encoding/">fixing a particularly complicated MySQL encoding problem</a> in another post.</p>
<h3>Practical demonstration</h3>
<p>Let&#8217;s create a table with columns in different encodings and fill it with some data.</p>
<pre>
CREATE TABLE test_encoding (
  utf8   varchar(4) CHARACTER SET utf8,
  latin1 varchar(4) CHARACTER SET latin1,
  latin2 varchar(4) CHARACTER SET latin2
);
INSERT INTO test_encoding (utf8,latin1,latin2) values ('A','A','A');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('Ñ','Ñ','Ñ');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('Ŕ','Ŕ','Ŕ');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('☺','☺','☺');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('€','€','€');
</pre>
<p>When you select the data from this table with <code>select * from test_encoding</code> you get the following results:</p>
<pre>
+------+--------+--------+
| utf8 | latin1 | latin2 |
+------+--------+--------+
| A    | A      | A      |
| Ñ    | Ñ      | ?      |
| Ŕ    | ?      | Ŕ      |
| ☺    | ?      | ?      |
| €    | €      | ?      |
+------+--------+--------+
</pre>
<p>Even though each column is encoded differently everything is still readable. This is thanks to the MySQL client encoding. The characters that could not be encoded are replaced with question marks: there is no Ñ in latin2, nor is there a Ŕ in latin1. The smiley face is not present in either.</p>
<p>The last line is interesting: the euro sign <em>can be encoded in latin1 although technically it&#8217;s not really a latin1 character</em>. This shows that MySQL&#8217;s latin1 is not really the standard ISO-8859-1 encoding, commonly known as &#8220;ISO latin 1&#8243;. In reality it is <a href="http://en.wikipedia.org/wiki/Windows-1252">Windows-1252</a>, the &#8220;Western European&#8221; encoding that also has curly quotes and other niceties where ISO-8859-1 has unprintable control characters.</p>
<p>To see that MySQL really is storing the columns in different encodings you could do this:</p>
<pre>mysql> select utf8,hex(utf8),hex(latin1) from test_encoding;
+------+-----------+-------------+
| utf8 | hex(utf8) | hex(latin1) |
+------+-----------+-------------+
| A    | 41        | 41          |
| Ñ    | C391      | D1          |
| Ŕ    | C594      | 3F          |
| ☺    | E298BA    | 3F          |
| €    | E282AC    | 80          |
+------+-----------+-------------+
</pre>
<p>What happens when we compare two columns encoded differently?</p>
<pre>
mysql> select utf8,latin1 from test_encoding where utf8=latin1;
+------+--------+
| utf8 | latin1 |
+------+--------+
| A    | A      |
| Ñ    | Ñ      |
| €    | €      |
+------+--------+
</pre>
<p>MySQL knows that the <a href="http://dev.mysql.com/doc/refman/5.6/en/charset-repertoire.html">character repertoire</a> of latin1 is a subset of utf8, so it converts the latin1 column into unicode so that it can be compared. If however you try to compare latin1 with latin2 this is what happens:</p>
<pre>
mysql> select latin1,latin2 from test_encoding where latin1=latin2;
ERROR 1267 (HY000): Illegal mix of collations (latin1_bin,IMPLICIT) and (latin2_bin,IMPLICIT) for operation '='
</pre>
<p>Since neither latin1 and latin2 is a subset of the other, MySQL cannot compare the columns. To be able to compare them you have to convert at least one to unicode, e.g. by using <code>CONVERT(.. USING ..)</code>:</p>
<pre>select latin1, latin2 from test_encoding
where convert(latin1 using utf8)=latin2;</pre>
<p>I hope this information is enough to solve any character encoding problems you may have with MySQL once and for all.</p>
<p>(*) Did I say <code>utf8</code> can store <em>any</em> Unicode characters? That&#8217;s not really true. MySQL&#8217;s utf8 and ucs2 only store unicode from the basic multilingual plane (BMP), which includes &#8220;only&#8221; the characters U+0000 through U+FFFF. If you have MySQL 5.5 or later you have additional encodings to choose from: utf16, utf32, utf8mb4. These encodings You can read the details, as usual, in the <a href="http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html">official MySQL documentation</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/ultimate-guide-to-utf8-and-mysql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Grouping data</title>
		<link>http://jonisalonen.com/2012/grouping-data/</link>
		<comments>http://jonisalonen.com/2012/grouping-data/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 08:19:59 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Idioms]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=223</guid>
		<description><![CDATA[Suppose you are faced with the task of taking a list of things and grouping them by some attribute, like turning the following list country-city pairs&#8230; input=[("USA", "New York"), ("USA", "San Francisco"), ("USA", "Los Angeles"), ("UK", "London"), ("UK", "Manchester"), ("UK", "Edinborough"), ("Spain", "Madrid"), ("Spain", "Barcelona"), ("Spain", "Granada"), ("Finland", "Helsinki"), ("Finland", "Oulu"), ("Finland", "Kuopio")] into lists of [...]]]></description>
			<content:encoded><![CDATA[<p>Suppose you are faced with the task of taking a list of things and grouping them by some attribute, like turning the following list country-city pairs&#8230;</p>
<pre>input=[("USA", "New York"), ("USA", "San Francisco"), ("USA", "Los Angeles"),
      ("UK", "London"), ("UK", "Manchester"), ("UK", "Edinborough"),
      ("Spain", "Madrid"), ("Spain", "Barcelona"), ("Spain", "Granada"),
      ("Finland", "Helsinki"), ("Finland", "Oulu"), ("Finland", "Kuopio")]</pre>
<p>into lists of cities grouped by country like this:</p>
<pre>output=[("USA", ["New York", "San Francisco", "Los Angeles"]),
	("UK", ["London", "Manchester", "Edinborough"]),
	("Spain", ["Madrid", "Barcelona", "Granada"]),
	("Finland", ["Helsinki", "Oulu", "Kuopio")]</pre>
<p>Also suppose that the application that you are working on is about preparing data for reports and that a lot of the application will be dedicated to creating such groupings. Often the data is accumulated by summing numbers instead of building lists. Sometimes groups have to be divided into subgroups, like regions within countries. You had better have some kind of a generic pattern for this grouping algorithm instead of inventing a new way of doing each time, right?</p>
<p>One the worst ways of doing it I have seen is this:</p>
<pre>country = None
isFirst = True
output  = []
cities = None

if len(input) &gt; 0:
    for countryCity in input:
	if country != countryCity[0]:
	    if not isFirst:
		output.append((country, cities));

	    isFirst = False
	    country = countryCity[0]
	    cities = [countryCity[1]]
	else:
	    cities.append(countryCity[1])

    output.append((country, cities))</pre>
<p>It gets the job done and didn&#8217;t take more than 5 minutes to write, but it&#8217;s brittle. The group is closed in two different places. It&#8217;s not immediately obvious what to change to group by multiple levels. If the processing logic was any longer doing this is like leaving a time bomb in the code base.</p>
<p>What I have found clearest in these cases is a nested loop like this:</p>
<pre>while there is more input:
    open new group
    while there is more input and we're in the group:
	accumulate data for the group
	get next input
    close the group</pre>
<p>An example in Python for the problem at hand would be something like this:</p>
<pre>output = []
inputIter = iter(input)
countryCity = next(inputIter, None)
while countryCity != None:
    # open new group
    country = countryCity[0]
    cities = []

    while countryCity != None and country == countryCity[0]:
	# accumulate data for group
	cities.append(countryCity[1])
	countryCity = next(inputIter,None)

    # close group
    output.append((country, cities))</pre>
<p>Here it&#8217;s pretty clear where a group is opened, where the data is accumulated, and where it is closed. Also it&#8217;s clear what to do if a second level of grouping is needed: the loop that accumulates data has to have this same structure inside. For example:</p>
<pre>while countryCity != None:
    country = countryCity[0]
    regions = []
    # group by country
    while countryCity != None and country == countryCity[0]:
	region = countryCity[1]
	cities = []
	# group by region
	while countryCity != None and region == countryCity[1]:
	    cities.append(countryCity[2])
	    countryCity = next(inputIter,None)
	regions.append((region, cities))
    output.append((country, regions))</pre>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/grouping-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Crazy fast AJAX search suggest in CakePHP</title>
		<link>http://jonisalonen.com/2011/crazy-fast-ajax-search-suggest-in-cakephp-using-browser-cache/</link>
		<comments>http://jonisalonen.com/2011/crazy-fast-ajax-search-suggest-in-cakephp-using-browser-cache/#comments</comments>
		<pubDate>Wed, 02 Mar 2011 22:10:35 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[AJAX]]></category>
		<category><![CDATA[Autocomplete]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Sample code]]></category>
		<category><![CDATA[Scriptaculous]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=190</guid>
		<description><![CDATA[The fastest AJAX call is the one that is never made. Network latency makes using AJAX applications like Google instant search, or even the common autocomplete field, feel like running in treacle*. Having the browser run to the server for more data every time the user pushes a key just doesn&#8217;t work well. Many times [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The fastest AJAX call is the one that is never made.</strong></p>
<p>Network latency makes using AJAX applications like Google instant search, or even the common autocomplete field, feel like running in treacle<a href="#footnote1">*</a>. Having the browser run to the server for more data every time the user pushes a key just doesn&#8217;t work well. Many times the data doesn&#8217;t even change that often, so one solution is to let the browser cache it. This is how you do it in <a href="http://cakephp.org">CakePHP</a>.</p>
<p>One way to implement the search suggest feature is with the Scriptaculous AJAX autocomplete field. While you type it sends frequent requests to the server. In a high-latency setting you may have to wait seconds to obtain the response. Worse still, sometimes the responses arrive out of order.</p>
<p>You get much better responsiveness if you preload the data and <strong>use <a href="http://madrobby.github.com/scriptaculous/autocompleter-local/">Autocomplete.Local</a></strong>. It works a little differently from AJAX autocomplete in that the data has to be a Javascript array. If you create a controller action /search/suggestData that returns all data as JSON array, you can load the &#8220;local&#8221; autocomplete data with AJAX. So you could add this to your default layout:</p>
<pre>&lt;input id="search" autocomplete="off"/&gt;
&lt;div id="search_suggest" style="display:none;"&gt;&lt;/div&gt;
&lt;script type="text/javascript"&gt;
new Ajax.Request('&lt;?php echo $html-&gt;url('/search/suggestData') ?&gt;', {
    method: 'get', // Important! Only GET requests are cached.
    onSuccess: function(response) {
	var suggestData = response.responseJSON;
	new Autocompleter.Local('search_autocomplete', 'search_suggest', suggestData, { });
    }
});
&lt;/script&gt;</pre>
<p>But that&#8217;s not the full story. A search suggest feature should have thousands of suggestions to be useful. Adding all of that for <em>every single page view</em> amounts to significant increase in bandwidth use. Since the data doesn&#8217;t change that often, you can make the browser store it in its local cache.</p>
<p>Basically all you need is the <code>Expires</code> HTTP header and <strong>make sure you make GET requests</strong>. POST requests are not cached. The next time the call is made the browser will serve it from disk and it will be ultra fast. For example, to generate the JSON necessary for the search suggest feature you could use this controller action and view:</p>
<pre>// app/controllers/search_controller.php
class SearchController extends AppController {
    function suggestData() {
        header('Expires: '.date('r', strtotime('+1 day')));
        header('Content-type: application/json; charset=utf-8');
        // The top 5000 search words as an array()
        $this-&gt;set('data', $this-&gt;Search-&gt;top5kSearches());
    }
}

// app/views/search/suggestData.ctp
&lt;?php echo json_encode($data) ?&gt;</pre>
<p>If the data doesn&#8217;t change at all you can set the cache expiry date a lot further into the future. Here it&#8217;s set to 1 day.</p>
<p>How many search suggestions should you include, then? As many as possible, but bear in mind that Autocomplete.Local finds its matches by doing a <em>linear search</em> through the data array. With 5000 entries in the array you still get decent performance. It&#8217;s not hard to replace the search logic to use e.g. a binary search though.</p>
<p>I have observed this trick to shorten response times from several seconds to about zero for users that have a slow link to the web server. It makes a world of difference for autocomplete fields.</p>
<p><a name="footnote1"></a>*) Oh, by the way. Apparently it&#8217;s easier to swim in treacle than in water. Go figure.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2011/crazy-fast-ajax-search-suggest-in-cakephp-using-browser-cache/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

