appsintheopen.com articles

tailwindcss core dump on Ampere Arm

I was recently deploying a rails app on an Ampere Arm instance which used Tailwind 4.1.18. During the rails assets precompile build step, tailwindcss failed with:

Command failed with SIGABRT (signal 6) (core dumped): /usr/local/bundle/gems/tailwindcss-ruby-4.1.18-aarch64-linux-gnu/exe/aarch64-linux-gnu/tailwindcss
#16 1.394 /usr/local/bundle/gems/tailwindcss-rails-4.4.0/lib/tasks/build.rake:11:in 'Kernel#system'
#16 1.394 /usr/local/bundle/gems/tailwindcss-rails-4.4.0/lib/tasks/build.rake:11:in 'block (2 levels) in '
#16 1.394 /usr/local/lib/ruby/gems/4.0.0/gems/rake-13.3.1/lib/rake/task.rb:281:in 'block in Rake::Task#execute'
#16 1.394 /usr/local/lib/ruby/gems/4.0.0/gems/rake-13.3.1/lib/rake/task.rb:281:in 'Array#each'
#16 1.394 /usr/local/lib/ruby/gems/4.0.0/gems/rake-13.3.1/lib/rake/task.rb:281:in 'Rake::Task#execute'
#16 1.394 /usr/local/lib/ruby/gems/4.0.0/gems/rake-13.3.1/lib/rake/task.rb:219:in 'block in Rake::Task#invoke_with_call_chain'
...

I didn't dig into it much further to figure out why, but searching around I did not find any more reports of this problem. I did find a workaround, which was to build with the node version of tailwind instead. My deployment was all in Docker, so I first added these environment variables to point the tailwindcss-ruby gem at the node executable:

 TAILWINDCSS_INSTALL_DIR=/rails/node/node_modules/.bin \
    NODE_PATH=/rails/node/node_modules \

Ensure npm is installed via apt-get and add the following step to the docker image build:

RUN mkdir /rails/node; npm install --prefix /rails/node tailwindcss @tailwindcss/cli
...
RUN bundle exec bootsnap precompile --gemfile
...
# Cleaning up the tailwindcss workaround
RUN rm -rf /rails/node

After that, the build worked fine.

Backing up my Ubuntu home directory to Remote Borg

Continuing on the Borg and Flint Router theme, the latest task is to backup my Ubuntu laptop to storage attached to the Flint router.

The best way to backup to a remote location is to install Borg remotely, but with larger backup repositories, it can need significant memory, and the router hasn't got much spare.

However, Borg can backup to a remote filesystem mounted using sshfs and it seems to work well for my limited requirements.

The plan is to use an hourly cron task and run a small script to perform the backup. The tricky thing about backups, is you need to know when they fail. On a server, something like Cronitor is a perfect way to alert about failed jobs.

On the desktop, backups will only run when the machine is on and likely being used. Therefore a failed cron job could alert with a notification via notify-send, which will pop up on the desktop.

Steps to Working Backups

Create a new public / private key pair without a passphrase and add it to the authorized_keys on the router. The lack of pass phrase is important, as there is no way for cron to enter the password. Then init a new borg repo at that location.

Create a script somewhat like the following:

#!/bin/bash

set -e

log () {
  echo " $(date '+%d/%m/%Y %H:%M:%S'): $1"
}

BACKUP_FOLDER=/home/sodonnell

export BORG_REPO=/home/sodonnell/backup/thinkpad-borg
export BORG_PASSPHRASE='SomePassword'

log "mounting remote backup location"

sshfs sodonnell@192.168.8.1:/mnt/backup/thinkpad-borg /home/sodonnell/backup -o uid=1000 -o gid=1000 -o IdentityFile=/home/sodonnell/.ssh/id_ed25519_backups

log "Running borg backup"

/usr/bin/borg create                      \
    --verbose                             \
    --filter AME                          \
    --list                                \
    --stats                               \
    --show-rc                             \
    --compression zstd,5                  \
    --exclude-caches                      \
    --exclude $BACKUP_FOLDER/Downloads    \
    --exclude $BACKUP_FOLDER/scratch      \
    --exclude $BACKUP_FOLDER/.cache       \
    --exclude $BACKUP_FOLDER/.local/share/JetBrains     \
    --exclude $BACKUP_FOLDER/.local/share/Trash         \
                                          \
    ::'thinkpad-home-{now}'               \
    $BACKUP_FOLDER

/usr/bin/borg prune                   \
    --list                            \
    --glob-archives 'thinkpad-home-*' \
    --show-rc                         \
    --keep-daily    7                 \
    --keep-weekly   4                 \
    --keep-monthly  2

/usr/bin/borg compact

log "backup completed"

fusermount -u /home/sodonnell/backup

log "Backup FS unmounted"

I also created a wrapper script to call this one, and send the notification if it fails, eg: ```

Needed for notify-send to work

export DBUSSESSIONBUS_ADDRESS="unix:path=/run/user/1000/bus" /home/sodonnell/source/scripts/backup/backup-home.sh || /usr/bin/notify-send -u critical -a backups "Backups failed. Check log" ``TheDBUS` environment variable was obtained from my shell environment, and is needed for notify-send to work.

Finally, schedule this wrapper in cron: $ crontab -l SHELL=/bin/bash 0 * * * * /home/sodonnell/source/scripts/backup/backup-home-wrapper.sh &>> /home/sodonnell/scratch/backup.log The default shell is /bin/sh, which doesn't support the &>> syntax, hence the SHELL=/bin/bash line in the crontab.

Backing up Sqlite Database with Borg and De-duplication

I have a reasonably large sqlite database that is fine with daily backups as losing days worth of data would not be terrible. I'd also like to keep 5 to 7 days of backups, "just in case".

The current backup strategy is to create a date stamped file for each day, gzip, delete the oldest and sync the lot to an S3 like store.

This works, but each daily backup is mostly a duplicate of the previous, and the backups are not encrypted.

So I thought it would be interesting to see how Borg Backup behaves with this sort of backup.

The database in question is about 325MB gzipped, and 1.8GB uncompressed.

Borg and Gzipped Databases

For my first try, I took 5 daily gzipped database backups, and added them to borg in turn:

borg create --stats /home/sodonnell/Downloads/backup/compressed_dbs::1st ./current
...
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              340.64 MB            320.21 MB            320.21 MB
All archives:              340.64 MB            320.21 MB            320.22 MB

                       Unique chunks         Total chunks
Chunk index:                     146                  146

Then I added the next 4 databases:

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              342.98 MB            322.51 MB            322.51 MB
All archives:                1.71 GB              1.61 GB              1.61 GB

                       Unique chunks         Total chunks
Chunk index:                     667                  667
------------------------------------------------------------------------------

Notice there are 667 total and unique chunks, so with these gzipped files, Borg is not able to de-duplicate any of the backups, even though the databases are mostly identical.

Borg and Uncompressed Databases

Next I repeated the test, adding each database in turn without gzipping them first. After adding the same 5 DBs the stats looked like:

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                1.93 GB            553.39 MB              8.00 MB
All archives:                9.62 GB              2.76 GB            590.26 MB

                       Unique chunks         Total chunks
Chunk index:                     787                 3699
------------------------------------------------------------------------------

This looks much better. The compressed size is less than with gzip, but deduplication is working. The total size is 590MB vs 1.6GB compared to the previous attempt.

Experimenting With Compression

The above test used the default compression, which is LZ4 - fast but not great compression rates. Borg supports different compression types, so I performed a few tests with zstd, starting with level 15:

borg create --stats --compression zstd,15 /home/sodonnell/Downloads/backup/uncompressed_zstd15::1st ./current
------------------------------------------------------------------------------
Repository: /home/sodonnell/Downloads/backup/uncompressed_zstd15
Archive name: 1st
Archive fingerprint: 05bc3fb4666357daa2da63094213e933b7104bd7c29fb60e8e5c5db2e66fde85
Time (start): Wed, 2025-01-29 21:45:03
Time (end):   Wed, 2025-01-29 21:48:01
Duration: 2 minutes 57.69 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                1.92 GB            225.38 MB            225.38 MB
All archives:                1.92 GB            225.38 MB            225.41 MB

                       Unique chunks         Total chunks
Chunk index:                     747                  747
------------------------------------------------------------------------------

Repeating the test at level 10:

borg create --stats --compression zstd,10 /home/sodonnell/Downloads/backup/uncompressed_zstd15::1st ./current
------------------------------------------------------------------------------
Repository: /home/sodonnell/Downloads/backup/uncompressed_zstd15
Archive name: 1st
Archive fingerprint: dd26f83b534330b4a4444c42e5f74c6a390a280d1ebbc833a63c016c8e2c3407
Time (start): Wed, 2025-01-29 21:48:42
Time (end):   Wed, 2025-01-29 21:49:51
Duration: 1 minutes 9.46 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                1.92 GB            225.64 MB            225.64 MB
All archives:                1.92 GB            225.64 MB            225.67 MB

                       Unique chunks         Total chunks
Chunk index:                     721                  721
------------------------------------------------------------------------------

And then level 5:

borg create --stats --compression zstd,5 /home/sodonnell/Downloads/backup/uncompressed_zstd15::1st ./current
Enter passphrase for key /home/sodonnell/Downloads/backup/uncompressed_zstd15:
------------------------------------------------------------------------------
Repository: /home/sodonnell/Downloads/backup/uncompressed_zstd15
Archive name: 1st
Archive fingerprint: 4f7f10988eef17bc97f3052efe42d2641686b7f08e3ead4352a4872ed6b4a27b
Time (start): Wed, 2025-01-29 21:50:43
Time (end):   Wed, 2025-01-29 21:51:19
Duration: 35.35 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                1.92 GB            267.63 MB            267.63 MB
All archives:                1.92 GB            267.63 MB            267.66 MB

                       Unique chunks         Total chunks
Chunk index:                     751                  751
------------------------------------------------------------------------------

So level 15 took nearly 3 minutes compressing to 225MB.

Level 10 was faster at 1m 9s, also compressing to 225MB.

Level 5, took only 38 seconds and compressed to 267MB, which is significantly faster for not much gain in size.

Compression and the Next File

Borg operates by splitting a file into chunks and then hashing each chunk. Each hash is checked against previous hashes to see if it already exists. Only if it does not exist, is it compressed and encrypted. Therefore, even with expensive compression, adding a second mostly duplicate file should be much faster. Adding the second database copy at level 15 compression:

------------------------------------------------------------------------------
Repository: /home/sodonnell/Downloads/backup/uncompressed_zstd15
Archive name: 2nd
Archive fingerprint: 7d2e0b837ba60ccea38c69c4d45b23b0ebebae4aa7d5aeed3bf1995bdfd0b5cd
Time (start): Wed, 2025-01-29 22:40:52
Time (end):   Wed, 2025-01-29 22:41:04
Duration: 11.99 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                1.92 GB            225.62 MB              2.78 MB
All archives:                3.84 GB            450.85 MB            228.08 MB

                       Unique chunks         Total chunks
Chunk index:                     725                 1428
------------------------------------------------------------------------------

So 12 seconds compared to 3 minutes for the first file into the archive.

MySQL Dump

Keeping with the subject of database backups, a basic way to backup mysql is using mysqldump, which writes to stdout. Borg can create an archive with a single file reading from stdin, eg:

mysqldump  | borg create --compression zsdt,5 --stdin-name 'mysql-dump.sql' repo::archive -

Conclusion

Despite the sqlite databases being mostly identical, gzipping them before adding to Borg yields files which have no duplication and results in the largest archives.

For my database (mostly new inserts, few if any deletes), de-duplication works very well, and even level 5 zstd compression beats gzip. Even using expensive compression only hurts on the first file. Later nearly duplicate files are stored much more quickly.

I suspect mileage will vary depending on how much the database changes between backups, so running a few tests for a specific database would help yield the best compression and way to use Borg for a particular use case.

Running Syncthing on Flint OpenWRT Router

My GL.inet GL-ATX1800 (Flint) router is currently running Firmware version 4.6.8 which equates to a somewhat modified OpenWRT 21.02.

I recently heard about Syncthing from the Security Now podcast, and decided to setup it up on my Ubuntu laptop and router as a kind of poor man's backup.

Install Syncthing on the Router

From the Flint web UI, if you goto Applications -> Plug-ins and then search for Syncthing, version 1.18.2.1 is available for install. This version is a little bit old, but I used it rather than installing or building the latest version.

After installing, Syncthing does not run automatically, so you have to ssh onto the router to make a few changes.

Also note there is very limit free storage on the Flint, so you will need to add a USB disk - I have a 256GB Nvme drive in a USB enclosure for this purpose.

Getting Syncthing to Run

Syncthing is installed as a service, so you can start it with:

service syncthing start

However it would not start from that simple command. Searching the filesystem for syncthing references yielded:

/etc/config/syncthing
/etc/init.d/syncthing
/etc/syncthing
/lib/upgrade/keep.d/syncthing
/overlay/upper/etc/init.d/syncthing
/overlay/upper/etc/syncthing
/overlay/upper/etc/config/syncthing
/overlay/upper/lib/upgrade/keep.d/syncthing
/overlay/upper/usr/bin/syncthing
/overlay/upper/root/.config/syncthing
/usr/bin/syncthing

Digging further, the /etc/config/syncthing file contained:

config syncthing 'syncthing'
        option enabled '1'

        option gui_address 'http://192.168.8.1:8384'

        # Use internal flash for evaluation purpouses. Use external stor
        #   for production.
        # This filesystem must either support ownership/attributes or
        #   be readable/writable by the user specified in
        #   'option user'.
        # Consult syslog if things go wrong.
        option home '/tmp/mountd/disk1_part1/syncthing'

        # Changes to "niceness"/macprocs are not picked up by "reload_co
        #   nor by "restart": the service has to be stopped/started
        #   for those to take effect
        option nice '19'

        # 0 to match the number of CPUs (default)
        # >0 to explicitly specify concurrency
        option macprocs '0'

        # Running as 'root' is possible, but not recommended
        option user 'syncthing'

The first line option enabled '1' defaulted to '0', preventing the service from starting, so changing it to '1' got it going. I also modified the gui_address to the LAN IP of the router so I can access the UI easily.

Next I changed to the option option home '/tmp/mountd/disk1_part1/syncthing' to a folder on the mounted USB disk, and also set the owner:group of that folder to Syncthing, as the process runs as that user by default. This folder is where Syncthing will store its indexes, keys etc, so it is best on external storage than the routers internal flash storage. After starting the service, Syncthing created its required files in that location:

ls -al /tmp/mountd/disk1_part1/syncthing
drwx------    4 syncthin syncthin      4096 Jan 25 23:16 .
drwxr-xr-x    7 root     root          4096 Jan 24 22:34 ..
-rw-r--r--    1 syncthin syncthin       794 Jan 24 22:38 cert.pem
-rw-------    1 syncthin syncthin      9879 Jan 25 23:16 config.xml
-rw-------    1 syncthin syncthin      7273 Jan 24 22:38 config.xml.v0
-rw-------    1 syncthin syncthin        66 Jan 25 10:51 csrftokens.txt
-rw-r--r--    1 syncthin syncthin       794 Jan 24 22:38 https-cert.pem
-rw-------    1 syncthin syncthin       288 Jan 24 22:38 https-key.pem
drwxr-xr-x    2 syncthin syncthin      4096 Jan 26 10:45 index-v0.14.0.db
-rw-------    1 syncthin syncthin       288 Jan 24 22:38 key.pem

The GUI can also be accessed at http://192.168.8.1:8384 with no password by default.

Default Folder Location

By default Syncthing wants to create a default synced folder called Sync. In my setup, it was unable to create it, and was attempting to create it at /Sync. To fix that, edit the config.xml file in home folder and locate the folder definition at the top of the file. I modified it to create the folder in a synced_data folder under the home directory:

configuration version="35">
    

What I am not sure about, is what would happen in such a setup if I had multiple hosts on the Tailnet outside the LAN that I want to talk to each other, but only when both are on the Tailnet. As it is working well enough for what I need now, I will leave that as a task for another day!



Rails Page Caching Cache Headers and Thruster
I recently updated this blog from an antique Rails version to the latest, in part so I could use Thruster.

This site uses Rails full page caching. This works by writing a copy of the rendered page into the public folder of the Rails app. When a request comes in, a middleware checks if a cached response exists and returns it rather than invoking the Rails controller.

The Rails stack serves this cached page via x-sendfile if the upstream proxy supports it, which Thruster does.

Rails serves all static files using the static middleware, which includes images, assets and cached files.

This middleware has a single cache header setting, which is controlled by the following setting in application.rb, and caches all files for 1 year:

 config.public_file_server.headers = { "cache-control" => "public, max-age=#{1.year.to_i}" }


Thruster also performs asset caching, and if Rails returns a cache header for a request, Thruster will store the content in memory and serve the next request from its own cache instead of sending it to Rails.

Requests served with x-sendfile and a cache header appear to be a bit more complex in Thruster. I found an issue where Thruster appears to cache the x-sendfile header. Later if the cached file is expired Thruster throws a 404 if it cannot find the original file.

Aside from the problem, I don't want the full pages to be cached upstream indefinitely, as then any edits would be somewhat invisible. So, we need a way to avoid setting a cache header on "Full Page Cached" files, while retaining the cache header on static assets.

Out of the box, Rails cannot do this, but we can make it work by adding another middleware.

First, disable all cache headers for the static middleware, and configure our new middleware:

 # application.rb

 # Remove or comment out the default cache headers
 # config.public_file_server.headers = { "cache-control" => "public, max-age=#{1.year.to_i}" }

 # Default all static files to no cache for full page caching.
 config.public_file_server.headers = { "cache-control" => "max-age=0, private, must-revalidate" }

 # Add a new middleware before the Static file server. Then reinstate the cache header for asset
 # paths
 config.middleware.insert_before ActionDispatch::Static, StaticFileHeaderOverride, /^\/assets\/.*/,
                                    { "cache-control" => "public, max-age=#{1.year.to_i}" }


Note the original config.public_file_server.headers was defined in development.rb and production.rb, but I centralized the setting in application.rb.

I added the code for the new middleware into lib/middleware/static_file_header_override.rb:

class StaticFileHeaderOverride
  # Pass a pattern to match the paths we want to override the heads on, and the
  # headers to merge in, which will overwrite any existing with the same name.
  def initialize(app, pattern, headers)
    @app = app
    @pattern = pattern
    @headers = headers
  end

  def call(env)
    # Call the next middleware, and apply any overrides as the request is returned.
    status, headers, response = @app.call(env)
    if @pattern.match? env['REQUEST_PATH']
      headers.merge! @headers
    end

    [status, headers, response]
  end
+end


This also required adding the new middleware folder to the autoload_lib setting in application.rb:

-    config.autoload_lib(ignore: %w[assets tasks])
+    config.autoload_lib(ignore: %w[assets tasks middleware])


Now, only static files under /assets get a cache header, and all others return no-cache.



Create a bootable Ubuntu installer on a partitioned USB drive
Most guides to create a bootable live CD for Ubuntu will wipe the entire drive, creating a single partition in the process.

The ISO for Ubuntu 22.04 is about 4.5GB, and these days most USB sticks are way bigger than that. Its a shame to waste the rest of the space.

After some googling I came across this post which describes how to do this. However the top answer seemed overly complex and this simpler answer save me an error running the isohybrid --partok command.

Following a link to this post, it seems things are even simpler.


Partition the drive with the first partition being the one to use for storage - I formatted as FAT32 as I wanted to use it with Windows and my TV. Apparently, if the storage partition is second, Windows will not see it.
Create a partition with the remaining space, format as EXT4.
Make the second partition bootable.
Copy the ISO over using dd:


sudo dd if=/home/sodonnell/Downloads/ubuntu-22.04.5-desktop-amd64.iso of=/dev/sda2 bs=1M



Install a MBR onto the disk - note this command runs on the disk sda not the partition. The install-mbr command can be installed with apt install mbr.


sudo install-mbr /dev/sda


After that, I rebooted my system and was able to run "Try Ubuntu" successfully.



Modelling Meter Rates in Rails
I have been working on a small Rails app to display household energy readings. The application receives power usage updates for a set of Meters. Each Meter can have a Rate attached to it, to calculate the cost of the energy.

Over time the rates can change, and we need to keep the old (or future) rates to calculate historical usage.

Rates with Start and End Date

We need a model that allows:


A date the rate becomes valid
A date the rate ends
A potentially open ended rate for the present day into the future.


Initially I considered modelling this with a simple table:

create_table :meter_rates do |t|
      t.references :meter, null: false, foreign_key: true
      t.decimal :rate, precision: 10, scale: 2, null: false
      t.date :start_on, null: false
      t.date :end_on
end


This meets all the requirements, where an open ended "present day" tariff has a null end date. However there are a series of hidden complexities:


We need a validation, ideally at the database that start_on is <= end_on. This is easily solved with a simple constraint.
Ideally, we need a constraint to ensure no tariffs start on the same day. Again easily achieved with a unique index.
An open ended tariff can be represented with a null end date, so adding a new tariff requires ending the current one and adding a new row. Adding a rate in the middle of two existing rates involves modifying the previous end date, adding a new row and then modifying the start date of the later row. This may be further complicated as below.
Ideally, we need to ensure tariffs do not overlap. Ie the start date or a new tariff is not between the start and end of another tariff. This is where things start to get tricky. Constraints and indexes only cover the newly inserted row. Therefore a database trigger is needed to validate the new rows against others.
When adding a new tariff, we should ensure there is no gap between the rates. At the database level, this would require a trigger. 


For 4 and 5 Rails can probably validate this before insert, but validations don't protect against concurrent inserts, so a data integrity problem could creep in.

Who Needs End Date Anyway?

What if we remove end_on from the model entirely? Interestingly the problem is greatly simplified.

A tariff change is indicated by a new row with the start date of the tariff.

Reviewing the earlier problems:


We no longer have an end_on field to worry about
Uniqueness of the tariff start_on is still enforced via a unique index.
Adding a tariff for the future or in the past between existing tariffs is a simple additional row.
Tariffs can no longer overlap.
Tariffs can no longer have gaps. The end date of a tariff is signaled by the new tariffs start date.


Finding the tariff for a given date is simple enough, and should be efficient on a large table with an index on meter_id, start_on:

select *
from tariffs
where meter_id = ?
and   start_on <= ?
order by start_on desc
limit 1;


While there is nothing ground breaking in this post, but I thought it was interesting how much more complex adding an explicit tariff end date made the problem.



Load a mysql table into sqlite quickly
I recently had a need to copy about 18M rows from Mysql to a sqlite database. The mysql dump measured about 1.2GB. 

First, dump the data from mysql using the following:

mysqldump -uroot -p --compatible=ansi --skip-extended-insert --compact --single-transaction --no-create-info schema table > table_dump.sql


This creates a file of individual insert statements. The key to loading these quickly in sqlite, is to load them all as a single transaction mentioned in the faq. It also makes sense to allocate additional cache memory for sqlite. Additional pragma options can be used to disable the journal and avoid waiting on disk, but I did not need to use these to get the speed I needed.

PRAGMA cache_size = 400000;

BEGIN;
.read table_dump.sql
END;




Hibernate on Ubuntu 22.04 and 24.04 without uswsusp
Getting my laptop to hibernate successfully with Ubuntu 22 was a little tricky.

Most of the online resources mention a command swap-offset which I think comes from the package uswsusp. That package is no longer part of Ubuntu.

These are the steps I used to get Hibernate working on a Lenovo T490 with Ubuntu 22.

Sizing the Swap File

Ubuntu installs a small swapfile by default (2GB on my system) rather than a swap partition.

The swapfile needs to be at least as large as the system memory.

First check the size of current swap:

$ swapon -s
Filename                Type        Size        Used        Priority
/swapfile                               file        16778236    0       -2

# Or alternatively check swap and memory size together:

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           15640        1755       10890         581        2994       12994
Swap:          16384           0       16384


Here we can see the total system memory is just under 16GB and the swap file is sized at 16GB as I have already sized it to be at least as large as memory. To resize swap:

sudo swapoff -a
sudo dd if=/dev/zero of=/swapfile bs=1M count=16385
sudo mkswap /swapfile
sudo swapon /swapfile


Then restart the system.

Swapfile Details

To enable hibernate, you need know the UUID of the partition the swapfile resides on, and the offset of the swapfile on the partition.

To get the partition UUID:

$ findmnt -no UUID -T /swapfile
2f29b7c3-cfdb-44a0-9ec6-2f141b7f581e


To get the offset:

$ sudo filefrag -v /swapfile
Filesystem type is: ef53
File size of /swapfile is 17180917760 (4194560 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   12287:   39702528..  39714815:  12288:            
   1:    12288..   14335:    1984512..   1986559:   2048:   39714816:
   2:    14336..   16383:   39825408..  39827455:   2048:    1986560:
   3:    16384..   18431:   39829504..  39831551:   2048:   39827456:
   4:    18432..   22527:   39839744..  39843839:   4096:   39831552:


The offset you want is the first physical offset from the first line, ie 39702528.

Update Grub2

Next, update the grub boot options using the value obtain above. Edit /etc/default/grub and find the line that looks like GRUB_CMDLINE_LINUX_DEFAULT="quiet splash. Edit the line so it looks like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash resume=UUID=2f29b7c3-cfdb-44a0-9ec6-2f141b7f581e resume_offset=39702528"


Note that the resume line looks like resume=UUID=2f..., as I missed the UUID part on my first try at setting this up. Then run:

sudo update-grub


Next edit sudo vi /etc/initramfs-tools/conf.d/resume and add:

RESUME=UUID=2f29b7c3-cfdb-44a0-9ec6-2f141b7f581e resume_offset=39702528


In any guides I found, the first RESUME was always in uppercase. I am not sure if it must be or not.

Now, run:

sudo update-initramfs -c -k all


At this stage reboot, then you should be able to hibernate via:

sudo systemctl hibernate


Hibernate From The UI

My Ubuntu runs the Gnome desktop environment (this must be the default, as I did not select it). To get the hibernate option from the usual power menu, you can add a Gnome extension. First, enable Firefox to install the extensions:

sudo apt install gnome-shell-extensions
sudo apt install chrome-gnome-shell


Logout, then install the Gnome Shell Integration extension.

Restart Firefox and install the status button extension, and toggle it ON.

Then logout and in again and the Hibernate and Hybrid Sleep options should be available.

Updates for Ubuntu 24.04

On Ubuntu 24.04, a few things have changed.

First, the default swapfile name has changed from swapfile to swap.img. I updated /etc/fstab to reference swapfile.

Second, the step sudo update-initramfs -c -k all is no longer required, or was never required!

Third, there seems to be a bug which causes an error:

systemctl hibernate
error: "Call to Hibernate failed: Invalid argument" 


This is apparently triggered by a kernel update. To fix it, remove the hibernate settings from /etc/default/grub reboot, and add them back again.

As of 2024-09-09 this bug is resolved and the above steps work again. To confirm the system has the fix:

apt list --installed
initramfs-tools/noble-updates,noble-updates,now 0.142ubuntu25.2 all [installed,automatic]


Ensure the version is 0.142ubuntu25.2 or greater.

Fourth (and finally), the extension to enable the hibernate button in the boot menu did not work for me until I added the following found here to /etc/polkit-1/rules.d/10-enable-hibernate.rules:

polkit.addRule(function(action, subject) {
    if (action.id == "org.freedesktop.login1.hibernate" ||
        action.id == "org.freedesktop.login1.hibernate-multiple-sessions" ||
        action.id == "org.freedesktop.upower.hibernate" ||
        action.id == "org.freedesktop.login1.handle-hibernate-key" ||
        action.id == "org.freedesktop.login1.hibernate-ignore-inhibit")
    {
        return polkit.Result.YES;
    }
});


Hybrid Sleep

Googling about hybrid sleep suggests it is like suspend / sleep and hibernate together. The system memory state is kept for quick startup, but the memory is also persisted to disk in case of power failure.

With Windows 11 and this same laptop, when Windows sleeps it is initially suspended, but then hibernates after some time.

I figured the Linux Hybrid Sleep was like this, but initial testing and some research suggests it is not.

The Hybrid sleep settings can be controlled via /etc/systemd/sleep.conf, which defaults to 180 minutes:

[Sleep]
#AllowSuspend=yes
#AllowHibernation=yes
#AllowSuspendThenHibernate=yes
#AllowHybridSleep=yes
#SuspendMode=
#SuspendState=mem standby freeze
#HibernateMode=platform shutdown
#HibernateState=disk
#HybridSleepMode=suspend platform shutdown
#HybridSleepState=disk
#HibernateDelaySec=180min


I tried changing this to 5min, and the laptop did not hibernate even after 15 minutes. Some reports suggest this used to work this way, and no longer does. Other reports suggest the laptop will wake the system when the battery reaches 5% and then it will hibernate. For now I will leave this as a problem to be solved!



Intellij development on Windows with WSL and Docker and Ruby
WSL Setup

Having left Windows behind some 10 years ago, I recently heard about WSL via DHH tweeting about trying it and leaving Mac OS behind. So I decided to give it a try. Most of my development these days is using Intellij for Java (or Ruby on the side). I'm also trying to push my dev environment into containers to avoid complexities around environment setup.

So my requirements are attempting to setup a development environment:


Using a Windows 11 machine
With WSL2
With docker running inside WSL2 (ie not Docker Desktop)
Using Intellij as the IDE


Getting WSL Running.

Starting from a clean Windows 11 install:


Control Panel -> Turn Windows Features On and Off
Ensure Hyper V, Virtual Machine Platform and Windows Sub System for Linux are enabled.


Windows will most likely need restarted. Then in Windows Powershell:

wsl --update
wsl --install -d ubuntu


At this stage WSL2 should be running, and you can enter the WSL environment with the wsl command.

WSLg and Intellij

WSL allows for Linux GUI applications to run seamlessly inside Windows. First install the vGPU driver, then in wsl:

sudo apt-get install x11-apps libxrender1 libxtst6 libxi6
sudo snap install intellij-idea-ultimate --classic


Now running intellij-idea-ultimate in a terminal should open Intellij. On my system (Lenovo T490) Windows automatically scales the display and Intellij appears with very small fonts when started this way. It looks fine on my external monitor.

Following the comments in this post turned up the following instructions to get WSLg apps to scale too:


As administrator wsl –shutdown
Create / edit the file .wslgconfig in your users home directory, ie c:\users\\ and add:


[system-distro-env]
WESTON_RDP_DISABLE_FRACTIONAL_HI_DPI_SCALING=false
WESTON_RDP_FRACTIONAL_HI_DPI_SCALING=true


However after this, Intellij still looked terrible. Rolling that back, it turns out using the zoom in Intellij sorted it.

Docker On Windows

The easiest path to Docker on Windows is of course Docker Desktop. This actually runs a VM in WSL to deploy docker on. But with Ubuntu available in WSL already, Docker Desktop seems unnecessary. As long as any docker commands are executed inside of WSL, docker can be in the same WSL instance, and run with virtually no overhead. Some years back, I read posts indicating it was difficult to run Docker inside WSL due to the lack of systemd, but that is no longer the case.

Where Does the Source Code Live?

With WSL, the Windows C drive is mounted into the Linux VM under /mnt/c so WSL can access source files inside Windows. However there appears to be quite a large performance penalty for this. It is better to keep the source inside WSL if the execution environment is inside WSL.

Intellij On Windows, source in WSL

What I ideally wanted, was Intellij running on Windows, avoiding WSLg. I fear it may be somewhat slow, but I have not used it much to validate that. Then have my source in WSL, mounted into a container running on Docker in WSL with all the dependencies.

Trying this out in Intellij, Docker can be configured to reside in WSL, and works fine. The source can be loaded from WSL directly. For Ruby development, attempting to configure a remote docker interpreter however fails as I reported with an error like:

Cannot run program "docker" (in directory "\\\\wsl.localhost\\Ubuntu\\home\\sodonnell\\rails\\shiny_new_app"): CreateProcess error=2, The system cannot find the file specified


It seems like the Ruby integration doesn't understand how to deal with code in WSL and docker in WSL. If the source was in Windows, it would run into a different problem, attempting to mount a windows path into the docker container inside WSL. This is where Docker Desktop would come in, translating the windows paths to make this work, but then the source would still be outside WSL.

Another option Intellij offers, is to have the interpreter inside WSL, but that would require abandoning docker and setting up a WSL instance with all dependencies.

So it looks like I am stuck with putting everything inside WSL - source, Docker and Intellij and running Intellij through WSLg! Or, just installing Ubuntu and going all in on Linux instead!



Rails development in docker with Rubymine or Intellij
Nearly 14 years ago, I wrote about setting up emacs for Rails development. Having a need for a new Rails app, I figured it was time to look at a more modern IDE to make things easier. 

The last time I tried to install Rails on my mac, I got into all sorts of build and compile errors that made it painful to even get started. These days I try to use Docker to make setup and laptop moves easier.

Having a need for a new Rails app, I though it would be interesting to try a fully dockerized dev environment. Rubymine supports developing in such a way, and this post describes how to setup it up.

Generate the Rails App

Without Ruby locally, and by extension Rails, the first challenge is to generate a new Rails app. Doing that is as simple as:

docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app --user $UID:$UID ruby:3.2.3 sh -c "gem install rails && /usr/local/bundle/bin/rails new shiny_new_app"


This uses the Ruby 3.2.3 base image to install Rails and generate the new application.

A Development Docker Image

Next, you need a container which contains Ruby and all the gems needed to run the Rails application. This can be defined in a Dockerfile, at the root of the Rails project, which I have named Dockerfile.dev. With bind mounts, (ie mounting the source into the container), on Mac OS the file permissions are translated by Docker Desktop. If you are running on Linux or WSL2 using Docker directly, the container needs to run as your own user ID, otherwise files created inside the container on the bind mount will be owned by root. Much of this Dockerfile is taken from this blog post

# ARG defined before from, can only be used in FROM.
ARG RUBY_VERSION=3.2.3
FROM ruby:$RUBY_VERSION

# Needed if you need to install some OS / Container packages for Gems
# Adding some generally useful tools here
RUN apt-get update && apt-get install -y \
    build-essential \
    git \
    netcat-traditional \
    vim \
    sudo

# Non root user. Pass in the uid and gid for your local user for build
ARG UID
ENV UID $UID
ARG GID
ENV GID $GID
ARG USER=ruby
ENV USER $USER

RUN groupadd -g $GID $USER && \
    useradd -u $UID -g $USER -m $USER && \
    usermod -p "*" $USER && \
    usermod -aG sudo $USER && \
    echo "$USER ALL=NOPASSWD: ALL" >> /etc/sudoers.d/50-$USER


# throw errors if Gemfile has been modified since Gemfile.lock
# RUN bundle config --global frozen 1
#

# Bundler and gem install here.

ENV LANG C.UTF-8

ENV BUNDLE_PATH /gems
ENV BUNDLE_HOME /gems
ENV BUNDLE_BIN /gems/bin

ARG BUNDLE_JOBS=20
ENV BUNDLE_JOBS $BUNDLE_JOBS
ARG BUNDLE_RETRY=5
ENV BUNDLE_RETRY $BUNDLE_RETRY

ENV GEM_HOME /gems
ENV GEM_PATH /gems

ENV PATH /gems/bin:$PATH

RUN mkdir -p "$GEM_HOME" && chown $USER:$USER "$GEM_HOME"
RUN mkdir -p /usr/src/app && chown $USER:$USER /usr/src/app

WORKDIR /usr/src/app

USER $USER

WORKDIR /usr/src/app

COPY Gemfile Gemfile.lock ./
RUN bundle install

# This was needed to get around an error with Rubymine, but I don't recall what.
#RUN mkdir -p /usr/local/bundle/bin && ln -s /usr/local/bin/bundle /usr/bin/bundle

CMD ["/usr/local/bin/bundle", "exec", "rails", "server"]


This Dockerfile uses the Gemfile from the Rails apps and installs all the required Gems, which includes Rails. At this stage, this container can run rails server and hence your Rails app.

As the image has a few parameters, the easiest way to build it, is inline with a docker-compose definition, rather than the usual command (ignoring the arguments):

docker build . -f Dockerfile.dev -t :


Compose Environment

Having the image, we now need a docker-compose.yaml file to be used by Rubymine. This is fairly simple:

services:
  web:
    build:
      context: ./
      dockerfile: ./Dockerfile.dev
      args:
      # Check these for your current user ID. They don't matter
      # when running for Docker Desktop on Mac
        - UID=1000
        - GID=1000  
    image: testapp:1
    volumes:
      - .:/usr/src/app
    ports:
      - "127.0.0.1:3000:3000"
      # Needed for debugging
      - "1234:1234"
      - "26166:26166"
    command: tail -f /dev/null


Notice we map the localhost port 3000 to the host so we can access the Rails app locally. The other ports are for the Rubymine debugger.

The command is set to tail -f /dev/null so the container start and the IDE can run the server and various commands within it, somewhat like a remote host.

Rubymine / Intellij Setup

The setup for Rubymine is the same as for Intellij Ultimate with the Ruby and Docker plugins. There is also a pretty nice youtube video from Jetbrains that walks through the setup and IDE features.


Open the project via File -> New Project From Existing Sources. When the import completes, the IDE may give a warning about "No Ruby Interpreter configured for the project. Before configuring the interpreter, we should confirm docker is working with the IDE.
Open View -> Tool Windows -> Services (command-8 / ALT-8 on Windows) to reveal the docker window. Confirm docker shows as connected.
Open the docker-compose.yaml, and using the small green arrows in the margin, start the environment
Access the project preferences (command-; / ctrl-alt-shift-s on Windows), and go to SDKs. Click +, "Add Ruby SDK", "Remote Interpreter or version manager". Choose docker-compose, select "web" as the service (from above docker-compose.yaml).
Now goto Project (still in project structure via command-;), and select the newly added remote interpreter.
The final thing to to check is that any commands run inside the established docker-compose container. Using "command-, / ctrl-alt-s on Windows", go to the general preferences and then Build, Execution, Deployment -> Docker -> Ruby Settings. Ensure "docker-compose-exec, run the project with docker-compose up if needed".


With that all done, you should be able to double press control to open the "run anything" menu and then run rails server, rake tasks, migrations, generators etc.



Whitelisting a Dynamic IP in with iptables
I recently got a Shelly EM to monitor the electric usage at my house. Shelly sends data over MQTT, and I intend to run a MQTT server on a small cloud server to receive it. However, I am not comfortable with an MQTT server open to the wider internet. To avoid this, I'd like to configure iptables to allow only traffic from my home ip address.

No IP

My home IP address is dynamic and can change at any time. No IP to the rescue! No IP provides a service which maps a domain name to the dynamic IP. It does this with a short TTL DNS record, but also provides an API to update the address when it changes. My home router has a built in integration with No IP, so it can automatically update the IP when it changes.

Updating iptables

Iptables works on IP addresses, not hostnames, so when the IP changes, iptables needs to be updated. In an earlier post I described how to setup iptables to filter traffic for Docker containers. In short, we have a iptables chain called WHITELIST-IP. The intention, is for this chain to hold a single IP and have it ACCEPT the traffic. To make things work, we need to do 3 things:


Resolve the no ip hostname to give the current IP of my home internet service.
Query iptables to find the current whitelist IP if any.
If there is no whitelisted IP or the current IP does not match the one in iptables, flush the chain and add a new entry.


The first is a simple DNS lookup.

The second, can be achieved by running iptables -nL WHITELIST-IP -t filter and looking for lines like:

ACCEPT     all  --  109.xxx.xxx.xxx      0.0.0.0/0


Finally, update iptables:

iptables -F WHITELIST-IP
iptables -A WHITELIST-IP -s #{new_address} -j ACCEPT


Putting this all together in a short Ruby script looks like below. Simply schedule this in cron, and the dynamic IP will be whitelisted in iptables anytime it changes.

require "resolv"

hostname = "hidden.ddns.net"

def existing_iptables_address
  # iptables -nL WHITELIST-IP -t filter
  # Chain WHITELIST-IP (1 references)
  # target     prot opt source               destination
  # ACCEPT     all  --  109.158.126.154      0.0.0.0/0
  output = `iptables -nL WHITELIST-IP -t filter`
  unless $?.success?
    raise "failed to run iptables command successfully"
  end

  output.each_line do |l|
    if l =~ /^ACCEPT/
      parts = l.split(/\s+/, 5)
      return parts[3]
    end
  end
  nil
end

def update_iptables(new_address)
  puts "Switching the whitelist IP to #{new_address}"
  system "iptables -F WHITELIST-IP"
  system "iptables -A WHITELIST-IP -s #{new_address} -j ACCEPT"
end

begin
  current_ip = Resolv.getaddress(hostname)
  whitelist_ip = existing_iptables_address
  if whitelist_ip.nil? || current_ip != whitelist_ip
    update_iptables current_ip
  end
rescue Resolv::ResolvError => e
  puts "Failed to get the address: #{e}"
end




Docker and the iptables firewall
Docker likes to make things simple. If you expose a port on a host, then by default it is open to anything which can connect to the host, even if the host firewall by default drops all incoming requests. Many people have been surprised and burned by this over the years. Dockers affect on iptables is documented, but it doesn't make it super clear that if your firewall is set to drop by default, docker exposed services are still publicly accessible.

To understand how Docker bypasses the firewall, we need to look into how iptables works.

Tables and Filter Chains

Iptables has a concept of tables and filter chains. A table can have a series of chains within it, and the chains can have filter rules, which can accept, drop or reject packets.

A packet first enters the RAW table, then the MANGLE table. On my fairly default system, there are no rules in either of these tables. Next it hits the NAT table. Docker is running on this host, and here we can see where Docker inserts its first rule, in the PREROUTING chain, directing all traffic into the DOCKER chain. Within the DOCKER chain we can see rules which correspond to ports exposed on running containers, sending the traffic to DNAT:

$ sudo iptables --line-numbers -n -L -t nat
Chain PREROUTING (policy ACCEPT)
num  target     prot opt source               destination         
1    DOCKER     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
num  target     prot opt source               destination         
1    MASQUERADE  all  --  172.25.0.0/16        0.0.0.0/0           
2    MASQUERADE  all  --  172.17.0.0/16        0.0.0.0/0           
3    MASQUERADE  tcp  --  172.25.0.3           172.25.0.3           tcp dpt:443
4    MASQUERADE  tcp  --  172.25.0.3           172.25.0.3           tcp dpt:80
5    MASQUERADE  tcp  --  172.25.0.8           172.25.0.8           tcp dpt:8080

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination         
1    DOCKER     all  --  0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain DOCKER (2 references)
num  target     prot opt source               destination         
1    RETURN     all  --  0.0.0.0/0            0.0.0.0/0           
2    RETURN     all  --  0.0.0.0/0            0.0.0.0/0           
3    DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:443 to:172.25.0.3:443
4    DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80 to:172.25.0.3:80
5    DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.25.0.8:8080


After traversing the NAT table, the packets will enter the FILTER table. Traffic assigned to NAT will skip the usual INPUT chain, which is normally where incoming packets will land, and goes to the FORWARD chain. This explains why the usual firewall rules applied to the INPUT chain in the FILTER table get by passed by Docker. Looking at the filter table, we can see docker has inserted chains and rules in the FORWARD chain:

$ sudo iptables --line-numbers  -L -t filter 
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         

Chain FORWARD (policy DROP)
num  target     prot opt source               destination         
1    DOCKER-USER  all  --  anywhere             anywhere            
2    DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere            
3    ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
4    DOCKER     all  --  anywhere             anywhere            
5    ACCEPT     all  --  anywhere             anywhere            
6    ACCEPT     all  --  anywhere             anywhere            
7    ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
8    DOCKER     all  --  anywhere             anywhere            
9    ACCEPT     all  --  anywhere             anywhere            
10   ACCEPT     all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination         

Chain DOCKER (2 references)
num  target     prot opt source               destination         
1    ACCEPT     tcp  --  anywhere             172.25.0.3           tcp dpt:https
2    ACCEPT     tcp  --  anywhere             172.25.0.3           tcp dpt:http
3    ACCEPT     tcp  --  anywhere             172.25.0.8           tcp dpt:webcache

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
num  target     prot opt source               destination         
1    DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
2    DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
3    RETURN     all  --  anywhere             anywhere            

Chain DOCKER-ISOLATION-STAGE-2 (2 references)
num  target     prot opt source               destination         
1    DROP       all  --  anywhere             anywhere            
2    DROP       all  --  anywhere             anywhere            
3    RETURN     all  --  anywhere             anywhere            

Chain DOCKER-USER (1 references)
num  target     prot opt source               destination         
1    RETURN     all  --  anywhere             anywhere 


As documented, the traffic is first sent to the DOCKER-USER chain, where we have a chance to add custom rules, then into DOCKER-ISOLATION-STAGE-1 and then later into the DOCKER chain where we see the traffic gets accepted on our exposed containers / ports. Note in the above output it looks like there are duplicate rules, but changing the command to iptables --line-numbers  -vL -t filter shows there are some extra conditions attached to these rules, so they are not really duplicates.

Now that we know how docker works, we can devise a way to lock down the firewall using the DOCKER-USER chain.

Custom Firewall Rules

Ideally, we would like one set of rules which can be applied to Docker containers and other non-docker services running on the host. To do that we can create a new FILTERS chain. From the DOCKER-USER chain, we can jump into the FILTERS chain applying our rules. If no rules match, by default deny the traffic.

First, jump to the FILTERS chain from DOCKER-USER for all traffic arriving on the external interface (ens3 here):

-A DOCKER-USER -i ens3 -j FILTERS


Inside FILTERS, allow the ports we want to open, and then drop everything else. We no longer need to worry about the interface, as we only jump to FILTERS for traffic arriving at ens3. 

Note that we use the connection tracking module to track the original destination port. It is possible for Docker to export port 80 and forward it to port 8080. If we don't use connection tracking, the rule would fail to match, as the destination port at that point would be 8080:

-A FILTERS -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 22 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 80 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 443 -j ACCEPT
-A FILTERS -j REJECT --reject-with icmp-host-prohibited


If you wish, you can also add a rule to the INPUT chain to jump to FILTERS, reusing the same rules.

Complete Firewall Script

Individual rules are great, but how can we put this into a full firewall script? Iptables allows its rules to be saved in a text file, and then restored. We can use that feature to create a firewall script which we can reload as required.

# ens3 is the external interface. Adjust accordingly if the external 
# interface has a different name.

*filter

# Lines beginning with : are chain creation
:FILTERS - [0:0]
:WHITELIST-IP - [0:0]
:DOCKER-USER - [0:0]

# -F (flush) deletes all rules in the chain.
-F DOCKER-USER
-F WHITELIST-IP
-F FILTERS

# External interface is ens3, so send all traffic to filters.
-A DOCKER-USER -i ens3 -j FILTERS

-A FILTERS -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
# Will be updated separately with a whitelist IP
-A FILTERS -j WHITELIST-IP
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 22 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 80 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 443 -j ACCEPT
-A FILTERS -j REJECT --reject-with icmp-host-prohibited

COMMIT


To load these firewall rules, run iptables-restore -n /etc/iptables.conf. The -n is important, as otherwise the restore command will flush all firewall rules. With -n, it will not flush anything unless it is specified in the script. That means this script will not affect rules in other tables and chains, eg those added by Docker.

The rules above only impact Docker containers, and it should be possible to load and reload them without impacting Docker itself, or any other firewall rules on the system.

Instead, a complete firewall script can be created that affects both Docker and access for other services running on the host. This allows the INPUT chain and DOCKER-USER chain to share the same FILTERS so that any exposed ports are the same for both Docker container and services running outside of Docker. It also ensure that external traffic is dropped by

# ens3 is the external interface. Adjust accordingly if the external 
# interface has a different name.

*filter
# Lines beginning with : are chain creation
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
:WHITELIST-IP - [0:0]
:FILTERS - [0:0]
:DOCKER-USER - [0:0]

# -F (flush) deletes all rules in the chain.
-F INPUT
-F DOCKER-USER
-F WHITELIST-IP
-F FILTERS
-F OUTPUT

# Accept all traffic from locahost
-A INPUT -i lo -j ACCEPT
# Note this will filter both internal and external interfaces
# add "-i ens3" (where ens3 is the external interface) to the above rule
-A INPUT -j FILTERS

# Filter only docker traffic arriving on the external interface ens3
-A DOCKER-USER -i ens3 -j FILTERS

# Open ports on the host
-A FILTERS -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
# Will be updated separately with a whitelist IP
-A FILTERS -j WHITELIST-IP
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 22 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 80 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 81 -j ACCEPT
-A FILTERS -m tcp -p tcp -m conntrack --ctorigdstport 443 -j ACCEPT
-A FILTERS -j REJECT --reject-with icmp-host-prohibited

COMMIT


What about Boot Time

We can make sure these rules are added at boot time by creating a simple Systemd unit file to run the restore when the system starts up. Create a file /lib/systemd/system/firewall-rules.service:

[Unit]
Description=Restore custom firewall rules
Before=network-pre.target
Wants=network-pre.target
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/sbin/iptables-restore -n /etc/firewall-rules.conf

[Install]
WantedBy=multi-user.target


Then enable it, or enable and then start depending on what systemd supports:

systemctl enable --now firewall-rules

OR

$ sudo systemctl enable firewall-rules
$ sudo systemctl start firewall-rules


If you need to change the firewall, simply exit the script, and run:

$ sudo systemctl restart firewall-rules


Note that my system originally had firewalld running on the host, and it clobbered these rules even if I had set it to start before these rules were applied. As I did not need firewalld, I simply disabled it and went with the setup here instead.

This post explains how I am using the WHITELIST-IP chain.

References

https://unrouted.io/2017/08/15/docker-firewall/

https://github.com/docker/docs/issues/8087

https://www.booleanworld.com/depth-guide-iptables-linux-firewall/



How to stop your iphone charging at a defined percentage
A lot has been said about managing battery health on laptops, phones and even electric cars. There seems to be some consensus that charging to 100% all the time puts stress on the battery. Many electric car manufacturers allow the charger to stop at a user defined level, and Apple even introduced optimized charging on its phones and Macbooks.

From reading around the topic and how Apples optimized charging works, it doesn't seem to be only charging to 100% that that stresses the battery, but taking it to 100% and then holding it there by leaving the phone plugged in for a long duration.

You can see this in action on a Macbook that is usually plugged in. Sometimes the battery will drain down to 80% and stay there for some time, or charging will stop at 80%.

Once an iPhone learns you normal usage patterns, when plugging in overnight it tends to charge to 80% and then holds back the final 20% until your normal "unplug" time.

Apple doesn't give any fine grained control over this process. For example, it might be nice to have a "max charge" setting, but it isn't available. On laptops there are 3rd party applications you can install to set the limits but on the iPhone, nothing exists.

My last iPhone was an iPhone 7, which I often had plugged in during the day for tethering, or when at my desk and left plugged in overnight. The second battery barely lasted 2 years before it was losing charge very quickly.

After replacing the iPhone 7 with a new iPhone 13, I wondered how I could stop it charging at 80%.

Apple provides no way to configure the charge limit, and no 3rd party Apps can do it either.

Then I came across Chargie. This is a clever device you plug into your charger, and then plug the charging wire into it. It connects to the phone via bluetooth and monitors the current battery level. When the battery level reaches a configured percentage, it switches off the charging. At about 35 euro, it seemed a bit expensive for what I wanted.

After some more research, I discovered Personal Automation / Shortcuts. It is possible to create an automation that is triggered when the charge level crosses a threshold. Already owning several Tapo Smart Plugs I discovered that if the Tapo app is installed, an automation action can turn on or off a Tapo plug. So if you plug your charger into the smart plug, you can stop charging at whatever percentage you like.

Assuming you already have the Tapo app installed and your smart plug configured, goto the Shortcuts App and select Automation from the bottom middle. Click the + symbol to create a new automation and select Create Personal Automation. Find "Battery Level" and select "Rises Above" and set your desired percentage. On the next screen, click "Add Action", goto the Apps tab. Find the Tapo app, Turn on/off a device and pick the plug you want to use.

After creating the automation, edit it, and deselect "Ask Before Running" so it runs automatically. Now your charger should switch off when the automation runs!



Resetting ByteBuffers to zero in Java
I have an application that makes use of ByteBuffers to buffer data read and process data from various sources. I wanted to answer the question, when you need a clean ByteBuffer, which is faster:


Allocated a new buffer, and allow the existing one to be garbage collected
Reset the position on the existing buffer and zero out the contents


It's obviously better to avoid both of the above. Just clear the buffer (reset position to zero and the limit to the capacity), fill with available bytes and then only use from zero to the filled position. That allows a buffer to be reused without any expensive operations. There may be times when a new buffer is easier, or an existing buffer needs zero padded to some limit, so it's useful to know the fastest way to do this.

Zeroing Methods

In these tests, I am using the java.nio.ByteBuffer class. Allocating 6x1MB buffers in an array, as follows:

  public static ByteBuffer[] allocateBuffers(int count, int ofSize) {
    ByteBuffer[] buf = new ByteBuffer[count];
    for (int i=0; i


Then I have a few different ways of zeroing the buffer:

  // Note if zeroing on a single buffer, then you may as well
  // allocate a new one, as this method needs to allocate 1 new buffer
  // to use to zero all the others.
  public static void zeroBuffers(ByteBuffer[] buf) {
    ByteBuffer newBuf = ByteBuffer.allocate(buf[0].capacity());
    for (ByteBuffer b : buf) {
      b.position(0);
      newBuf.position(0);
      b.put(newBuf);
      b.position(0);
    }
  }

  public static void zeroBuffersByte(ByteBuffer[] buf) {
    for (ByteBuffer b : buf) {
      b.position(0);
      while (b.hasRemaining()) {
        b.put((byte)0);
      }
      b.position(0);
    }
  }

  // Note will not work correctly if the buffer is not an exact multiple of 1024,
  // but its good enough for a benchmark test
  public static void zeroBuffersByteArray(ByteBuffer[] buf) {
    byte[] bytes = new byte[1024];
    for (ByteBuffer b : buf) {
      b.position(0);
      while (b.hasRemaining()) {
        b.put(bytes);
      }
      b.position(0);
    }
  }

  public static void zeroBuffersArray(ByteBuffer[] buf) {
    for (ByteBuffer b : buf) {
      Arrays.fill(b.array(), (byte)0);
      b.position(0);
    }
  }


Finally I ran some benchmark code, which performs the following tests:


Allocate a new array of 6x1MB buffers, rather than zeroing the existing ones
Reset an existing set of buffers by allocating one new ByteBuffer and using it to zero all others
Simply writing one zero byte at a time to the buffer from 0 to its capacity
Allocate a single byte[] of 1024 and write it until the buffer is filled.
Obtain the internal array from the byte buffer and use Arrays.fill() to fill it with zeros
Just reset the buffer position to zero, to compare how much faster that is.


The results look like:

Benchmark                                         Mode  Cnt          Score         Error  Units
BenchmarkBufferAllocate.allocateNewBuffers       thrpt    5       2306.443 ±     465.750  ops/s
BenchmarkBufferAllocate.zeroBufferWithBuffer     thrpt    5       2156.215 ±     436.713  ops/s
BenchmarkBufferAllocate.zeroBufferWithByte       thrpt    5        459.383 ±      77.800  ops/s
BenchmarkBufferAllocate.zeroBufferWithByteArray  thrpt    5       4170.109 ±     401.827  ops/s
BenchmarkBufferAllocate.zeroBuffersArray         thrpt    5       4985.363 ±     597.730  ops/s
BenchmarkBufferAllocate.resetPosition            thrpt    5  137490972.804 ± 2829717.621  ops/s


Using the final approach, we can see that the Arrays.fill() method (zeroBuffersArray) is faster than any of the others, and so is the preferred approach. An additional advantage is that is would be equally efficient for a single ByteBuffer, as it does not allocate any new objects.

A simple ResetPosition with no zeroing, is much faster than any other approach, and hence should be preferred if possible.

It is also interesting to add the flags "-prof gc" when running the benchmarks to see the memory allocation rate. Unsurprisingly, the options which allocate more objects perform many more memory allocations per second:

BenchmarkBufferAllocate.allocateNewBuffers:·gc.alloc.rate                      thrpt    5       9335.400 ±    2415.879  MB/sec
BenchmarkBufferAllocate.zeroBufferWithBuffer:·gc.alloc.rate                    thrpt    5       1399.931 ±     319.943  MB/sec
BenchmarkBufferAllocate.zeroBufferWithByte:·gc.alloc.rate                      thrpt    5         ≈ 10⁻⁴                MB/sec
BenchmarkBufferAllocate.zeroBufferWithByteArray:·gc.alloc.rate                 thrpt    5          2.373 ±       1.312  MB/sec
BenchmarkBufferAllocate.zeroBuffersArray:·gc.alloc.rate                        thrpt    5         ≈ 10⁻⁴                MB/sec
BenchmarkBufferAllocate.resetPosition:·gc.alloc.rate                           thrpt    5         ≈ 10⁻⁴                MB/sec



For completeness, here is the benchmark code:

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Threads;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;

import java.nio.ByteBuffer;

import static java.util.concurrent.TimeUnit.MILLISECONDS;

public class BenchmarkBufferAllocate {

  @State(Scope.Benchmark)
  public static class BenchmarkState {
    public ByteBuffer[] buffers = ECValidateUtil.allocateBuffers(6, 1024*1024);

    @Setup(Level.Trial)
    public void setUp() {
    }

  }

  public static void main(String[] args) throws Exception {
    String[] opts = new String[2];
    opts[0] = "-prof";
    opts[1] = "gc";
    org.openjdk.jmh.Main.main(opts);
  }

  @Benchmark
  @Threads(1)
  @Warmup(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @Fork(value = 1, warmups = 0)
  @Measurement(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @BenchmarkMode(Mode.Throughput)
  public void allocateNewBuffers(Blackhole blackhole) throws Exception {
    ByteBuffer[] buffers = ECValidateUtil.allocateBuffers(6, 1024*1024);
    blackhole.consume(buffers);
  }

  @Benchmark
  @Threads(1)
  @Warmup(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @Fork(value = 1, warmups = 0)
  @Measurement(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @BenchmarkMode(Mode.Throughput)
  public void zeroBufferWithBuffer(Blackhole blackhole, BenchmarkState state) throws Exception {
    ECValidateUtil.zeroBuffers(state.buffers);
    blackhole.consume(state.buffers);
  }

  @Benchmark
  @Threads(1)
  @Warmup(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @Fork(value = 1, warmups = 0)
  @Measurement(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @BenchmarkMode(Mode.Throughput)
  public void zeroBufferWithByte(Blackhole blackhole, BenchmarkState state) throws Exception {
    ECValidateUtil.zeroBuffersByte(state.buffers);
    blackhole.consume(state.buffers);
  }

  @Benchmark
  @Threads(1)
  @Warmup(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @Fork(value = 1, warmups = 0)
  @Measurement(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @BenchmarkMode(Mode.Throughput)
  public void zeroBufferWithByteArray(Blackhole blackhole, BenchmarkState state) throws Exception {
    ECValidateUtil.zeroBuffersByteArray(state.buffers);
    blackhole.consume(state.buffers);
  }

  @Benchmark
  @Threads(1)
  @Warmup(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @Fork(value = 1, warmups = 0)
  @Measurement(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @BenchmarkMode(Mode.Throughput)
  public void zeroBuffersArray(Blackhole blackhole, BenchmarkState state) throws Exception {
    ECValidateUtil.zeroBuffersArray(state.buffers);
    blackhole.consume(state.buffers);
  }

  @Benchmark
  @Threads(1)
  @Warmup(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @Fork(value = 1, warmups = 0)
  @Measurement(iterations = 5, time = 1000, timeUnit = MILLISECONDS)
  @BenchmarkMode(Mode.Throughput)
  public void resetPosition(Blackhole blackhole, BenchmarkState state) throws Exception {
    ECValidateUtil.resetBufferPosition(state.buffers, 0);
    blackhole.consume(state.buffers);
  }

}




The Memory Overhead of Java Ojects
The Size of Java Objects

Note: Everything I tested here is on Java 8 running on Mac OS

Also Note: The memory overheads are most likely implementation, OS and version dependent, so this is really only a guide. If memory usage is important, you should do your own tests using your own objects and versions.

These days I work a lot on the HDFS part of Hadoop. HDFS is a "big data" file system which stores all the filesystem meta data in memory and never pages any of this data out to disk. It is not uncommon for the Namenode service within HDFS to run with 100 - 250gb of JVM heap configured, generally using the CMS GC.

When storing potentially 100s of millions of objects in memory like this, it can be important to consider the overhead a Java object can add to the actual data you want to store, and consider how to reduce it.

References under and over 32GB heap

Internally, Java uses references to store objects in various structures. When the JVM is running with less than 32GB of heap, these references are 4 bytes in size. For any heap of 32GB or greater, the references increase to 8 bytes. This means, there is often little value in increasing your heap size from 30GB to 34GB, as this extra reference overhead will likely result in you having less usable memory than before. If you are already close to the 32GB boundary, any heap increase needs to be significant.

Object Overhead

Consider this simple program, which creates 10k empty objects:

public class BasicObject {

  public static void main(String[] argv) throws InterruptedException {

    BasicObject[] objs = new BasicObject[10000];

    for (int i=0; i<10000; i++) {
      objs[i] = new BasicObject();
    }

    Thread.sleep(1000000);
  }

}


Using jmap, we can dump the live objects on the heap:

$ jmap -histo:live 27071

 num     #instances         #bytes  class name
----------------------------------------------
   1:         10000         160000  com.sodonnell.BasicObject
   2:          1070          96760  [C
   3:           488          55736  java.lang.Class
   4:             1          40016  [Lcom.sodonnell.BasicObject;


We have 10k instances of BasicObject, which occupies 16,000 bytes, or 16 bytes per object. Therefore, it is reasonable to conclude the object overhead on this platform is 16 bytes.

We can also see a single instance of an array to hold these objects, occupying 40016 bytes. Based on each object reference being 4 bytes, we can see a 16 byte overhead plus 4 bytes per entry. It is important to remember, a Java array occupies all the memory for its defined size, even if it is empty. This is easy to prove with the above program by removing the for loop and leaving the array empty.

If we extend the above program, to make BasicObject store a reference to another object, then we can see how that affects its memory usage:

public class BasicObject {

  private Object other = new Object();

  public static void main(String[] argv) throws InterruptedException {

    BasicObject[] objs = new BasicObject[10000];

    for (int i=0; i<10000; i++) {
      objs[i] = new BasicObject();
    }

    Thread.sleep(1000000);
  }

}


 num     #instances         #bytes  class name
----------------------------------------------
   1:         10035         160560  java.lang.Object
   2:         10000         160000  com.sodonnell.BasicObject
   3:          1070          96760  [C
   4:           488          55736  java.lang.Class
   5:             1          40016  [Lcom.sodonnell.BasicObject;



Ignoring the additional space used by the "Object" objects, the space used by BasicObject has not changed, which is unexpected.

Extending the program to store two objects, gives:

 num     #instances         #bytes  class name
----------------------------------------------
   1:         20035         320560  java.lang.Object
   2:         10000         240000  com.sodonnell.BasicObject
   3:          1070          96760  [C
   4:           488          55736  java.lang.Class
   5:             1          40016  [Lcom.sodonnell.BasicObject;


And 3 Objects:

 num     #instances         #bytes  class name
----------------------------------------------
   1:         30035         480560  java.lang.Object
   2:         10000         240000  com.sodonnell.BasicObject
   3:          1070          96760  [C
   4:           488          55736  java.lang.Class
   5:             1          40016  [Lcom.sodonnell.BasicObject;


An empty object needs 16 bytes as does an object with a single reference. Storing 2 or 3 references pushes the usage up to 24 bytes.

A key detail around JVM memory, is that it tends to round usage up to a multiple of 8 bytes. From this, we can conclude that the object overhead is really 12 bytes on this platform, which gets rounded to 16 bytes. If we store a single reference in this object, it uses that wasted 4 bytes. Storing a second reference uses 20 bytes, but this gets rounded 24, allowing the third reference to use that wasted space, and so on.

Conclusion

The memory overhead of an object is 12 bytes, plus 4 bytes for each object it stores a reference to. If the object holds any primitives, it will use the number of bytes the primitive occupies, eg:


Boolean - 1 byte
Char / short - 2 bytes
int - 4 bytes
long - 8 bytes


After adding up all these parts, the space is rounded up to a multiple of 8. As a test:

public class BasicObject {

  private Object other = new Object();
  private int   myint = 1234;
  private short myshort = (short)1;
  private long  mylong = 1234567;

  public static void main(String[] argv) throws InterruptedException {

    BasicObject[] objs = new BasicObject[10000];

    for (int i=0; i<10000; i++) {
      objs[i] = new BasicObject();
    }

    Thread.sleep(1000000);
  }

}


Running the above, we expect each instance of BasicObject to require 12 (overhead) + 4 (reference) + 4 (int) + 2 (short) + 8 (long) = 30 bytes, rounded to 32:

 num     #instances         #bytes  class name
----------------------------------------------
   1:         10000         320000  com.sodonnell.BasicObject


What About Arrays?

Looking at an array, it turns out to be a special type of object. We can see the array usage above is always 16 bytes plus 4 bytes per reference. This equates to 12 bytes for the object overhead, plus a 4 byte integer to hold the array size, giving a 16 byte overhead.

If you create an array of primitives, instead of 4 bytes per entry, it will instead be the size of the primitive:


Boolean - 1 byte
Char / short - 2 bytes
int - 4 bytes
long - 8 bytes


Filling a 10,000 element array with longs, it is more difficult to see from jmap what space it is using. However we get:

 num     #instances         #bytes  class name
----------------------------------------------
   1:          1070          96760  [C
   2:             2          80064  [J
   3:           487          55632  java.lang.Class
   4:            23          28496  [B
   5:           528          26424  [Ljava.lang.Object;



It appears to be the "[J" class which represents this array object. There are two instances of it, and from earlier tests, without this array of longs, there seems to be an existing entry used for some internal purposes holding 48 bytes:

 103:             1             48  [J


This means the array of 10k longs uses 80016 bytes, which is as expected - 16 byte overhead plus 8 bytes per entry.

Hashes

Now we know the overhead of an object, we can consider a HashMap, by storing our simple object into one:

public class BasicObject {

  public static void main(String[] argv) throws InterruptedException {

    Map hmap = new HashMap();

    for (int i=0; i<10000; i++) {
      BasicObject o = new BasicObject();
      hmap.put(o, o);
    }

    Thread.sleep(1000000);
  }

}


 num     #instances         #bytes  class name
----------------------------------------------
   1:         10041         321312  java.util.HashMap$Node
   2:         10000         160000  com.sodonnell.BasicObject
   3:          1075          97064  [C
   4:            16          66816  [Ljava.util.HashMap$Node;
   5:           487          55632  java.lang.Class



We have our 10k BasicObjects and about 10k HashMap$node objects, occupying 32 bytes each. There are also 16 arrays of HashMap$Node. By creating a simple program that does not create any user defined HashMaps, it seems Java is creating 15 of them in the background somewhere:

 num     #instances         #bytes  class name
----------------------------------------------
...
17:            15           1264  [Ljava.util.HashMap$Node;


Therefore we have an overhead of 32 bytes per entry in the Node object, plus an array occupying 65552 bytes. We know this array has 16 bytes overhead as it stores object references, it is 4 bytes per object. Therefore (65552 - 16) / 4 = 16384 entries, which is a power of 2.

This means that to store these 10k entries our overhead is 320,000 + 65,552 = 38 bytes per entry.

How HashMap Works Internally

A HashMap, by default, starts with an array of 16 entries, but this can be controlled by its constructor. This array is called a table.

An entry is stored into the table by taking its hashcode and truncating it into one of these 16 table entries.

As more entries are added, their hash code hopefully spreads them throughout the table evenly.

If two entries hash to the same table entry, the entries are chained using a linked list via the Node object.

This node object, which we know occupies 32 bytes per entry, holds a "next" reference for the linked list, the hash value, the key and the value being stored. From that we can see where the 32 byte overhead comes from:

   static class Node implements Map.Entry {
        final int hash;
        final K key;
        V value;
        Node next;


12 byte object overhead + 4 (int) + 4 (reference to key) + 4 (reference to value) + 4 (reference to next) = 28 rounded to 32.

When the table in the HashMap fills up to a certain point, the table is doubled in size and all the entries are re-hashed to their new slot in the table.

As an aside, Java 8 has a nice optimisation, where the linked list can be converted into a tree map if it gets too long.

Knowing how the HashMap works, we can understand the space used by the array. It has space for 16384 entries, as it started with 16 slots, and was doubled several times as we loaded the HashMap. In the example with 10k entries, there is space in the table for more elements before it needs to be expanded to a 32k element array.

Its also worth noting that I used the same object as the key and value in this HashMap. Often an ID or other object is used as the key. As HashMap cannot use a primitive as the key, so if you wanted to do lookups based on a long for example, it needs to be wrapped in an object giving another 24 bytes of overhead (8 for the long + 12 for the object overhead rounded to 24).

Conclusion

The overhead of storing an object into a HashMap is about 38 bytes of overhead per entry, provided the same object acts as the key and value. If a different object is the key, the overhead will be higher, but often this other object would exist anyway, making little difference.

Sorted Hash - TreeMap

A TreeMap is a structure which implements the Java Map interface, but stores the objects sorted in order of their keys. Running the same program using it gives:

 num     #instances         #bytes  class name
----------------------------------------------
   1:         10000         400000  java.util.TreeMap$Entry
   2:         10000         160000  com.sodonnell.BasicObject


Giving a overhead of 40 bytes per entry, provided the same object is used as the key and value.

HashSet

You may think that using a HashSet will reduce the overhead of a HashMap, if you are storing the same object in the key and the value. However, behind the scenes, Java uses a HashMap to implement a HashSet, where it sets a dummy object as the value. Therefore the overhead of a HashSet is exactly the same as of a HashMap. In theory, HashSet could use 8 bytes less by using a different object for the Map.Entry, as it does not need the value. I guess for most cases, this is not an enhancement worth making.

Can we do better?

For most applications, saving a few bytes of memory per entry is just not worth the hassle. As I pointed out earlier, the HDFS namenode needs to store potentially 100's of millions of entries in a sorted Hash table like structure. A TreeMap, with 40 bytes per entry, can add up to several GB of memory easily in this scenario.

FoldedTreeSet

Within Hadoop, lives a data structure call a FoldedTreeSet, which somewhat mirrors the Java TreeMap. It implements the SortedSet interface and implements a red-black tree like TreeMap does. Instead of storing each entry inside an object with 40 bytes of overhead, it allocates a 64 element array for each object, and stores the entries within the array.

Given a 64 element array uses 64 * 4 + 12 + 4 = 272 bytes, and the tree node uses 56 bytes per entry, the best case overhead, assuming all array entries are filled, is 5.125 bytes per entry.

Compared to a TreeMap, this would save about 3GB of memory per 100M objects.

LightWeightResizableGSet

Also within Hadoop is a structure called LightWeightResizableGSet. This provides an interface like Set, only it also allows the get() operation, which Set does not. Therefore it is somewhat like a HashMap, where the key and value are the same object. The later implements the set interface.

Rather than wrapping the object to be stored in another object (which is what HashMap does), it has an array to use as the hash table, and then stores the object directly in the table. To handle collisions, if forces the objects being stored to implement an interface called "linkedElement" to allow the objects to be chained in a one direction linked list.

This means that the overhead of GSet is either 0 or 4 or 8 bytes on the object being stored to add the link reference (depending on rounding to the next 8 byte boundary), plus the overhead of the array. Assuming it is 50% filled, that is about 8 bytes (4 for each entry, plus 4 wasted at 50% filled), giving a total overhead somewhere between 4 and 16. The array usage would be approximately the same in HashMap, so the saving vs HashMap (32 bytes per Node entry) is between 32 and 24, depending on round to the 8 byte object boundary.



Merge Empty HBase Regions
Sometimes you see HBase tables with ascending row keys (eg an increasing sequence, or a date stamp) for storing time series data, and the table is setup to have a TTL set on the rows to age out old data. With a setup like this, the new data will always be added to the last region in the table, which will split when it reaches a certain size leading to a new last region.

As the older rows age out, you will end up with regions that are empty, but HBase will not automatically remove them, potentially leaving you with tables with 100's or even 1000's of empty regions.

The only way to remove these regions is to merge adjacent regions over and over until you get the number down to a manageable level, but with many regions, this process can take a long time.

The following script can make this process much easier. What it does is:


Get a list of all regions in the table
Then get a list of all non-empty regions, ie those with a store files of over 1MB
Remove the non empty regions from the first list
Make one pass over the empty regions, merging adjacent pairs so that the number of empty regions should be halved on each run


After completing one run of the script, simply run it again to make another pass over the empty regions. It would be possible to make this script loop until all empty regions have been merged, but the following works and suits my needs OK.

To run the script, you should first run it in test mode so it will print out what it plans to do:

hbase org.jruby.Main merge_empty_regions.rb namespace.tablename


When you are happy the output looks OK, add the 'merge' option to actually do the merge:

hbase org.jruby.Main merge_empty_regions.rb namespace.tablename merge


To test the script out, you can create a table with a given number of empty regions as follows:

hbase(main):002:0> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
0 row(s) in 2.4890 seconds

hbase(main):004:0> put 't1', 'key1', 'f1:c1', 'val'
0 row(s) in 0.1420 seconds

hbase(main):006:0> flush 't1'
0 row(s) in 0.3560 seconds


Now we have a table with 15 regions, 14 of which are empty, but all are under 1MB and so will so run the merge script in test mode:

hbase org.jruby.Main merge_empty_regions.rb t1
...
Total Table Regions: 15
Total non empty regions: 0
Total regions to consider for Merge: 15
3cc4e2dc6fb5878aaf5eb7588ae367d3 is adjacent to d316a9284372859a94d97c3532c1bd85
3df36d4f3ffcec81e4119fb2cfc23401 is adjacent to 7fdfabff5e7391e51afd91a3a7bd3196
5d625b3d5c9761c8280c6d6a802e443a is adjacent to 21b363273911448aba8df7a1e9b4a13d
f1da23259ced30138aba15cf6fed5406 is adjacent to d9e5c55fe2990679ca2b9f0078af65f1
a42cc7197467665eff4ef4eadc5299ff is adjacent to 325b36c018a34c6ee524f99c0624d1d0
dfc021c2aedc58b3497529bc625ea2b1 is adjacent to 030d525ef0e9ae77180b2b8b9325f36c
b3cf0198943c7562bd05622bbb1227e2 is adjacent to ff556c5c60ff71ceb81c9b68133fa2cc


Finally merge the regions:

hbase org.jruby.Main merge_empty_regions.rb t1 merge
...
Total Table Regions: 15
Total non empty regions: 0
Total regions to consider for Merge: 15
3cc4e2dc6fb5878aaf5eb7588ae367d3 is adjacent to d316a9284372859a94d97c3532c1bd85
Successfully Merged 3cc4e2dc6fb5878aaf5eb7588ae367d3 with d316a9284372859a94d97c3532c1bd85
3df36d4f3ffcec81e4119fb2cfc23401 is adjacent to 7fdfabff5e7391e51afd91a3a7bd3196
Successfully Merged 3df36d4f3ffcec81e4119fb2cfc23401 with 7fdfabff5e7391e51afd91a3a7bd3196
5d625b3d5c9761c8280c6d6a802e443a is adjacent to 21b363273911448aba8df7a1e9b4a13d
Successfully Merged 5d625b3d5c9761c8280c6d6a802e443a with 21b363273911448aba8df7a1e9b4a13d
f1da23259ced30138aba15cf6fed5406 is adjacent to d9e5c55fe2990679ca2b9f0078af65f1
Successfully Merged f1da23259ced30138aba15cf6fed5406 with d9e5c55fe2990679ca2b9f0078af65f1
a42cc7197467665eff4ef4eadc5299ff is adjacent to 325b36c018a34c6ee524f99c0624d1d0
Successfully Merged a42cc7197467665eff4ef4eadc5299ff with 325b36c018a34c6ee524f99c0624d1d0
dfc021c2aedc58b3497529bc625ea2b1 is adjacent to 030d525ef0e9ae77180b2b8b9325f36c
Successfully Merged dfc021c2aedc58b3497529bc625ea2b1 with 030d525ef0e9ae77180b2b8b9325f36c
b3cf0198943c7562bd05622bbb1227e2 is adjacent to ff556c5c60ff71ceb81c9b68133fa2cc
Successfully Merged b3cf0198943c7562bd05622bbb1227e2 with ff556c5c60ff71ceb81c9b68133fa2cc


The script to perform this merge is below:

# Test Mode:
#
# hbase org.jruby.Main merge_empty_regions.rb namespace.tablename
#
# Non Test - ie actually do the merge:
#
# hbase org.jruby.Main merge_empty_regions.rb namespace.tablename merge
#
# Note: Please replace namespace.tablename with your namespace and table, eg NS1.MyTable. This value is case sensitive.

require 'digest'
require 'java'
java_import org.apache.hadoop.hbase.HBaseConfiguration
java_import org.apache.hadoop.hbase.client.HBaseAdmin
java_import org.apache.hadoop.hbase.TableName
java_import org.apache.hadoop.hbase.HRegionInfo;
java_import org.apache.hadoop.hbase.client.Connection
java_import org.apache.hadoop.hbase.client.ConnectionFactory
java_import org.apache.hadoop.hbase.client.Table
java_import org.apache.hadoop.hbase.util.Bytes

def list_non_empty_regions(admin, table)
  cluster_status = admin.getClusterStatus()
  master = cluster_status.getMaster()
  non_empty = []
  cluster_status.getServers.each do |s|
    cluster_status.getLoad(s).getRegionsLoad.each do |r|
      # getRegionsLoad returns an array of arrays, where each array
      # is 2 elements

      # Filter out any regions that don't match the requested
      # tablename
      next unless r[1].get_name_as_string =~ /#{table}\,/
      if r[1].getStorefileSizeMB() > 0
        if r[1].get_name_as_string =~ /\.([^\.]+)\.$/
          non_empty.push $1
        else
          raise "Failed to get the encoded name for #{r[1].get_name_as_string}"
        end
      end
    end
  end
  non_empty
end

# Handle command line parameters
table_name = ARGV[0]
do_merge = false
if ARGV[1] == 'merge'
  do_merge = true
end

config = HBaseConfiguration.create();
connection = ConnectionFactory.createConnection(config);
admin = HBaseAdmin.new(connection);

non_empty_regions = list_non_empty_regions(admin, table_name)
regions = admin.getTableRegions(Bytes.toBytes(table_name));

puts "Total Table Regions: #{regions.length}"
puts "Total non empty regions: #{non_empty_regions.length}"

filtered_regions = regions.reject do |r|
  non_empty_regions.include?(r.get_encoded_name)
end

puts "Total regions to consider for Merge: #{filtered_regions.length}"

if filtered_regions.length < 2
  puts "There are not enough regions to merge"
end

r1, r2 = nil
filtered_regions.each do |r|
  if r1.nil?
    r1 = r
    next
  end
  if r2.nil?
    r2 = r
  end
  # Skip any region that is a split region
  if r1.is_split()
    r1 = r2
    r2 = nil
    next
  end
  if r2.is_split()
    r2 = nil
    next
  end
  if HRegionInfo.are_adjacent(r1, r2)
    # only merge regions that are adjacent
    puts "#{r1.get_encoded_name} is adjacent to #{r2.get_encoded_name}"
    if do_merge
      admin.mergeRegions(r1.getEncodedNameAsBytes, r2.getEncodedNameAsBytes, false)
      puts "Successfully Merged #{r1.get_encoded_name} with #{r2.get_encoded_name}"
      sleep 2
    end
    r1, r2 = nil
  else
    # Regions are not adjacent, so drop the first one and iterate again
    r1 = r2
    r2 = nil
  end
end
admin.close




Writing Data To HDFS From Java
When you want to write a file into HDFS, things are quite different from writing to a local file system. Writing to a file on any file system is an operation that can fail, but with HDFS there are many more potential problems than with a local file, so your code should be designed to handle failures.

At a very high level, when you want to open a file for write on HDFS, these are the steps the client must go through:


Contact the namenode and tell it you want to create file /foo/bar
Assuming you have the relevant permissions, the Namenode will reply with the list of datanodes to write the data to.
The client will then open a TCP connection to the first datanode to establish a write pipeline
The first part of this write pipeline involves starting a thread within the datanode process (called an xciever). This thread will open a local file to store the data into, and it will also make a TCP connection to the next datanode in the pipeline, which will start up a similar xciever thread. Assuming a replication factor of 3, this second datanode will open a connection to the third and final datanode.
The client will start writing data to the first datanode, which will save it to disk and forward to the second datanode, which will also write to disk and forward to the third.
After the client has written the blocksize of data (128MB generally), this pipeline will be closed and the client will ask the namenode where to write the next block, and this process will repeat until the file is closed.


There are a few not so obvious things to consider here:


The TCP connection from the client to the first datanode and then the chain onto the second and third node will remain open indefinitely, assuming the client process stays alive, the file is not closed and no routers or firewalls are involved to drop the TCP connection.
While the file is still open, there is a thread, an open file and a TCP socket tied up on all three datanodes. So if you have a very large number of open files (either for read or write) against a small number of datanodes, the xciever thread limit (dfs.datanode.max.xcievers) and OS open file limits may need increasing.


Locking

HDFS is an append only file system - you cannot seek to a position in a file and perform writes in random locations. For that reason it only makes sense that a single process can have a file open for writing at one time. To enforce this, the Namenode grants a lease to the process which opens the file for writing. If a second process attempts to open the same file for append or writing, an error like the following will be returned:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/user/vagrant/testwriter] for [DFSClient_NONMAPREDUCE_-642075008_1] for client [192.168.33.6], because this file is already being created by [DFSClient_NONMAPREDUCE_-608672003_1] on [192.168.33.6]


When a client is writing a file, it is responsible for renewing the lease periodically. Even if a client has many files open, it only requires a single lease for all of them. The namenode tracks all the current leases, and if a client does not renew its lease each 1 minute, another process can forcibly take over the lease and open the file. After 1 hour of no updates from the client, the namenode assumes the client has died and closes the file automatically. Luckily, the HDFS client code takes care of lease renewal, and end users of the API don't need to worry about it, but it is important to be aware of it.

If your client process crashes, or exits in such a way that it does not close any HDFS files that are already open for writing, you may find that when the process is restarted it cannot reopen the file, giving the same error as above. Eventually (after about 1 hour) the namenode will close this file, but it may not be convenient to wait that long. To work around this, you can use the hdfs debug command to force the file closed:

$ hdfs debug recoverLease -path /user/vagrant/testwriter  -retries 5
recoverLease SUCCEEDED on /user/vagrant/testwriter


This blog post on the hdfs block recovery process gives a very good overview of the lease and what happens if all the block replicas on each node do not match.

Simple Java Code for Writing to HDFS

The following code snippet is all that is required to write to HDFS:

package com.sodonnel.Hadoop;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HdfsWriter {
  public static void main(String[] args) throws IOException, InterruptedException {

    Configuration configuration = new Configuration();
    FileSystem hdfs = FileSystem.get( configuration );
    Path file = new Path("/user/vagrant/testwriterl");

    FSDataOutputStream os;

    if ( hdfs.exists( file )) {
      //hdfs.delete( file, true );
      os = hdfs.append( file ); 
    } else {
      os = hdfs.create( file );
    }
    // Note this writes to the file in unicode, where each 
    // character has a single 
    os.writeChars("this is a string to write");

    os.close();
    hdfs.close();
   }
}


To run it, you can compile the code into a JAR, set the classpath to include the Hadoop jars and run it like a normal Java program:

export CLASSPATH=$(hadoop classpath):HdfsWriter-1.0-SNAPSHOT.jar:.
java com.sodonnel.Hadoop.HdfsWriter


Current File Size

One side effect of how HDFS works, is that you cannot really tell how large a file currently is while it is being written from the normal hadoop fs -ls command. The namenode decides which datanodes will receive the blocks, but it is not involved in tracking the data written to them, and the namenode is only updated periodically. After poking through the DFSClient source and running some tests, there appear to be 3 scenarios where the namenode gets an update on the file size:


When the file is closed
When a block is filled and a new one is created, the namenode size will be incremented by the blocksize. This means that the file will look like it is growing in block sized chunks.
The first time a sync / hflush operation is called on a block, it updates the size in the namenode too. If you write 1MB, and then sync it, the Namenode will report the size as 1MB. If you then write another 1MB and sync again, the Namenode will still report 1MB. Assuming a 128MB blocksize, the next update to size on the Namenode will be at 128MB, then 129MB, then 256MB and so on.


You can see this process in action with a simple test program:

public class HdfsWriter {
  public static void main(String[] args) throws IOException, InterruptedException {

    Configuration configuration = new Configuration();
    FileSystem hdfs = FileSystem.get( configuration );
    Path file = new Path("/user/vagrant/testwrites");

    int ONE_MB = 524288;

    FSDataOutputStream os;

    if ( hdfs.exists( file )) {
      hdfs.delete( file, true );
    }
    // Create a file with a rep factor of 1 and blocksize of 1MB
    os = hdfs.create( file, true, 2, (short)1, (long)1048576 );

    // Blocks size is 1MB, 524288 is 0.5MB, each write is 2 bytes
    // so 524288 writes will be 1MB, so in this case we will write 3MB
    // into 3 blocks in total
    for(int i = 0; i < 3*ONE_MB; i++) {
      os.writeChars("a");

      // Sync every 256KB, but not a the 1MB boundaries
      if ( i > 0 && (i+1) % (ONE_MB/4) == 0 && (i+1) % ONE_MB != 0 ) {
        System.out.println("SYNC");
        os.hsync();
      } 
      // Print file every 16KB to see where the size changes
      if (i > 0 && (i+1) % (ONE_MB / 16) == 0) {
        FileStatus[] files = hdfs.listStatus(file);
        System.out.println("Data written: "+ (i+1)*2/1024.0/1024.0 +" MB; Current file size: "+ files[0].getLen()/1024.0/1024.0 +" MB");
      } 
    }
    os.close();
    FileStatus[] files = hdfs.listStatus(file);
    System.out.println("Size on closing: "+ files[0].getLen()/1024.0/1024.0 +" MB");
    hdfs.close();
   }
}


This program creates a file with a 1MB block size, and then writes a single 2 byte character over and over to create a file exactly 3MB and 3 blocks in length. After writing each 65KB, we print out the bytes written and query the Namenode for the size of the file. Additionally, every 256KB, we perform a file sync unless we have written an even MB, which is a block boundary. The output looks like this:

Data written: 0.0625 MB; Current file size: 0.0 MB
Data written: 0.125 MB; Current file size: 0.0 MB
Data written: 0.1875 MB; Current file size: 0.0 MB
SYNC
Data written: 0.25 MB; Current file size: 0.25 MB
Data written: 0.3125 MB; Current file size: 0.25 MB
Data written: 0.375 MB; Current file size: 0.25 MB
Data written: 0.4375 MB; Current file size: 0.25 MB
SYNC
Data written: 0.5 MB; Current file size: 0.25 MB
Data written: 0.5625 MB; Current file size: 0.25 MB
Data written: 0.625 MB; Current file size: 0.25 MB
Data written: 0.6875 MB; Current file size: 0.25 MB
SYNC
Data written: 0.75 MB; Current file size: 0.25 MB
Data written: 0.8125 MB; Current file size: 0.25 MB
Data written: 0.875 MB; Current file size: 0.25 MB
Data written: 0.9375 MB; Current file size: 0.25 MB
Data written: 1.0 MB; Current file size: 0.25 MB
Data written: 1.0625 MB; Current file size: 1.0 MB
Data written: 1.125 MB; Current file size: 1.0 MB
Data written: 1.1875 MB; Current file size: 1.0 MB
SYNC
Data written: 1.25 MB; Current file size: 1.25 MB
Data written: 1.3125 MB; Current file size: 1.25 MB
Data written: 1.375 MB; Current file size: 1.25 MB
Data written: 1.4375 MB; Current file size: 1.25 MB
SYNC
Data written: 1.5 MB; Current file size: 1.25 MB
Data written: 1.5625 MB; Current file size: 1.25 MB
Data written: 1.625 MB; Current file size: 1.25 MB
Data written: 1.6875 MB; Current file size: 1.25 MB
SYNC
Data written: 1.75 MB; Current file size: 1.25 MB
Data written: 1.8125 MB; Current file size: 1.25 MB
Data written: 1.875 MB; Current file size: 1.25 MB
Data written: 1.9375 MB; Current file size: 1.25 MB
Data written: 2.0 MB; Current file size: 1.25 MB
Data written: 2.0625 MB; Current file size: 2.0 MB
Data written: 2.125 MB; Current file size: 2.0 MB
Data written: 2.1875 MB; Current file size: 2.0 MB
SYNC
Data written: 2.25 MB; Current file size: 2.25 MB
Data written: 2.3125 MB; Current file size: 2.25 MB
Data written: 2.375 MB; Current file size: 2.25 MB
Data written: 2.4375 MB; Current file size: 2.25 MB
SYNC
Data written: 2.5 MB; Current file size: 2.25 MB
Data written: 2.5625 MB; Current file size: 2.25 MB
Data written: 2.625 MB; Current file size: 2.25 MB
Data written: 2.6875 MB; Current file size: 2.25 MB
SYNC
Data written: 2.75 MB; Current file size: 2.25 MB
Data written: 2.8125 MB; Current file size: 2.25 MB
Data written: 2.875 MB; Current file size: 2.25 MB
Data written: 2.9375 MB; Current file size: 2.25 MB
Data written: 3.0 MB; Current file size: 2.25 MB
Size on closing: 3.0 MB


We can see that the file looks like it has zero bytes until a sync is performed at 256KB. Then, despite two further syncs, it stays at 256KB proving that further syncs against the same block will not update the Namenode size. Then it updates to 1MB at slightly over 1MB written. The namenode size is actually updated when the new block is created, but you may have to write up to 65KB over the block size before that happens due to the default packet size of 65KB in DFSClient.java. Then the same pattern repeats for the second and third block.

One thing that sometimes confuses users, is that if you perform a get on the file while it is being written, the size of the file pulled from HDFS will be the actual size of the file at the time it was pulled, and not the size reported by the namenode.

When Datanodes Fail

If you have a long running process writing data to HDFS, you need to be concerned with what happens when a datanode fails. In general each piece of data written to HDFS is persisted onto 3 datanodes in a write pipeline, as the default replication factor is 3, but it is possible to override this to a lower value if you wish.

With the default replication factor, if one datanode in pipeline fails, the write should continue unaffected, and the client will produce an WARN message like the following:

17/02/04 04:19:00 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1223970327-10.17.81.191-1483542430690:blk_1073767685_26861
java.io.IOException: Bad response ERROR for block BP-1223970327-10.17.81.191-1483542430690:blk_1073767685_26861 from datanode DatanodeInfoWithStorage[10.17.81.194:50010,DS-a1e673dd-4d4f-4c27-b437-2a61657e2c97,DISK]
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1022)
17/02/04 04:19:00 WARN hdfs.DFSClient: Error Recovery for block BP-1223970327-10.17.81.191-1483542430690:blk_1073767685_26861 in pipeline DatanodeInfoWithStorage[10.17.81.193:50010,DS-7049acbc-93ae-4574-9f39-2c6a9a0e81ac,DISK], DatanodeInfoWithStorage[10.17.81.194:50010,DS-a1e673dd-4d4f-4c27-b437-2a61657e2c97,DISK]: bad datanode DatanodeInfoWithStorage[10.17.81.194:50010,DS-a1e673dd-4d4f-4c27-b437-2a61657e2c97,DISK]


It is possible for two of the datanodes to fail, but if the third also fails, you will get an error like the following and the write will fail:

17/02/04 04:39:27 ERROR hdfs.DFSClient: Failed to close inode 42255
java.io.IOException: All datanodes DatanodeInfoWithStorage[10.17.81.193:50010,DS-7049acbc-93ae-4574-9f39-2c6a9a0e81ac,DISK] are bad. Aborting...
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1386)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)


Remember that writes are on a block by block basis, so if you get an issue where a couple of the datanodes fail and you are left with only one, that is only the case until the next block is started. Assuming there are still enough live nodes in the cluster, and new write pipeline will then be established containing 3 datanodes once again.

If you are writing a file with a replication factor of 3, and one or two of the datanodes fails, then it means there will only be one complete block on the cluster. The namenode will notice this and create new replicas quite quickly, but until this happens there is a higher than usual risk of data loss on that file.

Replacing Failed Nodes

The HDFS client code has some options to handle datanode failures during writes in different ways, known as the "DataNode Replacement Policy". This blog post has some good information about the replacement policies. Assuming you are writing a file with the default replication factor, the general case is that if 1 datanode fails, the writes will continue with just 2 datanodes. However if another fails, then the client will attempt to replace one of the failed nodes. If there are not enough datanodes left in the cluster to do the replacement, then the write will still continue. There are options to make this more strict, in that the failed node must always be replaced or an exception will be thrown.

Replication Factor 1

If you create a file with replication factor 1, and the only datanode in the write pipeline crashes, then the file write will fail, either with one of the errors shown above, or one like the following:

17/02/04 12:02:44 ERROR hdfs.DFSClient: Failed to close inode 17733
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/vagrant/testwrites could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1622)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3325)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:679)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:214)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:489)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)


There isn't really much else that can happen in this scenario, as the datanode that failed will have a partially written block on it, and it will not be possible to replicate it onto another node if the original data source is no longer present.

Writing files with a replication factor of 1 can be faster, but the there is a much higher risk of data loss, both during the write and later when the file needs to be read. Therefore only temporary files or files that can be easily reconstructed from another source should use a replication factor of 1.



Setup Open LDAP on Centos 6
Getting Open LDAP working on Centos 6 was a painful experience for me. The syntax of the config files and the way you load users etc - none of it seemed obvious. I got it working, as least as far as a POC environment goes, and have produced this post in the hope it saves someone else some pain.

Install Packages

$ yum install -y openldap openldap-clients openldap-servers


Config Files

Some older posts on the web talk about /etc/sladp.conf, which doesn't seem to exists any more. All the config is within /etc/openldap/slapd.d in several cryptic files.

First pick what domain you want your 'dc' (domain component) to be, I am using appsintheopen.

Then generate a hash for the ldap root password (copy the value as you will need it in a moment):

$ slappasswd
New password: 
Re-enter new password: 
{SSHA}ST60lR0GBhedeMm+70nJV00VzKjtyxwp


/etc/openldap/slapd.d/cn=config/olcDatabase={2}bdb.ldif

There are a few lines to add and change in this file.

Find the line for olcSuffix, and change it (or add if it does not exist):

# /etc/openldap/slapd.d/cn=config/olcDatabase={2}bdb.ldif

olcSuffix: dc=appsintheopen,dc=com


Find the line olcRootDN:

# /etc/openldap/slapd.d/cn=config/olcDatabase={2}bdb.ldif

olcRootDN: cn=Manager,dc=appsintheopen,dc=com


Add a line for the password generated above:

# /etc/openldap/slapd.d/cn=config/olcDatabase={2}bdb.ldif

olcRootPW: {SSHA}ST60lR0GBhedeMm+70nJV00VzKjtyxwp


Add the following two lines to the bottom of the file - this will let a user update their own password later:

# /etc/openldap/slapd.d/cn=config/olcDatabase={2}bdb.ldif

olcAccess: {0}to attrs=userPassword by self write by dn.base="cn=Manager,dc=appsintheopen,dc=com" write by anonymous auth by * none
olcAccess: {1}to * by dn.base="cn=Manager,dc=appsintheopen,dc=com" write by self write by * read


You can test the config is OK with the following command:

$ slaptest -u


Ignore any checksum errors.

Now you can startup the ldap server (and configure it to come up at boot time):

$ service slapd start
$ chkconfig --levels=345 slapd on

Setup the Base Structures in LDAP

Next, you need to create the overall organisation, users and groups entries in the ldap hierarchy. To do this create a file like the following (/tmp/base.ldif). Note the blank lines in the file are important!

dn: dc=appsintheopen,dc=com
objectClass: dcObject
objectClass: organization
o: appsintheopen.com
dc: appsintheopen

dn: ou=users,dc=appsintheopen,dc=com
objectClass: organizationalUnit
objectClass: top
ou: users

dn: ou=groups,dc=appsintheopen,dc=com
objectClass: organizationalUnit
objectClass: top
ou: groups


Now add the config you created in that file:

ldapadd -x -W -D "cn=Manager,dc=appsintheopen,dc=com" -f base.ldif
Enter LDAP Password:
adding new entry "dc=appsintheopen,dc=com"
adding new entry "ou=users,dc=appsintheopen,dc=com"
adding new entry "ou=groups,dc=appsintheopen,dc=com"


Now you can query the ldap server to ensure it returns the objects you just added:

$ ldapsearch -x -W -D "cn=Manager,dc=appsintheopen,dc=com" -b "dc=appsintheopen,dc=com" "(objectclass=*)"


Add some Users and Groups

To add users to the LDAP server, you need to create files similar to the above and add them using the same command as above. For instance, I will create a few groups - staff, hadoop-admin, hadoop-users, create a couple of users and assign them to the groups.

dn: cn=staff,ou=groups,dc=appsintheopen,dc=com
objectClass: top
objectClass: posixGroup
gidNumber: 1000

dn: cn=hadoop-users,ou=groups,dc=appsintheopen,dc=com
objectClass: top
objectClass: posixGroup
gidNumber: 1001

dn: cn=hadoop-admin,ou=groups,dc=appsintheopen,dc=com
objectClass: top
objectClass: posixGroup
gidNumber: 1002

dn: uid=sam,ou=users,dc=appsintheopen,dc=com
objectClass: top
objectClass: account
objectClass: posixAccount
objectClass: shadowAccount
cn: sam
uid: sam
uidNumber: 20000
gidNumber: 1000
homeDirectory: /home/sam
loginShell: /bin/bash
gecos: sam
userPassword: {crypt}x
shadowLastChange: 0
shadowMax: 0
shadowWarning: 0

dn: uid=bob,ou=users,dc=appsintheopen,dc=com
objectClass: top
objectClass: account
objectClass: posixAccount
objectClass: shadowAccount
cn: bob
uid: bob
uidNumber: 20001
gidNumber: 1000
homeDirectory: /home/bob
loginShell: /bin/bash
gecos: bob
userPassword: {crypt}x
shadowLastChange: 0
shadowMax: 0
shadowWarning: 0

dn: uid=jim,ou=users,dc=appsintheopen,dc=com
objectClass: top
objectClass: account
objectClass: posixAccount
objectClass: shadowAccount
cn: jim
uid: jim
uidNumber: 20002
gidNumber: 1000
homeDirectory: /home/jim
loginShell: /bin/bash
gecos: jim
userPassword: {crypt}x
shadowLastChange: 0
shadowMax: 0
shadowWarning: 0

dn: cn=staff,ou=groups,dc=appsintheopen,dc=com
changetype: modify
add: memberuid
memberuid: sam

dn: cn=staff,ou=groups,dc=appsintheopen,dc=com
changetype: modify
add: memberuid
memberuid: bob

dn: cn=staff,ou=groups,dc=appsintheopen,dc=com
changetype: modify
add: memberuid
memberuid: jim

dn: cn=hadoop-users,ou=groups,dc=appsintheopen,dc=com
changetype: modify
add: memberuid
memberuid: bob

dn: cn=hadoop-admin,ou=groups,dc=appsintheopen,dc=com
changetype: modify
add: memberuid
memberuid: jim



Now add the new users and groups:

$ ldapadd -x -W -D "cn=Manager,dc=appsintheopen,dc=com" -f all.ldif 
Enter LDAP Password: 
adding new entry "cn=staff,ou=groups,dc=appsintheopen,dc=com"

adding new entry "cn=hadooop-users,ou=groups,dc=appsintheopen,dc=com"

adding new entry "cn=hadooop-admin,ou=groups,dc=appsintheopen,dc=com"

adding new entry "uid=sam,ou=users,dc=appsintheopen,dc=com"

adding new entry "uid=bob,ou=users,dc=appsintheopen,dc=com"

adding new entry "uid=jim,ou=users,dc=appsintheopen,dc=com"

modifying entry "cn=staff,ou=groups,dc=appsintheopen,dc=com"

modifying entry "cn=staff,ou=groups,dc=appsintheopen,dc=com"

modifying entry "cn=staff,ou=groups,dc=appsintheopen,dc=com"

modifying entry "cn=hadooop-users,ou=groups,dc=appsintheopen,dc=com"

modifying entry "cn=hadooop-admin,ou=groups,dc=appsintheopen,dc=com"


Finally, you can set the password for the new users:

ldappasswd -s newpass123 -W -D "cn=Manager,dc=appsintheopen,dc=com" -x "uid=sam,ou=users,dc=appsintheopen,dc=com"


Setup a Client Machine

To setup another machine to authenticate with the ldap server, first install the ldap packages:

$ yum install -y openldap openldap-clients


Then you only need to run a single command and it takes care of configuring the necessary config files and processes:

$ authconfig --enableldap --enableldapauth --ldapserver=ldap://192.168.33.5:389/ --ldapbasedn="dc=appsintheopen,dc=com" --enablecache --disablefingerprint --kickstart




Loading Stack Exchange Data Dumps to Hadoop and Hive
These days I am learning a lot about Hadoop, and as part of that I need some data to play with. A lot of Hadoop examples run against Twitter data, or airline data, but I decided it would be fun to look at the StackExchange data dumps instead.
The people at StackExchange kindly supply data dumps in XML format for the various StackExchange sites. 

The Stack Overflow dump is quite large, so I have been experimenting with the Serverfault dump instead, which is about 350MB compressed.

As I write this, the dump file contains the following files:


Badges.xml
Comments.xml
PostHistory.xml
PostLinks.xml
Posts.xml
Tags.xml
Users.xml
Votes.xml


I guess the file format can change from time to time, but the overall definition seems fairly consistent with this post. The major difference is that some of the tables mentioned in the post are not in the dump, and the static lookup tables are missing, but they can easily be created.

Hive can deal with XML files, and the XML from the data dumps is pretty simple, with each row being wrapped in a  tag with the column values encoded inside attributes, for example:




In an effort to learn a bit about Avro, and make things hard for myself, I decided to convert all the XML files into avro files and then load them to Hive.

The easiest way to do this, is to create a Hive table pointing at each XML file and then create a new Avro table using a CREATE TABLE AS SELECT statement. I decided it would be more educational to write a map reduce job to do this instead, making things even more difficult for myself.

The Short Version


Download a data dump, decompress and put into the input folder in your hdfs home directory. Serverfault is a manageable size.
Clone the git repo
mvn package -Dmaven.test.skip=true. A jar file will be created inside the target directory (Wordcount-1.0-SNAPSHOT.jar)
Download avro-1.7.7.jar and avro-mapred-1.7.7-hadoop2.jar from here
On your Hadoop client box run the map reduce job:


export LIBJARS=avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar
export HADOOP_CLASSPATH=avro-1.7.7.jar:avro-mapred-1.7.7-hadoop2.jar
hadoop jar WordCount-1.0-SNAPSHOT.jar com.sodonnel.stackOverflow.xmlToAvro -libjars $LIBJARS input output



Create the hive structures by running data_set/stackoverflow/create_directories.sh in the git repo


For more explanation, read on ...

Load Raw Data To Hadoop

The Stack Exchange dumps come compressed in 7z format (which I had never used before). The first step is to decompress them and load to Hadoop. I put them into a folder called input in my HDFS home directory:

[vagrant@standalone serverfault.com]$ ls -ltrh
total 2.0G
-rw-r--r-- 1 vagrant vagrant 168M Sep 23 15:23 Comments.xml
-rw-r--r-- 1 vagrant vagrant  39M Sep 23 15:23 Badges.xml
-rw-r--r-- 1 vagrant vagrant 2.9M Sep 23 15:23 PostLinks.xml
-rw-r--r-- 1 vagrant vagrant 916M Sep 23 15:23 PostHistory.xml
-rw-r--r-- 1 vagrant vagrant 139M Sep 23 15:23 Votes.xml
-rw-r--r-- 1 vagrant vagrant  64M Sep 23 15:23 Users.xml
-rw-r--r-- 1 vagrant vagrant 224K Sep 23 15:23 Tags.xml
-rw-r--r-- 1 vagrant vagrant 625M Sep 23 15:23 Posts.xml

[vagrant@standalone serverfault.com]$ hadoop fs -put *.xml input/

[vagrant@standalone serverfault.com]$ hadoop fs -ls input
Found 8 items
-rw-r--r--   3 vagrant vagrant   40661120 2015-09-23 15:24 input/Badges.xml
-rw-r--r--   3 vagrant vagrant  176150364 2015-09-23 15:24 input/Comments.xml
-rw-r--r--   3 vagrant vagrant  959680230 2015-09-23 15:24 input/PostHistory.xml
-rw-r--r--   3 vagrant vagrant    2988058 2015-09-23 15:24 input/PostLinks.xml
-rw-r--r--   3 vagrant vagrant  654708460 2015-09-23 15:24 input/Posts.xml
-rw-r--r--   3 vagrant vagrant     228933 2015-09-23 15:24 input/Tags.xml
-rw-r--r--   3 vagrant vagrant   66168226 2015-09-23 15:24 input/Users.xml
-rw-r--r--   3 vagrant vagrant  144769493 2015-09-23 15:25 input/Votes.xml



Convert XML to Avro

Next step is to run a map only map reduce job to convert the XML files to Avro. I have documented that process previously.

After running that job, there should be a bunch of avro files in the output directory:

[vagrant@standalone]$ hadoop fs -ls output
Found 33 items
-rw-r--r--   3 vagrant vagrant          0 2015-09-23 18:28 output/_SUCCESS
-rw-r--r--   3 vagrant vagrant   22478186 2015-09-23 18:28 output/badges-m-00016.avro
-rw-r--r--   3 vagrant vagrant  111145713 2015-09-23 18:21 output/comments-m-00001.avro
-rw-r--r--   3 vagrant vagrant   35046851 2015-09-23 18:28 output/comments-m-00015.avro
-rw-r--r--   3 vagrant vagrant    1330504 2015-09-23 18:28 output/postlinks-m-00018.avro
-rw-r--r--   3 vagrant vagrant  141311747 2015-09-23 18:25 output/posts-m-00009.avro
-rw-r--r--   3 vagrant vagrant  140294479 2015-09-23 18:25 output/posts-m-00010.avro
-rw-r--r--   3 vagrant vagrant  139147884 2015-09-23 18:26 output/posts-m-00011.avro
-rw-r--r--   3 vagrant vagrant  138334514 2015-09-23 18:27 output/posts-m-00012.avro
-rw-r--r--   3 vagrant vagrant  120587513 2015-09-23 18:27 output/posts-m-00013.avro
-rw-r--r--   3 vagrant vagrant     119501 2015-09-23 18:28 output/tags-m-00019.avro
-rw-r--r--   3 vagrant vagrant   78929530 2015-09-23 18:28 output/users-m-00014.avro
-rw-r--r--   3 vagrant vagrant   83896117 2015-09-23 18:20 output/votes-m-00000.avro


Create Hive Directory Structure

A Hive table points at a directory, and we want to to have a table for each of the original Stack Exchange files, so the next step is to create a simple directory structure and moved the files in the output directory into the correct place. The following bash script should do the trick:

BASE_DIR=/user/vagrant/hive/stackoverflow

dirs=("users" "posts" "comments" "tags" "votes" "badges" "postlinks")
for i in "${dirs[@]}"
do
    hadoop fs -mkdir -p ${BASE_DIR}/${i}
    hadoop fs -mv output/${i}*.avro ${BASE_DIR}/${i}/
done


Create Hive Tables

Now that the data is converted and moved into place, all that is left is to create some Hive tables. Current Hive versions (>= 0.14 I think) make creating tables over Avro files very simple - you don't even need to specify the Avro schema in the table definition, as it is derived from columns in the create table statement. So this step is as simple as create table statements, eg:

CREATE EXTERNAL TABLE users(
  id string,
  reputation string,
  creationdate string,
  displayname string,
  lastaccessdate string,
  websiteurl string,
  location string,
  aboutme string,
  views string,
  upvotes string,
  downvotes string,
  age string,
  accountid string,
  profileimageurl string
  )
STORED AS AVRO location '/user/vagrant/hive/stackoverflow/users';


Get the full script on Github.

Run Queries in Hive

Now you should be able to run queries against the data in Hive:

hive> select count(*) from posts;
Query ID = vagrant_20150923184848_60c6c182-9b95-438d-875d-0ed433b4c7ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1442929182717_0016, Tracking URL = http://standalone:8088/proxy/application_1442929182717_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1442929182717_0016
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2015-09-23 18:48:43,080 Stage-1 map = 0%,  reduce = 0%
2015-09-23 18:48:53,565 Stage-1 map = 12%,  reduce = 0%, Cumulative CPU 6.78 sec
2015-09-23 18:48:54,599 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 8.02 sec
2015-09-23 18:49:05,033 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 14.75 sec
2015-09-23 18:49:07,105 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 16.33 sec
2015-09-23 18:49:15,493 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.12 sec
2015-09-23 18:49:19,695 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 22.33 sec
MapReduce Total cumulative CPU time: 22 seconds 330 msec
Ended Job = job_1442929182717_0016
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 22.33 sec   HDFS Read: 680169669 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 330 msec
OK
537109
Time taken: 46.837 seconds, Fetched: 1 row(s)


TODOs

You may have noticed I have cheated in a major area - Every column in every table is a string, including the dates. This would be much more useful if those were true Hive timestamp columns.



Map Reduce with XML Input and Multiple Avro Outputs
Continuing my learning on Map Reduce (after quite a long break) I decided to figure out how to take the Stack Exchange data dumps and convert them from XML to Avro.

The data dumps have one file per table, with the data for each row encoded as attributes within a row tag, eg:


  ...
  
  ...



The first challenge is therefore how to read an XML file with a map reduce job.

StreamXmlRecordReader and XmlInputFormat

Reading large XML documents with Hadoop has the potential to be tricky as the XML document can span multiple splits, and hence cannot be processed by a single mapper. Often a large XML document is actually a collection of relatively small records, which is the case here. The Hadoop Definitive Guide mentions StreamXmlRecordReader as a way to process these sort of documents. You specify the start and end tags that describe how to split the large XML document into records, and then a mapper gets handed a full record at a time. It also mentions an improved XML input format from the Mahout project called XmlInputFormat. Eventually I found a project that used XmlInputFormat, and the entire XmlInputFormat is about 140 lines. It's a pretty good example of how to create a custom input format. The source is on github. Just copying that class into your project is much easier than including the entire Mahout project!

To use this, all you need is the following in your driver:

// Create configuration
Configuration conf = this.getConf(); //new Configuration(true);
conf.set("xmlinput.start", "");

// Create job
Job job = Job.getInstance(conf, "StackOverflow XML to Avro");
job.setJarByClass(getClass());
job.setInputFormatClass(XmlInputFormat.class);


Notice that I have not used full tags as my delimiters, which works OK and is what is required for this example.

Avro Outputs

An Avro file is a file that contains a list of Avro records, where each record conforms to the same schema. The schema can be defined in JSON format and is stored in the Avro output file along with the records. For example, the schema for the Votes record is:

{
  "type":  "record",
  "name":  "StackOverflowVoteRecord",
  "doc":   "A Record",
  "fields": [
    { "name": "id", "type": "string" },
    { "name": "postid", "type": ["null", "string"] },
    { "name": "votetypeid", "type": ["null", "string"] },
    { "name": "userid", "type": ["null", "string"] },
    { "name": "creationdate", "type": ["null", "string"] },
    { "name": "bountyamount", "type": ["null", "string"] }
  ]
  }


Each entry in an Avro file is a record, but there are output types that simulate writing key value pairs into the file by writing a record with two fields, 'key' and 'value', where the value can be another record.

To configure an Avro job, you should use the AvroJob class, which allows you to set the key and value schema for mapper and job output, for example:

AvroJob.setMapOutputKeySchema(job, Schema.Type.INT));
AvroJob.setMapOutputValueSchema(job, SCHEMA);
AvroJob.setOutputKeySchema(job, SCHEMA);

job.setOutputFormatClass(AvroKeyOutputFormat.class);


In this job, I have several types of input file to process, and hence several types of output file, so instead of using AvroJob, I used AvroMultipleOutputs instead. This allows me to add a named output each with a different schema for each input file type:

AvroMultipleOutputs.addNamedOutput(job, "badges", AvroKeyOutputFormat.class, SCHEMAS.get("badges"), null);
AvroMultipleOutputs.addNamedOutput(job, "users", AvroKeyOutputFormat.class, SCHEMAS.get("users"), null);
AvroMultipleOutputs.addNamedOutput(job, "posts", AvroKeyOutputFormat.class, SCHEMAS.get("posts"), null);
AvroMultipleOutputs.addNamedOutput(job, "comments", AvroKeyOutputFormat.class, SCHEMAS.get("comments"), null);
AvroMultipleOutputs.addNamedOutput(job, "tags", AvroKeyOutputFormat.class, SCHEMAS.get("tags"), null);
AvroMultipleOutputs.addNamedOutput(job, "votes", AvroKeyOutputFormat.class, SCHEMAS.get("votes"), null);
AvroMultipleOutputs.addNamedOutput(job, "postlinks", AvroKeyOutputFormat.class, SCHEMAS.get("postlinks"), null);
AvroMultipleOutputs.setCountersEnabled(job, true);


The complete driver code is on github

The Mapper

The mapper is now pretty simple, aside from one complexity. The mapper must identify which input filetype is it processing (from its filename) and select the correct schema and named output to write the avro records into. The filename can be obtained from the map reduce context, and then the schema looked up in a hashmap stored in the driver class. This is all taken care of by the mapper setup method:

public void setup(Context context) {
  amos = new AvroMultipleOutputs(context);
  InputSplit split = context.getInputSplit();
  String fileName = ((FileSplit) split).getPath().getName().toLowerCase().split("\\.")[0];
  if (xmlToAvro.SCHEMAS.containsKey(fileName)) {
    record = new GenericData.Record(xmlToAvro.SCHEMAS.get(fileName));
    amos_name = fileName;
  }
}


The remainder of the mapper code handles splitting the XML records into an Avro record and writing them to the correct output.

Maven Dependencies

To use Avro in a map reduce job, you need to add the following to you pom.xml:


  org.apache.avro
  avro
  1.7.7
  


  org.apache.avro
  avro-mapred
  1.7.7
  hadoop2



Notice the classifier 'hadoop2' - this includes the avro-mapred jar for MR2 (yarn).

Running The Job

Load Raw Data To Hadoop

The Stack Exchange dumps come compressed in 7z format (which I had never used before). The first step is to decompress them and load to Hadoop. I put them into a folder called input in my HDFS home directory:

[vagrant@standalone serverfault.com]$ ls -ltrh
total 2.0G
-rw-r--r-- 1 vagrant vagrant 168M Sep 23 15:23 Comments.xml
-rw-r--r-- 1 vagrant vagrant  39M Sep 23 15:23 Badges.xml
-rw-r--r-- 1 vagrant vagrant 2.9M Sep 23 15:23 PostLinks.xml
-rw-r--r-- 1 vagrant vagrant 916M Sep 23 15:23 PostHistory.xml
-rw-r--r-- 1 vagrant vagrant 139M Sep 23 15:23 Votes.xml
-rw-r--r-- 1 vagrant vagrant  64M Sep 23 15:23 Users.xml
-rw-r--r-- 1 vagrant vagrant 224K Sep 23 15:23 Tags.xml
-rw-r--r-- 1 vagrant vagrant 625M Sep 23 15:23 Posts.xml

[vagrant@standalone serverfault.com]$ hadoop fs -put *.xml input/

[vagrant@standalone serverfault.com]$ hadoop fs -ls input
Found 8 items
-rw-r--r--   3 vagrant vagrant   40661120 2015-09-23 15:24 input/Badges.xml
-rw-r--r--   3 vagrant vagrant  176150364 2015-09-23 15:24 input/Comments.xml
-rw-r--r--   3 vagrant vagrant  959680230 2015-09-23 15:24 input/PostHistory.xml
-rw-r--r--   3 vagrant vagrant    2988058 2015-09-23 15:24 input/PostLinks.xml
-rw-r--r--   3 vagrant vagrant  654708460 2015-09-23 15:24 input/Posts.xml
-rw-r--r--   3 vagrant vagrant     228933 2015-09-23 15:24 input/Tags.xml
-rw-r--r--   3 vagrant vagrant   66168226 2015-09-23 15:24 input/Users.xml
-rw-r--r--   3 vagrant vagrant  144769493 2015-09-23 15:25 input/Votes.xml



Compile and Run Map Reduce

Clone the git repo.

Use Maven to build the jar:

$ mvn package -Dmaven.test.skip=true


The Avro libraries are not part of the core Hadoop install, so you need to download the correct Jars and add them to both LIBJARS and HADOOP_CLASSPATH. Then the job can be started in the usual way:

export LIBJARS=avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar
export HADOOP_CLASSPATH=avro-1.7.7.jar:avro-mapred-1.7.7-hadoop2.jar
hadoop jar WordCount-1.0-SNAPSHOT.jar com.sodonnel.stackOverflow.xmlToAvro -libjars $LIBJARS input output


After running the job, you should end up with a few Avro files for each of the input XML files in the output directory:

$ hadoop fs -ls output
Found 13 items
-rw-r--r--   3 vagrant vagrant          0 2015-09-24 14:11 output/_SUCCESS
-rw-r--r--   3 vagrant vagrant   22477814 2015-09-24 14:11 output/badges-m-00016.avro
-rw-r--r--   3 vagrant vagrant  110111079 2015-09-24 14:03 output/comments-m-00001.avro
-rw-r--r--   3 vagrant vagrant   34758424 2015-09-24 14:11 output/comments-m-00015.avro
-rw-r--r--   3 vagrant vagrant    1330504 2015-09-24 14:11 output/postlinks-m-00018.avro
-rw-r--r--   3 vagrant vagrant  124593825 2015-09-24 14:07 output/posts-m-00009.avro
-rw-r--r--   3 vagrant vagrant  123105305 2015-09-24 14:08 output/posts-m-00010.avro
-rw-r--r--   3 vagrant vagrant  121213033 2015-09-24 14:09 output/posts-m-00011.avro
-rw-r--r--   3 vagrant vagrant  119791251 2015-09-24 14:09 output/posts-m-00012.avro
-rw-r--r--   3 vagrant vagrant  104216815 2015-09-24 14:10 output/posts-m-00013.avro
-rw-r--r--   3 vagrant vagrant     119501 2015-09-24 14:11 output/tags-m-00019.avro
-rw-r--r--   3 vagrant vagrant   69758697 2015-09-24 14:10 output/users-m-00014.avro
-rw-r--r--   3 vagrant vagrant   83896117 2015-09-24 14:03 output/votes-m-00000.avro




Create a RPM from a Ruby gem
I needed to get a few Ruby gems integrated into my team developer VM builds recently, so I thought I would see how difficult it was to turn a Gem into an RPM. Puppet is much happier dealing with RPMs than anything else. Plus, I have a rather annoying proxy in the way when attempting to install from rubygems.org, making a rpm on a locally hosted yum repo much easier.

After a bit of goggling, I came across a few gists and tools that claim to do the job, but it wasn't immediately clear how to use them.

Having previously build rpms for Ruby, Java, Maven and a few other bits and pieces, I know its a pretty simple process. Instead of using any special gem rpm tools, I decided to use rpm-build and a spec file just like I have in the past. Using this method, creating the RPM is simple.

Setup

First, make sure the rpm-build tool is installed:

$ yum install rpm-build


Then as root, in /root create the rpmbuild directory structure:

$ mkdir -p ~/rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}


Spec file

Taking the ruby-oci8 gem (Oracle drivers) as an example, I used the following spec file in rpmbuild/SPECS/ruby-oci8.spec:

Summary: ruby-oci8
Name: ruby-oci8
Version: 2.1.8
Release: 0
Group: Software Development
Distribution: ruby-oci8 for Ruby
Vendor: Redhat
Packager: Stephen ODOnnell
License: GPL
# Skip autogenerating RPM dependencies
AutoReqProv: no

%description
The Ruby gem ruby-oci8 packaged as a gem

%prep
rm -rf $RPM_BUILD_ROOT/*

%build

%install
mkdir -p $RPM_BUILD_ROOT/usr/lib/ruby/gems/1.8/gems
gem install -y -V --no-ri --no-rdoc --install-dir $RPM_BUILD_ROOT/usr/lib/ruby/gems/1.8 --local $RPM_SOURCE_DIR/%{name}-%{version}.gem

%files
/usr/lib/ruby/gems/1.8/gems/%{name}-%{version}
/usr/lib/ruby/gems/1.8/cache/%{name}-%{version}.gem
/usr/lib/ruby/gems/1.8/specifications/%{name}-%{version}.gemspec

%post



Download the ruby-oci8 gem and copy it into the rpmbuild/SOURCES directory and then build the RPM:

$ rpmbuild -ba SPECS/ruby-oci8.spec


If successful, the final RPM will be in rpmbuild/RPMS/x8664/ruby-oci8-2.1.8-0.x8664.rpm

Potential Issues

This RPM is coupled to a fairly specific Ruby install path, which is probably OK, but is something to be aware of.

Watch out for version changes on the source gem - you need to edit the Version section of the spec file each time the gem version changes.



Map Reduce Counters
When you run a map reduce job, at the end it prints out a bunch of statistics about the job, for example:

File System Counters
        FILE: Number of bytes read=215
        FILE: Number of bytes written=243859
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
Map-Reduce Framework
    Map input records=2
    Map output records=0
    Input split bytes=119
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=0
    Total committed heap usage (bytes)=192937984
File Input Format Counters 
    Bytes Read=42
File Output Format Counters 
        Bytes Written=8


These statistics are tracked by Counters. Notice that we have 4 counter groups (File System Counters, Map-Reduce Framework, ...) and each group has one or more counters within it. These counters are all created and maintained automatically by the map reduce framework, but you can easily add your own custom counters, allowing you to track all sorts of details about your job.

Creating a Custom Counter

All you need to do is define an enum type to hold your counters. When you define an enum in Java is behaves somewhat like a class, so you can create it in its own file:

package com.sodonnel.Hadoop.Fan;

public enum OtherCounters {
    GOOD_RECORDS,
    BAD_RECORDS
}


In this case, this will create a new counter group called OtherCounters, which contains two counters - GOOD_RECORDS and BAD_RECORDS.

To use the counters all you have to do is increment them in your mapper or reducer:

context.getCounter(OtherCounters.GOOD_RECORDS).increment(1);


That's really all there is to it. Now if you run your job, the custom counter will be displayed in the output.

Getting The Counts

Aside from the framework displaying the value of the counter, you can get the values in the driver after the job has completed. For instance, you can access a counter by its name:

System.out.println("Good Records: "+counters.findCounter("com.sodonnel.Hadoop.Fan.OtherCounters", "GOOD_RECORDS").getValue());


Or, you can even loop over all the counter groups and their counters:

for (CounterGroup group : counters) {
    System.out.println("Counter Group: " + group.getDisplayName() + " (" + group.getName() + ")");
    System.out.println("  number of counters in this group: " + group.size());
    for (Counter counter : group) {
        System.out.println("   -> " + counter.getDisplayName() + ": " + counter.getName() + ": "+counter.getValue());
    }
}




Map Reduce Multiple Outputs
Recently I wrote a couple of articles about writing and testing simple map reduce programs.

That got me thinking about some more advanced things you might want to do, so in this article and hopefully a few more in the future I will look at more features of the map reduce framework, starting with multiple outputs.

Multiple Outputs

Most of the early map reduce examples I came across only worked with a single input and a single output directory. It didn't take me long to come across a few scenarios where it would be useful to output to several directories. For instance, I might want to fan out the records in a large file into several smaller files, based on the contents of each record. Or, if I am processing a batch of records, instead of failing the entire job thanks to a single badly formatted record, I could write the bad record out to an error file and continue on processing as normal.

Hadoop allows writing to multiple output files very easily, and there are (at least) two ways of doing it.

Named Outputs

In the driver class, specify any number of named output classes.

MultipleOutputs.addNamedOutput(job, "goodrecords", TextOutputFormat.class, NullWritable.class, Text.class);
MultipleOutputs.addNamedOutput(job, "badrecord",   TextOutputFormat.class, NullWritable.class, Text.class);


Note that each output can have a different format and key / value type.

In the mapper (or reducer) you need to do a couple of things to make multiple outputs work correctly.

First, in the mapper setup method, instantiate a multiple output object:

@Override
public void setup(Context context) {
  mos = new MultipleOutputs(context);
}


A very important point is that you also need to close the multiple output object or things do not work correctly at all. You do this in the cleanup method:

@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
  mos.close();
}


Then in the map (or reduce) method, you can write either of the outputs previously created by name:

mos.write("goodRecords", NullWritable.get(), value);
mos.write("badrecords",  NullWritable.get(), value);


You can still write to the context as usual too, so this is quite flexible.

An important point when writing to multiple outputs in this way, is that all the files end up in the same directory with a different prefix, which is the default output directory for the job. You don't have the option to specify a different path for each named output.

Unnamed Outputs

If you want to output to a different file based on the content on the input record, then you can use unnamed multiple outputs. In this case, you don't have to setup anything in the driver class except the usual output directory.

In the mapper, add a setup and cleanup method exactly the same as above, and then use an alternative version of the write command:

String keyChar = value.toString().substring(0,1).toLowerCase();
mos.write(NullWritable.get(), value, keyChar);


In this example, the output does not have a name and the record is written to an output file determined by the first character of the record (keyChar in the example above).

Unlike with the named outputs, you can have data written into sub directories, by passing a path instead of just a filename to keyChar. I think the subdirectories must be within the jobs normal output directory.

Another difference from the named outputs is that each named output can have a different key, value and file format, while the unnamed version seems to inherit these attributes from the jobs default output settings.

Lazy Output Format

There is one more minor annoyance when working with multiple outputs. Even if you never write any records to the context for standard output, the map reduce framework will create a zero byte file in the output directory. There is a way to avoid this, by adding the following setting in the driver:

LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);


The Complete Code

//
// DRIVER
//

package com.sodonnel.Hadoop.Fan;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class Fan extends Configured implements Tool {

    public int run(String[] args) throws Exception {
        Path inputPath = new Path(args[0]);
        Path outputDir = new Path(args[1]);

        // Create configuration
        Configuration conf = new Configuration(true);

        // Create job
        Job job = Job.getInstance(conf, "Fan");
        job.setJarByClass(getClass());

        // Setup MapReduce
        job.setMapperClass(FanOutMapper.class);
        job.setNumReduceTasks(0);

        // Specify key / value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        // Input
        FileInputFormat.addInputPath(job, inputPath);
        job.setInputFormatClass(TextInputFormat.class);

        // Output
        FileOutputFormat.setOutputPath(job, outputDir);
        //
        // Changing the outputFormatClass by commenting the next line
        // and adding the following one, prevents a zero byte file from
        // being created when you use multi-outputs
        //
        job.setOutputFormatClass(TextOutputFormat.class);
        //LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

        //
        // If you want to have named outputs, then define them upfront here
        //
        //  MultipleOutputs.addNamedOutput(job, "badRecords", TextOutputFormat.class,
        //          NullWritable.class, Text.class);
        //  MultipleOutputs.addNamedOutput(job, "goodRecords", TextOutputFormat.class,
        //          NullWritable.class, Text.class);

        // Delete output if exists
        FileSystem hdfs = FileSystem.get(conf);
        if (hdfs.exists(outputDir))
            hdfs.delete(outputDir, true);

        // Execute job
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new Fan(), args);
        System.exit(exitCode);
    }
}

//
// MAPPER
//

package com.sodonnel.Hadoop.Fan;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

public class FanOutMapper extends
    Mapper {

    private MultipleOutputs mos;

    @Override
    public void setup(Context context) {
        mos = new MultipleOutputs(context);
    }           

    // 
    // You must override the cleanup method and close the multi-output object
    // or things do not work correctly.
    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        mos.close();
    }

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        // Throw away totally blank lines
        if (value.equals(new Text(""))) {
            return;
        }

        // Fan the records out into a file that has the first character of the 
        // string as the filename.
        // You can also use named outputs (defined in the job runner class)
        // instead of deriving the filename based on the input lines.
        // If you pass a path with / characters in it, the data will go into subdirs
        // eg 20150304/data etc
        String keyChar = value.toString().substring(0,1).toLowerCase();

        // In this example, the keyChar string indicates the filename the data is written
        // into. You can write the same data to many files, and the filename can 
        // contain slashes to make it into a path. The path is relative to the output dir
        // setup in the job config.
        mos.write(NullWritable.get(), value, keyChar);
        // mos.write("goodRecords", NullWritable.get(),value);
        // context.write(NullWritable.get(), value);
    }
}




Comparing Sequence Files, ORC Files and Parquet Files
Back when I started working with Hadoop, I did some benchmarks around different file types, mainly thinking about how much they compressed the data and whether they were splittable formats or not. I quickly learned that just loading files as gzipped text was not a good idea thanks to it being an non splittable format. Eventually we settled on using compressed sequence files (using gzip) in our project, which was probably not the optimal choice.

Since then, both Parquet and ORC files have been getting a lot of press, and I though it was about time I had a good look at them.

Test Platform and Plan

I wanted to do some basic checks on each of the file types using real-world data from our application. I did not make any effort to change any of the default settings, except to set the PARQUET_COMPRESSION_CODEC=snappy; (on my system it seemed to default to NONE). My main areas of interest are how big the resulting files become, and how much CPU is consumed creating them and later querying them.

I ran these tests on Cloudera Hadoop version 5.2.1 and hive 0.13. One thing to note, is that in this version Parquet does not support the Timestamp data type, which will hurt its compression statistics. All of my test tables have at least one Timestamp column. Hopefully I can re-run these tests once my cluster is upgraded.

Log Table

The first table I looked at holds application log data. Typically a day of data is about 80GB, stored as a gzipped compressed Sequence file.

I created one day of data using both ORC and PARQUET:

SEQUENCE FILE: 80.9 G created in 1344 seconds, 68611 CPU seconds
ORC FILE     : 33.9 G created in 1710 seconds, 82051 CPU seconds
PARQUET FILE : 49.3 G created in 1421 seconds, 86263 CPU seconds


Both ORC and Parquet compress much better than Sequence files, with ORC the clear winner, however it does take slightly more CPU to create the ORC file. It is interesting to note, and not really surprising that creating Sequence files is much more efficient than either of the other two formats.

The next thing to be concerned with is query performance. Additional overhead creating the files is easy to accept if queries benefit over and over again. Even ignoring the special features built into ORC and Parquet, I expect both ORC and Parquet will do much better than Sequence files due to the large difference in file sizes.

Simple count(*)

SEQUENCE FILE:  202 seconds; 9316 CPU (second run 242 seconds)
ORC FILE     :  148 seconds; 1839 CPU (second run 122 seconds)
PARQUET FILE :  139 seconds; 2801 CPU (second run 117 seconds)


Filter on 1 column and group by

SEQUENCE FILE:  340 seconds; 14318 CPU (second run 373 seconds)
ORC FILE     :  165 seconds; 2978  CPU (second run 157 seconds)
PARQUET FILE :  165 seconds; 4490  CPU (second run 171 seconds )


Filter on 3 columns plus a lookup join

SEQUENCE FILE:  526 seconds; CPU 14031 (second run 491 seconds)
ORC FILE     :  201 seconds; CPU 5329  (second run 204 seconds)
PARQUET FILE :  240 seconds; CPU 8797  (second run 312 seconds)


In terms of CPU, ORC is the clear winner in all these tests, and it is just about edging it in response time too. As the runtime can be quite variable on a Hadoop Cluster, I am more concerned with the CPU used as a performance benchmark.

Wide Transaction Table with Array of Structs

Another important table in our application is a very wide table that also makes use of an array of structs embedded in each row. This table has many fewer rows than the log table, coming in at about 1.5GB a day:

SEQUENCE FILE: 1.5 G
ORC FILE     : 835.9 M created in 414 seconds; 1705 CPU seconds
PARQUET FILE : 919.3 M created in 290 seconds; 1510 CPU seconds


Select count(*)

SEQUENCE FILE: 76 seconds;  109 CPU (second run 68 seconds)
ORC FILE     : 70 seconds;  27  CPU (second run 73 seconds)
PARQUET FILE : 69 seconds;  42  CPU (second run 58 seconds)


Expand lateral view, filter and count

SEQUENCE FILE: 98 seconds; 240 CPU (second run 80 seconds)
ORC FILE     : 85 seconds; 114 CPU (second run 77 seconds)
PARQUET FILE : 84 seconds; 196 CPU (second run 94 seconds)


Again, in both these test ORC seems to be the winner for queries, but is the most costly file to create.

Transaction Table Copied From RDBMS

This is a fairly typical database table, storing about 1.1GB compressed each day:

SEQUENCE FILE: 1.1 G
ORC FILE     : 667.0 M created in 202 seconds; 989 CPU seconds
PARQUET FILE : 691.0 M created in 202 seconds; 853 CPU seconds


I didn't run any queries on this table, but again ORC creates the smallest files but with the largest overhead at file creation time.

Conclusion

Clearly you should not use Sequence files to store Hive tables. While they are efficient to create, the additional disk space and CPU overhead when reading them is a heavy cost.

Based on this quick set of tests, ORC files win for me. They are more expensive to create than Parquet files, but the compression techniques are better for my data, along with lower CPU overhead for my test queries.

There is one more thing to consider - at this time, Impala cannot make use of ORC files, so it may make sense to go with Parquet if that is something you will need now or in the future.



Experimenting with Flume Performance
I was evaluating Flume for a Hadoop integration project recently, and as part of my investigation I needed to see how many messages per second it could handle.

The Flume manual points out that Flume performance will vary greatly depending on your hardware, message size, disk speed and configuration, so it is important to evaluate performance based on your own application.

The manual also points out that a bigger batch size when passing messages into Flume should give higher performance.

In order to perform some benchmarks, I created a simple flume injector, that allowed me to send a given number of messages to Flume, where I could control the length of each message and the batch size.

Testing Strategy

For the following tests, I am only concerned about the message input rate - therefore I am using a null sink to remove the messages from the channel.

I am also using a single injector with a single connection to Flume - maybe I could get better performance out of many injectors each connecting to Flume separately, but I am not concerned with going into that level of detail.

The Flume test box, is a 2 core 4GB RAM VM, with no internal disks, so it is fairly basic hardware. The injector is running on a similar VM sending messages to Flume over the network.

Memory Channel Tests

For these tests I inject 500k messages, varying the message size or the batch size. The Flume configuration uses an Avro source, a memory channel and a null sink:

agent.sources  = avro
agent.sinks    = nullsink
agent.channels = memchannel

agent.sources.avro.type = avro
agent.sources.avro.bind = 0.0.0.0
agent.sources.avro.port = 41414

agent.channels.memchannel.type                = memory
agent.channels.memchannel.capacity            = 10000
agent.channels.memchannel.transactionCapacity = 1000
agent.channels.memchannel.byteCapacity        = 100000000

agent.sinks.nullsink.type = null

agent.sources.avro.channels = memchannel
agent.sinks.nullsink.channel = memchannel


Vary The Batch Size

For this test I inject 500K messages of approximately 500 bytes each, varying the batch size: 



Batch Size
Runtime (seconds)
TPS



1
259
1930


10
43
11627


20
24
20833


40
16
31250


80
12.5
40000


160
11.8
42372


320
11.5
43748


640
11.2
44642


1000
11
45454



Increasing the batch size has a notable impact on performance up to a batch size of between 80 and 160 messages where it seems to flatten out.

Vary Message Size

For this test, I used the same Flume config as above and set the batch size to 80, varying the message length:



Message Length
Runtime (seconds)
TPS



100
10.5
47619


200
10.6
47169


500
12.3
40650


800
14.5
34482


1600
17.5
28517


3200
24.2
20661


6400
38
13157


12800
68
7352



As the message length increased, the TPS reduced. This is probably expected. For small message lengths (under 500 bytes) the effect of going from 100 to 500 bytes is not too noticeable. For longer message lengths, doubling the length of the message seems to almost half the TPS.

File Channel Tests

For these tests, I changed the Flume configuration to use a file channel instead of a memory channel:

agent.sources  = avro
agent.sinks    = nullsink
agent.channels = filech

agent.sources.avro.type = avro
agent.sources.avro.bind = 0.0.0.0
agent.sources.avro.port = 41414

agent.channels.filech.type = file
agent.channels.filech.checkpointDir = /var/flume/filech/checkpoint
agent.channels.filech.dataDirs = /var/flume/filech/data
agent.channels.filech.capacity = 1000000
agent.channels.filech.transactionCapacity = 1000

agent.sinks.nullsink.type = null

agent.sources.avro.channels = filech
agent.sinks.nullsink.channel = filech


Note, that as the file channel is much slower than the memory channel, I have changed the tests to load 100K messages instead of 500K.

Vary Batch Size

Load 100K messages of length 500 bytes, varying the batch size:



Batch Size
Time (seconds)
TPS



1
140
714


10
23.3
4291


20
15.5
6451


40
11.5
8695


80
9.3
10752


160
8.6
11627


320
9.4
10638


640
8.7
11494


1000
7.7
12987



Notice that the file channel test exhibits a similar performance profile as the memory channel as the batch size increases, but at a much lower TPS.

Vary Message Size

Load 100K messages of varying size into a file channel using a batch size of 80.



Message Size
Time (seconds)
TPS



100
7.5
13333


200
7.8
12820


400
8.8
11363


500
8.7
11494


800
9.8
10204


1600
12.6
7936


3200
17.5
5714


6400
25.5
3921


12800
40
2500



Again, the performance profile looks similar to the memory channel test, but at lower TPS.

Replicated File Channels

The final test I ran against file channels, is to examine the effect of a multiplexed channel. I loaded 100K messages using a batch size of 80 and a message length of 500. The flume config is:

agent.sources  = avro
agent.sinks    = nullsink nullsink2 nullsink3
agent.channels = filech filech2 filech3

agent.sources.avro.type = avro
agent.sources.avro.bind = 0.0.0.0
agent.sources.avro.port = 41414

agent.channels.filech.type = file
agent.channels.filech.checkpointDir = /var/flume/filech/checkpoint
agent.channels.filech.dataDirs = /var/flume/filech/data
agent.channels.filech.capacity = 1000000
agent.channels.filech.transactionCapacity = 1000

agent.channels.filech2.type = file
agent.channels.filech2.checkpointDir = /var/flume/filech2/checkpoint
agent.channels.filech2.dataDirs = /var/flume/filech2/data
agent.channels.filech2.capacity = 1000000
agent.channels.filech2.transactionCapacity = 1000

agent.channels.filech3.type = file
agent.channels.filech3.checkpointDir = /var/flume/filech3/checkpoint
agent.channels.filech3.dataDirs = /var/flume/filech3/data
agent.channels.filech3.capacity = 1000000
agent.channels.filech3.transactionCapacity = 1000


agent.sinks.nullsink.type = null
agent.sinks.nullsink2.type = null
agent.sinks.nullsink3.type = null

agent.sources.avro.selector = replicating
agent.sources.avro.channels = filech filech2 filech3
agent.sinks.nullsink.channel = filech
agent.sinks.nullsink2.channel = filech2
agent.sinks.nullsink3.channel = filech3


The time taken to load 100K messages to 1, 2 and 3 replicated channels is given below:



Single Channel
2 Replicated Channels
3 Replicated Channels



9.3
13.4
21



It looks like each replicated channel hurts performance significantly. I suspect I am hitting contention on on disk writes with the replicated channels - the machine I am testing on is a VM with disk stored on SAN, so the disk performance is not going to be great. If I get time in the future I may trying running this test again with SSD disks or on a machine with several internal disks to see the effect.

Conclusion

The TPS Flume is capable of handling varies significantly depending on the batch size and message size. Messages under 500 bytes seem pretty efficient, and a batch size of around 100 seems to be optimal in these tests.

Its also significant to note the performance impact a persistent file channel has - cutting throughput by almost 4 times.

I should point out that the hardware these tests were run on is nothing fantastic. I suspect file channel performance would be much better on SSD machines, with a separate disk for each channel.

I also didn't make any effort to tune any Flume settings. I did turn on Java GC logging to ensure Flume was not suffering from excessive full GC runs, which it was not.



Creating Centos or Redhat init scripts
Init scripts are used to start and stop daemon processes on Linux systems. In turns out, that like most things in Linux, they are pretty simple. Following a few rules allows you to quickly create a script that plays well with how the system starts up and shuts down.

There is a pretty good guide that explains all the parts of the init scripts in much more detail that I will repeat here.

Things an Init Script Should Do

Script Location and Naming

Your init script should be given the same name as the process you are tying to start and it should be stored in /etc/rc.d/init.d. For example, if you are creating an init script for a process called flume, you should create a file /etc/rc.d/init.d/flume and set its permissions to 755.

Once you do this, you can use the service command start and stop the service, for example:

# /etc/rc.d/init.d/testservice
echo "executed the service start script"

$ service testservice start
executed the service start script



Manage /var/lock/subsys

For a few reasons, an init script should create a lock file in /var/lock/subsys/ upon starting and clear it when stopped. If the OS is shutdown, reboot or the run level is changed, these lock files are used to determine which processes to stop and start.

Create a Pidfile

It should write a pidfile (ie a file containing the PID of the parent process of the service) into /var/run. What I found is that if the process runs as a user other than root, the user starting the process probably won't have permission to write into /var/run. If that is the case, you can create a sub-directory, and store the pidfile there - /var/run//.pid.

Chkconfig Header

For Redhat systems, all init scripts should contain a header line for chkconfig, normally it looks something like this:

# chkconfig: - 20 80


This means the script is off by default on all run levels, which can be changed using the chkconfig utility.

Additionally, and chkconfig description line should be included:

# description: This is a description for my service. Multiple lines \
#              should be ended with a backslash as shown.


The library functions

On Redhat (and Centos systems), there are a few library functions that you should include in any init script. They are stored in /etc/rc.d/init.d/functions, and should be included into the init script:

# source function library
. /etc/rc.d/init.d/functions


This library provides 4 functions.

daemon

Used to start a process that correctly daemonizes itself. Interestingly, the daemon function does not write a pidfile for the process it starts. I think it expects the process to create its own pidfile. It takes a few options in a command line like format, including the program to start and any options to pass to the program, for instance:

daemon --user=httpd --pidfile=/var/run/httpd.pid /usr/local/bin/ 


The user switch is only required if the process does not run as root. The pidfile parameter is also optional - it does not actually create a pidfile for the process that is started.

killproc

Used to shutdown (or kill) a running process. Generally you pass it the pid file of the process and it shuts it down.

killproc -p /var/run/process.pid /usr/local/bin/process [-signal]


pidofproc

Can be used to find the pid of the procedure, if it is running:

pidofproc -p /var/run/process.pid /user/local/bin/process


status

Tests to see if the process is running or not:

status  -p /var/run/process.pid /user/local/bin/process


Init Script Template

Putting all these points together, gives a fairly generic init script template (taken directly from the guide I mentioned earlier:

#!/bin/sh
#
#  
#
# chkconfig:   - 20 80
# description: 

# Source function library.
. /etc/rc.d/init.d/functions

exec="/path/to/"
prog=""
config=""

[ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

lockfile=/var/lock/subsys/$prog

start() {
    [ -x $exec ] || exit 5
    [ -f $config ] || exit 6
    echo -n $"Starting $prog: "
    # if not running, start it up here, usually something like "daemon $exec"
    retval=$?
    echo
    [ $retval -eq 0 ] && touch $lockfile
    return $retval
}

stop() {
    echo -n $"Stopping $prog: "
    # stop it here, often "killproc $prog"
    retval=$?
    echo
    [ $retval -eq 0 ] && rm -f $lockfile
    return $retval
}

restart() {
    stop
    start
}

reload() {
    restart
}

force_reload() {
    restart
}

rh_status() {
    # run checks to determine if the service is running or use generic status
    status $prog
}

rh_status_q() {
    rh_status >/dev/null 2>&1
}


case "$1" in
    start)
        rh_status_q && exit 0
        $1
        ;;
    stop)
        rh_status_q || exit 0
        $1
        ;;
    restart)
        $1
        ;;
    reload)
        rh_status_q || exit 7
        $1
        ;;
    force-reload)
        force_reload
        ;;
    status)
        rh_status
        ;;
    condrestart|try-restart)
        rh_status_q || exit 0
        restart
        ;;
    *)
        echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-reload}"
        exit 2
esac
exit $?




Unit Testing Map Reduce Programs With MRUnit
Last time, I described how to create a very simple map reduce program in Java. The next problem you run into is how to write some unit tests for this program.

The nice thing about testing many Map Reduce programs is that each stage of the process is generally very simple. For instance, a mapper receives a line of a file and does some transform on it to output a set of key values pairs.

This means the mapper can be tested in isolation from the reducer, and even a job that runs through many map and reduce phases can be tested one stage at a time.

To make testing Map Reduce programs easier, the Hadoop project contains a tool called MRUnit. MRUnit is based on JUnit, so its syntax should be pretty familiar.

To include MRUnit in your project add the following to the pom.xml:


    org.apache.mrunit
    mrunit
    1.1.0
    hadoop2
    test



Building on the code from my last article, its pretty simple to create some simple unit tests for the mapper:

package com.sodonnel.Hadoop;

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver; 
import org.junit.*;

public class WordCountMapperTest {

    MapDriver mapDriver;

    @Before
    public void setup() {
        WordCountMapper mapper = new WordCountMapper();
        mapDriver = MapDriver.newMapDriver(mapper);
    }

    @Test
    public void splitValidRecordIntoTokens() throws IOException, InterruptedException {
        Text value = new Text("the,quick,brown,fox,the");
        mapDriver.withInput(new LongWritable(), value)
                .withOutput(new Text("the"), new IntWritable(1)) 
                .withOutput(new Text("quick"), new IntWritable(1)) 
                .withOutput(new Text("brown"), new IntWritable(1)) 
                .withOutput(new Text("fox"), new IntWritable(1)) 
                .withOutput(new Text("the"), new IntWritable(1)) 
                .runTest();
    } 

    @Test
    public void recordWithSingleWordIsValid() throws IOException, InterruptedException {
        Text value = new Text("the");
        mapDriver.withInput(new LongWritable(), value)
                .withOutput(new Text("the"), new IntWritable(1)) 
                .runTest();
    }

    @Test
    public void recordWithEmptyLineOutputsNothing() throws IOException, InterruptedException {
        Text value = new Text("");
        // If you don't specify any 'withOutput' lines, then it EXPECTs no output. 
        // If there is output it will fail the test
        mapDriver.withInput(new LongWritable(), value)
                .runTest();
    }
}


Notice that the setup method creates a MapDriver object, which has the same definition as the Mapper class under test.

Then, using the MapDriver, you define the input, any expected output and run the test - simple.

Reducer Tests

Writing Reducer tests is just as easy as mapper tests:

package com.sodonnel.Hadoop;

import java.io.IOException;
import java.util.Arrays;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver; 
import org.junit.*;

public class WordCountReducerTest {

    ReduceDriver reduceDriver;

    @Before
    public void setup() {
        WordCountReducer reducer = new WordCountReducer();
        reduceDriver = ReduceDriver.newReduceDriver(reducer);
    }

    @Test
    public void splitValidRecordIntoTokens() throws IOException, InterruptedException {
        reduceDriver.withInput(new Text("the"), Arrays.asList(new IntWritable(1), new IntWritable(2)))
                .withOutput(new Text("the"), new IntWritable(3)) 
                .runTest();
    } 

}


This time you create a ReduceDriver object, pass it the expected input and output in a similar way.

Testing The Driver

After testing mapper and reducer in isolation, it is a good idea to have a couple of checks on the driver class that bolts the map reduce job together.

It turns out this is pretty simple, and it doesn't even requite MRUnit to perform the tests.

Hadoop allows you run map reduce jobs in a local mode, where files are read from and written to the local file system instead of HDFS. It is also possible to build a configuration object and pass it to the config class, instead of requiring all the usual xml config files.

To test the driver, you build a configuration (making sure it sets local mode), instantiate the driver class, pass the config and any command line arguments and then run the job - the following code gives an example:

@Test
public void test() throws Exception {
    Configuration conf = new Configuration(); 
    conf.set("fs.defaultFS", "file:///"); 
    conf.set("mapred.framework.name", "local");
    Path input = new Path("input"); 
    Path output = new Path("output");

    FileSystem fs = FileSystem.getLocal(conf); 
    fs.delete(output, true); // delete old output
    WordCount driver = new WordCount();
    driver.setConf(conf);
    int exitCode = driver.run(new String[] { input.toString(), output.toString() });
    assertThat(exitCode, is(0));
    // checkOutput(conf, output);
}


Notice that in the conf object, mapred.framework.name is set to local - this tells Hadoop that it is running in a local, single VM mode.

Also notice that in the 3rd last line, in the parameters to the run method, an array is passed. This is how you simulate passing command line parameters to the job. The parameters are received by the driver class just as if they were passed by the Hadoop command line program.

I have commented out the last line, as it is specific to each test - you probably want to run the job with a know input and an expected output and validate the actual output matches up with what is expected.

Note - I was not able to get this sort of test working on Windows. It worked fine on OS X and Linux.

Running The Job against Local Files

Once you have a unit tested a map reduce job, the next stage is to pass some actual data to the program and see if it works end to end.

Hadoop has a local mode that allows you to run an entire map reduce program in a single JVM, and hence without a full Hadoop cluster - you can get away with just the Hadoop Client libraries installed.

To do this, you override the job tracker (Map Reduce V1) or the mapreduce.framework.name (YARN), setting it to the special value of local, which is actually the default.

You can also override the default filesystem, telling it to use the local filesystem instead of HDFS.

Place the following in a file called config/hadoop-local.xml:


  
    
      fs.default.name
      file:///
    
    
      mapred.framework.name
      local
    



Then create a directory called input and output, and put a file containing some CSV data into the input directory.

This will give you a directory structure that looks like:

MapReduce-0.0.1-SNAPSHOT.jar
input
  data.csv
output
config
  hadoop-local.xml


Now you can run the local job using the following command:

$ hadoop --config config jar MapReduce-0.0.1-SNAPSHOT.jar com.sodonnel.MapReduce.WordCount input output


The job should run very quickly (assuming your csv file is small) compared to running the job on a real cluster, and the output will be written into the output directory.

Apparently you can use this local Hadoop mode to run the job inside an IDE and debug it etc, but I have yet to get that working.



Creating a Simple Map Reduce Program for Cloudera Hadoop
The Hadoop Definitive Guide has a pretty good tutorial on creating simple map reduce programs.

The first thing to learn is that the typical map reduce program is make up of at least 3 different classes:


The Driver class - this is the program entry point, and is used to setup the flow of the job
The Mapper class - this implements the map phase of the map reduce task
The Reducer class - this implements the reduce phase of the job


Generally, a driver class should implement the Tool interface and extend the Configured class - this seems to be a fairly common pattern to bootstrap a map reduce program.

The Hadoop Definitive Guide provides a very simple example that prints out the configuration of the Hadoop cluster you are running on:

package com.sodonnel.MapReduce;

import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class ConfigurationPrinter extends Configured implements Tool {

    static {
        Configuration.addDefaultResource("hdfs-default.xml");
        Configuration.addDefaultResource("hdfs-site.xml");
        Configuration.addDefaultResource("mapred-default.xml");
        Configuration.addDefaultResource("mapred-site.xml");
    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        for (Entry entry: conf) {
            System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
        }
        return 0;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
        System.exit(exitCode);
    }       

}


Notice that the main method does not invoke its own run method, it uses ToolRunner to do it instead, which performs some setup to bootstrap the application.

If you compile this code into a JAR and then run it against a cluster, it will print a rather long list of configuration variables:

$ hadoop jar MapReduce-0.0.1-SNAPSHOT.jar com.sodonnel.MapReduce.ConfigurationPrinter

mapreduce.shuffle.ssl.enabled=false
mapreduce.tasktracker.report.address=127.0.0.1:0
mapreduce.tasktracker.http.threads=40
dfs.stream-buffer-size=4096
tfile.fs.output.buffer.size=262144
fs.permissions.umask-mode=022
dfs.client.datanode-restart.timeout=30
io.bytes.per.checksum=512
ha.failover-controller.graceful-fence.connection.retries=1
dfs.datanode.drop.cache.behind.writes=false
yarn.app.mapreduce.am.resource.cpu-vcores=1
hadoop.common.configuration.version=0.23.0
mapreduce.job.ubertask.enable=false
dfs.namenode.replication.work.multiplier.per.iteration=2
mapreduce.job.acl-modify-job=
io.seqfile.local.dir=${hadoop.tmp.dir}/io/local
fs.s3.sleepTimeSeconds=10
mapreduce.client.output.filter=FAILED



A Full Map Reduce Program

Printing the Hadoop configuration might be useful for debugging configuration problems, but it's not super useful. A full map reduce program is probably more interesting. To show how to create a full map reduce problem, I came across a variation of the word count example - read a csv file where each row is a comma seperate list of words and provide the total count of words as output.

This program will make use of 3 classes:


WordCount - this is the driver program
WordCountMapper - This will receive the data a line at a time, split it into words (the key) and counts (the value, always 1) to pass onto the reducer
WordCountReducer - This will receive the list of words created by the mapper and sum up the count of each word, before writing it to the output directory.


WordCount.java:

package com.sodonnel.MapReduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount extends Configured implements Tool {



    public int run(String[] args) throws Exception {
        Path inputPath = new Path(args[0]);
        Path outputDir = new Path(args[1]);

        // Create configuration
        Configuration conf = new Configuration(true);

        // Create job
        Job job = Job.getInstance(conf, "WordCount");
        job.setJarByClass(getClass());

        // Setup MapReduce
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        job.setNumReduceTasks(1);

        // Specify key / value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // Input
        FileInputFormat.addInputPath(job, inputPath);
        job.setInputFormatClass(TextInputFormat.class);

        // Output
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setOutputFormatClass(TextOutputFormat.class);

        // Delete output if exists
        FileSystem hdfs = FileSystem.get(conf);
        if (hdfs.exists(outputDir))
            hdfs.delete(outputDir, true);

        // Execute job
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new WordCount(), args);
        System.exit(exitCode);
    }


}


WordCountMapper.java

package com.sodonnel.MapReduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends
        Mapper {

    private final IntWritable ONE = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {

        String[] csv = value.toString().split(",");
        for (String str : csv) {
            word.set(str);
            context.write(word, ONE);
        }
    }
}


WordCountReducer.java:

package com.sodonnel.MapReduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends
        Reducer {

    public void reduce(Text text, Iterable values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        context.write(text, new IntWritable(sum));
    }
}


Compile the JAR

$ mvn clean install -Dmaven.test.skip=true


Add Input Data

On HDFS, create the input directory in your home directory, and put some CSV data into it:

$ hadoop fs -mkdir input
$ hadoop fs -put csvdata.csv input/


Run the Map Reduce Program

$ hadoop jar MapReduce-0.0.1-SNAPSHOT.jar com.sodonnel.MapReduce.WordCount input output


If all goes well, the output directory will be created, and the resulting word counts will be in a file in that directory.



Maven Config For Cloudera Map Reduce Programs
I have been working with Hadoop for a while now, and I have been able to achieve everything I need with a combination of Sqoop, Oozie, Hive and shell scripts. Given a bit of free time, I decided it would be worth exploring how to create simple map reduce jobs in Java.

First install Maven (Java build tool and dependency manager) and Eclipse (Java IDE) on your development machine.

Then create a new Maven project using Eclipse or from the command line:

$ mvn archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.sodonnel.Hadoop -DartifactId=WordCount


This will create a directory called WordCount and inside it you will find a java project structure and a pom.xml file, which is the maven config file.

As we want to create a Hadoop Map Reduce program, we need to add the Hadoop dependencies to our project. Searching the web, people seem to put all sorts of dependencies into their pom.xml for Hadoop jobs, but I found I only need a few entries - one to specify the Cloudera Maven repo, another to bring in the Hadoop dependencies and then a couple more to allow me to write unit tests against map reduce jobs. My complete pom.xml is:


  4.0.0

  com.sodonnel.Hadoop
  WordCount
  1.0-SNAPSHOT
  jar

  WordCount
  http://maven.apache.org

  
    
      cloudera
      https://repository.cloudera.com/artifactory/cloudera-repos/
    
  

  
    UTF-8
  

  
    
      junit
      junit
      4.12
      test
    

    
      org.apache.mrunit
      mrunit
      1.1.0
      hadoop2 
    

    
      org.apache.hadoop
      hadoop-client
      2.5.0-cdh5.2.1
    

  



There are a couple of things to watch out for.

Cloudera have a different version of the hadoop-client library for Map Reduce V1 and YARN clusters, and the version you should use is determined by the version identifier.


2.5.0-cdh5.2.1 for YARN
2.5.0-mr1-cdh5.2.1 for Map Reduce V1


The second thing to be aware of, is picking the correct version for your Cloudera cluster - this link is useful to figure that out.

Download Project Dependencies and Compile

At this point, you will want to install all the project dependencies into your local maven repo:

$ mvn clean install


This should pull down all packages required to run and compile your application. This command will also compile your application and build it into a JAR, ready for execution. We have not added any source files to this project as yet, but maven put a 'hello world' class in for us. Inside the target directory of your project, you should find a file called WordCount-1.0-SNAPSHOT.jar, which is the compiled application.

Adding To Eclipse

If you created the project from the command line and want to import it into Eclipse, then run the following command to generate the Eclipse project files:

$ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true


This ran for quite a while on my system as the Java Doc files were quite large. Finally, import the project into Eclipse using file>import then expand Maven and select 'Existing Maven Projects'.

My next article will have some information on creating a simple map reduce job and running it against Hadoop.



Creating a RPM From the Java JDK Tar File
I have been doing some work with Cloudera Hadoop recently, and as part of building a cluster I took the opportunity to automated it using Puppet.

For Cloudera, pretty much all of the setup is done via RPMs and config files, which are easily deployed with Puppet, but there was one initial step that did not have an RPM available - the specific Java version that needs to be installed.

Java is available as a download from Oracle as a tar.gz file - you can extract the tar file anywhere on your system and point your PATH at the extract and Java will work. I decided it would be interesting to turn the tar file downloaded from Oracle into an RPM that is easily deployed from Puppet.

Step 1 - Understand How To Build an RPM

Building an RPM is pretty simple using the rpmbuild command. By default, there is a directory structure in /root/rpmbuild and inside it are several directories:


BUILD - This is where the expanded source goes, ie the result of untarring the Java archive
BUILDROOT - This is where the built package is deployed
RPMS - The final RPM will be here after the build process completes
SOURCES - This is where source code goes, in this case the tar.gz file downloaded from Oracle
SPECS - This is where the spec file that controls the RPM build is stored
SRPMS - If your build also produces a source RPM, it will be stored here.


BUILDROOT is interesting - imagine it is like an empty root file system. If you need to compile source code as part of your RMP build (using the traditional configure, make, make install sequence), you should set the --prefix parameter to configure $RPM_BUILD_ROOT/usr/local - in this way, your package will be installed into $BUILD_ROOT/usr/local and will not affect anything else installed on your machine.

When the RPM is built, rpmbuild will grab all the files inside $RPM_BUILD_ROOT and strip the $RPM_BUILD_ROOT off, leaving you with an RPM that puts the files in the correct place.

Step 2 - Decide On Install Location

Hadoop wants Java to be installed in /usr/java, eg:

/usr/java/jdk-7u55-linux-x64


With a symlink /usr/java/default pointing at the active Java version.

Step 3 - Create a Spec File

The spec file is where you outline how to turn your sources into a working build, and also list out the files that should be included in the RPM. Create the file jdk-7u55-linux-x64.spec in the SPECS directory containing:

Summary: Java JDK
Name: javajdk
Version: 1.7.0
Release: 55
Group: Software Development
Distribution: Java 7 for RedHat Linux
Vendor: Oracle
Packager: Stephen ODOnnell
License: GPL
# Skip autogenerating RPM dependencies
AutoReqProv: no

%description
Java JDK packaged as an RPM

%prep
rm -rf $RPM_BUILD_DIR/jdk%{version}_%{release}
rm -rf $RPM_BUILD_ROOT/*
tar zxf $RPM_SOURCE_DIR/jdk-7u55-linux-x64.tar.gz -C $RPM_BUILD_DIR

%build

%install
mkdir -p $RPM_BUILD_ROOT/usr/java
cp -r $RPM_BUILD_DIR/jdk%{version}_%{release} $RPM_BUILD_ROOT/usr/java/

%files
/usr/java/jdk%{version}_%{release}

%post
ln -s /usr/java/jdk%{version}_%{release} /usr/java/default


The prep step simply untars the source code and copies it intop the BUILD_DIR.

Normally the build step is where you would compile the source using make etc, but in this case that isn't necessary.

The install step is generally where you would run the make install command, but in this case, we just copy the contents of the BUILD_DIR into the BUILD_ROOT.

The files step gathers up all the files you want to be included in the RPM - notice that you do not include the RPM_BUILD_ROOT prefix at the start of file paths - rpmbuild is smart enough to know where to file the files.

The final step in this RPM is the post step - this is executed at the end of RPM installation and creates the symlink we require.

Step 4 - Create the RPM

$ cd /root/rpmbuild/SPECS
$ rpmbuild -ba jdk-7u55-linux-x64.spec


The resulting RPM will be generated in /root/rpmbuild/RPMS/x86_64/javajdk-1.7.0-55.x86_64.rpm and can be installed as usual:

rpm -ivh /root/rpmbuild/RPMS/x86_64/javajdk-1.7.0-55.x86_64.rpm




Speeding up Ember CLI build times on OS X
I started playing with Ember recently, and I decided to develop using Ember CLI. This live compiles your JavaScript as each file changes and adds some hooks so the webpage refreshes when the code changes.

One thing that nearly made me give up on Ember CLI is that it just seemed too slow. I am developing on a 3ish year old Macbook Pro with an old fashioned spinning disk, so its not exactly state of the art hardware. Each time I changed a file it took 3 to 4 seconds to compile, and that was on a very small app.

I came across this page that suggested slowness is often caused by lots of files that need to be read and written in the project tmp directory, so that got me thinking - what if the tmp folder wasn't on disk, and just lived in memory?

On OS X creating a RAM disk is easy, I created one using the following script:

# The number at the end is 512 MB, set by 2028*512 = 1048576
diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nomount ram://1048576`
mkdir /Volumes/RAM\ Disk/ember


Then I deleted my Ember projects tmp directory and created a sym link to /Volumes/RAM Disk/ember - my compile times instantly dropped to about 1 second, which I think is tolerable.

I'm hoping to get an SSD in this old machine soon, but until then this hack keeps my build times tolerable.



Emacs Setup for Version 24.3
A few years ago now, I created a guide to getting started with Emacs for Rails development, and one of the key things I setup was ECB - The Emacs Code Browser.

It turns out things have moved on since the first time I did this (Emacs 22) and even my updated instructions for Emacs 23. I recently did a clean setup on Emacs 24.3 on Centos 7, and found things are much easier now - mostly because of Melpa and Marmalade, which make installing packages almost as simple as installing a Ruby Gem!

So, this is my guide to getting Emacs 24.3 working for my development, starting with ECB and then a few others things.

A few Essential Settings

These are some settings I have picked up over the years that I stick at the top of my .emacs:

;;; Allows syntax highlighting to work, I think
(global-font-lock-mode 1)
;;; Prevents the startup screen displaying
(setq inhibit-splash-screen t) 
;;; Prompts when you attempt to quit (C-x C-c) incase you do it by accident
(setq confirm-kill-emacs 'yes-or-no-p) 
;;; Turns off the tool and menu bars
(tool-bar-mode -1)
(menu-bar-mode -1)
;;; Don't use tabs when indenting code
(setq indent-tabs-mode nil)
;;; When you hit Return, also indent the line (savings having to type tab too)
(define-key global-map (kbd "RET") 'newline-and-indent)


Configure Package Manager

First, add both the Melpa and Marmalade repos to your .emacs:

(require 'package)
(add-to-list 'package-archives
             '("marmalade" . "http://marmalade-repo.org/packages/"))
(add-to-list 'package-archives
             '("melpa" . "http://melpa.milkbox.net/packages/") t)


Now, you can list all the packages availabled with M-x package-list-packages command.

Install ECB

From the list of packages above you can scroll around and find ECB. Luckily, its dependency CEDET is already installed by default in Emacs 24.3, so ECB installation is a simple matter of installing its package! If you select it from the list of packages, then select the install button, it should download and install.

In my .emacs, I only need to add the following code block that sets it up the way I like it:

(custom-set-variables
 ;; custom-set-variables was added by Custom.
 ;; If you edit it by hand, you could mess it up, so be careful.
 ;; Your init file should contain only one such instance.
 ;; If there is more than one, they won't work right.
 '(ecb-layout-name "left14")
 '(ecb-layout-window-sizes nil)
  ;; Adjust the layout 22% of width is directories and 78% is code window. Shrink the history window down to only 5% of height
 '(ecb-layout-window-sizes (quote (("left14" (ecb-directories-buffer-name 0.22631578947368424 . 0.7821428571428571) (ecb-history-buffer-name 0.22631578947368424 . 0.05)))))
 '(ecb-primary-secondary-mouse-buttons (quote mouse-1--C-mouse-1))
 '(ecb-options-version "2.40")
 '(ecb-tip-of-the-day nil)
 '(ecb-tree-buffer-style (quote ascii-guides))
)

(setq ecb-source-path '("~/source"))


Notice the ecb-source-path - you can add many different source paths here, depending on where your code is stored etc.

Enhanced Ruby Mode

Back when I first started with Ruby, emacs did not come with a builtin Ruby mode - fortunately, now it does. However there is a better one called enh-ruby-mode - Enhanced Ruby Mode. Again, install it from the package manager, and then add the following to .emacs to configure it to open whatever types of Ruby files you want:

(add-to-list 'auto-mode-alist '("\\.rb$" . enh-ruby-mode))
(add-to-list 'auto-mode-alist '("\\.rake$" . enh-ruby-mode))
(add-to-list 'auto-mode-alist '("Gemfile$" . enh-ruby-mode))
(eval-after-load "enh-ruby-mode"
 ;; Use wavy underlines for syntax errors and warnings, not a box.
  (custom-set-faces
   '(erm-syn-warnline ((t (:underline (:style wave :color "orange")))))
   '(erm-syn-errline ((t (:underline (:style wave :color "red")))))))


Enhanced Ruby Mode also does live syntax checking as you type, which is pretty handy.

Javascript

There is a builtin Javascript mode, but again there is a better one called js2-mode - install it using the package manager. The advantage of js2-mode is that is does syntax checking as you type, and I previously used it as my major Javascript mode, however based on the recommendations, it is better to let it work as a minor mode. The default Javascript mode indent is 4, so I changed it to 2:

(add-hook 'js-mode-hook 'js2-minor-mode)
(setq js-indent-level 2)


Color Scheme

I like the Tango color theme, so I install it with the package manger (both color-theme and color-theme-tango), and add the following to my .emacs. For some reason I had to actually include these files, unlike with all the other packages I installed:

;; Color Theme
(add-to-list 'load-path "~/.emacs.d/elpa/color-theme-20080305.34/")
(load-file "~/.emacs.d/elpa/color-theme-20080305.34/color-theme.el")

(add-to-list 'load-path "~/.emacs.d/elpa/color-theme-tango-0.0.2/")
(load-file "~/.emacs.d/elpa/color-theme-tango-0.0.2/color-theme-tango.el")
(require 'color-theme)
(color-theme-initialize)
(color-theme-tango)


Markdown Mode

Install markdown-mode from the package manager and then add

(add-to-list 'auto-mode-alist '("\\.text\\'" . markdown-mode))
(add-to-list 'auto-mode-alist '("\\.markdown\\'" . markdown-mode))
(add-to-list 'auto-mode-alist '("\\.md\\'" . markdown-mode))




Check Apache or Nginx Compression Is Working
If you go to the bother of setting up gzip compression on your webserver, it's a good idea to check it is actually working. It can make a pretty big difference to the responsive feel of a site, especially if the content is on the large size.

Using CURL, you pass a header indicating the client accepts gzip format files:

curl -I -H 'Accept-Encoding: gzip,deflate' http://website.to.check.com/


The -I switch gets only the document header information. The -H switch passes a custom header, in this case specifying the client accepts gzip output. If compression is working look for the following in the output:

Content-Encoding: gzip


or

Content-Encoding: deflate


If you don't see a Content-Encoding line, or it doesn't indicate gzip or deflate, then your webserver is not compressing your content.



Ruby Performance Analysis Tools
I was watching a video by Aaron Patterson, where he highlighted a few Ruby performance tuning tools. I thought it would be useful to write a quick post to gather them all together in the same place.

Benchmark

The classic tool for testing if one block of code is faster than the other in Ruby is the Benchmark module. It comes built in, and is pretty simple to use:

require 'benchmark'

h = { foo: 1, bar: 1, baz: 1, foobar: 1 }
a = [:foo, :bar, :baz, :foobar]


n = 50000
Benchmark.bm(6) do |x|
  x.report('hash:')  { n.times do; h.include?(:foobar); end }
  x.report('array:') { n.times do; a.include?(:foobar); end }
end


$ ruby benchmark.rb
             user     system      total        real
hash:    0.460000   0.000000   0.460000 (  0.459387)
array:   1.040000   0.010000   1.050000 (  1.049521)


Benchmark does an OK job, but the problem with it is that you need to guess how many iterations to run the code block so that it accrues meaningful time, and doesn't report the result as zero.

Benchmark/IPS

Often, a more interesting metric is Iterations Per Second, rather than the wall clock time of one method verses another. For this the benchmark-ips gem is useful. It figures out how many iterations to run the code so that the results are meaningful, and also calculates the standard deviation of the results.

require 'benchmark/ips'

h = { foo: 1, bar: 1, baz: 1, foobar: 1 }
a = [:foo, :bar, :baz, :foobar]

Benchmark.ips do |x|
  x.report("hash")  { h.include?(:foobar) }
  x.report("array") { a.include?(:foobar) }

  x.compare
end       


$ ruby benchmark_ips.rb
Calculating -------------------------------------
                hash     24983 i/100ms
               array     24084 i/100ms
-------------------------------------------------
                hash  6430567.4 (±8.8%) i/s -   31378648 in   4.937493s
               array  3636150.6 (±4.0%) i/s -   18014832 in   4.963605s

Comparison:
                hash:  6430567.4 i/s
               array:  3636150.6 i/s - 1.77x slower



Stackprof

Stackprof is a sampling call-stack profiler that works with Ruby 2.1 and above. The way it works is to sample the call stack at regular tunable intervals. On each sample, it captures the method which is executing at that moment.

By doing this, a slow method will be encountered by the sampler more often than a fast one, and hence will be reported as consuming a bigger percentage of the time.

This technique is not perfect, as it could miss some very fast method calls completely, but generally when profiling you are only concerned with slow methods anyway.

require 'stackprof'

def slow_method
  sleep(0.1)
end

def fast_method
end

StackProf.run(mode: :cpu, out: 'stackprof-output.dump') do
  100.times do
    slow_method
    fast_method
  end
end


The captured call data is written to the stackprof-output.dump file which can analyzed by the stackprof command line tool that is installed with the gem:

$ stackprof stackprof-output.dump
==================================
  Mode: cpu(1000)
  Samples: 14 (0.00% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
        14 (100.0%)          14 (100.0%)     Object#slow_method
        14 (100.0%)           0   (0.0%)     block in 
        14 (100.0%)           0   (0.0%)     block (2 levels) in 
        14 (100.0%)           0   (0.0%)     
        14 (100.0%)           0   (0.0%)     


This example is a bit contrived - something more useful would be to profile some requests in a Rails application to see how things look.

To use Stackprof in a Rails application, you simply insert it as an extra piece of middleware, by adding the following to application.rb:

config.middleware.insert_before(Rack::Sendfile, StackProf::Middleware, enabled: true, mode: :cpu, interval: 1000, save_every: 5)


The saveevery parameter tells stackprof to save to disk after processing saveevery requests.

When you run Rails server, it will write stackprof trace files into your applications tmp directory that can be profiled as normal:

$ stackprof stackprof-cpu-27132-1407514856.dump
==================================
  Mode: cpu(1000)
  Samples: 26 (0.00% miss rate)
  GC: 1 (3.85%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
         3  (11.5%)           3  (11.5%)     ActiveSupport::Subscriber#start
         3  (11.5%)           3  (11.5%)     block in ActionView::PathResolver#find_template_paths
         2   (7.7%)           2   (7.7%)     block in Rack::Utils::KeySpaceConstrainedParams#to_params_hash
         1   (3.8%)           1   (3.8%)     block (4 levels) in Class#class_attribute
         1   (3.8%)           1   (3.8%)     block in ActiveRecord::ConnectionAdapters::AbstractAdapter#lease
         1   (3.8%)           1   (3.8%)     ThreadSafe::NonConcurrentCacheBackend#[]
         1   (3.8%)           1   (3.8%)     ActiveSupport::TaggedLogging::Formatter#pop_tags
         1   (3.8%)           1   (3.8%)     Rack::BodyProxy#initialize
         1   (3.8%)           1   (3.8%)     ActionView::PathResolver#extract_handler_and_format_and_variant
         1   (3.8%)           1   (3.8%)     ActionView::Resolver::Path#to_str
         1   (3.8%)           1   (3.8%)     ActionDispatch::Journey::Route#matches?
         1   (3.8%)           1   (3.8%)     ActiveSupport::Duration.===
         1   (3.8%)           1   (3.8%)     Logger#format_severity
         1   (3.8%)           1   (3.8%)     ActiveSupport::Configurable::ClassMethods#config
         1   (3.8%)           1   (3.8%)     ActiveSupport::Inflector#underscore
         1   (3.8%)           1   (3.8%)     ActiveSupport::FileUpdateChecker#max_mtime
         1   (3.8%)           1   (3.8%)     Hash#extractable_options?
         9  (34.6%)           1   (3.8%)     Benchmark#realtime
         1   (3.8%)           1   (3.8%)     ActiveSupport::TaggedLogging::Formatter#current_tags
        25  (96.2%)           1   (3.8%)     Rack::Runtime#call
         2   (7.7%)           0   (0.0%)     Logger#add
        18  (69.2%)           0   (0.0%)     ActionDispatch::Journey::Router#call
        18  (69.2%)           0   (0.0%)     ActionDispatch::Routing::RouteSet#call
        18  (69.2%)           0   (0.0%)     Rack::ETag#call
        18  (69.2%)           0   (0.0%)     Rack::ConditionalGet#call
        18  (69.2%)           0   (0.0%)     Rack::Head#call
        18  (69.2%)           0   (0.0%)     ActionDispatch::ParamsParser#call
        18  (69.2%)           0   (0.0%)     ActionDispatch::Flash#call
        18  (69.2%)           0   (0.0%)     Rack::Session::Abstract::ID#context
        18  (69.2%)           0   (0.0%)     Rack::Session::Abstract::ID#call


Allocation Tracer

Often, a piece of code that allocates less objects is more efficient. This is because less memory needs to be allocated, and the garbage collector needs to run less frequently. The Allocation_Tracer gem allows you to see how many of each type of object your application creates, and also which file and line number they were created on.

require  'allocation_tracer'
require 'pp'

ObjectSpace::AllocationTracer.setup(%i{path line type})

result = ObjectSpace::AllocationTracer.trace do
  str = 'hello '
  10_000.times{|i|
    str << 'hello '
  }
  str = 'hello '
  10_000.times{|i|
    str = str + 'hello '
  }
end

pp result


The results show that using << to concatenate a string produces many less objects than the + method - this is because << does an in-place concatenation of the string, while + creates a new object:

{["allocation.rb", 13, :T_STRING]=>[20000, 30, 19463, 0, 3, 282839776],
 ["allocation.rb", 9, :T_STRING]=>[10000, 0, 10000, 1, 1, 0],
 ["allocation.rb", 7, :T_STRING]=>[1, 0, 1, 1, 1, 102399],
 ["allocation.rb", 11, :T_STRING]=>[1, 1, 10, 10, 10, 0]}


If you have a large application, you can aggregate the counts at a higher level, by grouping by object type and ignoring the file and line number. Out of interest, I created a simple piece of Rails middleware to see how many objects Rails allocates on each request:

require 'pp'

class Allocation

  def initialize(app)
    ObjectSpace::AllocationTracer.setup(%i{type})
    @app = app
  end

  def call(env)
    res = nil
    result = ObjectSpace::AllocationTracer.trace do      
      res = @app.call(env)
    end
    pp result
    res
  end

end


Notice that I asked allocation tracer to capture only the type of the object - adding the file or line number for a Rails application produced way to much output. This produced the following output for a single Rails request:

Started GET "/about" for 127.0.0.1 at 2014-08-10 13:23:33 +0100
Processing by AboutController#index as HTML
  Rendered about/index.html.erb within layouts/application (0.1ms)
Completed 200 OK in 16ms (Views: 15.5ms)
{[:T_STRING]=>[2728, 0, 0, 0, 0, 0],
 [:T_NODE]=>[425, 0, 0, 0, 0, 0],
 [:T_ARRAY]=>[989, 0, 0, 0, 0, 0],
 [:T_OBJECT]=>[53, 0, 0, 0, 0, 0],
 [:T_HASH]=>[367, 0, 0, 0, 0, 0],
 [:T_DATA]=>[416, 0, 0, 0, 0, 0],
 [:T_MATCH]=>[67, 0, 0, 0, 0, 0],
 [:T_REGEXP]=>[6, 0, 0, 0, 0, 0],
 [:T_STRUCT]=>[2, 0, 0, 0, 0, 0],
 [:T_FILE]=>[2, 0, 0, 0, 0, 0]}


Tracepoint

Tracepoint is an interesting built-in library. It allows you to trace many events that occur in your Ruby application. This has many potential uses. For instance, in a large application it can be difficult to find all the places a certain class is instantiated. With Tracepoint, you can capture the file and line number where any given class is referenced by watching for events against the class you are interested in, and then dumping the call stack. For instance this code (taken from Aaron's talk):

require 'active_support/all'

trace = TracePoint.new(:c_call, :call) { |tp|
  if tp.defined_class == ActiveSupport::SafeBuffer &&
    tp.method_id == :initialize
    puts "#" * 90
    puts tp.binding.eval "caller"
  end
}
trace.enable
"balblablalbal".html_safe
ActiveSupport::SafeBuffer.new "omgee"


This produces the following output:

##########################################################################################
tracer.rb:7:in `eval'
tracer.rb:7:in `block in '
/Users/sodonnel/.rvm/gems/ruby-2.1.1/gems/activesupport-4.1.4/lib/active_support/core_ext/string/output_safety.rb:158:in `initialize'
/Users/sodonnel/.rvm/gems/ruby-2.1.1/gems/activesupport-4.1.4/lib/active_support/core_ext/string/output_safety.rb:237:in `new'
/Users/sodonnel/.rvm/gems/ruby-2.1.1/gems/activesupport-4.1.4/lib/active_support/core_ext/string/output_safety.rb:237:in `html_safe'
tracer.rb:11:in `'
##########################################################################################
tracer.rb:7:in `eval'
tracer.rb:7:in `block in '
/Users/sodonnel/.rvm/gems/ruby-2.1.1/gems/activesupport-4.1.4/lib/active_support/core_ext/string/output_safety.rb:158:in `initialize'
tracer.rb:12:in `new'
tracer.rb:12:in `'




Profiling a Mysql Query
In my day job, I do a lot of work with Oracle databases. One of the things I really like about Oracle is that the database is exceptionally well instrumented. If you are not sure whether designing a table or query is better using one method or another, you can turn on this instrumentation, run each method and gather all sorts of statistics about the query.

When tuning a query, I tend to focus on the number of logical reads performed to generate the results. Less is pretty much always better, as it gives a good approximation as to how much work the database performed running the query.

On a recent project that required Mysql, I wanted to do something similar, assuming this would be easy in Mysql.

It turns out it is pretty easy, but it is not immediately obvious how, at least in Mysql 5.5.

Set Profiling = 1

After some googling, I thought I had found the Mysql equivalent of Autotrace in Oracle:

mysql> set profiling = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from id_tab where user_id = 100;
+----------+
| count(*) |
+----------+
|      998 |
+----------+
1 row in set (0.02 sec)

mysql> show profile;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.014555 |
| checking permissions | 0.000038 |
| Opening tables       | 0.000066 |
| System lock          | 0.000195 |
| init                 | 0.000070 |
| optimizing           | 0.000059 |
| statistics           | 0.000146 |
| preparing            | 0.000053 |
| executing            | 0.000014 |
| Sending data         | 0.000455 |
| end                  | 0.000022 |
| query end            | 0.000012 |
| closing tables       | 0.000016 |
| freeing items        | 0.000065 |
| logging slow query   | 0.000009 |
| cleaning up          | 0.000010 |
+----------------------+----------+
16 rows in set (0.00 sec)



This timing information is great, but it is not really what I am after, which is how many database blocks / page reads were required to answer the query?

Turns out you can get more information from the profile, such as block IO and page faults, but neither of these tell me what I want to know.

The block IO option is potentially interesting, but it doesn't give any information about cached reads, which are important too.

Innodb Counters

Assuming you are using Innodb tables, there are another set of counters to consider.

mysql> show status like 'Inno%';
+---------------------------------------+----------+
| Variable_name                         | Value    |
+---------------------------------------+----------+
| Innodb_buffer_pool_pages_data         | 692      |
| Innodb_buffer_pool_bytes_data         | 11337728 |
| Innodb_buffer_pool_pages_dirty        | 0        |
| Innodb_buffer_pool_bytes_dirty        | 0        |
| Innodb_buffer_pool_pages_flushed      | 1        |
| Innodb_buffer_pool_pages_free         | 7500     |
| Innodb_buffer_pool_pages_misc         | 0        |
| Innodb_buffer_pool_pages_total        | 8192     |
| Innodb_buffer_pool_read_ahead_rnd     | 0        |
| Innodb_buffer_pool_read_ahead         | 63       |
| Innodb_buffer_pool_read_ahead_evicted | 0        |
| Innodb_buffer_pool_read_requests      | 42158    |
| Innodb_buffer_pool_reads              | 630      |
| Innodb_buffer_pool_wait_free          | 0        |
| Innodb_buffer_pool_write_requests     | 1        |
| Innodb_data_fsyncs                    | 7        |
| Innodb_data_pending_fsyncs            | 0        |



Some of these look a bit more useful, especially:


Innodb_buffer_pool_read_requests - The number of logical read requests InnoDB has done
Innodb_buffer_pool_reads - The number of logical reads that InnoDB could not satisfy from the buffer pool, and had to read directly from the disk. 


There is one very important thing to note about this counters - they are global across all sessions on the database. To reliably compare two queries, you need to ensure nothing else is running on the database.

Benchmark Code

Another frustration is that Mysql doesn't give an out of the box way to get the difference between the counters before and after running a query. This means that you need to write some code to grab the value of the counters, run your test, grab the value of the counters again and finally calculate the differences.

I am sure it would be possible to write a stored procedure to do this, but I put together a small Ruby class to do what I needed:

require 'mysql2'

module MySQL

  class Profile

    def initialize(connection_hash)
      @client = Mysql2::Client.new(connection_hash)
      @profile = Hash.new
    end

    def run_test(query)
      snapshot_stats
      results = @client.query(query)
      read_results(results)
      snapshot_stats
    end

    def print_profile
      @profile.keys.each do |k|
        diff = @profile[k][-1] - (@profile[k][-2] || 0)
        if diff > 0
          puts "#{k.ljust(50, ' ')} #{diff}"
        end
      end
    end

    def read_results(results)
      results.each do |row|
        # do nothing, just want to read them from the db
      end
    end

    def snapshot_stats
      results = @client.query("show status like 'Inno%'")
      results.each do |row|
        (@profile[row['Variable_name']] ||= []) << row['Value'].to_i
      end
    end
  end

end


Putting this code into action:

test = MySQL::Profile.new(:host => "localhost", :username => "root", :database => 'test')
test.run_test("select * from user_tab where user_id = 50")

puts "User tab ID"
test.print_profile

puts ""
puts "ID tab ID"
test.run_test("select * from id_tab where user_id = 50")
test.print_profile

User tab ID
Innodb_buffer_pool_read_requests                   141
Innodb_rows_read                                   998

ID tab ID
Innodb_buffer_pool_read_requests                   3007
Innodb_rows_read                                   998



In this case we can see that the first query performed about 20 times less reads from the cache than the second query, so we can conclude the first query is much more efficient.

In this post, I didn't explain what data is in my tables, or what I am testing here - that will be the topic of another post. I just wanted to illustrate how to compare one approach over another.



Statically Compiling Git
If you don't have access to a C compiler, or have the ability to install an RPM on a machine, but you still want to use git, you need to create yourself a statically linked binary on another machine and copy it over.

The process to do this is supposed to be pretty easy, but there are a host of libraries that you need to install. I already had a working Centos 6.4 VM, with gcc, openssl, curl etc installed. Basically all the dependencies required to compile git with the usual commands:

$ make
$ make install


Before attempting a static build, you will also need to install at least the following with yum:


glibc-static
zlib-static
libssh2-devel
openldap-devel
curl-devel


Then, you can create a static build in ~/bin using the following commands:

# Create the configure executable
$ make configure
# Configure the build
$ ./configure --prefix=/home/sodonnel/bin CFLAGS="${CFLAGS} `pkg-config --static --libs libcurl`"
# make and install as usual
$ make
$ make install


Another guide suggested using the following for the configure step:

$ ./configure --prefix=/home/sodonnel/bin CFLAGS="${CFLAGS} -static"


I tried this, but I was not able to clone a repo over https so I tried the version above that was mentioned in the comments.

If all goes well, you should have a working git in /home/sodonnel/bin/bin/git.

To move this onto another machine, simple tar up /home/sodonnel/bin and then untar at the destination.

If you get problems during the configure stage, look at config.log - it will probably give an error indicating a library is missing. Install it and try again.

You also need to make sure you build on a machine with the same kernel version, OS version, architecture (32 or 64 bit) as the host. 

After following these steps and transferring the build to my target machine, I still get one warning I cannot resolve. Each time I run a git command, it prints the following:

git: /lib64/libz.so.1: no version information available (required by git) static 


The commands all seem to work OK despite that error message, but if anyone has any ideas on how to make this message go away I would be very grateful!



Masking data in Hive
I had a problem recently where I needed to mask a bunch of sensitive production data to create a database performance test environment. By chance the data was both in Oracle and in Hadoop.

This application just works on strings - it doesn't really care what the format of the strings is. Therefore to mask the data, so long as all occurrences of the same string turn into the same other string, the application will work perfectly.

To turn one string into another string the obvious choice is to use a one way function, such as SHA1 or SHA256. This on its own is not overly secure, as someone could reverse engineer some of my sensitive data using a brute force attack. Adding a salt to the hash would make it much more secure.

Then I recalled something I had heard about on security now some time ago called HMAC. It adds a secret key (which is much like a salt) to the data and hashes it twice. 

If I generated a random key for the HMAC function, masked all my data and then throw away the key, there should be no way for anyone to reverse engineer the original data from the hash of it.

Oracle doesn't have a built in HMAC function, but I did a test on some data using the built in sha1 function (which is part of the dbms_crypto package). It used a lot of CPU and took quite a long time to do the hashing, not a great thing to do on your production database.

Then I tried Hive - it doesn't have a builtin HMAC UDF either, but building on my last post, it was pretty easy to create a UDF to HMAC some data:

package com.sodonnel.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;

import org.apache.commons.codec.binary.Hex;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;


public class Hmac extends UDF
{
  public String evaluate(String key, String message)
    throws java.security.InvalidKeyException, java.security.NoSuchAlgorithmException {

    // if a null or empty string is input, return empty string
    if ((null == message) || (message.isEmpty())) {
      return "";
    }

    SecretKeySpec keySpec = new SecretKeySpec(key.getBytes(),"HmacSHA1");

    Mac mac = Mac.getInstance("HmacSHA1");
    mac.init(keySpec);
    byte[] rawHmac = mac.doFinal(message.getBytes());

    return Hex.encodeHexString(rawHmac);

  }
}


For this to compile you need to have the apache.commons.codec jar on the CLASSPATH. In the Cloudera install I am using, it is at:

/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hadoop/lib/commons-codec-1.4.jar


Using this, I was able to Hash about 30M rows, with 7 hashes per row in about 30 seconds, which is not too shabby at all.



Creating a Basic Hive UDF
Creating and using a basic Hive UDF is pretty simple.

First locate the hive-exec and hadoop-core jars on your system, and add them to the class path:

CLASSPATH=$CLASSPATH:/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hive/lib/hive-exec-0.10.0-cdh4.2.1.jar:/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hadoop/client-0.20/hadoop-core-2.0.0-mr1-cdh4.2.1.jar:.


Next create a directory structure for the java files:

mkdir -p udf_test/src/com/sodonnel/udf
mkdir -p udf_test/classes


Create the most basic hello world UDF in udf_test/src/com/sodonnel/udf/HelloWorld.java:

package com.sodonnel.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;


public class HelloWorld extends UDF
{
  public String evaluate(String v) {
    return "Hello World!";
  }
}


In the src directory, compile the Java class:

javac -d ../classes com/sodonnel/udf/HelloWorld.java


This will create the directories and class file under the classes folder. Now we need to create a JAR out of the class file. In the classes directory run the following command:

jar cf HelloWorld.jar com


The final step is to load this jar file into Hive:

hive> add jar /export/home/sodonnel/udf/src/com/sodonnel/udf_test/classes/HelloWorld.jar;
Added /export/home/sodonnel/udf/src/com/sodonnel/udf_test/classes/HelloWorld.jar to class path
Added resource: /export/home/sodonnel/udf/src/com/sodonnel/udf_test/classes/HelloWorld.jar

hive> create temporary function hello_world as 'com.sodonnel.udf.HelloWorld';
OK
Time taken: 0.0040 seconds


Now call the function when selecting some rows from a table:

hive> select hello_world('any string') from my_table limit 10;
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!


Not a very useful UDF, but it opens the door for more interesting things.



Building a Ruby 2.0.0 RPM
I recently started to learn a little about how Puppet can be used to setup and configure servers. One task I wanted Puppet to automate was the install of Ruby 2.0.0.

It didn't take much reading for me to realise that using Puppet to grab the Ruby source and compile it on each machine is not seen as best practice - a better idea is to build an RPM once and use Puppet to deploy the RPM on each machine as required.

That left me with the problem of how to build an RPM.

Mock

I came across some good instructions that suggest using a tool called mock to help with building RPMs. Mock creates a clean chroot environment to build a RPM package in - it will only contain packages that are defined as required by the RPM it is building, so it is easy to spot if something is missing.

You can also build the RPMs using the rpmbuild command without mock - I think the advantage of mock is that it enables the dependency check. Aside from that, it does pretty much the same job as rpmbuild, as it actually invokes rpmbuild behind the scenes.

Setup Mock

To use mock, you need to install a few packages:

$ sudo yum install rpm-build redhat-rpm-config rpmdevtools mock


And create an unprivileged user to run mock:

$ sudo adduser builder --home-dir /home/builder \
  --create-home  --groups mock \
  --shell /bin/bash --comment "rpm package builder"


Next switch to the new builder user and create an RPM directory structure:

$ su - builder
$ rpmdev-setuptree


Source Code and Spec File

Now we need to get the Ruby source code, and put it into the SOURCES directory:

$ cd ~/rpmbuild/SOURCES
$ wget ftp://ftp.ruby-lang.org/pub/ruby/2.0/ruby-2.0.0-p195.tar.gz


The final thing we need is an RPM spec file that describes how to build Ruby and any dependencies. Thanks to the instructions above, I located a spec file for Ruby-1.9.3 on Github, so I took it and modified it for what I needed.

One difference between my spec file and the one on Github, is that I have specified an installation prefix for my Ruby install, while the Github one is intended to replace the system Ruby. I decided to put my Ruby install in /opt/rubyies/ruby-2.0.0-p195 which is the location chruby prefers to put different Ruby versions.

Put the following spec file into ~/rpmbuild/SPECS/ruby20.spec

%define rubyver 2.0.0
%define rubyminorver p195

Name: ruby20
Version: %{rubyver}%{rubyminorver}
Release: 1%{?dist}
License: Ruby License/GPL - see COPYING
URL: http://www.ruby-lang.org/
BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
BuildRequires: automake zlib zlib-devel readline libyaml libyaml-devel readline-devel ncurses ncurses-devel gdbm gdbm-devel glibc-devel tcl-devel gcc unzip openssl-devel db4-devel byacc make libffi-devel
Requires: libyaml
Source0: ftp://ftp.ruby-lang.org/pub/ruby/ruby-%{rubyver}-%{rubyminorver}.tar.gz
Summary: An interpreter of object-oriented scripting language
Group: Development/Languages

%description
Ruby is the interpreted scripting language for quick and easy
object-oriented programming. It has many features to process text
files and to do system management tasks (as in Perl). It is simple,
straight-forward, and extensible.

%prep
%setup -n ruby-%{rubyver}-%{rubyminorver}

%build
export CFLAGS="$RPM_OPT_FLAGS -Wall -fno-strict-aliasing"

./configure --prefix=/opt/rubies/ruby-2.0.0-p195

make %{?_smp_mflags}

%install
# installing binaries ...
make install DESTDIR=$RPM_BUILD_ROOT

#we don't want to keep the src directory
rm -rf $RPM_BUILD_ROOT/usr/src

%clean
rm -rf $RPM_BUILD_ROOT

%files
%defattr(-, root, root)
/opt/rubies/ruby-2.0.0-p195

%changelog
* Tue May 21 2013
- Initial Version



Build the RPM

With all the setup done, it is time to use mock to build the RPM.

First we initialize mock to create a clean chroot area to build the RPM, then we go ahead and build the actual RPM, which requires two steps, the first to build the source RPM and the second to build the binary RPM:


$ cd ~
$ mock --init
$ mock --buildsrpm   --spec=./rpmbuild/SPECS/ruby20.spec --sources=./rpmbuild/SOURCES
$ mock --no-clean --rebuild /var/lib/mock/epel-5-x86_64/result/ruby-2.0.0p195-1.el5.centos.src.rp m



On my system, building Ruby takes about 5 minutes, but when it is done you should have a few RPMs in /var/lib/mock/epel-5-x86_64/result/ along with a couple of log files. The binary RPM is called ruby-2.0.0p195-1.el5.centos.x86_64.rpm and is what you actually want to install to provide Ruby.

$ ls -l /var/lib/mock/epel-5-x86_64/result/

-rw-rw-r-- 1 builder mock   548108 May 21 13:20 build.log
-rw-rw-r-- 1 builder mock    51366 May 21 13:20 root.log
-rw-rw-r-- 1 builder mock 13148404 May 21 13:13 ruby-2.0.0p195-1.el5.centos.src.rpm
-rw-rw-r-- 1 builder mock 11011799 May 21 13:19ruby-2.0.0p195-1.el5.centos.x86_64.rpm
-rw-rw-r-- 1 builder mock  7246129 May 21 13:20 ruby-debuginfo-2.0.0p195-1.el5.centos.x86_64.rpm
-rw-rw-r-- 1 builder mock     1835 May 21 13:20 state.log


Now you can install the RPM with the usual command:

[root@localhost result]# rpm -ivh ruby-2.0.0p195-1.el5.centos.x86_64.rpm
Preparing...                ########################################### [100%]
   1:ruby                   ########################################### [100%]

$ /opt/rubies/ruby-2.0.0-p195/bin/ruby -v
ruby 2.0.0p195 (2013-05-14 revision 40734) [x86_64-linux]




Test Hard Disk Speed With dd
I recently wanted to test the speed of disk attached to a Linux server I was using, as it seemed somewhat slower than I expected. I have never had a need to test disk speeds before and I thought it would be an easy thing to test. Turns out a straight speed test is fairly easy, once you figure out what you want to test.

All Disk Writes Are Not The Same

One thing I knew, but had never investigated in detail is that there are several ways to write a file to disk.

On Linux, if you mount an ext3 filesystem with no options, the OS will use any free memory to buffer file contents in memory. That makes it faster to access a file that was previously read. It also makes it faster to write to a file - the OS can write the file contents into memory, and then it can lazily stream the data to disk after the write call has completed.

This means that after a write to a file has completed, the data may not yet have made it to disk and could be lost if the machine suddenly lost power.

You can see this effect in action if you run the free command on Linux:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         24082      23654        428          0        373       7430
-/+ buffers/cache:      15849       8232
Swap:         2047        553       1494


This machine has 24GB of RAM, and at first look, it would appear that almost all the memory is used. However, according to the cached column, 7430MB of memory is being used to cache file system buffers. As more memory is required on the machine by other processes, it will free this memory automatically, ensuring the machine doesn't run out of memory.

The Linux kernel provides a system call, fsync(), that forces the contents of a file cached in memory to be written to the underlying disk. That means that to ensure the data is safely on disk and not partially written, a write to a file needs to be followed by an fsync() call.

It is possible to open files in different modes, such as direct, that forces writes to bypass the cache or dsync, that uses the cache, but forces an fsync() call before the write returns.

Speed Tests with dd

Now we know there are different ways to write a file, we need a tool that allows data to be written to a file to compare the speed of the different approaches.

This is where the dd command comes in. It allows a data to be written to a file in various modes and ways, reporting the transfer speed when it completes.

The Fastest way to write a file

$ dd if=/dev/zero of=writetest bs=8k count=131072
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 1.10978 seconds, 968 MB/s


960MB/s - that is pretty fast! Well, that is because this test wrote the file into memory, and the contents were not actually on disk when this call completed. Therefore this test didn't really test the speed of the disks in the machine at all.

Before moving onto a more realistic test, a definition of the fields passed to dd is required:


if=/dev/zero - the if option stands for in-file. This is the data that will be written to disk in our speed test. In this case we have used the /dev/zero device, that simply streams NULL characters into the file.
of=writetest - this is the filename of the file that will be written to. In this case, it will contain the contents /dev/zero, which won't be very interesting if you view it!
bs=8k - This tells dd to write the data to the file in 8Kb chunks. Each write to the file will contain 8Kb of data.
count=31072 - This is how many 8Kk writes to perform on the file. In this case it will result it a file of 1GB.


The Fastest Way To Actually Write To Disk

The last test proved nothing about the speed of the disk the file was supposedly written to, but earlier I mentioned the need to use a call to fsync() to force the file onto the disk. Luckily dd gives us a way to do this with the conv option:

$ dd if=/dev/zero of=writetest bs=8k count=131072 conv=fsync
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 18.1336 seconds, 59.2 MB/s


Now I have a more realistic disk speed of 59MB/s, which is likely to be the top speed this disk can write at.

The conv=fsync parameter changed the behavior of dd to make a call to fsync() before it completes writing the file, forcing all the data to disk.

It is also possible to call dd with conv=fdatasync which could be slightly faster for small files, but is about the same when writing a 1GB file in this test:

$ dd if=/dev/zero of=writetest bs=8k count=131072 conv=fdatasync
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 18.0217 seconds, 59.6 MB/s


Other (Slower) Ways To Write A File

In the test above, I made 131,000 small writes to a file, but the OS buffered these write in memory until the fsync() call wrote all the data to disk as a large write. In some applications, you need to do many small writes and know that each of them is securely stored on disk. This is why my test used a block size of 8kb. My disks were being used to store Oracle database files, and it works on a block size of 8kb. So while my disk can write at almost 60MB/s, how does it do with 8kb writes that must be synchronized to disk after each write? The dd command gives us a way to test this too:

$ dd if=/dev/zero of=writetest bs=8k count=131072 oflag=sync


The oflag=sync tells dd to open the output file with the sync option, which means that every write to the file must be synced to disk. The OS still buffers the writes in memory, but it must flush them to disk before the write call returns. In this case, that means about 131,000 writes are going to be issued to disk, which is going to be slow:

dd if=/dev/zero of=writetest bs=8k count=131072 oflag=sync
66324+0 records in
66324+0 records out
543326208 bytes (543 MB) copied, 66.9737 seconds, 8.1 MB/s


With small synced writes my top speed has drop to 8.1MB/s - much slower than the top speed of the disk.

Another option is to turn on directio, which skips the OS buffering of the writes, and puts them straight onto disk:

$  d if=/dev/zero of=writetest bs=8k count=131072 oflag=direct
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 34.6159 seconds, 31.0 MB/s


Now my write speed is a more impressive 31.0MB/s, so it looks like direct IO is much faster than the sync option. From what I've read, direct IO should store your data just as safely as sync mode, but I am not 100% certain on that one.

Conclusion

It is pretty simple to test the speed of writing a large file with the dd command, but you need to know about the subtle options available when writing a file to actually test the speed of a storage device.



PLSQL Unit Test
Continuing on my track of creating Ruby Gems to help with database interactions, I created PLSQL Unit Test. As the name suggests, it is designed to test PLSQL code.

It is a very simple gem, probably about 100 lines of code. It a nutshell it does three things:


Monkey patches Test::Unit::TestCase to add a few extra assert methods for testing the contents of database tables
Adds a method to create a database connection that is shared across all test classes
Loads Simple Oracle JDBC and Data Factory so they are available when coding tests.


The readme file and documentation bundled with the gem do a pretty good job of explain how it works, and I added an post to my Oracle blog giving an overview on how to get up and running.

As I write this, I have probably written about 400+ test cases using this framework, and it works well. It is certainly not an all singing and dancing test framework, as I only added features to cover scenarios I encountered as I was writing tests. The key for me when writing tests was to have a clean interface to communicate with the database (Simple Oracle JDBC) and a way to easily stage data (Data Factory) - after that Test::Unit provided almost everything I required.



Data Factory
When I was writing PLSQL unit tests with Ruby, I came across a need to stage test data into tables. I didn't really want to involve an ORM like Active Record, but working with RAW insert statements quickly became a chore.

One reason is that some of my tables had a lot of columns - many of the columns were not nullable, but for a given test it didn't really matter what values were in most of the columns. For instance if a test needed to prove records with a status of 'pending' were picked up by a stored procedure, I just needed to ensure I had a few records with a status of 'pending', and all the other columns could be set to anything.

After wasting too much time with long insert statements and missing not null fields, I created the Data Factory gem to remove the need to write insert statements at all.

What is Data Factory

DataFactory is a simple Ruby gem that generates data random test data and inserts it into database tables. 

DataFactory reads the table definition from the database, and generates random values for all not null columns. It inserts this data into the table, while providing the option of specifying non-random defaults to meet integrity constraints etc.

Usage

DataFactory is a simple gem, so a few examples explore a lot of the functionality. Note that these examples use Simple Oracle JDBC as the database access layer. 

For a more complete manual, have a look at the documentation on Rubygems. 

For these examples to run, create a table on the database as follows:

create table employees (emp_id     integer,
                        dept_id    integer,
                        first_name varchar2(50),
                        last_name  varchar2(50),
                        email      varchar2(50),
                        ssn        varchar2(10) not null);


Define a DataFactory Class

To use DataFactory, create a class for each table you want to interface with, and make it a sub-class of DataFactory::Base:

class Employee < DataFactory::Base

  set_table_name "employees"

  set_column_default :last_name, "Smith"
  set_column_default :email,   begin    
                                 "#{rand(10000)}@#{rand(10000)}.com"
                               end
end


In the class definition, use the settablename method to map the class to a particular table on the database.

Optionally, you can specify default values for columns in the table with the setcolumndefault method, which takes the table name followed by a value for the column, or a block that generates the value each time it is called, as with the email example.

Creating a Row

The first requirement is to connect to the database, and hand an instance of the database interface to DataFactory:

interface = SimpleOracleJDBC::Interface.create('sodonnel',
                                               'sodonnel',
                                               'local11gr2.world',
                                               'localhost',
                                               '1521')

DataFactory::Base.set_database_interface(interface)


Then a row can be created using the create! method, for example:

f = Employee.create!("emp_id" => 1001)


The create! call will take the column defaults defined in the Employee class, and merge in any column values passed into the create! method. Then it will generate a value for any other non-nullable columns in the table, and insert the row into the database.

An Employee instance is returned, containing all the generated values.

There is also a create method that works just like create! but does not issue a commit.

Finally there is a build method that creates an instance of the class with default and generated values, but does not insert it into the database at all.

Accessing The Column Values

When an instance of a DataFactory class is created, you can access the generated values for the columns with the column_values method, which returns a hash. The keys of the hash are the uppercase column names and the values contain the generated data:

f.column_values.keys.each do |k|
  puts "#{k} :: #{f.column_values[k]}"
end

# EMP_ID :: 1001
# DEPT_ID ::
# FIRST_NAME ::
# LAST_NAME :: Smith
# EMAIL :: 4506@5941.com
# SSN :: Gb3


Notice how columns that are nullable, have not got a default value and were not passed a value are generated with null values.



Simple Oracle JDBC
Over the years, I have attempted PLSQL unit testing with some success using different tools, but they always left me frustrated. The best tool was probably utplsql, but the test code was so verbose it got annoying very quickly. 

These days there seems to be a push toward doing PLSQL unit testing using a GUI (built into Toad and SQL Developer), which doesn't fill me with joy either.

I decided that Ruby would make a pretty good tool to test stored procedure calls, as it is easily able to execute the stored procedures and can query the database pretty effectively too. I started out with the Ruby OCI8 gem. At the time I quickly hit a problem, as it didn't support PLSQL array types, which I needed to test.

At that point I figured it would be nice if I could use the Oracle JDBC drivers along with JRuby and get my testing done that way. I quickly realized that the endless set_int, get_int and mapping between Java native types and Ruby native types was going to be tedious, and Simple Oracle JDBC was born.

A Thin Wrapper

The idea behind this gem is that it provides a thin wrapper around a JDBC connection. It provides an interface to quickly execute SQL statements and stored procedures, but leaves the raw JDBC connection available if anything more complicated is required, such as binding array types.

Values can be bound to SQL statements and procedures by simply passing an array of Ruby types into the execute call, and they are mapped automatically into the correct Java SQL types. The same happens when values are returned from procedures and queries.

More Than Just Testing

While the gem was created to help me with Unit Testing PLSQL code, there is nothing preventing it being used for other quick scripts or prototypes. I haven't looked at performance at all, so if you decided to use it in a production application, test it thoroughly first!

Usage

The best way to learn how to use Simple Oracle JDBC is to read through the sample code below, and then checkout the documentation over at rubygems.org.

require 'simple_oracle_jdbc'

conn = SimpleOracleJDBC::Interface.create('sodonnell',   # user
                                          'sodonnell',   # password
                                          'tuned',       # service
                                          '192.168.0.1', # host
                                          '1521')        # port

# ... or create with an existing JDBC connection
# conn = SimpleOracleJDBC.create_with_existing_connection(conn)

# Create a SimpleOracleJDBC::SQL object
sql = conn.prepare_sql("select 1 c1, 'abc' c2, sysdate c3, 23.56 c4
                        from dual
                        where 1 = :b1
                        and   2 = :b2")

# execute the query against the database, passing any binds as required
sql.execute(1, 2)

# get the results back as an array of arrays. Note that the resultset
# and statement will be closed after this call, so the SQL cannot
# be executed again.
results = sql.all_array
puts "The returned row is #{results[0]}"

# > The returned row is [1.0, "abc", 2013-02-12 22:00:23 +0000, 23.56]

# Run the same SQL statement again
sql = conn.prepare_sql("select 1 c1, 'abc' c2, sysdate c3, 23.56 c4
                        from dual
                        where 1 = :b1
                        and   2 = :b2")

sql.execute(1, 2)

# This time fetch the results as an array of hashes
results = sql.all_hash
puts "The returned row is #{results[0]}"
puts results[0]["C3"].class

# Notice how the column names are the keys of the hash, and the date is converted
# into a Ruby Time object.
#
# > The returned row is {"C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:03:02 +0000, "C4"=>23.56}
# > Time

# If you need to iterate over a large result set, then pass a block to the each_array
# or each_hash method
sql = conn.prepare_sql("select level rnum, 1 c1, 'abc' c2, sysdate c3, 23.56 c4
                        from dual
                        where 1 = :b1
                        and   2 = :b2
                        connect by level <= 4")

sql.execute(1, 2).each_hash do |row|
  puts row
end

# > {"RNUM"=>1.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:07:14 +0000, "C4"=>23.56}
# > {"RNUM"=>2.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:07:14 +0000, "C4"=>23.56}
# > {"RNUM"=>3.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:07:14 +0000, "C4"=>23.56}
# > {"RNUM"=>4.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:07:14 +0000, "C4"=>23.56}

# Finally you can ask for each row one at a time, with each_array or each_hash
sql = conn.prepare_sql("select level rnum, 1 c1, 'abc' c2, sysdate c3, 23.56 c4
                        from dual
                        where 1 = :b1
                        and   2 = :b2
                        connect by level <= 4")
sql.execute(1, 2)

# If you fetch to the end of the result set, then the statement and
# and result set will be closed. Otherwise, call the close method:
#
# sql.close
while row = sql.next_array do
  puts "The row is #{row}"
end

# > The row is [1.0, 1.0, "abc", 2013-02-12 22:11:38 +0000, 23.56]
# > The row is [2.0, 1.0, "abc", 2013-02-12 22:11:38 +0000, 23.56]
# > The row is [3.0, 1.0, "abc", 2013-02-12 22:11:38 +0000, 23.56]
# > The row is [4.0, 1.0, "abc", 2013-02-12 22:11:38 +0000, 23.56]


# Executing Stored Procedures is easy too, just take care of out and inout parameters.
#
# create or replace function test_func(i_var integer default null)
# return integer
# is
# begin
#   if i_var is not null then
#     return i_var;
#   else
#     return -1;
#   end if;
# end;
# /
#
# Execute a function with a returned parameter. Notice how the
# out/returned parameter is passed as a 3 element array.
# The first element defines the Ruby type which is mapped into a SQL type as follows:
#
#    RUBY_TO_JDBC_TYPES = {
#      Date       => OracleTypes::DATE,
#      Time       => OracleTypes::TIMESTAMP,
#      String     => OracleTypes::VARCHAR,
#      Fixnum     => OracleTypes::INTEGER,
#      Integer    => OracleTypes::INTEGER,
#      Bignum     => OracleTypes::NUMERIC,
#      Float      => OracleTypes::NUMERIC,
#      :refcursor => OracleTypes::CURSOR
#    }
#
# The second element is the value, which should be nil for out parameters and can take a
# value for inout parameters.
#
# The third parameter should always be :out
#
# Also notice how the value is retrieved using the [] method, which is indexed from 1 not zero.
# In, out and inout parameters can be accessed using the [] method.
proc = conn.prepare_proc("begin :return := test_func(); end;")
proc.execute([String, nil, :out])
puts "The returned value is #{proc[1]}"

# > The returned value is -1

# To pass parameters into the function, simply pass plain Ruby values:
proc = conn.prepare_proc("begin :return := test_func(:b1); end;")
proc.execute([String, nil, :out], 99)
puts "The returned value is #{proc[1]}"
proc.close

# > The returned value is 99

# A refcursor is returned from a stored procedure as a SimpleOracleJDBC::SQL object, so it can
# be accessed in the way as the SQL examples above:
#
# create or replace function test_refcursor
# return sys_refcursor
# is
#    v_refc sys_refcursor;
# begin
#   open v_refc for
#   select level rnum, 1 c1, 'abc' c2, sysdate c3, 23.56 c4
#   from dual
#   connect by level <= 4;
#
#   return v_refc;
# end;
# /
#
proc = conn.prepare_proc("begin :return := test_refcursor; end;")
proc.execute([:refcursor, nil, :out])
sql_object = proc[1]
sql_object.each_hash do |row|
  puts row
end
proc.close

# > {"RNUM"=>1.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:32:48 +0000, "C4"=>23.56}
# > {"RNUM"=>2.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:32:48 +0000, "C4"=>23.56}
# > {"RNUM"=>3.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:32:48 +0000, "C4"=>23.56}
# > {"RNUM"=>4.0, "C1"=>1.0, "C2"=>"abc", "C3"=>2013-02-12 22:32:48 +0000, "C4"=>23.56}




One Large Redis or Many Smaller Shards?
After experimenting with a simple proof of concept Redis backed application, I turned my attention to some of the more practical aspects on attempting to run an application like it in production.

For the use case I have in mind for Redis, I will need to store a lot of data, potentially several 100GB. It is possible to get machines that have 256GB and more RAM - we use some of them today to host Oracle databases - but is it sensible to run a single Redis process that is 100GB or more in size?

Startup Time

The problem with Redis, is that the entire dataset needs to live in memory - when running it cannot read any data from disk. Due to this limitation, before Redis can accept any connections, it needs to load the entire database from disk into memory. So how long would this take?

I created a Redis instance and populated it with about 5.2GB of random data. I created a snapshot RDB file on disk, which was about 4.8GB in size. My keys and data values were randomly generated, so the RDB file did not compress much. I have heard that the RDB file for many real world applications can be about 10x smaller than the in memory database size.

Start up time for this 5.2GB database was 1min 3seconds. A guy on the Redis mailing list stated that his real world 50GB Redis instance took 20 minutes to start up, running on EC2.

I ran my test on a 24GB 3.57Gz Intel Xeon box, with the datafile stored on SAN, so maybe it outperforms an EC2 box. Either way, I am looking at about 20 - 40 minutes to start up a 100GB Redis instance.

Slaves

If you have slaves, then the start up time may be tolerable. Promote a slave to the master, and then re-point all your connections. How hard is it to create Redis slaves?

The way Redis creates slaves is as follows:


Master creates a snapshot of the entire database (an RDB file) on disk
Master transmits that file across the network to the slave, while buffering all new commands received on the master in memory for later sending to the slave.
Slaves load the RDB file. This will take about the same length of time it would take the master to start up from the RDB file.
Slave receives the stream of pending commands from the master
The slave is now online


With a very large Redis instance, it will take quite a long time to transmit the large RDB file to the slave. Then it will take the slave some time to load it. Meanwhile the master needs to keep a buffer of all the commands it received in the 20 - 40 minutes this all takes. If the master is receiving a lot of writes during this time, the buffer needed on the master may overflow, and the slave synchronization process will need to start again.

To make things worse, if a slave loses contact with the master, even for a few seconds, then it must reload the entire database from the master once again. There is no concept of an incremental refresh, although a partial resync feature is under development.

For me, bringing a new slave online is an even bigger problem than the start up time. It is going to take a long time to get a slave up and running, and if the master is very busy, it may be difficult to get the slave to sync at all.

Persistence

Redis offers two ways to ensure your data is available if you restart the process - RDB files and the Append Only File (AOF).

RDB Files

An RDB file is a consistent snapshot of entire Redis database. You can configure Redis to create a new RDB snapshot after the database receives a given number of changes, or you can kick them off at any time you wish manually. When running large Redis instances, there are a few potential problems:


The RDB file is generated by forking the Redis process, and the new process dumps its memory to disk. On very large instances, this fork may take a little time, maybe a second or slightly more in extreme cases, which blocks the entire Redis instance as it happens.
When Redis forks, it uses the standard Unix copy on write technique to mirror the parent processes memory, giving the forked process a copy of the original data without using any more memory. However, each memory page that changes in the parent process while the child process is still working results in that memory page being duplicated in the child process. If you have a large Redis instance, creating the RDB file is going to take some time. If the instance is under heavy write load while this happens, Redis will use quite a lot of extra memory until the RDB file is completed.
If you care about your data, then RDB files are really only good for consistent backups. If it takes 20 or more minutes to create a new RDB file, then at the best case, you can only secure any data that is over 20 minutes old using RDB files.


The third point here is crucial, and if you cannot afford to lose any data, then you need to look at the AOF.

Append Only File

If Redis is running in AOF mode, all operations are written to the AOF after they have completed changing the in memory data. If the AOF is set to sync to disk every second, then at most 1 second of data could be lost if the Redis instance is killed. The AOF does not significantly impact performance, so you should probably turn it on.

As every operation that changes the dataset is written to the AOF, it is going to get big quickly. If Redis needs to restart, it needs to read the AOF from the beginning, applying every change to get the database back to how it was before the shut down. This might take a very long time.

To make this start up time shorter, Redis allows the AOF to be rewritten periodically or on-demand. This works in a similar way to creating an RDB file. In simple terms, it creates an RDB file, while buffering all the writes that occurred since the process started. Then it creates a new AOF, appends the buffered writes and switches the AOFs around. Again on large instances this process is going to take a long time, making it problematic, and if you want Redis to start in a reasonable time, you probably need to rewrite the log a few times each day.

From a persistence point of view, the AOF will get the job done, and by allocating enough memory to buffer pending writes, it can be rewritten in a reasonable time. It will however make the start up time even longer.

Sharding Is Better

While none of issues I outlined here are show stoppers, they do make things difficult. My conclusion is that it makes much more sense to run many small Redis processes, probably on the same machine as a cluster. Backing up each of them is easier, starting a slave off a smaller master is easier and recreating the AOF is easier too. You also get more potential performance. One large Redis instance can only use a single CPU, while if you shard it across many instances you can use many CPUs.

One big negative, is if your application depends on the data all being in the same instance (for set intersect operations for example), you many not be able to shard, or at least not easily.

Another negative is that Redis doesn't currently offer built in clustering - it all has to be done in the application. Monitoring and running all those Redis processes is also more complex than a single instance.

That said, it can be done - the team at Craigslist documented their strategy which provides some interesting information in a real world application. 



Calculating Velocity Scores at the Speed of Redis
To really get a feel for a technology, you have to at least build a proof of concept application using it. Building a simple application that uses Redis as an in memory hash table isn't very interesting, so I was searching for something that would test some of the more advanced features of Redis.

Velocity Check

A problem that came to mind is implementing a Velocity Check that could be used as part of a fraud checking system. According to the first result I got on google, the definition of a Velocity Check is:


The intent of velocity of use is to look for suspicious behaviour based on the number of associated transactions a consumer is attempting. It works based on counting the number of uses of a data element within a predetermined time frame. The theory is the higher the number of uses on a data element (e.g., credit cards) in a predefined time period (e.g., 24 hours), the higher the risk of taking an order.  


So we if consider customers ordering goods from a company, there are various 'data items' we might like to track - delivery address, email and credit card number. If the same value keeps occurring over and over in an unrealistic time frame, it may suggest fraud.

Requirements

From the description above, and a bit more reading around the linked web page, the minimum set of requirements for a multi tenant Velocity scoring engine are:


Each tenant can track many values
Each value should be tracked for a defined time period
To calculate a score, a rule and data value will be supplied, and the Velocity Engine must count how many times the value occurred in the defined time period.


There are obviously more requirements to think about around on-boarding accounts to use the system and configuring the rules etc, but I don't want to think about that yet. The goal of this post is to explore the core scoring engine, assuming we have a way to configure the rule definitions.

Time To Live

As this is a multi tenant system, each user may configure different time periods - one may want to track things over a 60 minute period, while another may want to track over a 6 month period. If we were to implement this in a relational database, we need to think about how we would delete the old data to keep the size of the database in check, probably by looking up each rule and its retention, and then deleting any data held for the rule that is past the retention time.

With Redis, we can set a time to live (TTL) on a key. If we set the time to live on each data item as we set it, then it will automatically be removed from the database when that time has passed, so that is something to keep in mind.

Many Distinct Values

If you consider transactions that are not fraudulent, for every field tracked there will be a large number of distinct values, and only a few values that actually repeat. The goal of the velocity scoring engine is to find those repeating values. We also have to keep the data retention requirement in mind, and set the TTL appropriately.

Modelling In Redis

The following are the pieces of data we need to track:


For each rule, we will have many (possibly millions) of data values, only a few of which will repeat.
Each data value will have an associated transaction date, and there is a retention period defined by the rule
A Velocity score for a rule is calculated by counting how many times a given value been seen within the time window


After thinking about this for a while, I decided that Redis Sorted Sets would be a perfect for modelling this problem. We are going to need 1 sorted set per rule, and 1 sorted set per data item.

Sorted Sets

A sorted set is like a list, only each member can only appear in it once. A member is simply a string, so a sorted set is a list of unique strings.

Associated with each member is a numeric score, which controls where the member is sorted in the list, biggest score first, smallest score last. Members with the same score are sorted in binary order.

We can create a sorted set for each data item, where the key is "ruleid:datavalue" and each member is a transaction_id, and give each member a score which is the transaction date in unix timestamp format. That will give us a series of keys that look like:

1546:joeblogs@nowhere.com => [trans_id_234(Timestamp)]
1546:james@nowhere.com    => [trans_id_235(Timestamp)]
1546:cheeky@nowhere.com   => [trans_id_237(Timestamp) trans_id_236(Timestamp) trans_id_001(Timestamp)]


Notice how the third set has 3 transactions in it, suggesting there may be something strange about that email address.

Purging The Data

There are two things to think about around purging data. In the majority of cases, when a data value is seen, it will not be seen again within the retention period for a rule. So if the TTL is set on the sorted set when it is created, it will be automatically removed when necessary.

There will be some data values that repeat, possibly a lot - in this case, the TTL should be reset each time a data value is added. To stop the list getting too long, we need to figure out how to remove elements from the list that have gotten too old.

Redis has commands to purge elements from a sorted set where the members scores fall within a range. As the scores here are timestamps, it is pretty easy to purge members older than the retention time.

Indexing the Data Values

It may be useful to have a way of viewing the latest data items added to a rule, for this we can use another sorted set. This one would have a key of the rule_id, and the sorted set would contain the data value, along with the date of the transaction in unix timestamp format as the score. In this way the newest data values will be at the start of the list. We can also purge the old values out of the list in the same way as the other data lists. This list will potentially contain millions of members, but it is not strictly necessary to make the application work.

Show me some code already!

Once you figure out how to model the problem in Redis, the code doesn't really come to much, which is pretty nice. The method below can be used to add a new record to the database. Obviously this could be coded in a much better OO style, but putting it all into a single method like this makes it easier see what is going on.

def load_data_value(rule_id, retention_time, data_value, transaction_id, transaction_date)
  data_key = "#{rule_id}:#{data_value}"

  @redis.multi do
    # Create a sorted set for each value attached to a rule
    # The key is rule_id:rule_value - this is what we will lookup
    # to check the velocity, summing up the last N records, determined
    # by the score. 
    @redis.zadd data_key, transaction_date.to_i, transaction_id

    # Test if the set needs any items purged from it as they are past
    # its TTL
    @redis.zremrangebyscore data_key, "-inf", Time.now.to_i - retention_time

    # Each time an element is added to the set, set the TTL on it to be
    # the retention time for the set
    @redis.expire data_key, retention_time

    # Optionally, create a lset to index all the data values, and trim
    # the number of items in it
    @redis.zadd rule_id, transaction_date.to_i, data_value
    @redis.zremrangebyscore rule_id, "-inf", Time.now.to_i - retention_time
  end
end


If the records are loaded in the format above, then calculating a score immediately after adding an item is pretty simple too:

def calculate_score(rule_id, data_value)
  @redis.zcard "#{rule_id}:#{data_value}"
end


This makes sense, as generally you will want to add a value and calculate it's score at the same time. The Redis zcard command simply counts the number of values in the set with the given key or returns zero if the key doesn't exist. This code is very much proof-of-concept code. To calculate a real Velocity Score, you will need to check more than 1 rule and sum them all up, but my goal here is to show how Redis can be used to solve the problem, so I am keeping it simple.

Performance

The thing I am really interested in is how this solution performs. To test this out, I am going to assume a transaction_id of 10 bytes and a data value of 100 random characters. I will also assume there are 1000 rules with id's between 10,000 and 11,000, with them all having a TTL of 3600 seconds. The following Ruby code can be used to simulate this scenario:

require 'redis'

class SimulateVelocity

  CHARS = ['a'..'z'].map{|r|r.to_a}.flatten

  def initialize(host)
    @redis = Redis.new(:host => host)
  end

  def run_operation
    rule_id = random_number(1000) + 10000
    retention_time   = 3600
    transaction_id   = random_string(10)
    data_value       = random_string(100)
    transaction_date = Time.now

    load_data_value(rule_id,
                    retention_time,
                    data_value,
                    transaction_id,
                    transaction_date)

    calculate_score(rule_id,
                    data_value)
  end

  def load_data_value(rule_id, retention_time, data_value, transaction_id, transaction_date)
    data_key = "#{rule_id}:#{data_value}"

    @redis.multi do
      # Create a sorted set for each value attached to a rule
      # The key is rule_id:rule_value - this is what we will lookup
      # to check the velocity, summing up the last N records, determined
      # by the score.
      @redis.zadd data_key, transaction_date.to_i, transaction_id

      # Test if the set needs any items purged from it as they are past
      # its TTL
      @redis.zremrangebyscore data_key, "-inf", Time.now.to_i - retention_time

      # Each time an element is added to the set, set the TTL on it to be
      # the retention time for the set
      @redis.expire data_key, retention_time

      # Optionally, create a lset to index all the data values, and trim
      # the number of items in it
      @redis.zadd rule_id, transaction_date.to_i, data_value
      @redis.zremrangebyscore rule_id, "-inf", Time.now.to_i - retention_time
    end
  end

  def calculate_score(rule_id, data_value)
    @redis.zcard "#{rule_id}:#{data_value}"
  end

  private

  def random_string(length)
    str = ''
    1.upto(length) do
      str << CHARS[rand(CHARS.size)]
    end
    str
  end

  def random_number(max_value)
    rand(max_value)
  end

end

sv = SimulateVelocity.new('redishost')
while(1) do
  puts sv.run_operation
end


With this code, I can plug it into some simple benchmarking code I have and see how many scores we can calculate per second.

I get a sustained rate of about 12K - 13K scores per second across 15 threads. In this setup, each score consists of 6 Redis calls, which means Redis is running at about 72K requests per second - not too bad on a single CPU.

If I remove the optional additional index for all the data values, performance jumps to 15K - 16K per second.

If you really need to score more than 10K velocity scores per second, which is almost 1 billion per day, it would be pretty trivial to design a system where different accounts are held on different Redis instances. Until Redis Cluster comes out, this would need to be done in the application.

What About Memory

Doing 12 - 16K requests per second is one thing, but what about memory? Remember, Redis requires the entire data set to be in memory.

Some stats I gathered on memory:


100 byte data values with additional index - 490 bytes per key
50 byte data values with additional index - 393 bytes per key
100 byte data values, no index - 270 bytes per key
50 byte data values, no index - 227 bytes per key


Storing 40M records is going to cost between 18GB and 8.5GB of memory depending on your data and retention time for each rule, so that is something to think before deciding if this technique is feasible or not.



A Quick Overview Of Redis
Recently I decided to get myself somewhat up to speed on some of many nosql databases. Redis is a tool I have heard people raving about for several years now, so I decided it would be a great place to start.

What is Redis?

Redis is an in memory key value store, but with a difference. Many key value stores allow a string value to be stored against a key, and that is it. Redis supports this, but it also supports other simple types too:


Lists - an in memory array allowing elements to be pushed and popped
Hashes - not much more to say about this really
Sets - like a list, only each element can only stored once. Pushing a duplicate element onto a set will still result in only one occurrence of that element being stored.
Sorted Sets - Similar to sets, except each element has a numeric score, and the elements are stored in score order.


Plenty has been written about these types in other places, and I am not going to say much more about them. Read up in the Redis Manual and have a scan over all the available commands.

All In Memory

With Redis, all of your data needs to fit in memory. It has a couple of persistence options to ensure your data is not lost if Redis crashes, but this data is never read except at startup time.

In Redis 2.2 an experimental feature was added called Virtual Memory, but it was later removed in Redis 2.6. The Redis team decided that they wanted to do one thing well - serve data from memory - and not be concerned with reading data from disk. 

Simple

One interesting thing about Redis is that it is very simple. It is a single threaded server, and in version 2.2 was only 20k lines of code. Due to the single threaded design, Redis will use at most 1 CPU core (except when it is background saving when it could use another one) and this helps keep the code simple - Redis doesn't need to worry about latches and locks to prevent concurrent processes stomping over each other. 

It can only run a single command at a time - in other words each command is atomic and blocks the entire server while it is being processed. When you first hear this, it sounds like a terrible, unscalable design - until you realize just how fast typical Redis commands are. 50K - 100K requests per second are typical on commodity hardware on a single CPU.

Single Node

Right now, Redis is a single node database. If your data is too large to fit on a single machine, then sharding it across multiple machines is a job for the application. Apparently Redis Cluster is coming and should provide sharding capabilities.

One Redis node can be replicated to another very easily, so having a fail-over instance that is identical to the master is pretty easy right now.

Getting Started Is Easy

Getting started with Redis is so simple, it is literally as easy as:

$ make
$ cd src
$ ./redis-server


Unlike with many databases, you can be up and running in about 5 minutes. Setting up replication is just about as easy too.

You can read the entire Redis manual and understand it all in under a day, which is one of the reasons I decided to investigate it.

Performance

Redis is the first non-relational database I have ever experimented with, and I was sceptical about being able to hit the performance numbers that were being suggested. Luckily Redis comes with a handy benchmarking tool, which allows you to see how it performs on your hardware.

Simply running:

$ ./redis-benchmark


Will run a bunch of tests you can use to compare your setup with others. The default tests are not terribly real world in my opinion, as the value set for any keys is always 2 bytes, but this can be changed with the -d switch. 

Running the benchmark on my hardware, with a payload size of 200 bytes, gives the following results for set and get operations:

$ ====== SET ======
  10000 requests completed in 0.07 seconds
  50 parallel clients
  200 bytes payload
  keep alive: 1

100.00% <= 0 milliseconds
151515.16 requests per second

====== GET ======
  10000 requests completed in 0.08 seconds
  50 parallel clients
  200 bytes payload
  keep alive: 1

99.51% <= 8 milliseconds
99.97% <= 9 milliseconds
100.00% <= 9 milliseconds
131578.95 requests per second


At 150K sets per second and 131K gets per second, the single threaded nature of Redis doesn't seem so bad any-more.

The benchmark tool also lets you run any Redis command you like, for instance, to test the cost of pushing 1 million items onto a sorted set (a more complex operation that a simple key-value set operation), try the following command:

$ ./redis-benchmark -r 1000000 -n 1000000 zadd sortedset 10 random_value_that_is_a_little_long_:rand:000000000000

====== zadd sortedset 10     random_value_that_is_a_little_long_:rand:000000000000 ======
  1000000 requests completed in 10.70 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1

99.96% <= 1 milliseconds
99.99% <= 2 milliseconds
100.00% <= 9 milliseconds
100.00% <= 10 milliseconds
100.00% <= 10 milliseconds
93466.68 requests per second


93K requests per second - not bad at all.

Use Cases

Redis has plenty of potential use cases, I cannot possibly think of them all. At the most simple, it can be used as a cache for a web application, it also has applications in queuing, distributed object stores and with some thought the lists, sets and sorted sets have a lot of potential use cases.



Nosql isn't just Hype
Nosql - this is a term I have been hearing for some years now, and I pretty much ignored all the fuss. I've spent quite a large part of the last decade becoming good at building applications around Oracle databases, and I assumed "nosql" was something people fell back on when they failed to scale MySQL or Oracle properly.

However, more recently I encountered a colleague who was also well versed in all things Oracle, but sings the praises of nosql databases too.

So I asked him to give me an overview to see what it was all about, and then spent some time reading about nosql databases.

Not Only SQL

One of the first things I learned was that the term nosql does not stand for No SQL (or at least it shouldn't). A better definition is Not Only SQL, and that is interesting. It suggests that maybe there is room in you organization (or application) for more than 1 database, and they don't all have to be queried by SQL. So for me nosql refers to databases that store data in a non-relational way.

Many Different Options

In the relational database space, you have the open source options, MySQL and Postgres and then the commercial options, Oracle, SQL Server or Sybase. Arguably, each of these databases is as good as the other, and largely the same techniques can be used to design a successful application around any of them.

In the nosql space, there are quite a few options and some of them are good at different things.

Key Value Stores

Some are good at serving data out of memory. Redis is a good example, along with Memcached. Redis can persist the in memory dataset to disk, but it ever reads from it - you need to have enough memory to hold the entire dataset. Memcached on the other hand is just a memory store - perfect for caching, but if you shut the server down the data is gone forever.

Typically these key value stores are in memory hash tables - everything is accessed by a key, and there are no other indexes.

Big Data

After the key value stores come the Big Data solutions. Hadoop is the first thing that comes to mind. The premise of Big Data databases is that it is much cheaper to store many terrabytes of data than it is with a relational database. At first I struggled to understand how this could be the case, surely disk is disk, but it turns out it isn't really. 

For a relational database, data generally lives on SAN which is never expected to fail. This is expensive. In Hadoop, data lives on commodity disks in cheap commodity servers - these disks and servers are expected to fail, so the data is duplicated across many nodes of the cluster, typically 3. The drawback is that it often isn't very efficient to query small amounts of data in Hadoop in real time, so it doesn't lend itself to OLTP applications. It is good for analysing log files to generate reports and many other things.

Clusters

There are other databases that have elements of big data, key value and relational. Cassandra, HBase, Riak and CouchBase may fall into this category. Designed to answer queries in real time to suit OLTP applications, but designed to handle node failures gracefully. Many of these claim to be eventually consistent, use consistent hashing and some are better at different parts of the CAP theorem than others. I still have a lot to learn here I think.

Others

Neo4J is a graph database. It stores data in graphs instead of tables, and is good for applications that lend themselves to graphs - notice how vague my description is. There is more to learn here too.

Resources

Some places I have found useful in helping me to learn about nosql:


NoSQL Tapes
myNoSQL




Connecting to Sybase from Java with the JTDS drivers
A short piece of Java code that uses the jtds drives to connect to Sybase and issue a query:

import java.sql.*;
import net.sourceforge.jtds.jdbc.*;

public class SimpleProcSybase {

    private static Connection conn;

    public static void main(String[] args)
    throws ClassNotFoundException, SQLException, InterruptedException
    {
        connect();
        System.out.println ("Got connected OK");
        PreparedStatement stmt = conn.prepareStatement("select 'abc'");
        ResultSet res = stmt.executeQuery();
        while(res.next()) {
            System.out.println(res.getString(1));
        }
    }

    public static void connect()
    throws ClassNotFoundException, SQLException
    {
        DriverManager.registerDriver
            (new net.sourceforge.jtds.jdbc.Driver());

        String url = "jdbc:jtds:sybase://localhost:4100";

        conn = DriverManager.getConnection(url,"user","password");
    }

}




Installing the libv8 Ruby gem on Centos 5.8
First, Centos 5.8 ships with gcc 4.1.4, but to compile libv8 you need 4.4. Luckily, this step is easy:

$ yum install gcc44-c++


Next, you need to tell the build to make use of it. The easiest thing to do here, is to export a couple of environment variables:

$ export CC=/usr/bin/gcc44
$ export CXX=/usr/bin/g++44


Now if you attempt gem install libv8, you will get an error along the lines of:

$ gem install libv8
creating Makefile
Using compiler: /usr/bin/g++44
Traceback (most recent call last):
  File "build/gyp/gyp", line 15, in ?
    import gyp
  File "build/gyp/pylib/gyp/__init__.py", line 8, in ?
    import gyp.input
  File "build/gyp/pylib/gyp/input.py", line 14, in ?
    import gyp.common
  File "build/gyp/pylib/gyp/common.py", line 375
    with open(source_path) as source_file:
            ^
SyntaxError: invalid syntax
gmake: *** [out/Makefile.x64] Error 1
GYP_GENERATORS=make \
    build/gyp/gyp --generator-output="out" build/all.gyp \
                  -Ibuild/standalone.gypi --depth=. \
                  -Dv8_target_arch=x64 \
                  -S.x64 -Dhost_arch=x64


This is because the version of Python that is shipped with Centos is too old. Upgrading Python is not too hard, but be warned - do not under any circumstance replace the shipped Centos Python - lots of stuff depends on it, and it you replace it, lots of stuff will break.

To install Python 2.7:

$ yum install bzip2
$ yum install bzip2-devel
$ wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz
$ tar -xf Python-2.7.3.tgz
$ cd Python-2.7.3
$ ./configure
$ make
$ make altinstall


The final step is very important - this stops it overwriting the default Centos Python. We are on the home straight now. To get the libv8 install to use Python 2.7 instead of Python 2.4, I thought I could create a symlink in my local directory, and then slip my local directory onto the front of my path: 

$ ln -s /usr/local/bin/python2.7 python
$ export PATH=.:$PATH
$ python --version
Python 2.7.3


However, that didn't work. 

I don't know why - maybe the Makefile explicitly references /usr/bin/python? What I did was move the existing Python executable out of the way, and symlink the Python 2.7 in its place:

$ mv /usr/bin/python /usr/bin/python_
$ ln -s /usr/local/bin/python2.7 /usr/bin/python


Finally:

$ gem install libv8
( about 5 minutes later)
Building native extensions.  This could take a while...
Successfully installed libv8-3.11.8.4
1 gem installed
Installing ri documentation for libv8-3.11.8.4...
Installing RDoc documentation for libv8-3.11.8.4...


Now remember, to put Python back the way you found it:

$ rm /usr/bin/python
$ mv /usr/bin/python_ /usr/bin/python


Job done - finally.



Ruby 1.9.3 libyaml centos 5.6
Installing Ruby used to be easy, but I came across this tricky problem when doing some upgrades on my rails server. Anytime I ran the gem command on my new ruby install, I get the warning:

It seems your ruby installation is missing psych (for YAML output).
To eliminate this warning, please install libyaml and reinstall your ruby.


I came across this post, which claims to fix the problem, but  it didn't for me.

Anyway, what I did to get it working was follow the instructions in the post above:

$ wget http://pyyaml.org/download/libyaml/yaml-0.1.4.tar.gz
$ tar xzvf yaml-0.1.4.tar.gz
$ cd yaml-0.1.4
$ ./configure --prefix=/usr/local
$ make
$ make install


Then install Ruby:

$ ./configure --prefix=/usr/local/lib/ruby_1.9.3 --enable-shared --disable-install-doc --with-opt-dir=/usr/local/lib
$ make
$ make install


However, the problem persisted. To fix it, all I did was:

$ cp /usr/local/lib/libyaml* /usr/local/lib/ruby_1.9.3/lib


Problem solved - 2 hours into a 15 minute task :-(



DBGeni - Better database installs
Having worked as a developer on projects with (Oracle) databases at their core for around 10 years, I have seen quite a number of releases that needed to change database objects.

Database changes are different from application server changes, in that when you build a new version of the app, it completely replaces the old version, even if only a couple of lines of code changed. In many cases you can even have the old and new versions of the application live at the same time, maintaining service through an upgrade.

With databases, it is not so simple - each release builds on the existing database, generally by applying a script to modify it's structure. The hard part of database releases is how to produce this script, and many teams do it in different ways, from diffing databases, to exporting code and doing manual comparisons, and everything in between.

Inspired by Rails

Quite a few years ago now, Ruby on Rails appeared on the scene, and with it came a new-to-me way to manage database changes with migration scripts. The idea was simple, starting with an empty database, you should be able to take a series of scripts, arranged in date stamped order, and apply each in turn to the database to get the current version. These migration scripts are version controlled with the application code, and so when the code is branched or tagged, they can be run against an empty database to produce the database version the application depends on. In fact, given version 1 of the application and database, the additional scripts to move the database to version 2 can be applied, especially if the migrations that were already applied are logged in the database.

What Rails didn't need to care about is stored procedures, but they are easier to handle than tables and indexes, because when they are changed, they can be completely replaced, just like with application code.

In my then day job, I took the ideas from Rails, and applied them to a broken build process, creating an installer to manage changes in a massive PLSQL application developed by a team of over 100 people.

DBGeni

This first version of the installer was pretty horrible. I used it to teach myself Ruby, and what I produced was Rubyized Perl code - ie not very Object Orientated, and worse, it was tightly coupled to our application, but it was better than what we had before, and it transformed the release process. I later left that job and joined a new company, and it turned out the database release process was just as bad there.

From this, DBGeni was born - the DataBase GENeric Installer. DBGeni is a Ruby gem that provides a simple command line interface to apply (and rollback) database migrations and stored procedure code.


It works with Oracle, Sybase and MySQL. 
It is designed in such a way that other databases can easily be plugged in.
It is designed with convention over configuration in mind, getting the job done with minimal config.
It supports plugins and is script-able, making it quite flexible.


Why Build it?

Do I think DBGeni is my path to internet fame and riches? Probably not, however coding it was a fun project. It allowed me to build a non trivial Ruby application, package it in a gem, employ a TDD approach to development, hone my OO skills on a green field project and generally improve my Ruby knowledge.

I also took the time to create a simple website and a fairly complete manual, which currently sees very little traffic (more on that later), but it let me see how much non-coding work is involved in producing something other people can use!

Alternatives?

After starting on DBGeni, I came across DBDeploy which does something very similar to DBGeni, but in a slightly different way. This didn't put me off working on DBGeni, but further reinforced my believe that managing database changes in this way is a good idea. DBDeploy seem to have at least a few users, so there is potential for DBGeni to get some adoption too.

Adoption?

Right now, I am fairly certain that I am the sole user of DBGeni. The website is not getting much organic traffic from Google, and to be honest, I haven't promoted it at all, except on my own Oracle blog and this one.

At one point, I had visions of charging for DBGeni, but now I am not so sure. First of all, I would need to get a few teams to use the tool to see if they like it and give it some thorough field testing, which means spending some time on promotion.

Right now, I am happy to have produced a tool as a side project and I think it solves a real problem. At the very least, I can use it as an example of my work, and going forward I will spend some time on promotion and see if I can get anyone to use it.



Oracle JDBC connections slow to connect /dev/urandom
If your JDBC connections to Oracle take a long time (many seconds) to connect, it could be due to a problem with the random number generator on Linux.

Apparently /dev/random is known to be slow in some instances at generating random numbers, so the solution is to change it to use /dev/urandom instead. 

However, there is a further bug in Java 5 (and perhaps above) - if the value is set to /dev/urandom, it will be ignored, falling back to the default /dev/random.

The solution is to set it to /dev/./random instead.

To set the value, edit the java.security file in your JAVA_HOME and find the line starting securerandom.source and change it as below:

securerandom.source=file:/dev/./urandom


This simple change cut my connection times to Oracle from 5 - 10 seconds to almost instant on Cent OS 5.6 with Java 1.6.0 and the Oracle ojdbc6 drivers.

More info here, here and here



JDBC, JRuby and Oracle
Connecting to Oracle from MRI Ruby is pretty easy, thanks to the excellent OCI8 drivers.

Sometimes it can be useful to use the very robust Oracle JDBC drivers and connect to Oracle over JDBC, which is possible using JRuby.


First install JRuby


Next get the Oracle JDBC drivers by downloading ojdbc6.jar from Oracle
Put the ojdbc6.jar file into the lib directory inside jruby install, eg C:\jruby-1.6.5\lib



With all that installed, the following code should connect successfully to Oracle:

require 'java'

java_import 'oracle.jdbc.OracleDriver'
java_import 'java.sql.DriverManager'

oradriver = OracleDriver.new
DriverManager.registerDriver oradriver
conn = DriverManager.get_connection('jdbc:oracle:thin:@localhost/local11gr2.world',
                           'user', 'password')
conn.auto_commit = false

stmt = conn.prepare_statement('select * from dual');

rowset = stmt.executeQuery()
while (rowset.next()) do
  puts rowset.getString(1)
end




Connecting to Sybase with JRuby using the jtds drivers
From the searching I have done, getting a working gem for Sybase is close to impossible for MRI Ruby, but it is pretty simple for JRuby.


First install JRuby.
Then install the dbi gem (it requires depreciate-2.0.1, but the gem command should installed it automatically).
Next install the dbd-jdbc gem.
Finally, get the jTDS open source and pure Java drivers for Sybase (and SQL Server). Extract the jtds-1.2.5.jar file from the download and drop it into the lib directory inside the jruby install, eg C:\jruby-1.6.5\lib


With that all installed and working, the following sample code should connect to a Sybase instance (changing the hostname, port, user, and password):

require 'java'
java_import 'net.sourceforge.jtds.jdbc.Driver'
require 'rubygems'
require 'dbi'

dbh = DBI.connect('dbi:Jdbc:jtds:sybase://hostname:port/cfg', 'user', 'password', {'driver' => 'net.sourceforge.jtds.jdbc.Driver'} )

stmt = dbh.prepare("select account_number from account")
stmt.execute    
while (r = stmt.fetch) do
  puts r
end
stmt.finish




Installing ruby-oci8 on 64 bit Windows
To use the Ruby OCI8 Oracle bindings, you need to have at least an Oracle client installed. As I needed an Oracle database on my system, I just did a full database install. Everything worked fine Oracle and Sqlplus wise, so I forged ahead and installed the ruby-oci8 gem:

gem install ruby-oci8


This completed without any errors. However, when I attempted to run it, I got the error:

C:\Users\sodonnel> ruby -r oci8 -e "OCI8.new('scott', 'tiger', 'local11gr2').exec('select * from emp') do |r| puts r.join(','); end"
C:/Ruby192/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require': OCI.DLL: 193(%1 is not a valid Win32 application.  ) (LoadError)
    from C:/Ruby192/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
    from C:/Ruby192/lib/ruby/gems/1.9.1/gems/ruby-oci8-2.0.6-x86-mingw32/lib/oci8.rb:38:in `'
    from :33:in `require'
    from :33:in `rescue in require'
    from :29:in `require'


The problem here is that the ruby gem is a pre build binary for 32 bit Windows - it does not attempted to compile against the Oracle libraries (as that would require too many Windows build tools). When it attempts to load the OCI.DLL from the Oracle libraries, it finds a 64 bit library, and the 32 bit build will not work with it.

The solution is to somehow get a working 32 bit OCI.DLL. The easiest way to do this, is to install the Oracle Instant Client along side your Oracle install. Download both instantclient-basic-nt-11.2.0.2.0.zip and instantclient-sdk-nt-11.2.0.2.0.zip from (http://www.oracle.com/technetwork/topics/winsoft-085727.html). Then extract them both into a folder, and add that folder to your path. I put them into:

C:\app\sodonnel\product\instant_client


Then the OCI8 gem should work correctly.



On getting stuff done in companies
Interesting Hacker News comment that struck a chord with me today. On the topic of how to hire people who get stuff done, one insightful reader commented that it could just as easily be the organization that stops people who have a track record of getting stuff done, but just seem to stop in their new position:

I had this experience working at Google. I had a horrible time getting anything done there. Now I spent a bit of time evaluating that since it had never been the case in my career, up to that point, where I was unable to move the ball forward and I really wanted to understand that. The short answer was that Google had developed a number of people who spent much, if not all, of their time preventing change. It took me a while to figure out what motivated someone to be anti-change.


The fear was risk and safety. Folks moved around a lot and so you had people in charge of systems they didn't build, didn't understand all the moving parts of, and were apt to get a poor rating if they broke. When dealing with people in that situation one could either educate them and bring them along, or steam roll over them. Education takes time, and during that time the 'teacher' doesn't get anything done. This favors steamrolling evolutionarily :-)

So you can hire someone who gets stuff done, but if getting stuff done in your organization requires them to be an asshole, and they aren't up for that, well they aren't going to be nearly as successful as you would like them to be.

The other risk is of course the people who 'get a lot done' but don't need to. Which is to say they can rewrite your CRM system and push it out to the world in a week but only by writing it from scratch.


Couldn't have put this better myself. The link to the comment and the entire discussion.



Achievements in 2010
The end of the year is almost here, so maybe it's worthwhile having a think about what I achieved this year in general life (ie not just behind a computer terminal).  So here goes:


I learned to juggle, 3 balls well, 4 balls badly. Its still a work in progress and is surprisingly fun.
I launched a web application to do group code reviews, but subsequently closed it down. I could have stuck with it, but it wasn't really good enough and getting users was hard. Still, it was a first step and I have more ideas in the pipeline.
I broke my personal best swimming times. My PB 1km time is now 15 minutes 5 seconds, a 20 second improvement. My 100M and 200M times have improved a lot too.
I started this blog.
I went hiking for the first time, up actual mountains and immediately wondered why it took me 30 years to discover it.
I am doing better than ever in my day job, but have realised with everyday that goes past that I need to do something other than work for a mega-corporation for the rest of my days.
I tried surfing and loved it, despite being a total novice.
I played Frisbee for the first time in years.
I went sledging in the snow in the park for the first time since I was a child.
I learned a few new tricks in the kitchen
I turned 30 years old.


So what is on the horizon for 2011?


Start getting some content onto this blog.
Launch at least another web application or online business idea.
Learn to surf well.
Break 15 minutes for a 1km swim.
Go on a great holiday or two.
Adopt a new 'give anything a go' attitude.
Have fun at every opportunity


I achieved or learned a few other things which I am not going to reveal publically, but overall I think its not a bad list. I could have done more, but I certainly could have done a lot less. 

Always be learning and always be trying new things - that is the motto for 2011!



A few general emacs tips
To finish off the series on Emacs, this post outlines a few tips and tricks I have picked up over the years.

Javascript

If you are coding Javascript, the default 'js-mode' in Emacs could do with upgrading. What you really want is js2-mode. This mode checks the syntax of your Javascript as you type. I am no Javascript guru, but working with Javascript using js2-mode is like night and day than without it.

PLSQL Mode

I do a lot of Oracle PLSQL development. Once upon a time I used  TOAD as my IDE. However, once you start using Emacs and get used to it, you want to use it for everything.  Enter plsql-mode. This will syntax highlight plsql, and let you compile it with C-c c. Any errors are marked and displayed in a split window, and clicking on an error jumps you to the line of code causing it (standard emacs compile-mode behaviour).

Perl Mode

I found the default emacs Perl mode to be a bit clunky, but I cannot remember why. The improved cperl-mode is much better.

Keyboard Macros

These just blew my mind. You the story:


Here are a bunch of 2000 account numbers, can you query their status on the database please?


So to do this, you can write a program that opens a file, reads it a line at a time and runs an SQL query for each one. 

Or

You could load all the accounts into a database table and join them to produce a report.  To do this, for each account you need to take

1234567890


and turn it into 

insert into tempAccounts values ('1234567890');


Enter keyboard macros. You 'record' a series of keystrokes for one line, and then have emacs repeat it for the rest of the lines. There is more information on the emacswiki. Once you know about this tool, repetitive changes to text files becomes a breeze.

Better Colors

I was never happy with the default emacs color scheme. There are quite a few good ones out there. First, install color-theme, then pick a color scheme you like. Personally, I like color-theme-tango

Sane Defaults

The number of times I have quit emacs by accident is just too many. I like it to prompt me and ask 'if I really want to quit':

; stops me killing emacs by accident!
(setq confirm-kill-emacs 'yes-or-no-p)


Once you get used to the emacs keyboard commands, you won't have much need for the menu and tool bars. They take up valuable screen space, so I get rid of them:

; turn off the tool and menu bars
(tool-bar-mode -1)
(menu-bar-mode -1)


Finally, the default splash screen that appears each time emacs loads can be a bit of a pain, so to get rid of it, simply add this to .emacs:

(setq inhibit-splash-screen t)


Fly Modes

Watch out for 'fly-make' modes. These highlight syntax errors in your code as you type it, without doing a full compile. It can be a real time saver.  Ruby has a fly-make mode, so does Perl, so does Javascript (as part of js2-mode). If you are typing text, you can even have flyspell-mode, which I am using right now to type this!



Rails options for Emacs
There are several options when it come to Rails development in Emac, and which one you use probably depends on how much time you want to invest in learning how to use the editor.

Ruby Mode plus rhtml mode and ECB

Working in Rails without some sort of file browser in the editor is not fun, so at the very least you want to setup the Emacs Code Browser. I find that combined with ruby-mode and rhtml-mode is about all I need. If you have been following the tutorial so far, adding rhtml is simple.  Download it from github and copy all the '.el' files into your emacs includes directory (C:/emacs-23.2/includes in my setup). Then add the following to you _emacs file:

;;; rhtml mode
(require 'rhtml-mode)
; put rhtml templates into rhtml-mode
(setq auto-mode-alist  (cons '("\\.erb$" . rhtml-mode) auto-mode-alist))
; put any rjs scripts into ruby-mode, as they are basically ruby
(setq auto-mode-alist  (cons '("\\.rjs$" . ruby-mode) auto-mode-alist))


This assumes you have added your includes directory as a load path in emacs by adding (add-to-list 'load-path "C:/emacs-23.2/includes") right at the top of your _emacs file.

Rhtml-mode simply syntax highlights the html in your templates while adding ruby highlighting inside the erb blocks.

More advanced Options

When I first setup emacs and started Rails development, I tried out a few other options. These options provide all sorts of features that allow you to jump straight to a related view from the controller, or straight to a unit test, run the Rails server within emacs and so on. It was all just too much to learn, and I quickly fell back into using my simple setup above. These days, there are two options for turning emacs into a full blown 'Rails IDE'


rails-mode
Rinari


As I don't use either of these, I am not going to explore installing them.  The Emacs Wiki has a good page that links to both the projects. If you have followed this tutorial until now, you will be more than skilled enough to get one of these tools working. There is also my old post where I talk about setting up rails-mode, but not in great detail.

Summary

In 5 posts, this tutorial has demonstrated how to get Emacs installed, configure the daunting but very useful Emacs Code Browser, hack ruby-mode into shape and explore some options for Rails development.

If you have been following along and are using emacs 23.2 or newer, then your _emacs file should looking somthing like that below. Enjoy using emacs and Rails, there is plenty more to learn.

(global-font-lock-mode 1)

(add-to-list 'load-path "C:/emacs-23.2/includes")

; ruby mode is bundled with default in emacs 23.2.  This makes it work 
; quite a lot better though.
(add-hook 'ruby-mode-hook
      (lambda()
        (add-hook 'local-write-file-hooks
              '(lambda()
             (save-excursion
               (untabify (point-min) (point-max))
               (delete-trailing-whitespace)
               )))
        (set (make-local-variable 'indent-tabs-mode) 'nil)
        (set (make-local-variable 'tab-width) 2)
        (imenu-add-to-menubar "IMENU")
        (define-key ruby-mode-map "\C-m" 'newline-and-indent) ;Not sure if this line is 100% right!
;            (require 'ruby-electric)
;            (ruby-electric-mode t)
        ))

;;; rhtml mode
(require 'rhtml-mode)

(setq auto-mode-alist  (cons '("\\.erb$" . rhtml-mode) auto-mode-alist))
(setq auto-mode-alist  (cons '("\\.rjs$" . ruby-mode) auto-mode-alist))


(load-file "C:/emacs-23.2/plugins/cedet-1.0/common/cedet.el")

;; Enable EDE (Project Management) features
(global-ede-mode 1)

;; Enable EDE for a pre-existing C++ project
;; (ede-cpp-root-project "NAME" :file "~/myproject/Makefile")


;; Enabling Semantic (code-parsing, smart completion) features
;; Select one of the following:

;; * This enables the database and idle reparse engines
(semantic-load-enable-minimum-features)

;; * This enables some tools useful for coding, such as summary mode
;;   imenu support, and the semantic navigator
(semantic-load-enable-code-helpers)

;; * This enables even more coding tools such as intellisense mode
;;   decoration mode, and stickyfunc mode (plus regular code helpers)
;; (semantic-load-enable-gaudy-code-helpers)

;; * This enables the use of Exuberent ctags if you have it installed.
;;   If you use C++ templates or boost, you should NOT enable it.
;; (semantic-load-enable-all-exuberent-ctags-support)
;;   Or, use one of these two types of support.
;;   Add support for new languges only via ctags.
;; (semantic-load-enable-primary-exuberent-ctags-support)
;;   Add support for using ctags as a backup parser.
;; (semantic-load-enable-secondary-exuberent-ctags-support)

;; Enable SRecode (Template management) minor-mode.
;; (global-srecode-minor-mode 1)

(add-to-list 'load-path "C:/emacs-23.2/plugins/ecb-2.40")
(load-file "C:/emacs-23.2/plugins/ecb-2.40/ecb.el")

(custom-set-variables
  ;; custom-set-variables was added by Custom.
  ;; If you edit it by hand, you could mess it up, so be careful.
  ;; Your init file should contain only one such instance.
  ;; If there is more than one, they won't work right.
 '(ecb-layout-name "left14")
 '(ecb-options-version "2.40")
 '(ecb-primary-secondary-mouse-buttons (quote mouse-1--C-mouse-1))
 '(ecb-source-path (quote ("C:/rails")))
 '(ecb-tip-of-the-day nil))
(custom-set-faces
  ;; custom-set-faces was added by Custom.
  ;; If you edit it by hand, you could mess it up, so be careful.
  ;; Your init file should contain only one such instance.
  ;; If there is more than one, they won't work right.
 )




Emacs Ruby Foo
Before worrying about Rails, you need to get Emacs to behave sanely with Ruby code.  That means it should indent it automatically and syntax highlight the code.

Emacs 22.1

For most languages, Emacs will syntax highlight your source code right out of the box. It does this by switching into a language specific mode when you open files with certain extensions. Back in Emacs 22 days, Ruby Mode was not present out of the box, but adding support is trivial.

In the Ruby source distribution, under the misc directory, you will find several files. The one called ruby-mode.el is the one that adds Ruby syntax highlighting support to Emacs.

Download it and place it somewhere in your Emacs load path. On windows I created a directory called C:\emacs-22.1\includes and copied ruby-mode.el into it.

First ensure the 'includes directory' is in the emacs load path by defining it in your _emacs file (at or near the top as it needs to be declared before you use anything in it):

; directory to put various el files into
(add-to-list 'load-path "C:/emacs-22.1/includes")


Now add the following lines to your _emacs:

; loads ruby mode when a .rb file is opened.
(autoload 'ruby-mode "ruby-mode" "Major mode for editing ruby scripts." t)
(setq auto-mode-alist  (cons '(".rb$" . ruby-mode) auto-mode-alist))


Now when you open a file with an rb extension, ruby-mode will automatically activate.

Emacs 23.2

I don't know when it happened, but in some later version of Emacs, ruby-mode started being bundled by default, and you can skip the steps in the section above. If you open a file with a .rb extension, ruby-mode should start automatically.

Improving Ruby Mode

In both Emacs 23 and 22, if you set things up as above there are a few annoyances with ruby-mode. Mainly, it does not auto-indent the next line after you hit return, meaning you have to press tab each time. To fix it, add the following block to your _emacs file:

(add-hook 'ruby-mode-hook
      (lambda()
        (add-hook 'local-write-file-hooks
                  '(lambda()
                     (save-excursion
                       (untabify (point-min) (point-max))
                       (delete-trailing-whitespace)
                       )))
        (set (make-local-variable 'indent-tabs-mode) 'nil)
        (set (make-local-variable 'tab-width) 2)
        (imenu-add-to-menubar "IMENU")
        (define-key ruby-mode-map "\C-m" 'newline-and-indent) ;Not sure if this line is 100% right!
     ;   (require 'ruby-electric)
     ;   (ruby-electric-mode t)
        ))


This code makes ruby-mode much better, but I don't pretend to be an expert on it! Also note the two lines that are commented out for ruby-electric-mode, which I will next.

Adding Electric

Before saying anything more about ruby-mode, we may as well install ruby-electric mode. I started off using this, but eventually turned it off as it annoyed me more than it helped. Basically electric-mode defines 'electric keys'. For example, if you type a quote, the closing quote will automatically be inserted. Same for {, [, and (. Also typing 'def' will result in the corresponding 'end' being inserted too. Its easier if you play with it a little to see what I mean!

To install electric-mode, download the ruby-electric.el file from the Ruby misc directory into your include directory, and uncomment the two lines mentioning ruby-electric in the _emacs setting above.

Now, we have Ruby Syntax highlighting, electric keys, and auto-indenting, but we can do more ...

Messed Up Your Indentation?

If you let you code get into a bit of a twist, you can fix the indenting by press the TAB key while on the line – give it a try, moving over each line and pressing TAB on each to correct the indents!

Comment a block

If you want to comment out a large section of code, you can select it and the type ALT-x ruby-encomment-region and it will place a '#' symbol at the beginning of each line. ALT-x ruby-decomment-region does the opposite.

This is very specific to ruby-mode. A better way of commenting blocks is to highlight the region and use the ALT-; command (M-;).  This will work in many language modes, and will either comment or uncomment the block as appropriate.

Compile and Run Code in Emacs

There are many ways to run code in Emacs. The easiest out-of-the-box way, is to type ALT-x compile. Emacs will prompt for a compile command, which by default is ‘make -k’. Replace this with:

ruby -w my_file_name.rb


The emacs window will split in two and the results of running the script will be in the new window.

If you are writing a text document that includes code snippets, and you want to ensure they are correct, you can select the region and type ALT-| to run it. Again, you need to enter the ruby command to run the code.

Smarter Compile?

All that typing ruby -w filename will start to annoy you after a while – the solution is mode-compile which has some brains built in. It can tell when you are editing a Ruby file (or many other types of file) and run it with the correct compiler/interpreter automatically. Download mode-compile.el and put it in your includes directory. As usual, add the following to your _emacs:

; Install mode-compile to give friendlier compiling support!
(autoload 'mode-compile "mode-compile"
"Command to compile current buffer file based on the major mode" t)
 (global-set-key "\\C-cc" 'mode-compile)
(autoload 'mode-compile-kill "mode-compile"
"Command to kill a compilation launched by `mode-compile'" t)
 (global-set-key "\\C-ck" 'mode-compile-kill)


Now you can compile/run code by typing CTRL-c c (and if your code enters a nasty infinite loop, you can kill it with CTRL-c k). The output will again appear in a new split window. If there are any compile errors in the output, you can move the cursor over them and hit return and emacs will jump to the offending line in your source file – pretty neat eh?

That is enough for this post – next time I will add some Rails specific config.



Setting up the emacs code browser
Update September 2014

Installing ECB in 24.3 emacs is much simpler than described here - see my updated post.

One of the key goals of this tutorial is to setup a file browsing pane in Emacs, similar to what Textmate offers. For Rails development this is essential as each project requires a lot of different files.

The Emacs Code Browser

Emacs has been around so long that it has plug ins for just about every use case, and browsing files is no exception.  To create a Textmate style file browsing pane, all you need to do is install the Emacs code browser (ECB).  This can be a bit daunting, as there are so many configuration options and settings required to get it working well. Fortunately, this tutorial will get you something pretty useful working very quickly.

Downloads

You need to download two packages to setup ecb:


CEDET - This is a requirement of ecb, don't worry about it, just install it.
ecb - The code for ecb itself.


Now you have the 2 downloads, you need to put them somewhere. On Windows, I created a directory called plugins inside my emacs directory, giving me C:\emacs-23.2\plugins

On the Mac, I created a directory called .emacs_includes/plugins in my home directory.

I am pretty sure Emacs gurus would tell me these files should go somewhere else, but this works for me and the location is not really important. Extract each of the downloads into the plugins directory.

Compiling CEDET

There are a few ways of compiling CEDET, but the most portable involves doing it inside emacs itself.

Assuming CEDET has been extracted to c:\emacs-23.2\plugins\cedet-1.0, open emacs and then open the file c:\emacs-23.2\plugins\cedet-1.0\cedet-build.el. Don't edit anything in this file, just run the following two emacs commands:


M-x eval-buffer
M-x cedet-build-in-default-emacs


(M-x means press and hold the alt key, and then press x and release both keys.  Then enter the text 'eval-buffer' which will appear at the very bottom of the emacs window, then hit return).  The second command will take several minutes to run and it will open a new instance of emacs while it does it. When I ran it, lots of warnings scrolled past, but I ignored them and things worked fine. Emacs also prompted to answer y or n for creating a new directory. Just answer y. When the compilation is complete, you should see a window that has 'done' at the bottom of it. At this point just close emacs as the compilation is complete.

Note that this method will not work in a unix terminal, however the solution is easy. Just ensure the correct emacs is on your path, switch into the cedet directory and enter the command make, which will get everything byte compiled.

Installing Emacs plugins generally involves putting some files in a known location, and telling Emacs about them using the .emacs file. Now you need to tell emacs about CEDET. Open your .emacs file (or _emacs on windows) and add the following (this is a Windows example, edit the paths for OS X):

; allows syntax highlighting to work
 (global-font-lock-mode 1)

;; Load CEDET.
;; This is required by ECB which will be loaded later.
;; See cedet/common/cedet.info for configuration details.
(load-file "C:/emacs-23.2/plugins/cedet-1.0/common/cedet.el")

;; Enable EDE (Project Management) features
(global-ede-mode 1)

;; * This enables the database and idle reparse engines
(semantic-load-enable-minimum-features)

;; * This enables some tools useful for coding, such as summary mode
;;   imenu support, and the semantic navigator
(semantic-load-enable-code-helpers)


After saving the .emacs file, restart emacs and hopefully it should load without displaying any config errors.

Installing ECB

Installing ecb is pretty simple compared to CEDET.  All you have to do is add two more lines to your .emacs:

(add-to-list 'load-path "C:/emacs-23.2/plugins/ecb-2.40")
(load-file "C:/emacs-23.2/plugins/ecb-2.40/ecb.el")


One again restart emacs and start ecb by entering the command:

M-x ecb-activate


The Emacs window will change, adding a new section with 4 windows on the left side. In this default mode, the top window shows directories, next files, then history and finally methods in the current file.

To close ECB enter M-x ecb-deactivate

I changed this default layout to show only a combined directory and file listing and opened file history by adding the following lines to my .emacs:

(custom-set-variables
;; custom-set-variables was added by Custom.
;; If you edit it by hand, you could mess it up, so be careful.
;; Your init file should contain only one such instance.
;; If there is more than one, they won't work right.
'(ecb-layout-name "left14")
'(ecb-layout-window-sizes (quote (("left14" (0.2564102564102564 . 0.6949152542372882) (0.2564102564102564 . 0.23728813559322035)))))
'(ecb-options-version "2.40"))


There really are tons of options for ecb, and I only know a handful of them. First of all, you will want to add the location of your code files so they are quickly accessible in ECB:

(ecb-source-path (quote ("d:/myRailsProject" "d:/useful scripts")))


By default, ecb opens files using the middle mouse button. If you want to use the left mouse button, add the following option:

'(ecb-primary-secondary-mouse-buttons (quote mouse-1--C-mouse-1))


I also wanted to get rid of the annoying tip-of-the-day and use ascii style directory listings:

'(ecb-tip-of-the-day nil)
'(ecb-tree-buffer-style (quote ascii-guides)))


If you go for all the options I have set, your .emacs should look like:

; allows syntax highlighting to work
 (global-font-lock-mode 1)

;; Load CEDET.
;; This is required by ECB which will be loaded later.
;; See cedet/common/cedet.info for configuration details.
(load-file "C:/emacs-23.2/plugins/cedet-1.0/common/cedet.el")

;; Enable EDE (Project Management) features
(global-ede-mode 1)

;; * This enables the database and idle reparse engines
(semantic-load-enable-minimum-features)

;; * This enables some tools useful for coding, such as summary mode
;;   imenu support, and the semantic navigator
(semantic-load-enable-code-helpers)

(add-to-list 'load-path "C:/emacs-23.2/plugins/ecb-2.40")
(load-file "C:/emacs-23.2/plugins/ecb-2.40/ecb.el")

(custom-set-variables
;; custom-set-variables was added by Custom.
;; If you edit it by hand, you could mess it up, so be careful.
;; Your init file should contain only one such instance.
;; If there is more than one, they won't work right.
'(ecb-layout-name "left14")
'(ecb-layout-window-sizes (quote (("left14" (0.2564102564102564 . 0.6949152542372882) (0.2564102564102564 . 0.23728813559322035)))))
'(ecb-options-version "2.40")
'(ecb-source-path (quote ("d:/myRailsProject" "d:/useful scripts")))
'(ecb-primary-secondary-mouse-buttons (quote mouse-1--C-mouse-1))
'(ecb-tip-of-the-day nil)
'(ecb-tree-buffer-style (quote ascii-guides)))


There is a lot to configure in ECB if you wish. All the details are available in the packaged manual, which you can view by typing M-x ecb-show-help.

Having used ecb for a few year now, the only commands I ever use are:


Jump to the directory window CTRL-c . gd (ie type ctrl and c together, release and press '.', release and press 'g' then 'd')
Jump to the history window CTRL-c . gh
Jump to the last window you were in CTRL-c . gl
Jump to the first editor window CTRL-c . g1


The directory browser can be controlled without using the mouse too – just use the arrow keys and enter – give it a go!

As you can see, setting up ECB is not difficult, and its well worth it in my opinion.



Installing Emacs on Windows and OS X
OS X

If you are on a Mac, there are a couple of choices when it comes to emacs. You can go for Aquamacs, which is emacs, but polished up to make it more Mac like. In the past I recommended using Aquamacs, but I have since changed my mind and think it's better to stick with plain old emacs. Whatever you choose, just download it and install it like any other OS X application.

Windows

On windows its almost as simple. The current version of Emacs is 23.2, so grab the download from here (for the current version, the file you want is emacs-23.2-bin-i386.zip), and extract it into any folder on your machine, I choose C:\, which placed Emacs in C:\emacs-23.2

.emacs

Emacs is highly customisable, and all the settings relevant to your setup are stored in a file called .emacs or _emacs. Emacs searches for this file in your home directory, which is fine on OS X and Linux, but what about windows where you don’t really have a home directory?

The trick is to create an environment variable called HOME that contains the location of a directory you wish to use. A sensible place to store your _emacs is in the Application Data folder, normally located at C:\documents and settings*username*\Application Data

To create the environment variable:


Right Click on My Computer and select properties
Click on the Advanced tab
Click the environment variable button at the bottom
Click the new button under the User variables for username pane
Enter HOME as the variable name, and the location of the directory that will contain your _emacs file as the value


Playtime

Now its time to fire up emacs. On the Mac, run the Emacs application you have just installed, on Windows double click on C:\emacs-23.2\bin\emacs.exe

The first thing you will notice is that Emacs behaves just like most other text editors – you can type stuff, open, close and save files using the menus etc. However, to start on the road to becoming a power user, you have to start learning the keyboard commands of which there are many!

The best way to learn is to work your way through the built-in tutorial. While reading it, you will quickly learn that most Emacs commands are accessed by holding the CRTL or ALT key and issuing some number of key presses. To open the tutorial, press and hold CTRL and type 'h'. Release all the keys and type 't' – the tutorial will then open, and it can teach you much more than I can!



Choosing a text editor for Rails development
A few years ago I wrote a series of posts called the emacs newbie guide to rails about getting emacs up and running on OS X, primarily for Ruby on Rails development.  The posts turned out to be quite popular, as there were no other similar tutorials around at the time. As with all things in software, things change, so I have reproduced the posts here with a few updates and a few more things I have learned along the way.

Textmate

Anyone who has been around the Rails scene for a while know a lot of Rails developers use OS X, and Textmate
seems to be the editor of choice on that platform. For me, the problem is that Textmate is Mac only, and I often have to edit files on Windows, Linux and in terminal sessions. Learning a new editor well is quite a lot of work, and if you are going take the time to learn one, it better be available on any OS you need to edit files on. For that reason I quickly gave up on Textmate, but not without falling in love with its file browsing pane.

Enter Emacs

If you want to learn an editor that is available everywhere, then your choices boil down to Vim or Emacs. I decided on Emacs. Vim is probably just as good, but I had to choose something and Emacs won.

The Goals

Emacs out of the box is nowhere near as user friendly as Textmate, and a file browsing pane is non-existent. So I set about learning enough about Emacs to get an editor that:


Provides a file browser pane like in Textmate
Ruby (and other languages) syntax highlighting
Automatic code indenting and automatic closing braces, quotes, if statements would be nice
Compile/Run code inside the editor
Do all of this on Windows, OS X and inside a decent terminal


Other things that would be nice nice to have:


Rails code snippets like in Textmate
Spell check as you type
Customisable color schemes to suit tired eyes


If you follow all the parts of this tutorial, you will quickly Emacs setup to meet the goals about and a little bit more.



Sell Yourself
Lets face it, you could come up with 101 app ideas, and build many of them to a high standard only to find a lack of customers or need for your product.  If this happens, and  your money is running out, and you need to go get a job, then it would be good to have something to point to that shows off the skills you have learned.  Building those failed apps was not a waste of time if you learned new skills doing it, so you should really show them off, customers or revenue or not.

Blog About it

The first 'App' any software entrepreneur should consider creating is a place on the internet to sell themselves, which is what this website is for me (although, this is only the second post, so I have way to go yet).  It doesn't have to be anything special, any old blog will do.  Just link to your work, talk intelligently about your development problems and ideas.  Try and make it the first result in Google when someone types in your name.

Knowing What Is Out There

Even by building this very simple blogging website, I have demonstrated a certain level of proficiency with Rails 3, and all the code is mine, so I can share it in an interview if I am asked. Even simple things, like knowing about Feedburner, Disqus, Google Analytics and Blueprint show I know what is going on in the internet today, and know how to exploit these tools to get more done faster.

Who knows what opportunities your blog will bring, so if you don't have one, maybe it's time to starting spending an hour a week on it and see what happens.



The new blog is ready
Well, another new blog.  Hopefully I will stick with it longer than the last two, where I pretty much ran out of steam after only a few posts.

Techy Details

This time, I decided to write my own super simple blogging engine in Rails 3.  While it's not much of a challenge to build a blogging engine, there are some interesting problems, like caching pages, expiring caches and a password protected admin area that help you to get to grips with the Rails 3 framework.

I short circuited the biggest challenge of all, which is making it look 'good enough' by borrowing a design from somewhere else (I don't remember where now) and using the Blueprint grid layout, which gives pretty good default typography settings.

Another area I side stepped was blog comments - I outsourced that to disqus, as dealing with spam and captcha's was more than I wanted to handle. 

Keeping with the outsourcing trend, I intend to wire Google Analytics into the site to track visits and use Feedburner to track RSS stats once I get a few minutes.

One other thing to note - Go Daddy offers very cheap domains (this one code me less than £1) if you search for voucher codes on google before paying.  However, their website is just awful. Awful.

The Theme

The idea of this blog is to provide a place to post ideas for web applications or businesses I come up with, a place for general software related posts and articles and more generally a place to capture all my bits and pieces on the web in one place.  Who knows where it go, or if it will be interesting, only time will tell.

Batch Size	Runtime (seconds)	TPS
1	259	1930
10	43	11627
20	24	20833
40	16	31250
80	12.5	40000
160	11.8	42372
320	11.5	43748
640	11.2	44642
1000	11	45454

Message Length	Runtime (seconds)	TPS
100	10.5	47619
200	10.6	47169
500	12.3	40650
800	14.5	34482
1600	17.5	28517
3200	24.2	20661
6400	38	13157
12800	68	7352

Batch Size	Time (seconds)	TPS
1	140	714
10	23.3	4291
20	15.5	6451
40	11.5	8695
80	9.3	10752
160	8.6	11627
320	9.4	10638
640	8.7	11494
1000	7.7	12987

Message Size	Time (seconds)	TPS
100	7.5	13333
200	7.8	12820
400	8.8	11363
500	8.7	11494
800	9.8	10204
1600	12.6	7936
3200	17.5	5714
6400	25.5	3921
12800	40	2500