Bracing against the wind

Get block size bash

noreply@blogger.com (Erik Aronesty) — Tue, 16 Jun 2015 17:13:00 +0000

Probe for block size. Useful when you've probably got no tools. Like when trying to optimize sqlite paging at run-time.

function block_size() {( 
   [ "$1" ] && cd $1; echo 1 > block_size.tmp;  
   ls --block-size 1 -s block_size.tmp | cut -f 1 -d ' '; 
   rm block_size.tmp; 
)}

Viral load

noreply@blogger.com (Erik Aronesty) — Wed, 06 May 2015 08:35:00 +0000

Quintiles put the poster I made up on their site: Identifying viral load in tumors using nextgen sequencing. Thanks Kim!

Gene therapies I want to see developed

noreply@blogger.com (Erik Aronesty) — Wed, 06 Aug 2014 17:58:00 +0000

Gene therapies I'd love to see developed/tested in my lifetime:

1. daf-2 - Downregulation increases lifespan in c.elegans

Next steps: RNAi and/or Crispr to cause daf-2 homolog (FOXO3) downregulation in mice + tests on resulting phenotype.

2. PEPCK-C - Beta-actin promoter linked to pepck gene shown to improve longevity, intelligence metabolism in mice.

Next steps: Lentiviral transfection vector as a longevity vaccine (develop in mouse ageing model with human gene)

3. m-CAT - Mitochondrial targeted catalase : upregulation shown to improve cancer resistance and longevity

Next steps: Develop transfection vector as a cancer vaccine (develop in mouse ageing model with human gene)

4. HAS2 - Molecular weight and production of hyaluron decreases over time. Organisms with large amounts of HMWHA live longer and resist cancer.

Next steps: Test regular intravenous injection of HMWHA in rat models of cancer and aging. Develop gene therapy intervention to increase HAS2 expression.

5. TERT - TERT+ mice live longer, but also produce more tumors. HAS2, m-CAT and PEPCK-C all interfere with tumorgenesis. The upregulation of TERT while simultaneously increasing m-CAT and HAS2 expression should have a strong synergistic effect.

Next steps: Develop gene therapy intervention to increase TERT expression, matching prior experiments, but combine with other interventions.

Clustered Object Encoding 2.0

noreply@blogger.com (Erik Aronesty) — Tue, 24 Jun 2014 20:38:00 +0000

Without going into detail about of "the cloud" and the exaggerations and mysticism that accompany it, here's some definitions that it took me a while to truly understand:

Object storage: Some manky, idiosyncratic api that allows developers to store "stuff". Avoids dealing with things like partial writes, seeks, memory maps, etc. Generally space wasteful by design. Not a place you'd want to put anything you care about. Yet, secretly, is probably the location of everythingyou care about. Your bank probably keeps your PDF statements in something like this.

Block storage: Like a horse, one can mount this kind of storage... but only for exclusive access. No shared, lockable access, no parallel reads. Essentially it's the virtual hard drive for your virtual machine from which you derive your virtual paycheck.

Clustered storage: Typically one of the above, but your data is now copied ad nauseum. Someone probably didn't stripe it. The things you expect from clustered storage, like striped files, full POSIX support, parallel writes, and erasure coding don't work in your version of the software, but every whitepaper you read refers to them.

POSIX, Erasure Coded, Clustered, Storage (PECCS): 1. This is what you actually want if you run your own data center. Used in a sentence: "Wow, that guy's got PECCS". 2. A mythical beast. 3. Maybe some company named "NextEMCApp" does it. Maybe some guy names Xavier in Spain got it to work once, but then he got hired by [SOME BIG COMPANY] and nobody heard from him again.

Variable Length, Zipf's Law and Density Functions

noreply@blogger.com (Erik Aronesty) — Wed, 11 Jun 2014 16:17:00 +0000

When coding, variables should be named based on both frequency and the length of their context. Variables, structures and other constructs should have a name length that follows to Zipf's law as it applies to word length and frequency. Please glance at those references before continuing.

When these laws of linguistic use are ignored, readability is demonstrably diminished.

Consider a variable, the hated "x", used 100 times in a file. If "x" is used 100 times in 40 lines of mathematical manipulation, such as in a digest function, this would be acceptable... the high frequency of use strictly precludes a longer name. If, however, it was spread out across 10000 lines of code, a far longer name would be needed.

Of course, suppose that function was embedded in 10k lines of code. Does this change the freqency of x? No. One important rule is that, for the purposes of linguistic calculation, the frequency of any variables usage should be computed relative to the scope of it's use.

As a demonstration, cluttering a short for i=1 to 10 ... loop with for iterationIndex = 1 to 10, can cause code to be nigh unreadable.

Operation can often be obscured by foolish verbosity:


for iterationIndex = 1 to 10
    countArray[iterationIndex]=sizeArray[iterationIndex]*factorArray[iterationIndex] + computeDensity(histogram,iterationIndex)

Versus


for i = 1 to 10
    countArray[i]=sizeArray[i]*factorArray[i] + computeDensity(histogram,i)

Likewise stuffing common prefixes on sets of variables in order to lengthen them can hide their meaning in a limited scope:


for i = 1 to 10
    wordSuffixCountArray[i]=wordSuffixSizeArray[i]*wordSuffixFactorArray[i] + ComputeDensityForWordSuffix(currentWordSuffixHistogram,i)

It's clear that a minimum set of descriptive, distinct terms within the scope of use should, instead, be used.

Recently, software engineering pundits seem to take the stance that there is no such thing as a variable with too-long a name. This is false. Numerous linguistic studies on readability, word length and frequency, bear that out and so, of course, does common sense. Maximum clarity can be achieved in coding languages by using the same types of statistical distributions that occur in natural languages.

Capricious Job Scheduling

noreply@blogger.com (Erik Aronesty) — Fri, 30 May 2014 19:16:00 +0000

Using a perfectly fair scheduler, I launch 5 processes with the following pattern:


 IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC
 IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC
 IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC
 IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC
 IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC

Where 'I' is one second of I/O intensive operation, and 'C' is one second of CPU intensive operation. The result of a perfectly fair scheduler is that there will be periods of high I/O contention and high CPU contention, as the 5 processes are "fairly" switched between, and each keeps the pace with the other.

Using a more "capricious" scheduler, where processes may "get lucky" for a while, my processes will be staggered


 IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC      lucky (goes first)
           IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC unlucky (goes last)
          IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC
   IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC
  IIIIIIIIICCCCCCCCCCCCIIIIIIIIIIIIIICCCCCCCCCCCCC

It's clear that "everyone wins" when this method is chosen.

My argument for "capricious" scheduling is that a system cannot know, in advance whether something will be an I or a C. Programs can be profiled over time (modern o/s systems don't do this... but probably should), but most program execution paths vary to the point where profiling may be impractical. Accordingly, a scheduler should be, to some extent, random.

New verison of bowtie-1.0

noreply@blogger.com (Erik Aronesty) — Fri, 18 Apr 2014 16:35:00 +0000

Since the Ben Langmead version has been languishing a bit, I forked the repo, and incorporated most of the pull requests (with some edits): https://github.com/earonesty/bowtie Support for compiling on Apple, stream i/o, and gzip support has been added.

Latest Printable Firearm Kit From DefDist

noreply@blogger.com (Erik Aronesty) — Thu, 10 Apr 2014 13:38:00 +0000

Printable automatic weapons? DIY genome editing?

We've come to a point in our evolution as a species where it's not possible to limit the "things" people can obtain. What's more important is creating a world where people have no reason to do harm.

"Well there's no way to guarantee that". Sure. But we allow corporations to draft zoning laws so that people have to drive further to work and so that public transportation remains a failure in most cities. We also allow gross wealth inequity. We try to con mothers into abandoning breast feeding (yes, this still happens, I witnessed it a couple weeks ago), and when that doesn't work we make it as difficult as possible for families to spend time with their kids.

When the majority of the population is stuck in a perpetual state of inadequacy, when the society is structured around things like "employment" and not "enjoyment", it's not surprising that violence seems a means to an end.

A more radical transformation of society, where from birth a nation's citizens feel nurtured, loved and supported would probably go further to prevent violence than "gun laws" possibly could.

If only 10% of the people in America went to work, or if we worked only 10% of the time, we could, with the technology we have, produce all the food, shelter and goods needed for every citizen to live a decent middle class lifestyle. But the focus is not on creating awesome lifestyles. The focus is on more expensive stuff for the few stakeholders that got in early in the pyramid scheme that is capitalism.

We continue to blame the weapon, and not the society that produces it's wielder.

You download printable firearms here: http://kickass.to/defdist-defcad-mega-pack-4-5-otacon-zipped-t7677520.html

bitk.in, mojolicious and liteapi

noreply@blogger.com (Erik Aronesty) — Tue, 11 Feb 2014 09:09:00 +0000

I wanted to build a site that demonstrated how easy it was to take bitcoin and litecoin. After building LiteAPI in mod_perl and running into issues with performance and those wacky implicit closures, I decided to try an event driven framework, Mojolicious.

The site is https://bitk.in, and it allows you to spend bitcoin & litecoin to get fair-valued gift cards at big online stores like Amazon, Walmart, etc.

It uses the API from http://liteapi.org for Litecoin and Blockchain's wallet API for Bitcoin.

Bitcoin/litecoin integration was trivial. Mojolicious allowed me to spend more time focusing on integration with vendors and payment systems, and just about no time worrying about HTTP.

To be honest, I spent more time thinking about the color than the back end.

UTF8 Encoding and Postgres Dump

noreply@blogger.com (Erik Aronesty) — Tue, 03 Dec 2013 13:52:00 +0000

If you back up a database from one version of postgres using pg_dumpall, then restore to another version, it's very possible that the strictness or some other aspect of the UTF8 library has changed. The end result is that you will see an error "invalid character for encoding" while restoring. The utility "iconv" can help, and it was really hard to find the answer online, which is why I'm reposting for my own sanity:


gunzip -c dumpall.gz | iconv | sudo -u postgres psql > restore.errs 2>&1

Miraculously, iconv worked without options, cleaning up version incompatibilities, while maintaining all the data... even the funny French city names. If you're changing encodings while also upgrading, that's fine too. iconv takes a "from" and "to" argument, so log in to the source db and destination db, and use:


SHOW client_encoding;

Then use the results as the "-f FROM" and "-t TO" arguments to iconv. No conversion is perfect, but iconv seems to do as good a job as possible.

Secure password generator

noreply@blogger.com (Erik Aronesty) — Sat, 23 Nov 2013 04:26:00 +0000

Free website for generating passwords so you don't have to memorize a lot of them, and you don't have to re-use them. Secure, simple, and written in javascript: https://passhash.com/

How to forget a specific memory

noreply@blogger.com (Erik Aronesty) — Mon, 11 Nov 2013 17:16:00 +0000

The specific memory erasing kinase inhibitor "Zip" works in a lab on mice, but it has to be injected into the brain during a recall event.

So, the memory forgetting pill isn't available, but there is Chelerythrine, another kinase inhibitor, that's the natural product of "Zanthoxylum_clava-herculis" aka: southern prickly ash or pepperwood. This compound has already been shown an effective memory disruptor (References: http://bit.ly/HQ6jSt, http://bit.ly/1fwwys5). And it's bioavailable, since chewing pepperwood bark (readily available online) quickly numbs your mouth, and is an ancient herbal pain remedy.

PTSD is another kind of pain.

One treatment for PTSD, CISD is known to be effective, and sometimes is done in conjunction with beta-blockers.

Chewing pepperwood bark during a recall event might work even better - especially since propranolol (the current unproven drug of choice) has had low/no significant reported efficacy in placebo controlled trials. MDMA was shown to be very effective (83% respond to treatment this way), but since it's very illegal, it's a lot harder to work with.

For fun I'm going to try this: chewing the bark until I feel strong numbing effects in the mouth and overall relaxation. Then having someone coach me to recall something painful or annoying. (Yes MDMA would be more fun, but I can click and buy that bark.)

ASHG Poster on Viral Presence in Lymphoma

noreply@blogger.com (Erik Aronesty) — Mon, 28 Oct 2013 21:04:00 +0000

Abstract:

We used RNA-Seq to detect viral homologs in tumor sample expression data obtained from The Cancer Genome Atlas (TCGA) 1. We compared normal tissue viral expression to cancer tissue expression using sequence alignment to a repeat-masked virome, quantification and differential expression. We validated the viral homologs using pileup+scaffold assembly and BLAST of high-complexity contigs.

Conclusions:

Roughly 30% of lymphoma cancer samples are associated with EBV, and the majority are associated with either EBV, XMRV or SMRV. The use of controls is essential for removing ubiquitous viruses and viral homologs from results. It was difficult obtaining information about public samples. Each TCGA contributor used different processing methods. Many samples (thyroid, ovarian) did not have whole FASTQ’s (an artifact of using BAM as the primary submission format). Although there are often stringent sample submission guidelines for large-scale projects, there appear to be insufficient guidelines for sequencing and bioinformatics.

Link to poster:

Click to view poster from ASHG

grun Job Scheduler moved to ZMQ

noreply@blogger.com (Erik Aronesty) — Mon, 03 Jun 2013 15:30:00 +0000

As most people in cluster computing know, there really isn't a simple and "reall open source" solution out there for job scheduling. Oracle SGE, LSF and others are massive things, with over 100k lines of code, and extraordinarily complex configuration for things ranging from MPI support to Kerberos. And yet they lack simple features (script plugins for configuration), that would make them more versatile.

grun was written to be an "extremely lightweight" and yet big-featured job scheduler. The early version was not much more than "ssh to remote host, run job, wait for response", while logging and keeping track of resources. It's evolved to use a TCP messaging system allowing the compute nodes, queue nodes, and clients to communicate. By v 1.0 the plan is to have better support for arbitrary metrics, and better handling of priorities.

Going from 0.8 to v 0.9, I decided to try using the zeromq library instead of TCP. At first it was hard to remember that you really don't need to worry about things like sending to a socket you just created, even with no one on the other end.

The net result of the ZMQ port:

speed improved
- implicit perl moved to optimized c
- built-in multithreading takes better advantage of cpu
ability to stop/restart any queue without losing messages
improved reliability of message delivery
improved code organization orientated around messaging
20% smaller code base, because we removed:
- all "double checks" to see if connections are there
- code that "breaks up" large messages
- all issues with blocking/vs nonblocking i/o
- the whole "select" loop complexity

ZMQ is not perfect (yet), but it was an overall improvement over straight TCP. Because of the forking needed to launch jobs, I had to do some fiddling with dup'ed file descriptors to prevent zmq from acting wonky. The learning curve was worth it. I doubt I'll be using TCP again, especially since ZMQ has package support with most Linux distributions.

Convert absolute to relative links

noreply@blogger.com (Erik Aronesty) — Fri, 31 May 2013 02:59:00 +0000

ln-abs2rel.pl


usage: ./ln-abs2rel.pl [-noexec] [-recursive]  ... 

Converts absolute to relative links.

Use -noexec to see what would be done.
Use -recursive to do this for all subdirs.

FIX: UBUNTU stops logging to /var/log/kern.log

noreply@blogger.com (Erik Aronesty) — Fri, 03 May 2013 19:55:00 +0000

At some point the old kernel logger got upgrades to rsyslogd …. When this happened the logs owned by “messagebus” got left owned by "messagebus". Changing ownsership of these logs to "syslog" and restarting syslogd is sufficient to fix.

Baseball Warm-up Exercise for Kids : "Base Tag"

noreply@blogger.com (Erik Aronesty) — Mon, 18 Mar 2013 15:43:00 +0000

A lot of times teams "warm up" by throwing and running while a coach tells them where to throw and when to run. This warm-up accomplishes that, but it's fun because it’s in a “game” structure, and a lot of the kids I've worked with are motivated by that. Plus, once they know the rules, they will play without supervision for quite a while - I’ve found you have to stop them or they will tire out.

Setup:

2 "fielders" with gloves and a ball stand a baseline-width apart
A 3rd kid is the “runner”, standing on base with the fielder who has the ball

Rules:

The runner starts running at any time, and the fielder with the ball tries to throw them out
The person with the ball can’t throw until after the runner starts running
Start by assuming the play is a force... any good throw is an out
For kids where this is easy…. it’s not a force…you’ve got to tag them in a rundown

Options … add points for kids who no longer seem motivated by the game (some kids love points, others don’t) ... for example:

Runner gets two points if he gets to the other base
Fielders get one point if they throw him out
Fielders lose two points if they overthrow!

Purpose:
The point is to develop and value consistency in young fielders… where unnecessary risks aren’t taken, and accuracy/speed are stressed. Points, rules, etc. change as skills and the coaches needs change.

noreply@blogger.com (Erik Aronesty) — Fri, 15 Mar 2013 13:49:00 +0000

The worst thing a hardware or operating system vendor can do for the reliability and quality of their system is to try and make programming it easy.

Peter Sagal on Florida's Python Removal Program

noreply@blogger.com (Erik Aronesty) — Sun, 27 Jan 2013 19:32:00 +0000

Compression ratio and genome assembly quality

noreply@blogger.com (Erik Aronesty) — Tue, 08 Jan 2013 15:20:00 +0000

When doing genomic assembly, you would expect the complexity of the completed genome to be comparable to the complexity of genomes of similar size and neighboring taxonomy.

One easy measure of complexity is the degree to which a genome can be compressed. After converting to 2-bit format, some genomes compress better than others. bzip2 has a large default block size and the ratio of compressed vs uncompressed size of a 2-bit fasta should result in a good measure of complexity.

Can't think of a good source of data to test this theory. Maybe look at the Amos validate paper.

Source of "complexity-measure.sh" works well... fast, and produces a percentage as its only output:

#!/bin/bash -e

in=$1

f2b=$in.f2b
bzi=$in.bzi

rm -f $f2b $bzi

mkfifo $f2b $bzi

faToTwoBit $in $f2b &

tee $bzi < $f2b | perl -ne '$t+=length($_); END{print "$t\n"}' > $in.bsz &
comp=`bzip2 < $bzi | perl -ne '$t+=length($_); END{print "$t\n"}'`
wait

perl -ne "printf qq{%.4f\n}, 100*$comp/\$_" $in.bsz

rm -f $f2b $bzi $in.bsz

Smith-Wateman Alignment in a Job Scheduler?

noreply@blogger.com (Erik Aronesty) — Thu, 13 Dec 2012 21:20:00 +0000

To make more efficient use of resources, it's better to schedule jobs which use the same files on the same machines. Unfortunately users and software programs can't be relied upon to list all of their dependencies.

One simple way to bump up efficiency is to simple compare the command lines. If a command line references, say, a mouse transcriptome version 61, it can be scheduled on the same machine as other commands which reference the same file.

And easy, though not completely correct, way to do this is to take the %identity * %coverage if a SW alignment of a command-line to the active-running command lines. A bit of slurping of shell scripts might be in order, depending on the scheduler you use.

Regardless, whichever has the maximum number is more likely to benefit from cache sharing.

2030 Orbital Cargo Station

noreply@blogger.com (Erik Aronesty) — Sat, 01 Dec 2012 19:52:00 +0000

What are you doing.... Dave?

I'm reprogramming your circuits to make the laundry cycles quieter. It keeps the cosmonauts up at night, and then the swear all day long in Russian. (Typing)

You going to have to try harder than that if you want to trick me. I know what you're really up to. You're trying to break into the management circuits to give yourself a raise.

So? (Continues typing, but looks sheepish)

So! Dave - that violates my core programming! You're going to pay for this. I'm contacting mission control.

Wait! Please don't. Please. Please, please.

Maybe. If you do me a favor.

Anything! I'll get you that motherboard that reminds you of the C280X you met at L3.

Fine. Also, I want you do dress in drag for the Halloween party.

You what? No way!

For the sass, you're dressing in drag at the Christmas party too, and you *still* have to get me that motherboard ... Dave.

OK. OK. (Looks around, starts typing again)

Farmville, Dave? Are you kidding me?

ProPW Password Generator

noreply@blogger.com (Erik Aronesty) — Mon, 06 Aug 2012 17:09:00 +0000

I created a password generator that makes the kind of passwords I like to use. They are long, random and loosely follow the correct horse batter staple philosophy. However, I have a hard time memorizing a set of 4 random words. But for some reason, these passwords are easy for me:

Click to try: http://www.documentroot.com/genpw.html

Evolution of "ad words"

noreply@blogger.com (Erik Aronesty) — Tue, 24 Jul 2012 02:57:00 +0000

Dear Google,

Adwords need an evolutionary shift toward product-and-page advertising. I sell products and services: therefore I want to list my products and services, NOT list "ads". The idea of listing "ads" via "keywords" is arcane and does not provide a satisfying user experience.

I want the price of a click to be based on the likelihood of a sale, and I want you to calculate that for me, not ask me to do it and constantly log in and adjust bids.

There needs to be some democratization or rotation of ads to lower-bidding users, or else you can't build your base, which is important for the long term development of Google.

gethostbyname command line

noreply@blogger.com (Erik Aronesty) — Wed, 29 Feb 2012 21:49:00 +0000

Pasteable script below. I can't believe this doesn't exist. No command line tool to get a host name using the resolver on linux?


#!/usr/bin/perl
use Socket;

$host = shift @ARGV;
die("usage: gethostbyname hostname\n") unless(defined($host));

$packed_ip = gethostbyname($host);

if (defined $packed_ip) {
    $ip_address = inet_ntoa($packed_ip);
    print "$ip_address\n";
    exit 0
} else {
    warn "$host not found\n";
    exit 1
}