<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  <title>Yashh</title>
  
  
  <updated>2011-09-23T20:39:12-07:00</updated>
  <id>http://yashh.com</id>
  <author>
    <name>Yashh Nelapati</name>
    <email>me[at]yashh.com</email>
  </author>
  
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/yashh" /><feedburner:info uri="yashh" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
    <title>Python S3 BlobStore</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/4N2SJptD_RI/python-S3-BlobStore" />
    <updated>2011-09-23T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/python-S3-BlobStore</id>
    <content type="html">
      &lt;p&gt;There are some scenarios where you have a huge JSON / binary data you wanted to save. You can totally write it to a BLOB column using mysql / postgres. That works well till you have a finite dataset. When you have an expanding dataset I bet you will see the performance hit on the database. This usually hurts when you want to bring up a slave for that particular database. It takes a ton of time to copy the data over. The table ends up taking huge disk space. In this scenario you can totally use mongo or something as it is schemaless. But mongoDB ends up putting all indexes and data into memory causing quite a lot of overhead. This is where you need a blobstore which is a dumb key-value store. Lets say you are writing a contact importer where you import your facebook, twitter, google contacts. You end up receiving a ton of contact data from these services. You can JSON encode all the received data and write to a blobstore with user id as part of the key.&lt;/p&gt;

&lt;p&gt;If you are using &lt;a href='http://code.google.com/appengine/'&gt;google app engine&lt;/a&gt;, you can use the google&amp;#8217;s &lt;a href='http://code.google.com/appengine/docs/python/blobstore/overview.html'&gt;blobstore&lt;/a&gt;. But if you are hosting on Amazon EC2 you can take advantage of S3 as blobstore. The latency between EC2 and S3 is pretty good.&lt;/p&gt;

&lt;h2 id='presenting_blobstorepy'&gt;Presenting BlobStore.py&lt;/h2&gt;

&lt;p&gt;Here is the &lt;a href='https://gist.github.com/1238913'&gt;blobstore.py&lt;/a&gt; I use to store huge data objects. It makes use of boto library to write and read from S3. You can easily subclass the BlobStore and write you model to define which S3 bucket to use and the key prefix.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!python
class MyNewDataBlob(BlobStore):
    def __init__(self):
        super(MyNewDataBlob, self).__init__(&amp;quot;s3_bucket_name&amp;quot;)

    def get_key(self, user_id): return &amp;quot;userdata:&amp;quot;+str(user_id)

#!python
&amp;gt;&amp;gt;&amp;gt; from blobstore_models import MyNewDataBlob
&amp;gt;&amp;gt;&amp;gt; MyNewDataBlob().get(123)
    # return parsed JSON from S3 or None if it does not exist
&amp;gt;&amp;gt;&amp;gt; MyNewDataBlob().set(123, {&amp;quot;image&amp;quot;: &amp;quot;/path/to/image.jpg&amp;quot;, &amp;quot;so_many_other_keys&amp;quot;: &amp;quot;....&amp;quot;})
&amp;gt;&amp;gt;&amp;gt; MyNewDataBlob().delete(123)
    # deletes the entry for 123&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All we need is just a simple get, set and delete. You can get a little fancy by implement get_many and set_many. Send a pull request if you think you added some cool stuff.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/python-S3-BlobStore</feedburner:origLink></entry>
  
  <entry>
    <title>Monitoring Python Processes</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/wFW0fKMNGXk/monitoring-python-process" />
    <updated>2011-08-23T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/monitoring-python-process</id>
    <content type="html">
      &lt;p&gt;Its often common to see some python process like a webserver process or a python script running for a long time and you have no idea what its doing currently. Although you have logging and print statements it does not help much to get a sense of what the latency is all about. I personally encounter this everyday. You brought in a new database or algorithm which worked well in you local / dev / staging environment, but as soon as you push to production, DANG&amp;#8230; stuck.. You try a number of ways in shell etc and take a guess of what might be wrong but you cant for sure find where exactly its getting stuck.&lt;/p&gt;

&lt;p&gt;For these situations we came up with a couple of ways to know whats happening in productions.&lt;/p&gt;

&lt;h3 id='setproctitle'&gt;setproctitle&lt;/h3&gt;

&lt;p&gt;&lt;a href='http://pypi.python.org/pypi/setproctitle'&gt;Setproctitle&lt;/a&gt; is a python library which allows you to change the name of the process which running. Projects like postgres, celery, gunicorn use this to show which process is master and which of them are workers. But we used this to a new level to know which worker is serving what request and whats the latency in a each request.&lt;/p&gt;

&lt;p&gt;In your WSGI middleware set a timer on your request object and change the process title to the request.path and in the process_response set the request&amp;#8217;s title to idle and add the time spent in the request.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!python
# some custom middleware
def process_request(self, request):
    ...
    request.start_time = time.time()
    setproctitle.setproctitle(&amp;quot;gunicorn -&amp;gt; &amp;quot;+request.path)
    ...

def process_response(self, request, response):
    ...
    setproctitle.setproctitle(&amp;quot;gunicorn -&amp;gt; [idle] [%0.3fs]&amp;quot; \
                        %(time.time()-request.start))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now &lt;strong&gt;ps -aux | grep gunicorn&lt;/strong&gt; will show you the current request being handled and the time taken by a worker to process the last request. With this you can know which workers are like stuck. The next method will tell you where there are stuck.&lt;/p&gt;

&lt;h3 id='register_sigusr1_signal'&gt;Register SIGUSR1 signal&lt;/h3&gt;

&lt;p&gt;Another way to get a traceback of the python process is to register a SIGUSR1 signal to dump a traceback of the current execution in a file. All you need to do is to call &lt;em&gt;register_signals&lt;/em&gt; in any python script.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!python
import os, signal, sys, traceback

def sigusr1_handler(signum, frame):
    print &amp;quot;Received SIGUSR1 -- Printing stack trace...&amp;quot;

    f = open(&amp;#39;/tmp/current.traceback.txt&amp;#39;, &amp;#39;a&amp;#39;)
    traceback.print_stack(file=f)
    f.close()

def register_signals():
    signal.signal(signal.SIGUSR1, sigusr1_handler)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now call register_signals in settings.py or views.py or any python script. Now do a ps -aux and get the pid of the process you want to know about and send a SIGUSR1 signal to it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ps -aux
kill -SIGUSR1 8932
cat /tmp/current.traceback.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will give you a full traceback of where the process is stuck there by giving you a clean idea of why its stuck. Kudos to &lt;a href='http://pinterest.com/martaaay'&gt;marty@pinterest&lt;/a&gt; for sharing this technique.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/monitoring-python-process</feedburner:origLink></entry>
  
  <entry>
    <title>PyMysql</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/AsEXVVEoXKc/PyMysql" />
    <updated>2011-08-21T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/PyMysql</id>
    <content type="html">
      &lt;p&gt;&lt;a href='https://github.com/petehunt/PyMySQL'&gt;PyMysql&lt;/a&gt; is a pure python mysql client, which is trying to be a drop in replacement for python-mysqldb. There is a &lt;strong&gt;install_as_MySQLdb&lt;/strong&gt; function which when called makes PyMysql to be a drop in replacement for MySQLdb. I am excited about this project because its so easy to install and get started. Installing MySQLdb is never straight forward because its a C binding and looks for mysql_config binary.&lt;/p&gt;

&lt;p&gt;Once you start to identify that your site is growing fast you will see a ton of bottlenecks at your sql layer. Its usually your ORM doing some whacky stuff, once you tweak it to do simple queries you are fine. But as your traffic grows and you have about 20-30 app servers running 20-30 app processes you are under pressure again. You are making a ton of connections to your mysql layer. Even you pool your connections you are handling quite a lot of connections. At this point you will need to fork your mysql into a different data layer. That is making it like a thrift API or a Rest Api like service to get data.&lt;/p&gt;

&lt;p&gt;I personally think that writing SQL is easy and becomes very easy when your site grows. Because you cannot JOIN tables and expect it to work at scale. Here is a quick example of using PyMysql&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!python
def get_mysql_connection(shard_name):
# really important to set autocommit to True or else mysql returns query cache
conn = pymysql.connect(**settings.DATABASES[shard_name])
conn.autocommit(1)
return conn

# in your python controller
conn = get_mysql_connection(settings.DATABASES[&amp;quot;slave001&amp;quot;])
cursor = conn.cursor()
sql = &amp;quot;DELETE FROM follow_user WHERE user_id1=%s AND user_id2=%s&amp;quot;
data = cursor.execute(sql, (user_id1,user_id2))
cursor.connection.commit()
cursor.close()
conn.close()
return str(data)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Just get a connection open a cursor run your query with PyMysql escaping your sql parameters (to avoid injections) and boom done. Here is a simple SQL insert.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!python
sql = &amp;quot;&amp;quot;&amp;quot;INSERT INTO users (username, password) VALUES (%s, %s)&amp;quot;&amp;quot;&amp;quot;
data = cursor.execute(sql, (username,password,)
cursor.connection.commit()
user_id = cursor.connection.insert_id()
cursor.close()
return user_id&lt;/code&gt;&lt;/pre&gt;

&lt;h5 id='asynchronous'&gt;Asynchronous&lt;/h5&gt;

&lt;p&gt;Now comes the most interesting part. Running your queries asynchronously. Once you import the PyMysql module you can use gevent / eventlet to monkey patch the socket module. You have the power to spawn multiple queries in parallel with greenlets.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!python
def query(sql, database):
    conn = get_mysql_connection(settings.DATABASES[database])
    cursor= conn.cursor()
    data = cursor.execute(sql)
    cursor.close()
    conn.close()
    return data

# in ur controllers
jobs = [gevent.spawn(query, (sql,database)) for sql, database in some_array]
gevent.joinall(jobs, timeout=2)
# will block untill both queries complete
what_you_want = [job.value for job in jobs]&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Recently &lt;a href='http://tarekziade.wordpress.com/2011/07/12/firefox-sync-python/'&gt;Tarek Ziade&lt;/a&gt; has blogged how they used gevent+PyMysql+Gunicorn for Mozilla&amp;#8217;s sync server. I want to cover about connection pooling in a separate blog post.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/PyMysql</feedburner:origLink></entry>
  
  <entry>
    <title>Monitoring memcached</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/4ctcSvjVEeg/monitoring-memcached" />
    <updated>2011-05-15T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/monitoring-memcached</id>
    <content type="html">
      &lt;p&gt;Memcache is one key component that cannot be neglected from a web stack. Its almost certain that when you have some sort of caching before your database layer it is certainly gonna improve your performance. But setting up memcache in a production environment is slightly trickly espcially when you have a growing data set.&lt;/p&gt;

&lt;p&gt;I almost hate to install memcache via apt-get since it creates config file and disabled by default. Instead I chose to install memcached from sources and always start it with &lt;strong&gt;-d&lt;/strong&gt; flag to daemonize it. I reason I chose to install memcaches via sources is due to the extra debug tools it provides like memstat and memdump. They are extremely useful to know how well your cache layer is performing.&lt;/p&gt;

&lt;h2 id='installing_memcache_from_source'&gt;Installing memcache from source&lt;/h2&gt;

&lt;p&gt;Since memcached has a bunch of dependencies lets install them first.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;wget http://www.monkey.org/~provos/libevent-2.0.10-stable.tar.gz
tar xzf libevent-2.0.10-stable.tar.gz 
cd libevent-2.0.10-stable/
./configure --prefix=/usr/local
make
sudo make install

wget http://memcached.googlecode.com/files/memcached-1.4.5.tar.gz
tar xzf memcached-1.4.5.tar.gz
cd memcached-1.4.5
./configure --with-libevent=/usr/local/
make
sudo make install

wget http://launchpad.net/libmemcached/1.0/0.44/+download/libmemcached-0.44.tar.gz
tar xzf libmemcached-0.44.tar.gz
cd libmemcached-0.44/
./configure
make
sudo make install&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now you can start memcached and allocate a fixed amount of memory to it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;memcached -u nobody -d -m 512&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;One thing I learned the hard way is to always allocate the amount of memory memcached should use. If you dont specify the memory it defaults to 32MB which is pretty small. If you have a growing dataset and traffic I suggest you dedicate some boxes to memcache and start memcache with 90% of available memory.&lt;/p&gt;

&lt;h2 id='few_tools_to_monitor_memcached'&gt;Few tools to monitor memcached&lt;/h2&gt;

&lt;p&gt;Now that your memcached is up and running. We can see all the keys your memcached holds with &lt;strong&gt;memdump&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;memdump --servers=127.0.0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Specify the list of servers for memdump to introsect. This is useful for development purposes.&lt;/p&gt;

&lt;p&gt;Now on to my most favorite tool -&amp;gt; &lt;strong&gt;memstat&lt;/strong&gt;. Memstat collects a bunch of various statistics about how your memcached is perfoming.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;memstat --servers=127.0.0.1
Server: 127.0.0.1 (11211)
	 pid: 10394
	 time: 1305424990
	 version: 1.4.5
	 pointer_size: 64
	 curr_connections: 42
	 total_connections: 177741649
	 connection_structures: 308
	 cmd_get: 1633514457
	 cmd_set: 439392889
	 get_hits: 1493612276
	 get_misses: 137902181
	 delete_misses: 9951479
	 delete_hits: 8166456
	 bytes_read: 770672258573
	 bytes_written: 2206367771075
	 limit_maxbytes: 6878658560
	 conn_yields: 11785
	 bytes: 6163615697
	 curr_items: 3736681
	 total_items: 235136283
	 evictions: 181308239
	 reclaimed: 4226382&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I have trimmed the output to only include the useful stats. The most important ones I lookup for is &lt;strong&gt;get_hits&lt;/strong&gt; and &lt;strong&gt;get_misses&lt;/strong&gt;. This gives you your memcache hit ratio. Another most important one to keep in mind is the &lt;strong&gt;evictions&lt;/strong&gt;. As your evictions increase over time that means its time to scale up your memcache servers. Also you can see that memcache is holding a lot of connections, may be connection pooling them can also improve the performance.&lt;/p&gt;

&lt;p&gt;Along with these tools there are a few handy tools to remove keys from a cluster, copy keys etc.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/monitoring-memcached</feedburner:origLink></entry>
  
  <entry>
    <title>Tracking with statsd</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/HSzGhugfQKs/tracking-with-statsd" />
    <updated>2011-05-14T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/tracking-with-statsd</id>
    <content type="html">
      &lt;p&gt;&lt;strong&gt;&lt;a href='https://github.com/etsy/statsd'&gt;Statsd&lt;/a&gt;&lt;/strong&gt; is a &lt;a href='http://nodejs.org/'&gt;nodejs&lt;/a&gt; daemon for easy stats aggregation from folks at etsy. If you have n&amp;#8217;t come across statsd I suggest you read blog post from etsy devops team &lt;a href='http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/'&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I like the fact that they chose UDP to aggregate stats since its blazing fast. &lt;a href='http://graphite.wikidot.com/'&gt;Graphite&lt;/a&gt; is chosen as a backend to store the tracked data. Graphite receives data and creates a data point every minute.&lt;/p&gt;

&lt;p&gt;Statsd is a nodejs daemon which accepts UDP packets, stores the data in memory. and flushes data into graphite once in X seconds. This &lt;strong&gt;flushinterval&lt;/strong&gt; can be configured via config file.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I suggest you install node.js version 0.38 as other versions threw some errors and warnings regarding vm.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;pre&gt;&lt;code&gt;$ node stats.js config.js&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once statsd daemon is up and running. Install graphite and its dependencies. Installing graphite is quite some work as it has dependencies on django, pycairo and several other modules. &lt;a href='http://agiletesting.blogspot.com/'&gt;Grig&lt;/a&gt; has written an excellent post on &lt;a href='http://agiletesting.blogspot.com/2011/04/installing-and-configuring-graphite.html'&gt;&amp;#8220;Installing and configuring Graphite&amp;#8221;&lt;/a&gt;. Instead of using Apache to serve django I chose to use gunicorn which is pretty easy to setup.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# run this command using runit or daemontools
cd /opt/graphite/webapp/graphite &amp;amp;&amp;amp; gunicorn_django -b 0.0.0.0:8000&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After you install graphite make sure to syncdb and start carbon-cache server to start collecting data.&lt;/p&gt;

&lt;h2 id='write_to_statsd_with_python'&gt;Write to Statsd with python&lt;/h2&gt;

&lt;p&gt;Now comes the interesting part writing data into the statsd. I used a python-statsd client &lt;a href='https://github.com/sivy/py-statsd'&gt;py-statsd&lt;/a&gt; available on github.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from pystatsd import Client
sc = Client(&amp;quot;graphite_host.yashh.com&amp;quot;, 8125)
sc.increment(&amp;quot;stats.blog_post.track-with-statsd&amp;quot;)
sc.decrement(&amp;quot;stats.users.count&amp;quot;)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now this code creates a connection object &lt;strong&gt;sc&lt;/strong&gt; to statsd and increments a key. You can do quite a lot of tracking with statsd package.&lt;/p&gt;

&lt;p&gt;* Track logins on the site * Track 500 rate on the site * Track user invite rate * Track emails sent out and many more&lt;/p&gt;

&lt;p&gt;py-statsd also comes with a server which collects data just like statsd, but I use it only for development purposes. We run all our graphite + statsd packages on a EC2 small instance. Pretty sweet.&lt;/p&gt;

&lt;p&gt;Make sure you configure graphite to store data back in the past for about &lt;a href='http://graphite.wikidot.com/getting-your-data-into-graphite'&gt;6 months to 1 year&lt;/a&gt;.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/tracking-with-statsd</feedburner:origLink></entry>
  
  <entry>
    <title>Scaling up with nginx</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/Jhf5sO05gb4/scaling-up-with-nginx" />
    <updated>2011-05-13T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/scaling-up-with-nginx</id>
    <content type="html">
      &lt;p&gt;More than 6% of the web is powered by nginx today. Nginx out of the box is very fast and extremely good at serving static content. But then you have a growing traffic you need to monitor all of your infrastructure and make sure no body is gating your speed.&lt;/p&gt;

&lt;h2 id='few_tips_to_scale_up_nginx'&gt;Few tips to scale up nginx&lt;/h2&gt;

&lt;p&gt;The first thing you need to monitor is your error.log. Nginx spews out all the errors into your error.log. Make sure you set the error.log path in your conf file or while configure. The first thing you might notice is to increase the number of nginx workers -&amp;gt; &lt;strong&gt;worker_processes 4;&lt;/strong&gt;. This will spawn 4 workers to handle your traffic. In most cases there is no need to go beyond 2., but if you have a lot of traffic make sure you match this value to the number of cores on your machine.&lt;/p&gt;

&lt;p&gt;After your nginx starts to handle a lot of connections you might start to see this in your error.log&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;2011/05/09 12:32:36 [alert] 4671#0: accept() failed (24: Too many open files)
2011/05/09 16:13:55 [alert] 7245#0: 1024 worker_connections are not enough&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;By default nginx is allowed to have only 1024 file descriptors at a time. If it exceeds that limit bounce the number of file descriptors your machine can handle. If you are on ubuntu edit your &lt;strong&gt;/etc/security/limits.conf&lt;/strong&gt; file. Add this just before end of the file&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;*       soft    nofile  16384
*       hard    nofile  32768
# End of file&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To apply the changes in /etc/security/limits.conf on Ubuntu, add the following line in the /etc/pam.d/common-session file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;session required  pam_limits.so&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;By this your are extending the number of file descriptors your machine can handle. After that change edit your nginx.conf to make your nginx handle more file descriptors.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;worker_rlimit_nofile 32768;
events {
    worker_connections  4096;
    use epoll;
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can also increase your worker_connections to handle more connections at a time. Make sure you read &lt;a href='http://wiki.nginx.org/CoreModule'&gt;nginx docs&lt;/a&gt; before you change more.&lt;/p&gt;

&lt;p&gt;If you are using nginx as a proxying server make sure your backends are up or else you might end up with &lt;strong&gt;connect() failed (111: Connection refused) while connecting to upstream, client: xx.xx.xx.xx&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Once you past the default file descriptor limit it is better to move the static stuff into a CDN or else make sure there is a specfic nginx server serving just the static.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;And make sure you monitor nginx with munin. I use &lt;a href='https://github.com/jnstq/munin-nginx-ubuntu'&gt;munin-nginx plugin&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;pre&gt;&lt;code&gt;server {
   listen 127.0.0.1;
   server_name localhost;
   location /nginx_status {
           stub_status on;
           access_log   off;
           allow 127.0.0.1;
           deny all;
   }

## In your munin plugin-conf.d/munin
[nginx*]
user root
env.url http://localhost/nginx_status&lt;/code&gt;&lt;/pre&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/scaling-up-with-nginx</feedburner:origLink></entry>
  
  <entry>
    <title>Installing nginx from source</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/oOGk0H5nBbg/installing-nginx-from-source" />
    <updated>2011-05-07T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/installing-nginx-from-source</id>
    <content type="html">
      &lt;p&gt;Once in a while we install nginx on a box, but there are times where we want to install from sources to benefit from the newer stable features. Usually apt-get or a similar package manager is a good friend here but compiling from sources gives you much more flexibility. Say you want to run https using nginx, you need to pass &amp;#8211;with-http_ssl_module to enable ssl. Well of course there are a ton of &lt;a href='http://library.linode.com/web-servers/nginx/configuration/ssl'&gt;articles&lt;/a&gt; out there to do that.&lt;/p&gt;

&lt;p&gt;Download nginx from &lt;a href='http://nginx.org/en/download.html'&gt;here&lt;/a&gt; and untar it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;apt-get install libpcre3 libpcre3-dev libpcrecpp0 libssl-dev zlib1g-dev
wget http://nginx.org/download/nginx-1.0.1.tar.gz
tar xvzf nginx-1.0.1.tar.gz
cd nginx-1.0.1
./configure --prefix=/etc/nginx --conf-path=/etc/nginx/nginx.conf
            --sbin-path=/usr/local/sbin --pid-path=/etc/nginx 
            --error-log-path=/var/log/nginx/error.log   
            --http-log-path=/var/log/nginx/access.log
make
sudo make install&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will install nginx into /etc/nginx. Make sure you copy &lt;a href='https://gist.github.com/960290#file_proxy.conf'&gt;proxy.conf&lt;/a&gt; into /etc/nginx to proxy requests to various backends. Also here is the &lt;a href='https://gist.github.com/960290'&gt;init&lt;/a&gt; file to stop|start|restart nginx.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For High Performance&lt;/strong&gt; nginx you need to tweak your configuration to handle more connections.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ulimit -a&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Make sure to increase this file descriptor limit to 2048/4096. If you are serving static using nginx make sure you set expires setting &lt;strong&gt;&amp;#8220;expires 30d;&amp;#8221;&lt;/strong&gt; to set expires header on the files so that browser can cache them. If you have like huge traffic I would suggest you to serve your static from CDN like &lt;a href='http://aws.amazon.com/cloudfront/'&gt;CloudFront&lt;/a&gt; or &lt;a href='http://www.rackspace.com/cloud/cloud_hosting_products/files/'&gt;Rackspace&lt;/a&gt; so that each webpage will only make one request to your nginx there by keeping your number of connections low.&lt;/p&gt;

&lt;p&gt;Make sure to tweak your &lt;strong&gt;&amp;#8220;worker_processes 2;&amp;#8221;&lt;/strong&gt; setting to match the number of cores to avail your CPU to the best. Turn off &lt;a href='http://wiki.nginx.org/NginxHttpLogModule#access_log'&gt;access log&lt;/a&gt; if you donot use it as it cause quite some IO usage on the server. Turn on gzip so that nginx compresses the data before sending to client. Increasing the &lt;strong&gt;gzip_comp_level&lt;/strong&gt; will use more CPU cycles.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/installing-nginx-from-source</feedburner:origLink></entry>
  
  <entry>
    <title>Blog V2</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/A5H-flykDBM/blog-v2" />
    <updated>2011-04-01T00:00:00-07:00</updated>
    <id>http://yashh.com/posts/2011/blog-v2</id>
    <content type="html">
      &lt;p&gt;I launched this blog about 3 years ago as a mission to learn web development. I tried to write a lot of code and implement ton of features which I din&amp;#8217;t even care to use. But it did really help me understand a lot of things. Finally I decided to retire my old codebase and wrote one which is very lightweight and simple.&lt;/p&gt;

&lt;p&gt;This new incarnation runs on &lt;a href='http://flask.pocoo.org/'&gt;Flask&lt;/a&gt; which is a lightweight python framework. The source code of this blog will be open-sourced soon under the name &amp;#8220;flaker&amp;#8221;.&lt;/p&gt;

&lt;p&gt;Even though I liked my two column layout some of the readers did complain that was not usable. So I decided to give up on two column layout. I never upgraded my ubuntu server which was running 8.10 with all old gear. I took this as an opportunity to format and install 10.10 along with brand new mysql 5.5 and many other tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Got rid of comment&lt;/strong&gt; Yup that was quite a relief. Everyday I had to delete one / two spam comments which kind of sucks. Even though I check every comment with &lt;a href='http://akismet.com/'&gt;Akismet&lt;/a&gt; a few seems to get past it. I would love to participate in discussion on &lt;a href='http://news.ycombinator.com/user?id=yashh'&gt;hackernews&lt;/a&gt; and &lt;a href='https://convore.com/users/yashh/'&gt;convore&lt;/a&gt;.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2011/blog-v2</feedburner:origLink></entry>
  
  <entry>
    <title>Tips from MongoSV conference</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/HZhDlegqh4Q/tips-mongosv-conference" />
    <updated>2010-12-08T00:00:00-08:00</updated>
    <id>http://yashh.com/posts/2010/tips-mongosv-conference</id>
    <content type="html">
      &lt;p&gt;I attended the MongoSV conference on Dec 3rd and learned a bunch of cool things on using / deploying and monitoring MongoDB. First I was impressed by the various usecases of where mongoDB was used. EventBrite was using Mongo to recommend events to its users. They perform background processing on data and store the nearest neighbours and their related score into MongoDB and query it to show some recommended events.&lt;/p&gt;

&lt;p&gt;ShareThis uses MongoDB as their primary count system. Whenever a link is shared they increment the links counter in mongoDB using atomic increments. They almost give near realtime counts all the time. Also they mentioned they use pymongo to build internal queues which would aggregate 100 / more writes and then dump all those writes as one single write. This made it possible to produce a high write throughput. Also they wrote some fallback system which will detect MongoDB failure and write the data into memcachedb. And later when mongo is up they copy back the data from memcachedb.&lt;/p&gt;

&lt;p&gt;While deploying mongoDB you need to make sure you are deploying on a 64bit machines as 32 bit machines have a hard limit of 2GB per file. Since mongo creates fixed size files upon the first write it is much optimal to be on 64bit systems.&lt;/p&gt;

&lt;p&gt;If you are hosting on EC2 you must use a EBS volume for the data directory. For best performance use ext4 or xfs filesystem. Also striped EBS volumes give a high write throughput which is great for MongoDB. Make sure you give enough space for the data directory.&lt;/p&gt;

&lt;p&gt;After you got the mongo running in production it is important to check few things on mongodb regularly.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;db.serverStatus()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The things to look out in serverStatus() are connections current and available. You need to make sure you are not using lot of connections. pymongo and other drivers by default use connection pooling to open connections and use them. indexCounters: You can have a view of the access, hits, misses there by giving you a view on how well your indexes are serving your queries.&lt;/p&gt;

&lt;p&gt;opCounters: This gives you a view on how many inserts, queries, updates and deletes are made. You can make a judgement if your app is read heavy / write heavy from those counters and fine tune accordingly.&lt;/p&gt;

&lt;p&gt;Another command which gives you some details about the data is&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;db.stats()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This command gives you a overview of the number of collections, objects, datasize, indexSize of your data. Observe the datasize and indexsize closely. The output is in bytes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is recommended to allocate memory such that your whole data and indexes&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;can fit into memory. That way you get the best performance out of mongoDB. If you cannot do that atleast make sure indexes fit into memory.&lt;/p&gt;

&lt;p&gt;Make sure you install &lt;a href='https://github.com/erh/mongo-munin'&gt;munin-plugin&lt;/a&gt; which monitors number of connections, operations, memory size and btrees. And if you are using replication make use of nagios alerts to detect database downtimes and replication lag.&lt;/p&gt;

&lt;p&gt;Apart from this if you are a sysadmin style of a guy, there is mongostat which is top like tool visualizing commands per second broken into inserts / deletes / updates / queries per second along with connections, flushes, locks, fsync and other valuable information which helps you understand your system.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2010/tips-mongosv-conference</feedburner:origLink></entry>
  
  <entry>
    <title>Backup Mysql with XtraBackup</title>
    <link href="http://feedproxy.google.com/~r/yashh/~3/7eV9hePjTkc/backup-mysql-xtrabackup" />
    <updated>2010-12-08T00:00:00-08:00</updated>
    <id>http://yashh.com/posts/2010/backup-mysql-xtrabackup</id>
    <content type="html">
      &lt;p&gt;Its pretty common to backup mysql with mysqldump utility but this backup acquires a global lock on all tables. Especially when your database is huge it takes few minutes to take a backup. When you only one database / shard in production running a dump will result in downtime. Another popular way to hot backup mysql is using LVM snapshot. &lt;a href='http://www.howtoforge.com/how-to-back-up-mysql-databases-with-mylvmbackup-on-ubuntu-8.10'&gt;mylvmbackup&lt;/a&gt; is a perl script that obtains a read lock on all tables and flushes all server caches to disk and creates a snapshot of the volume containing the MySQL data directory, and unlocks the tables again. But FLUSH TABLES WITH READ LOCK may take a while which is varies on systems with long running queries. On EC2 this process becomes even messy but unmounting /mnt/ and then create a physical volume(pvcreate) and a few logical volumes(lvcreate).&lt;/p&gt;

&lt;p&gt;Apart from these there is &lt;a href='http://www.innodb.com/doc/hot_backup/manual.html'&gt;innodb hot backup&lt;/a&gt;, a commercial software to make hotbackups of both MyISAM and InnoDB. There is xtrabackup which is a open source implementation of the same from percona. The &lt;strong&gt;biggest advantage&lt;/strong&gt; of xtrabackup is it does n&amp;#8217;t lock your database during backup(except MyISAM tables).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;: Installing xtrabackup is kind of tricky. Installing from source requires to compile InnoDB. This process varies slightly from 5.0.x to 5.1.x. You can use one of the binaries from &lt;a href='http://www.percona.com/downloads/XtraBackup/XtraBackup-1.3-beta/'&gt;percona website&lt;/a&gt;. Since I was on CentOS I tried to use the rpm binaries from RedHat. rpm - i errored out with &amp;#8220;requires mysql server&amp;#8221;. Even though we have a compiled mysql rpm&amp;#8217;s requires one installed from yum or something. I used the linux binary which worked for me. Copy the binaries in the bin directory to /usr/local/bin or some PATH.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage&lt;/strong&gt;: Using xtrabackup is pretty easy.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;innobackupex-1.5.1 /mnt/something/ --user=root --password=xxxxx&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will backup all databases to &amp;#8220;something&amp;#8221; directory under today timestamp. You might want to pass a list of databases to be backed up. There are a few options you can pass to innobackupex like datadir, default configuration, incremental backup etc. Check all the options &lt;a href='http://www.percona.com/docs/wiki/percona-xtrabackup:xtrabackup_manual#xtrabackup_options'&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;innobackupex-1.5.1 --defaults-file=/etc/my.cnf /mnt/something/ --user=root --password=xxxxx&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Since I installed mysql from source I had pass my.cnf and make sure your my.cnf has datadir option defined in it.&lt;/p&gt;

&lt;p&gt;Apart from backing up to a directory, gziping and copying the directory to another server you can stream the backup directly to another host there by making it easy to setup slaves. (But never the less streaming never worked for me.)&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;innobackupex-1.5.1: Created backup directory /root/src/xtrabackup-1.3/bin
101106 17:34:44  innobackupex-1.5.1: Starting mysql with options:  --defaults-file=&amp;quot;/etc/my.cnf&amp;quot; --unbuffered --
101106 17:34:44  innobackupex-1.5.1: Connected to database with mysql child process (pid=16855)
101106 17:34:52  innobackupex-1.5.1: Connection to database server closed
101106 17:34:52  innobackupex-1.5.1: Starting mysql with options:  --defaults-file=&amp;quot;/etc/my.cnf&amp;quot; --unbuffered --
101106 17:34:52  innobackupex-1.5.1: Connected to database with mysql child process (pid=16865)
101106 17:34:56  innobackupex-1.5.1: Connection to database server closed&lt;/code&gt;&lt;/pre&gt;

&lt;blockquote&gt;
&lt;p&gt;If you had any success with streaming backups please post a comment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After making a database backup you need to make sure you backup the InnoDB logs for proper functioning of the database. You can apply-logs using the same innobackupex along with a special flag.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;innobackupex-1.5.1 --defaults-file=/etc/my.cnf --user=root --password=xxx --apply-log /mnt/something/2010-12-08_18-01-12/&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will apply all the innodb logs to the backup. This operation will complete quickly basing on your innodb log size. After the logs are applied your backup is ready. All you need to do is copy the timestamped folder into the destination machine. You can use &lt;strong&gt;scp / rsync&lt;/strong&gt; to copy the directory over. If you are on EC2, you can attach and mount a EBS volume and copy the backup into the volume and unmount and mount it back on the destination server. After the transfer is complete there is only last step to copy back the new data into the new database.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;innobackupex-1.5.1 --defaults-file=/etc/my.cnf --user=root --password=xxx --copy-back /mnt/something/2010-12-08_05-21-36/&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will copy the new data into the database.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You might see a error that cannot copy mysql. This is because its trying to copy mysql database to the new server. You might want to mv the mysql in /var/lib/mysql to mysql.bak and retry the operation which will succeed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Soon after the copy-back make sure you make to make mysql user owner of the /var/lib/mysql since you might have ran these operations as root / some other user.&lt;/p&gt;
&lt;div class='entry_comments'&gt;
&lt;h5&gt;Follow me on &lt;a href='http://twitter.com/yashh'&gt;twitter&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;I am tired of deleting spam comments and decided to get rid of comments altogether. If you have any questions you can reach me at &lt;strong&gt;me [at] yashh [dot] com&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
    </content>
  <feedburner:origLink>http://yashh.com/posts/2010/backup-mysql-xtrabackup</feedburner:origLink></entry>
  
</feed>

