Time To Pull The Plug

This is a subtitle. There are many like it, but this one is here.

A Way to Scale Your Blog

Scaling a website or blog is a constant battle. One of the first things people do is set up a two-tiered system on their webservers, with a lighter proxy httpd server sitting in front of the heavy duty backend application httpd. This way, the lighter webserver program can handle the common lighter requests, reserving the backend for the harder dynamic tasks. Serving up things like images, javascript files, CSS files, and other static content with a light weight process without the overhead of a full mod_perl/mod_python/etc. installation can increase performance dramatically.

The two-tiered setup with webservers is common knowledge. As time goes by and your site’s traffic starts increasing, however, performance will begin to suffer again and you’ll have to find different ways to scale.

A common way to do that is to cache dynamic content for anonymous users. A good rule of thumb is that anonymous browsers make up 90% of your website’s traffic, so if you can avoid regenerating those pages constantly for anonymous visitors you can theoretically reduce your website’s load to a tenth of what it was. Writing pages out to disk is a good way to cache pages, but it isn’t perfect. If you’re going to cache more than just the index page and other pages found at a consistent URL, you’ll have to find a way to dump the pages out to disk and use some rewrite trickery to properly serve up the cached pages. If you have more than one webserver in a pool, then the cache isn’t shared between the machines. Finally, and this was becoming a problem on Daily Kos, you can have performance issues caused by dumping the files to disk by itself when you have multiple webservers all updating their caches simulateously.

I solved this problem by moving from an Apache 1.3 proxy to a hacked up lighttpd proxy using mod_magnet and memcached to cache anonymous pages, sharing the work of caching them between all the webservers, and greatly expanding how much content I could cache. Below the fold is a description of how it works, how to set it up for yourself, caveats, and ideas for future development.

Note: This works for me, but like all things computer, it may not work for you for no clear reason whatsoever. Always make backups and have a disaster recovery plan in place.

HOW IT WORKS: Using Daily Kos as an example, the lighttpd proxy server sits in front of the mod_perl apache that runs Scoop. When a request is made, lighttpd checks to see if the request does not have a session cookie. If it doesn’t, it sees if the URI matches the pattern of URIs to check. If it does, it goes to a mod_magnet lua script that queries the memcached server for the page. If it’s present, memcached returns the page, lighttpd gunzips it if necessary, sets the content type, and returns it to the user. If the page is not present in memcached, the request proceeds to the backed. There, the page is made, and if the request is an anonymous one and fits the same pattern of URIs that lighttpd looks in, it places a copy of the page into memcached before serving it up to the user. While that page is active in memcached, any of the other webservers can retrieve the page, saving them the work of regenerating it themselves.

HOW TO SET UP: First, you’ll need to download lighttpd. Make sure you get the 1.4.x version, not the 1.5 development tree. Once you’ve downloaded lighttpd, apply the patch for 1.4.19 (in case you need it, there’s also a patch for 1.4.18). This patch contains both the modified mod_magnet with memcached functionality and the backported mod_deflate for compressing pages before sending them out. Compiling and configuring lighttpd is beyond the scope of this document, but the lighttpd docs will explain everything you’ll need. Make sure you enable PCRE, Lua, and memcache.

Once you have lighttpd compiled and set up, it’s time to configure the memcached bits. I have an example configuration here that can be used. Just adapt it to your setup and place it in your vhost’s configuration.

Example vhost configuration for mod_magnet and memcached

# In this example, if the visitor does not have a session cookie, check
# memcached to see if a copy of the page requested is there or not.
    $HTTP["cookie"] !~ ".*mysite_session" {
        # To simplify this example, we're using two memcached hosts
        # and any urls that begin with story, main, and comments
        magnet.memcache-hosts = ( "10.0.0.1:11211", "10.0.0.2:11211" )
        $HTTP["url"] =~ "^/story|^/main|^/comments" {
            magnet.attract-raw-url-to = 
                ("/etc/lighttpd/lua/mc_magnet_example.lua")
            }
    }

Then, you need to set up the lua script. It doesn’t need to do much, and I’ve provided an example lua script that does everything it needs to do. Make sure, of course, that you put it where the vhost configuration is going to look for it.

Example lua/memecached script

– looks like we can control what comes here from the lighty config, which
– will be quite a bit easier than doing it in here I think
content = mc_get(“mysite-cacheuri_” .. lighty.env[“request.orig-uri”])
–end
if (content)
       then
               lighty.content = { content }
               – This should only ever be text/html
               – If you want it to be something else, sometimes, add some
               – conditionals for it.
               lighty.header[“Content-Type”] = “text/html”
               – print(“Getting anon page for ” .. lighty.env[“request.orig-uri”])
               return 200
       – else
       –        print “No memcached love”
       end

return nil

Finally, and this is the most important part, you need to modify your application to put the pages into memcached. Obviously how you do that varies between applications, but it needs to be once the completed page is created for output to the user. At that stage, place it into memcached with a name like “mysite-cacheuri_/the/url/requested” with an appropriate timeout. A timeout value of 2 to 5 minutes is good. The memcached key you use can be anything you want, so long as it’s consistently used between the backend application and lighttpd. The patch will support gzipped memcached pages, too, so you can compress very large pages. I would recommend setting the compression threshold fairly high, but you may need to adjust it to fit your needs. Remember, though, that gzipping and gunzipping have overhead themselves. By default, memcached can only hold objects up to 1MB in size, so I compress pages above 768KB. Depending on how much memcached space you have, you may find a smaller value works better for you. There’s a tradeoff between space and gzip overhead, but no matter what it’s better than regenerating a page with hundreds and hundreds of comments each time it’s loaded because the full page doesn’t fit into memcached.

Once all that’s done, fire it up and test it out. If you want to see if it’s working, uncomment the debugging print lines in the lua script and check your error log for messages. If it’s working, your anonymous pages should be served out of memcached and much faster than before.

CAVEATS: Be careful with bandwidth. If you have a lot of traffic and pages being stored and fetched, you can generate a lot of traffic between memcached and your webservers. If you’re not careful, you can saturate your network or get stuck with a lot of bandwidth charges. If necessary, have one memcached instance per machine and only have the local webserver use it, or get a dedicated backend network set up for your webservers to communicate.

THE FUTURE: The obvious thing to do is to move this functionality out of mod_magnet and get rid of the lua dependency. All the lua script does here is retrieve the content and set the content type. I’ve been working on a standalone module using libmemcached rather than libmemcache for this, but haven’t had time to work on it much. Once that module is done, it will need to be ported to lighttpd 1.5.

This configuration has helped Daily Kos out a lot, but it isn’t for everyone. If you think the complexity and setup is worth it for your organization or site, though, give it a whirl. I hope that it can help others as much as it’s helped me.