time to pull the plug

This is a subtitle. There are many like it, but this one is here.

New mod_mcpage patch

If it seems like all I'm writing about right now is mod_mcpage, well, it's what I've been doing the most work on. Thanks to the helpful folks in #lighttpd on freenode, who made some suggestions and found some problems, there've been some changes. I've made a new patch for mod_mcpage, this time against the svn branch of lighttpd 1.4.x, although it should apply against 1.4.19 and 1.4.20 as well. However, if you apply this patch against an earlier release (which I haven't actually tested if it would work yet), you would at minimum need to rerun autoreconf -fi, and possibly .

New, improved version of mod_mcpage

Update: And that's what I get for not upgrading to the latest version of libmemcached before making the patch; MEMCACHED_HASH_KETAMA was dropped from libmemached a little while ago, and the previous version of the patch didn't reflect that. A new version of the patch has been uploaded.

Updated Update: Slightly updated patch uploaded again - small bug turned up in production use that didn't show up in testing where pages would occasionally be stored in memcached when they shouldn't have been, despite not actually being served up from memcached down the road. Latest version fixes.

You can get it from here. The patch applies to lighttpd 1.4.19 (I haven't tried it against 1.4.20 yet, but I'd be surprised if it had much trouble). There's a README that explains what it does and how to set it up in more detail, but here's the short version:

With the older version of mod_mcpage, the backend application was responsible for placing the page into memory,  gzipping if necessary, so lighttpd + mod_mcpage could serve it up directly without having to hit the backend at all. Now, no backend changes are required -- mod_mcpage will place pages (and their content types (no more "text/html" only)) into memcached, handle compressing them if the page will be too large, and serve up pages from memcached. If lighttpd + mod_mcpage find a page in memcached, it will decompress (if necessary) the page, set the content type, and return the page to the user. In production use, serving pages up for anonymous users from memcached has been a massive help for Daily Kos. This new version should simplify things a bit by not requiring the backend app and lighttpd to coordinate the page caching. Note: I have only used this with lighttpd running as a proxy in front of a webserver running a backend app (Scoop in this case). Also, mod_mcpage uses libmemcached, not the older libmemcache. Make sure libmemcached is installed before you try to install this.

TODO:

As always, there's more stuff to do. At the moment, expiration headers aren't set. That might be nice, although probably of somewhat limited utility with dynamic pages that are likely to change soon. The plugin only supports reading in write_queue content that's in memory, not from a file. This works fine with content returned by mod_proxy, which is what I wanted it for, but does not currently support putting files that would be served from disk into memcached. In my opinion, that wouldn't be particularly useful, but it might be desirable behavior. Small frequently loaded files can be placed in a tmpfs mount and read out of memory that way without the overhead of memcached. Still more memcached behaviors could be set from the configuration file, but aren't. Only text files are supported currently - binary data gets garbled up. This should be fixed, but the workaround is to exclude those from being served by memcached (and why are you proxying them anyway?).

Lessons from mod_mcpage

I made a bunch of progress over the weekend on mod_mcpage, and now the plugin is placing pages into memcached as well as retrieving them, and setting the content type correctly. It's pretty cool, although there's still some rough edges to wear down, and I'll need to run it through valgrind some to make sure there's not horrid memleaks somewhere. While I was working on getting mod_mcpage to put pages into memcached, I also figured out some stuff that seems painfully obvious now.

Getting the content lighttpd's returning from a request

One of the things I want to do with the mod_mcpage plugin for lighttpd is have it be able to place pages into memcached itself, along with being able to serve pages out of memcached, but it was never very clear how to get a hold of it. It was clearly in con->write_queue, but it took me a while to figure out how. Once I found it, though, it's not very hard.

Note: Here, I'm working with content being sent back by mod_proxy, which is sent back in chunks in memory. Files are chunked in a different fashion, but can easily be added. This code is adapted from network_write_chunkqueue_linuxsendfile in network_linux_sendfile.c.

First, the plugin needs a handler attached to handle_subrequest; for mod_mcpage, I added  p->handle_subrequest    = mod_mcpage_handle_subrequest; to mod_mcpage_plugin_init. This subrequest handler will run as mod_proxy sends content back.

Then, there needs to be a handler_ctx struct for the connection specific buffers for storing the content type and the page.

Now, as the request continues running, the subrequest handler continues running.  Each time it's called, it checks to see if the file is finished and mod_proxy is done. When it is, it walks through the chunks of the file and appends them to a char pointer, which is then placed into the handler_ctx struct.


char *cret = malloc(0);
size_t crlen = 0;
chunk *c;
if(con->file_finished == 1){
           for(c = con->write_queue->first; c; c = c->next){
                   crlen += c->mem->used;
                   cret = realloc(cret, crlen);
                   strcat(cret, c->mem->ptr);
                   }
           }
if(crlen != 0)
           buffer_copy_string(hctx->outpg, cret);
free(cret);

Now the page returned by mod_proxy is in hctx->outpg, where you could do whatever you want with it.

Amusingly enough, while I was writing this up, I realized that I was doing it wrong by not waiting until the file was finished to get the file chunks, yet managed to get it right by screwing up and using strcpy instead of strcat. Oops.  In fact, now I see that a bunch of stuff can be shifted around and moved inside of that conditional that checks if the file is finished. Fortunately I'm still at a point where I only just figured out how to get at the page being returned, so I'm still filling out the rest of the pieces.

Update -- Further thoughts: Having the handler_ctx struct isn't actually necessary, now that the page data doesn't have to persist across calls to the subrequest handler. It could just be a pair of char pointers to hold the content type and the page. I'm a little torn, though, because using the handler_ctx struct might work better for memory management since it has its own init and free functions to handle taking it down properly, and the free function can be called if the connection gets reset. I'm also wondering now if using memcpy would be better than strcat. I don't have any particular plan to put non-text files into memcached, but someone else might. Bears investigating.

Handy one liners: counting lines of code

Just a little one liner to count lines of code in your source tree. This assumes you're working with perl - adjust the 'iname' argument as necessary. It also isn't smart enough to skip over pod documentation, but does skip over blank lines and lines that are only comments (but doesn't skip over lines with comments after actual code, or $#array variables). This should work fine with python and ruby, but probably not real well with Java, C, or PL/I.

find . -iname *pm -print | xargs perl -e 'while (<>) { next if /^\s*$|^\s*?#/; print $_; }' | wc -l

Oh, this also assumes you work in a Unixy environment. That one liner won't be too useful in Windows unless you have Cygwin installed. The results it returns won't be entirely accurate (because of the issue with pod documentation), but should give a reasonably close to correct figure.

DKos API: Story Posting now working on dev box, further thoughts

A story posting method's working on the dev box now, along with the proper permission checks and everything. However, there's a few little things that have to be worked out before it can be even vaguely public, and I'm still kind of dragging along here with this cold that's lasted a week and a half. Blugh. It's a nice demonstration that posting stuff can work through this API without too much hassle, but there's still some things that have to be taken care of before it becomes practical.

DKos API Authentication Scheme

This isn't active here yet, but I've settled on an authentication scheme for the API. It's waiting on me cleaning up the Perl module that will act as a reference client implementation and finishing the scoop.stories.post method, which will be the first thing you'd actually need authentication for. It took some thinking to come up with a way to handle authentication via the API, because of certain details of Scoop's native authentication measures.

DK Tech Preview: a RESTful API

Someone emailed me a while back asking me if DKos had an API for people to access the site programatically. It had never occurred to me to have something like that before, but I can see how it would be handy. I also wrote a perl client for Voxel's hAPI for some of their services, so I had been thinking about it over the last couple of weeks again. I'm still feeling kind of sick and achy from being sick since last week, so I decided to start working on one (since hopefully I wouldn't have to think too hard), using Voxel's hAPI as inspiration.

Working on mod_mcpage - Stupid Programming Tricks Edition

One of the things I want to do with mod_mcpage is move placing pages into memcached out of the backend application and into lighttpd, so it's all handled there. To do so, the page's content type will need to be stored along with the page in memcached. The easiest way to store it was to just put the content type at the beginning of the string to store, but I had to think for a bit to figure out the best way. I came up with two. Each involves jumping through some hoops, but at different points.

The First Way

This is probably the more correct way to store the content type, but requires somewhat fancier footwork further down the line. Here, we store the content type and the page together with null bytes separating them.


char *content_type = "text/plain";
char *page = "Some text and whatnot";

char *store = malloc(strlen(content_type) + strlen(page) + 2);

char *i; /* We'll want this later to reset the pointer */
i = store;

strcat(store, content_type);
store += strlen(content_type);
*(++store) = '\0';
strcat(store, page);

/* Reset *store back to the beginning */

store = i

The hoop jumping here is that strlen() won't work anymore, since there's a null byte in the middle of the string, so you'll need to get the length of the string to be stored in two steps. You'll also need to extract the content type and the page in the same manner. Below, we'll go ahead and do both at the same time to illustrate.


size_t contype_len = strlen(store);
i = store; /* Using char *i from our earlier example - if you want to free the stored string, you'll need the pointer somewhere. */

char *contype = malloc(contype_len + 1);
strcpy(contype, store); /* Now we have the content type in its own string */

/* Advance *store past the end of the content type. */

store += contype_len + 1;

/* Now let's get the length of the page, and the overall length of the stored string */
size_t page_len = strlen(store);
size_t overall_len = contype_len + page_len + 2;

/* Now we have both the content type and the original page (accessed from *store) so we can use them. */

... do stuff ...

/* Once we're done, we can free them  up. This is why we saved the pointer in i = freeing *store will lead to Bad Things. */

free(i);
free(contype);

Again, pretty straightforward. The downside is that if you want the length of the string to store, you need to make sure you count both null terminated strings inside of it.

The Second Way

This was actually the first way I thought of, but is probably the less correct way to do it. However, it avoids having null bytes in the middle of the string. With this method, we store the length of the content type in one and a half chars at the beginning of the string we're storing. Using one and a half chars does limit the possible content type lengths to 4096 bytes (the first char is ORed with 0xF0 to avoid a null byte at the beginning of the string). The bit manipulation here may also cause problems with endianness - if I end up using this way, I'll have to set up Debian/390 under Hercules and see what happens.


char *content_type = "text/plain";
char *page = "Some text and whatnot";

char *store = malloc(strlen(content_type) + strlen(page) + 2);

/* Get the content type length and transfer it to the chars. */

#define MAX_CONTENT_TYPE_LEN 4095
size_t c = strlen(content_type);
/* Make cure that the length of the content type string isn't over 4K - 1. Unlikely, but you never know. */
if(c > MAX_CONTENT_TYPE_LEN){
   ... handle the error ...
   }

unsigned char b, d;
unsigned int fl = 0x00000F00;
unsigned int fg = 0x000000FF;
b = (c & fl) >> 8; /* mask out everything but the second byte, shift right by 8 bits so the second byte becomes the first, and assign to b. */
d = c & fg; /* mask out everything but the first byte. */
b |= 0xF0; /* Don't want to end up with a null byte there. */

*store++ = b;
*store++ = c;
strcat(store, content_type);
strcat(store, page);
store -= 2; /* get it back down to where we started. */

So now the string is ready for storage. What do we do when we want to use it?


char *i; /* For storing *store's starting point for later */
i = store;

unsigned char n = *store++;
unsigned char m = *store++
n &= 0x0F; /* clear the bits in the top half of that byte that were there to prevent having a null byte. */

unsigned int clen = 0;
clen = n << 8;
clen |= m;

int y;
char *content_type = malloc(clen + 1);
char *bb;
bb = content_type;
for(y = 0; y < clen; y++)
   *content_type++ = *store++;
*content_type = '\0';
content_type = bb;

/* Now we have the content type string and the page (accessed, as in the previous example, from *store). */

.... do stuff ....

/* We remembered to save the stored strings pointer in *i earlier, because we need it to free our memory. */
free(i);
free(content_type);

Those were the two ways I'd come up with to handle storing the content type. The second one occurred to me because I wanted to avoid dealing with embedded nulls, but it's also pretty complicated. The first requires remembering to handle the null bytes, but is a lot less complicated (and can handle theoretical content type strings longer than 4K, should they come up). I'm planning on using the first method for storing the pages in memcached, but I'm keeping the second in reserve should there end up being some overriding reason. Also, it's so ridiculous, I couldn't help but share it with the world (although the world will likely point and laugh, and I'll deserve it).