Hooking a CDN up to S3

As I wrote in my last post, we’ve been going through the process of adding a content delivery network (PantherExpress) to SlideShare. Since there isn’t a lot of information on the net about how to use a CDN to accelerate content that is hosted on S3, I thought I’d publish it here.
The way you integrate with any CDN is pretty much the same.
1) Create a subdomain (e.g. static.slideshare.net) for your domain.
2) Point that subdomain to your CDN (e.g. 132.pantherexpress.com) using a CNAME entry in your DNS.
3) Configure the CDN to know that the “origin server” they should use is your amazon S3 bucket.
4) When the CDN gets a live request for a piece of content, it serves the content if it has it in it’s cache. Otherwise, it fetches the content from the origin server and then serves it.
This is great if you’re starting out. But what if you’ve already launched? You NEED to be able to try out your CDN integration, and then quickly back it out if it isn’t working. There’s two ways to go here.
1) If you’ve thought ahead, you’ve named your s3 bucket after your subdomain (e.g. static.slideshare.net) then you can point your CNAME entry to to bucket. To switch to your CDN, change the CNAME entry. If there’re a problem, switch back. The switch will take however long your “time to live” is set to in DNS.
2) OTOH, you probably WEREN’T forward thinking enough to name your s3 bucket after your subdomain. In this case (the normal case), you have to make sure your webapp is written so that you can quickly change the location where it expects to find external content. We didn’t do that (we had hard-coded the s3 bucket url into the code), so we had to externalize that into a property file that could be easily edited.
So far we’ve accelerated our thumbnails, our javascript / css / navigation images, and all our content players. Still to come is the actual content (the slides themselves). Measured page load time for page loads on media-heavy pages have dropped from 10 seconds to 4 seconds. I’m hopeful that we can use the CDN to accelerate our slides as well.

Panther Express and S3

One thing I’ve been working on in the last month is accelerating the serving of SlideShare content using a “content delivery network” (or CDN). You use a CDN so your content can be cached in RAM, in a place that is geographically near your customers, instead of on disk, in a place that is far away from your customer. This makes a BIG difference in terms of page load time. There isn’t much on the net about hooking a CDN up to Amazon S3, so here’s what I learned:
Frankly, the process of shopping for a CDN vendor is *really* annoying, especially for someone who has become used to buying these cloud-based services like S3 that are priced openly and on the basis of usage. The process is very “enterprise procurement”: lots of high-pressure salesmen trying to get you to sign two-year contracts, and with no price transparency. One way to win is to get them to bid against each other. But the whole thing feels like an unnecessary amount of work.
Fortunately, we found a company that had transparent pricing that seemed fair to us, and that wasn’t about locking us into a long-term contract: PantherExpress! Their pricing is standardized, is per-gigabyte, and gets cheaper the more you use it. Given that Amazon doesn’t provide a CDN, this is the next best thing for serving up content fast. It costs $.28/GB for the first 8 TB/month, $.24/GB for the next 8 TB, and so on. More expensive by Amazon, but a decent price for global content delivery.
Hooking PantherExpress up to S3 was pretty easy, and I imagine the same procedure would work with other CDNs. I’ll cover that in my next post.

Tufte joins SlideShare advisory board

I’m really pleased to announce a new member of the SlideShare advisory board. Edward Tufte, author of such seminal works as “The Visual Display of Quantitative Information”, has agreed to join the SlideShare team.
Tufte’s work on presentation design is obviously especially relevant to us. His critique of current approaches to presentations (The cognitive style of PowerPoint) was a major driver of the new styles of presentation that have cropped up in the last few years.
This will result in some changes to the SlideShare experience as you currently know it. Most importantly, we’re implementing some filters that will block the most egregious examples of PowerPoint abuse from our system. You can read the official announcement for the whole story.

SlideShare down due to ServePath outage

SlideShare is down and has been down for the last several hours. Our dedicated hosting provider (ServePath) is experiencing catastrophic network problems. For a while we were able to keep the site live by pointing our DNS to specific servers that were available on the network: this strategy is no longer working (the paths that still work on the network are changing as ServePath technicians try to fix the problem).
My sincere apologies to all SlideShare users. We’ll be taking stock once this outage is resolved, and we’ll evaluate what to do long-term at that point. Right now there’s not much we can do besides wait for the network to get back online.
UPDATE: as of 7:45, we seem to be back in business! Doing the happy dance (and checking the servers every 5 minutes to make sure we can still get to them).

AJAX and Flash mistakes

I gave a little talk on “AJAX and Flash mistakes: lessons learned building SlideShare” at SXSW08 last weekend. The audience was great, and the conference in general was a helluva lot of fun (as always).
I talked a lot about the interplay between AJAX and server-side performance … how AJAX often seems like a performance solution, but often introduces new problems. I also talked about the need to design entire processes, rather than simple modal dialogs (which are so easy to design that your design skills for building multi-panel flows can get quite rusty if you aren’t careful).

Simple DB: the final piece of the puzzle falls into place

Amazon just announced “SimpleDB“, which sounds a lot like the rumored “SDS” or “Simple Database Service” that we’ve all been waiting for.
This is huge: the single biggest thing stopping you from running a webapp on EC2 is the fact that there’s nowhere safe for your database to live. EC2 is a virtual hosting service, so if a machine crashes and is rebooted, any data written to the hard drive simply disappears. Not good. As a result, EC2 was framed as a great solution for back-end processing (think transcoding videos for youtube), but not a great fit for an entire web application.
Solutions (including backing up your database continually to S3), for this problem never were very convincing. But it was always clear that SOME major initiative that would solve this problem was planned.
Now we know. This isn’t a vanilla mysql clustering service: it’s something a little weirder (it’s conceptually similar to a database, but lacks many of the features of a database, and works somewhat differently). As a result, you’ll have to build your app from the ground up as an Amazon app: this isn’t a drop-in replacement for mysql cluster.
But the benefits are potentially huge. Imagine you’re building a facebook application. You could use SimpleDB, EC2, and S3 to provide the backend, and pay very little in infrastructure costs until you actually started getting real traction. Your system would transparently scale (simply add more EC2 nodes as web/app servers as your server load increases), and you would never, ever have to worry about the huge P.I.T.A. (pain in the ass) that is setting up a database cluster, designing schemas for federating data across multiple databases, etc.
There’s never been a better time to be a software entrepreneur. Amazon has once again lowered the upfront cost of starting up a new web business, and at the same time dramatically increased the number of use cases that their other services can be used for.
Coverage from techcrunch, and gigaom here. Marcelo Calbucci frames the services as a “directory service rather than a database service“.

Using a CDN with S3

I’ve started shopping for a content delivery network for SlideShare. It’s a market with pretty opaque pricing: if you’re making the jump to using a CDN for the first time it’s not easy to get a real sense of what monthly costs will be.
Conceptually, integration between a CDN and Amazon S3 is pretty straightforward. Here’s the basic steps:
1) Dedicate a subdomain (say static.slideshare.net) to serving up all the content you want to serve via the delivery network.
2) Make a cname entry in your DNS to tell traffic going to that subdomain to go to your CDN instead
3) Tell the CDN which bucket on amazon S3 you’re saving your static content on.
The CDN receives the request for content at a geographically local server (so Europeans hit a node in Europe, Asians hit a node in Asia, etc). The node will first look in it’s own (in-memory) cache. If it doesn’t find the content that is requested, it will fetch it from S3 and save it so that will have it cached for next time. How long they cache it for is typically configurable, and APIs are typically provided that allow you to flush the cache.
In my investigations so far the following companies have turned up as potential vendors:
Akamai (the biggest company in the space)
Limelight is another big contender (famously used by Youtube and other Web 2.0 video companies)
Panther express is smaller contender. I’ve had the most conversations with these guys.
Level 3 is interesting in that they’ve recently announced that they’ll be selling CDN bandwidth at normal bandwidth rates. I haven’t talked to them yet, but I probably should. ;->
CDNetworks
Internap
Peer1
EdgeCast
If anyone has any other recommendations for vendors I should check out, feel free to reply on this post! Frankly I really wish Amazon would just provide this as a service on top of S3: that way we wouldn’t have to change any of our code at all! Unfortunately, it doesn’t seem like this going to happen in the near future.

MediaWiki SlideShare extension

Sergey Chernyshev just released a much-needed piece of code last week: an extension that makes it easy to embed slideshare slideshows into MediaWiki, the open-source wiki software that powers wikipedia.
This is pretty huge: for an organization trying to build a knowledge repository, easy integration between wiki content and social document sharing is a really important. A good example of how this can be used can be found on Sergey’s site TechPresentations.org , which archives presentations from all tech conferences worldwide.
A company that wanted to run a private mediawiki could even upload slideshows to slideshare, not share them publicly, and embed them into their corporate wiki. This would provide a wiki that supported embedded office documents, which would be a killer knowledge-management tool.
Just like Chris Hellman’s slidehare ego widget, this mashup does it’s work without using our API. I’m reminded that RSS and embed codes are powerful integration points with any system. It’s easy to forget that a lot of the time, a formal REST API isn’t even necessary in order to build a mashup!