On Friday, we migrated Soundslice from Heroku to direct use of Amazon Web Services (AWS). I'm very, very happy with this change and want to spread the word about how we did it and why you should consider it if you're in a similar position.
My Heroku experience
Soundslice had been on Heroku since the site launched in November 2012. I decided to use it for a few reasons:
- Being a sysadmin is not my thing. I don't enjoy it, and I'm not particularly good at it.
- Soundslice is a two-man operation (developer and designer), and my time is much better spent working on the product than doing sysadmin work.
- Heroku had the promise of easy setup and easy scaling in cases of high traffic.
While I was getting Soundslice up and running on Heroku, I ran into problems immediately. For one, their automatic detection of Python/Django didn't work. I had to rejigger my code four or five times ("Should settings.py go in this directory? In a subdirectory? In a sub-subdirectory?") in order for it to pick up my app -- and this auto-detection stuff is the kind of thing that's very hard to debug.
Then I spent a full day and a half (!) trying to get Django error emails working. I verified that the server could send email, and all the necessary code worked from the Python shell, but errors just didn't get emailed out from the app for some reason. I never did figure out the problem -- I ended up punting and using Sentry/Raven (highly recommended).
These experiences, along with a few other oddities, made me weary of Heroku, but I kept with it.
To its credit, Heroku handled the Soundslice launch well, with no issues -- and using heroku:ps scale from the command line was super cool. In December, Soundslice made it to the Reddit homepage and 350,000 people visited the site in a period of a few hours. Heroku handled it nicely, after I scaled up the number of dynos.
But over the next few months, I got burned a few more times.
First, in January, they broke deployment. Whenever I tried to deploy, I got ugly error messages. I ended up routing around their bug by installing a different "buildpack" thanks to a tip from Jacob, but this left a sour taste in my mouth.
Then, one April evening, I deployed my app, and Heroku decided to upgrade the Python version during the deploy, from 2.7.3 to 2.7.4. (In itself, that's vaguely upsetting, as I didn't request an upgrade. But my app code worked just as well on the new version, so I was OK with it.) When the deployment was done, my site was completely down -- a HARD failure with a very ugly Heroku error message being shown to my users. I had no idea what happened. I raced through my recent commits, looking for problems. I looked at the Heroku log output, and it just said some stuff about my "soundslice" package not being found. I ran the site locally to make sure it was working. It was working fine. I had deployed successfully earlier in the day, and I had made no fundamental changes to package layout.
After several minutes of this futzing around, with the site being completely down, after I had just sent the link to some potential partners who, for all I know, were evaluating the site that very moment -- I deployed again and the site worked again. So it was nothing on my end. Clearly just something busted with the Heroku deployment process.
That's when Heroku lost my trust. From then on, whenever I deployed, I got a little nervous that something bad would happen, out of my control.
Around the same time, Soundslice began using some Python modules with compiled C extensions and other various non-Python code that was not deployable on Heroku with their standard requirements.txt process. Heroku offers a way to compile and package binaries, which I used successfully, but it was more work using that proprietary process than running a simple apt-get command on a server I had root access to.
With all of this, I decided it was time to leave Heroku. I'm still using Heroku for this blog, and I might use it in the future for small/throwaway projects, but I personally wouldn't recommend using it for anything more substantial. Especially now that I know how easy it is to get a powerful AWS stack running.
My AWS setup
I'm lucky to be friends with Scott VanDenPlas, who was director of dev ops for the Obama reelection tech team -- you know, the one that got a ton of attention for being awesome. Scott helped me set up a fantastic infrastructure for Soundslice on AWS. Despite having used Amazon S3 and EC2 a fair amount over the years, I had no idea how powerful Amazon's full suite of services really were until Scott showed me. Unsolicited advertisement: You should definitely hire Scott if you need any AWS work done. He's one of the very best.
The way we set up Soundslice is relatively simple. We made a custom AMI with our code/dependencies, then set up an Elastic Load Balancer with auto-scaling rules that instantiate app servers from that AMI based on load. I also converted the app to use MySQL. In detail:
Step 1: "Bake" an AMI. I grabbed an existing vanilla Ubuntu AMI (basically a frozen image of a Linux box) and installed the various packages Soundslice needs with apt-get and pip. I also compiled a few bits of code I needed that aren't in apt-get, and I got our app's code on there by cloning our Git repository. After that instance had all my code/dependencies on it, I created an AMI from it ("Create Image (EBS AMI)" in the EC2 dashboard).
Step 2: Set up auto-scaling rules. This is the real magic. We configured a load balancer (using Amazon ELB) to automatically spawn app servers based on load. This involves setting up things called "Launch configurations" and "scaling policies" and "metric alarms." Check out my Python code here to see the details. Basically, Amazon constantly monitors the app servers, and if any of them reaches a certain CPU usage, Amazon will automatically launch X new server(s) and associate them with the load balancer when they're up and running. Same thing applies if traffic levels go down and you need to terminate an instance or two. It's awesome.
Step 3: Change app not to use shared cache. Up until the AWS migration, Soundslice used memcache for Django session data. This introduces a few wrinkles in an auto-scaled environment, because it means each server needs access to a common memcache instance. Rather than have to deal with that, I changed the app to use cookie-based sessions, so that session data is stored in signed cookies rather than in memcache. This way, the web app servers don't need to share any state (other than the database). Plus it's faster for end users because the app doesn't have to hit memcache for session data.
Step 4: Migrate to MySQL. Eeeek, I know. I have been a die-hard PostgreSQL fan since Frank Wiles showed me the light circa 2003. But the only way to use Postgres on AWS is to do the maintenance/scaling yourself...and my distaste for doing sysadmin work is greater than my distate for MySQL. :-) Amazon offers RDS, which is basically hosted MySQL, with point-and-click replication. I fell in love with it the moment I scaled it from one to two availability zones with a couple of clicks on the AWS admin console. The simplicity is amazing. (UPDATE, spring 2014: Amazon now supports PostgreSQL in its RDS product!)
Step 5: Add nice API with Fabric. Deployment was stupidly simple with Heroku, but it's easy to make it equally simple using a custom AWS environment -- I just had to do some upfront work by writing Fabric tasks. The key is, because you don't know how many servers you have at a given moment, or what their host names are, you query the Amazon API (using the excellent boto library) to get the hostnames dynamically. See here for the relevant parts of my fabfile.
Ongoing: Update AMI as needed. Whenever there's a new bit of code that my app needs -- say, a new apt-get package -- I make a one-off instance of the AMI, install the package, then freeze it as a new AMI. Then I associated the load balancer with the new AMI, and each new app server from then on will use the new AMI. I can force existing instances to use the new AMI by simply terminating them in the Amazon console; the load balancer will detect that they're terminated and, based on the scaling rules, will bring up a new instance with the new AMI.
Another approach would be to use Chef or Puppet to automatically install the necessary packages on each new server at instantiation time, instead of "baking" the packages into the AMI itself. We opted not to do this, because it was unnecessary complexity. The app is simple enough that the baked-AMI approach works nicely.
Put all this together, and you have a very powerful setup that I would argue is just as easy to use as Heroku (once it's set up!), with the full power of root access on your boxes, the ability to install whatever you want, set your scaling rules, etc. Try it!