Sign Up Now
Become a Fan
Have a Question?

Get Support

Twitter Feed

Achieving consistent, worry-free, super-fast deploys using AWS and ELBs

At PipelineDeals, we deploy code frequently, usually 2-3x per week, and sometimes even more often. As all web application developers know, deploying is sort of a nervous process. I mean, sure, 99.99% of the time, everything will go perfectly smooth. All your tests pass, your deploy to staging went perfectly, all the tickets have been verified. There is no reason to fear hitting the button. And, the vast majority of the time, this is true.

But all web application developers also know that sometimes, there is a snag. Sometimes the fates are against you, and for whatever reason, something goes bust. Perhaps you have a series of sequenced events that must occur to deploy, and one of the events silently failed because the volume that /tmp is mounted on temporarily had a 100% full disk. Perhaps that upload of the new assets to S3 did not work. Perhaps you did not deploy to ALL the servers you needed to deploy to.

And then, the worst happens. For a short period while you are scrambling to revert, your customers see your mistake. They start questioning the reliability of your system. Your mistake (and it is yours, even if some bug in some server caused the problem) is clearly visible to your customers, your bosses, and your peers.

Taking advantage of the tools we have

PipelineDeals runs on Amazon AWS. We utilize EC2 for our server instances, ELB for our load balancing, ElastiCache for our memcached storage. We are also major proponents of Opscode's Chef, and use it to spin up and configure any type of instance that makes up our stack.

Since we have all these fantastic tools, we decided to use them in a way that makes deploying seamless and easy. We wrote a simple Rakefile called Deployer that orchestrates a seamless app server deploy.

Using the Deployer script

The first thing it does is creates new app servers that have the new code on them. Once the app servers have completed their configuration, the deployer rakefile will register those new app servers with a test ELB load balancer.

From there, we can do a final walkthrough of what exactly is going into production, and indeed the app server is up, awake, and ready to receive requests.

After that final validation, we simply run rake deploy, which adds the new app servers to the current load balancers, verifies their health, then removes the old app servers from the production LB. This all runs in about 3 seconds, so the transition is smooth and seamless.

If indeed anything was wrong with our code, or we found it was generating an error we did not expect, we can simply run rake rollback which does the opposite.

Or, if we are completely satisfied that everything looks ok, we can run rake cleanup which will tag the new app servers as the current production servers, and terminate the old app servers.


Originally we designed the Deployer for when we launch large projects, or risky chunks of code. But I have found that we have started using the Deployer for nearly every deploy, because it is so easy.

If your company utilizes Chef, EC2, and ELB, check out the deployer. It might work great for your deployment workflow!


What it means to be truly geographically redundant on AWS

PipelineDeals is hosted on Amazon's AWS cloud platform, and has been since 2007.   During these years we have been exposed to two separate mass outages, all of which affected their US-EAST availability zone.

Compared to other AWS availability zones, US-EAST center is their busiest, has the cheapest hourly server rates, and (happens to be) the most prone to massive outages.

Given that the majority of our customer base happens to be closer to the east coast, we keep our servers hosted in US-EAST.  What this means, however, is that we must be prepared to jump ship to another availibility zone with as little downtime as possible.

Setting up for geographic redundancy

PipelineDeals relies heavily on chef for our server configuration managment.   Over the past couple months we have made deep investments in our chef cookbooks to ensure that we can bring up any type of server in our stack to 100%, by simply running a rake command.  Given that we can do this, we can effortlessly bring up any type of server in any AWS availability zone.

An ounce of prevention

We noticed that the past two massive outages were related to each other.  They both had to do with Amazon's EBS system, which provides external storage, and also acts as a root device for certain types of instances.   We noticed during the outage that our EBS-backed instances were experiencing problems. 

We decided to become proactive about it.  All of our production servers now no longer rely on EBS-backed instances.  Because we use chef to bring up instances, we do not need the benefit of rebooting or stopping servers, thus we do not need one of the major benefits that EBS-backed servers provide.  In fact, removing our dependence on EBS has forced our hand to utilize chef to its fullest potential.

We switched our chef recipes to run against instance store backed instances, which do not rely on the EBS subsystem.  Since the outage we have replaced all running EBS-backed instances with instance store-backed instances. 

Guarding against the 'mad rush'

Typically when an outage occurs in US-EAST, there is a massive rush to start firing up instances in other availability zones.  So much so that it could take tens of minutes, possibly even hours to fire up servers, all in the midst of an emergency.

To guard against that, we have a "skeleton crew" set of servers that are always on, in US-WEST-1.  They include a database slave, an application server, and a background process server.  If we need to fail over to our skeleton crew in the west, we could be up while assessing the situation in the east, and/or firing up more powerful hardware in the west.

Essentially what would happen is we would promote our slave in the west to become a master, fire up a new west coast slave, and start firing up the ancillary servers that make up the full PipelineDeals experience.

Practice makes perfect

The multi-day threat of hurricaine sandy facilitated multiple practice runs to fail over to the west.  This included exercising every single one of our chef recipes that make up the entire infrastructure of PipelineDeals.

During that time we made many many commits to our chef repository, and ensured that we could fire up servers to 100%, time after time.

In addition, we were able to combine roles of servers, to reduce the amount of servers needed to make up the entire PipelineDeals experience. 



Announcing V3 of the PipelineDeals API

The developers at PipelineDeals are happy to announce a new version of our API. V3 introduces many changes, including the following:

  • Read-only access to all of your admin data
  • A new, simplified way of handling custom fields for Deals, People, and Companies
  • New, more powerful filtering for Deals, People and Companies

You can get the full scoop by going to our API documentation.

Starting out

Access to our API is now disabled by default. To ensure the security of your account, you must enable the new API before you can use it. Account admins can can enable your api by going to the API section or the Admin menu.

Read-only access to all of your admin data

We now provide read-only access to all of your admin data. This includes things like person statuses, sources, deal stages, note and event categories, as well as all custom fields defined for Deals, Companies, and People.

This will enable you to do things with the api you previously could not, like tagging or assigning a status to a person, setting the stage of a deal, or assigning a category id to a note.

Handling custom fields

We have greatly simplified how to store and retrieve custom fields for all of our entities that have custom fields, including Deals, People and Companies.

For instance, say your company is set up to have 3 deal custom fields. The first, Region, is a multiselect dropdown. The second, Square footage, is a numeric, and the 3rd, Product type, is a single-select dropdown.

When retrieving a single deal, you'll get back the custom fields for that deal in a new, clean way.

curl "" | prettify_json.rb

  "id": 3,
  "name": "Grant's Deal"
  "custom_fields": {
    "custom_label_82": [28,29,30]  // this is our region custom field
    "custom_label_83": null,       // this is our square footage field
    "custom_label_90": 49,         // this is our product type dropdown field

The region field contains an array of custom field label dropdown entry ids. You can get your custom field label dropdown choices by using the deal custom field labels list api call.

This makes working with custom fields very easy. To update your fields, simply POST the data back to us, with the ids you want checked:

curl -X PUT "" -d \
  "'deal': { 
    'custom_fields': {
      'custom_label_82': [31,32],
      'custom_label_82': 456,
      'custom_label_90': null

The deal will be updated with the correct custom fields. This custom_fields object is available for Deals, People, and Companies.

New, more powerful filtering for Deals, People, and Companies

You can filter and sort a list of Deals, People or Companies using our new powerful filtering framework.

All available filters are listed in the API documentation.


Javascript error reporting for fun and profit

Here at PipelineDeals, our app is very Javascript intensive. In fact, for our Jupiter release, we moved much of our logic from rails to the browser and Javascript, using Backbone.js, Coffeescript and a host of other new cutting edge technologies.

During deploys and other edge cases, we want to know if a user has received a javascript error, just like we would with any other type of exception. We deploy very frequently. A few weeks ago we ran into a problem where one of our app servers did not minify and concatenate one of our javascript files correctly, and led to a host of issues that customers saw. In order to prevent this in the future, we decided to search around and see if we could have javascript errors report to one of our favorite tools, NewRelic RPM.

As it turns out, catching and reporting Javascript errors is pretty easy. We ended up hooking into window.onerror, which we define right after we add jQuery, right at the top of the page.

// Determine if the error occurred before or after document ready
jQuery(function() { window.documentIsReady = true; });

// report a maximum of 5 errors
window.MaximumErrorCount = 5;

window.onerror = function(errorMsg, file, lineNumber) {
  window.errorCount || (window.errorCount = 0);

  // this example assumes a ppd object that has various information about 
  // our environment and our logged in user.
  if (ppd.env == 'production' && window.errorCount <= window.MaximumErrorCount) {
    window.errorCount += 1;

    // post the error with all the information we need.'/javascript_error', {error: errorMsg, file: file, location:     window.location.href, lineNumber: lineNumber, userId:, documentReady:     window.documentIsReady, ua: navigator.userAgent});

The above code will execute if the user receives a javascript error. It will then do an ajax post to our app, which will handle the error.

class JavascriptErrorController < ApplicationController

  def javascript_error
      # post the error to newrelic.
      # You could also send an email, notify hipchat, whatever.
      NewRelic::Agent.notice_error("Javascript error: #{params[:error]}", {:uri => params[:location], :params => params})
    head :ok

Not only has this been invaluable for catching that rare, but painful asset problem, but it will also catch legitimate issues and race cases where sometimes a variable might not be defined.

In NewRelic, javascript errors get report just like any other error, and if we do have an asset problem, our error rate will shoot through the roof and all the devs will be notified.


Improving Rails Performance with Twitter's Kiji Ruby

In March, Twitter unveiled "Kiji", an effort to significantly reduce the impact of running the garbage collector in the Ruby Enterprise Edition (REE) runtime.  Many have criticized their effort because it focused on the older 1.8.7 instead of the newer 1.9.x version of ruby.   But for older, larger apps that still run Rails 2.x or need ruby 1.8,  Kiji may be worth exploring.


MRI relies on a single heap that basically resembles  a slab allocator.   New slabs are created via malloc()  and carved up into a fixed 40-byte slot.  Each slot can hold a variety of ruby objects.   When a slab becomes full,  GC is invoked to clean things up by looking for non-reachable objects.  If the slab is still full after GC, then a new slab is allocated.


MRI’s GC uses a very simple naive mark-and-sweep algorithm.  When MRI GC runs, all code execution stops until GC is finished.   The implementation is such that it does not (easily) leave the door open to use more advanced GC techniques. 


For large applications, and espcially web-based applications, this can become a major issue.    There is a direct relationship between the number of objects and the garbage collection runtime.  More objects equals more time spent garbage collecting and less time serving customer requests. 


This is where Kiji comes in.


Kiji’s GC acts like a type of generational algorithm by splitting objects into heaps based on their age with the notion that younger objects are more likely to become garbage.  While not a “true” implementation of a generational GC, because objects cannot move between heaps , there are still benefits to using it.   Kiji’s GC allows the CPU to continue executing ruby code while GC is run.  For web-based applications, this is a huge win, as your users do not have to wait for GC to perform a full run during the request cycle.


Kiji also changes the way that frequently-used objects are stored in the heap.  For web-based ruby applications, the majority of “live” objects in the heap are RNodes.    RNodes are responsible for holding the actual source code of the application.  Kiji implements a long-life stack that solely stores RNodes which is garbage collected much less frequently than the primary stack.  Because of the primary stack now has less objects, GC runs complete much faster, at the expense of the application consuming more memory.


Twitter’s initial announcement and subsequent followup both included results from running the webapp under synthetic load but having run Kiji in production for nearly two months we are excited to share the noticeable difference it has made within our own stack.


In the figure below we have collected several weeks worth of runtime data from our production application servers. Each point in the figure represents the average of one day's worth of data collected in one minute samples. To understand the potential effect Kiji might have our application's runtime we captured both the average time a customer's request spent solely within the Ruby runtime as well as our application's apdex, an open standard that converts response time measurements into the degree to which user expectations are met. An application's apdex can range uniformly from zero (no users satisfied) to one (all users satisfied) .


Days spent running REE or Kiji neatly cluster along expected lines, with REE posting signficantly higher response times and subsquently lower apdexes. Whle utilizing nearly fifty percent less of the CPU than REE, Kiji did, however, bring a nearly three fold memory increase. Overall Kiji shaved off nearly one second from our page load time bringing us to an average of three seconds per page well below today's global average.


If you would like to take Kiji for a spin with your own application you can find it on Github. We would love to hear about your experiences using Kiji (or another alternative Ruby runtime) with Rails.


Technical Notes: 


* Our production application servers use Debian 5.0, we have successfully compiled Kiiji on RPM-based distributions (RedHat/CentOS/Fedora) as well. To date we have been unable to successfullly build Kiji on an Ubuntu based box (Lucid - 10.04 and later tested) due to an bug triggered by libc6. 


* We built Kiji against Google's TCMalloc instead of malloc() producing the same binary as Twitter using the method they provided in their README.