For those in the Lean Startup world, the utopian version of product development is continuous deployment. It means every engineer is deploying code multiple times a day, often even on someone’s first day of work. It’s also exceptionally test driven, reducing risks of bugs taking down the entire system. While many startups aspire to this, few have succeeded, which is why it was so exciting to have Brett Durrett of IMVU come and speak to the Lean Startup Circle Boston Thursday night. Brett is VP of Engineering at IMVU, which coincidentally happens to be Lean Startup Guru, Eric Ries’s startup he spent many years helping build before becoming the movement’s biggest evangelist.
{Note: Brett’s presentation was awesome but hasn’t been posted yet. I’m embedding his Lean LA version as a reference until it can be posted.}
Continuous Deployment at Lean LA
In a nutshell, continuous deployment breaks down into 3 steps:
- Develop a feature
- Test it
- Deploy it
But why would you do continuous deployment?
The reason for using continuous deployment hits the core of lean startups: more iterations. Whether funded or bootstrapped, there’s a limited amount of time to iterate and nothing speeds iteration like getting new features, site tweaks and updates out faster. Continuous deployment forces you to break down all your features into bite size chunks which can save you building massive features when you can confirm it with much smaller steps. It also minimizes version control issues if no one is working on a long term project based on old code.
It also makes your engineers more efficient. Is it easier to find the problem with freshly deployed code that has 10,000 lines in it or 10 lines? Is it easier to get engineers up to speed on a system that expects them to write a micro feature or build a major piece of the system? If everyone writes their own test code, you have greater accountability across the entire engineering team (you have to fix what you break) and you don’t have to hire a QA team stuck cleaning up everyone else’s mess.
The Continuous Deployment Process for Engineers at IMVU
After convincing us why it matters, Brett walked us through the process for an IMVU engineer. Once an engineer has finished building their bite size feature, they walk through all of the following in less than 15 minutes!
1) Engineer runs the test in their sandbox.
To keep from clogging the deployment and testing systems everyone shares, engineers first run some basic PHP tests on their own system to ensure the code is ready. (Note: they don’t do branches in the repository; the brand in the code instead.)
2) Engineer runs testing on main system
After passing the tests on their computers, they enter the queue for the main testing system for all over IMVU called Buildbot. They have tons of tests so you tag your code based on what parts of the system it affects and what it’s for. This optimizes the right tests to run and which can be avoided. (Running every test would take over 8 hours.)
On average, it takes about 8 minutes to run all the tests needed. They’ve achieved this speed because they have 40-50 instances running just for testing. They’ve also discovered that 12 minutes or less is the optimal time to have testing take and keep your engineers happy.
The most common slip at this stage is a missing tag, which means a needed test isn’t run.
3A) If all tests pass, the engineer now deploys the code.
3B) If any of the tests fail, the engineer reverses their commit.
With the rate the whole team is deploying and testing, their’s no time to have people fixing while they’re in the system. Therefore, regardless of the issue, the engineer will reverse their commit and go back to fixing the problem on their machine and starting the test and deploy process from the beginning.
4) Deployment occurs in pieces
Currently, IMVU (with over 50 Million registered members) has 800 servers in use. When they deploy new code, it starts out on just 35 servers. This ensures that if something goes wrong, it doesn’t take the entire site down.
5) Testing continues after deployment thanks to the Cluster Immune System
Even after deployment, they’re still testing, just in a different fashion. They’ve developed their own tool called the “Cluster Immune System” which monitors key site (speed, system performance, etc) and customer metrics (revenue, registrations, etc) to make sure there hasn’t been a dramatic change. Even the best tests won’t notice an engineer accidentally made a blue button on a blue page; the tests will see the button is still there and works, but won’t realize a user can’t see the critical sign up button.
This system runs on those 35 servers they use as a live test bed. If anything goes wrong there, they prevent it from deploying to the rest of the system. If not, it’s deployed system wide to all 800 servers.
An audience member asked about “what if you don’t have massive traffic you can segment to test a new deploy?” Brett said it’s an advantage when you’re bigger, but until then, you may just run at Cluster Immune System to monitor a system-wide deploy.
The best news of all of this? IMVU plans to open source the Cluster Immune System (CIS) soon.
6) If all CIS testing is passed, deploy to all servers, but continue monitoring
Even after deploying to all servers, they still monitor for anything unexpected. If they see anything alarming, they’ll roll back and remove the feature.
—
This entire process takes only 10-12 minutes. Only one engineer can be in the Buildbot testing phase at a time, but as soon as you enter deployment (step 4, above) someone else can enter buildbot.
—
This process sounds great, right? But it seems so sophisticated…how do you get started? Brett covered that too..
Getting Started – How do you actually do this?!?
Getting started is a different process depending on if you’re an established company or just an infant startup with limited traffic, but either way, there’s great ways to get started:
- If you’re a small startup – Start with a sandbox for each of your engineers and just focus on pushing code quickly and in small chunks. You can develop your testing as things break; that’s how IMVU built their system.
- If you’re an established company – Start with production (ie- the last step before deploying code) and automate that process. Start building tests for whether something should be deployed or not. Work to get the automated tests as good as the human part of the process. Once you’ve accomplished this, keep working backward to continue to remove humans from the deployment flow. At first, you should err on the side of preventing problems then clean up your tests to be efficient.
Whether big or small, the same key rule applies: Anything can break once. Then you have to make it so the same thing can’t happen again by writing a test for your mistake. This builds both accountability and builds only the tests you really need…one step at a time.
—
Pitfalls
Like any system, this isn’t perfect. There are challenges both with getting personnel buy in and in scaling this:
1) Philosophy- Blameless systems
It’s not “Ned broke this!” it’s “How did we let that fail get through our system.” Brett really emphasized this important difference. It’s a philosophical buy in required to really make this work best.
At IMVU, they hold regular “Blameless Post Mortems” to discuss issues that slipped through.
2) Optimize your testing
As you grow, more tests will be required. You should optimize for which tests actually need to be run for a specific line of code (as they did with tagging) and purchase sufficient hardward to make tests fast. IMVU also found they could save a tremendous amount of time simply by optimizing the order the tests run, by running the slower ones first (a 22% time saving). They also then built in dependency in the testing (ie- if Test B requires you pass Test A to work…make sure Test B runs second). Finally, sandbox testing of high level issues kept a lot of code from entering their “1 engineer at a time” test system by having everyone be able to test it on their own machines first.
3) Outsourcing doesn’t work well
This sort of system requires a cultural buy in that IMVU found couldn’t be instilled well remotely. It also proved difficult to manage the testing system whenever some engineers were in a different time zone.
4) Complex fails are harder to find
Brett still occasionally finds issues don’t fail during the work day…and being VP of engineering it means he gets the 3am calls about such issues. They also struggle with MySQL and memcached issues, which has led to separate systems being developed to deal with them.
—
The key principle in all of this is that continuous deployment is a constantly evolving system. It is not perfect. However, taking it one step at a time will help you build a formidable engineering process that allows your company to move faster than you ever thought possible.
Ready to try to bring continuous deployment to your startup?