Scalable Infrstructure for Whitehouse.gov at DrupalCon SF 2010

 

Frank Febbraro is the CTO of Phase2 and the architect of the transition of whitehouse.gov to Drupal. He gave this talk at DrupalCon San Francisco.

Due to NDAs I can't go into great detail about the things I'm presenting, but I'm happy to be able to share what I can with you guys.

Overview

Whitehouse.gov scales to lots and lots of page views. Scaling means more than adding servers and increasing page views, it means planning and processes, and growing and maturing the process of delivery. We'll be talking about infrastructure, but other stuff as well.

Why replace the White House website?

Before, they had a CMS that provided a good website, but after this project was over, we provided them a platform to build on so they could tap into and participate in the community.

Why Drupal?

They had a clear vision of what they wanted to do, and how they wanted to tell the story of the presidency. It allowed full control of the pltaform with open and transparent functionality.

We had two dedicated teams developing the site. One team was focused on the Drupal side; we also had another team of the same size building up the infrastructure. Integration, setting up servers, load testing, and performance. On top of that we had analysts and project managers.

Interestingly, the infrstructure and development team spent about the same amount of time on their respective projects.

There was tremendous dependency between these teams, and while it seemed sometimes like there were too many cooks in the kitchen, there was very strong collaboration between our vendors.

Ingredients

  • Great Design
  • Drupal 6
  • Performance patches
  • Lots of contrib modules
  • Custom features and integration

Features

  • Each department can have their own custom content, blogs and administrators; that's all managed through Context to get different layouts.

  • Apache Solr search was a massive improvement to the original and a key feature of the site.
  • We built a custom media browser on top of Solr which degrades gracefully without JavaScript, although there are a lot of cool AJAX features.
  • We improved the workflow of the content, for example to create slideshows of photos in the photo gallery.
  • The Node Embed module, which was contributed, took out the guesswork with how items were visually placed into the site.
  • We integrated with Akamai, which helps us scale the site. It clears the cache automatically on changes, and allows individual pages to be cleared from the cache from a button on the page, as well as a bulk cache clear.

What do you do when it breaks?

We developed a series of failure tests and recovery plans. What things can break? What services can fail? We went out into our environment and started turning off servers and services to find out how the site would respond, and discover the fastest ways to recover. It's really important to develop this fault tolerance.

We had a number of fall-back plans to mitigate the risks of our assumptions. Think about what your risks are, and have contingency plans for those.

How do you launch it?

Intense collaboration. We started with weekly meetings and moved to daily calls with all the vendors on the phone. We started in the middle of July and launched at the end of October. We had to plan for delays with background investigations and security audits, we had to have plans for instrusion and penetration testing. All of this was compressed into a three-month period.

The big elephant was the certification and accredidation process. You have to have mitigation strategies and validate the controls for each risk; we ended up with a 900 page document to prove that we had done our homework.

We launched Saturday, October 24 at 1pm. This was locked in four hours prior.

Numbers

The site gets hundreds of thousands of unique visitors per day; we've had over 100k peak live streams for the State of the Union. The site gets over 15,000 webform submissions per day.

Datacenters

There's two environments that are completely separate: Production and Disaster Recovery. It's a typical high-performance Drupal setup. We have MySQL, Varnish, Akamai, monitoring, and Puppet.

Servers

The servers run RHEL 5 in a virtualized cloud hosting environment. They're hardened to NSA guidelines; everything is provisioned by a tool called Puppet, which provides data center automation. We put our configuration for each server into some code, which we can revision control, and run this recipe on our server; when they're done, every database and web server comes out exactly the same.

Compliance is the best part about this, since we had to undergo security scans. When we spawn new servers, it meets all the specifications that our other servers already meet.

CDN

We use Akamai. They provide SiteAccellerator, which is a reverse proxy cache. NetStorage is like s3 from Amazon, and LiveStream to help with live stream support. We have tight support through the Cache Control Utility, and at this point Akamai handles 90% of our traffic.

Drupal

We're using Drupal 6, with Pressflow patches and other community patches. It supports replication and read/write splitting, with a shared filesystem.

Caching

We use Memcached and Drupal Memcache API module, with a cluster of Memcaache servers.

Search

We use Apache Solr with masters and slaves; we have nginx running in front of it, doing weighted load balancing between these.

Databases

MySQL enterprise and InnoDB, RAM Filesystem for Temp Tables, which helps with performance. We also have a lot of performance optimizations with the typical caches and buffers.

Replication

We have two teirs of replication: master/master replication, one active and one passive. The passive master serves as master to all the slaves. All the replication load is shifted to the passive master in this scenario.

Monitoring

We use Nagios for infrastructure monitoring: free memory, swap space, how many people have logged in to the servers, how long it's been since Cron has run. We also use MySQL Enterprise Monitor and Cacti. This lets you see trends over time, so you can predict when you need more resources.

Replication Monitoring

We have a set of custom scripts to manage replication with MySQL and tie that in with Drupal. Drupal knows which slaves are available for active reading, and this list is constantly updated. We have scripts that monitor the status of all the slaves in the environment, and notice that slaves may not be in a state where they should be receiving reads, it reinitializes that slave and then moves it back into the pool. These scripts also manage the replication hierarchy, so that you can swap out passive masters, for example.

Environmental Sync

A disaster recovery environment is no good if the data is 25 minutes old, it has to be as recent as it can possibly be. We sync static assets to Akamai NetStorage. We use Solr built-in replication and MySQL replication taking data changes across the entire tier.

Release Process

We had to scale this process. We had automatic branch and integration sites, with a full-featured staging environment with a full front-end with replication and heierarchies of replication, so that we can test that anything we roll out will work in that environment.

Since launch, we've done at least one deployment every week, and we've been able to keep this pace up by using JIRA and Fisheye/Crucible to help feed our process and make sure we're doing things right.

What have you done since launch?

  • White House visitor records online
  • Mobile version of the site
  • iPhone app with live streams
  • HTML5 version of the site with video for iPads

What next?

We have a lot of obstacles. How do we scale user authentication and user generated content? The content issue is really key on the top of our minds. We've had over two million webform submissions. We need to scale and segregate that data.

A site like Whitehouse.gov was great for the Drupal community, but it set a new bar for large site deployments. From security processes to site reviews, not all these are Drupal-related but the process is key. As Drupal moves up-market into the government and enterprise, it's important for us to scale that process. These organizations are ready for us, but we need to be ready for them too.

We took Drupal from Dries's dorm room with a sombrero to the huge community where we are today, so thank you everybody for that.

Q & A

Q: Do you know of any other large-scale Drupal sites being developed?

A: I know that the Examiner is moving to Drupal, and they get some insane traffic numbers (2 million a day.)

Q: What do you think about support for Drupal 6 being stopped in two years when Drupal 8 comes out?

A: We'll put it in the list and prioritize it, and as it comes closer we'll knock it out. There's a lot of modules that need to be updated, and we'll need to do another security audit, but it will fit into the process.

Q: Did you take that into consideration when choosing Drupal?

A: Drupal is what they wanted to use, but you have this problem with every system. Drupal 7 won't go live unless there is a migration path for people who are on Drupal 6.

Q: How often do you get DDOS attacks?

A: Can't answer that. [laughter]

Q: How is the uptime?

A: We haven't had any major outages, so uptime is pretty good.

Q: You're using a shared filesystem, what is it? NFS?

A: We have a NAS device, mounted using NFS.

Q: Do you have to go through an approval process for your weekly releases?

A: Yes, we have an internal team that reviews each release. Part of that process is a review committee.

Q: Do you have enough authenticated users that you have to deal with scaling?

A: Currently, no public user can log into the site, so it's not an issue, but we're looking at that.

Q: Do you use the Nagios module on Drupal.org?

A: Yes.

Q: How much is the Akamai module that you released part of your daily work?

A: Every time a node is changed, we catch that event and purge it from the cache, both its canonical URL and any aliases. If you want to clear an image, you can clear the cache by path.

Q: Did you find the multi-tier replication model added extra latency?

A: We're not doing a lot of writes, so things don't get backed up too often. We find that in the time it takes for me to enter a row on the master server, by the time I do a query, it's already replicated everywhere.

Q: Do you have plans to serve from multiple datacenters through Akamai?

A: No, but we do realize it's a possibility.

Q: You mentioned that MySQL proxy didn't work for you and you're handling your read/writes through Drupal. How?

A: We took the Pressflow patches and added a little bit more to it.

Q: Do you have any comparison from pre-Drupal transition traffic to post-Drupal launch traffic.

A: We do, but we don't think it's necessarily a fair comparison, since we've been rolling out a bunch of new features.

Q: Do you have any core performance patches in the queue that has not been committed to Drupal 7?

A: We're using patches that are already out there. There was enough work out there to take care of all of the things we had to address. We used some Drupal 7 patches that were back-ported to Drupal 6.

Q: How many people and how much did you spend?

A: We've interacted with over 50 people in the course of the project, but we can't release the cost.

Thanks to Frank for this talk. You can see the site live at whitehouse.gov.

2 Comments

Great write up

Could you talk a bit more about how you split the reads/writes? The Pressflow patches seem to introduce new methods for reading from slaves and masters. But this is an interface change that would seem to require all modules to leverage the new methods to get any value from the read slaves. How did you get around this? Better yet - is the code available? :)

Staging

Ooh, one more question ... you mentioned a staging environment. Do you merely stage new features / code to that environment or are you guys using it to stage content as well (like the deploy module's model)? I'm very interested because it's a common request from enterprise users. Thanks again for sharing!

Did you enjoy this post? Please spread the word.