Archive for the 'Velocity08' Category

23
Jun

Velocity: Wikipedia Operations Talk

A few interesting notes from the Wikipedia talk, which for my money was the most interesting ops talk of the conference so far, from a Wikipedia engineer and trustee:

  • Lots of talk at the beginning about how many mistakes they make and how often an employee takesĀ  the site down. Refreshing, if strange, to hear such a perspective. Hard to find volunteers who aren’t too scared about breaking the site. (Mentioned later that there are 6 site engineers, two volunteer. All of them were in the room.)
  • An early graph showed that the % of non-cached data was a super-tiny % of all traffic: even with what’s no doubt a very sparse hit matrix, it seems like they’re able to cache just about everything.
  • The speaker didn’t know what their uptime was – he said (without prompting) “99 point something” – and very much said that “lots of nines” wasn’t a priority for them. He was an engineer and not a manager type, but it was still an interesting perspective – obsess over efficient use of non-profit resources (and people), not about high availability. This theme continues throughout the talk.
  • “I don’t know how important the last few seconds of changes” are – people can just re-enter the changes.
  • When the site goes down, their Gone Fishin’ page shows a donation box, and they make a huge amount of money. It’s the most profitable time. “It’s sometimes better to torture people.”
  • They have an external developer community, they like when they add features, but their optimization operator is // – be always ready to disable something. Heavy tasks get eliminated, delayed, reduced, and otherwise made smaller.
  • Some very funny/snarky comments, like pointing out how they split their database into multiple slaves based on use cases/languages, and then said something like “now that’s called sharding. We didn’t call it sharding, we just do things and then people come up with names for them.”
  • Uses Lucene for its search engine: since Lucene was based on Java, and Java wasn’t open-source friendly, they got it running on an open-source Java clone and on Mono (!). Now back to Lucene proper and very focused on keeping things open.
  • Every feature anyone builds needs to allow caching and think about invalidation. OK to think of the database as a cache if the in-memory cache hit rate will be very low. Memcached for an object cache with a lot of careful planning, aggressive Squid usage as an HTTP cache with a month-long TTL (and an app that purges outdated data), multiple layers and geographically distributed.
  • Have some old P4s that they won’t use b/c of their power usage and heat distribution. (We have some as well, but they aren’t sitting idle yet…)
  • Some crazy high numbers (80K SQL queries/sec, 50K pages/sec, >1TB of compressed revisions, etc.) that I didn’t catch – I’ll look for this talk and update it later.
  • “Please open source your stuff, so we can use it!”

All in all, a very memorable talk.

23
Jun

Velocity: Introducing Jiffy – Performance Tools for the Rest of Us

During my time at Amazon, I drank the performance obsession kool-aid. Geniuses like John did the hard-core data analysis to show that milliseconds matter to encourage customers to buy things. Using the great toolset Amazon had built for analyzing page performance, my teams worked to improve the response time at the average, the 99.9th percentile, etc.

Amazon’s challenges were very specific – pages were built by hundreds of service calls to hundreds of services across thousands of boxes, and so the page framework needed to be able to respond to failing or slowing services quickly, with appropriate backoff, etc. These were fun and heady problems to work on and follow.

When I left Amazon for WhitePages, I realized that the toolset I had grown to rely on both solved Amazon’s unique problems and were kept inside Amazon. I learned that other large companies had similar systems, and I could pay the usual suspects for some kinds of services, but there was nothing just off-the-shelf that could give me 10% of the information Amazon had.

Smaller web publishers, then, have the double-whammy of fewer tools to measure page performance and fewer engineers to build the tools to make that better.

That’s not right for the web, and so at WhitePages, we’ve built a toolset to help engineers build faster web sites, based on real-world data from real clients.

Today, we released that system as an open source project – Jiffy. Jiffy’s an end-to-end system for measuring page and component performance across real data.

I’ve blogged in detail on the WhitePages Developer Blog about Jiffy, and the code is immediately available for download. I announced the release at the O’Reilly Velocity conference this morning – here are the slides.

More to come about Velocity later, which has already had a few interesting moments, even if the room is very hard to make laugh. (I killed a few jokes before I started, likely a good call, since the ones I left fell flat.)




Twitter Updates

[aktt_tweets count="5"]