Velocity: Wikipedia Operations Talk

A few interesting notes from the Wikipedia talk, which for my money was the most interesting ops talk of the conference so far, from a Wikipedia engineer and trustee:

  • Lots of talk at the beginning about how many mistakes they make and how often an employee takesĀ  the site down. Refreshing, if strange, to hear such a perspective. Hard to find volunteers who aren’t too scared about breaking the site. (Mentioned later that there are 6 site engineers, two volunteer. All of them were in the room.)
  • An early graph showed that the % of non-cached data was a super-tiny % of all traffic: even with what’s no doubt a very sparse hit matrix, it seems like they’re able to cache just about everything.
  • The speaker didn’t know what their uptime was – he said (without prompting) “99 point something” – and very much said that “lots of nines” wasn’t a priority for them. He was an engineer and not a manager type, but it was still an interesting perspective – obsess over efficient use of non-profit resources (and people), not about high availability. This theme continues throughout the talk.
  • “I don’t know how important the last few seconds of changes” are – people can just re-enter the changes.
  • When the site goes down, their Gone Fishin’ page shows a donation box, and they make a huge amount of money. It’s the most profitable time. “It’s sometimes better to torture people.”
  • They have an external developer community, they like when they add features, but their optimization operator is // – be always ready to disable something. Heavy tasks get eliminated, delayed, reduced, and otherwise made smaller.
  • Some very funny/snarky comments, like pointing out how they split their database into multiple slaves based on use cases/languages, and then said something like “now that’s called sharding. We didn’t call it sharding, we just do things and then people come up with names for them.”
  • Uses Lucene for its search engine: since Lucene was based on Java, and Java wasn’t open-source friendly, they got it running on an open-source Java clone and on Mono (!). Now back to Lucene proper and very focused on keeping things open.
  • Every feature anyone builds needs to allow caching and think about invalidation. OK to think of the database as a cache if the in-memory cache hit rate will be very low. Memcached for an object cache with a lot of careful planning, aggressive Squid usage as an HTTP cache with a month-long TTL (and an app that purges outdated data), multiple layers and geographically distributed.
  • Have some old P4s that they won’t use b/c of their power usage and heat distribution. (We have some as well, but they aren’t sitting idle yet…)
  • Some crazy high numbers (80K SQL queries/sec, 50K pages/sec, >1TB of compressed revisions, etc.) that I didn’t catch – I’ll look for this talk and update it later.
  • “Please open source your stuff, so we can use it!”

All in all, a very memorable talk.

0 Responses to “Velocity: Wikipedia Operations Talk”

Twitter Updates

[aktt_tweets count="5"]