A few interesting notes from the Wikipedia talk, which for my money was the most interesting ops talk of the conference so far, from a Wikipedia engineer and trustee:
- Lots of talk at the beginning about how many mistakes they make and how often an employee takes the site down. Refreshing, if strange, to hear such a perspective. Hard to find volunteers who aren’t too scared about breaking the site. (Mentioned later that there are 6 site engineers, two volunteer. All of them were in the room.)
- An early graph showed that the % of non-cached data was a super-tiny % of all traffic: even with what’s no doubt a very sparse hit matrix, it seems like they’re able to cache just about everything.
- The speaker didn’t know what their uptime was – he said (without prompting) “99 point something” – and very much said that “lots of nines” wasn’t a priority for them. He was an engineer and not a manager type, but it was still an interesting perspective – obsess over efficient use of non-profit resources (and people), not about high availability. This theme continues throughout the talk.
- “I don’t know how important the last few seconds of changes” are – people can just re-enter the changes.
- When the site goes down, their Gone Fishin’ page shows a donation box, and they make a huge amount of money. It’s the most profitable time. “It’s sometimes better to torture people.”
- They have an external developer community, they like when they add features, but their optimization operator is // – be always ready to disable something. Heavy tasks get eliminated, delayed, reduced, and otherwise made smaller.
- Some very funny/snarky comments, like pointing out how they split their database into multiple slaves based on use cases/languages, and then said something like “now that’s called sharding. We didn’t call it sharding, we just do things and then people come up with names for them.”
- Uses Lucene for its search engine: since Lucene was based on Java, and Java wasn’t open-source friendly, they got it running on an open-source Java clone and on Mono (!). Now back to Lucene proper and very focused on keeping things open.
- Every feature anyone builds needs to allow caching and think about invalidation. OK to think of the database as a cache if the in-memory cache hit rate will be very low. Memcached for an object cache with a lot of careful planning, aggressive Squid usage as an HTTP cache with a month-long TTL (and an app that purges outdated data), multiple layers and geographically distributed.
- Have some old P4s that they won’t use b/c of their power usage and heat distribution. (We have some as well, but they aren’t sitting idle yet…)
- Some crazy high numbers (80K SQL queries/sec, 50K pages/sec, >1TB of compressed revisions, etc.) that I didn’t catch – I’ll look for this talk and update it later.
- “Please open source your stuff, so we can use it!”
All in all, a very memorable talk.