Disaster recovery planning.
Apr. 10th, 2007 06:15 pmThe Wikimedia Foundation is in no danger of collapse. There's all sorts of deeply problematic things about it, but no more than at any other small charity. Situation normal all fouled up.
But it would be prudent to be quite sure that the Foundation failing — through external attack or internal meltdown — would not be a disaster.
The projects' content: The dumps are good for small wikis, but not for English Wikipedia — they notoriously take ages and frequently don't work. There are no good dumps of English Wikipedia available from Wikimedia. (I asked Brion about this and he says the backup situation should improve pretty soon, and Jeff Merkey has been putting backups up for BitTorrent.)
The English Wikipedia full text history is about ten gigabytes. The image dumps (which ahahaha you can't get at all from Wikimedia) are huge, as in hundreds of gigabytes. It'll be a few years before hard disks are big enough for interested geeks to download this stuff for the sake of it. What can be done to encourage widespread BitTorrenting right now?
The easiest way for a hosting organisation to proprietise a wiki, despite the license, is simply not to make dumps available or usable. And to block spidering the database fast enough to substitute. This is happening inadvertently now; it would be too easy to do deliberately.
Who are you? The user-password database is private to the Foundation, for obvious good reason. But I really hope the devs trusted with access to it are keeping backups in case of Foundation failure.
In the longer term, going to something like OpenID may be a less bad idea for identifying editors.
Hosting it somewhere that can handle it: MediaWiki is a resource hog. Citizendium got lots of media interest and their servers were crippled by the load, with the admin having to scramble to reconfigure things. Conservapedia was off the air for days at a time just from blogosphere interest. Who could put up a copy of English Wikipedia quickly and not be crippled by it?
Suitable country for hosting: What is a good legal regime for the hosting to be under? The UK is horrible. The US seems seems workable. The Netherlands is fantastic if you can afford the hosting fees. Others? (I fear languages going to the countries they're spoken in would be a disaster for NPOV.)
Multiple forks: No-one will let a single organisation be the only Wikipedia host again. So we'll end up with multiple forks for the content. In the short term we'll have gaffer-and-string kludges for content merging ... and lots of POV forking. A Foundation collapse would effectively "publish" wikipedia as of the collapse date — or as of the previous good dump — as the final result of all this work.
(The English Wikipedia community could certainly do with a reboot. Hopefully that would be a benefit. It could, of course, get worse.)
In the longer term, for content integrity, we'll need a good distributed database backend. (There's apparently-moribund academic work to this end, and Wikileaks note they'll need something similar.)
Worst case scenario: A 501(c)(3) can only be eaten by another 501(c)(3), but the assets of a dead one (domains, trademarks, logos, servers) can be bought by anyone. Causing the Foundation to implode could be a very profitable endeavour for a commercial interest, particularly if they smelt blood in the water.
Second worst case scenario: The Wikimedia Foundation's assets (particularly the trademarks and logos) go to another 501(c)(3): Google.org. Wikipedia's hosting problems are solved forever and Google further becomes the Internet. Google gets slack about providing database dumps ...
What we need:
- Good database dumps more frequently. This is really important right now. If the Foundation fails tomorrow, we lose the content.
- People to want to and be able to BitTorrent these routinely.
- Backups of the user database.
- A user identification mechanism that isn't a single point of failure.
- Multiple sites not just willing but ready to host it.
- Content merging mechanisms between the multiple redundant installations.
- A good distributed database backend.
- The trademarks to become generic should the Foundation fail.
I'd like your ideas and participation here. What do we do if the Foundation breaks tomorrow?
(See also the same question on my Wikimedia blog.)
Correction: Google.org is not a 501(c)(3). So it couldn't gobble up Wikimedia directly.
(no subject)
Date: 2007-04-10 06:07 pm (UTC)1) It runs out of money. Given that it's been pretty close to 'out of money' all along, this doesn't seem like a high hurdle, unless, say, you get a multi-zillion-squillion-dollar legal judgment against you, in which case you're boned unless you can shit a pricey lawyer or three. So, let's forget that for now.
2) Infighting Schism Heresy Insurrection Fork. This doesn't seem terribly likely to me either, but there's probably procedural incorporation-y things you can do to ensure that no random mad genius (not even a founding random mad genius) can take his ball and go home in such a fashion that fucks up everything. And I assume that a lot of that stuff has already happened, so let's forget that too.
3) Catastrophic infrastructure failure. This seems like the only likely possibility, and the one that you spend the most time worrying about, although it's really just an edge-case of problem #1. You're a huge dork, so surely you can suggest all of the ways to ensure that it takes at least a statistically impossible set of coincidences to fuck up everything beyond the point of repair, so that's not really much of an issue either...
...so, what problem is it exactly that we're trying to solve here, anyways?
(no subject)
Date: 2007-04-10 06:20 pm (UTC)Other than the trademark issue (which is a big one), what do we lose, in practice, if we lose everything but a text dump? (Heck, lose the edit history, lose the talk pages, lose all non-article namespaces.)
As I see it, we lose the images, where most of our copyright problems exist. We lose the written legacy of wankery. We have two million encyclopedia articles of various qualities and, in one way or another, a clear sense of what went wrong. We might have a lapse of a few days before the mirrors go up, but after that, it seems to me, we have a good position - a dozen or so forks that have to start with two million encyclopedia articles and figure out a way to make something useful out of them. Some of them will quickly try to recreate the structures of en. Others will start working in different ways, and hopefully do a better job of it.
There is, to my mind, a larger problem for obscure language encyclopedias, but it tends to be my sense that if an encyclopedia wouldn't mutate into competing forks it probably doesn't have a healthy enough community to survive the experience productively.
(no subject)
Date: 2007-04-10 06:43 pm (UTC)The main thing at present, though, is to have a good database dump at all. (Which is why I made sure to ask Brion.)
(no subject)
Date: 2007-04-10 07:39 pm (UTC)(no subject)
Date: 2007-04-10 07:45 pm (UTC)If Commons could be preserved that would be nice, but incidental - the preference would be to start with Commons from the beginning.
(no subject)
Date: 2007-04-10 07:47 pm (UTC)I'd love to obliterate all the fair use shit.
(no subject)
Date: 2007-04-10 07:54 pm (UTC)(no subject)
Date: 2007-04-12 11:21 am (UTC)(no subject)
Date: 2007-04-12 12:13 pm (UTC)Is there a way to fork at all, that's in compliance with the GFDL? If not, then the GFDL isn't a free licence on a wiki, even without invariant sections.
(no subject)
Date: 2007-04-10 07:04 pm (UTC)If the problem is internecine warfare within the Foundation, well, perhaps a bloodless coup is called for: get a trusted person to change passwords and seize control.
(no subject)
Date: 2007-04-10 07:05 pm (UTC)Now, looking at the English wiki dump report (http://download.wikimedia.org/enwiki/20070402/) shows the dump of "All pages, current versions only" is 4.1 GB in size.
Now, while you would lose the revision history of the content, you would have a complete copy of the current version which is better than nothing, and actually a good place to start in the event of a 'catastrophic disaster'. It is also of a size that it would fit onto a DVD. So, one snap shot burnt to DVD means that you at least have the core content of the site.
The stub meta data comes in at about 4.8 gigs. Another DVD and that's the revision history snapshoted.
The rest of it looks like it'd fit easily onto yet another DVD.
So, three DVDs and you have a complete current snapshot of the database (and a certain level of peace of mind). Get someone with full direct access to the systems to do an fssnap and burn that onto DVD.. and then store the DVDs somewhere safe.
After that, see if you can get an incremental change file comprising only pages that have been added, deleted or updated since the last full snapshot. That should be significantly smaller, and by applying it to the current snapshot you would be able to recreate the entire database to the point of the last backup.
As for bittorrent, it is a valid way of doing things. With three DVDs worth of content, you'd be able to do it (you should see how big some of the movie files are available through BT).
You would have to worry about the issue of "ownership" of the data. If some other organisation DID want to 'acquire' Wikipedia, then all they'd have to do is jack into the torrent and they'd have a copy of your 'product'. So, things such as copyright or encryption of the data would need to be thought about.
No-one will let a single organisation be the only Wikipedia host again. So we'll end up with multiple forks for the content
This could make the whole backup concept easer as you could assign certain "parts" of the database to each individual system. Example, stuff that is maybe labelled as "Geography" would get put onto Systam A while stuff marked as "History" would go to System B. (Horribly simplistic example). It would be an administrative nightmare in some ways as trying to ensure all parts of the wiki are available would be a job in itself. However, it would mean that backups would (should?) be easier to manage and distribute. It also has the (sort of) benefit that if one system was to crash then the whole of the wiki would not be lost, just tha one part. You could then use the central wiki "site" as the authorative index that people would go to to search for the items they want and then be directed to the sattelite systems.
As for the other stuff, I haven't got a scooby!
(no subject)
Date: 2007-04-10 07:14 pm (UTC)The image database on en: probably isn't worth the disk space - however to lose commons would be terrible..
The en:image collection also isn't hostable outside the US because of all the fair use shit we can't get rid of.
If there are no internal offsite backups the sysadmins want shooting.
ps did you read the stuff re:wikileaks from the guy who runs cryptome? He reckons it's a con.
(no subject)
Date: 2007-04-10 07:23 pm (UTC)(no subject)
Date: 2007-04-10 09:19 pm (UTC)An efficient protocol for sharing changes would give you:
- an efficient means by which I can keep my local archive in sync with Wikipedia
- an efficient way of making a distributed Wikipedia so you didn't need a distributed database
- potentially, an efficient way for me to keep a fork mostly-in-sync
When we're in the pub sometime I'll describe how Monotone's netsync protocol works...
(no subject)
Date: 2007-04-10 09:54 pm (UTC)(no subject)
Date: 2007-04-11 05:25 am (UTC)Out of the box suggestion
Date: 2007-04-11 07:20 am (UTC)Work out how much it costs to:
buy external hard drives
pay someone to copy dumps onto them
post them to wherever
Then put a copy of it on eBay for a "buy it now" price making it clear that this is an experiment to assess whether there is an interest for people to pay for copies of the database for posterities sake. It should ideally come with enough instructions that would let the person make it useful to them (ie how to configure it for local sophisticated text search or other things that you can't do with the online version). Depending on how much it costs, I'm potentially willing to go backstop on that one and cough up the money if no geeks think it's cool to buy a copy of Wikipedia. If you can make it break even or vaguely profitable to sell copies of the database, you largely solve the copies of the database being saved for posterity problem.
In an ideal scenario, you need a distributed model where you end up with many copies of the database spread over the world into local altruistic hosting centres. You may be able to convince ISPs to take a copy of it and mirror it on a schedule and then when requests come in from specific subnets supplied by the ISP, you redirect the results back to the ISP. This reduces the ISP needing to use their Internet link for the content. It reduces your hosting requirements too. There is a real win for the ISPs which is why they can do it and they can potentially claim to be helping you for a bit of PR spin on that.
An interesting aside. This is from memory. I have no references to back it up. Apparently US Government departments can't hold copyright. The CIA logo was designed and is owned by an external graphic design company. Otherwise anyone would be able to legitimately reproduce the CIA logo without breaching copyright. I'm sure there are still issues with breaking laws by using this to pretend to be them, but it's still something that you don't want them to be able to do. Can you start a small company owned by the Wikimedia players (Maybe create it with 50 shares and divvy them out to 50 people who care about Wikipedia) and then sell all the important intellectual property such as logos to this company for a token price and grant Wikimedia the ability to use these logos from now until the end of time for no cost.
I need to go to work to design my employers IT systems for the next 8 hours, but I'll ponder this more this evening and see if I can make any more suggestions. Maybe there will be some more posts here with some ideas to bounce off.
(no subject)
Date: 2007-04-11 07:55 am (UTC)(no subject)
Date: 2007-04-11 01:01 pm (UTC)myselfyouthe French and Germanseveryone involved.Wikpedia vs Encyclopaedia Britannica
Date: 2007-04-12 01:45 pm (UTC)Encyclopaedia Britannica never thought that an open source product like Wikipedia would seriously challenge the credibility of its brand. They were wrong and Encyclopaedia Britannica's staff seriously misread the global market. They are now very concerned about the widespread use of a free Wikipedia vs their paid subscription model. Industry analysis shows that the accuracy of both encyclopedic databases is similar.
It is interesting that Wikipedia founder Jimmy Wales is developing a new search engine. It is the combination of a) improved search engines and b) the success of Wikipedia that has put financial pressure on Encyclopedia Britannica over recent years. Many institutions and individuals are questioning the need to pay to subscribe to Encyclopaedia Britannica when the content is free on the internet. Google even has free direct links to Encyclopaedia Britannica's main database !!