Apr. 9th, 2008

reddragdiva: (Wikipedia)

(Geek post, call for computer scientist assistance.)

Last Thursday at London.PM, I got asked a lot why MediaWiki wikitext, as used in Wikipedia, doesn't have a WYSIWYG editor. The answer is that a WYSIWYG editor would need to know wikitext grammar, and there is no defined grammar. The MediaWiki "parser" is not actually a parser — it's a twisty series of regular expressions (PHP's version of PCREs).

So any grammar effort requires reverse-engineering that, and lots of people have tried and gotten 90% of the way before stalling. It doesn't help that wikitext is (I'm told) provably impossible to just put into a single lump of EBNF.

It occurred to me that there must exist tools to convert regexps into EBNF. And that if we can get it into even a few disparate lumps of hideous EBNF, there should be tools to take those and simplify them somewhat. (Presumably with steps to say what given bits mean.) Or possibly things other than EBNF, just as long as the result is parseable.

I am not (even slightly) a computer scientist, but some of you are. Does anyone have any ideas on this? Or pointers to anyone having done anything even remotely similar? Or knowledgeable friends they could point this query at?

(The goal is to replace the twisty series of regexps with something generated from the grammar. Tim Starling, MediaWiki's second-in-command, has said: "We can't change wikitext. Go away and write something that (a) covers almost all of it (b) is comparably fast in PHP." Which is harsh, but fair.)

Update: [livejournal.com profile] en_ki just made the obvious suggestion: the unit tests. Running maintenance scripts, the scripts (look for parserTests), the list of tests.

March 2022

S M T W T F S
  12 345
6789101112
13141516171819
20212223242526
2728293031  

Style Credit

Expand Cut Tags

No cut tags