MediaWiki is a powerful democratic publishing platform, making it a rich resource for computations that concern user-generated content. Wikipedia is by far the best known resource powered by this platform. MediaWiki is open source and freely available, it also powers a number of other notable sites. Tens of thousands of wikis are listed as powered by MediaWiki on wikiindex.org. Even Project Gutenberg is based on MediaWiki. Wikia, where I work, hosts over 200,000 wikis concerning a wide variety of subjects and domains -- from gaming, to entertainment, and even to things like genealogy and music lyrics. And let's not forget that even the WikiMedia Foundation has several other projects, all of which run on MediaWiki. In other words, if you develop with MediaWiki in mind, instead of just Wikipedia, you make a lot more content available to your project. For instance, content on Wikia, just like at Wikipedia, is Creative Commons licensed, which means you can use it for a heck of a lot of purposes. To hammer the point home, Wikipedia currently has a little under 28.5 million pages; Wikia has over 120 million pages across all of its wikis.
Imagine developing your text scraping application towards a single newsgroup, just to find out that you could have used all of Usenet! That's exactly what tends to happen in the wiki space. Numerous projects use Wikipedia as a textual source, and code specifically towards Wikipedia for that end. You can see it in class and variable naming, as well as hard-coding Wikipedia URLs in classes that really could consume any MediaWiki-based site's data.
Fortunately, if you're working with Wikipedia's API, you're well on your way to developing towards the MediaWiki platform, with a handful of minor tweaks. In many cases, it should just be a matter of updating the API endpoint you're accessing. Here are a few additional tips for engineering your solution in a manner that takes advantage of the wide range of content available from resources powered by MediaWiki.
Use the API, Seriously
Try to avoid screen-scraping any MediaWiki-based platform. Both yourself and the host you're requesting from will save a significant amount of bandwidth and CPU by requesting only the data that you're interested in. Some sites will load complex skins, ads, and other additional markup, when all you may be looking for is a list of the pages the page you're loading is linking to. There's a specific API method for that.
Another instance of this problem of over-engineering and squandering server and network resources is crawling a MediaWiki-based site. Many such solutions are hand-rolled, and prone to accidentally DDoSing their API endpoint. Odds are, you can get all of the data you need just based off of the API. And you'll get it a lot faster. Before architecting a solution to the problem of getting data from a MediaWiki-based site, take a look at the MediaWiki API documentation, and try to find a way to utilize it effectively and with the least number of requests. This helps make your solution scalable and keep your data source available.
Support Minimum and Maximum Versions
Not all MediaWiki-based applications are alike. Project Gutenberg is still on 1.13. Wikia is currently running on 1.19.2. Version 1.20 is in beta, with 1.21 following close on its heels. In fact, Wikipedia is currently powered by the WikiMedia Foundation's latest bi-weekly release for 1.21. API functionalities do change over time, so it's worth working off of the appropriate version's documentation.
Be Smart with Design Patterns
There are a number of ways you can design your solution to be resource-agnostic. Class inheritance may be a good way to handle this, assuming there are some subtle variations between what you want from Wikipedia and what you want from Wookieepedia. One thing to keep in mind is that since MediaWiki supports extensions, different resources may have different API functionalities, despite having identical versions. This means that you can develop towards a baseline behavior with a parent MediaWiki class, supporting API functionality of a given version, and then extend that class for different resources, based on what extensions they have.
If class inheritance doesn't work with your development paradigm, you may want to consider using dependency injection to control the resource end-point and supported functionalities. It really depends on what you're trying to get done as to what approach is most suitable. For an example of what moving from strict Wikipedia support to broad MediaWiki support, you can take a look at the pull request I have issued for Pattern, a Python-based web mining library.
Now that you have some ideas on how to more effectively access and interact with MediaWiki-based platforms, here are a couple of ideas on what you can do with numerous web mining resources:
- Compare a resource with a scholarly tone (e.g. Wikipedia) to a resource with a comedic tone (e.g. Uncyclopedia) for different topics or discourse markers
- Compare graph density between different domain areas
- Generate language models for a specific wiki community
- Aggregate information about a video game across general gaming, game-specific, and cheat-code-specific wikis
- Write a predictive model for categorizing new pages using their content
As far as working with user-generated content goes, MediaWiki provides an effective and collaborative way to expound on what interests you. This is what makes it a such a compelling platform. If you harness the power of MediaWiki, rather than the content of the Wikipedia community, you increase your access to UGC by an order of magnitude, and roughly for free. It's a win-win for everybody.