This user account is a botoperated by The Anome (talk).It is used to make repetitive automatedorsemi-automated edits that would be extremely tedious to do manually, in accordance with the bot policy. The bot is approved and currently active – the relevant request for approval can be seen here. Administrators: if this bot is malfunctioning or causing harm, please block it.
Note: Blocking will stop further edits: the bot will intermittently retry errors for several minutes, but should then automatically shut itself down until restarted manually; please use a ten minute block or longer to be sure of stopping it.
This bot is designed to add standardized machine-readable geodata records to relevant articles in the English-language Wikipedia, using data from GNS, GNIS, OSGB coordinates in UK articles, plaintext geodata scraped from article text, and interwiki-linked geotag data from other-language Wikipedias. -- The Anome 12:13, 22 September 2007 (UTC)
Currently backfilling a number of corner cases missed by earlier over-cautious heuristics, using:
machine parsing of plaintext geodata found in dumps
automatched GNS data
interwiki-matched machine-readable geodata from other language editions
This is very laborious for the bot, as it requires the re-scanning of large numbers of false positives, and will result in only a few hundred articles being geocoded, but machine time is cheap, the re-scans are necessary in any case, and this will lay the foundations for larger systematic efforts to come later.
Standardize existing geotags. "coor title *" is now done, "coor *" pending.
Finish adding "coord missing" to all eligible articles.
Ancient sites should be templatable as missing, while still blocked from being given coordinates automatically.
Go back and use CatScan to find any remaining franchises mis-tagged as "coord missing".
Rebuild article state map from log file and other stored data.
Interwiki:
In the absence of up-to-date Kolossus data, start using the externallink table API to live-scan non-en: Wikipedia editions for URLs in order to obtain interwiki patterns
Use full interwiki data to regenerate fuller tags where only KML data was used for earlier tagging.
Consistency and correctness:
Use 1-degree-tile binning to look for outliers
Look for misuse of coord tags for offplanet locations: report to WikiProject for fixing
Matching:
Hierarchical matching with disambiguation by subnational entities; rejected some time ago because ineffective, but may have become possible with greater navbox systematization in last year
Open research topic: Bayesian inference of relative locality from the link graph -- this may be an effective way of handling the above. Use places with known locations as training set.
Properly handle undersea features and disputed territories with no applicable recognized country
Types of places not yet keyword-matched during graph traversal:
Casinos
Resorts
Historic districts [?]
Ports and harbo[u]rs by country
Bus and some metro stations
New data sources:
Collect lists of country-specific coordinate data
Mine geodata from images included in articles (thanks to User:Planemad for the suggestion)
With >70,000 data points, I now have enough data to do a spatial analysis of the category tree, and to generate lists of possibly misclassified or mislocated outliers. The cleaned up bounding data could then be used as a Bayesian classifier for future work. -- The Anome 10:14, 24 August 2007 (UTC)
The category+link graph may be a better choice for this. -- The Anome (talk) 13:59, 12 October 2008 (UTC)