March 9, 2007 - AIML Distinguished Speaker Series Sam Roweis, PhD Making the sky searchable: Fast geometric hashing for automated astrometry Joined work with Ph.D students Dustin Lang and Keir Mierle http://astrometry.net (should "live" in a few weeks) +++ Other work: * statistical analysis / medical data : * automatic alignment of curves (time series) (in order to average them): unsupervised learning/HMM applications: aligmnent of multiple speaker audio / liquid chromatography curves * blindsensor fusion and identification: estimate the intensity/exposure of photos * projection/visualization of labeled data: good linear projections which keep things of the same class nearby +++ Problem: * take from a picture of the night sky where on the sky it came from +++ Rules: * start with a catalogue of stars and built an index from it (the index is used to locate new test images) +++ Issues: * high amount of data * noisy data * query images may contain some extra stars that are not in the catalogue (*distractors*) and conversly (*dropouts*) * scale +++ Robust matching: covering (almost) exactly except for distractors/dropout ie. explaining enough of the test image for the probability of it not matching dropping. +++ Solving the search problem: * exhaustive serarch absolutely impossible * *(Inverted) Index of Features*: - define a set of features for any view on the sky - build an index that tells us which views exhibit certain (combinations of) features values - idem webpage search (find "machine learning" = cross the indices of "machine" and "learning" - basically) - matching a test image: compute which features are present. Each feature generates a candidate list. Intersecting these lists zeroes inon the true matching views. Practically, any 2 lists that intercepts are usually enough. - geometric hashing: features have to be invariant toscale, rotation and translation,robusut to small positional noise features = relative positions of nearby quadruples of stars (*quads*) +++ *Quads* - quadruples chosen as minimum size that really tells something (enough memory to store all the features but they're still somehow informative) - relative positions: use a coordinate system definedby the most widely separated pair - 4 features (if A-B is the most widely separated pair, 2 coordinates for C, 2 coordinates for D) - only nearby starts - if A-B is the biggest pair, C and B have to lie in the circle of diameter AB +++ Training data: - US Naval Observatory (10^9 objects, mostly stars) - TYCHO (2.5.10^6 brightest stars) to compensate saturation in the USNO-B catalogue - browsable based on Google Maps - cut to get a spatially uniform set of the 150M brightest stars and galaxies: grid and take the K brightest sin each cell +++ Building the Index: - *kdtrees* recursive division of vector space in which each node is a bounding volume (a box or space) that countains a fraction of the data - then build the quads code +++ Locate a new test image: - identify objects (bright points) and createa list of their 2D positions - cycle through all possible quads (brightests firsts), compute the correspondig codes, look-up in the code kdtree (_and find matches with some tolerance_, made possible thanks to the kdtree structure). Note: you cannot reduce yourself to close points for quads because you don't know the scale - each code returns a candidate postition/rotation. When 2 quads agree on a candidate, go to the verification step - verification step: propose an alignment based on the quads hash alignments and check than p% of the bright points from the test image sit on top of the known card of the sky. p=50 (it's enough to make sure this alignment does not happen by chance) +++ Results: the Astronomy Picture of the Day (NASA) - locates it perfectly (even if part of the image is eclipsed, for instance by the Moon) +++ Results: SDSS (http://www.sdss.org) - no false positive - 99.84% solved - 99.27% could be solved with 60 brightest objects in the pic +++ Parameter settings - range search sizes in code-space, agreement and verification tolerances, etc... - tune with examining histograms of what happens across a number of known cases +++ Speed/Memory/Disk: - on a desktop computer - indexing 12hours, 2GB of memory, 100GB of disk - processing time << 1sec once the image is pre-processed (ie. objects detection has been performed) +++ Advantages: - No false positives - Produces a location header that is professional/standard so professionals can use amateur images - Basically saves time for astronomers - Distorsion (wide angle camera, etc...) does not really influe because quads are local - but you'll have to do some fitting at verification stage +++ Notes: - The tolerance ball of a given quad contains a few hundred images