The above diagram gives a high level overview of what my approach is to estimating the user's pose within an urban environment. I have so far been focusing on the computer vision side of things, trying to get a robust pose estimate from keypoint correspondences between what's seen through the camera's lens and a rectified image that serves as a natural marker. Getting this to work well on a mobile device has been quite a project in itself, and there are still plenty of challenges to solve there. But the past couple of weeks, I have been diving into the other sensors found on an iPad, iPhone, many of the top-of-the-line Android phones and tablets, and likely most portable media devices of the future: GPS, compass, gyroscopes, and accelerometers. These sensors can be combined to give a pose estimate as well (and this is extremely easy on a platform like iOS... the CoreMotion framework abstracts away a lot of the details, and I believe there is some rugged sensor fusion going on at the hardware level). Most of the existing "locative" augmented reality apps out there (like Layar, Wikitude, or Yelp Monocle) only use these sensors. This is problematic mainly because GPS does not give very precise or accurate position information, especially in an urban environment. GPS drifts, can be offset several meters due to multipath effects, and generally doesn't get you "close enough" to do a true pixel-perfect visual overlay onto the real world, so most apps that use sensor-based AR simply display floating information clouds and textual annotations rather than 3D graphics. Thus, I aim to combine vision- and sensor-based pose estimates for better results.
This video gives a nice overview of what the different sensors do and what they're each good and bad at.