Monday, May 23, 2011

Website + Documentation Draft

Working on the organization and layout for the toolkit website. Gave it a name, a look, and some bold words to describe the major functions.

Of course, I haven't even finished the code yet. But I think tackling the challenge of how to organize the documentation at this stage will guide me in refactoring the code and making it as clear as possible as I finish writing the first release.

The (home - download - docs - gallery - forum) navigation items come from a survey of the homepages for some of the toolkits that inspired this, such as ProcessingOpenFrameworks, and Cinder.
I want a demo video to go on the right side. I guess I need to make that some time, too.

The class overview page for "augment." Not as explicit as a UML diagram, not as verbose as a Javadoc or doxygen documentation, I'd like to present the most relevant information up front and hide the details until clicked upon. The cyan entries are the public interface-- either public methods or members with at least a getter/possibly a setter. The private stuff should be hidden by default. This could look really nice with some jQuery sliding menu magic.

Saturday, May 21, 2011

Aspect Ratio of a Rectangle in Perspective

One of the things I left unimplemented in my image tagger input screen was recovering the aspect ratio of the selected rectangle. Before, it just squished everything into a 640x480 image. But now, thanks to this paper,  I can automatically calculate the aspect ratio from a given set of four corners. The OpenCV implementation is below. Note the strange ordering of the rectangle's corners (M_i (i = 1...4), are (0, 0), (w, 0), (0, h), and (w, h) ).

// Get aspect ratio
// Input corners c0,c1,c2,c3 are given as a percent of the original image height/width
// Using equations from:
  cv::Mat A = (cv::Mat_(3,3) << 786.42938232, 0, imageSize.width/2,
            0, 786.42938232, imageSize.height/2,
  float k2, k3;
  float ratio;
  cv::Mat _ratio;
  cv::Mat n2, n3;
  cv::Mat m1 = (cv::Mat_(3,1) << imageSize.width * (float)c0.x, imageSize.height * (float)c0.y, 1);
  cv::Mat m2 = (cv::Mat_(3,1) << imageSize.width * (float)c3.x, imageSize.height * (float)c3.y, 1);
  cv::Mat m3 = (cv::Mat_(3,1) << imageSize.width * (float)c1.x, imageSize.height * (float)c1.y, 1);
  cv::Mat m4 = (cv::Mat_(3,1) << imageSize.width * (float)c2.x, imageSize.height * (float)c2.y, 1);
  k2 = (m1.cross(m4).dot(m3)) / ((m2.cross(m4)).dot(m3));
  k3 = (m1.cross(m4).dot(m2)) / ((m3.cross(m4)).dot(m2));
  n2 = (k2*m2) - m1;
  n3 = (k3*m3) - m1;
  _ratio = (n2.t()*(A.inv().t())*(A.inv())*n2) / (n3.t()*(A.inv().t())*(A.inv())*n3);
  ratio = sqrt(,0));

Monday, May 16, 2011

Offloading Processing to the Cloud!

I've long realized that truly city-wide exploration with AR would require some sort of client-server infrastructure. If an app were to contain all the possible facade-markers to recognize, it would require a single monolithic download. The reality is, the most interesting augmentations are going to require a network connection anyway (because they may be user-generated, or reflect up-to-the-minute information), and downloading only the facade-markers that are nearby will limit the app's initial size. This also means an app set to work in one city can be expanded into another without needing a new program, just need data.

Once there's a remote server in the mix, I realized I could use it to offload some of the image processing so the mobile device doesn't have to work so hard. This is especially important when I'm using Fern classifiers as they require a long training step (~1 minute on the device) that just isn't realistic in terms of user experience. So I wrote some server-side scripts to accept new facade images (obtained via an interface like the one I described earlier), process them, store their data in a database, and spit out stored facades that are near a user's current location. The diagram of how it all works is below:

A few fun things I'm trying out here:

First, I'm using Amazon EC2 which is awesome because I get root access on a virtual server somewhere in cloud-land. It's a little strange to get set up and wrap your head around data-persistence issues (i.e. If you "terminate" a server, everything goes bye-bye, but to "stop" it seems ok...) and it took a while to get everything set up (basically I started with a blank Ubuntu install, needed to get and build OpenCV, install Apache/MySQL/PHP) but now I'm happily working from the command line on a machine that exists mainly as an IP address.

Second, I'm writing the high-level API stuff in PHP because it's really easy to process HTTP requests, write out json, and talk to the MySQL server. But the low-level image processing and Fern classifier processing has to happen in C++ (I wanted to use OpenCV's Python interface, but it doesn't cover all the latest stuff, including Fern classifiers). So I have my PHP scripts call up the OpenCV C++ program using the exec() command. Maybe this isn't an optimal arrangement, but it works just fine.

Third, I wanted to do the Ferns processing asynchronously so that when a new facade image is uploaded, the user gets an immediate confirmation and can carry on their merry way without waiting for the processor to finish. This is achieved by writing a PHP script that acts as a daemon process, using a PEAR extension called System::Daemon. The daemon sits in a loop, checking the database every few seconds for any facade entries flagged as unprocessed. It then sends these images down to the processor script and updates the database when they are complete.

An interesting note about Amazon EC2 is that I'm using their "micro" instance which is free for a year. As best I can tell, the amount of processing power allocated to me is equivalent to a single-core 1Ghz processor. Which is actually less than what I have on the iPad 2. So Ferns processing still takes a couple of minutes, but at least it's not burning the iPad's battery and blocking the user interface.

Finally, you can check out all the server code on GitHub.

Wednesday, May 4, 2011

Sensor Fusion Video

Demo of sensor fusion running on an iPad 2. The front of this building has been preprocessed to serve as a visual marker. The iPad's camera detects the image to get an initial estimate of where the user is standing and how the iPad is oriented in space. After that, the camera and the gyros/accelerometer in the iPad work together to keep the overlay aligned, even when the building goes out of view or isn't detected by the vision algorithm.

Right now it's not rending anything interesting-- the red-green-blue lines represent the x-y-z axis as calculated by the camera and sensors. The background grid is drawn as a large cube surrounding the user-- you can see the corners when the camera pans up and left. The white rectangle with the X in the center only shows up when the camera detects the building facade; as you can see, it isn't detecting the facade every frame, but it doesn't have to as the gyros provide plenty of readings to fill in between the camera estimates. As a result, the animation runs at a nice smooth 60fps.

Pipeline as of now: FAST corner detector - Ferns keypoint classifier - RANSAC homography estimator - Kalman filter (with CoreMotion attitude matrix) - OpenGL Modelview matrix

Monday, May 2, 2011

Fusion + Interface

Sensor fusion is sort of hard to capture in images. I'll try to get some video up here some time soon. But it's working to some extent-- once getting a pose estimate from the camera, the device's gyros will take over on frames where the camera can't detect the object. As long as the device only rotates and does not translate (or translates very little relative to the distance between it and the object it's detecting, as is the case when looking at a building a few dozen meters away), the gyros keep the image registered nicely.

The image above shows the beginning of the interface that will allow a user to take a photo of a building, select the corners of the facade to use as a marker, rectify the image and apply a mask (to remove trees, people, etc), geotag the image by placing it on the map, and finally set its elevation (not yet shown)-- all with the nice touch interface on the iPad/iPhone. After this, the rectified image and its metadata will be sent to a server, where it will be processed as the training image for the ferns classifier. I'll have to draw up a diagram of this later. In the meantime, here's a picture I drew to rough out the idea of how this would work:

One thing this allows me to do is experiment with training images of different sizes and aspect ratios. Right now, everything gets squished into a 640x480 image (my video resolution). This means if I select a square region for the training image and try to find it in a scene, the homography it calculates must somehow represent anisotropic scaling (because in reality, the object to detect is square again, while the training image of it is 4:3). Well, it calculates the homography just fine, and when I multiply the image bounds by the homography directly to find their 2D coordinates, it draws the correctly, but when I decompose the homography matrix to get the OpenGL transform, it has an additional rotation added in. This is strange, and maybe means I'm calculating the OpenGL transformation matrix incorrectly (which might explain some weird results I was getting earlier...)  Below is a picture of the issue.

Cropping a roughly square region
White rectangle with a cross represents homography applied to 2D points. RGB coordinate system is drawn using the OpenGL transformation matrix. Note the offset in rotation. White homography looks correct...
I know this has something to do with the assumption that the homography matrix H = K * [R | T] -- meaning a combination of the camera properties, a rotation, and a translation (i.e. no scaling that isn't just a result of translation in the z-axis). But beyond that... Not sure what to do about it just now. Maybe simply keeping all training images at the same aspect ratio, padded with black, is the way to go about this. We'll see...

Wednesday, April 20, 2011

Pose Estimation and Sensor Fusion

The above diagram gives a high level overview of what my approach is to estimating the user's pose within an urban environment. I have so far been focusing on the computer vision side of things, trying to get a robust pose estimate from keypoint correspondences between what's seen through the camera's lens and a rectified image that serves as a natural marker. Getting this to work well on a mobile device has been quite a project in itself, and there are still plenty of challenges to solve there. But the past couple of weeks, I have been diving into the other sensors found on an iPad, iPhone, many of the top-of-the-line Android phones and tablets, and likely most portable media devices of the future: GPS, compass, gyroscopes, and accelerometers. These sensors can be combined to give a pose estimate as well (and this is extremely easy on a platform like iOS... the CoreMotion framework abstracts away a lot of the details, and I believe there is some rugged sensor fusion going on at the hardware level). Most of the existing "locative" augmented reality apps out there (like Layar, Wikitude, or Yelp Monocle) only use these sensors. This is problematic mainly because GPS does not give very precise or accurate position information, especially in an urban environment. GPS drifts, can be offset several meters due to multipath effects, and generally doesn't get you "close enough" to do a true pixel-perfect visual overlay onto the real world, so most apps that use sensor-based AR simply display floating information clouds and textual annotations rather than 3D graphics. Thus, I aim to combine vision- and sensor-based pose estimates for better results.

This video gives a nice overview of what the different sensors do and what they're each good and bad at.

Friday, April 8, 2011

Some sort of results

Outdoors, recognizing a building facade, on an iPad, at a reasonable framerate, just like I always wanted. Though in the frame I screencaptured, things aren't registering properly just yet, but oh well. It was light outside then, but we all know the best work gets done after sunset.