Monday, May 23, 2011

Website + Documentation Draft

Working on the organization and layout for the toolkit website. Gave it a name, a look, and some bold words to describe the major functions.

Of course, I haven't even finished the code yet. But I think tackling the challenge of how to organize the documentation at this stage will guide me in refactoring the code and making it as clear as possible as I finish writing the first release.

The (home - download - docs - gallery - forum) navigation items come from a survey of the homepages for some of the toolkits that inspired this, such as ProcessingOpenFrameworks, and Cinder.
I want a demo video to go on the right side. I guess I need to make that some time, too.

The class overview page for "augment." Not as explicit as a UML diagram, not as verbose as a Javadoc or doxygen documentation, I'd like to present the most relevant information up front and hide the details until clicked upon. The cyan entries are the public interface-- either public methods or members with at least a getter/possibly a setter. The private stuff should be hidden by default. This could look really nice with some jQuery sliding menu magic.

Saturday, May 21, 2011

Aspect Ratio of a Rectangle in Perspective

One of the things I left unimplemented in my image tagger input screen was recovering the aspect ratio of the selected rectangle. Before, it just squished everything into a 640x480 image. But now, thanks to this paper,  I can automatically calculate the aspect ratio from a given set of four corners. The OpenCV implementation is below. Note the strange ordering of the rectangle's corners (M_i (i = 1...4), are (0, 0), (w, 0), (0, h), and (w, h) ).

// Get aspect ratio
// Input corners c0,c1,c2,c3 are given as a percent of the original image height/width
// Using equations from:
  cv::Mat A = (cv::Mat_(3,3) << 786.42938232, 0, imageSize.width/2,
            0, 786.42938232, imageSize.height/2,
  float k2, k3;
  float ratio;
  cv::Mat _ratio;
  cv::Mat n2, n3;
  cv::Mat m1 = (cv::Mat_(3,1) << imageSize.width * (float)c0.x, imageSize.height * (float)c0.y, 1);
  cv::Mat m2 = (cv::Mat_(3,1) << imageSize.width * (float)c3.x, imageSize.height * (float)c3.y, 1);
  cv::Mat m3 = (cv::Mat_(3,1) << imageSize.width * (float)c1.x, imageSize.height * (float)c1.y, 1);
  cv::Mat m4 = (cv::Mat_(3,1) << imageSize.width * (float)c2.x, imageSize.height * (float)c2.y, 1);
  k2 = (m1.cross(m4).dot(m3)) / ((m2.cross(m4)).dot(m3));
  k3 = (m1.cross(m4).dot(m2)) / ((m3.cross(m4)).dot(m2));
  n2 = (k2*m2) - m1;
  n3 = (k3*m3) - m1;
  _ratio = (n2.t()*(A.inv().t())*(A.inv())*n2) / (n3.t()*(A.inv().t())*(A.inv())*n3);
  ratio = sqrt(,0));

Monday, May 16, 2011

Offloading Processing to the Cloud!

I've long realized that truly city-wide exploration with AR would require some sort of client-server infrastructure. If an app were to contain all the possible facade-markers to recognize, it would require a single monolithic download. The reality is, the most interesting augmentations are going to require a network connection anyway (because they may be user-generated, or reflect up-to-the-minute information), and downloading only the facade-markers that are nearby will limit the app's initial size. This also means an app set to work in one city can be expanded into another without needing a new program, just need data.

Once there's a remote server in the mix, I realized I could use it to offload some of the image processing so the mobile device doesn't have to work so hard. This is especially important when I'm using Fern classifiers as they require a long training step (~1 minute on the device) that just isn't realistic in terms of user experience. So I wrote some server-side scripts to accept new facade images (obtained via an interface like the one I described earlier), process them, store their data in a database, and spit out stored facades that are near a user's current location. The diagram of how it all works is below:

A few fun things I'm trying out here:

First, I'm using Amazon EC2 which is awesome because I get root access on a virtual server somewhere in cloud-land. It's a little strange to get set up and wrap your head around data-persistence issues (i.e. If you "terminate" a server, everything goes bye-bye, but to "stop" it seems ok...) and it took a while to get everything set up (basically I started with a blank Ubuntu install, needed to get and build OpenCV, install Apache/MySQL/PHP) but now I'm happily working from the command line on a machine that exists mainly as an IP address.

Second, I'm writing the high-level API stuff in PHP because it's really easy to process HTTP requests, write out json, and talk to the MySQL server. But the low-level image processing and Fern classifier processing has to happen in C++ (I wanted to use OpenCV's Python interface, but it doesn't cover all the latest stuff, including Fern classifiers). So I have my PHP scripts call up the OpenCV C++ program using the exec() command. Maybe this isn't an optimal arrangement, but it works just fine.

Third, I wanted to do the Ferns processing asynchronously so that when a new facade image is uploaded, the user gets an immediate confirmation and can carry on their merry way without waiting for the processor to finish. This is achieved by writing a PHP script that acts as a daemon process, using a PEAR extension called System::Daemon. The daemon sits in a loop, checking the database every few seconds for any facade entries flagged as unprocessed. It then sends these images down to the processor script and updates the database when they are complete.

An interesting note about Amazon EC2 is that I'm using their "micro" instance which is free for a year. As best I can tell, the amount of processing power allocated to me is equivalent to a single-core 1Ghz processor. Which is actually less than what I have on the iPad 2. So Ferns processing still takes a couple of minutes, but at least it's not burning the iPad's battery and blocking the user interface.

Finally, you can check out all the server code on GitHub.

Wednesday, May 4, 2011

Sensor Fusion Video

Demo of sensor fusion running on an iPad 2. The front of this building has been preprocessed to serve as a visual marker. The iPad's camera detects the image to get an initial estimate of where the user is standing and how the iPad is oriented in space. After that, the camera and the gyros/accelerometer in the iPad work together to keep the overlay aligned, even when the building goes out of view or isn't detected by the vision algorithm.

Right now it's not rending anything interesting-- the red-green-blue lines represent the x-y-z axis as calculated by the camera and sensors. The background grid is drawn as a large cube surrounding the user-- you can see the corners when the camera pans up and left. The white rectangle with the X in the center only shows up when the camera detects the building facade; as you can see, it isn't detecting the facade every frame, but it doesn't have to as the gyros provide plenty of readings to fill in between the camera estimates. As a result, the animation runs at a nice smooth 60fps.

Pipeline as of now: FAST corner detector - Ferns keypoint classifier - RANSAC homography estimator - Kalman filter (with CoreMotion attitude matrix) - OpenGL Modelview matrix

Monday, May 2, 2011

Fusion + Interface

Sensor fusion is sort of hard to capture in images. I'll try to get some video up here some time soon. But it's working to some extent-- once getting a pose estimate from the camera, the device's gyros will take over on frames where the camera can't detect the object. As long as the device only rotates and does not translate (or translates very little relative to the distance between it and the object it's detecting, as is the case when looking at a building a few dozen meters away), the gyros keep the image registered nicely.

The image above shows the beginning of the interface that will allow a user to take a photo of a building, select the corners of the facade to use as a marker, rectify the image and apply a mask (to remove trees, people, etc), geotag the image by placing it on the map, and finally set its elevation (not yet shown)-- all with the nice touch interface on the iPad/iPhone. After this, the rectified image and its metadata will be sent to a server, where it will be processed as the training image for the ferns classifier. I'll have to draw up a diagram of this later. In the meantime, here's a picture I drew to rough out the idea of how this would work:

One thing this allows me to do is experiment with training images of different sizes and aspect ratios. Right now, everything gets squished into a 640x480 image (my video resolution). This means if I select a square region for the training image and try to find it in a scene, the homography it calculates must somehow represent anisotropic scaling (because in reality, the object to detect is square again, while the training image of it is 4:3). Well, it calculates the homography just fine, and when I multiply the image bounds by the homography directly to find their 2D coordinates, it draws the correctly, but when I decompose the homography matrix to get the OpenGL transform, it has an additional rotation added in. This is strange, and maybe means I'm calculating the OpenGL transformation matrix incorrectly (which might explain some weird results I was getting earlier...)  Below is a picture of the issue.

Cropping a roughly square region
White rectangle with a cross represents homography applied to 2D points. RGB coordinate system is drawn using the OpenGL transformation matrix. Note the offset in rotation. White homography looks correct...
I know this has something to do with the assumption that the homography matrix H = K * [R | T] -- meaning a combination of the camera properties, a rotation, and a translation (i.e. no scaling that isn't just a result of translation in the z-axis). But beyond that... Not sure what to do about it just now. Maybe simply keeping all training images at the same aspect ratio, padded with black, is the way to go about this. We'll see...

Wednesday, April 20, 2011

Pose Estimation and Sensor Fusion

The above diagram gives a high level overview of what my approach is to estimating the user's pose within an urban environment. I have so far been focusing on the computer vision side of things, trying to get a robust pose estimate from keypoint correspondences between what's seen through the camera's lens and a rectified image that serves as a natural marker. Getting this to work well on a mobile device has been quite a project in itself, and there are still plenty of challenges to solve there. But the past couple of weeks, I have been diving into the other sensors found on an iPad, iPhone, many of the top-of-the-line Android phones and tablets, and likely most portable media devices of the future: GPS, compass, gyroscopes, and accelerometers. These sensors can be combined to give a pose estimate as well (and this is extremely easy on a platform like iOS... the CoreMotion framework abstracts away a lot of the details, and I believe there is some rugged sensor fusion going on at the hardware level). Most of the existing "locative" augmented reality apps out there (like Layar, Wikitude, or Yelp Monocle) only use these sensors. This is problematic mainly because GPS does not give very precise or accurate position information, especially in an urban environment. GPS drifts, can be offset several meters due to multipath effects, and generally doesn't get you "close enough" to do a true pixel-perfect visual overlay onto the real world, so most apps that use sensor-based AR simply display floating information clouds and textual annotations rather than 3D graphics. Thus, I aim to combine vision- and sensor-based pose estimates for better results.

This video gives a nice overview of what the different sensors do and what they're each good and bad at.

Friday, April 8, 2011

Some sort of results

Outdoors, recognizing a building facade, on an iPad, at a reasonable framerate, just like I always wanted. Though in the frame I screencaptured, things aren't registering properly just yet, but oh well. It was light outside then, but we all know the best work gets done after sunset.

Tuesday, April 5, 2011

From Homography to OpenGL Modelview Matrix

This is the challenge of the week-- how do I get from a 3x3 homography matrix (which relates the plane of the source image to the plane found in the scene image) to an OpenGL modelview transformation matrix so I can start, you know, augmenting reality? The tricky thing is that while I can use the homography to project a 3D point onto the 2D image plane, I need separated rotation and translation vectors to feed OpenGL so it can set the location and orientation of the camera in the scene.

The easy answer seemed to be using OpenCV's cv::solvePnP() (or its C equivalent, cvFindExtrinsicCameraParams2()) by inputting four corners of the detected object calculated from the homography. But I'm getting weird memory errors with this function for some reason ("incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
but setting a breakpoint on malloc_error_break didn't really help, and it isn't an Objective-C object giving me trouble, so NSZombiesEnabled won't be any help, etc etc arghhh....) AND it looks like it's possible to decompose a homography matrix into rotation and translation vectors which is all I really need (as long as I have the camera intrinsic matrix, which I found in the last post). solvePnP looks useful if I wanted to do pose estimation from a 3D structure, but I'm sticking to planes for now as a first step. OpenCV's solvePnP() doesn't look like it has an option to use RANSAC which seems important if many points are likely to be outliers-- an assumption that the Ferns-based matcher relies upon.

Now to figure out the homography decomposition... There are some equations here and some code here. I wish this were built into OpenCV. I will update as I find out more.

Update 1: The code found here was helpful. I translated it to C++ and used the OpenCV matrix libraries, so it required a little more work than a copy-and-paste. The 3x3 rotation matrix it produces is made up of the three orthogonal vectors that OpenGL wants (so they imply a rotation, but they're not three Euler angles or anything) which this image shows nicely:
Breakdown of the OpenGL modelview matrix (via)
The translation vector seems to be translating correctly as I move the camera around, but I'm not sure how it's scaled. Values seem to be in the +/- 1.0 range, so maybe they are in screen widths? Certainly they aren't pixels. Maybe if I actually understood what was going on I'd know better... Well, time to set up OpenGL ES rendering and try this out.

Update 2: Forgot for a minute that OpenGL's fixed pipeline requires two transformation matrices: a modelview matrix (which I figure out above, based on the camera's EXtrinsic properties) and a projection matrix (which is based on the camera's INtrinsic properties). These resources might be helpful in getting the projection matrix.

Update 3: Ok, got it figured out. It's not pretty, but it works. I think I came across the same thing as this guy. Basically I needed to switch the sign on four out of nine elements of the modelview rotation matrix and two of the three components of the translation vector. The magnitudes were correct, but it was rotating backwards in the z-axis and translating backwards in the x- and y- axes. This was extremely frustrating. So, I hope the code after the jump helps someone else out...

Monday, April 4, 2011

Offline camera calibration for iPhone/iPad-- or any camera, really

Creating a GUI to perform camera calibration on a mobile device like an iPhone or iPad sounded like more work than it would be worth, so I wrote a short program to do it offline. The cameras used on these devices can be assumed to be consistent within the same model, so it makes more sense for an app developer to have several precomputed calibration matrices available rather than asking the user to do this step on their own device.

The program I wrote is adapted from a tutorial I found to do camera calibration from live input. My version instead looks for a sequence of files name 00.jpg, 01.jpg, etc and calibrates from those. So the way I used it was to take several pictures of the checkerboard pattern from my iPad, upload them to my computer, edit out the rest of the stuff on my desktop in Photoshop so finding the corners was more likely to be correct, and rename them. The output of the program is two XML files which include the camera intrinsic parameters and distortion coefficients. The code for the program is attached after the jump.

And results:
For camera Matrix

f_x = 786.42938232
f_y = 786.42938232
c_x = 311.25384521 // See update below
c_y = 217.01358032 // See update below

And distortion coefficients were: -0.10786291, 1.23078966, -4.54779295e-03, -3.28966696e-03, -5.54199600

I hope I did that right. The center is slightly off from where it ideally should be (320, 240).

Note: I found this precompiled private framework of OpenCV built for OSX rather handy. It is only built with 32-bit support, so set your target in XCode accordingly.

UPDATE: The primary point obtained from this calibration was wrong! It was throwing off the pose estimates at glancing angles. I set it to 320,240 and everything works better now...

The Approach So Far

This project has been under development for a few months prior to the beginning of this blog, so I might as well explain some of the approach as it stands so far.

There are three "big picture" technical components to this project: The first is developing an efficient markerless camera-based AR system consisting of a keypoint detector, feature matcher and pose estimator on the mobile platform. Second, is sensor fusion with the other sensors available on a modern mobile device-- compass, GPS, gyroscope, and accelerometer. Third, is user interface design integrating these technologies into an easy-to-use app that can both build view augmented data as well as provide new user-generated data to grow the database.

So far, I have focused on building the general-purpose markerless AR system. I am using FAST Corner Detection to find keypoints in an image, followed by the Ferns classifier to match the keypoints as seen through the camera with those in a reference image, and then using RANSAC to calculate the homography mapping the reference image to the camera image. All the algorithms at this point are built into OpenCV, which I have compiled for iOS with some outside help.

(Speaking of iOS, I have tested this code both on an iPhone 4 and an iPad 2. The iPad 2 is significantly faster even without specific multithreaded programming techniques to take advantage of the dual core processor. I'm not exactly yet sure why this is, but maybe discovering why would reveal some unexpected bottlenecks in my code...)

The immediate next step is to convert the homography I find into an OpenGL modelview transformation matrix so I can start rendering something more interesting than a rectangle over my scene. Though rectangles are nice and satisfying after fighting compilation errors for days, and seeing new rectangles drawn at ~12fps is great after a few weeks trying out SURF descriptors. And even though I have something that remotely functions, there is plenty of room for optimization, especially in terms of compressing the Ferns so they don't hog so much memory. The ideas in this paper look like they might be half-implemented in the OpenCV Ferns code, though they are commented out.

More details as things progress...

Monday, March 28, 2011


The goal of this project is to develop a platform for mobile augmented reality (AR) applications that enable a user to explore and manipulate the image of an urban environment in real time. Most vision-based AR tools that exist today (including ARToolkit and Qualcomm's AR SDK) assume that the objects to be tracked are not stationary and therefore no external frame of reference is used to guide object recognition and tracking. If, however, the objects to recognize are facades of buildings and signs in a city, we can use sensor data (GPS/compass/etc) to roughly estimate our position and pose (as is done with Layar) and use vision techniques to more precisely align a 3D data overlay.

The ultimate goal is to produce a set of tools packaged as a unified toolkit that can prove useful to the growing community of "creative coders" working in the fields of art and design.

Check out the project proposal for more.