New Microsoft Word Document.docx

1. Introduction

00:00So in the second part of this lecture, we are going to look at localization techniques that use image processing. Image processing can allow you to detect objects and do so directly, persons or lands I mean, landmarks for instance. You could detect textures. For instance, a given floor texture might give you information about where you're based inside a building. Or you could detect 2 d barcodes, so called fiducials or markers to know where you are located and where exactly the device is looking at.

00:29There is more techniques that we will look at today that make use of image processing. Image processing has become paramount, essentially, because cameras are now very inexpensive and available in many devices. So any modern day smartphone will not only feature 1, but multiple cameras and possibly even a depth camera. The problem with camera based sensing is that there is lighting issues and occlusion. So So if there's no direct line of sight, then you have problems.

00:57Also, it's still true that image processing is energy intense which can be a problem, for your battery in mobile situations. Nevertheless, it's of very widespread use and I bet everyone of you have used, image based localization tracking somewhere on your mobile device, your mobile phone or, for for for gaming consoles, for instance. So let's first of all, clarify a few definitions. There is this term tracking and the term registration that is often times used. And I want to clarify first of all, what is the overall goal of image based localization.

01:32So here's a simple example, you have a mobile phone with a camera and, this cam in this case, it is used for for spatial augmented reality, meaning that this camera is observing some real scene. In this case, it's a brochure or magazine that we see here. The camera is identifying part of this one here and then displaying some augmented virtual content. So in this case, what we see here is that this 2 d image that is printed on this magazine is augmented with some 3 d content that we're seeing here. What is the technical problem behind or the goal here?

02:08Well, the problem is that our camera operates in a sensor, a camera coordinate system. So basically, all it does see is coral pixels inside its camera coordinate system. The goal of tracking is to relate this information to the world coordinate system. So basically, you can identify what are real world coordinates that specific elements are at. And so this then, basically, is used for 2 purposes that are oftentimes, coming together.

02:38One of them is tracking. Tracking means that we're continuously locating the user's position or put into more technical terms, the camera's viewpoint. And we do so while the user or while the camera is moving. So we keep, this position being updated and this is tracking. Registration means that we position virtual objects in a way that is aligned with the real view.

03:02This is what you can see here, basically. We have this real world image, which is visible in the camera feed, and we basically augment this information here at the pixel level precision with virtual objects. So these virtual objects here are closely registered or precisely registered and not off. I mean, it could be displayed somewhere here and the registration would not be good, but of course, the goal is to have very accurate registration here as well. And basically, once registered you would keep tracking this over time so the user can move the camera, can move the mobile device, and the information would still be correctly displayed.

03:38There's 2 main types of tracking we can distinguish. One of them is called inside out. Here the idea is that basically you are wearing a tracking device which is identifying its location, by looking at outside references. And so this process of positioning is done locally. This is what you can see here.

03:57This example, the user is wearing a camera, projector as well. And so this camera is tracking, in this case, some hand movement and sort of some some some information about the environment which is used for for tracking. The same is true if you use your mobile phone with some augmented reality framework, to identify the phone's location. You have inside out tracking. GPS is doing the same, you have an active receiver, which is tracking its location based on information comes from the environment.

04:29The opposite is outside in tracking. In this case here, it is the infrastructure that observes the user using radio, IR, acoustics, multiple types of approaches can be used here. And so, basically, then, the user users part can be more or less passive. It could be active emitters or just passive points that can be tracked. And it's the environment that identifies user position and it knows about all user positions.

2. Original Camera Image

04:55And the environment actually does the active processing of the perceived signals. So here's an example of a, of an augmented fashion story here where the user is being tracked basically once he or she is coming close to a specific active device. Outside in tracking is also used with optical motion capturing setups where you have lots of cameras that surround the user, that surround the scenery, and that track what is going on inside. And, basically, user or the tracked object itself can be augmented with passive markers that just reflect signals that are coming, from from infrared emitters. So So let's have a look at it at several techniques that are oftentimes used for, vision based, tracking.

05:40And the most simple and very widely used approach is based on optical markers. The term fiducial is often used in this context. It is a specific object or specific visual pattern that is seen by a camera and acts as a point of reference. This marker can optionally encode a unique identifier. And of course, you know, many markers, classical barcodes on products, of course, serve this purpose of being seen by a camera and encoding an identifier.

06:08QR codes do the same. But the ones we care most about in the context of, augmented reality are markers that look more like this one here. Markers of so called augmented reality toolkits. These allow to identify the 3 d position and the 3 3 d orientation of a marker with respect to camera. So here the main goal is not first and foremost to identify a marker.

06:30This is what you do with a barcode or QR code. Here the main goal is to identify 3 d position and 3 d orientation. And this then allows you to spatially register virtual imagery on top of, this marker here. Of course, as I mentioned, these markers could also actually have a unique identifier and you see multiple examples here. So these markers look different, so basically one doesn't know which of those markers is current currently present in the scene.

06:58The technique is fairly simple. It has been around for 20 years. And there is a number of open source solutions available, that have started with the Art Toolkit. Now there is more advanced solutions, available. The main idea is you have a predefined pattern that access your mark or your usual, and this pairing can be placed somewhere on the physical, scenery.

07:20So for instance, a physical sheet of paper that is placed inside the physical, scenery and then the camera would see this in a camera frame. The mark is recognized in the image and because we know that this marker has a specific real world size that is known, say for instance, 3 by 3 centimeters, we can use the 4 corner points and then basically from those 4 corner points identify the camera's pose. And then in turn, we can use this information to augment those markers with virtual content which is specially registered in in the camera frame. So here's the workflow. We would start first with a video video image of the marker that is captured by a camera.

08:03This image is then sent to a tracking system and this tracking system uses several algorithms, dedicated algorithms to identify the position and pose of this marker. So, for instance, it would, threshold the image, create a binary black and white image, then use edge detection. So basically, these edges are known, so we know where are those corners. And from this information then, the position of the marker can be detected. Next a local coordinate system is calculated based on the position of this marker, this is indicated here.

08:35So you see a 3 d coordinate system that is relative to the marker. And given the use tracking here, you can imagine that as soon as this marker moves in 3 space or is rotated in 3 d space or if the camera moves or rotates then in real time, basically this information gets updated here. So we always have this local coordinate system. This thing gives us a transformation matrix and we can use this to render 3 d objects that are aligned with a marker. And finally, this 3 d object is rendering can be overlaid on the original camera image and we have a specially registered, augmented reality here that can be displayed.

09:14Of course, the downside of markers of initials is that we need to have markers. So if you have a scene that does not have markers, available, that is not equipped with it, or that simply does not offer enough free space for placing markers, then things get more difficult. This is why natural feature tracking has been, invented and presented. The main idea here is that rather than using a dedicated artificial marker, you would make use of visual features that are part of some natural image. So it could be any predefined pattern, that have sufficiently, distinguishable features.

09:49For instance, in this case, one could be the cover of a of a board game that you see here, of Lego bricks. And, then basically, once this is identified in the camera frame, one can use a dedicated algorithms that again identify the specific location, rotation, both of that that object here. Then again, we can use this to augment it with, 3 d 3 d, graphics. In this case, it's just a simple cube, but you can imagine that depending on the 3 d model that you have, you can now present interesting like avatars or advanced options. There is a couple of advantages with natural features.

3. Simple Rgb Camera

10:26Course, you don't need markers. First of all, this is obvious, but also, it can be more robust against partial occlusion. So the problem with those markers is here, because I have to identify the corners, means that the system will stop working as soon as a marker is just partially occluded. And that's not the case here, so you can partially occlude this image here, and the system will still be fairly robust. So for practical use, this of course is a huge benefit.

10:54Also, we have natural features almost everywhere, so we don't need to instrument with markers. The downside is the image processing is more heavy than with markers and the database, you needed to have some database or some information about visual key points that you want to identify here. So this is about identifying objects and knowing about this object's pose, but how we are in a totally unknown environment and we want to understand where we are in this environment. This is where, SLAM techniques come into play. SLAM stands for simultaneous localization and mapping.

11:32It's a technique that originated from robotics that does not need any additional infrastructure. So it allows us to track, devices or robots position in unknown environments. And very important, these environments do not need to be instrumented. So simultaneous localization and mapping essentially means that we can navigate this unknown environment while keeping track of the pose and orientation relative to the environment map, relative to the environment. So basically, we know, sort of, where we are located inside this environment, how we are oriented inside this environment with all 6 degrees of freedom information.

12:09At the same time, as we're, navigating through this environment, we would simultaneously build a map of environment. So basically, the more we navigate through the environment, the more we learn about it and the better we can map it. It's a quite quite quite, powerful technique. So the in more technical terms, the aim is that we compute both the camera pose and the structure of the environment. This is called a map without prior knowledge about either of them.

12:38And so the basic principle is that you keep, basically, to taking camera views off the environment and you triangulate from, visual features that you see in the environment, you solve an optimization equation to minimize reprojection error, which essentially means that you basically calculate the difference between a feature point's track position and where the tracker expects it to be. And so the more basically you map the environment, the more this gets minimized and the better you know about the environment. So here's an example of SLAM, a recent video, that you can see here. It uses a simple RGB camera, just so just a color color camera that is moved in 3 d space. A handheld camera that you can see here, it uses the natural features that we have inside this 3 d space.

13:29As you can see after a few seconds of mapping, the assistant does know about the environment. So the idea is you move the camera around the scene and while you do so, the environment is mapped. Keyframes are identified, are sent to server for localization, and the results are then overlaid on the image. These are the the yellow annotations that you see here. And as the map expands, basically, the more you move the camera, the more you, walk through uncharted territory, then new keyframes are processed and the map is refined.

14:02And as you can see here, this works not only in an office, but also outdoors in, real world scenes that, of course, have not been augmented in any ways. It's just those visual features that you see, visualized here with the small points that help the system identify the geometry of this environment. So big benefit of SLAM, no augmentation required at all, and it can work with simple RGB cameras, but also with depth cameras and other cameras. So this brings me to a next technique that is of widespread use with interactive systems, which is called structured light. What does structured light mean?

14:39Structured light means that you use a specific technical approach to give you more information than just the RGB image. Using structured light, you can infer on the geometry of a scenery. And this is what is used in many modern day depth cameras. You may know that depth camera basically gives you for each pixel in the camera view, not only the RGB information, but also, the depth information. We see how far this point that you see in the camera is from is is from the camera.

15:10So the basic principle is visualized here in this illustration. Basically, you have a light source that emits what is called structured light, meaning a specific pattern of light into the 3 d scene. So for instance, if you have an IR emitter here that is emitting just, one ray then you have a single dot that is captured here in your camera. If you project a line, then you basically see that line over here. If you project a stripe pattern, of course, then you see stripes etcetera.

15:40And depending on the geometry where this light is reflected, the the geometry that you see on your 2 d image sensor looks distorted. So in this example here, for instance, we have, put it in a regular grid, but then the reflection that comes off this circular cylinder here basically looks this way, looks distorted. And basically depending from this distorted geometry, one can identify the original geometry here. This this is the idea. So basically you say this is a deviation from the plane and from this deviation we can, infer back what is the geometry here.

4. Inexpensive Depth Camera

16:17And this happens to be the principle that has been widely deployed with the original Microsoft Kinect camera. It is based on the light principle. You may have seen these cameras before and used them with the Xbox. We also use them a lot for seminars and, you know, research. The key benefit here is that's a very inexpensive depth camera, €100 to €200 only, that gives you really good depth information.

16:41So how does it work? Well, it projects a, a light pattern, in this case, a dot pattern onto the scenery. It's called a speckle pattern. The the dots look randomly placed, but there is a specific principle behind. Then, this is basically the IR emitter that we see here that projects this dot pattern.

17:00The camera features an IR sensor, basically an IR camera that is observing the infrared image. And that is seeing the dot pattern. Actually, this is how a dot pattern looks like in the infrared view. As you see, all these fine fine fine dots that are projected into the scenery and projected back. Now depending on the distortions, basically, on how this original dot pattern appears distorted in the camera view, the algorithm can identify what is called a depth map.

17:27Meaning, calculated distance, of each camera pixel from the sensor. This works at a lower resolution than in classical RGB cameras. In this case, the resolution is 320 by 220 240. It's still very reasonable for identifying, depth geometry. And this works at a distance between 40 centimeters and 4 point 5 meters.

17:51Of course, the limiting factor here is that you have this projector that is projecting infrared pattern into the environment. So if this is too far apart, the intensity would be too low. The same is doing outdoor scenes, where again, intensity could be an issue. And so basically from this infrared pattern, the camera can then detect a depth view. And this is what you see here, you see a user, this is a color image here.

18:16And you see depth image of the user. So basically the shades of gray that you see identify or encode the distance between the user and the camera. At the same time, what you also see is that, from this information, one can very easily fit a simple skeleton. This is important for all kinds of body tracking that is used in Xbox and many other systems. So we have a handy way of not only knowing about the user's location, but also about the physical or spatial configuration of the body.

18:54And SLAM does not only work with RGB cameras, but you can also use depth cameras. And, this led to widely used and widely cited piece of work that's called Kinect Fusion, that uses the Kinect camera to, do slam based mapping and tracking inside real world sceneries. And as you will see when you check out this video, there's a number of important benefits, of this ConnectFusion implementation that make it very very powerful. I recommend you to check out the video yourself. You would not have the sound when I play this back here.

19:26So just click on the link down here and, watch the video on YouTube yourself. A very different approach is using optical motion capturing. These are dedicated systems, fairly expensive systems that allow you to track users' movements or track objects inside a defined confined space, and do so with a very high frame rate and a very high accuracy. These systems are being used, for instance, in the movies, when you have modern day, 3 d graphics inside movies, where actors are being being, either replaced using 3 d graphics or augmented with 3 d graphics. In these cases, actors would wear a specific motion capturing suit and those systems capture the actors.

20:12We're also using the system a lot for our own research for quickly building interactive systems that have contextual information and also for user studies. The main idea is that you have a set of high speed infrared cameras that you deploy in a space such that they're observing a scene from many different perspectives. You see this here, multiple infrared cameras attached around the ceiling that observe the scenery here. And we have similar setup in our lab that we use for multiple purposes. Now, the interesting thing is that these cameras are active and really high frame rate, and as I mentioned quite expensive.

20:45But the big benefit is that the devices or objects or users they track can be passive. So basically, you would simply put a number of these markers, these are just retroreflective dots or retroreflective balls. And those basically reflect infrared light that is emitted from those cameras. And then the camera identifies or sees these, these points appearing bright white in a camera image. And so, from multiple cameras and using triangulation, you can then identify the system just then identify 3 d positions.

5. Own Tracking System

21:17Of course, these systems are, are very well calibrated. There's a calibration step you will do first once your cameras are deployed in specific configuration. And then the system gives you off the shelf three d coordinates of these points, at a very high frame rate. And you can also identify markers that encode a UD by placing multiple of these markers in a known physical geometry. That basically you can track this as one identifier, with 6 dot information, so 3 d location and, 3 d rotation.

21:49The systems can be very fast. A 100 hertz is standard. It can go up to to several 100 hertz, and the accuracy largely depends on the resolution of the cameras that you have and the number of cameras that you have, but also on the size of the captured volume. So you can go down to submillimeter level precision or accuracy or in typical setups, you would have maybe 1 or 2 centimeters of of accuracy. Baikon is the pioneer of the systems and typically, this comes with a specific price tag of definitely more than €100,000.

22:20OptiTrack is now one of the the other leaders on the market. They're offering systems that are a bit more affordable between €10,301,000. And these are the classes of systems that you would see most, like, most frequently deployed at research labs. So here's one example, that I would like to show where, OptiTrack system has been used for tracking quadcopters in 3 d space. You see the quadcopter and a quadcopter is augmented with several multi, retroreflective markers here.

22:51And so this allows the system to identify its 3 d location. This is the actual camera view that you see inside the system. So in real time, you get a 3 d model that I it shows our cameras and the respective marker locations here. These are the cameras that you can see here that have been deployed. And this allowed the creators of the system then to, realize a very beautiful orchestration of drones that are very precisely tracked in 3 d space and sort of position corrected, so they can, work and fly in an orchestrated manner.

23:22Again, I invite you to check this out on YouTube directly yourself to see what can be done with modern day optical motion capture systems. So one last important system in this class is the lighthouse tracking system. Many of you might have heard about the HSC wife, maybe even own one. It has been developed for for gaming purposes in living room and the HSC wife comes with its own tracking system, which is quite interesting. So let me explain how this works.

23:58The main idea of a wife setup is that you have multiple base stations, they call as lighthouses, and I will say in a second why this term applies very well. So you have, for instance, 2 of these base stations here and 1 or multiple users. And these users are using the WIF headset that you can see here. And the WIFE headset actually identifies or receives signals from these base stations and based on timing of the signal, it can precisely identify its, location in space. So let's have a closer look at how this works.

24:34This is a lighthouse base station, the inside view, basically, if you remove the the the cover. And what we see here is 3 main components. The first component is the one you see here. These are essentially just infrared LEDs. They're used as a flash to illuminate the scene in the infrared spectrum.

24:52So you cannot see this with your bare eyes. And then we see 2 wheels around here. One wheel that is, oriented vertically. It is actually a laser mirror that you can see here. Basically, it is a mirror that reflects off the light from this flash that you see here.

25:12And it's rotating. And the second one is a similar mirror that is rotating in a horizontal manner. And that's the basic trick of how, location tracking works with the lighthouse. It's quite quite quite, quite neat. So basically, you have 2 or more of these base stations and they emit precisely timed infrared pulses, using these IR LED flashes here.

25:36The tracked objects, must be active. This is the downside side. Each of these objects has 5 or more infrared sensors. And device calculates its position and orientation from the time difference between the pulses that is being received from here. So how does this work precisely?

6. Conclusion

25:53Basically, in each individual step, the base station would first of all send an LED flash. This marks the start time. And then this laser here, rotates horizontally. This one here, rotates horizontally with a fixed angular speed. Then the IR flash, again, flashes and marks the start time, and then the vertical one is rotating.

26:21So what does this mean? That means that no matter where you are in the scene, you're always receiving the flash at the same time. And the flash is always followed from a second by a second flash that is realized by these rotating laser mirrors here. And actually, depending on where you are, this signal hits you earlier or faster. Why is that?

26:42Well, because the signal rotates horizontally or vertically. And then based on the timing difference between the initial flash and this laser information that you get here, we can calculate angular information that gives us information about where we are in space at a submillimeter accuracy. Why that? Well, because the updates are really fast and, the calculations are really precise. So here you see a slow mo view of the lighthouse tracking system, in the infrared channel.

27:12So basically, what you see here is always first a flash and then you see how this lighthouse, like really like the classical lights from light houses is, moving through the scene horizontally and vertically. And whenever it hits your sensor, the sensor would notice the signal and then calculate the time difference, based on how long it took between the flash and the rotary signal to arrive.