Paper 3 Case Study - Rescue Robots
Bundle adjustment is an optimization technique used in computer vision and photogrammetry to refine the parameters of a 3D reconstruction or camera calibration model. It involves simultaneously adjusting the 3D positions of points (landmarks) in a scene and the camera parameters to minimize the discrepancy between the predicted and observed image projections of those points. Bundle adjustment is commonly used in applications such as 3D reconstruction, camera calibration, and structure from motion.
Computer vision is a field of study that focuses on enabling computers to understand, analyze, and interpret visual data from images or videos. It involves developing algorithms and techniques to extract meaningful information from visual input, recognize objects, understand scenes, track motion, and perform tasks such as image classification, object detection, and scene understanding. Computer vision finds applications in various domains, including autonomous vehicles, surveillance systems, augmented reality, robotics, and image/video analysis.
Dead reckoning data refers to the information obtained through the process of dead reckoning or inertial navigation. It involves estimating an object's current position, velocity, or orientation by using previously known data and incorporating measurements of acceleration, rotation, and time. Dead reckoning data is often used in navigation systems, robotics, and motion tracking, where the data helps estimate an object's motion when other external position sensors like GPS are unavailable or unreliable.
Edge computing is a distributed computing paradigm that brings computation and data storage closer to the edge of the network, closer to where data is generated or consumed, rather than relying solely on centralized cloud computing infrastructure. By processing and analyzing data locally at the edge devices or edge servers, edge computing reduces latency, optimizes bandwidth usage, enhances privacy and security, and enables offline operation. It is commonly used in applications such as IoT, real-time analytics, and autonomous systems where low latency and real-time processing are crucial.
Global map optimization, also known as map refinement or map fusion, refers to the process of improving the accuracy, consistency, and completeness of a map representation of the environment by incorporating information from multiple sensor sources or multiple observations over time. It involves optimizing the positions of landmarks or scene points and camera poses to minimize errors and inconsistencies. Global map optimization is commonly used in mapping, localization, and navigation systems to create and maintain accurate and reliable maps of the environment.
The Global Positioning System (GPS) signal refers to the radio frequency signals transmitted by GPS satellites that provide positioning, navigation, and timing information to GPS receivers on Earth. GPS signals allow receivers to calculate their precise location, velocity, and time synchronization. The GPS signal consists of components such as the navigation message, carrier wave, and spread spectrum signal. GPS signals are utilized in various applications, including navigation systems, surveying, aviation, and location-based services.
A GPS-degraded environment refers to a situation or location where GPS signals are compromised or significantly degraded, leading to challenges or limitations in accurate positioning and navigation using GPS receivers. This degradation can occur due to factors such as signal obstruction, multipath interference, signal jamming, or adverse atmospheric conditions. In a GPS-degraded environment, alternative positioning methods or technologies may be used to supplement or replace GPS for reliable positioning and navigation.
A GPS-denied environment refers to a situation or location where GPS signals are entirely unavailable or inaccessible. It can occur in environments such as underground structures, indoor areas, or areas where intentional signal jamming is present. In a GPS-denied environment, alternative positioning and navigation techniques, such as inertial navigation systems (INS), sensor fusion, or visual-based localization, may be used to estimate position and navigate accurately.
Human pose estimation (HPE) is a computer vision task that involves estimating the positions and orientations of human body joints or body parts from images or videos. The goal is to understand and analyze human movement and posture. HPE has applications in various fields, including action recognition, motion capture, human-computer interaction, virtual reality, and augmented reality.
An Inertial Measurement Unit (IMU) is an electronic sensor device that combines multiple sensors, typically including accelerometers, gyroscopes, and sometimes magnetometers, to measure the linear and angular motion of an object. IMUs are commonly used in robotics, navigation systems, virtual reality, and motion tracking applications. They provide information about an object's acceleration, rotation rate, and often orientation.
Keyframe selection is a process in computer vision and video processing where specific frames from a sequence of video frames are chosen as keyframes. Keyframes capture important or representative information of a video sequence while omitting redundant or less informative frames. Keyframe selection is commonly used in video compression, summarization, and streaming to reduce storage space, processing requirements, and transmission bandwidth. Keyframes are often selected based on criteria such as scene changes, content diversity, or visual saliency.
Key points or key point pairs are distinctive and identifiable points or features in an image that are robust to changes in scale, rotation, or viewpoint. They are often extracted using feature detection algorithms and can be used for tasks such as image matching, object recognition, or 3D reconstruction. Key points serve as reference points to establish correspondences between images or to track objects across frames.
Light detection and ranging (LIDAR) is a remote sensing technology that measures distances and generates 3D representations of objects or environments using laser light. LIDAR systems emit laser pulses and measure the time it takes for the reflected light to return, allowing the calculation of distances to objects. LIDAR is commonly used in applications such as mapping, autonomous vehicles, robotics, and environmental monitoring.
Object occlusion refers to a situation in computer vision where an object or part of an object is partially or entirely obscured by another object or occluding surface in the scene. Occlusion poses challenges in tasks such as object detection, tracking, and recognition since the occluded regions may not be visible or fully observed. Dealing with object occlusion requires techniques that can handle partial information or infer occluded regions based on context or prior knowledge.
An odometry sensor, also known as a motion sensor or wheel encoder, is a device used to measure the motion or displacement of a vehicle or robot. It typically involves measuring wheel rotations or changes in wheel speed to estimate the distance traveled. Odometry sensors are commonly used in robotics for estimating relative motion and position changes over short distances.
Optimization refers to the process of finding the best solution or set of parameters that minimize or maximize an objective function or satisfy certain constraints. In computer vision and robotics, optimization techniques are often used to refine models, align data, estimate parameters, or solve complex problems. Common optimization methods include gradient descent, least squares, non-linear optimization, and convex optimization.
Relocalization occurs when a system loses tracking (or initialized in a new environment), and needs to assess its location based on currently observable features. If the system is able to match the features it observes against the available map, it will localize itself to the corresponding pose in the map, and continue the SLAM process. Relocalization, also known as relocalization or pose recovery, is the process of estimating the position or pose of a sensor (e.g., a camera or a robot) within a known map or reference frame. It involves matching observed sensor data, such as images or 3D point clouds, with features or landmarks in the map to determine the sensor's position and orientation accurately. Relocalization is important for robust and accurate localization in scenarios where a sensor's position is temporarily lost or needs to be re-established.
Rigid pose estimation (RPE) refers to the task of estimating the position and orientation (pose) of a rigid object, such as a rigid body or object instance, in a 3D space. It involves determining the translation and rotation parameters that align a model or template of the object with the observed data, such as images or point clouds. RPE is used in applications such as object tracking, augmented reality, robotics, and camera calibration.
Robot drift, also known as sensor drift or localization drift, refers to the cumulative error or deviation that occurs over time in the estimated position or pose of a robot or autonomous system. It can be caused by inaccuracies in sensor measurements, noise, systematic errors, or limitations of the localization algorithms. Robot drift can lead to inaccurate positioning, navigation, or mapping if not properly accounted for or corrected.
Simultaneous localization and mapping (SLAM) is a technique used in robotics and computer vision to create a map of an unknown environment while simultaneously estimating the robot's position within that map. SLAM algorithms leverage sensor measurements, such as odometry, GPS, LIDAR, or camera data, to iteratively build the map and refine the robot's localization. SLAM is commonly used in autonomous navigation, robotics, and augmented reality.
A sensor fusion model combines data from multiple sensors or sources to improve the accuracy, robustness, or completeness of the information obtained. In computer vision and robotics, sensor fusion models integrate measurements from different sensors, such as cameras, LIDAR, IMUs, or GPS, to obtain a more comprehensive understanding of the environment, object detection, tracking, or localization. Fusion techniques can include data association, filtering algorithms (e.g., Kalman filter), probabilistic methods, or deep learning approaches.
Visual simultaneous localization and mapping (vSLAM) modules are components or stages within a vSLAM system that work together to perform the tasks of mapping the environment and estimating the camera's pose (position and orientation) in real-time using visual information. The vSLAM modules include: 1. InitializationThe initialization module is responsible for setting up the vSLAM system at the beginning of the operation. It typically involves detecting or extracting visual features from the initial frames, estimating the camera's pose, and creating an initial map representation. Initialization provides the starting point for subsequent mapping and localization processes. 2. Local MappingThe local mapping module builds a local map of the environment based on the camera's motion and observed visual data. It extracts visual features from the current frame, estimates their 3D positions relative to the camera, and associates them with the existing map. Local mapping updates the map representation incrementally as new observations become available, allowing for continuous refinement of the map. 3. Loop ClosureThe loop closure module handles the detection and correction of errors or inconsistencies that may arise when revisiting previously seen parts of the environment. It identifies when the camera has returned to a previously visited location or scene, matches the current observations with previously stored keyframes or landmarks, and corrects accumulated errors. Loop closure helps maintain map consistency, improves localization accuracy, and reduces drift in the vSLAM system. RelocalizationRelocalization is the process of re-establishing the camera's position and orientation within the existing map when it is temporarily lost or needs to recover after tracking failures. The relocalization module matches the current visual observations with features or landmarks in the map, allowing the system to determine the camera's pose relative to the map again. Relocalization enables robust and accurate localization even in challenging situations. TrackingThe tracking module continuously estimates the camera's pose in real-time as it moves through the environment. It analyzes the incoming visual data, matches the current frame's features with those in the map, and updates the camera's pose accordingly. The tracking module is responsible for tracking the camera's motion, predicting its pose between frames, and providing a continuous estimate of the camera's position and orientation. These vSLAM modules work collaboratively to enable real-time mapping and localization in dynamic environments using visual information from cameras or sensors. They provide the necessary functionalities for building and maintaining a map of the environment and accurately estimating the camera's pose during camera motion. |
“VideoRay’s search-and-rescue/recovery system is remote-operated. The video-enabled, joystick-controlled vehicle is a versatile submersible that employs high-powered lights, multi-beam sonar imaging, GPS and metal gauges to help rescue and recovery missions. The NYC Harbor Unit has used it for cargo inspections and Bertram Yachts once turned it into a sport-fishing accessory. But it really shines in the hands of clients like the Sheriff’s Office of St. Louis County, Minnesota, whose rescue squad used the device to find a drowning victim who collapsed into an iced-over lake.”
The United States Navy Explosive Ordnance Disposal (EOD) Units have one of the most dangerous mission’s assigned to military operatives – locating and disposing of underwater explosives. Compared to torpedoes, small boat attacks and missiles, underwater mines have caused more than four times more damage to U.S. Navy ships. Crude but effective mines, Water Borne Improvised Explosive Devices (WBIEDs) and Underwater Hazardous Devices (UHDs) are cheap, easy to stockpile, and easily concealed in holds of ships and fishing boats. Previous methods for underwater EOD were time-consuming, extremely hazardous, and labor-intensive, often with little to no verification of mission success or completion.
— https://videoray.com/u-s-navy-eod/
The advantages of integrating vSLAM into VideoRay robots
Video - Stills - 3D: Simultaneously provides a low-latency video stream for GVI (General Visual Inspection) and piloting, high resolution stills images for inspection, and a real-time 3D model to evaluate survey coverage and data quality during acquisition.
Enhanced Situational Awareness: The VSLAM solutions provides ROV pilots with a visualization of the ROV position relative to the 3D environment, enabling contextual piloting, and ability to effectively maintain a consistent speed and standoff distance.
Survey Quality Control: Visual 3D data is processed in real-time, directly validating that coverage and image quality is sufficient for accurate photogrammetry, prior to survey completion.
NaviSuite Integration: Uniting Discovery’s cutting-edge imaging capabilities with EIVA’s proven NaviSuite software to incorporate visual data into existing survey workflows and tools.
More than a decade after the 2011 Fukushima nuclear disaster, much of the areas remained unexplored. Some of the doors to rooms in the facility had not been opened since the disaster, and officials had little idea what was on the other side. In 2022, decommissioning crews began using Spot, the quadruped robot from Boston Dynamics, to collect data, shoot video, measure radiation dose, and gather debris samples for radiation testing. Although officials were already using other tracked and wheeled robots, the superior mobility and the automated arm of Spot proved to be “game changers,” says Brad Bonn, head of nuclear programs for Spot.
— https://bostondynamics.com/case-studies/spot-in-fukushima-daiichi/
“The fire that devastated France's Notre-Dame Cathedral in 2019 was a monumental loss in many ways. But it could have been far worse if not for Shark Robotics’ Colossus. Armed with its WALL-E-like treads and the power to blast 660 gallons of water per minute, the 1,100-pound fireproof robot was summoned by the Paris Fire Brigade when conditions proved too treacherous for firefighters. The brigade’s commander, Jean-Claude Gallet, would later say that Colossus had saved the lives of his crew.
Aside from extinguishing fire, the joystick-controlled Colossus can also haul firefighting equipment, transport wounded victims and trigger its 360-degree, high-definition thermal camera to assess a scene.”
The original abstract:
This paper presents the Visual Simultaneous Localization and Mapping (vSLAMTM) algorithm, a novel algorithm for simultaneous localization and mapping (SLAM). The algorithm is vision-and odometry-based, and enables low-cost navigation in cluttered and populated environments. No initial map is required, and it satisfactorily handles dynamic changes in the environment, for example, lighting changes, moving objects and/or people. Typically, vSLAM recovers quickly from dramatic disturbances, such as “kidnapping”.
Visual SLAM systems are designed to map the environment around the sensors while simultaneously determining the precise location and orientation of those sensors within their surroundings. It relies entirely on visual data for estimating sensor motion and reconstructing environmental structures (Taketomi et al., 2017). This approach has attracted attention in the literature because it is cost-effective, easy to calibrate, and has low power consumption in monocular cameras while also allowing depth estimation and high accuracy in RGB-D and stereo cameras (Macario Barros et al., 2022; Abbad et al., 2023).
— A review of visual SLAM for robotics: evolution, properties, and future applications
Feature-based vSLAM is what our case study is referring to.
This method involves detecting and tracking distinct features in the environment, such as corners or edges, across multiple frames of video. Algorithms like ORB-SLAM and PTAM (Parallel Tracking and Mapping) fall under this category.
Within visual-only SLAM, there also exists:
|
The initialization module is responsible for setting up the vSLAM system at the beginning of the operation. It typically involves detecting or extracting visual features from the initial frames, estimating the camera's pose, and creating an initial map representation. Initialization provides the starting point for subsequent mapping and localization processes.
The local mapping module builds a local map of the environment based on the camera's motion and observed visual data. It extracts visual features from the current frame, estimates their 3D positions relative to the camera, and associates them with the existing map. Local mapping updates the map representation incrementally as new observations become available, allowing for continuous refinement of the map.
The loop closure module handles the detection and correction of errors or inconsistencies that may arise when revisiting previously seen parts of the environment. It identifies when the camera has returned to a previously visited location or scene, matches the current observations with previously stored keyframes or landmarks, and corrects accumulated errors. Loop closure helps maintain map consistency, improves localization accuracy, and reduces drift in the vSLAM system.
Relocalization is the process of re-establishing the camera's position and orientation within the existing map when it is temporarily lost or needs to recover after tracking failures. The relocalization module matches the current visual observations with features or landmarks in the map, allowing the system to determine the camera's pose relative to the map again. Relocalization enables robust and accurate localization even in challenging situations.
The tracking module continuously estimates the camera's pose in real-time as it moves through the environment. It analyzes the incoming visual data, matches the current frame's features with those in the map, and updates the camera's pose accordingly. The tracking module is responsible for tracking the camera's motion, predicting its pose between frames, and providing a continuous estimate of the camera's position and orientation.
These vSLAM modules work collaboratively to enable real-time mapping and localization in dynamic environments using visual information from cameras or sensors. They provide the necessary functionalities for building and maintaining a map of the environment and accurately estimating the camera's pose during camera motion.
Happens initially in the initialization module, then subsequently each cycle in the local mapping module.
Camera distortions are corrected for, and the image is preprocessed in order to enhance feature extraction (eg. noise reduction, sharpening, contrast increase)
Key points (features) are detected in the image.
This can be done using a deep learning approach, where objects (and their bounding boxes) are identified, or with traditional approaches such as edge detection, which can be done with image processing kernels.
The change in position of the extracted features over time is combined with the dead reckoning data from the IMU to determine the location of the robot within the environment.
One big challenge occurs in this process, however:
As the camera moves through space, there is increasing noise and uncertainty between the images the camera captures and its associated motion.
Kalman filters reduce the effects of noise and uncertainty among different measurements to model a linear system more accurately by continually making predictions, updating and refining the model against the observed measurements.
For SLAM systems, we typically use extended Kalman filters (EKF), which takes nonlinear systems, and linearizes the predictions and measurements around their mean.
Loop closure: When the localization algorithms detect features (and groups of features) that have already been seen, or when the internal map of the robot indicates that it is in a previously traversed region, the map can be refined to improve the accuracy of the robot’s navigation.
Relocalization: When the robot loses tracking (loses track of its position in the environment), it tries to match the features seen by the camera with its internal map. If it recognizes certain groups of features, it can place itself back into the map, and continue tracking.
Bundle adjustment: By comparing the projected point of the feature (based on the internal map/world state) with the actual location of the feature (based on input data from the sensors), the reprojection error can be calculated, and using an optimization algorithm, the parameters of the internal map can be adjusted in order to minimize this reprojection error.
“The reprojection error is a geometric error corresponding to the image distance between a projected point and a measured one. It is used to quantify how closely an estimate of a 3D point recreates the point's true projection.”
Bundle adjustment is used in the following contexts:
Locally (local bundle adjustment): the robot optimizes a subset of the most recent frames and points to quickly improve the local accuracy without excessive computation.
Globally: (global bundle adj): occasionally run to refine the entire map and trajectory globally, ensuring consistency and accuracy over long sequences.
Keyframes are select observations by the camera that capture a “good” representation of the environment. Some approaches will perform a bundle adjustment after every keyframe. Filtering becomes extremely computationally expensive as the map model grows, however keyframes enable more feature points or larger maps, with a balanced tradeoff between accuracy and efficiency.
Analysis of state-of-the-art visual odometry/visual simultaneous localization and mapping (VSLAM) system exposes a gap in balancing performance (accuracy and robustness) and efficiency (latency). Feature-based systems exhibit good performance, yet have higher latency due to explicit data association; direct and semidirect systems have lower latency, but are inapplicable in some target scenarios or exhibit lower accuracy than feature-based ones.
— Zhao et. al., Semantic scholar reader
Ideally, we want vSLAM to work as fast as possible, and as accurately as possible.
(these are more about the processing/overall implementation of vSLAM rather than the specific sensor paradigm/configuration)
Having a specialized processor dedicated to performing vSLAM allows it to work faster and more efficiently, as it is specially designed for vSLAM
“This letter introduces a dedicated processor architecture, called MEGACORE, which leverages vector technology to enhance tracking performance in visual simultaneous localization and mapping (VSLAM) systems. By harnessing the inherent parallelism of vector processing and incorporating a floating point unit (FPU), MEGACORE achieves significant acceleration in the tracking task of VSLAM. Through careful optimizations, we achieved notable improvements compared to the baseline design.”
— Li et. al., https://doi.org/10.1109/LES.2023.3298900
Having a lightweight, on-device, self-supervisory network to act as an adversarial against the vSLAM algorithm
“This article proposes a quantized self-supervised local feature for the indirect VSLAM to handle the environmental interference in robot localization tasks. A joint feature detection and description network is built in a lightweight manner to extract local features in real time. The network is iteratively trained by a self-supervised learning strategy, and the extracted local features are quantized by an orthogonal transformation for efficiency.
— Li et. al., https://doi.org/10.1109/TMECH.2021.3085326 (different Li)
Applying the concept of edge computing
Distributing the computation of vSLAM across multiple processors or even across devices in a network can reduce latency and increase the system's scalability.
This approach is particularly useful in collaborative robotic systems, e.g. a fleet of rescue robots mapping/working in the same environment
Integrating multiple types of sensors can significantly enhance the robustness and accuracy of vSLAM systems. Commonly, vSLAM systems are augmented with inertial measurement units (IMUs), depth sensors, or LiDAR to provide additional data points that compensate for the limitations of visual data alone.
“VI-SLAM is a technique that combines the capabilities of visual sensors, such as stereo cameras, and inertial measurement sensors (IMUs) to achieve its SLAM objectives and operations (Servières et al., 2021; Leut et al., 2015). This hybrid approach allows a comprehensive modeling of the environment, where robots operate (Zhang et al., 2023). It can be applied to various real-world applications, such as drones and mobile robotics (Taketomi et al., 2017). The integration of IMU data enhances and augments the information available for environment modeling, resulting in improved accuracy and reduced errors within the system’s functioning (Macario Barros et al., 2022; Mur-Artal and Tardós 2017b).”
But in what cases may other approaches perform better than visual-inertial SLAM?
In terms of choosing the approach, it really depends:
Specific application needs: Different applications prioritize different aspects like accuracy, real-time performance, computational cost, and robustness.
Environmental conditions: Different environments (indoor vs. outdoor, structured vs. unstructured) may favor different types of sensors and algorithms.
In environments with poor lighting or highly dynamic scenes, LiDAR SLAM might outperform VI-SLAM due to its ability to provide accurate distance measurements regardless of lighting conditions.
Beyond visual and inertial sensors, integrating additional modalities like GPS, odometry, or depth sensors can provide complementary data that helps in scenarios where VI-SLAM might struggle, such as featureless or highly dynamic environments.
For instance, RGB-D SLAM
This integrates RGB-D cameras with depth sensors to estimate and build models of the environment (Ji et al., 2021; Macario Barros et al., 2022). This technique has found applications in various domains, including robotic navigation and perception (Luo et al., 2021). It demonstrates efficient performance, particularly in well-lit indoor environments, providing valuable insights into the spatial landscape (Dai et al., 2021).
The incorporation of RGB-D cameras and depth sensors enables the system to capture both color and depth information simultaneously. This capability is advantageous in indoor applications, addressing the challenge of dense reconstruction in areas with low-textured surfaces (Zhang et al., 2021b).
HPE is crucial for rescue robots, and can help determine whether victims are in need of immediate assistance. Certain poses can be compared with certain conditions:
if someone is walking, they are likely fine, whereas if someone is lying or struggling on the ground, they are more likely to need assistance.
The detection of arm waving or other physical distress signals helps rescuers pinpoint those who need help.
Joint position error analysis from HPE data can help rescuers pinpoint the nature of trauma.
if a joint is severely out of place and not positioned normally based on degrees of freedom, broken or dislocated limbs can be identified.
Rescue robots can navigate through debris or confined spaces without causing harm to (stepping on) victims.
2D: This involves estimating the positions of various body parts in two dimensions This is often done by predicting the positions of key points, or "joints," like the elbows or knees, in the image.
3D: This is a more challenging task that involves estimating the positions of body parts in three dimensions. It not only requires understanding where the body parts are in an image, but also how far they are from the camera.
Bottom-up vs. top-down HPE
(a) — top down, (b) — bottom up
Modern approaches to HPE are based on neural networks, which are trained (taught to recognize humans) based on labeled data.
The poses of occluded components are estimated based on the position of visible limbs (key points), edge lengths (for 3D HPE), and temporal convolution (extrapolation of previous and future positions when the limbs are visible).
Occlusion leads to distorted visual data, making it difficult for the HPE algorithms to accurately identify and interpret human poses. HPE algorithms must be designed with occlusion in mind.
When traditional HPE methods are limited by occlusion and in crowd scenes, deep-learning-based methods can estimate/correct the estimated poses based on temporal and other adjacent information.
(top-down approach) | |
(bottom-up approach) |
Using multiple datasets to train deep-learning-based approaches can also bring advantages (but also advantages) over training on a single dataset.
There is no doubt that deep-learning-based approaches overall are significantly more robust than traditional methods. Within DL, however, there are multiple pathways:
While CNNs are commonly used in HPE studies for their effectiveness in implicit feature extraction from images, a few studies have explored other deep learning methods such as GANs, GNNs, and RNNs. The relative performance of these methods is unclear and warrants further research.
Many studies have found that detection-based approaches outperform regression-based approaches for estimating single poses. Recently, Gu et al. [162] analyzed these two approaches to determine why detection-based methods are superior to regression-based methods. They ultimately proposed a technique that showed regression-based approaches could outperform detection-based approaches, especially when facing complex problems. Further study of this work may open new directions for estimating single-person poses
Multipose estimation has been significantly harder due to occlusion and varying sizes, but finding the best solutions remains an open problem:
Optical flow has been used by some studies to track motion in videos. However, it is easily affected by noise and can have difficulty tracking human motion in noisy environments. To improve performance, a few works have replaced optical flow with other techniques, such as RNNs or temporal consistency.
Many studies use post-processing steps such as search algorithms or graphical models to group predicted key points into individual humans in bottom-up approaches. However, some recent works have incorporated graphical information into neural networks to make the training process differentiable.
Improving the efficiency of HPE tasks is not limited to enhancing models; dataset labels also play a significant role. In addition to keypoint position labels, only a few datasets provide additional labels, such as visibility of body joints, that can help address the challenge of occlusions.
As occlusion is one of the main challenges in 2D HPE, researchers need to increase the number of occluded labels in datasets. Unsupervised/semi-supervised and data augmentation methods are currently used to address this limitation;
Another challenge in 2D HPE is crowded scenes. Only a few datasets provide data with crowded scenarios (e.g., CrowdPose and COCO), and their data consist only of images. Recently, a dataset called HAJJv2 [163] was introduced that provides more than 290,000 videos for detecting abnormal behaviors during Hajj religious events. The data in this dataset are diverse in terms of race, as many people from all over the world [164,165] perform Hajj rituals. They also have a large crowd scale, providing nine classes with normal and abnormal behaviors for each category. This dataset may help train 2D HPE models.
For example, models such as YOLO and OpenPose are used for detecting and estimating poses to identify suspicious behavior during Hajj events. However, these models still face challenges in handling large numbers of poses in real-time. Developing methods to address this problem remains an open challenge.
— Samkari et. al., https://doi.org/10.3390/make5040081
Recall that this challenge mentions a scenario where “rubble is still shifting after an earthquake.”
While building maps when robot poses are known is a tractable problem requiring limited computational complexity, the simultaneous estimation of the trajectory and the map of the environment (known as SLAM) is much more complex and requires many computational resources.
Moreover, SLAM is generally peformed in environments that do not vary over time (called static environments), whereas real applications commonly require navigation services in changing environments (called dynamic environments).
Many real robotic applications require updated maps of the environment that vary over time, starting from a given known initial condition.
In this context, classical SLAM approaches are generally not directly applicable: such approaches only apply in static environments or in dynamic environments where it is possible to model the environment dynamics. We are interested here in long-term mapping operativity in presence of variations in the map, as in the case of robotic applications in logistic spaces, where rovers have to track the presence of goods in given areas.
— https://ieeexplore.ieee.org/document/5756810
“Dynamic objects such as people and cars are often unavoidable in scenarios such as classrooms, hospitals, and outdoor shopping places. Those vSLAM systems built on a static environment have poor adaptability to dynamic and complex scenes, leading to substantial errors in the obtained map points and pose matrix (Cheng et al., 2019). Indirectly, it will cause problems such as drift of virtual objects registered in the world coordinate system.”
— https://www.frontiersin.org/articles/10.3389/fnbot.2022.990453/full
One solution is (in feature-based vSLAM) to remove any features that are moving (dynamic). Thus, the goal is to figure out what part of a frame is static and what is dynamic, so we can treat them differently.
Using HPE to perform semantic segmentation of dynamic vs static objects in a frame, the vSLAM system can be told to “ignore” any moving objects like humans.
This makes the mapping itself more accurate, as we are only mapping the static world.
→ This seems to be a common solution - simply knowing what is dynamic and what is static is helpful to both the vSLAM and HPE algorithms and the human controller. (for instance, a heatmap of dynamic elements could be useful to quickly identify humans, in addition to HPE).
“The optimized vSLAM algorithm adds the modules of dynamic region detection and semantic segmentation to ORB-SLAM2. First, a dynamic region detection module is added to the vision odometry. The dynamic region of the image is detected by combining single response matrix and dense optical flow method to improve the accuracy of pose estimation in dynamic environment.”
— Wei et. al., A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments
“Moreover, a new dynamic feature detection method called semantic and geometric constraints was proposed, which provided a robust and fast way to filter dynamic features. The semantic bounding box generated by YOLO v3 (You Only Look Once, v3) was used to calculate a more accurate fundamental matrix between adjacent frames, which was then used to filter all of the truly dynamic features.”
— Yang et. al., https://doi.org/10.3390/s20082432
Rather than completely removing them, we can also weigh them differently so that the algorithm knows to pay less attention to moving features.
“A robust visual SLAM system that utilizes weighted features, namely, named WF-SLAM is proposed in this paper, which is based on ORB- SLAM2, and significantly decreases mismatch and improves the accuracy of localization.”
— Zhong et. al., https://doi.org/10.1109/jsen.2022.3169340
Rather than weighing features, we can also weigh data from the different sensors.
The system can assign different weights to the sensor measurements based on their reliability and relevance in the current environment. For example, in areas with high dynamics, the system can give more weight to LiDAR or radar data compared to visual features.
Similarly, the criteria for selecting keyframes can also be adapted based on the level of dynamics in the environment. In highly dynamic scenes, more frequent keyframe updates might be necessary to capture the changing structure of the rubble.
Map management: We can divide the map into chunks, and each time we update the map we can update the chunk rather than the entire map. This allows us to have a full representation of the environment while adjusting to dynamic changes.
In the occupancy grid method, each cell in the occupancy grid map is assigned a probability of being occupied or free. As new sensor data arrives, the probabilities are updated using Bayesian inference, allowing the map to adapt to the changing environment. This is an example of a probabilistic mapping technique, which represents the uncertainty and dynamics of the environment.
“To avoid a continuous re-mapping, the map can be updated to obtain a consistent representation of the current environment. In this paper, we propose a novel LIDAR- based occupancy grid map updating algorithm for dynamic environments. The proposed approach allows robust long-term operations as it can detect changes in the working area even in presence of moving elements.”
— Stefanini et. al., doi:10.5281/zenodo.7531326
“This work presents a semantic map management approach for various environments by triggering multiple maps with different simultaneous localization and mapping (SLAM) configurations. A modular map structure allows to add, modify or delete maps without influencing other maps of different areas. The hierarchy level of our algorithm is above the utilized SLAM method. Evaluating laser scan data (e.g. the detection of passing a doorway) triggers a new map, automatically choosing the appropriate SLAM configuration from a manually predefined list. Single independent maps are connected by link-points, which are located in an overlapping zone of both maps, enabling global navigation over several maps. Loop- closures between maps are detected by an appearance-based method, using feature matching and iterative closest point (ICP) registration between point clouds.”
— Ehlers et. al., doi: 10.1109/ICRA40945.2020.9196997.
Instead of relying on a single, static map, the vSLAM system can continuously update the map based on the latest sensor data. This approach involves detecting changes in the environment and incrementally updating the affected regions of the map.
By comparing the current sensor data with the existing map, the system can identify regions where significant changes have occurred. Techniques like point cloud registration, occupancy grid comparison, or appearance-based methods can be used for change detection.
Once the changed regions are identified, the map can be incrementally updated by incorporating the new data and discarding the outdated information. This process helps maintain a more accurate representation of the current state of the environment.
“Qualitative text analysis [of the existing body of literature on rescue robot ethical concerns] identified seven core ethically relevant themes: fairness and discrimination; false or excessive expectations; labor replacement; privacy; responsibility; safety; and trust.”
Discrimination was looked at in terms of disaster victims, in one, instead, as relating to rescue operators. As Amigoni and Schiafonati point out:
Hazards and benefits should be fairly distributed (…) to avoid the possibility of some subjects incurring only costs while other subjects enjoy only benefits. This condition is particularly critical for search and rescue robot systems, e.g., when a robot makes decisions about prioritizing the order in which the detected victims are reported to the human rescuers or about which detected victim it should try to transport first (Amigoni & Schiafonati, 2018).
Stakeholders are generally unable to make sound assessments about the capabilities and limitations of rescue robots. This inability can lead stakeholders to overestimate or underestimate the capabilities of rescue robots.
In the first case, this may translate into unjustified reliance on their performance, and thus, for example, into false hopes that the robots may save certain victims, or into their deployment for tasks for which they are not suitable or under inappropriate conditions.
In the second case, when robots' capabilities are underestimated, they may be underutilized, leading to a waste of precious resources (Harbers et al., 2017).
Stakeholders predict that rescue robots will likely replace human operators in the most physically challenging or high-risk rescue missions. Researchers express concerns that replacing humans with robots may determine degraded performance concerning victim contact, situation awareness, manipulation capabilities, etc., pointing out that robot-mediated contact with victims may interfere with medical personnel's ability to perform triage or provide medical advice or support
the use of robots generally leads to an increase in information gathering, which can jeopardize the privacy of personal information. This may be personal information about rescue workers, such as images or data about their physical and mental stress levels, but also about victims or people living or working in the disaster area. Harbers et al add that the loss of privacy potentially associated with the deployment of robots in disaster scenarios does not necessarily result in an ethical dilemma: indeed, given the critical nature of search and rescue operations, the benefits of collecting information in such settings largely outweigh any harms it may cause. This will require, however, that the information gathered by the robots is not shared with anyone outside professional rescue organizations and is exclusively used for rescue purposes.
In the paper by Tanzi et al. issues of responsibility are viewed as associated with liability in the event of technical failures or accidents and injuries to victims (Tanzi et al., 2015). Harbers and colleagues instead focus on responsibility assignment problems, which, they say, can apply to both moral and legal responsibility, where moral responsibility concerns blame and legal responsibility, instead, concerns accountability.
Such problems, according to the authors, can arise when robots act with no human supervision. If a robot malfunctions, behaves incorrectly, makes a mistake or causes harm, it may be unclear who is responsible for the damage caused: the operator, the software developer, the manufacturer or the robot itself. Responsibility assignment problems, they continue, become particularly complicated when the robot has some degree of autonomy, self-learning capabilities or is capable of making choices that were not explicitly programmed (Harbers et al., 2017).
Harbers and colleagues acknowledge that although attention to safety is clearly one of the key priorities than need to be taken into account when deploying rescue robots, this priority will often have to be balanced against other values, as rescue missions necessarily involve safety risks. Certain of these risks can be mitigated by replacing operators with robots, but robots themselves, in turn, may determine other safety risks, mainly because they can malfunction. Even when they perform correctly, robots can still be harmful: they may, for instance, fail to identify and collide into a human being. In addition, robots can hinder the well-being of victims in subtler ways. For example, the authors argue, that being trapped under a collapsed building, wounded and lost, and suddenly being confronted with a robot, especially if there are no humans around, can in itself be a shocking experience (Harbers et al., 2017).
Focusing specifically on the use of UAVs, Tanzi et al. also emphasize the risks associated with collisions and accidents, pointing out that even high-end military drones like the Predator crash with some frequency, although injuries are rare, and that in urban environments, small UAVs can still cause injury or property damage (Tanzi et al., 2015).
The question of trust in autonomous systems is the focus of one of the papers identified by our review. In his paper, Stormont highlights how trust by an agent in another agent requires two beliefs: that an agent that can perform a task to help another achieve a goal has a) the ability to perform the task and b) the desire to perform it (Stormont, 2008).
He then points out that two main components of trust have been identified in the literature: confidence and reputation.
Stormont claims that autonomous systems and robots in general tend to not have a good reputation, and that confidence must be involved. In the author's view, humans lack confidence in autonomous robots because they are unpredictable.
Humans working together are generally able to anticipate each other's actions in a wide range of circumstances— especially if they have trained together, as is the case in rescue crews. Autonomous systems, instead, often surprise even those who designed them, and such unpredictability can be both concerning and unwelcome in dangerous situations like those that are typical of disaster scenarios.
In 2002, Gianmarco Veruggio coined the term ‘roboethics’ which establishes ethical standards for the design, production, and use of robots. It is important to bring both human (programmers, designers, or users) and robot behaviour into regulations so that it can be controlled by law and code. Leenes et al. [28], distinguished code or law into four categories:
Regulating robot design, production through law.
Regulating user behaviour through the robot’s design.
Regulating the effects of robot behaviour through law.
Regulating robot behaviour through code.
Roboethics is distinct from “machine ethics”, which would require robots to follow ethical guidelines, and is still in the theoretical stage because autonomous robots are not yet capable of making moral judgements [27]. This is even more relevant regarding robotics in SAR operations. For instance, robots doing inexperienced first aid may cause extra problems because of their moral incapability and uncertain nature, such as harming or breaking people’s parts of body. SAR robotics is highly human-centric, the mismatch between robots and human cognitive abilities is often the limiting factor in SAR; regardless of the technical capabilities of robots in terms of locomotion, communication, and sensing [29]. In this former field, some studies discussed ethical issues.
— Chitikena et. al., https://doi.org/10.3390/app13031800
“As robots are becoming increasingly human-like, this issue will continue to gain importance over time. The following question thus emerges: Should we act in order to maintain human life as the most valuable from the legal perspective? For example, if we accept that human life should always be at the top of hierarchies of value, perhaps manufacturers should be forced to mark robots such that they can be easily differentiated from humans in emergencies. In unforeseen traffic accidents, drivers only have seconds to decide what to do and what they can avoid. Robot drivers and human drivers should know that robots should be sacrificed in collisions involving both humans and robots. From another perspective, we should ask whether robots have any properties that make them equal to humans with regard to legal protections, such as a human-like intelligence, and whether we could in fact decide that robots should be granted more protection than humans.”
— Whether to Save a Robot or a Human: On the Ethical and Legal Limits of Protections for Robots
The iRobotSurgeon survey aimed to explore public opinion towards the issue of liability with robotic surgical systems. The survey included five hypothetical scenarios where a patient comes to harm and the respondent needs to determine who they believe is most responsible: the surgeon, the robot manufacturer, the hospital, or another party.
A total of 2,191 completed surveys were gathered evaluating 10,955 individual scenario responses from 78 countries spanning 6 continents. The survey demonstrated a pattern in which participants were sensitive to shifts from fully surgeon-controlled scenarios to scenarios in which robotic systems played a larger role in decision-making such that surgeons were blamed less.
However, there was a limit to this shift with human surgeons still being ascribed blame in scenarios of autonomous robotic systems where humans had no role in decision-making. Importantly, there was no clear consensus among respondents where to allocate blame in the case of harm occurring from a fully autonomous system.
The iRobotSurgeon Survey demonstrated a dilemma among respondents on who to blame when harm is caused by a fully autonomous surgical robotic system. Importantly, it also showed that the surgeon is ascribed blame even when they have had no role in decision-making which adds weight to concerns that human operators could act as “moral crumple zones” and bear the brunt of legal responsibility when a complex autonomous system causes harm.
— Autonomous surgical robotic systems and the liability dilemma
Bundle adjustment is an optimization technique used in computer vision and photogrammetry to refine the parameters of a 3D reconstruction or camera calibration model. It involves simultaneously adjusting the 3D positions of points (landmarks) in a scene and the camera parameters to minimize the discrepancy between the predicted and observed image projections of those points. Bundle adjustment is commonly used in applications such as 3D reconstruction, camera calibration, and structure from motion.
Computer vision is a field of study that focuses on enabling computers to understand, analyze, and interpret visual data from images or videos. It involves developing algorithms and techniques to extract meaningful information from visual input, recognize objects, understand scenes, track motion, and perform tasks such as image classification, object detection, and scene understanding. Computer vision finds applications in various domains, including autonomous vehicles, surveillance systems, augmented reality, robotics, and image/video analysis.
Dead reckoning data refers to the information obtained through the process of dead reckoning or inertial navigation. It involves estimating an object's current position, velocity, or orientation by using previously known data and incorporating measurements of acceleration, rotation, and time. Dead reckoning data is often used in navigation systems, robotics, and motion tracking, where the data helps estimate an object's motion when other external position sensors like GPS are unavailable or unreliable.
Edge computing is a distributed computing paradigm that brings computation and data storage closer to the edge of the network, closer to where data is generated or consumed, rather than relying solely on centralized cloud computing infrastructure. By processing and analyzing data locally at the edge devices or edge servers, edge computing reduces latency, optimizes bandwidth usage, enhances privacy and security, and enables offline operation. It is commonly used in applications such as IoT, real-time analytics, and autonomous systems where low latency and real-time processing are crucial.
Global map optimization, also known as map refinement or map fusion, refers to the process of improving the accuracy, consistency, and completeness of a map representation of the environment by incorporating information from multiple sensor sources or multiple observations over time. It involves optimizing the positions of landmarks or scene points and camera poses to minimize errors and inconsistencies. Global map optimization is commonly used in mapping, localization, and navigation systems to create and maintain accurate and reliable maps of the environment.
The Global Positioning System (GPS) signal refers to the radio frequency signals transmitted by GPS satellites that provide positioning, navigation, and timing information to GPS receivers on Earth. GPS signals allow receivers to calculate their precise location, velocity, and time synchronization. The GPS signal consists of components such as the navigation message, carrier wave, and spread spectrum signal. GPS signals are utilized in various applications, including navigation systems, surveying, aviation, and location-based services.
A GPS-degraded environment refers to a situation or location where GPS signals are compromised or significantly degraded, leading to challenges or limitations in accurate positioning and navigation using GPS receivers. This degradation can occur due to factors such as signal obstruction, multipath interference, signal jamming, or adverse atmospheric conditions. In a GPS-degraded environment, alternative positioning methods or technologies may be used to supplement or replace GPS for reliable positioning and navigation.
A GPS-denied environment refers to a situation or location where GPS signals are entirely unavailable or inaccessible. It can occur in environments such as underground structures, indoor areas, or areas where intentional signal jamming is present. In a GPS-denied environment, alternative positioning and navigation techniques, such as inertial navigation systems (INS), sensor fusion, or visual-based localization, may be used to estimate position and navigate accurately.
Human pose estimation (HPE) is a computer vision task that involves estimating the positions and orientations of human body joints or body parts from images or videos. The goal is to understand and analyze human movement and posture. HPE has applications in various fields, including action recognition, motion capture, human-computer interaction, virtual reality, and augmented reality.
An Inertial Measurement Unit (IMU) is an electronic sensor device that combines multiple sensors, typically including accelerometers, gyroscopes, and sometimes magnetometers, to measure the linear and angular motion of an object. IMUs are commonly used in robotics, navigation systems, virtual reality, and motion tracking applications. They provide information about an object's acceleration, rotation rate, and often orientation.
Keyframe selection is a process in computer vision and video processing where specific frames from a sequence of video frames are chosen as keyframes. Keyframes capture important or representative information of a video sequence while omitting redundant or less informative frames. Keyframe selection is commonly used in video compression, summarization, and streaming to reduce storage space, processing requirements, and transmission bandwidth. Keyframes are often selected based on criteria such as scene changes, content diversity, or visual saliency.
Key points or key point pairs are distinctive and identifiable points or features in an image that are robust to changes in scale, rotation, or viewpoint. They are often extracted using feature detection algorithms and can be used for tasks such as image matching, object recognition, or 3D reconstruction. Key points serve as reference points to establish correspondences between images or to track objects across frames.
Light detection and ranging (LIDAR) is a remote sensing technology that measures distances and generates 3D representations of objects or environments using laser light. LIDAR systems emit laser pulses and measure the time it takes for the reflected light to return, allowing the calculation of distances to objects. LIDAR is commonly used in applications such as mapping, autonomous vehicles, robotics, and environmental monitoring.
Object occlusion refers to a situation in computer vision where an object or part of an object is partially or entirely obscured by another object or occluding surface in the scene. Occlusion poses challenges in tasks such as object detection, tracking, and recognition since the occluded regions may not be visible or fully observed. Dealing with object occlusion requires techniques that can handle partial information or infer occluded regions based on context or prior knowledge.
An odometry sensor, also known as a motion sensor or wheel encoder, is a device used to measure the motion or displacement of a vehicle or robot. It typically involves measuring wheel rotations or changes in wheel speed to estimate the distance traveled. Odometry sensors are commonly used in robotics for estimating relative motion and position changes over short distances.
Optimization refers to the process of finding the best solution or set of parameters that minimize or maximize an objective function or satisfy certain constraints. In computer vision and robotics, optimization techniques are often used to refine models, align data, estimate parameters, or solve complex problems. Common optimization methods include gradient descent, least squares, non-linear optimization, and convex optimization.
Relocalization occurs when a system loses tracking (or initialized in a new environment), and needs to assess its location based on currently observable features. If the system is able to match the features it observes against the available map, it will localize itself to the corresponding pose in the map, and continue the SLAM process. Relocalization, also known as relocalization or pose recovery, is the process of estimating the position or pose of a sensor (e.g., a camera or a robot) within a known map or reference frame. It involves matching observed sensor data, such as images or 3D point clouds, with features or landmarks in the map to determine the sensor's position and orientation accurately. Relocalization is important for robust and accurate localization in scenarios where a sensor's position is temporarily lost or needs to be re-established.
Rigid pose estimation (RPE) refers to the task of estimating the position and orientation (pose) of a rigid object, such as a rigid body or object instance, in a 3D space. It involves determining the translation and rotation parameters that align a model or template of the object with the observed data, such as images or point clouds. RPE is used in applications such as object tracking, augmented reality, robotics, and camera calibration.
Robot drift, also known as sensor drift or localization drift, refers to the cumulative error or deviation that occurs over time in the estimated position or pose of a robot or autonomous system. It can be caused by inaccuracies in sensor measurements, noise, systematic errors, or limitations of the localization algorithms. Robot drift can lead to inaccurate positioning, navigation, or mapping if not properly accounted for or corrected.
Simultaneous localization and mapping (SLAM) is a technique used in robotics and computer vision to create a map of an unknown environment while simultaneously estimating the robot's position within that map. SLAM algorithms leverage sensor measurements, such as odometry, GPS, LIDAR, or camera data, to iteratively build the map and refine the robot's localization. SLAM is commonly used in autonomous navigation, robotics, and augmented reality.
A sensor fusion model combines data from multiple sensors or sources to improve the accuracy, robustness, or completeness of the information obtained. In computer vision and robotics, sensor fusion models integrate measurements from different sensors, such as cameras, LIDAR, IMUs, or GPS, to obtain a more comprehensive understanding of the environment, object detection, tracking, or localization. Fusion techniques can include data association, filtering algorithms (e.g., Kalman filter), probabilistic methods, or deep learning approaches.
Visual simultaneous localization and mapping (vSLAM) modules are components or stages within a vSLAM system that work together to perform the tasks of mapping the environment and estimating the camera's pose (position and orientation) in real-time using visual information. The vSLAM modules include: 1. InitializationThe initialization module is responsible for setting up the vSLAM system at the beginning of the operation. It typically involves detecting or extracting visual features from the initial frames, estimating the camera's pose, and creating an initial map representation. Initialization provides the starting point for subsequent mapping and localization processes. 2. Local MappingThe local mapping module builds a local map of the environment based on the camera's motion and observed visual data. It extracts visual features from the current frame, estimates their 3D positions relative to the camera, and associates them with the existing map. Local mapping updates the map representation incrementally as new observations become available, allowing for continuous refinement of the map. 3. Loop ClosureThe loop closure module handles the detection and correction of errors or inconsistencies that may arise when revisiting previously seen parts of the environment. It identifies when the camera has returned to a previously visited location or scene, matches the current observations with previously stored keyframes or landmarks, and corrects accumulated errors. Loop closure helps maintain map consistency, improves localization accuracy, and reduces drift in the vSLAM system. RelocalizationRelocalization is the process of re-establishing the camera's position and orientation within the existing map when it is temporarily lost or needs to recover after tracking failures. The relocalization module matches the current visual observations with features or landmarks in the map, allowing the system to determine the camera's pose relative to the map again. Relocalization enables robust and accurate localization even in challenging situations. TrackingThe tracking module continuously estimates the camera's pose in real-time as it moves through the environment. It analyzes the incoming visual data, matches the current frame's features with those in the map, and updates the camera's pose accordingly. The tracking module is responsible for tracking the camera's motion, predicting its pose between frames, and providing a continuous estimate of the camera's position and orientation. These vSLAM modules work collaboratively to enable real-time mapping and localization in dynamic environments using visual information from cameras or sensors. They provide the necessary functionalities for building and maintaining a map of the environment and accurately estimating the camera's pose during camera motion. |
“VideoRay’s search-and-rescue/recovery system is remote-operated. The video-enabled, joystick-controlled vehicle is a versatile submersible that employs high-powered lights, multi-beam sonar imaging, GPS and metal gauges to help rescue and recovery missions. The NYC Harbor Unit has used it for cargo inspections and Bertram Yachts once turned it into a sport-fishing accessory. But it really shines in the hands of clients like the Sheriff’s Office of St. Louis County, Minnesota, whose rescue squad used the device to find a drowning victim who collapsed into an iced-over lake.”
The United States Navy Explosive Ordnance Disposal (EOD) Units have one of the most dangerous mission’s assigned to military operatives – locating and disposing of underwater explosives. Compared to torpedoes, small boat attacks and missiles, underwater mines have caused more than four times more damage to U.S. Navy ships. Crude but effective mines, Water Borne Improvised Explosive Devices (WBIEDs) and Underwater Hazardous Devices (UHDs) are cheap, easy to stockpile, and easily concealed in holds of ships and fishing boats. Previous methods for underwater EOD were time-consuming, extremely hazardous, and labor-intensive, often with little to no verification of mission success or completion.
— https://videoray.com/u-s-navy-eod/
The advantages of integrating vSLAM into VideoRay robots
Video - Stills - 3D: Simultaneously provides a low-latency video stream for GVI (General Visual Inspection) and piloting, high resolution stills images for inspection, and a real-time 3D model to evaluate survey coverage and data quality during acquisition.
Enhanced Situational Awareness: The VSLAM solutions provides ROV pilots with a visualization of the ROV position relative to the 3D environment, enabling contextual piloting, and ability to effectively maintain a consistent speed and standoff distance.
Survey Quality Control: Visual 3D data is processed in real-time, directly validating that coverage and image quality is sufficient for accurate photogrammetry, prior to survey completion.
NaviSuite Integration: Uniting Discovery’s cutting-edge imaging capabilities with EIVA’s proven NaviSuite software to incorporate visual data into existing survey workflows and tools.
More than a decade after the 2011 Fukushima nuclear disaster, much of the areas remained unexplored. Some of the doors to rooms in the facility had not been opened since the disaster, and officials had little idea what was on the other side. In 2022, decommissioning crews began using Spot, the quadruped robot from Boston Dynamics, to collect data, shoot video, measure radiation dose, and gather debris samples for radiation testing. Although officials were already using other tracked and wheeled robots, the superior mobility and the automated arm of Spot proved to be “game changers,” says Brad Bonn, head of nuclear programs for Spot.
— https://bostondynamics.com/case-studies/spot-in-fukushima-daiichi/
“The fire that devastated France's Notre-Dame Cathedral in 2019 was a monumental loss in many ways. But it could have been far worse if not for Shark Robotics’ Colossus. Armed with its WALL-E-like treads and the power to blast 660 gallons of water per minute, the 1,100-pound fireproof robot was summoned by the Paris Fire Brigade when conditions proved too treacherous for firefighters. The brigade’s commander, Jean-Claude Gallet, would later say that Colossus had saved the lives of his crew.
Aside from extinguishing fire, the joystick-controlled Colossus can also haul firefighting equipment, transport wounded victims and trigger its 360-degree, high-definition thermal camera to assess a scene.”
The original abstract:
This paper presents the Visual Simultaneous Localization and Mapping (vSLAMTM) algorithm, a novel algorithm for simultaneous localization and mapping (SLAM). The algorithm is vision-and odometry-based, and enables low-cost navigation in cluttered and populated environments. No initial map is required, and it satisfactorily handles dynamic changes in the environment, for example, lighting changes, moving objects and/or people. Typically, vSLAM recovers quickly from dramatic disturbances, such as “kidnapping”.
Visual SLAM systems are designed to map the environment around the sensors while simultaneously determining the precise location and orientation of those sensors within their surroundings. It relies entirely on visual data for estimating sensor motion and reconstructing environmental structures (Taketomi et al., 2017). This approach has attracted attention in the literature because it is cost-effective, easy to calibrate, and has low power consumption in monocular cameras while also allowing depth estimation and high accuracy in RGB-D and stereo cameras (Macario Barros et al., 2022; Abbad et al., 2023).
— A review of visual SLAM for robotics: evolution, properties, and future applications
Feature-based vSLAM is what our case study is referring to.
This method involves detecting and tracking distinct features in the environment, such as corners or edges, across multiple frames of video. Algorithms like ORB-SLAM and PTAM (Parallel Tracking and Mapping) fall under this category.
Within visual-only SLAM, there also exists:
|
The initialization module is responsible for setting up the vSLAM system at the beginning of the operation. It typically involves detecting or extracting visual features from the initial frames, estimating the camera's pose, and creating an initial map representation. Initialization provides the starting point for subsequent mapping and localization processes.
The local mapping module builds a local map of the environment based on the camera's motion and observed visual data. It extracts visual features from the current frame, estimates their 3D positions relative to the camera, and associates them with the existing map. Local mapping updates the map representation incrementally as new observations become available, allowing for continuous refinement of the map.
The loop closure module handles the detection and correction of errors or inconsistencies that may arise when revisiting previously seen parts of the environment. It identifies when the camera has returned to a previously visited location or scene, matches the current observations with previously stored keyframes or landmarks, and corrects accumulated errors. Loop closure helps maintain map consistency, improves localization accuracy, and reduces drift in the vSLAM system.
Relocalization is the process of re-establishing the camera's position and orientation within the existing map when it is temporarily lost or needs to recover after tracking failures. The relocalization module matches the current visual observations with features or landmarks in the map, allowing the system to determine the camera's pose relative to the map again. Relocalization enables robust and accurate localization even in challenging situations.
The tracking module continuously estimates the camera's pose in real-time as it moves through the environment. It analyzes the incoming visual data, matches the current frame's features with those in the map, and updates the camera's pose accordingly. The tracking module is responsible for tracking the camera's motion, predicting its pose between frames, and providing a continuous estimate of the camera's position and orientation.
These vSLAM modules work collaboratively to enable real-time mapping and localization in dynamic environments using visual information from cameras or sensors. They provide the necessary functionalities for building and maintaining a map of the environment and accurately estimating the camera's pose during camera motion.
Happens initially in the initialization module, then subsequently each cycle in the local mapping module.
Camera distortions are corrected for, and the image is preprocessed in order to enhance feature extraction (eg. noise reduction, sharpening, contrast increase)
Key points (features) are detected in the image.
This can be done using a deep learning approach, where objects (and their bounding boxes) are identified, or with traditional approaches such as edge detection, which can be done with image processing kernels.
The change in position of the extracted features over time is combined with the dead reckoning data from the IMU to determine the location of the robot within the environment.
One big challenge occurs in this process, however:
As the camera moves through space, there is increasing noise and uncertainty between the images the camera captures and its associated motion.
Kalman filters reduce the effects of noise and uncertainty among different measurements to model a linear system more accurately by continually making predictions, updating and refining the model against the observed measurements.
For SLAM systems, we typically use extended Kalman filters (EKF), which takes nonlinear systems, and linearizes the predictions and measurements around their mean.
Loop closure: When the localization algorithms detect features (and groups of features) that have already been seen, or when the internal map of the robot indicates that it is in a previously traversed region, the map can be refined to improve the accuracy of the robot’s navigation.
Relocalization: When the robot loses tracking (loses track of its position in the environment), it tries to match the features seen by the camera with its internal map. If it recognizes certain groups of features, it can place itself back into the map, and continue tracking.
Bundle adjustment: By comparing the projected point of the feature (based on the internal map/world state) with the actual location of the feature (based on input data from the sensors), the reprojection error can be calculated, and using an optimization algorithm, the parameters of the internal map can be adjusted in order to minimize this reprojection error.
“The reprojection error is a geometric error corresponding to the image distance between a projected point and a measured one. It is used to quantify how closely an estimate of a 3D point recreates the point's true projection.”
Bundle adjustment is used in the following contexts:
Locally (local bundle adjustment): the robot optimizes a subset of the most recent frames and points to quickly improve the local accuracy without excessive computation.
Globally: (global bundle adj): occasionally run to refine the entire map and trajectory globally, ensuring consistency and accuracy over long sequences.
Keyframes are select observations by the camera that capture a “good” representation of the environment. Some approaches will perform a bundle adjustment after every keyframe. Filtering becomes extremely computationally expensive as the map model grows, however keyframes enable more feature points or larger maps, with a balanced tradeoff between accuracy and efficiency.
Analysis of state-of-the-art visual odometry/visual simultaneous localization and mapping (VSLAM) system exposes a gap in balancing performance (accuracy and robustness) and efficiency (latency). Feature-based systems exhibit good performance, yet have higher latency due to explicit data association; direct and semidirect systems have lower latency, but are inapplicable in some target scenarios or exhibit lower accuracy than feature-based ones.
— Zhao et. al., Semantic scholar reader
Ideally, we want vSLAM to work as fast as possible, and as accurately as possible.
(these are more about the processing/overall implementation of vSLAM rather than the specific sensor paradigm/configuration)
Having a specialized processor dedicated to performing vSLAM allows it to work faster and more efficiently, as it is specially designed for vSLAM
“This letter introduces a dedicated processor architecture, called MEGACORE, which leverages vector technology to enhance tracking performance in visual simultaneous localization and mapping (VSLAM) systems. By harnessing the inherent parallelism of vector processing and incorporating a floating point unit (FPU), MEGACORE achieves significant acceleration in the tracking task of VSLAM. Through careful optimizations, we achieved notable improvements compared to the baseline design.”
— Li et. al., https://doi.org/10.1109/LES.2023.3298900
Having a lightweight, on-device, self-supervisory network to act as an adversarial against the vSLAM algorithm
“This article proposes a quantized self-supervised local feature for the indirect VSLAM to handle the environmental interference in robot localization tasks. A joint feature detection and description network is built in a lightweight manner to extract local features in real time. The network is iteratively trained by a self-supervised learning strategy, and the extracted local features are quantized by an orthogonal transformation for efficiency.
— Li et. al., https://doi.org/10.1109/TMECH.2021.3085326 (different Li)
Applying the concept of edge computing
Distributing the computation of vSLAM across multiple processors or even across devices in a network can reduce latency and increase the system's scalability.
This approach is particularly useful in collaborative robotic systems, e.g. a fleet of rescue robots mapping/working in the same environment
Integrating multiple types of sensors can significantly enhance the robustness and accuracy of vSLAM systems. Commonly, vSLAM systems are augmented with inertial measurement units (IMUs), depth sensors, or LiDAR to provide additional data points that compensate for the limitations of visual data alone.
“VI-SLAM is a technique that combines the capabilities of visual sensors, such as stereo cameras, and inertial measurement sensors (IMUs) to achieve its SLAM objectives and operations (Servières et al., 2021; Leut et al., 2015). This hybrid approach allows a comprehensive modeling of the environment, where robots operate (Zhang et al., 2023). It can be applied to various real-world applications, such as drones and mobile robotics (Taketomi et al., 2017). The integration of IMU data enhances and augments the information available for environment modeling, resulting in improved accuracy and reduced errors within the system’s functioning (Macario Barros et al., 2022; Mur-Artal and Tardós 2017b).”
But in what cases may other approaches perform better than visual-inertial SLAM?
In terms of choosing the approach, it really depends:
Specific application needs: Different applications prioritize different aspects like accuracy, real-time performance, computational cost, and robustness.
Environmental conditions: Different environments (indoor vs. outdoor, structured vs. unstructured) may favor different types of sensors and algorithms.
In environments with poor lighting or highly dynamic scenes, LiDAR SLAM might outperform VI-SLAM due to its ability to provide accurate distance measurements regardless of lighting conditions.
Beyond visual and inertial sensors, integrating additional modalities like GPS, odometry, or depth sensors can provide complementary data that helps in scenarios where VI-SLAM might struggle, such as featureless or highly dynamic environments.
For instance, RGB-D SLAM
This integrates RGB-D cameras with depth sensors to estimate and build models of the environment (Ji et al., 2021; Macario Barros et al., 2022). This technique has found applications in various domains, including robotic navigation and perception (Luo et al., 2021). It demonstrates efficient performance, particularly in well-lit indoor environments, providing valuable insights into the spatial landscape (Dai et al., 2021).
The incorporation of RGB-D cameras and depth sensors enables the system to capture both color and depth information simultaneously. This capability is advantageous in indoor applications, addressing the challenge of dense reconstruction in areas with low-textured surfaces (Zhang et al., 2021b).
HPE is crucial for rescue robots, and can help determine whether victims are in need of immediate assistance. Certain poses can be compared with certain conditions:
if someone is walking, they are likely fine, whereas if someone is lying or struggling on the ground, they are more likely to need assistance.
The detection of arm waving or other physical distress signals helps rescuers pinpoint those who need help.
Joint position error analysis from HPE data can help rescuers pinpoint the nature of trauma.
if a joint is severely out of place and not positioned normally based on degrees of freedom, broken or dislocated limbs can be identified.
Rescue robots can navigate through debris or confined spaces without causing harm to (stepping on) victims.
2D: This involves estimating the positions of various body parts in two dimensions This is often done by predicting the positions of key points, or "joints," like the elbows or knees, in the image.
3D: This is a more challenging task that involves estimating the positions of body parts in three dimensions. It not only requires understanding where the body parts are in an image, but also how far they are from the camera.
Bottom-up vs. top-down HPE
(a) — top down, (b) — bottom up
Modern approaches to HPE are based on neural networks, which are trained (taught to recognize humans) based on labeled data.
The poses of occluded components are estimated based on the position of visible limbs (key points), edge lengths (for 3D HPE), and temporal convolution (extrapolation of previous and future positions when the limbs are visible).
Occlusion leads to distorted visual data, making it difficult for the HPE algorithms to accurately identify and interpret human poses. HPE algorithms must be designed with occlusion in mind.
When traditional HPE methods are limited by occlusion and in crowd scenes, deep-learning-based methods can estimate/correct the estimated poses based on temporal and other adjacent information.
(top-down approach) | |
(bottom-up approach) |
Using multiple datasets to train deep-learning-based approaches can also bring advantages (but also advantages) over training on a single dataset.
There is no doubt that deep-learning-based approaches overall are significantly more robust than traditional methods. Within DL, however, there are multiple pathways:
While CNNs are commonly used in HPE studies for their effectiveness in implicit feature extraction from images, a few studies have explored other deep learning methods such as GANs, GNNs, and RNNs. The relative performance of these methods is unclear and warrants further research.
Many studies have found that detection-based approaches outperform regression-based approaches for estimating single poses. Recently, Gu et al. [162] analyzed these two approaches to determine why detection-based methods are superior to regression-based methods. They ultimately proposed a technique that showed regression-based approaches could outperform detection-based approaches, especially when facing complex problems. Further study of this work may open new directions for estimating single-person poses
Multipose estimation has been significantly harder due to occlusion and varying sizes, but finding the best solutions remains an open problem:
Optical flow has been used by some studies to track motion in videos. However, it is easily affected by noise and can have difficulty tracking human motion in noisy environments. To improve performance, a few works have replaced optical flow with other techniques, such as RNNs or temporal consistency.
Many studies use post-processing steps such as search algorithms or graphical models to group predicted key points into individual humans in bottom-up approaches. However, some recent works have incorporated graphical information into neural networks to make the training process differentiable.
Improving the efficiency of HPE tasks is not limited to enhancing models; dataset labels also play a significant role. In addition to keypoint position labels, only a few datasets provide additional labels, such as visibility of body joints, that can help address the challenge of occlusions.
As occlusion is one of the main challenges in 2D HPE, researchers need to increase the number of occluded labels in datasets. Unsupervised/semi-supervised and data augmentation methods are currently used to address this limitation;
Another challenge in 2D HPE is crowded scenes. Only a few datasets provide data with crowded scenarios (e.g., CrowdPose and COCO), and their data consist only of images. Recently, a dataset called HAJJv2 [163] was introduced that provides more than 290,000 videos for detecting abnormal behaviors during Hajj religious events. The data in this dataset are diverse in terms of race, as many people from all over the world [164,165] perform Hajj rituals. They also have a large crowd scale, providing nine classes with normal and abnormal behaviors for each category. This dataset may help train 2D HPE models.
For example, models such as YOLO and OpenPose are used for detecting and estimating poses to identify suspicious behavior during Hajj events. However, these models still face challenges in handling large numbers of poses in real-time. Developing methods to address this problem remains an open challenge.
— Samkari et. al., https://doi.org/10.3390/make5040081
Recall that this challenge mentions a scenario where “rubble is still shifting after an earthquake.”
While building maps when robot poses are known is a tractable problem requiring limited computational complexity, the simultaneous estimation of the trajectory and the map of the environment (known as SLAM) is much more complex and requires many computational resources.
Moreover, SLAM is generally peformed in environments that do not vary over time (called static environments), whereas real applications commonly require navigation services in changing environments (called dynamic environments).
Many real robotic applications require updated maps of the environment that vary over time, starting from a given known initial condition.
In this context, classical SLAM approaches are generally not directly applicable: such approaches only apply in static environments or in dynamic environments where it is possible to model the environment dynamics. We are interested here in long-term mapping operativity in presence of variations in the map, as in the case of robotic applications in logistic spaces, where rovers have to track the presence of goods in given areas.
— https://ieeexplore.ieee.org/document/5756810
“Dynamic objects such as people and cars are often unavoidable in scenarios such as classrooms, hospitals, and outdoor shopping places. Those vSLAM systems built on a static environment have poor adaptability to dynamic and complex scenes, leading to substantial errors in the obtained map points and pose matrix (Cheng et al., 2019). Indirectly, it will cause problems such as drift of virtual objects registered in the world coordinate system.”
— https://www.frontiersin.org/articles/10.3389/fnbot.2022.990453/full
One solution is (in feature-based vSLAM) to remove any features that are moving (dynamic). Thus, the goal is to figure out what part of a frame is static and what is dynamic, so we can treat them differently.
Using HPE to perform semantic segmentation of dynamic vs static objects in a frame, the vSLAM system can be told to “ignore” any moving objects like humans.
This makes the mapping itself more accurate, as we are only mapping the static world.
→ This seems to be a common solution - simply knowing what is dynamic and what is static is helpful to both the vSLAM and HPE algorithms and the human controller. (for instance, a heatmap of dynamic elements could be useful to quickly identify humans, in addition to HPE).
“The optimized vSLAM algorithm adds the modules of dynamic region detection and semantic segmentation to ORB-SLAM2. First, a dynamic region detection module is added to the vision odometry. The dynamic region of the image is detected by combining single response matrix and dense optical flow method to improve the accuracy of pose estimation in dynamic environment.”
— Wei et. al., A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments
“Moreover, a new dynamic feature detection method called semantic and geometric constraints was proposed, which provided a robust and fast way to filter dynamic features. The semantic bounding box generated by YOLO v3 (You Only Look Once, v3) was used to calculate a more accurate fundamental matrix between adjacent frames, which was then used to filter all of the truly dynamic features.”
— Yang et. al., https://doi.org/10.3390/s20082432
Rather than completely removing them, we can also weigh them differently so that the algorithm knows to pay less attention to moving features.
“A robust visual SLAM system that utilizes weighted features, namely, named WF-SLAM is proposed in this paper, which is based on ORB- SLAM2, and significantly decreases mismatch and improves the accuracy of localization.”
— Zhong et. al., https://doi.org/10.1109/jsen.2022.3169340
Rather than weighing features, we can also weigh data from the different sensors.
The system can assign different weights to the sensor measurements based on their reliability and relevance in the current environment. For example, in areas with high dynamics, the system can give more weight to LiDAR or radar data compared to visual features.
Similarly, the criteria for selecting keyframes can also be adapted based on the level of dynamics in the environment. In highly dynamic scenes, more frequent keyframe updates might be necessary to capture the changing structure of the rubble.
Map management: We can divide the map into chunks, and each time we update the map we can update the chunk rather than the entire map. This allows us to have a full representation of the environment while adjusting to dynamic changes.
In the occupancy grid method, each cell in the occupancy grid map is assigned a probability of being occupied or free. As new sensor data arrives, the probabilities are updated using Bayesian inference, allowing the map to adapt to the changing environment. This is an example of a probabilistic mapping technique, which represents the uncertainty and dynamics of the environment.
“To avoid a continuous re-mapping, the map can be updated to obtain a consistent representation of the current environment. In this paper, we propose a novel LIDAR- based occupancy grid map updating algorithm for dynamic environments. The proposed approach allows robust long-term operations as it can detect changes in the working area even in presence of moving elements.”
— Stefanini et. al., doi:10.5281/zenodo.7531326
“This work presents a semantic map management approach for various environments by triggering multiple maps with different simultaneous localization and mapping (SLAM) configurations. A modular map structure allows to add, modify or delete maps without influencing other maps of different areas. The hierarchy level of our algorithm is above the utilized SLAM method. Evaluating laser scan data (e.g. the detection of passing a doorway) triggers a new map, automatically choosing the appropriate SLAM configuration from a manually predefined list. Single independent maps are connected by link-points, which are located in an overlapping zone of both maps, enabling global navigation over several maps. Loop- closures between maps are detected by an appearance-based method, using feature matching and iterative closest point (ICP) registration between point clouds.”
— Ehlers et. al., doi: 10.1109/ICRA40945.2020.9196997.
Instead of relying on a single, static map, the vSLAM system can continuously update the map based on the latest sensor data. This approach involves detecting changes in the environment and incrementally updating the affected regions of the map.
By comparing the current sensor data with the existing map, the system can identify regions where significant changes have occurred. Techniques like point cloud registration, occupancy grid comparison, or appearance-based methods can be used for change detection.
Once the changed regions are identified, the map can be incrementally updated by incorporating the new data and discarding the outdated information. This process helps maintain a more accurate representation of the current state of the environment.
“Qualitative text analysis [of the existing body of literature on rescue robot ethical concerns] identified seven core ethically relevant themes: fairness and discrimination; false or excessive expectations; labor replacement; privacy; responsibility; safety; and trust.”
Discrimination was looked at in terms of disaster victims, in one, instead, as relating to rescue operators. As Amigoni and Schiafonati point out:
Hazards and benefits should be fairly distributed (…) to avoid the possibility of some subjects incurring only costs while other subjects enjoy only benefits. This condition is particularly critical for search and rescue robot systems, e.g., when a robot makes decisions about prioritizing the order in which the detected victims are reported to the human rescuers or about which detected victim it should try to transport first (Amigoni & Schiafonati, 2018).
Stakeholders are generally unable to make sound assessments about the capabilities and limitations of rescue robots. This inability can lead stakeholders to overestimate or underestimate the capabilities of rescue robots.
In the first case, this may translate into unjustified reliance on their performance, and thus, for example, into false hopes that the robots may save certain victims, or into their deployment for tasks for which they are not suitable or under inappropriate conditions.
In the second case, when robots' capabilities are underestimated, they may be underutilized, leading to a waste of precious resources (Harbers et al., 2017).
Stakeholders predict that rescue robots will likely replace human operators in the most physically challenging or high-risk rescue missions. Researchers express concerns that replacing humans with robots may determine degraded performance concerning victim contact, situation awareness, manipulation capabilities, etc., pointing out that robot-mediated contact with victims may interfere with medical personnel's ability to perform triage or provide medical advice or support
the use of robots generally leads to an increase in information gathering, which can jeopardize the privacy of personal information. This may be personal information about rescue workers, such as images or data about their physical and mental stress levels, but also about victims or people living or working in the disaster area. Harbers et al add that the loss of privacy potentially associated with the deployment of robots in disaster scenarios does not necessarily result in an ethical dilemma: indeed, given the critical nature of search and rescue operations, the benefits of collecting information in such settings largely outweigh any harms it may cause. This will require, however, that the information gathered by the robots is not shared with anyone outside professional rescue organizations and is exclusively used for rescue purposes.
In the paper by Tanzi et al. issues of responsibility are viewed as associated with liability in the event of technical failures or accidents and injuries to victims (Tanzi et al., 2015). Harbers and colleagues instead focus on responsibility assignment problems, which, they say, can apply to both moral and legal responsibility, where moral responsibility concerns blame and legal responsibility, instead, concerns accountability.
Such problems, according to the authors, can arise when robots act with no human supervision. If a robot malfunctions, behaves incorrectly, makes a mistake or causes harm, it may be unclear who is responsible for the damage caused: the operator, the software developer, the manufacturer or the robot itself. Responsibility assignment problems, they continue, become particularly complicated when the robot has some degree of autonomy, self-learning capabilities or is capable of making choices that were not explicitly programmed (Harbers et al., 2017).
Harbers and colleagues acknowledge that although attention to safety is clearly one of the key priorities than need to be taken into account when deploying rescue robots, this priority will often have to be balanced against other values, as rescue missions necessarily involve safety risks. Certain of these risks can be mitigated by replacing operators with robots, but robots themselves, in turn, may determine other safety risks, mainly because they can malfunction. Even when they perform correctly, robots can still be harmful: they may, for instance, fail to identify and collide into a human being. In addition, robots can hinder the well-being of victims in subtler ways. For example, the authors argue, that being trapped under a collapsed building, wounded and lost, and suddenly being confronted with a robot, especially if there are no humans around, can in itself be a shocking experience (Harbers et al., 2017).
Focusing specifically on the use of UAVs, Tanzi et al. also emphasize the risks associated with collisions and accidents, pointing out that even high-end military drones like the Predator crash with some frequency, although injuries are rare, and that in urban environments, small UAVs can still cause injury or property damage (Tanzi et al., 2015).
The question of trust in autonomous systems is the focus of one of the papers identified by our review. In his paper, Stormont highlights how trust by an agent in another agent requires two beliefs: that an agent that can perform a task to help another achieve a goal has a) the ability to perform the task and b) the desire to perform it (Stormont, 2008).
He then points out that two main components of trust have been identified in the literature: confidence and reputation.
Stormont claims that autonomous systems and robots in general tend to not have a good reputation, and that confidence must be involved. In the author's view, humans lack confidence in autonomous robots because they are unpredictable.
Humans working together are generally able to anticipate each other's actions in a wide range of circumstances— especially if they have trained together, as is the case in rescue crews. Autonomous systems, instead, often surprise even those who designed them, and such unpredictability can be both concerning and unwelcome in dangerous situations like those that are typical of disaster scenarios.
In 2002, Gianmarco Veruggio coined the term ‘roboethics’ which establishes ethical standards for the design, production, and use of robots. It is important to bring both human (programmers, designers, or users) and robot behaviour into regulations so that it can be controlled by law and code. Leenes et al. [28], distinguished code or law into four categories:
Regulating robot design, production through law.
Regulating user behaviour through the robot’s design.
Regulating the effects of robot behaviour through law.
Regulating robot behaviour through code.
Roboethics is distinct from “machine ethics”, which would require robots to follow ethical guidelines, and is still in the theoretical stage because autonomous robots are not yet capable of making moral judgements [27]. This is even more relevant regarding robotics in SAR operations. For instance, robots doing inexperienced first aid may cause extra problems because of their moral incapability and uncertain nature, such as harming or breaking people’s parts of body. SAR robotics is highly human-centric, the mismatch between robots and human cognitive abilities is often the limiting factor in SAR; regardless of the technical capabilities of robots in terms of locomotion, communication, and sensing [29]. In this former field, some studies discussed ethical issues.
— Chitikena et. al., https://doi.org/10.3390/app13031800
“As robots are becoming increasingly human-like, this issue will continue to gain importance over time. The following question thus emerges: Should we act in order to maintain human life as the most valuable from the legal perspective? For example, if we accept that human life should always be at the top of hierarchies of value, perhaps manufacturers should be forced to mark robots such that they can be easily differentiated from humans in emergencies. In unforeseen traffic accidents, drivers only have seconds to decide what to do and what they can avoid. Robot drivers and human drivers should know that robots should be sacrificed in collisions involving both humans and robots. From another perspective, we should ask whether robots have any properties that make them equal to humans with regard to legal protections, such as a human-like intelligence, and whether we could in fact decide that robots should be granted more protection than humans.”
— Whether to Save a Robot or a Human: On the Ethical and Legal Limits of Protections for Robots
The iRobotSurgeon survey aimed to explore public opinion towards the issue of liability with robotic surgical systems. The survey included five hypothetical scenarios where a patient comes to harm and the respondent needs to determine who they believe is most responsible: the surgeon, the robot manufacturer, the hospital, or another party.
A total of 2,191 completed surveys were gathered evaluating 10,955 individual scenario responses from 78 countries spanning 6 continents. The survey demonstrated a pattern in which participants were sensitive to shifts from fully surgeon-controlled scenarios to scenarios in which robotic systems played a larger role in decision-making such that surgeons were blamed less.
However, there was a limit to this shift with human surgeons still being ascribed blame in scenarios of autonomous robotic systems where humans had no role in decision-making. Importantly, there was no clear consensus among respondents where to allocate blame in the case of harm occurring from a fully autonomous system.
The iRobotSurgeon Survey demonstrated a dilemma among respondents on who to blame when harm is caused by a fully autonomous surgical robotic system. Importantly, it also showed that the surgeon is ascribed blame even when they have had no role in decision-making which adds weight to concerns that human operators could act as “moral crumple zones” and bear the brunt of legal responsibility when a complex autonomous system causes harm.
— Autonomous surgical robotic systems and the liability dilemma