Object detection from remote sensing images

Multi-view observation of strawberry fruits from ground-based images showing objects that are invisible from orthophto are visible from oblique views

Object detection in remote sensing images is one of the most critical computer vision tasks for various earth observation applications. Previous studies applied object detection models to orthomosaic images generated from the SfM (Structure-from-Motion) analysis to perform object detection and counting. However, some small objects that are occluded from the vertical view but observable in raw images from the oblique views cannot be detected in the orthomosaic image, leading to an occlusion issue that cannot be resolved with the traditional orthophoto-based approach. Taking strawberry detection as a case study, the objective of this study is to detect small objects directly from multi-view raw images. Firstly, an object-detection model (Faster R-CNN in this study) was applied to each raw image to identify strawberry fruit and flower objects. Each unique strawberry object on the ground can be detected multiple times in the raw images because images have forward- and side overlap. To find the unique objects from the step one detection results, an improved FaceNet model was proposed to combine the image and position information to calculate the feature distance between those objects, and a clustering algorithm was used to associate the cluster with each unique strawberry using the object distance output from the FaceNet model, from which the final position and number of strawberry fruits and flowers were obtained. Compared with the orthomosaic image alone, this approach using multi-view images effectively solved the occlusion problem and improved overall recognition accuracy of strawberry flowers, unripe fruits, and ripe fruits from 76.28% to 96.98%, 71.64% to 99.09%, and 69.81% to 97.17%, respectively, highlighting the potential of multi-view stereovision (MVS) in small object detection.

The anchor and positive image pair belong to the same identity while the anchor and negative image are from different identities. The triplet loss in FaceNet mapping aims to minimize the distance between the anchor and positive and maximize the distance between the anchor and negative image. The length of the black arrow indicates the distance.
Examples of strawberries obscured by leaves in the orthomosaic image and but visible in raw images: flowers with red bounding boxes, unripe fruits with yellow bounding boxes, and ripe fruits with blue bounding boxes. Nine samples are displayed. For each sample, the unlabeled image on the left is cropped from the orthomosaic image, and the right image is from raw images

Faster R-CNN was first applied on each individual raw image to obtain the location information and image patches of strawberry fruits and flowers; then the duplicate strawberries detected from multi-view images were projected onto the ground using a photogrammetric model to form object (i.e., fruits or flowers) clusters; this is followed by applying clustering algorithms to find the final position and number of unique strawberry fruits and flowers. We used facenet to map the detected straberries to features that have smaller distance between duplicate straberries and longer distance if straberries are not duplicates.

Landcover mapping using drone

Multi-view images of a study site in the southeastern United States. The study site is a wetland area with a variety of land covers, including invasive vegetation, such as Cogongrass, and native

The methods used to analyze sUAS images have not been significantly changed to accommodate the wide adoption of these images in the natural resource management field. Currently, traditional pixel-based and object-based image classification of an orthoimage produced through photogrammetric processing of hundreds or thousands of sUAS images is still the most common way for sUAS image classification. Images captured by sUAS differ from those captured by other remote sensing platforms since they tend to have smaller extent, higher spatial resolution, and large image-to-image overlap with varying objectsenor geometry compared to satellite or piloted aircraft images. Unlike satellite images, in a typical image acquisition mission, many overlapped sUAS images are captured within a very short time from different viewing angles, potentially, facilitating a way to study the bi-directional reflectance distribution function (BRDF) of the land cover. Nevertheless, taking advantage of this redundancy in image classification is by itself an important asset to explore.

With the rapid evolution of deep learning classifiers, and increased availability of computing power (e.g., GPU and cloud computing), deep learning classifiers have become one of the most active topics for the sUAS image classification field. This is not only motivated by its successful performance in computer vision, but also due to its operational advantages in comparison with traditional classifiers. For example, deep learning classifiers do not require manual extraction of features, while manually selecting appropriate features are important to achieve good performance for traditional classifiers. Deep learning classifiers, however, are not without shortcomings. Generally, they require large amount of training data accompanied by a computationally intensive training process.

Maps generated for the (a) study site using (b) Ortho-OBIA-12st and (c) MV-OBIA-12st. The area severely impacted by invasive vegetation, that is, Cogongrass; the corresponding maps generated by Ortho-OBIA-12st and MV-OBIA-12st are shown the second and third column respectively.

We introduce a new OBIA approach utilizing multi-view information of original UAS images and compare its performance with that of traditional OBIA, which uses only the orthophoto (Ortho-OBIA). The proposed approach, called multi-view object-based image analysis (MV-OBIA), classifies multi-view object instances on UAS images corresponding to each orthophoto object and utilizes a voting procedure to assign a final label to the orthophoto object. The proposed MV-OBIA is also compared with the classification approaches based on Bidirectional Reflectance Distribution Function (BRDF) simulation. Finally, to reduce the computational burden of multi-view object-based data generation for MV-OBIA and make the proposed approach more operational in practice, this study proposes two window-based implementations of MV-OBIA that utilize a window positioned at the geometric centroid of the object instance, instead of the object instance itself, to extract features. The first window-based MV-OBIA adopts a fixed window size (denoted as FWMV-OBIA), while the second window-based MV-OBIA uses an adaptive window size (denoted as AWMV-OBIA). Our results show that the MV-OBIA substantially improves the overall accuracy compared with Ortho-OBIA, regardless of the features used for classification and types of wetland land covers in our study site. Furthermore, the MV-OBIA also demonstrates a much higher efficiency in utilizing the multi-view information for classification based on its considerably higher overall accuracy compared with BRDF-based methods. Lastly, FWMV-OBIA and AWMV-OBIA both show potential in generating an equal if not higher overall accuracy compared with MV-OBIA at substantially reduced computational costs.

Graph structure for CRF model: given an object to be classified (i.e., the object with yellow boundary), its surrounding objects were extracted to construct a graph. For simplicity, only the context information extracted from the central node and its surrounding nodes were considered, while the information among the surrounding nodes was dismissed, as indicated by the missing edges between neighbors of surrounding nodes.

Context information is rarely used in the object-based landcover classification. Previous models that attempted to utilize this information usually required the user to input empirical values for critical model parameters, leading to less optimal performance. Multi-view image information is useful for improving classification accuracy, but the methods to assimilate multi-view information to make it usable for context driven models have not been explored in the literature. Here we propose a novel method to exploit the multi-view information for generating class membership probability. Moreover, we develop a new conditional random field model to integrate multi-view information and context information to further improve landcover classification accuracy. This model does not require the user to manually input parameters because all parameters in the Conditional Random Field (CRF) model are fully learned from the training dataset using the gradient descent approach. Using multi-view data extracted from small Unmanned Aerial Systems (UASs), we experimented with Gaussian Mixed Model (GMM), Random Forest (RF), Support Vector Machine (SVM) and Deep Convolutional Neural Networks (DCNN) classifiers to test model performance. The results showed that our model improved average overall accuracies from 58.3% to 74.7% for the GMM classifier, 75.8% to 87.3% for the RF classifier, 75.0% to 84.4% for the SVM classifier and 80.3% to 86.3% for the DCNN classifier. Although the degree of improvement may depend on the specific classifier respectively, the proposed model can significantly improve classification accuracy irrespective of classifier type.

Find individual trees from forest with LiDAR

Given the challenges posed by a fluctuating global economy, environmental concerns, and diminishing forest resources, there is an urgent demand for producing detailed forest data that is both highly precise and cost-efficient. In response to this need, the concept of precision forestry has been introduced, aiming to enhance forest productivity by accurately defining and geographically linking essential forest management and product information through sophisticated information technology methods. Consequently, acquiring forest information on an individual tree basis becomes critical for implementing precision forestry. This involves collecting data on individual tree characteristics such as trunk and crown size, location, height, and crown base height. Such information is vital for assessing forest attributes like stem density, crown closure, biomass, and carbon stocks at more significant plot or stand levels. However, detecting individual trees from remotely sensed data presents numerous challenges, especially in accurately identifying each tree's ground truth distribution and dimensions. Advancements in LiDAR technology have facilitated the easier acquisition of point cloud data beneath forest canopies using handheld scanners. Our strategy focuses on extracting individual tree data from these point clouds to establish a ground truth for developing methods to identify trees from overhead data. We identified that current algorithms struggle with processing under-canopy point clouds for tree detection due to their requirement for numerous sensitive parameters. Our approach leverages deep learning algorithms to recognize 3D tree stems from the forest, demonstrated in a video, followed by using a region-growing method to link each point to its respective tree stem, offering a more robust and user-friendly solution.

Bounding boxes on the stems of slice of post-processed point cloud
Students test the GeoSLAM LiDAR scanner mounted on the DJI M600 Pro drone