VisionBased Preharvest Yield Mapping for Apple Orchards
Abstract
We present an endtoend computer vision system for mapping yield in an apple orchard using images captured from a single camera. Our proposed system is platform independent and does not require any specific lighting conditions. Our main technical contributions are 1) a semisupervised clustering algorithm that utilizes colors to identify apples and 2) an unsupervised clustering method that utilizes spatial properties to estimate fruit counts from apple clusters having arbitrarily complex geometry. Additionally, we utilize camera motion to merge the counts across multiple views. We verified the performance of our algorithms by conducting multiple field trials on three tree rows consisting of trees at the University of Minnesota Horticultural Research Center. Results indicate that the detection method achieves measure for multiple color varieties and lighting conditions. The counting method achieves an accuracy of . Additionally, we report merged fruit counts from both sides of the tree rows. Our yield estimation method achieves an overall accuracy of across different datasets.
Keywords:
Yield estimation Apple detection Apple counting Semisupervised image segmentation Machine vision Unsupervised clustering∎
Efficient control and management of agricultural farms are essential to cope with growing population and numerous environmental and economic issues. Precision agriculture techniques provide farmers with methods to monitor the status of their crops on demand and enable them to make crucial decisions (trimming, pruning, application of fertilizers and pesticides etc.). For many commodity crops (rice, corn, wheat etc.), these techniques are already mature enough to provide ondemand maps (yield, crop stress, water quality etc.) as well as help create an infrastructure to implement crucial tasks. However, for specialty crops (such as fruits, vegetables, and flowers), these techniques are still evolving. Specialty crops are particularly good candidates for precision farming and phenotyping studies, because of their species diversity, high value, high management cost, and high variability in growth. One of the main bottlenecks for both these applications is the lack of a convenient yield mapping system.
The variability in plant size, color, shape etc., make it difficult to develop a general yield mapping method. Instead, researchers developed yield mapping systems specific to each crop (Senthilnath et al. (2016); Wang et al. (2013); Das et al. (2015); Gongal et al. (2016); Silwal et al. (2014)). In this paper, we focus on apple orchards and present a complete system for yield monitoring. Our system has a couple of distinguishing features:

It is platform independent and relies only on images captured from a single camera. The camera can be mounted on a ground robot or an Unmanned Aerial Vehicle (UAV). It can also be handheld.

It does not require any special lighting condition. It operates at daytime in a completely natural environment that is convenient for field applications.
We have two main technical contributions in this paper. 1) We present a novel semisupervised segmentation method to separate the apple pixels from others in the input images. Varying colors of apples (red, golden, green, yellow along with the mixture of their shades), different lighting conditions (based on the position of the sun, clouds etc.), shadows and occlusions created by leaves and branches make the job of identifying regions belonging to apples in an image difficult. We present a semisupervised segmentation method (Fig. 7) which utilizes minimal user interaction to train a classification model and uses it to identify the apples. Additionally, the system is capable of storing trained models that can be used for a similar variety of apples and lighting conditions. A key aspect of our segmentation method is that it does not rely on detecting every apple in a single frame (Fig. 16(\subreffig:inherent_fuziness)). We utilize multiple views for detection and counting based on the assumption that for every apple there is “some” view from which it can be detected. This provides robustness against false positives (caused by shadows and specularities) and helps us achieve high precision and recall (Fig. 18,19). We investigate the sensitivity of our detection method with respect to user input, the impact of occlusions, performance for different color varieties and lighting conditions. We observed that except for extreme situations (Fig. 1 rightmost image) most of the difficulties caused by occlusions and lighting conditions can be eliminated with the help of user supervision. Our algorithm achieves score of for different color varieties and lighting conditions outperforming all existing methods.
2) We present a novel method for counting the apples from clusters with arbitrarily complex geometry and merge the apple counts across multiple frames utilizing camera motion. Counting apples from a running sequence of images is difficult as apples can be found in arbitrarily shaped clusters in which almost all apples overlap with each other (Fig. 2). Furthermore, because of specularities as well as occlusions due to leaves and branches, some apples are not detectable at all. We present a method for counting apples based on a classic clustering technique: Gaussian Mixture Models (GMM) (Bilmes et al. (1998)). Our method provides both the counts and location of individual apples in an input image. We model each apple with a Gaussian probability distribution function (pdf) and each apple cluster with a mixture of Gaussians. We present a novel heuristic to find the correct number of components (i.e the number of apples) in the mixture model. Additionally, apples detected in individual frames are tracked across frames to obtain accurate counts (Fig. 4). We merge the information obtained from the per frame operations across multiple frames utilizing camera motion (approximated by pairwise homography). We validate our counting performance both individually and coupled with tracking. We observe that the accuracy of the perframe counting method is . When coupled with tracking, we achieve a varying accuracy of for seven different videos.
One of the main takeaways from this paper is that the number of visible apples from a single side of a row varies a lot from datasets to datasets () (See section 5.4). To obtain a consistent estimate of fruit counts, it is essential to have a coherent geometric representation of the entire tree row. With the help of our recent work (Dong and Isler (2018); Roy et al. (2018)) we report merged apple counts from both sides of fruit tree rows with a varying accuracy of across three different datasets.
The rest of the paper is organized as follows: in Section 1 we discuss relevant works in literature. In Section 2 we formalize our problem definition and provide a brief overview of our entire computer vision system. Section 3, 4 present our segmentation and counting methods. Section 5 presents the experimental results. Finally, Section 6 presents the conclusion and future work. We start with the related works in the next section.
1 Related Work
There has been a significant recent activity for automating yield estimation(Wang et al. (2013); Das et al. (2015); Hung et al. (2015); Gongal et al. (2016)) in apple orchards. While some of these existing works focus on the entire yield estimation systems, others focus on specific components only. In this section, we first discuss complete yield estimation systems and then focus on individual components. Besides apples, our discussion includes yield estimation systems and components for fruits similar to apples such as citrus, pomegranates, tomatoes etc.
Complete Yield Estimation Systems: Wang et al. (2013) presented a complete system for apple yield estimation. Their system used a side facing, widebaseline vertical stereo rig mounted on an autonomous ground vehicle. The system operated in controlled artificial illumination settings at night. It uses flashlights to illuminate the apples and classified them using HSV color thresholds. In addition to count information, the stereo system extracted fruit sizes. Hung et al. (2015) presented a feature based learning approach for identification of red and green apples and extraction of fruit count. They used a conditional random field classifier to segment fruits using color and depth data. Das et al. (2015) developed a sensor suite for extracting plant morphology, canopy volume, leaf area index and fruit counts that consists of a laser range scanner, multispectral cameras, a thermal imaging camera and navigational sensors. They used a Support Vector Machine (SVM) trained on pixel color to locate apples in images. Gongal et al. (2016) developed a new sensor system with an overtherow platform integrated with a tunnel structure which acquired images from opposite sides of apple trees. The tunnel structure is used to minimize illumination of apples with direct sunlight and to reduce the variability in lighting. In contrast to these systems, our goal is to develop a general computer vision system to extract the count of apples in an orchard row from a monocular camera. The system is independent of any particular hardware platform and can be used with a various type of robotic and nonrobotic systems. In the rest of this section, we will discuss systems developed for tackling specific portions of the yield estimation system.
Apple Detection: The problem of locating fruits from captured photographs have been studied extensively in the literature. Early systems relied on hard color thresholds. The early methods relying on machine learning were targeted to learn these thresholds. Jimenez et al. (2000) presented a survey of these early computer vision methods for locating fruits on trees. More recent approaches include Zhou et al. (2012), who presented a method that uses RGB and HSV color thresholds to detect Gala apples. Linker et al. (2012), used Knearest neighbor (KNN) classifier along with blob and arc fitting to find red and green apples. The authors avoid specular lighting conditions by capturing images close to sunset. Changyi et al. (2015) developed an apple detection method that uses backpropagation neural network (HechtNielsen (1989)) trained on color features. Similar to them, Liu et al. (2016) presented a method for detecting apples at night using pixel color and position as features in neural networks. The earliest work close to our method is (Tabb et al. (2006)) who developed a method for detecting apples from videos using background modeling by global Gaussian mixture models (GMM). The method relied on images collected using an overtherow harvester platform that provides a consistent background and illumination setting. In contrast, we operate in natural settings, use GMM for unsupervised clustering in every image and classify them using pretrained models.
Deep Learning for Detecting Apples: Recent advancements in deep learning inspired researchers to apply these techniques for identifying apples. Bargoti and Underwood (2017) used MultiScale MultiLayered Perceptron network (MLP) for apple and almond detection. Stein et al. (2016) used a similar deep learning technique to identify mangoes. Chen et al. (2017) used a Fully Convolutional Neural Network (Long et al. (2015)) for apple segmentation. Though deep learning methods are accurate, they require a large amount of training data [e.g we trained a Fully Convolutional Network (FCN) using the data from (Bargoti and Underwood (2017)) and the network did not perform well on our data. When we use some of our data for training though, performance improved drastically]. Generating such training data for different varieties of apples, in different lighting conditions are tedious and cumbersome. In contrast, our method generalizes to any variety of apples and different lighting conditions (as long as apples can be distinguished from other objects by color) using a modest amount of user assistance.
Registering Fruits Across Images: Compared to locating fruits in images, the studies on registering fruits across multiple images has been limited. Only the systems with full yield estimation pipelines studied this problem. Wang et al. (2013) used stereo cameras and point cloud alignment (using odometry and GPS) to avoid double counting. They aligned the apples in 3D space and removed the ones which are within five centimeters of a previously registered apple. Hung et al. (2015) used sampling at certain intervals to remove overlap between images. Das et al. (2015) used optical flow and navigational sensors to avoid duplicate apples. In our previous work (Roy and Isler (2016)), we presented a novel method for registering apples based on affine tracking (Baker and Matthews (2004)) and incremental structure from motion (SfM) (Sinha et al. (2014)). In this work, we are using homography between frames to track fruits and to avoid double counting.
To obtain an accurate yield estimate, in addition to registering fruits from a single side, we need to register the fruits from both sides of the row. Only Wang et al. (2013) handles this problem and used point cloud alignment using odometry and GPS. In this paper, we utilize a novel technique that utilizes global and local semantic information (ground plane, tree trunks, the silhouette of foliage etc.) for merging the apple counts from both sides (Dong and Isler (2018); Roy et al. (2018)).
Counting Fruits: Most of the yield estimation work described before report counting results. The methods for counting fruits from segmented images are dominated by circular Hough transforms (CHT) (Silwal et al. (2014); Changyi et al. (2015); Liu et al. (2016)). The main bottleneck of using CHT is that the parameters need to be tuned across different datasets. Another issue with this method is occlusion. Due to occlusion many apples are not fully visible and cannot be approximated by a circle. Different from these methods, Senthilnath et al. (2016) used Kmeans, a mixture of Gaussians and SelfOrganizing Maps (SOM) to detect individual tomatoes within a cluster. They used the Bayesian Information Criterion (BIC) (Chen and Gopalakrishnan (1998)) to select the optimal number of components in these methods. Chen et al. (2017) used a neural network for determining the count of the apples. In contrast, we use a Gaussian mixture model to count the apples from registered and segmented apple clusters. Unlike CHT, our method does not require any parameter tuning and can handle arbitrarily complex clusters. Unlike the neural network based methods, it does not require any training. The method yields competitive results compared to most of the state of the art methods. In this paper, we present a complete computer vision system for yield estimation in apple orchards. We present a novel segmentation method for detecting apples, a tracking method based on homography and a nonparametric counting algorithm. We start with an overview of the entire system in the next section.
2 Problem Formulation and Overview of Our Computer Vision System
In this section, we formalize our problem definition and present a brief overview of our system. We start with the problem definition.
Problem Formulation: Given a set of images from a calibrated monocular camera facing one side of a row in an apple orchard (where it is assumed that the variance in viewing direction is very small), we want to compute the total apple counts from the entire image sequence.
To solve this problem, we proceed in a per frame manner  separate the apple pixels from others in every frame and count the number of apples in each of them. Afterward, we merge the information across multiple frames by utilizing the approximate camera motion from pairwise homographies. We discuss the components of our system briefly in the rest of this section.
2.1 Segmentation
The segmentation component takes as input a color image for each frame, and outputs a binary mask which marks whether each pixel in the image belongs to the class apple (Fig. 7). This component is presented in detail at Section 3.
First, the image is oversegmented into SLIC superpixels (Achanta et al. (2012)), using the LAB colorspace. A single representative color (mean LAB color of the pixels within the superpixel) is assigned to each superpixel. Then superpixels are clustered by color into approximately color classes. Finally, it is determined for each class whether it describes apples, based on KL divergence (Goldberger et al. (2003)) from handlabeled classes. These handlabeled classes can be obtained from the unsupervised clusters of the first few frames of a particular video, to easily account for current lighting conditions and the color of the particular apple variety at its particular ripeness.
2.2 Per Frame Counting
The per frame counting component takes as input the binary segmented mask for each frame, and outputs a set of bounding boxes and associated integers for each frame where each bounding box represents a connected cluster of apples, and the integer is the estimated number of apples in that cluster (Fig. 2, 3). This component is presented in detail at Section 4.1.
First, a connected component analysis is performed on the binary apple mask. Each connected component is examined separately, to determine how many apples it contains. We perform a Gaussian Mixture Model (GMM) based clustering to estimate the number of apples contained within the bounding box, as well as their positions.
2.3 Camera Motion Approximation
The camera motion approximation component takes as input the detected SIFT (Lowe (1999)) features in the original input images. It computes a pairwise homography using the SIFT feature matches.
2.4 Merging the Counts From Multiple Views
The merging component takes as input a sequence of per frame bounding boxes with apple counts, as well as estimated frametoframe homographies of the approximately planar scene. The output is a total count of unique apples seen in the frame sequence (Fig. 4, 12). This component is presented in Section 4.2.
Essentially, it propagates the computed bounding boxes forward and recomputes the counts when a bounding box overlaps with another one.
3 Apple Segmentation
In the segmentation stage, the goal is to identify pixels which are likely to belong to an apple. Specifically, the segmentation algorithm takes as input an RGB image and produces a binary mask where pixels belonging to apples are marked as ones and all other pixels are marked as zeros.
One approach to segmentation is to simply choose apple pixels based on predetermined thresholds on color values. While this approach has the advantage of simplicity, in field conditions it often fails due to the variability of lighting conditions. In recent years, deeplearningbased approaches such as convolutional neural networks emerged as powerful, general methods for image classification. Recently their extension to Fully Convolutional Networks (FCN) trained for pixelwise prediction has gained popularity for fruit and crop segmentation (Chen et al. (2017)). While these models achieve high accuracy, training a network general enough to accommodate variations in light and visibility conditions remains challenging. Further, training FCNâs for each orchard video separately requires a significant amount of human labor involved in generating ground truth label for each apple.
We propose a semisupervised image segmentation method based on color and shape features of apples. By itself, the segmentation algorithm runs at to frames per second on images of size and 15 frames per second for images of the size of on a DELL XPS laptop with 16GB RAM and 2GB GPU memory. The training required from the user is minimal. It is assisted by a simple, convenient user interface. As we show in Section 5.3, the method generalizes well for the cases where the training and testing data are from different portions of an orchard taken on different days. Our method is expected to work for cases where apples are visibly distinguishable from the surrounding vegetation based on the color. Some working and challenging conditions are shown in Fig. 1.
Details of the segmentation process: For each frame in the image sequence, we convert it from RGB to LAB (Gauch and Hsia (1992)) color space and perform SLIC superpixel segmentation (Fig. 7(\subreffig:pipe2)) (Achanta et al. (2012)). This generates a set of superpixels for each image.
Here each superpixel is represented by , the mean L,A,B values for all pixels . We assume that the set of superpixels can be modeled as a density function governed by the set of parameters . For soft segmentation, we model as a Gaussian Mixture Model(GMM) (Chuang et al. (2001); Ruzon and Tomasi (2000)) with components. Hence is represented by a set of Gaussian components and parameters as the respective mean and covariances for each .
such that
The likelihood function of the parameters can be written as:
The resultant Gaussian mixture density estimation problem is:
where and . Expectation Maximization (EM) is used (Bilmes et al. (1998)) to estimate for each Gaussian cluster . Each Gaussian cluster thus generated contains similar colored superpixels (Fig. 5) and represents different semantic entities of an orchard frame such as the sky, soil, apples, leaves, branches, etc.
Next, we identify which among these capture superpixels belonging to apples. We divide this step into two parts: (1) initialization and (2) usage (Fig. 6(\subreffig:init_and_usage)). In the initialization phase, we provide an interface for a user to interact with the . Here the user is allowed to:

select components which completely capture apples. For each selected , their respective and are pushed in a list in memory.

delete components from the list stored in memory. This step is generally needed where there is a sudden illumination change and the previous stored components become bad (Fig. 6(\subreffig:General_paradigm)). We can update the old list, and continue the segmentation process from the current frame.
In the usage phase, the user interface is not invoked. To find components belonging to apples, we perform a simple matching between the current frame’s and the saved . We use KL Divergence as the distance measure for comparing Gaussians from and . For a matched Gaussian , all the superpixels within the confidence bounds are identified as superpixels belonging to apples. A normal usage paradigm is shown in Fig. 6(\subreffig:General_paradigm).
4 Counting Apples and Merging the Counts across Multiple Frames
After the segmentation phase, we count the number of apples from each segmented frame and merge the counts across multiple frames utilizing the estimated camera motion. In this section, we present both of these methods in detail. We start with per frame counting in the following section.
4.1 Per Frame Counting
Given a segmented input image, we would like to find the number and location of all the apples in the image. We use a Gaussian Mixture Models (GMM) based clustering method. Instead of color, we now focus on the spatial components of the image. This method holds numerous advantages over the Circular Hough Transform (CHT) based techniques (Pedersen (2007))  it does not require manual parameter tuning, can handle a significant level of occlusion and find apples of rapidly varying size. A preliminary version of this algorithm was presented in (Roy and Isler (2017)). In Roy and Isler (2017), we showed the comparison of this method with a baseline method similar to CHT and it outperformed it significantly (the accuracy of baseline method , the accuracy of GMM based method ). In this, paper we further validate the method with ground truth from multiple datasets and implement it in a close to a realtime system performing at 23 fps.
In our method, each apple is modeled by a Gaussian probability distribution function (pdf) and apple clusters are modeled as a mixture of Gaussians. We start by converting the input cluster image to binary. Let this binary image be denoted by . The locations of the nonzero pixels in the binary image are used as input to GMM.
Let X represent the set of apples we are trying to find. Then, we can convert our problem to a Gaussian mixture model formulation in the following way:
(1) 
Here, is a Gaussian mixture model with components, and is the th component of the mixture. and are the mean and covariance of the component. The covariance matrix is diagonal. is the weight of the component where and .
Given model parameters , the problem of finding the location of the center of the apples and their pixel diameters can be formulated as computing the world model which maximizes .
Each component of the mixture model represents an apple with center at , equatorial radius and axial radius .
A common technique to solve for is the expectation maximization (EM) algorithm (Bilmes et al. (1998)). As is wellknown, EM provides us a local greedy solution to the problem. Since EM is susceptible to local maxima, initialization is very important. We used Kmeans++ (Hartigan and Wong (1979)) (which uses randomlyselected seeds to avoid local maxima) for initialization of EM.
Selecting the Number of Components: In our problem formulation, the number of components is the total number of apples in image . EM enables us to find the optimal location of the apples given the total number of apples . Our main technical contribution is a method to calculate the correct . Let the correct number of apples in the input image be . We tried different stateoftheart techniques (Akaike Information Criterion (AIC) (Grunwald (2004)), Minimum Description Length (MDL) (Grunwald (2004)) etc.) for finding . None of them worked out of the box for our purposes (Fig. 8). Therefore, we propose a new heuristic for evaluating mixture models with a different number of components based on MDL.
Let and . Using the mean and covariances of the th component we define a Gaussian kernel where is the variance. Let denote the response of the kernel when placed at the center in the original input image and denote the total number of pixels clustered by . For each component , of the mixture model we define the reward in the following way,
(2) 
For most of the images, we only capture the frontal views of the apples, which can be easily approximated by circles lying on a plane. All four terms in equation (2) reward specific spatial characteristics of the Gaussian pdf related to this fact. represents the strength of the distribution in terms of pixel values and is present in the first three terms. The second term rewards circular shaped distributions using the eccentricity of the pdf. As the eccentricity for circles is zero, we use as the rewarding factor. The third term rewards coverage. The fourth term penalizes Gaussian pdfs covering large area and clustering very few points.
Now if we find out the reward for all the components , the total reward for the mixture model can be computed by summing them together.
Next, we define the penalty term. The traditional MDL penalty term is where is the number of parameters in the model, is the total size of the input data, and is a constant. Based on this principle, our penalty term is is defined as the following
(3) 
where represents the pixel index across the image . Compared to the traditional MDL based penalty we have the constant instead of . This is attributed to the fact that the reward expression (2) has three terms compared to one. The number of components is multiplied by three as each Gaussian has three parameters . With these terms defined, we choose the correct number of components in the following way:
(4) 

To have a better understanding of the selection procedure, we demonstrate a synthetic example at Fig. 9(\subreffig:gmmsyn). From Fig. 9(\subreffig:gmmkplotrew), it is evident that except for , other mixtures have low circularity. The coverage rewards and pixel density components increase with and converge to a steady state. While the penalty for minimum description length principle increases with , generally the penalty for coverage decreases with . In this example, the crucial factors in determining the score are circularity and the coverage penalty. For circularity is at the peak and coverage penalty is lowest and consequently, the score was maximum for . The plot of the corresponding rewards, penalties and final scores are shown in Fig. 9(\subreffig:gmmkplotrew)  (\subreffig:gmmkplot). We show sample results from our datasets in Fig. 10.
After executing the counting algorithm on each of the bounding boxes we have the location of the apples within each box and the number of apples per box. This is the input to our merging method that merges the apple counts across multiple frames. We describe this procedure in details in the next section.
4.2 Merging Count Across Multiple Frames
The perframe counting method provides us with the apple counts for a single frame. In natural settings though, the visibility of a particular cluster can change drastically with a change in camera position. A cluster might be completely invisible/partially visible from a particular view and yet clearly visible from other views. For these reasons, our segmentation method may not able to obtain the perfect segmentation in each frame and consequently, the predicted number of apples might be incorrect. Therefore, to obtain the correct apple count for each cluster we merge the counts from different frames. For this operation, we need to establish the correspondence between the clusters across multiple frames. It is executed by utilizing the camera motion.
As our camera viewing direction does not change much, and the scene is roughly planar, we model the camera motion between consecutive frames by pairwise homography (Eshel and Moses (2008)). The homography between frames is estimated by matching SIFT (Lowe (1999)) features above the ground. Using homography we will keep track of the boxes generated by the connected component analysis across the entire image sequence (Fig. 11).
Let be the bounding boxes generated by connected component analysis for frame . Let the apple count for each of this boxes be (computed by the counting method). When we find a bounding box for the first time we initialize a counting list that contains computed counts from the first frame.
Now for frame , let the bounding boxes be and the counts be . Let the homography that maps frame to frame be . We propagate all the bounding boxes in frame to frame using . This is executed by multiplying the center of each bounding box with . Next, we check the overlap between these propagated bounding boxes and the original bounding boxes on frame . If the overlap is more than we assume that these bounding boxes correspond to the same cluster. We will add the apple count to the list initialized previously using the following rules:

When a bounding box in the current frame does not overlap with any of the propagated bounding boxes, a new counting list for this box is initialized with the current count.

When only one bounding box in the current frame overlaps with a propagated bounding box, the count from the bounding box in the current frame is added to the counting list for the propagated box

Otherwise, when two or more bounding boxes overlap with a bounding box from the previous frame, their counts are added to obtain the total count to be recorded in the list. To have a better understanding of this rule, we consider the following scenario: Let be overlapping with . Prior to frame the counting list of , had only one entry, . As there is overlap, a new entry will be inserted. This new entry is .

The overlapping bounding boxes are unioned to obtain a new bounding box covering all of them. These new bounding boxes will be propagated to the next frame.
At the end of the image sequence, we have a set of unique boxes with count lists. We compute the median for each of the boxes and the sum of these are reported as the total count (Fig. 12).
5 Experimental Results
In this section, we present experimental results validating our algorithms. We start with the datasets.
5.1 Datasets
To verify the performance of our algorithms, we collected multiple datasets. We group these datasets into two categories, namely  “Validation Datasets and Training Datasets”. As their names suggest, they are used for the purposes of validation, and training. All data were collected at the University of Minnesota Horticulture Research Center at Victoria Minnesota (Fig. 13), over the course of two years (2015  2016).
Each of the videos collected from these datasets is tagged sunny/shady/cloudy based on the weather condition and whether the particular side of the row (captured in the video) was facing/opposing the sun.
Validation Datasets
Since this is a research site, it is home to many different kinds of apple trees. We arbitrarily chose four different sections in the orchard. We collected seven videos from these four segments, all of which were annotated manually with fruit locations (annotation procedure will be discussed in the next section). We also collected ground truth for all of them by labeling the apples physically by stickers and measuring their diameter after harvest. The details of these datasets are the following:
 Dataset1

This dataset contains six trees. Most of the apples on these trees were fully red and the trees were mostly planar (most of the apples are visible from both sides). We collected videos from both sides (the side facing the sun and opposing the sun) of the row. In total, there were apples in these six trees. See Fig. 14 (leftmost) for a sample image from this dataset.
 Dataset2

This dataset contains four trees. The trees had a mixture of red and green apples and complex (nonplanar) geometry. We collected a video from a single side of the row (the side facing the sun). In total, there were apples in these four trees. See Fig. 14 (second from left) for a sample image from this dataset.
 Dataset3

This dataset contains ten trees. Apples in these trees were mostly red and the trees had nonplanar geometry. In total, there were apples in these ten trees. We collected videos from both sides (the side facing the sun and opposing the sun) of the row. See Fig. 14 (third from left) for a sample image from this dataset.
 Dataset4

This dataset contains six trees. Fruits in these trees were a mixture of red and green apples and the trees had nonplanar geometry. We collected videos from both sides of the row (the side facing the sun and opposing the sun). In total, there were apples in these six trees. See Fig. 14 (fourth from left) for a sample image from this dataset.
All of the videos collected from these datasets are used to validate our segmentation and counting algorithms. Additionally, Dataset and are used to verify the total fruit counts from both sides. As we only collected video from a single side of the row from Dataset2, it is not used to verify both side fruit counts.
Training Datasets
To validate how our algorithms perform without user supervision, we picked a dataset from 2015. This dataset contains trees. Fruits in these trees were a mixture of red and green apples and the trees had nonplanar geometry. We collected a video from a single side of a row (it was a cloudy day, both sides were illuminated similarly). This video is not annotated manually and ground truth for this dataset is unknown. We only used this dataset for developing a model to identify apples without any user intervention. In other words, the user input was provided for the 2015 dataset and used without modification for 2016 dataset.
The four validation datasets were captured using a Samsung Galaxy camera in September 2016. The training dataset was collected in 2015 using a Garmin VR camera. It is notable that, our data collection facility is not a commercial orchard. Consequently, there is a great amount of variability in the shape of the trees even within the same crop row, which makes the yield estimation problem harder. In the next section, we look at the process of annotating the datasets.
5.2 Manual Annotation of Apples for Verifying Detection and Counting
In order to validate our segmentation and counting methods, we need image level ground truth (detected apples in each individual input image). This is different from the number of harvested apples from trees. For this purpose, we annotated the boundary of individual apples in the input images. This provides us with the ability to compare the bounding boxes generated by our algorithm with bounding boxes drawn manually.
For manual annotation, frames were selected arbitrarily every second for the test videos (frame rate fps), depending on how much the camera moved since the last annotated frame. For each of these frames, apples were tagged as clearly visible or marginally visible based on visibility guidelines (Fig. 15). An apple was considered clearly visible if more than half of its crosssectional area and more than half of its perimeter were unoccluded. Otherwise, if it was still detected as an apple by the human, it was marked as marginally visible. The marginally visible apples have more ambiguous bounding boxes, and might not even have a onetoone mapping between boxes and apples. In addition to these guidelines, apples that were growing on trees in the rows behind the main row of interest, and apples that had fallen to the ground, were not tagged.
Seven videos collected from the validation datasets (Dataset1 to Dataset4), described in the previous section, were tagged in this manner. The manually drawn bounding boxes were then propagated to the rest of the frames using camera motion between the frames. In the next sections, we investigate the performance of our segmentation and counting algorithms using these hand labeled datasets. Additionally, we will look at the relationship of computed yield vs actual yield for Dataset1, Dataset2, and Dataset4.
Our segmentation algorithm detects the ground apples as well. The manual annotation procedure ignores them. To remove ground apples, we let users choose a single line at the start of each video. This line is propagated using homography for the entire image sequence and any apple detected below this line are labeled as ground apples, and they are ignored for both segmentation and counting.
5.3 Performance Evaluation of the Segmentation Method
In this section, we study the performance of our segmentation method. In particular, we investigate its sensitivity with respect to user supervision across the validation datasets. We use three metrics, precision, recall and  measure for this purpose. As is wellknown in literature, these metrics are obtained using true positives (TP), false positives (FP) and false negatives (FN). Formally, and and measure = . We define TP, FP and FN using manually tagged apples and the apples detected by our algorithm. Specifically, TP = apples detected by our algorithm and tagged manually, FP = apples detected by the algorithm but not tagged manually, FN = apples tagged manually but not detected by our algorithm.
We compare the bounding boxes computed by our algorithm to manually drawn bounding boxes (guidelines for the manual annotation procedure was described in the previous section). To evaluate the quality of a particular bounding box generated by our algorithm, we use a metric wellknown in the literature as the intersection over union (IoU) threshold (Hariharan et al. (2014)). This method factors in how much of each bounding box is detected by our algorithm (Fig. 16(\subreffig:segmentation_evaluation)). We demonstrate the performance of our algorithm in terms of precision, recall and  measure over the entire range of intersection over union threshold. For the purposes of counting though, we utilize a nominal nonzero intersection over union threshold (0.01).
One of the advantages of our algorithm is that, we do not detect all the apples on a single image. We utilize multiple views available from the video. It is evident from Fig. 16(\subreffig:inherent_fuziness) that for single frames even for nonzero intersection over union threshold (means if the algorithm detected bounding boxes just touches the ground truth), the perframe apple detection rate (recall) varies significantly. Nevertheless, our perframe precision is always over and with the help of multiple views of each apple cluster we achieve high precision and recall ( Fig. 17, 18, 19, 20).
In order to test the performance without user intervention, we trained a classification model from the dataset collected in 2015 and used this model for detecting apples from the validation dataset collected in 2016. To test the effect of user supervision on specific videos collected from datasets, we let users choose the apples from the first fifty frames of the input video. We utilize the clusters chosen by the users to build the classification model and use this model for segmenting apples. For the rest of the paper, we will refer to these two types of classification models as semisupervised and usersupervised models.
First, we evaluate the computed recall for clearly and marginally visible apples. According to the visibility guidelines discussed in Section 5.2, the recall/ hit rate for clearly visible apples should be higher than the recall for marginally visible apples. Fig. 17 shows, that this is indeed the case for both the semisupervised and usersupervised classification models. We show the results for two different videos from the validation datasets (Dataset1 and Dataset2) in Fig. 17. For the figures on the top row, both models achieve high recall. For the figures on bottom row the semisupervised model has a low recall and the usersupervised model has high recall. Importantly though, for all the figures we see that recall for clearly visible apples are higher than recall for marginally visible apples and the overall recall is in between these two (especially for high IoU thresholds).
Second, we investigate the sensitivity of user input. For the usersupervised case, we only allow choosing apples for the first fifty frames. We see in Fig. 18,19 for the first five videos collected from Dataset1, Dataset3 and Dataset4 the precision and recall for both the usersupervised and semisupervised models are similar. However, for the video collected from Dataset2, the apples are a mixture of red and green. Therefore, the semisupervised classification model does not generalize well. Consequently, the precision is similar but the recall drops by .
Finally, it is desired to obtain a single number associated with the performance of the algorithm. We use  measure for this purpose. This is the harmonic mean of precision and recall. Our measure ranges from for six videos (Fig. 20, Table 1) for nonzero intersection over union threshold. The best known measure was obtained by (Bargoti and Underwood (2016)). As the datasets are different, and our method does not maximize detection for a single image, a direct comparison is not possible. In the next section, we investigate the performance of our counting algorithm.
Datasets  Semisupervised  Usersupervised 

(Sunny)  .9662  .9658 
(Shady)  .9585  .9609 
(Sunny)  .9738  .9711 
(Shady)  .9541  .9592 
(Sunny)  .9774  .9775 
(Sunny)  .8931  .9710 
5.4 Performance Evaluation of the Counting Method
In this section, we quantify the performance of our counting method. We investigate how the algorithm (perframe counting) performs on individual segmented images and how the overall performance (merging counts across frames) compares to humanperceived counts and ground truth. It is notable that following the segmentation step, all the bounding boxes with above nonzero intersection over union threshold are used in the counting step.
For evaluating both per frame counting and merging, we utilize the videos collected from the validation datasets (Dataset1  Dataset4). We start with the perframe counting method.
5.4.1 Evaluation of Perframe Counting Method:
To evaluate the performance of the perframe counting algorithm, we took all the segmented images from seven videos collected from the four datasets. Afterward, we performed a connected component analysis on them, randomly selected components and marked each one with the perceived count from a human point of view. These counts are then compared to the counts obtained from the algorithm. At this stage, we want the segmented images to be accurate and consequently, we use the usersupervised model for segmentation.

Essentially, we have three key insights from this experiment. First, it is evident from the confusion matrix (Fig. 21(\subreffig:confusionmat)), that recall drops with increasing cluster size (varies from ) but precision stays over for any cluster size (varies from ). Second, for a large portion of the data  single apples ( of entire data (Fig. 21(\subreffig:clusterDist))); the precision and recall of our algorithm are and respectively. Consequently, the overall accuracy of our method is high () (shown in the rightbottom cell in the confusion matrix). Third, low recall rates for larger clusters does not affect the overall performance. In the next section, utilizing the high precision and multiple views we achieve good recall on the entire data (Fig. 22(\subreffig:dsetcountmodel)).
Next, we quantify the effect of lighting conditions (sunny, shady, cloudy etc.) on the counting method. We computed the accuracy of the perframe counting method across all the collected videos (which were collected in different lighting conditions). Our counting accuracy varied from . Undercounting percentage varies from and overcounting varies from . These results are presented in Fig. 21(\subreffig:countdatasets). In the next section, we evaluate the performance of merging the apple counts using camera motion.
5.4.2 Merging Counts Across Multiple Frames
To verify the performance of the merging method, we utilize the manual annotations. We treat the manually annotated fruits as human perceived ground truth. Afterward, we track these fruits across frames using camera motion (3D camera poses) to avoid double counting and find the number of unique apples. The counts obtained in this manner are then compared to the counts from our merging algorithm.
First, we quantify the amount of overcounting due to approximating the camera motion using homography. We performed this evaluation by simply comparing the number of unique manually labeled apples obtained from homography to the number of unique apples obtained by utilizing full 3D camera motion. The camera motions were computed using a commercial photogrammetry software Agisoft (Agisoft (2017)). We found that across different datasets; using homography can lead to overcounting. Fig. 22(\subreffig:trackHom) shows these results.
Next, we evaluate the accuracy of the merged counts (tracking by homography) to humanperceived counts (tracking by 3D camera poses). To understand the importance of user interaction, we perform this analysis for both semisupervised and usersupervised models (Fig. 22(\subreffig:dsetcountmodel)). Our accuracy with respect to human perceived ground truth varies from for the usersupervised model and for the semisupervised model. The drop in accuracy for the semisupervised model was propagated from the segmentation phase. The main takeaways from these results are, 1) accurate segmentation is very important for obtaining correct counts. 2) based on the geometry of the environment and lighting conditions, we count of the visible apples from a single side. In the next section, we investigate how the single side counts correlate with the actual yield.
5.5 Yield Estimation
Our original goal is to get an accurate yield estimate. For that, we need to correlate the counts from our algorithm (from the single side of a row) to the actual ground truth (number of harvested fruits). Toward this, we first try to find out a correlation between the number of visible apples from a single side and the actual yield. As mentioned earlier, we determine the total number of visible apples by tracking the manually labeled apples from a single side, utilizing estimated camera motion. For this step, we will use datasets where videos from both sides were collected (Dataset1, Dataset3, Dataset4). From Fig. 24(\subreffig:labelgt), it is evident that the number of visible apples from a single side vary greatly across different datasets (). This is expected as the orchard from which we collected the data is not well trimmed and the size and shape of the trees varies significantly even within the same row.
Another simple solution is adding the apple counts from both sides and finding a correlation with the actual yield. Fig. 24(\subreffig:yieldsum) shows these results and again, the summed yields vary considerably across datasets (). Therefore, to get a useful estimate we need to merge the fruit counts from both sides of the tree rows. Next, we discuss this procedure in details.
To merge counts from both sides, we utilize the 3D geometry of the environment. We reconstruct each side of the row using captured images. We use a photogrammetry software Agisoft (Agisoft (2017)) for this purpose. Afterward, we merge the reconstructions from both sides using semantic constraints (Dong and Isler (2018); Roy et al. (2018)). Fig. 23 shows an example (merged reconstruction for Dataset1).
Next, we detect the fruits using our segmentation method (Section 3) and backproject the detected fruits in the images to obtain the fruit location in the 3D reconstruction. Fig. 25 shows an example. We perform a connected component analysis to detect the apple clusters in 3D. Then we project individual 3D clusters back to the images by utilizing the recovered camera motion. We count the fruits from these reprojected images using our counting method developed in Section 4. A 3D cluster can be tracked over many frames. We choose three frames with the highest amount of detected apple pixels (from the 3D cluster) and report the median count of these three frames as the fruit count for the cluster. We follow this procedure for all the detected 3D clusters and aggregate the fruit count from a single side.
To merge the counts from both sides, we compute the intersection of the connected components from both sides. Afterward, we compute the total counts by using the inclusionexclusion principle (Andreescu and Feng (2004)). Essentially, we sum up the counts from all the connected components, compute the intersections area among them and add/subtract the weighted parts accordingly. Fig. 26 shows our result. Our counting accuracy from both sides for Dataset1, Dataset3 and Dataset4 are respectively. Compared to both side counts, if we just add the single side counts we overcount significantly ( for Dataset1, Dataset3, and Dataset4 respectively). Table 2 summarizes the final yield result. It indicates that merging the rows from both sides is essential to obtain accurate yield.
Harvested fruit counts  Merged fruit counts from both sides  Sum of fruit counts from single sides  

Dataset1  ()  ()  
Dataset3  ()  ()  
Dataset4  ()  () 
6 Conclusion and Future Work
In this paper, we presented a complete yield estimation system for apple orchards from monocular images. From a purely technical point of view, our main contributions are a semisupervised clustering method relying on color for identifying the apples and an unsupervised clustering method relying on shape to estimate the number of apples in a cluster. We verified the performance of our algorithms on multiple small datasets. Results indicate that these algorithms perform well in practice and outperform most of the existing methods in terms of detection and counting accuracy.
As reported in section 5.4, we count of the visible apples from a single side of the row in different datasets. To be of practical usage though, we needed to correlate this single side counts with harvested yield. With the help of our recent work (Dong and Isler (2018); Roy et al. (2018)), we merged the fruit counts from both sides of fruit tree rows. Our method achieved a varying accuracy of across different datasets.
In future, we would like to couple our system more closely with the 3D geometry of the environment. We would like to develop techniques to localize each individual apple in a cluster, find the pose of the fruit and measure its diameter.
Acknowledgement
The authors thank Joshua Anderson, Professors Emily Hoover, and Cindy Tong from the Department of Horticultural Science, University of Minnesota, for their expertise and help with the experiments. This work is supported in part by NSF grant # 1317788, USDA NIFA MIN98G02 and the MnDrive initiative.
References
References
 Achanta et al. (2012) Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to stateoftheart superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34(11):2274–2282
 Agisoft (2017) Agisoft L (2017) Agisoft PhotoScan. http:http://www.agisoft.com/, accessed: 20170915
 Andreescu and Feng (2004) Andreescu T, Feng Z (2004) Inclusionexclusion principle. In: A Path to Combinatorics for Undergraduates, Springer, pp 117–141
 Baker and Matthews (2004) Baker S, Matthews I (2004) Lucaskanade 20 years on: A unifying framework. International journal of computer vision 56(3):221–255
 Bargoti and Underwood (2016) Bargoti S, Underwood JP (2016) Image segmentation for fruit detection and yield estimation in apple orchards. CoRR abs/1610.08120
 Bargoti and Underwood (2017) Bargoti S, Underwood JP (2017) Image segmentation for fruit detection and yield estimation in apple orchards. Journal of Field Robotics
 Bilmes et al. (1998) Bilmes JA, et al. (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute 4(510):126
 Changyi et al. (2015) Changyi X, Lihua Z, Minzan L, Yuan C, Chunyan M (2015) Apple detection from apple tree image based on bp neural network and hough transform. International Journal of Agricultural and Biological Engineering 8(6):46–53
 Chen and Gopalakrishnan (1998) Chen SS, Gopalakrishnan PS (1998) Clustering via the bayesian information criterion with applications in speech recognition. In: Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, IEEE, vol 2, pp 645–648
 Chen et al. (2017) Chen SW, Shivakumar SS, Dcunha S, Das J, Okon E, Qu C, Taylor CJ, Kumar V (2017) Counting apples and oranges with deep learning: A datadriven approach. IEEE Robotics and Automation Letters 2(2):781–788
 Chuang et al. (2001) Chuang YY, Curless B, Salesin DH, Szeliski R (2001) A bayesian approach to digital matting. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, IEEE, vol 2, pp II–II
 Das et al. (2015) Das J, Cross G, Qu C, Makineni A, Tokekar P, Mulgaonkar Y, Kumar V (2015) Devices, systems, and methods for automated monitoring enabling precision agriculture. In: Proceedings of IEEE Conference on Automation Science and Engineering
 Dong and Isler (2018) Dong W, Isler V (2018) Tree morphology for phenotyping from semanticsbased mapping in orchard environments. Tech. rep., Technical Report TR18006, University of Minnesota, Computer Science & Engineering Department
 Eshel and Moses (2008) Eshel R, Moses Y (2008) Homography based multiple camera detection and tracking of people in a dense crowd. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
 Gauch and Hsia (1992) Gauch JM, Hsia CW (1992) Comparison of threecolor image segmentation algorithms in four color spaces. In: Applications in optical science and engineering, International Society for Optics and Photonics, pp 1168–1181
 Goldberger et al. (2003) Goldberger J, Gordon S, Greenspan H, et al. (2003) An efficient image similarity measure based on approximations of kldivergence between two gaussian mixtures. In: ICCV, vol 3, pp 487–493
 Gongal et al. (2016) Gongal A, Silwal A, Amatya S, Karkee M, Zhang Q, Lewis K (2016) Apple cropload estimation with overtherow machine vision system. Computers and Electronics in Agriculture 120:26–35
 Grunwald (2004) Grunwald P (2004) A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077
 Hariharan et al. (2014) Hariharan B, Arbeláez P, Girshick R, Malik J (2014) Simultaneous detection and segmentation. In: European Conference on Computer Vision, Springer, pp 297–312
 Hartigan and Wong (1979) Hartigan JA, Wong MA (1979) Algorithm as 136: A kmeans clustering algorithm. Journal of the Royal Statistical Society Series C (Applied Statistics) 28(1):100–108
 HechtNielsen (1989) HechtNielsen R (1989) Theory of the backpropagation neural network. In: Neural Networks, 1989. IJCNN., International Joint Conference on, IEEE, pp 593–605
 Hung et al. (2015) Hung C, Underwood J, Nieto J, Sukkarieh S (2015) A feature learning based approach for automated fruit yield estimation. In: Field and Service Robotics, Springer, pp 485–498
 Jimenez et al. (2000) Jimenez A, Ceres R, Pons J (2000) A survey of computer vision methods for locating fruit on trees. Transactions of the ASAEAmerican Society of Agricultural Engineers 43(6):1911–1920
 Linker et al. (2012) Linker R, Cohen O, Naor A (2012) Determination of the number of green apples in rgb images recorded in orchards. Comput Electron Agric 81:45–57
 Liu et al. (2016) Liu X, Zhao D, Jia W, Ruan C, Tang S, Shen T (2016) A method of segmenting apples at night based on color and position information. Computers and Electronics in Agriculture 122:118–123
 Long et al. (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440
 Lowe (1999) Lowe DG (1999) Object recognition from local scaleinvariant features. In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on, Ieee, vol 2, pp 1150–1157
 Pedersen (2007) Pedersen SJK (2007) Circular hough transform. Aalborg University, Vision, Graphics, and Interactive Systems 123
 Roy and Isler (2016) Roy P, Isler V (2016) Surveying apple orchards with a monocular vision system. In: Automation Science and Engineering (CASE), 2016 IEEE International Conference on, IEEE, pp 916–921
 Roy and Isler (2017) Roy P, Isler V (2017) Visionbased apple counting and yield estimation. In: Kulić D, Nakamura Y, Khatib O, Venture G (eds) 2016 International Symposium on Experimental Robotics, Springer International Publishing, Cham, pp 478–487
 Roy et al. (2018) Roy P, Dong W, Isler V (2018) Registering reconstructions of the two sides of fruit tree rows. Tech. rep., Technical Report TR18008, University of Minnesota, Computer Science & Engineering Department
 Ruzon and Tomasi (2000) Ruzon MA, Tomasi C (2000) Alpha estimation in natural images. In: Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, IEEE, vol 1, pp 18–25
 Senthilnath et al. (2016) Senthilnath J, Dokania A, Kandukuri M, Ramesh K, Anand G, Omkar S (2016) Detection of tomatoes using spectralspatial methods in remotely sensed rgb images captured by uav. Biosystems Engineering
 Silwal et al. (2014) Silwal A, Gongal A, Karkee M (2014) Apple identification in field environment with over the row machine vision system. Agricultural Engineering International: CIGR Journal 16(4):66–75
 Sinha et al. (2014) Sinha SN, Steedly DE, Szeliski RS (2014) Multistage linear structure from motion. US Patent 8,837,811
 Stein et al. (2016) Stein M, Bargoti S, Underwood J (2016) Image based mango fruit detection, localisation and yield estimation using multiple view geometry. Sensors 16(11):1915
 Tabb et al. (2006) Tabb AL, Peterson DL, Park J (2006) Segmentation of apple fruit from video via background modeling. In: 2006 ASAE Annual Meeting, American Society of Agricultural and Biological Engineers, p 1
 Wang et al. (2013) Wang Q, Nuske S, Bergerman M, Singh S (2013) Automated crop yield estimation for apple orchards. In: Desai JP, Dudek G, Khatib O, Kumar V (eds) Experimental Robotics, Springer Tracts in Advanced Robotics, vol 88, Springer International Publishing, pp 745–758
 Zhou et al. (2012) Zhou R, Damerow L, Sun Y, Blanke MM (2012) Using colour features of cv.‘gala’ apple fruits in an orchard in image processing to predict yield. Precision Agriculture 13(5):568–580