# Joint Multi-Leaf Segmentation, Alignment, and Tracking for Fluorescence Plant Videos

###### Abstract

This paper proposes a novel framework for fluorescence plant video processing. The plant research community is interested in the leaf-level photosynthetic analysis within a plant. A prerequisite for such analysis is to segment all leaves, estimate their structures, and track them over time. We identify this as a joint multi-leaf segmentation, alignment, and tracking problem. First, leaf segmentation and alignment are applied on the last frame of a plant video to find a number of well-aligned leaf candidates. Second, leaf tracking is applied on the remaining frames with leaf candidate transformation from the previous frame. We form two optimization problems with shared terms in their objective functions for leaf alignment and tracking respectively. A quantitative evaluation framework is formulated to evaluate the performance of our algorithm with four metrics. Two models are learned to predict the alignment accuracy and detect tracking failure respectively in order to provide guidance for subsequent plant biology analysis. The limitation of our algorithm is also studied. Experimental results show the effectiveness, efficiency, and robustness of the proposed method.

plant phenotyping, Arabidopsis, leaf segmentation, alignment, tracking, multi-object, Chamfer matching

## 1 Introduction

Plant phenotyping [1] refers to a set of methodologies and protocols used to measure plant growth [2], architecture [3], composition [4], and etc. In contrast to the manual observation-based methods, the automatic image-based approaches for plant phenotyping have gained more attention recently [5, 6]. As shown in Figure 1, plant researchers conduct large-scale experiments in a chamber with controlled temperature and lighting conditions. Ceiling-mounted fluorescence cameras capture images of a plant during its growth period [7]. The pixel intensities of the image indicate the photosynthetic efficiency (PE) of the plant. Given such a high-throughput imaging system, the massive data calls for advanced visual analysis in order to study a wide range of plant physiological problems [8], e.g., the heterogeneity of PE among the leaves. Therefore, the leaf-level visual analysis is fundamental to automatic plant phenotyping.

This paper focuses on the processing of the rosette plants where the leaves are at a similar height and form a circular arrangement. Our experiments are mainly conducted on Arabidopsis thaliana, which is the first plant to have its genome sequenced [9]. Due to its rapid life cycle, prolific seed production, and easiness to cultivate in the restricted space, Arabidopsis is the most popular and important model plant [10] in the plant research community. An automatic image analysis method for Arabidopsis, which is the main focus of this paper, is of essential importance for high-throughput plant phenotyping studies. Given a fluorescence plant video, our method performs multi-leaf Segmentation, Alignment, and Tracking (SAT) jointly. Specifically, leaf segmentation [11] segments each leaf from the plant. Leaf alignment [12] estimates the leaf structure. Leaf tracking [13] associates the leaves over time. This multi-leaf analysis is a challenging problem due to several factors. First, the image resolution is low where the small leaves are even hard to be recognized by humans. Second, there are various degrees of overlap among leaves, which make it difficult to segment each leaf boundary. Third, leaves within a plant exhibit various shapes, sizes, and orientations, which also change over time. Therefore, an effective algorithm should be developed to handle all these challenges.

To the best of our knowledge, there is no previous work focusing on leaf SAT simultaneously from plant videos. To solve this new problem, we develop two optimization algorithms. Specifically, leaf segmentation and alignment are based on Chamfer Matching (CM) [14], which is a well-known algorithm to align one object in an image with a given template. However, classical CM does not work well for aligning multiple overlapping leaves. Motivated by crowd segmentation [15], where the number and locations of the pedestrians are estimated simultaneously, we propose a novel framework to jointly align multiple leaves in an image. First we generate a large set of leaf templates with various shapes, sizes, and orientations. Applying all templates to the edge map of a plant image leads to the same amount of transformed leaf templates. We adopt the local search method for optimization to select a subset of leaf candidates that can best explain the edge map of the test image.

While leaf segmentation and alignment work well for one image, applying it to every video frame independently does not enable tracking - associating aligned leaves over time. Therefore, we formulate leaf tracking on one frame as a problem of transforming multiple aligned leaf candidates from the previous frame. The tracking optimization initialized with results of the previous frame can converge very fast and thus results in enhanced leaf association and computational efficiency.

In order to estimate the alignment and tracking accuracy, two quality prediction models are learned respectively. We develop a quantitative analysis with four metrics to evaluate the multi-leaf SAT performance. Furthermore, the limitation of our algorithm is studied. In summary, we make four contributions:

We identify a new computer vision problem of joint multi-leaf SAT from plant videos. We collect a dataset of Arabidopsis and make it publicly available.

We propose two optimization algorithms to solve this multi-leaf SAT problem.

We develop two quality prediction models to predict the alignment accuracy and tracking failure.

We set up a quantitative evaluation framework to jointly evaluate the performance.

Compared to our earlier work [12, 16], we have made five main changes: One term is modified in the tracking objective function. The proposed method is superior to [12, 16] on a larger dataset. We develop two quality prediction models. We enhance the performance evaluation procedure and add one metric to evaluate segmentation accuracy. We study the limitation of our tracking algorithm and show its robustness to leaf template transformation. We extend our method to apply on RGB images [5] and compare the segmentation results to [17].

## 2 Prior Work

Plant image analysis has been studied in computer graphics [18, 19, 20] and computer vision [21, 11]. For example, a leaf shape and appearance model is proposed to render photo-realistic images of a plant [18]. A data-driven leaf synthesis approach is developed to produce realistic reconstructions of dense foliage [19]. These models may not be applied to fluorescence images due to the lack of leaf appearance information. There are prior computer vision work on tasks such as leaf segmentation [21, 11], alignment [22, 12], tracking [13, 16], and identification [23, 24, 25]. However, most previous studies focus on only one or two of these tasks. In contrast, our method addresses three tasks of leaf SAT.

Leaf Segmentation can be classified into two categories: segmentation of a detached leaf from natural [26, 27, 28, 29, 22] or clean background [24]; pixel-wise segmentation of each leaf from a plant [25, 17]. Methods in the first category are usually used as the first step for leaf classification or species identification. [24] uses pixel-based color classification for leaf segmentation from a white background. [25] proposes active contour deformation method for compound leaf segmentation and identification.

Our work belongs to the second category. It is very challenging due to leaf variation and overlapping. Tsaftaris et al. organized a collation study of leaf segmentation on rosette plants in 2015 [30, 31]. Our method is evaluated with other three methods. Two of them are based on superpixels and watershed transformation segmentation. [17] uses distance map-based leaf center detection and leaf split points detection for leaf segmentation.

Leaf Alignment aims to find the structure of a leaf, which is useful for leaf segmentation. [26] deforms a polygonal model to leaf shape fitting, where the base and tip points are used to define a leaf template. The same points are used in [22] to model leaf shapes and deform templates. Similarly, we use these two points on our leaf templates for alignment. Our novelty lies in solving leaf segmentation and alignment jointly by extending CM to align multiple potentially overlapping leaves in an image.

Leaf Tracking models leaf transformation over time. A probabilistic parametric active contour model is applied for leaf segmentation and tracking to automatically measure the temperature of leaves in [13]. However leaves on those images are well separated without any overlap and the active contours are initialized via the ground truth segments, which is hard to achieve in real-world applications. [32] segments all leaves in a video separately and employs a merging procedure to group the segments by exploiting the angle properties of the leaves. [33] proposes a graph-based tracking algorithm by linking leaf detections across neighboring frames. All of them treat tracking as a post processing after leaf segmentation on individual frame. In contrast, we employ a leaf template transformation to transfer the segmentation and alignment results between continuous frames.

## 3 Our Method

Figure 2 shows our framework. Given a plant video, we first apply leaf segmentation and alignment on the last frame to generate a number of well-aligned leaf candidates. Leaf tracking is considered as an alignment problem with the leaf candidates initialized from a previous frame. During tracking, a leaf candidate whose size is smaller than a threshold is deleted. A new candidate is detected and added for tracking when there is a certain region of the image mask that is not covered by the existing leaf candidates. Two prediction models are learned to investigate the alignment and tracking quality respectively. All notations are summarized in Table I.

Notation | Definition |
---|---|

the edge map and mask of a test image | |

the edge map and mask of a leaf template | |

the edge map and mask of a transformed leaf template | |

the distance transform image of | |

the numbers of template shapes, sizes, and orientations | |

the total number of leaf templates, | |

the number of transformed leaf templates for optimization | |

the objective functions for alignment and tracking | |

the diagonal length of | |

the center of a plant image | |

the center of the leaf candidate | |

the sets of and transformed templates | |

a matrix collecting all from | |

the number of pixels in the test image | |

a -dim - indicator vector | |

a -dim vector of CM distances in | |

a constant value used in | |

the number of maximum iterations in tracking | |

, | the number of estimated and labeled leaves in a frame |

the collection of selected leaf candidates | |

a set of transformation parameters | |

is the parameter for | |

the estimated and labeled tips for one leaf | |

the estimated and labeled tips for one frame | |

the collections of estimated and labeled tips for all videos | |

the collections of estimated and groundtruth segmentation | |

masks for all videos | |

the total number of labeled leaves | |

the tip-based error normalized by the leaf length | |

a matrix of leaf correspondence | |

a matrix of tip-based errors in one frame | |

the collection of all for labeled frames | |

the number of leaf without correspondence | |

the tip-based errors used in Algorithm 2 | |

a threshold for comparing with tip-based errors | |

the performance metrics | |

the quality to predict alignment and tracking | |

the features to learn quality prediction models | |

the weights used in and | |

the step sizes in the gradient descent of and | |

the smallest leaf size we use |

### 3.1 Multi-Leaf Segmentation and Alignment

Our segmentation and alignment algorithm consists of two steps. First, a pre-defined set of leaf templates is applied to the edge map of a test image to generate an over-complete set of transformed leaf templates. Second, we formulate an optimization process to select an optimal subset of leaf candidates.

#### 3.1.1 Candidate nomination via Chamfer matching

Chamfer Matching (CM) [14] is a well-known method used to find the best alignment between two edge maps. Let and be the edge maps of a test image and a template respectively. CM distance is computed as the average distance of each edge point in with its nearest edge point in :

(1) |

where is the number of edge points in . CM distance can be computed efficiently via a pre-computed distance transform image , which calculates the distance of each coordinate to its nearest edge point in . During the CM process, an edge template is superimposed on and the average value sampled by the template edge points equals to the CM distance, i.e., .

Given a fluorescence plant image, it is first transformed to a binary image by applying a threshold. The Sobel edge detector is applied to to generate an edge map . The goal of leaf alignment is to transform the D edge coordinates of a template in the leaf template space to a new set of D coordinates in the test image space so that the CM distance is small, i.e., the leaf template is well aligned with .

Image Warping: In our framework, there are two types of transformations involved including forward and backward warping. We use affine transformation that consists of scaling, rotation, and translation.

As shown in Figure 3, let be a forward warping function that transfers the 2D edge points from the template space to the test image space, parameterized by :

(2) |

where is the in-plane rotation angle, is the scaling factor, and are the translations along and axis respectively. is the center of the leaf, i.e., the average of all coordinates of , which is used to model the leaf scaling and rotation w.r.t. the leaf center.

Let be the backward warping from the image space to the template space. We denote as a matrix including all coordinates in the test image space. Thus, are the corresponding coordinates of in the template space. The purpose for this backward warping is to generate a -dim vector , which is the warped version of the original template mask .

Leaf Templates: Since there is a large variation in leaf shapes, it is infeasible to match leaf with one template. We manually select basic templates (the row in Figure 4) with representative leaf shapes from the plant videos and compute their individual edge map and mask . We synthesize an over-complete set of transformed templates by selecting a discrete set of and , which are expected to cover all potential leaf configurations in . This leads to an array of leaf templates where and are the numbers of leaf scales and orientations respectively (Figure 4). The yellow and green points in Figure 4 are the two labeled leaf tips , which are used to find the corresponding leaf tips in via Equation 1.

For each template , it scans through all possible locations on and the location with the minimum CM distance is selected, which provides and optimal to . Therefore, with the manually selected , , and exhaustively chosen and , transformed templates are generated from basic templates. For each transformed template, we record the 2D edge coordinates of its basic template, warped template mask, transformation parameters, CM distance and the estimated leaf tips as . Note that is an over-completed set of transformed leaf templates including the true leaf candidates as its subset. Hence, the critical question is how to select such a subset of candidates from .

#### 3.1.2 Objective function

The goal of leaf segmentation and alignment is to segment each leaf and estimate the structure precisely. If the leaf candidates are well selected, there should be no redundant or missing leaves. Each leaf candidate should be well aligned with the edge map of the test image. This rationality leads to a three-term objective function, which seeks the minimal number of leaf candidates () with small CM distances () to best cover the test image mask ().

Each image contains around leaves while the number of potential candidates in is in our case. The selection space needs to be narrowed down substantially. To do this, we compute the CM distance and the overlap ratio of each template to the test image mask. We remove leaf templates whose CM distance is larger than the average of all templates or whose overlap ratio is smaller than . Finally, we generate a new set with (a few hundreds) templates. RANSAC [34] is not applicable here for two reasons. First, it is difficult to define a model or evaluation criterion for a random subset of leaf templates. Second, we have more outliers than inliers, which makes it hard to select the correct set in consensus.

The objective function is defined on a -dim indicator vector , where means that the transformed template is selected and otherwise. Hence uniquely specifies a combination of transformed templates from . The first term is the number of the selected leaf candidates .

We concatenate from to form a -dim vector . The second term, i.e., the average CM distance of the selected leaf candidates, is formulated as:

(3) |

The third term is the comparison between the synthesized mask and the test image mask. As shown in Figure 5, we convert the binary mask to a -dim row vector by raster scan. Similarly, each warped template mask is also a -dim vector. The collection of from all transformed templates is denoted as a matrix . Note that is indicative of the synthesized mask except that the pixel values of the overlapping leaves are larger than . We employ the function, similar to [35, 36], to convert all elements in to be in the range of ,

(4) |

where is a constant controlling how close approximates the step function. Note that the actual step function cannot be used here since it is not differentiable and thus is difficult to optimize. The constant within the parentheses is a flip point separating where the value of will be pushed toward either or . Therefore, the third term becomes:

(5) |

Finally, our objective function is:

(6) |

where and are the weights. These three terms jointly provide guidance on what constitutes an optimal combination of leaf candidates.

#### 3.1.3 Local search method for optimization

Equation 6 is a pseudo-Boolean function. The basic algorithm [37] is not applicable because our objective cannot be written in the required polynomial form. We adopt the widely used local search method to optimize Equation 6. The local search algorithm [38] for pseudo-Boolean function iteratively searches a small neighborhood of and updates to its neighborhood that leads to a smaller function value.

First, all elements in are initialized as , i.e., all transformed templates are selected. We fix one element in at each iteration by searching the neighborhood of with the element being or , denoted as and . According to the proposition in [38], a positive gradient indicates in the corresponding element of the local optimal solution. Therefore, each iteration, we select the element with the maximum gradient to remove redundant leaf templates. The gradient of the objective w.r.t. is:

(7) |

where is a function returning the sign of each element, and is the element-wise division of vectors. In each iteration, is updated by . The element with the largest gradient is chosen and fixed to be or based on the smaller value of and . Once this element is fixed, its value remains unchanged in the future iterations. The total number of iterations is the number of transformed leaf templates . Finally, those elements in equal to provide the combination of leaf candidates.

This joint leaf segmentation and alignment is applied on the last frame of a plant video to generate leaf candidates that are used for tracking in the remaining video frames. We denote the set of leaf candidates selected from as , which means the basic leaf template is transformed by to result in a leaf candidate that is well-aligned with the edge map.

### 3.2 Multi-Leaf Tracking Algorithm

Leaf tracking aims to assign the same leaf ID to the same leaf through an entire video. In order to track all leaves over time, one way is to apply leaf segmentation and alignment framework on every frame of the video and then build leaf correspondence between consecutive frames. However, the leaf tracking consistency is an issue due to the potentially different leaf segmentation results on different frames. Therefore, we form an optimization problem for leaf tracking based on template transformation.

#### 3.2.1 Objective function

Similar to Equation 6, we formulate a three-term objective function parameterized by a set of transformation parameters , where is the transformation parameters for leaf candidate .

First, is updated so that the transformed leaf candidates are well aligned with the edge map . The first term is computed as the average CM distance of the transformed leaf candidates:

(8) |

The second term is to encourage the synthesized mask from all transformed candidates to be similar to the test mask . The synthesized mask of one transformed leaf candidate is , we formulate the second term as:

(9) |

One property of rosette plants such as Arabidopis is that the long axes of most leaves point toward the center of the plant. To take advantage of this domain-specific knowledge, the third term encourages the rotation angle to be similar to the direction of the leaf center to the plant center. Figure 6 shows the geometric relation of the angle difference, which can be computed as , where and are the geometric centers of a plant and a leaf, i.e., the average coordinates of all points in and respectively, is the distance between the leaf center and the plant center, and is the rotation angle. Furthermore, since this property is more dominant for leaves far away from the plant center, we weight the above angle difference by and normalize it by the image size. The third term is the average weighted angle difference:

(10) |

Finally, the objective function is formulated as:

(11) |

where and are the weights.

Note the differences in two objective functions and . Since the number of leaves is fixed for tracking, is not needed in the formulation of . The number of leaves is relatively small during tracking. Therefore, is not needed since the synthesized mask is already comparable to the test image mask.

#### 3.2.2 Gradient descent optimization

Given the objective function in Equation 11, our goal is to minimize it by estimating , i.e., . Since involves texture warping, it is a nonlinear optimization problem without a close-form solution. We use gradient descent to solve this problem. The derivation of w.r.t. can be written as:

(12) |

where and are the gradient images of at and axis respectively. These two gradient images only need to be computed once for each frame. and can be easily computed from Equation 2 w.r.t. , , and separately.

Similarly, the derivation of w.r.t. is:

(13) |

where and are the gradient images of the template mask at and axis respectively. and can be computed based on the inverse function of Equation 2.

The derivation of w.r.t. is more complex than to the other three transformation parameters. For clarity, we only present the derivative over :

(14) |

During optimization, is initialized as the transformation parameters of the leaf candidates from the previous frame and updated as for each leaf at iteration . Note that this is a multi-leaf joint optimization problem because the computation of involves all leaf candidates. The optimization stops when does not decrease or it reaches the maximum iteration .

#### 3.2.3 Leaf candidates update

Given a multi-day plant video, we apply leaf segmentation and alignment algorithm on the last frame to generate and employ the leaf tracking toward the first frame. Due to plant growth and leaf occlusion, the number of leaves may vary throughout the video. If the area of any leaf candidate at one frame is less than a threshold (defined as the number of pixels), we remove it from the leaf candidates.

On the other hand, a new leaf candidate can be detected and added to . To do this, we compute the synthesized mask of all leaf candidates and subtract it from the test image mask to generate a residue image for each frame. Connected component analysis is applied to find components that are larger than . We then apply a subset of leaf templates to find a leaf candidate based on the edge map of the residue image. The new candidate is assigned with an existing leaf ID if its overlap to a previous disappeared leaf is larger than a threshold. Otherwise it will be assigned with a new leaf ID. The new candidate is added into and tracked in the remaining frames.

### 3.3 Quality Prediction

While many algorithms strive for perfect results, it is inevitable that unsatisfactory or failed results are obtained on the challenging samples. It is critical for an algorithm to be aware of this situation so that future analysis does not rely on poor results. One approach to achieve this goal is to perform the quality prediction for the task, similar to quality estimation for fingerprint [39] and face [40]. The key tasks in our work include leaf alignment, estimating the two tips of a leaf, and leaf tracking, keeping leaf consistency over time. Therefore, we learn two quality prediction models to predict the alignment accuracy and detect the tracking failure respectively. The prediction can be used to select a subset of leaves with high quality for subsequent plant biology analysis [41].

#### 3.3.1 Alignment quality

Suppose is the alignment accuracy of a leaf, which indicates how well the two tips are aligned. We envision what factors may influence the estimation of the two tips. First, the CM distance indicates how well the template and the test image are aligned. Second, a well-aligned leaf candidate should have large overlap with the test image mask and small overlap with the neighboring leaves. Third, the leaf area, angle, and distance to the plant center may influence the alignment result. Therefore, we extract a -dim feature vector including: the CM distance , the overlap ratio with the test image mask , the overlap ratio with the other leaves , the area normalized by test image mask , the angle difference and the distance to the plant center . A linear regression model is learned by optimizing the following objective on training leaves with ground truth , which is proportional to the alignment error (details in Sec. ).

(15) |

where is a -dim weighting vector to predict the alignment accuracy of each leaf.

#### 3.3.2 Tracking quality

Due to the limitation of our algorithm, it is possible that one leaf might diverge to the location of the adjacent leaves and results in tracking inconsistency. We name it as a tracking failure. One example is shown in Figure 7, where labeled leaf has been assigned two different IDs ( and ) during tracking. The change happens from frame to frame . The goal of tracking quality prediction is to detect the moment when tracking starts to fail. We denote tracking quality as , where means a tracking failure of one leaf and means tracking success.

Similar to Section 3.3.1, we first extract a -dim feature vector for one leaf. However alone can not predict the tracking failure because it does not include temporal information. So we compare the features of one frame with that of a reference frame , which is frames before . Since a tracking failure may result in abnormal changes in leaf area, angle, and distance to the center, we compute the leaf angle difference, leaf center distance, leaf overlap ratio between the current and the reference frame. Finally, we form a -dim feature vector denoted as , including , , the leaf angle difference , the leaf center distance , and the leaf overlap ratio . Given a training set with , a SVM classifier is learned as the tracking quality model.

## 4 Performance Evaluation

Leaf segmentation is to segment each leaf from the image. Leaf alignment is to correctly estimate two tips of each leaf. Leaf tracking is to keep the leaf ID consistent over the video. In order to quantitatively evaluate the performance of joint multi-leaf SAT, we need to provide the ground truth of the pixel-level leaf segments in each frame, the two tips of each leaf, and the leaf IDs for all leaves in the video.

As shown in Figure 7, we label the two tips of each leaf and manually assign their IDs in several frames of one video. We record the label results in one frame as a matrix , where is the number of labeled leaves and records tip coordinates of leaf in this frame. The collection of all labeled frames in all videos is denoted as , where , is the number of labeled videos and is the number of labeled frames in each video. The total number of labeled leaves in is .

During template transformation, the corresponding points of the transformed template tips in become the estimated leaf tips . The leaf ID is assigned in the last frame starting from to the total number of selected leaves and kept the same during tracking. Similar to the data structure of , the tracking results of all videos over the labeled frames is written as . Given and , Algorithm 2 provides our detailed performance evaluation, which is also illustrated by a synthetic example in Figure 7.

There are two concepts involved: frame-to-frame and video-to-video correspondence. For each estimated leaf, we need to find one corresponding leaf in the labeled frame. Frame-to-frame correspondence aims to assign a unique leaf ID to each leaf in the frame so that the IDs are consistent with our labeled IDs. As mentioned before, the frame-to-frame correspondence may not be consistent in the whole video due to the tracking failures or more than one leaf IDs can be assigned to the same leaf in the video. Video-to-video correspondence aims to assign consistent and unique leaf ID to the same leaf in the entire video.

We start by building frame-to-frame leaf correspondence, as in Algorithm 1 and the red dotted box in Figure 7. To build the leaf correspondence of estimated leaves with labeled leaves, a matrix is computed, which records all tip-based errors of each estimated leaf tips with every labeled tips normalized by the labeled leaf length:

(16) |

We build the leaf correspondence by finding a number of minimum errors in that do not share columns or rows, which results in leaf pairs and leaves without correspondence. Finally, it outputs the number of unmatched leaf , recording tip-based errors and recording the leaf correspondence. This frame-to-frame correspondence is built on all frames and the results are added into and . We build the video-to-video leaf correspondence using the accumulated . and are the tip-based errors of leaf pairs with frame-to-frame and video-to-video correspondence respectively. The difference of and is from estimated leaf . While it is well aligned with labeled leaf in frame , it does not have leaf correspondence in all frames together.

Finally we compute three metrics by varying a threshold . Unmatched leaf rate is the percentage of unmatched leaves w.r.t. the total number of labeled leaves . attributes to two sources, leaves without correspondence and correspondent leaves with tip-based errors larger than . Landmark error is the average of all tip-based errors in that are smaller than . Tracking consistency is the percentage of leaf pairs whose tip-based errors in are smaller than w.r.t. . These three metrics jointly estimate the accuracy in leaf counting (), alignment (), and tracking ().

In order to quantitatively evaluate the segmentation accuracy, we annotate each image to generate a leaf segmentation mask where the pixels of the same leaf are assigned with the same number over the video. We add the metric “Symmetric Best Dice” (SBD) [5] to compute the similarity between the estimated and the ground truth segmentation masks. It is averaged across all labeled frames. These four metrics are used to evaluate the performance of our joint framework.

## 5 Experiments and Results

### 5.1 Dataset and Templates

Our dataset includes Arabidopsis Thaliana videos taken in a -day period, which is sufficient to model the plant growth [42]. Each video has frames, with the image resolution ranging from to . For each video, we label the two tips of all visible leaves and segmentation masks of frames, each being the middle frame of a day. In total we labeled leaves. We select videos to form the training set for template generation and parameter tuning. The remaining videos are used for testing. The collection of all labeled tips and segmentation masks are denoted as and .

To generate leaf templates, we select leaves with representative shapes and label the two tips for each leaf, as in Figure 4.
We select scales for each leaf shape to guarantee the scaled templates can cover all possible leaf sizes in the dataset.
For each scaled leaf template, we rotate it every in the space.
Finally, the total number of leaf templates is with ^{1}^{1}1The dataset, labels, and templates are publicly available at: http://cvlab.cse.msu.edu/project-plant-vision.html..

### 5.2 Experimental Setup

For each testing video, we apply our approach and compare with four methods: Baseline Chamfer Matching, Prior Work [12], [16], and Manual Results.

Proposed Method templates are applied to the edge map of the last video frame to generate the same amount of transformed templates. Leaf segmentation and alignment generate leaf candidates for leaf tracking, which iteratively updates according to Equation 11 towards the first frame.

Baseline Chamfer Matching The basic idea of CM is to align one object in an image. To align multiple leaves in a plant image, we design the baseline CM to iteratively align one leaf at one time. In each iteration, we apply all templates to the edge map of a test image to generate transformed leaf templates, which is the same as our first step. The transformed template with the minimum CM distance is selected and denoted as a leaf candidate. We update the edge map by deleting the matched edge points of the selected leaf candidate. The iteration continues until of the edge points are deleted. We apply this method to the labeled frames of each video and build the leaf correspondence based on leaf centers.

Multi-leaf Alignment [12] The optimization in [12] is the same as our proposed leaf alignment on the last frame. We apply [12] on all labeled frames and build the leaf correspondence based on leaf center distances.

Multi-leaf Tracking [16] The difference between the proposed method and [16] includes the modified in Equation 11. And [16] do not have the scheme to generate a new leaf candidate during tracking.

Manual Results In order to find the upper bound of our proposed method, we use the ground truth labels to find the optimal set of . For each labeled leaf, we find the leaf candidate with the smallest tip-based error from transformed templates.

For all methods, we record the estimated tip coordinates of all leaf candidates in the labeled frames as . The transformed template masks are used to generate an estimated segmentation mask for each frame. We record the estimated segmentation masks of all labeled frames as . and are used to evaluate , , and . and are used to evaluate SBD.

### 5.3 Experimental Results

#### 5.3.1 Performance comparison

Qualitative Results Figure 8 shows the results on the labeled frames of one video. Since the baseline CM only considers CM distance to segment each leaf separately, leaf candidates are likely to be aligned around the edge points, which result in large landmark errors. While [16] can keep the leaf ID consistent, it does not include the scheme to generate a new leaf candidate during tracking (e.g., leaf in Figure 8). Our proposed method performs substantially better than others. It has the same segmentation as the labeled results and all leaves are well tracked. Leaf is deleted when it gets too small. Due to the limitation of a finite amount of templates, the manual results are not perfect. However, in our tracking method, we allow template transformation under any parameters in without limiting to a finite number.

Quantitative Results We first evaluate the SAT accuracy w.r.t. , , and . We set the threshold to vary in and generate the accuracy curves for all methods, as shown in Figure 9. When is small, i.e., we have very strict requirements on the accuracy of tip estimation, all methods work well for easy-to-align leaves. With the increase of , more and more hard-to-align leaves with relatively large tip-based errors are considered as well-aligned leaves and contribute to the landmark error and tracking consistency . Therefore, detecting more leaves will result in higher and . It is noteworthy that our method achieves lower landmark error and higher tracking consistency while segmenting more leaves.

The baseline CM segments less leaves with higher landmark error and lower tracking consistency. The manual results are the upper bound of our algorithm. Obviously will be and will be with the increase of because we enforce the correspondence of all labeled leaves. But will not be due to the limitation of a finite template set. Overall, the proposed method performs much better than the baseline CM and our prior work. The improvement over [12] is mainly in a higher , and it improves [16] in all three metrics. However there is still a gap between the proposed method and the manual results, which calls for future research.

The SBD-based segmentation accuracy is shown in Table II. The proposed method is again superior to the baseline algorithm and the prior work.

Efficiency Results Table II shows the average execution time, which is calculated based on a Matlab implementation on a conventional computer. Our method is superior to the baseline CM and [12]. It is a little slower than [16] because of the updated and the scheme to add leaf candidates during tracking.

Baseline | [12] | [16] | Proposed | Manual | |
---|---|---|---|---|---|

SBD | 61.0 | 63.0 | 64.4 | 65.2 | 74.9 |

Time | 51.28 | 16.42 | 1.98 | 2.15 | - |

A1 | A2 | A3 | all | |
---|---|---|---|---|

[17] | 74.2(7.7) | 80.6(8.7) | 61.8(19.1) | 73.5(11.5) |

Ours | 78.5(5.5) | 77.4(8.1) | 76.1(14.1) | 78.0(7.8) |

Segmentation Accuracy While there is no prior work focuses on the joint multi-leaf SAT, leaf segmentation has been studied especially on RGB imagery. For example, state-of-the-art performance [17] is reported in the 2014 Leaf Segmentation Challenge (LSC) [30]. We apply our segmentation and alignment algorithm to the LSC dataset [5], which consists of three sets of Arabidopsis () and tobacco (). Two examples from and are shown in Figure 11. Note that pre-processing of the RGB imagery is employed in order to extend our proposed method to this LSC dataset. We compare the segmentation accuracy with [17] in Table III. Our algorithm achieves higher SBD in A, A, and in average. The segmentation accuracy on the LSC dataset is much higher than that of our fluorescence dataset because images in [5] are of higher resolution.

(a) | (b) | (c) | (d) |

#### 5.3.2 Parameter tuning

We explore the sensitivity of the parameters in our method. We use the training videos for parameter tuning in our framework. For alignment parameter tuning, we test on all labeled frames independently and evaluate the accuracy without using tracking consistency . For tracking parameter tuning, we test on the labeled frames of each video and evaluate the accuracy using all four metrics.

Figure 10 (a) shows the alignment parameter tuning results of the weights for each objective term in Equation 6. We first search for the optimal setting to be: and . We then fix one parameter and change the other and evaluate the performance at . We observe that is relatively robust with some improvement from to . The performance increases tremendously as increases, indicating that is crucial. Without either term ( or ), the performance is not optimal.

In order to analyze the impact of the number of leaf templates, we reduce the value in one of , , and at a time. As shown in Figure 10, the performance increases as the number of templates increases in all three parameters. However, orientation is the most important as leaves with different orientations are more likely to have higher CM distances than leaves with different shapes or scales.

Figure 12 (a,b) shows the parameter tuning results in leaf tracking framework. Similarly, we first find the optimal weights to be: and . We fix one parameter and change the other and evaluate the performance at . and are relatively robust to changes. However, they are still useful as without either term, the performance decreases.

To study the impact of the number of iterations between two frames, we change and evaluate the performance. As shown in Figure 12 (c), the performance increases as increases. However, it stabilizes when is larger than because the algorithm already converges before reaching the maximum iteration.

In summary, all parameters used in our algorithm are set as: , , , , , , , , and .

#### 5.3.3 Quality prediction

Alignment Quality Model Data samples for evaluating our alignment quality model are selected from in Algorithm 2, which contains the tip-based errors of all leaf pairs with of them are less than . We select samples from for each interval of tip-based error within . Sample duplication is employed when the number of sample in a particular interval is less than . All samples with tip-based error larger than will also be selected but without duplication. Finally we select samples and extract features for each sample. We assign to make the output in the range of . And for all samples with . We randomly select samples as the test set and the remaining samples are used to train the model. Figure 13 (a) shows the results of the model on both training and testing samples.

We use to measure how well the model fits our data. It is defined as:

(17) |

where is the predicted quality value and is the mean of . In our model, and the correlation coefficients for all testing samples is . Both values indicate a high correlation of and . This quality model is used to predict the alignment accuracy and generate one predicted curve for each leaf, as shown in Figure 2.

Tracking Quality Model We visualize the results of our method and find videos that have a tracking failure of one leaf. As the goal for tracking quality model is to detect when the tracking failure starts, we label two frames when the failure starts and ends in each video. The starting frame is when a leaf candidate starts to change its location toward its neighboring leaves. The ending frame is when a leaf candidate totally overlaps another leaf. Among all failure samples, the shortest tracking failure length is frames and the average length is frames.

We select - frames near the ending frame as the negative training samples with and frames evenly distributed before the failure starts as the positive training samples with . The features are extracted as discussed in Section 3.3.2 and used to train a SVM classifier. The learned model is applied to all frames to predict the tracking quality. Figure 13 (b) shows an example of the output. We apply a Gaussian filter to remove outliers and delete the failure whose length is less than frames (the shortest length of failure samples).

We compare the first frame of a predicted failure with that of a labeled failure.
When their distance is less than frames (the average length of the failure samples), it is considered as a true detection.
Otherwise it is a false detection.
Using the leave-one-video-out testing scheme, the quality model generates true detections and false detections over labeled failures.
Similarly, this quality model is applied during tracking and outputs a prediction curve for each leaf (shown in Figure 2).

#### 5.3.4 Limitation analysis

Any algorithm has its limitation. Hence, it is important to explore the limitation of the proposed method. First, one interesting question is to what extend our segmentation and alignment method can correctly segment leaves in the overlapping region. We answer this question using a simple synthetic example. As shown in Figure 14, our method performs well when the overlap ratio is less than . Otherwise it identifies two leaves as one leaf, which appears to be reasonable when the overlap ratio is high (e.g., ).

Second, our leaf tracking starts from a good initialization of the leaf candidates from the previous frame. Another interesting question is to what extend our tracking method can succeed with bad initializations. To study this, frames with good tracking results are selected from videos (one for each). We change the transformation parameters to synthesize different amount of distortions and apply our tracking algorithm on these frames. The leaf candidate is deleted only if it becomes one point and the tip-based error is set to be . We compute the average tip-based error of all leaf candidates.

We vary the rotation angle , the scaling factor , and the translation ratio , which is defined as and the direction is randomly selected. The average and range of the tip-based errors for all frames are shown in Figure 15 . Our tracking method reduces the initial tip-based error to a small value. It is most robust to and most sensitive to .

Figure 16 shows some examples. For rotation angle less than , our method works well for different amounts of leaf rotations. For the scaling factor, as long as the leaf candidate is not too small, our method is very robust even if we enlarge the original leaf candidates to be times larger. For the translation ratio, it is sensitive because the direction is randomly selected and leaf candidates are very likely to shift to the locations of the neighboring leaves. Furthermore, changing the initialization of and for separate leaves (leaf in Figure 16) leads to better performance than that of neighboring leaves (leaf in Figure 16) because neighboring leaves will have overlap with each other and therefore influence the tracking results. Overall, as the distortion increases, the average tip-based error increases while some of the leaf candidates can still be well aligned.

## 6 Conclusions

In this paper, we identify a new computer vision problem of leaf segmentation, alignment, and tracking from fluorescence plant videos. Leaf alignment and tracking are formulated as two optimization problems based on Chamfer matching and leaf template transformation. Two models are learned to predict the quality of leaf alignment and tracking. A quantitative evaluation scheme is designed to evaluate the performance. The limitations of our algorithm are studied and experimental results show the effectiveness, efficiency, and robustness of the proposed method.

With the leaf boundary and structure information over time, the photosynthetic efficiency can be computed for each leaf, which paves the way for leaf-level photosynthetic analysis and high-throughput plant phenotyping. The proposed method and the evaluation scheme are potentially applicable to other plant videos, as shown in the results on the LSC dataset.

## References

- [1] Fabio Fiorani and Ulrich Schurr, “Future scenarios for plant phenotyping,” Annual Review of Plant Biology.
- [2] Marcus Jansen et al., “Simultaneous phenotyping of leaf growth and chlorophyll fluorescence via growscreen fluoro allows detection of stress tolerance in arabidopsis thaliana and other rosette plants,” Functional Plant Biology, 2009.
- [3] Samuel Trachsel, Shawn M Kaeppler, Kathleen M Brown, and Jonathan P Lynch, “Shovelomics: high throughput phenotyping of maize (zea mays l.) root architecture in the field,” Plant and Soil, 2011.
- [4] Larissa M Wilson, Sherry R Whitt, Ana M Ibáñez, Torbert R Rocheford, Major M Goodman, and Edward S Buckler, “Dissection of maize kernel composition and starch production by candidate gene association,” The Plant Cell, 2004.
- [5] Hanno Scharr, Massimo Minervini, Andreas Fischbach, and Sotirios A Tsaftaris, “Annotated image datasets of rosette plants,” Tech. Rep. FZJ-2014-03837, 2014.
- [6] Anja Hartmann, Tobias Czauderna, Roberto Hoffmann, Nils Stein, and Falk Schreiber, “Htpheno: an image analysis pipeline for high-throughput plant phenotyping,” BMC Bioinformatics, 2011.
- [7] Ladislav Nedbal and John Whitmarsh, “Chlorophyll fluorescence imaging of leaves and fruits,” in Chlorophyll a Fluorescence. 2004.
- [8] Xu Zhang, Ronald J. Hause, and Justin O. Borevitz, “Natural genetic variation for growth and development revealed by high-throughput phenotyping in Arabidopsis thaliana,” G3: Genes, Genomes, Genetics, 2012.
- [9] Arabidopsis Genome Initiative et al., “Analysis of the genome sequence of the flowering plant arabidopsis thaliana.,” Nature.
- [10] ,” http://www.arabidopsis.org/portals/education/aboutarabidopsis.jsp#hist.
- [11] Chin-Hung Teng, Yi-Ting Kuo, and Yung-Sheng Chen, “Leaf segmentation, its 3D position estimation and leaf classification from a few images with very close viewpoints,” in Image Analysis and Recognition. 2009.
- [12] Xi Yin, Xiaoming Liu, Jin Chen, and David M Kramer, “Multi-leaf alignment from fluorescence plant images,” in WACV, 2014.
- [13] Jonas Vylder, Daniel Ochoa, Wilfried Philips, Laury Chaerle, and Dominique Straeten, “Leaf segmentation and tracking using probabilistic parametric active contours,” in Computer Vision/Computer Graphics Collaboration Techniques. 2011.
- [14] Harry G. Barrow, Jay M. Tenenbaum, Robert C. Bolles, and Helen C. Wolf, “Parametric correspondence and Chamfer matching: Two new techniques for image matching,” Tech. Rep., DTIC Document, 1977.
- [15] Bastian Leibe, Edgar Seemann, and Bernt Schiele, “Pedestrian detection in crowded scenes,” in CVPR, 2005.
- [16] Xi Yin, Xiaoming Liu, Jin Chen, and David M Kramer, “Multi-leaf tracking from fluorescence plant videos,” in ICIP, 2014.
- [17] Jean-Michel Pape and Christian Klukas, “3-d histogram-based segmentation and leaf detection for rosette plants,” in ECCV Workshops, 2014.
- [18] Long Quan, Ping Tan, Gang Zeng, Lu Yuan, Jingdong Wang, and Sing Bing Kang, “Image-based plant modeling,” in ACM Transactions on Graphics, 2006.
- [19] Derek Bradley, Derek Nowrouzezahrai, and Paul Beardsley, “Image-based reconstruction and synthesis of dense foliage,” ACM Transactions on Graphics.
- [20] Yangyan Li, Xiaochen Fan, Niloy J Mitra, Daniel Chamovitz, Daniel Cohen-Or, and Baoquan Chen, “Analyzing growing plants from 4d point cloud data,” ACM Transactions on Graphics, 2013.
- [21] Yann Chéné, David Rousseau, Philippe Lucidarme, Jessica Bertheloot, Valérie Caffier, Philippe Morel, Étienne Belin, and François Chapeau-Blondeau, “On the use of depth camera for 3D phenotyping of entire plants,” Computers and Electronics in Agriculture, 2012.
- [22] Guillaume Cerutti, Laure Tougne, Julien Mille, Antoine Vacavant, and Didier Coquin, “Understanding leaves in natural images–a model-based approach for tree species identification,” CVIU, 2013.
- [23] Sofiene Mouine, Itheri Yahiaoui, and Anne Verroust-Blondet, “Advanced shape context for plant species identification using leaf image retrieval,” in Proc. ACM Int. Conf. Multimedia Retrieval (ICMR), 2012.
- [24] Neeraj Kumar, Peter N. Belhumeur, Arijit Biswas, David W. Jacobs, W. John Kress, Ida C. Lopez, and João VB. Soares, “Leafsnap: A computer vision system for automatic plant species identification,” in ECCV. 2012.
- [25] Guillaume Cerutti, Laure Tougne, Julien Mille, Antoine Vacavant, Didier Coquin, et al., “A model-based approach for compound leaves understanding and identification,” in ICIP, 2013.
- [26] Guillaume Cerutti, Laure Tougne, Antoine Vacavant, and Didier Coquin, “A parametric active polygon for leaf segmentation and shape estimation,” in Advances in Visual Computing. 2011.
- [27] Siqi Chen, Daniel Cremers, and Richard J Radke, “Image segmentation with one shape priorâa template-based formulation,” Image and Vision Computing, 2012.
- [28] Xiao-Feng Wang, De-Shuang Huang, Ji-Xiang Du, Huan Xu, and Laurent Heutte, “Classification of plant leaf images with complicated background,” Applied Mathematics and Computation, 2008.
- [29] Xianghua Li, Hyo-Haeng Lee, and Kwang-Seok Hong, “Leaf contour extraction based on an intelligent scissor algorithm with complex background,” in International Conference on Future Computers in Education, 2012.
- [30] ,” http://plant-phenotyping.org/CVPPP2014-challenge.
- [31] Hanno Scharr, Massimo Minervini, Andrew P French, Christian Klukas, David M Kramer, Xiaoming Liu, Imanol Luengo, Jean-Michel Pape, Gerrit Polder, Danijela Vukadinovic, et al., “Leaf segmentation in plant phenotyping: a collation study,” Machine Vision and Applications, 2015.
- [32] Eren Erdal Aksoy, Alexey Abramov, Florentin Wörgötter, Hanno Scharr, Andreas Fischbach, and Babette Dellen, “Modeling leaf growth of rosette plants using infrared stereo image sequences,” Computers and Electronics in Agriculture, 2015.
- [33] Babette Dellen, Hanno Scharr, and Carme Torras, “Growth signatures of rosette plants from time-lapse video,” 2015.
- [34] Martin A Fischler and Robert C Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, 1981.
- [35] Xiaoming Liu, Ting Yu, Thomas Sebastian, and Peter Tu, “Boosted deformable model for human body alignment,” in CVPR, 2008.
- [36] Xiaoming Liu, “Discriminative face alignment,” PAMI, 2009.
- [37] Yves Crama, Pierre Hansen, and Brigitte Jaumard, “The basic algorithm for pseudo-boolean programming revisited,” Discrete Applied Mathematics, 1990.
- [38] Endre Boros and Peter L Hammer, “Pseudo-boolean optimization,” Discrete Applied Mathematics, 2002.
- [39] Eyung Lim, Xudong Jiang, and Weiyun Yau, “Fingerprint quality and validity analysis,” in ICIP, 2002.
- [40] Kamal Nasrollahi and Thomas B. Moeslund, “Face quality assessment system in video sequences,” in Biometrics and Identity Management. 2008.
- [41] Kathleen Greenham, Ping Lou, Sara E Remsen, Hany Farid, and C Robertson McClung, “Trip: Tracking rhythms in plants, an automated leaf movement analysis program for circadian period estimation,” Plant Methods, 2015.
- [42] Oliver L Tessmer, Yuhua Jiao, Jeffrey A Cruz, David M Kramer, and Jin Chen, “Functional approach to high-throughput plant growth analysis,” BMC systems biology, 2013.

Xi Yin received the B.S. degree in Electronic and Information Science from Wuhan University, China, in 2013. Since August 2013, she has been working toward her Ph.D. degree in the Department of Computer Science and Engineering, Michigan State University, USA. Her paper on plant segmentation won the Best Student Paper Award at Winter Conference on Application of Computer Vision (WACV) 2014. Her research interests include computer vision and deep learning. |

Xiaoming Liu is an Assistant Professor in the Department of Computer Science and Engineering at Michigan State University (MSU). He received the B.E. degree from Beijing Information Technology Institute, China and the M.E. degree from Zhejiang University, China, in and respectively, both in Computer Science, and the Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University in . Before joining MSU in Fall , he was a research scientist at General Electric Global Research Center. His research areas are face recognition, biometrics, image alignment, video surveillance, computer vision and pattern recognition. He has authored more than scientific publications, and has filed U.S. patents. He is a member of the IEEE. |

Jin Chen received the BS degree in computer science from Southeast University, China, in , and the Ph.D. degree in computer science from the National University of Singapore, Singapore, in . He is an Assistant Professor in the Department of Energy Plant Research Laboratory and the Department of Computer Science and Engineering at Michigan State University. His general research interests are in computational biology, as well as its interface with data mining and computer vision. |

David M. Kramer is the Hannah Distinguished Professor of Bioenergetics and Photosynthesis in the Biochemistry and Molecular Biology Department and the MSU-Department of Energy Plant Research Lab at Michigan State University. In 1990, he received his Ph.D. in Biophysics at University of Illinois, Urbana-Champaign, followed by Post-doc at the Institute de Biologie Physico-Chimique in Paris and a 15-year tenure as a faculty member at Washington State University. His research seeks to understand how plants convert light energy into forms usable for life, how these processes function at both molecular and physiological levels, how they are regulated and controlled, how they define the energy budget of plants and the ecosystem and how they have adapted through evolution to support life in extreme environments. This work has led his research team to develop a series of novel spectroscopic tools for probing photosynthetic reactions in vivo. |