A Experimental Set-up

DeepSolarEye: Power Loss Prediction and Weakly Supervised Soiling Localization via Fully Convolutional Networks for Solar Panels


The impact of soiling on solar panels is an important and well-studied problem in renewable energy sector. In this paper, we present the first convolutional neural network (CNN) based approach for solar panel soiling and defect analysis. Our approach takes an RGB image of solar panel and environmental factors as inputs to predict power loss, soiling localization, and soiling type. In computer vision, localization is a complex task which typically requires manually labeled training data such as bounding boxes or segmentation masks. Our proposed approach consists of specialized four stages which completely avoids localization ground truth and only needs panel images with power loss labels for training. The region of impact area obtained from the predicted localization masks are classified into soiling types using the webly supervised learning. For improving localization capabilities of CNNs, we introduce a novel bi-directional input-aware fusion (BiDIAF) block that reinforces the input at different levels of CNN to learn input-specific feature maps. Our empirical study shows that BiDIAF improves the power loss prediction accuracy by about 3% and localization accuracy by about 4%. Our end-to-end model yields further improvement of about 24% on localization when learned in a weakly supervised manner. Our approach is generalizable and showed promising results on web crawled solar panel images. Our system has a frame rate of 22 fps (including all steps) on a NVIDIA TitanX GPU. Additionally, we collected first of it’s kind dataset for solar panel image analysis consisting 45,000+ images.


1 Introduction

The surge in solar photovoltaic (PV) based renewable energy in recent years has revolutionized the energy sector across the globe by greatly reducing the energy cost [28]. Growing number of large- and mid-sized solar farms often face operations and maintenance challenges. Environment induced soiling on solar panel (accumulation of dust, pollen, leaves, bird drop, snail trail, and snow) and defects, such as cracks, hamper the power generation at large [24, 19, 18, 33]. Automatic visual inspection-based solutions can play a vital role in efficient solar farm operations, maintenance, and asset warranty.

Figure 1: Six images depicting the performance of our method. First three images are from our dataset while the remaining three are downloaded from the Internet. Our method can efficiently localize PV soiling and type, even in the wild. See Appendix E for more images. Best viewed in color.

The type of soiling or defect on the panel can be instantly and effortlessly recognized by merely looking at the PV panel image. However, to analyze the impact of soiling or defect on the performance of the solar panel, detailed information, such as soiling amount and coverage, type of the material, and location on the panel, is required. This information is not only useful for estimating the impact on the performance of solar panel, but also helpful for recommending corrective measures; which together are critical for efficient solar farm maintenance. As an example, wiping is a more appropriate cleaning action when solar panel is covered with bird drop rather than air blow which is effective for cleaning dust (see Figure 1(c) and (d)). This decision is likely due to the nature of the material, e.g., bird drop has sticky and oily composition than dust.

Figure 2: Overview of our method, DeepSolarEye, that predicts impact on the power loss and the soiling area simultaneously.

Besides soiling coverage and type, soiling location also plays a vital role in impact prediction due to the physics of cell connections in a panel. As an example, dust cover on panels in Figure 1(b) is much less than that in Figure 1(a), however, power loss is similarly high in both the cases. By design, solar panels have cells in series in vertical columns and one bypass diode per two columns. Therefore, even if one cell is fully covered by dust, it can block the current for the entire bypass diode segment; thus resulting in high power loss.

We aim to develop a model that provides information about the soiling and defects on the panel using RGB image of a solar panel. Convolutional neural networks (CNNs) have performed well in visual recognition tasks such as object detection and image segmentation. Yet, application of CNNs in solar farm monitoring and management still remains a challenge; primarily due to lack of labeled datasets.

In this paper, we present a novel end-to-end fully convolutional neural network, DeepSolarEye, that simultaneously predicts the power loss and localizes soiling area from an image of a solar panel. A simplified overview of DeepSolarEye is presented in Figure 2. Ideally, we need power loss and localization mask labels to train such a model. In our approach, we bypass the explicit localization mask requirement. We use power loss (or classification label) as weak supervision for generating the localization masks.

The main contributions of our paper are:

  • CNNs for solar panel analysis: We adapt existing CNNs for a new domain of solar panel soiling and defect analysis. To the best of our knowledge, ours is the first CNN-based approach for this task.

  • Weakly supervised learning: DeepSolarEye consists of four steps: (a) train a CNN-based classification network, ImpactNet, for predicting the power loss, (b) create a candidate soiling mask using a pyramid-based approach from the classification network, (c) train a multi-task network, Mask FCNN, for simultaneously predicting the power loss and localizing the soiling area using the generated candidate mask, and (d) predict the soiling category using webly supervised neural network, WebNN.

  • BiDIAF for accurate localization: CNNs learn features at multiple spatial levels by performing convolution and down-sampling operations. However, feature maps learned by CNNs do not encode localization information explicitly; thereby hindering in localizing the impact area. Localization further becomes difficult as spatial information is lost due to these operations. To overcome these challenges, we introduce a novel convolutional unit, a bi-directional input-aware fusion (BiDIAF), that reinforces the input at different layers of CNNs to learn input-specific feature maps. We show that BiDIAF improves the localization capabilities of CNNs largely.

  • Dataset: We created a first of its-kind-dataset for solar panel image analysis, comprising of 45,754 solar panel images with labels of power loss and solar irradiance, as well as timestamps. DeepSolarEye achieves a classification accuracy of 83.32% and localization Jaccard index (weakly supervised) of 66% on this dataset. Further, our experimental results suggest that DeepSolarEye learns generalizable representations of different soiling types and able to identify and localize the soiling on solar panels, even in the wild. For example, crack (Figure1(e)) and snow (Figure1(f)) were identified and localized despite being not present in our dataset.

2 Related Work

Image-based solutions for solar panel analysis:

There exist some work for analyzing different soiling types [1, 31, 8, 2]. These methods take RGB or IR1 image as an input and applies a traditional image processing algorithm, such as histogram matching, filtering, and color-space conversion. The output of the image processing algorithm is then thresholded to locate the impact area. As these methods are threshold-bound, they are not scalable to identify different types of soiling. Further, these methods are not able to capture the complex relationships between different factors, such as particle size, thickness, and coverage along with environmental factors (e.g. solar irradiance and humidity), that are required for analyzing the impact on power loss.

CNN for visual recognition tasks:

Convolutional Neural Networks (CNNs) are the state-of-the-art methods for image classification [26, 29, 14, 15]. Recent classification architectures have explored different types of connectivity patterns, such as bypass connections in [14] and dense connections in [15], to improve the information flow inside the network; thereby enabling end-to-end training of very deep CNNs. Further, these classification networks have been used as the base feature extractors for several visual recognition tasks including object detection [21] and segmentation [6].

Region-based CNNs or R-CNNs [10] have proven to be effective for both detection and segmentation tasks . However, the accuracy of R-CNNs is dependent on the region proposal method. Unlike R-CNNs, fully convolutional networks (FCN) have gained attention as they enable end-to-end training and are fast (e.g. [21, 4]). The features learned by the classification network at lower spatial resolution (say ) are coarse and leads to coarse output (e.g. segmentation masks of FCN-32s [25]). To address this limitation, several techniques have been proposed such as fully convolutional region proposal networks (e.g. [21]), skip-connections (e.g. [25, 22]), deconvolutional networks (e.g. [4]), dilated convolutions (e.g. [6] [32]), and multiple-input networks (e.g. [17]).

Several FCN-based supervised object classification and localization networks exists in literature. Yet, extending these approaches to new domains is challenging; primarily due to the lack of large labeled datasets. In this paper, we propose a method for PV solar image analysis to address these challenges. Our method consists of novel and carefully designed components that allows extending the prior work on image classification and localization on a new dataset without any manually labeled localization data.

3 Dataset

We create a first-of-its-kind dataset2, PV-Net, comprising of 45,754 images of solar panels with power loss labels. Our experimental setup consists of two identical solar panels, which are kept side by side with an RGB camera facing them. Soiling experiments were conducted on the first panel (close to the camera) while the other panel was used for reference. Images were captured at every 5 seconds and power generated by the panels was recorded. Soiling impact is reported as the percentage power loss with respect to the reference panel. We will be using soiling impact and power loss interchangeably in this paper.

Our data recording methodology was aimed to capture various types of soiling and their impact on PV panel. For this, we exposed the panel to different types of soiling in terms of color (red, brown, and gray), particle size (sand, dust, and talcum powder), and thickness under natural environmental conditions [24]. Some of the thick patches correspond to high power losses as much as 90%. The data set was collected for about a month and was enriched with large variations in soiling due to both experimental (e.g. dust with varying thickness, blob sizes, and patches) and natural means (e.g. wind and precipitation). Besides power loss corresponding to the soiled panel, our data set also contains information about environmental factors (solar irradiance and timestamps).

4 DeepSolarEye

We propose an end-to-end system, DeepSolarEye, which is based on fully convolutional networks. The architecture of DeepSolarEye is visualized in Figure 3. The input to our system is an RGB image with environmental factors (such as solar irradiance and timestamps), while outputs of our system are: (1) soiling impact, (2) soiling localization, and (3) soiling category. We propose a four-step approach that enables the training of DeepSolarEye: (1) train a CNN-based classification network, ImpactNet, for predicting soiling impact, (2) create candidate masks by aggregating the feature maps learned by classification network at different spatial-levels using pyramid-based approach, (3) train a multi-task network, Mask FCNN, for simultaneously predicting the soiling impact and soiling localization mask, and (4) predict the category of soiling using webly supervised neural network, WebNN. These steps are discussed below.

Figure 3: DeepSolarEye: An end-to-end system for predicting the soiling impact, the soiling localization, and the soiling type simultaneously. Number of feature maps used by each block are reported next to it. Note that AU doesn’t have . Best viewed in color.
Figure 4: Visualization of feature maps with and without BiDIAF. Due to the down-sampling operations, dataset specific features are lost. BiDIAF reinforces the input at different spatial-levels to learn data-specific feature. Best viewed in color.

4.1 ImpactNet: Image to Impact Analysis

Traditional CNNs encode the spatial information about the objects in an image by performing convolution and down-sampling operations in a top-down fashion. Although these CNNs do not encode the localization information, one may combine the feature maps at multiple spatial levels using a bottom-up approach to localize the object. However, the down-sampling operations tend to lose the spatial information and may hinder in object localization (see Figure 4). To address this limitation, we introduce a novel “bi-directional input-aware fusion block (BiDIAF)” that reinforces the input inside the network to compensate the loss of spatial information; thereby helping the network learn the relevant features with respect to the input. BiDIAF block takes an input from the main CNN branch () and shares one of the output with the same branch, hence we call this unit as bi-directional input-aware fusion unit.

Our proposed unit can be integrated with any CNN (such as VGG [26] or ResNet [14]). Following the success of ResNet [14] in different visual recognition tasks, we choose ResNet as our baseline network. ResNet stacks residual convolutional units (RCU) to aggregate feature maps at different spatial levels. The input and output in RCU are connected through identity mapping, which improves the information flow inside the network and prevents vanishing gradient issue. We add the BiDIAF unit between two residual convolutional units (RCU), as shown in Figure 3. BiDIAF takes the output of previous RCU and previous BiDIAF (if exists) unit along with an input image as input and produces two outputs that are given as input to the next RCU and next BiDIAF block . We can formulate BiDIAF unit as:


The function projects to the same dimensionality as using convolution. The function first sub-samples to the same spatial dimensions as using average pooling operation and then projects the sub-sampled image to the same dimensionality as using convolution. Apart from projection, convolution also learns input relevant feature maps. The function concatenates the feature maps obtained from , , and , followed by convolution that projects the concatenated feature maps to the same dimensionality as .

CNNs for classification do not encode the localization information explicitly. Most of the existing methods (e.g. [12, 11, 9, 25, 4, 20, 32, 6]) use labeled data to learn the localization mask. Data labeling is an expensive task and therefore, we propose a two-fold strategy to generate localization mask: (1) aggregate the feature maps at different spatial-levels to create a candidate mask, and (2) refine the localization masks by jointly training a classification and localization network, assuming candidate masks as ground truth during training.

4.2 Parameter-free Candidate Mask Creation

Our approach for localizing the region of impact is motivated by Burt and Adelson’s Laplacian pyramid-based method [5], which encode and decode the image information using top-down (analysis) and bottom-up (synthesis) pyramids respectively. Standard CNNs aggregate feature maps in top-down fashion; thus suggesting their resemblance with the analysis network. Therefore, we can decode the encoded feature maps using a synthesis pyramid i.e. in bottom-up fashion for localizing the impact area [9, 20, 4].

Our synthesis pyramid fuses the feature maps of main and auxiliary branches at level using Eq. 3 to produce a localization mask , which is then up-sampled to the same size as level using bilinear interpolation. This process is repeated till the size of localization mask is the same as the input image .


where and are element-wise multiplication and addition operations. The element-wise multiplication operation gives high importance to values only when the feature maps from both the main and auxiliary branches agree; therefore, multiplicative gating helps in suppressing the irrelevant features. The resultant feature map is then combined with the feature map from auxiliary branch using an element-wise addition operation; which boosts the values of relevant features identified using the multiplicative gating. Our inspection of feature maps at different spatial levels reveals that has much more descriptive power than , and therefore, we boost the feature maps using the auxiliary branch (see Figure 5).

Mask to label image: Solar panel image can be split into three classes: background, solar panel, and soiling area. We detect the solar panel by performing Gaussian filtering and edge detection operations on an RGB image. The area outside detected solar panel is assigned a label of 1 (corresponding to background). The area within the detected panel is then thresholded using the mean of the mask. If the pixel value (inside the panel) in the mask is less than the mean, then it is assigned a label of 2 (corresponding to panel). Otherwise, we assign a label of 3 (corresponding to soiling area).

Figure 5: Visualization of feature maps in main and auxiliary branches in the ImpactNet network. For visualization, we have scaled the feature maps to the same scale. Best viewed in color.

4.3 Mask FCNN

Mask FCNN is a fully convolutional CNN that aims to simultaneously predict the soiling impact and the soiling area (localization). Our approach is motivated by the recently proposed method, Mask R-CNN [12], that applies classification and masking in parallel. Unlike Mask R-CNN that adopts two stage procedure (region proposal network followed by classification and masking network), our method is fully convolutional i.e. we do not use any region proposal network.

Mask FCNN is composed of two networks: 1) classification network, the ImpactNet, and 2) synthesis network as shown in Figure 3. For synthesis network, we follow the fully convolutional bottom-up approach (e.g. [4, 20, 22]). Bottom-up approaches up-sample the feature maps to invert the loss of information due to down-sampling operations. Our bottom-up architecture is a stack of synthesis units (SUs), which can be defined as:


is a composite function comprising of convolution, deconvolution, and convolution operations. convolution operation reduces the dimension of the feature maps of and to -dimensional space while deconvolution up-samples the feature maps of and to the same spatial dimensions as and along with projecting the feature maps to -dimensional space. Mask FCNN was trained by minimizing the multi-task loss , where and are multinomial cross-entropy loss functions for classification (soiling impact) and masking (soiling area) respectively.

Cleaning actions for solar panels are dependent upon soiling type. For example, potential cleaning actions for bird drop and dust are wiping and air blow respectively. Therefore, it becomes critical to determine the soiling type for efficiently managing the work-force at solar farms. To determine the soiling type, we use a webly supervised classification network (WebNN) and is discussed next.

4.4 WebNN: Webly Supervised Neural Network

Given an image of solar panel with soiling mask, WebNN determines the soiling type. WebNN utilizes large amount of web data to train a soiling type classifier and is inspired from [7]. We, first, collected images (with and without solar panel) from the Internet by querying the most common soiling categories such as dust (brown, gray, red and black), white chalk powder, bird drop, snow, and crack. These images include soiling categories which were not available in our dataset. 24-dimensional RGB histogram of each of these images were extracted as feature vectors. A small 3-layered neural network, with 50, 100, and 150 hidden neurons per layer respectively, was trained on these feature vectors to predict the soiling type.

To assign a label to the soiling area, we crop the RGB area (referred as ROI in Figure 3) corresponding to the soiling area in the localization mask. A 24-dimensional RGB histogram is computed for this ROI as a feature vector, which is then classified using WebNN to predict the soiling type.

5 Experiments and Results

We performed thorough experiments with various training and model choices on the PV-Net data. We compared classification and localization accuracies of our model, DeepSolarEye. We found that the proposed BiDIAF block improves the classification and localization capabilities of ResNet. We tested the generalizability of our method on images downloaded from the Internet and found that our method was extensible to incorporate the soiling types that were not present in our dataset.

5.1 Classification Models and Results


We trained and tested 3 different classification models with single input and single output (SISO) setting (panel image as an input and power loss level as output). Our proposed method (ImpactNet) was compared with its two alternatives. For the first alternative (referred as ImpactNet-A), the BiDIAF block was removed. The resultant network after removing the BiDIAF block is the same as ResNet [14] and has only one CNN branch i.e. main branch (Figure 8). For the second alternative (referred as ImpactNet-B), we modified the BiDIAF Eq. 2 from to i.e. input-aware feature maps were not shared with the main branch (Figure 8).


[b] {subfigure}[b]

Figure 6: Analysis unit in ImpactNet-A
Figure 7: Analysis unit in ImpactNet-B
Figure 8: Different types of analysis units used in our experiments. Notations are the same as in Figure 3.

Our model (ImpactNet) was trained and tested under multiple input and single output (MISO) setting (panel image and environmental factors as inputs and power loss level as output). In our experiments, we used solar irradiance and time of the day from the image timestamp as environmental factors. To fuse these multiple inputs, we tried two alternatives: element-wise sum (ImpactNet-C) and concatenation (ImpactNet-D). We emphasize that solar irradiance is an important factor as it captures environmental conditions (such as cloudy and sunny) indirectly, which influences the power loss [3].


We trained all of our models end-to-end for 90 epochs using SGD with an initial learning rate of 0.01 decaying it by a factor of 10 after every 30 epochs, momentum of 0.9, weight decay of 0.0005, and a batch size of 32 on a single NVIDIA TitanX GPU. We used spatial dropout [30] with a dropout probability of 0.2 after every analysis unit (Figure 3). The PV-Net dataset (N=45,754) was split randomly into training (N=27,537) and validation (N=18,217) sets. We binned the normalized power loss into equal bins, with each bin representing a soiling impact level (or class). In our experiments, we varied from 2 to 16. Inverse class probability weighting scheme was used in loss function to address the class-imbalance issue. Further, we augmented training data using standard augmentation techniques such as horizontal flips, vertical flips, and random rotations. We did not use any color-based augmentation strategies as the color and power generation capacity of the solar panel are directly influenced by the environmental factors, such as sunlight. We would like to highlight that our classification approach was motivated by the solar farm maintenance practices, where maintenance actions were categorized based on severity levels which were captured in power loss bins.

Weight initialization:

We trained ResNet-18 on a subset of our dataset (training and validation sets each having 2,000 images) for 4-class classification task, with two different initialization strategies: (1) random weight initialization [13], and (2) fine-tuning the model that was trained on the ImageNet. Both networks attained similar accuracy for the 4-class classification task (). Our casual inspection of feature maps revealed that the network initialized with random weights paid attention to the dataset-specific features (see Figure 9). Therefore, we initialized the network weights randomly.

Figure 9: Visualization of feature maps at different spatial resolutions with two different weight initialization strategies. Random weight initialization helped in learning dataset-specific feature maps. For visualization, we scaled the feature maps to the same scale. Best viewed in color.


Classification results were presented in Table 1 and Table 2. We can make the following observations:

  • Convolutional block type: Replacing the VGG-type blocks with ResNet-type blocks improved the classification accuracy by about 7%.

  • Effect of BiDIAF: Replacing the analysis unit in the ImpactNet-A network (Figure 8) with an analysis unit in Figure 8 (ImpactNet-B) improved the accuracy by about 1%. Further, when we replaced the analysis unit in the ImpactNet-A network with the analysis unit in Figure 3 (RCU + BiDIAF), then the accuracy improved by about 2% (for both 8- and 16-class networks). The increase in classification accuracy with BiDIAF unit is likely due to the fact that it promotes data-specific feature learning, even at low-spatial resolutions (see Figure 4).

  • SISO vs MISO: By adding environmental factors as input, the accuracy of ImpactNet improved by about . The improvement in the accuracy is not drastic; suggesting that ImpactNet was able to learn the complex relationships between environmental factors and soiling that leads to power loss.

To further check the performance of our method (ImpactNet-D) in real-world, we tested our method on the data collected using our experimental setup for additional 3 weeks. Our method was able to attain an accuracy of 84.5% for 8-classes.

Convolutional Block Type
VGG-type [26] ResNet-type [14]
w/o BiDIAF 73.8 80.03
w/ BiDIAF 75.4 82.02
Table 1: This table compares the top-1 accuracies of different types of convolutional blocks on our dataset (for 8 classes). VGG-type block is the same as ResNet-type block (Figure 8), except the skip connection.
Models Classes # Params
2 4 8 16 (in Million)
SISO ImpactNet 97.56 93.39 82.02 68.43 1.97
ImpactNet-A 97.77 93.24 80.03 66.68 1.96
ImpactNet-B 97.61 93.18 80.99 67.88 1.97
MISO ImpactNet-C 97.64 93.10 82.97 70.19 1.99
ImpactNet-D 97.82 93.28 83.32 70.59 1.99
Table 2: This table compares the top-1 accuracies of different models on our dataset.

5.2 Localization Results

Model and training details:

For localization experiments, we first computed the candidate masks using pyramid-based method, which were then refined using Mask FCNN. We used the same training and augmentation strategy as discussed in Section 5.1. Note that Mask FCNN learned about 2.12 million parameters.

Evaluation metrics:

We evaluated the localization performance of our method both subjectively and objectively. For subjective assessment, we computed a mean opinion score (MOS), while for objective assessment, we measured Jaccard Index (JI), a widely used metric for measuring the localization accuracy.

For measuring the MOS, we divided our entire dataset into non-overlapping interval of 10 minutes and selected an image randomly from every such interval; resulting in 579 images. For each image, we asked following four questions to the user to determine the localization accuracy:

  • How many regions of dust that were present in the RGB image, but not in the localization mask?

  • How many regions of dust that were detected in localization mask, but not present in the RGB image?

  • On a scale of 0 to 10, rate the level of under-segmentation with 0 being perfectly segmented and 10 being fully under-segmented.

  • On a scale of 0 to 10, rate the level of over-segmentation with 0 being perfectly segmented and 10 being fully over-segmented.

If the area detected in a localization mask was half (or double) of the region of dust in an RGB image, then it was fully under-segmented (or over-segmented).

For measuring the JI, we selected a subset of 241 images out of 579 images. This subset includes all images where we noted high variance in the subjective assessment. These images were then annotated by participants using LabelMe [23]. We asked participants to annotate the dust regions on the solar panel image along with the dust type (such as brown and gray). These images were used as a ground truth for measuring the localization accuracy (JI).

Subjective assessment results:

For sufficient cultural, gender, and racial diversity, we used Amazon Mechanical Turk for conducting this experiment. A total of 172 unique users participated in our study, with each image being rated by 5 different users. MOS is shown in Figure 10.

  • Across all questions, overall MOS was lower than MOS for soiled (or dusty) panels; suggesting most of the mistakes were made on the soiled panels. However, MOS for soiled images was very low. On soiled images, our method was not able to locate on average 2 soiled patches (Q1) while falsely detecting about 0.5 soiled patches (Q2) per image.

  • MOS for under-segmented (Q3) and over-segmented (Q4) images were almost the same and close to 0 (on a scale of 0 to 10); suggesting that over- and under-segmentations were not severe.

Figure 10: Subjective assessment results. Best viewed in color.

Objective assessment results:

Table 3 compares the performance of three methods. First two methods used pyramid-based approach for generating the candidate masks while the third method was our end-to-end method, Mask FCNN. For a fair comparison between these models, we masked the background area (non-panel area) and did not consider it while measuring JI. From Table 3, we see that BiDIAF unit increased the JI of ImpactNet-A (or ResNet) by about 4%. This indicates that BiDIAF unit helped in learning input-specific features which resulted in good localization capabilities. Further, Mask FCNN improved the JI of pyramid-based method (with BiDIAF) by about 24%; suggesting joint learning enabled efficient aggregation of feature maps from the classification network.

Method JI (in %)
Pyramid- w/o BiDIAF 38
based w/ BiDIAF 42
Mask FCNN 66
Table 3: Objective assessment results

Webly supervised labeling results:

For measuring the labeling accuracy of WebNN, we used the same 241 images that were annotated by participants along with the type of dust. With webly supervised labeling, we achieved a classification accuracy of about 96.24%.

To test the flexibility of WebNN in the wild, we queried the Internet for solar panel images with a soiling type as a keyword (e.g. solar panel image with snow) and downloaded 150 images. The images with low quality (either they were overlayed with text or image resolution was less than ) were manually discarded. After discarding such images, we were left with 50 images. Out of these 50 images, some of the images had multiple panels and were manually cropped to identify the panel area corresponding to the given keyword. These images were then fed into Mask FCNN to produce the localization mask, which were then classified using WebNN. Our method attained an accuracy of about 87% on these web images.

DeepSolarEye was able to learn generalizable representations of different soiling types impacting the power loss. When trained on our dataset and tested on the web images, DeepSolarEye was able to localize the impact area (using Mask FCNN) and classify soiling type (using WebNN) even on the soiling types (e.g. snow, crack, and bird drop) that were not present in our dataset (Figure 1).

6 Application in Solar Farm Maintenance

DeepSolarEye provides enriched information about solar panel soiling and defects (soiling impact, soiling localization, and soiling type). Further, soiling localization could be used to easily compute soiling coverage area. Such information could be used for efficient solar farm monitoring and maintenance. Work force management is one of the crucial task in solar farm management, especially when the farm is spread across several acres. Managing such farm requires to address two main questions: 1) How to clean? and 2) When to clean?

The first question can be answered using the soiling type while the second question can be answered using soiling impact and soiling coverage area (computed from the soiling localization mask). To achieve this, we build a simple decision tree based on the soiling impact, soiling localization, and soiling type. Some results of this decision tree are shown in Figure 1. In Figure 1(c), DeepSolarEye correctly identified soiling type (dust) and suggested correct cleaning type (air blow). Though the soiling impact is low (12.5% to 25%) the soiling coverage is about 30% of the actual panel area. Therefore, the suggested action along with the cleaning priority was high. The low impact level was primarily due to the environmental factors (cloudy day).

7 Conclusion

In this paper, we presented a first CNN-based application for a new domain of solar panel soiling and defect analysis. Our method, DeepSolarEye, takes an RGB image of a solar panel and environmental factors as inputs and predicts the power loss, soiling localization, and soiling category in real time. We propose a four-stage methodology to train DeepSolarEye in weakly supervised fashion that completely avoids manually labeled localization data. We introduce a novel BiDIAF block for superior localization capabilities. We, further, leveraged the web-crawled data for categorizing the soling type, which allowed inclusion of new soiling types without re-training the Mask FCNN. Our empirical study suggests that BiDIAF module improves the classification and localization capabilities of ResNet (or ImpactNet-A) by about 3% and 4%. Our end-to-end model yielded further improvement of about 24% on localization task when trained in a weakly supervised manner. Additionally, we constructed a new dataset for solar panel image analysis consisting 45,000+ images.

Our classification model generalizes well to different domains. We found that the proposed BiDIAF unit improves the classification and localization capabilities of ResNet on the plant disease dataset [27] and the Cifar-10/100 dataset [16]. Please see Appendix C and Appendix D for more details.


The authors thank Mohit Jain for helping in designing the subjective assessment study.

Appendix A Experimental Set-up

We introduce first of its kind dataset, PV-Net, comprising of 45,754 images of solar panels with their power loss. We bin the power loss into 8 bins(classes k) of equal sizes. The distribution of data is shown in Figure 11.

Our experimental setup consists of two identical solar panels, which are kept side by side with an RGB camera facing the panels. Soiling experiments were conducted on the first panel (close to the camera) while the other panel was used for reference purpose. Images were captured at every 5 seconds and corresponding power generated by panels were also recorded (see Figure 12). Soiling impact is reported as the percentage power loss with respect to the reference panel.

Figure 11: Distribution of class labels for 8-classes
Figure 12: Graph showing the power generated by reference and experimental panel through out the day.

Appendix B Classification Error Analysis on the PV-Net dataset

Figure 15 visualizes the confusion matrix for 8-class case. We can see that the majority of the mistakes are made with the neighboring classes. Since we binned the classes at fixed interval, it becomes critical to understand the mistakes made by our network i.e. mistakes are made near the boundary or towards the extreme end of the neighboring classes. We introduced a relaxing variable that relaxes the boundaries of each impact level bin. For example, if the range of the bin is between 12.5% to 25%, then after relaxation, the range will be to . Figure 15 shows a graph between and accuracy. When we increased from 0 to 0.01, the accuracy of our method increased from about 83% to about 88%; suggesting the mistakes are at the border of the bin and are tolerable.


0.8 {subfigure}0.8

Figure 13: vs. accuracy
Figure 14: Confusion Matrix ()
Figure 15: Impact of relaxing the boundary conditions.

Appendix C Experiments with Plant Disease Dataset

Towards the generalization of our model in other domain for localization task, we experimented with the publicly available plant disease dataset. We trained and tested our model, ImpactNet, with and without BiDIAF for plant disease classification task. In both cases, our model attained an accuracy of around , which is comparable to the method proposed by [27]. However, on casual visual inspection, we found that ImpactNet without BiDIAF pays more attention to leaf area (such as medrib and veins) while with BiDIAF, it pays more attention to disease area (Figure 17). This suggests that BiDIAF has promising localization capabilities in other domains, which we intend to explore in detail in future.

Figure 16: Feature map visualization with and without BiDIAF.
Figure 17: More feature map visualizations with BiDIAF unit on the plant disease dataset. Red circles denote the disease area.

Appendix D Results on the Cifar Dataset

To show the efficacy of the proposed BiDIAF block, we performed experiments on the Cifar image classification dataset. We trained our model using the same training strategy as in [14]. Experimental results are given in Table 4. We observed that the proposed BiDIAF unit improves the accuracy of ResNet by about 1% and 2.5% across different depth levels on Cifar-10 and Cifar-100 datasets respectively.

Cifar-10 Cifar-100
Depth ResNet ResNet w/ BiDIAF ResNet ResNet w/ BiDIAF
20 91.25 92.03 71.95 73.15
56 93.03 93.84 72.94 75.93
110 93.57 94.68 74.84 77.23
Table 4: This table reports the top-1 test accuracies of ResNet with and without BiDIAF unit on the Cifar-10 dataset. BiDIAF unit establishes a long-range connection between an input image and any convolutional layer; thereby, promotes learning of dataset-specific features and improves the flow of information inside the network.

Appendix E Qualitative Results on Solar Panel Images

Figure 19 and 19 depicts the performance of our method on different solar panel images (from our dataset as well as the Internet). We can see that DeepSolarEye has good localization and soiling classification properties.

Figure 18: Images (from our dataset) depicting the performance of our method. Best viewed in color.
Figure 19: Images (from the Internet) depicting the performance of our method in the wild. Best viewed in color.


  1. IR-based systems exploit the overheating phenomenon while RGB-based systems exploit the color information to locate the hot-spots .
  2. More details about the project can be found here: https://deep-solar-eye.github.io/.


  1. M. Aghaei, A. Gandelli, F. Grimaccia, S. Leva, and R. Zich. Ir real-time analyses for pv system monitoring by digital image processing techniques. In Event-based Control, Communication, and Signal Processing (EBCCSP), 2015 International Conference on, pages 1–6. IEEE, 2015.
  2. M. Aghaei, S. Leva, and F. Grimaccia. PV power plant inspection by image mosaicing techniques for IR real-time images. In 43rd IEEE Photovoltaic Specialists Conference (PVSC), pages 3100–3105, June 2016.
  3. B. Andò, S. Baglio, A. Pistorio, G. M. Tina, and C. Ventura. Sentinella: Smart monitoring of photovoltaic systems at panel level. IEEE Transactions on Instrumentation and Measurement, 64(8):2188–2199, 2015.
  4. V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for scene segmentation. IEEE transactions on pattern analysis and machine intelligence, 2017.
  5. P. Burt and E. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on communications, 31(4):532–540, 1983.
  6. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  7. X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1431–1439, 2015.
  8. S. Dotenco, M. Dalsass, L. Winkler, T. Würzner, C. Brabec, A. Maier, and F. Gallwitz. Automatic detection and analysis of photovoltaic modules in aerial infrared imagery. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016.
  9. G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In European Conference on Computer Vision, pages 519–534. Springer, 2016.
  10. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  11. B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 447–456, 2015.
  12. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. IEEE International Conference on Computer Vision, 2017.
  13. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  14. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  15. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  16. A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  17. G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. CVPR, 2017.
  18. M. R. Maghami, H. Hizam, C. Gomes, M. A. Radzi, M. I. Rezadad, and S. Hajighorbani. Power loss due to soiling on solar panel: A review. Renewable and Sustainable Energy Reviews, 59:1307 – 1316, 2016.
  19. M. Mani and R. Pillai. Impact of dust on solar photovoltaic (pv) performance: Research status, challenges and recommendations. Renewable and Sustainable Energy Reviews, 14(9):3124 – 3131, 2010.
  20. H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
  21. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  22. O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
  23. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1):157–173, May 2008.
  24. T. Sarver, A. Al-Qaraghuli, and L. L. Kazmerski. A comprehensive review of the impact of dust on the use of solar energy: History, investigations, results, literature, and mitigation approaches. Renewable and Sustainable Energy Reviews, 22:698 – 733, 2013.
  25. E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(4):640–651, 2017.
  26. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  27. S. Sladojevic, M. Arsenovic, A. Anderla, D. Culibrk, and D. Stefanovic. Deep neural networks based recognition of plant diseases by leaf image classification. Computational intelligence and neuroscience, 2016, 2016.
  28. Solar Energy Industries Association(USA). Solar Market Insight. http://www.seia.org/research-resources/solar-market-insight-2015-q4, 2015.
  29. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  30. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015.
  31. W. K. Yap, R. Galet, and K. C. Yeo. Quantitative analysis of dust and soiling on solar pv panels in the tropics utilizing image-processing methods. In Asia Pacific Solar Research Conference 2015. Australian PV Institute, 2015.
  32. F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. ICLR, 2016.
  33. J. W. Zapata, M. A. Perez, S. Kouro, A. Lensu, and A. Suuronen. Design of a cleaning program for a pv plant based on analysis of energy losses. IEEE Journal of Photovoltaics, 5(6):1748–1756, Nov 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description