Defect prediction with bad smells in code

Defect prediction with bad smells in code


Background. Defect prediction in software can be highly beneficial for development projects, when prediction is highly effective and defect-prone areas are predicted correctly. One of the key elements to gain effective software defect prediction is proper selection of metrics used for dataset preparation.

Objective. The purpose of this research is to verify, whether code smells metrics, collected using Microsoft CodeAnalysis tool, added to basic metric set, can improve defect prediction in industrial software development project.

Results. We verified, if dataset extension by the code smells sourced metrics, change the effectiveness of the defect prediction by comparing prediction results for datasets with and without code smells-oriented metrics. In a result, we observed only small improvement of effectiveness of defect prediction when dataset extended with bad smells metrics was used: average accuracy value increased by 0.0091 and stayed within the margin of error. However, when only use of code smells based metrics were used for prediction (without basic set of metrics), such process resulted with surprisingly high accuracy (0.8249) and F-measure (0.8286) results. We also elaborated data anomalies and problems we observed when two different metric sources were used to prepare one, consistent set of data.

Conclusion. Extending the dataset by the code smells sourced metric does not significantly improve the prediction effectiveness. Achieved result did not compensate effort needed to collect additional metrics. However, we observed that defect prediction based on the code smells only is still highly effective and can be used especially where other metrics hardly be used.

1 Introduction

Among different aspects of software defect prediction process, one of the key elements is proper selection of metrics for training and verification dataset preparation. Most popular data is source code metrics [6, 11], but also different types of metrics are considered effective in term of defect prediction, such as design metrics [24], change metrics [21], mining metrics [22] or process metrics [18, 13].

1.1 Related work and goal

Separate group of design metrics are metrics based on code smells, also known as bad smells or code bad smells. The term was formulated by Kent Beck in 2006 [1]. The concept was popularized by Martin Fowler in his book Refactoring. Improving the structure of existing code [5]. Kent Beck was a co-author of the chapter on code smells.

Kent Beck on his website explains the idea of code smells: {displayquote} Note that a Code Smell is a hint that something might be wrong, not a certainty. A perfectly good idiom may be considered a Code Smell because it’s often misused, or because there’s a simpler alternative that works in most cases. Calling something a Code Smell is not an attack; it’s simply a sign that a closer look is warranted. [1]

Due to nature of code smells described above, there is ongoing discussion if code smells could be used effectively in quality assurance in code development [27, 26]. Major motivation for this research was to investigate, if code smells can improve software defect prediction.

In industrial software development, only Holschuh et al. investigated code smells metrics effectiveness in defect prediction process for Java programming language [7]. No code smells metrics for defect prediction in .NET oriented industrial software projects are known to authors. Thus, we decided use long-term defect prediction research project run in Volvo Group [9, 10] as an occasion for conducting an experiment with introduction of bad smells based metrics to prediction process and observe the results, if they improved prediction effectiveness or not:

RQ: How Code Bad Smells based metrics impact defect prediction in industrial software development project?

1.2 Research environment: Industrial software development project

Project, on which the study was conducted, is a software development of critical industry system used in Volvo Group vehicle factories called PROSIT+. It is created based on client-server architecture. The main functionality of PROSIT+ system is: programming, testing, calibration and electrical assembly verification of Electronic Control Units (ECUs) in Volvo’s vehicle production process.

PROSIT+ system consists of few coexisting applications. The most important one, desktop application – ”PROSIT Operator”, communicates in real time with a mobile application, located on palmtop computer used by vehicle factory workers to transfer all production related information to a local server. The server is responsible for storage and distribution of configuration-, system- and product-related data. Such communication can generate extremely heavy data transfer loads in large factories, when more than 100 mobile applications are used. Other application include: ”PROSIT Designer”, ”PROSIT Factory Manager” and web application ”PROSIT Viewer”. All of them are also connected to the same server.

Development of each PROSIT+ version lasts one year. After this period software is released to the end-user. As this period of time is connected to factory production cycle it cannot be fastened or postponed.

All applications within PROSIT+ system were developed using Microsoft .NET technology and Microsoft Visual Studio as the integrated development environment. For version control purposes, Microsoft Team Foundation Server was used. Before release of version 11 of the PROSIT+ system, IBM ClearQuest was used for software defect management. Until the development of version 11, Team Foundation Server was used for defect tracking.

Project lacks of bottlenecks described by Hryszko and Madeyski[8], which could hinder or prevent from applying defect prediction process. However, we observed relatively high number of naming issues in the project. Main reason of that situation we consider high maturity of the software system – over the time, naming conventions have changed. We consider naming issues as negligible problem and we will exclude them from the further investigation.

2 Research process

Defect prediction was already an ongoing process in investigated project. It used SourceMonitor software as metric source and as prediction tool – KNIME-based DePress Extensible Framework proposed by Madeyski and Majchrzak [19]. This tool, based on KNIME [17], provides with a wide range of data-mining techniques, including defects prediction, in various IT projects, independently of technology and programming language used. We will also use KNIME/DePress for purpose of our research.

To investigate the possible impact of code-smell metrics on defect prediction, we developed the following plan to follow:

  1. Generate metrics from SourceMonitor;

  2. Generate code smells metrics from CodeAnalysis;

  3. Parse results from CodeAnalysis and merge them with metrics from SourceMonitor.

  4. Link check-ins to defects;

  5. Link classes from check-ins to defects (the assumption is that if a class was changed while fixing a defect, that class was partially or fully responsible for that defect);

  6. Merge list of classes with merged metrics from CodeAnalysis and SourceMonitor;

  7. Use different software defect prediction approaches combinations to select optimal prediction set-up for evaluation purposes;

  8. Divide PROSIT+ code into 20 sub-modules and run prediction model training and evaluation using data from each module separately;

  9. Collect and interpret the results.

2.1 SourceMonitor as basic metrics source

Defect prediction process in PROSIT+ is based on metrics that are gathered using SourceMonitor tool [12]. That tool performs static computer code analysis on complete files and extracts 24 different kinds of metrics. Example metrics extracted are:

  • Lines of code,

  • Methods per class,

  • Percentage of comments,

  • Maximum Block Depth,

  • Average Block Depth.

2.2 CodeAnalysis tool as code smells metrics source

In our experiment, we decided to use Microsoft CodeAnalysis tool to gather code smells metrics. Primary deciding factor was cost: CodeAnalysis tool is delivered as a part of Microsoft Visual Studio software development suite for .NET based projects. Thus, there was no additional costs of introduction of this tool into the investigated software development project.


CodeAnalysis for managed code analyzes managed assemblies and reports information about the assemblies, such as violations of the programming and design rules set forth in the Microsoft .NET Framework Design Guidelines [20].

According to documentation, there are approximately two hundred rules in CodeAnalysis [20], trigerring 11 kinds of warnings (Table 1). Tool can be run from command line and results are then stored in an .xml file, that can be later parsed and analyzed further.

Bad smell warning Area covered
Design Correct library design as specified by the .NET Framework Design Guidelines
Globalization World-ready libraries and applications
Interoperability Interaction with COM clients
Maintainability Library and application maintenance
Mobility Efficient power usage
Naming Adherence to the naming conventions of the .NET Framework Design Guidelines
Performance High-performance libraries and applications
Portability Portability across different platforms
Reliability Library and application reliability, such as correct memory and thread usage
Security Safer libraries and applications
Usage Appropriate usage of the .NET Framework
Table 1: Bad smell warnings in CodeAnalysis

3 Results

We conducted our experiment by following the plan presented in previous section. Here we present the results.

3.1 Automatically generated code: observed anomaly, cause and solution

After analyzing the relation between number of reported code smells issues and file length metrics for complete software system, in datasets prepared basing on CodeAnalysis and SourceMonitor tools, we observed that different number of issues are reported for the same, large file length values (Figure 1). As considered software contains only small number of large files, we interpreted that as an anomaly: different total number of code bad smell issues were reported for the same files. After investigation, we found that in investigated system files with more than 1000 lines of code (LOC) are in most cases generated automatically and contain more than one class for a file, while CodeAnalysis tool calculates number of issues metric per class. That discrepancy resulted in abnormal number of issue per file length relation: different number of issues values were collected for the same LOC values, because number of issues values were calculated for different classes located in the same files, identified by the same LOC value.

As automatically generated code files exist only for installation and deployment purposes and are not covered by tests and are not reachable for end-users of the system, we decided to consider them as a source of information noise and we removed them from further analysis. Number of issue per file length relation improved after that step (Figure 2).

Figure 1: Anomalies in number of issues metric per file length (measured in LOC) relation, introduced by automatically generated code, later removed from analysis
Figure 2: Number of issues metric per file length (measured in LOC) relation for investigated software, with automatically generated code removed

3.2 Metrics breakdown difference: problem and solution

After a thorough investigation of the above problem, we found that different values of issue number metric for the same LOC metric was caused by the different metrics breakdown used by two tools selected for metric datasets generation: CodeAnalysis gathers data for every class while SourceMonitor for every file. When results from two tools were merged into single dataset, SourceMonitor metrics, fixed for each file, were artificially divided per each class in the file (Table 2).

To counteract against metric anomalies described in section 3.1, as well as against possible introduction of informational noise into the training dataset, we decided to change the approach and rearrange the datasets into single file metrics per record layout. To achieve this, metrics gathered by CodeAnalysis had to be aggregated (added; Table 3).

File Class SourceMonitor LOC CodeAnalysis Issues
File1.cs Class1 33 3
File1.cs Class2 33 20
File1.cs Class3 33 6
File2.cs Class4 30 15
Table 2: Example of dataset from first approach: single class per record (SourceMonitor metrics are artifically divided per each class in file)
File Class SourceMonitor LOC CodeAnalysis Issues
File1.cs Class1…3 100 29
File2.cs Class4 30 15
Table 3: Example of dataset from second approach: single file per record (CodeAnalysis metrics are artificially added)

3.3 Optimal prediction mechanism selection

To choose optimal prediction mechanism, we decided to test combination of different classifiers, feature selection and balance algorithms (Table 4) against two datasets: with- and without code bad smells metrics collected by CodeAnalysis tool.

We used SMOTE algorithm [4] to balance classes with defects and without them.

To select most important metrics from all available, as some of them should have seemingly little impact on the presence of true software defects, e.g. Efficient power usage warning (Table 1), we decided to use in our research two feature selection algorithms: KNIME’s build-in reversed elimination greedy algorithm [16] and simulated annealing meta-heuristic algorithm by Kirkpatrick et al. [15] in form proposed by Brownlee [3].

As classifier, we used popular in defect prediction studies [6, 21, 14, 23] Naive Bayes classifier and Probabilistic Neural Network (PNN), as well as Random Forest [2] classifier.

Results of testing combinations of above machine learning elements in favor of best prediction results are presented in Table 5. Two datasets – with- and without code bad smells metrics included, were divided using stratified sampling method into two equal subsets, for training and evaluation purpose. Prediction models were evaluated using F-measure [25].

Highest F-measure value (0.9713) was observed for dataset with code bad smells used, when SMOTE algorithm and reversed elimination feature selection mechanism was used to select optimal subset for training and evaluation of Random Forest classifier. And such combination was selected for final evaluation of usage of code smells based metrics in defect prediction process.

Classifier Feature Selection SMOTE Bad smells metrics?
Naive Bayes None With Present
Random Forest Elimination Without Absent
PNN Simulated Annealing
Table 4: Combinations of different approaches
Classifier SMOTE Feature selection Bad smells metrics included?
No Yes
F-meas. TP FP TN FN F-meas. TP FP TN FN
Naive Bayes NO Annealing 0.1318 23 161 3330 142 0.1149 15 81 3410 150
Elimination 0 0 1 3490 165 0 0 0 3491 165
None 0.1314 31 276 3215 134 0.1389 40 371 3120 125
YES Annealing 0.5474 1497 482 3008 1994 0.586 1695 599 2891 1796
Elimination 0.5961 1736 598 2892 1755 0.6106 1807 621 2869 1684
None 0.5424 1476 475 3015 2015 0.5775 1657 591 2899 1834
PNN NO Annealing 0 0 0 3491 165 0 0 0 3491 165
Elimination 0 0 0 3491 165 0 0 0 3491 165
None 0 0 0 3491 165 0 0 0 3491 165
YES Annealing 0.7313 2187 303 3187 1304 0.7582 2335 333 3157 1156
Elimination 0 0 0 3491 165 0.8051 2568 320 3170 923
None 0.7253 2147 282 3208 1344 0.7333 2204 316 3174 1287
Random Forest NO Annealing 0.0963 9 13 3478 156 0.1538 15 15 3476 150
Elimination 0.1587 15 9 3482 150 0.1405 13 7 3484 152
None 0.1429 14 17 3474 151 0.1538 15 15 3476 150
YES Annealing 0.9551 3364 189 3301 127 0.9696 3407 130 3360 84
Elimination 0.9654 3390 142 3348 101 0.9713 3435 147 3343 56
None 0.9601 3383 173 3317 108 0.9696 3407 130 3360 84
Table 5: Results for optimal prediction set-up selection (defect-prone class)

3.4 Datasets evaluation: CodeAnalysis (bad smells metrics) against SourceMonitor

For final evaluation, if code bad smells-based metrics could be valuable for defect prediction purposes, we divided all available code, in considered industrial software development project, into 20 smaller, similar in size sub-modules (ca. 700 records after SMOTE oversampling). Greater fragmentation of system’s code was not technically possible. For each sub-module we collected metrics using SourceMonitor or/and CodeAnalysis, to create different datasets:

  • 20 datasets of SourceMonitor metrics only;

  • 20 datasets of CodeAnalysis (code smells) metrics only;

  • 20 datasets of combined metric: SourceMonitor + CodeAnalysis.

Additionally, each kind of datasets we decided to test against feature selection (FS) process. During the evaluation, we collected Accuracy and Cohen’s kappa measures for overall results (Table 6), and F-measure and Recall for defect-prone classes (Table 7).

Dataset Measure Mean Std. deviation
SourceMonitor without FS Accuracy 0.9422 0.0187
Cohen’s kappa 0.8844 0.0374
CodeAnalysis without FS Accuracy 0.676 0.0451
Cohen’s kappa 0.3518 0.0904
SourceMonitor + CodeAnalysis w/o FS Accuracy 0.9487 0.0226
Cohen’s kappa 0.8973 0.0453
SourceMonitor with FS Accuracy 0.97 0.0122
Cohen’s kappa 0.9399 0.0245
CodeAnalysis with FS Accuracy 0.8249 0.059
Cohen’s kappa 0.6497 0.1180
SourceMonitor + CodeAnalysis with FS Accuracy 0.9791 0.0135
Cohen’s kappa 0.9582 0.027
Table 6: Final results of datasets evaluation
Dataset Measure Mean Std. deviation
SourceMonitor without FS Recall 0.9608 0.0278
F-measure 0.9433 0.0188
CodeAnalysis without FS Recall 0.666 0.2961
F-measure 0.6447 0.1157
SourceMonitor + CodeAnalysis w/o FS Recall 0.9637 0.0303
F-measure 0.9494 0.0228
SourceMonitor with FS Recall 0.9824 0.0146
F-measure 0.9704 0.012
CodeAnalysis with FS Recall 0.8424 0.0542
F-measure 0.8286 0.0559
SourceMonitor + CodeAnalysis with FS Recall 0.9859 0.0206
F-measure 0.9792 0.0136
Table 7: Measures for records marked as defect-prone

3.5 Threads to validity

Conclusion validity. In our research, we tested 20 datasets collected from different software modules. More research using larger data set, collected from different sources is needed to confirm our findings.

Internal validity. We have used aggregation of CodeAnalysis metrics for each file, by adding metrics collected for each class. Such solution was introduced to solve metrics breakdown difference problem and make combination of two metric sources possible, however it could impact the final result of our research.

External validity. Our research is based only on metrics gathered from one software development project. Despite the fact, that we were able to collect 34 different metric kinds for 20 different program modules, we were still constrained by single environment: development team and its programming habits, programming language, tools used, etc. Because of this fact, more research is needed to verify our findings in other software development environments (contexts).

4 Discussion

When selecting optimal defect prediction set-up for further verification if code smell-based metrics can improve prediction results, we observed that best result was achieved for dataset with bad smell metrics included (F-measure = 0.9713). However, for the same setup, but without code smells metrics, F-measure value was only by 0.0059 lower (Table 5) what makes the difference between SourceMonitor and CodeAnalysis results negligible. Final results collected from 20 different software sub-modules confirmed that statement: Average accuracy value for prediction based on dataset constructed basing on both sources was only by 0.0091 better than result for SourceMonitor-only based metrics (Average F-measure value difference = 0.0088), while standard deviation value was 0.0136. Worth noticing is drop of CodeAnalysis-only based prediction results, when feature selection (FS) process was removed from the experimental setup.

Results of our experiment of using code smells metrics in software defect prediction, show irrelevant – in our opinion – impact on effectiveness of the process, when basic dataset (SourceMonitor-based) was extended by CodeAnalysis metrics. because even if prediction effectiveness measures are slightly higher, the stay within the limits of error. But when only use of CodeAnalysis-based metrics were used for prediction (without basic set of SourceMonitor-based metrics), such process resulted with high accuracy (0.8249) and F-measure (0.8286) results.

Thus, answering the research question: How Code Bad Smells based metrics impact defect prediction in industrial software development project? We want to state, that in industrial environment, such as PROSIT+ software development project, impact of code bad smells based metrics is negligibly small, and usage of CodeAnalysis-based metrics should not be considered useful, due to fact that additional effort needed for introducing code smell-based metrics to software defect prediction process is not compensated by relatively high increase of prediction effectiveness.

However, we observed surprisingly high effectiveness of prediction, when dataset based on CodeAnalysis only was used. Authors believe, that code bad smells can be effectively used for defect prediction process especially there, where other metrics are not available, or computing power is insufficient to handle large sets of different metrics (for example 24 kinds of metrics for SourceMonitor), while CodeAnalysis metrics set, used in our research, contained only 11 different kinds of metrics. Due these promising results, aspects of using code bad smells only based metrics in defect prediction processes should be investigated further.


  1. Beck, K.: Code Smell (2016),, accessed: 2016.05.08
  2. Breiman, L.: Random Forests. Machine Learning pp. 5–32 (2001)
  3. Brownlee, J.: Clever Algorithms. Nature-Inspired Programming Recipes. Jason Brownlee (2011)
  4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence pp. 321–357 (2002)
  5. Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional (2006)
  6. Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering 38(6), 1276–1304 (2012)
  7. Holschuh, T., Pauser, M., Herzig, K., Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects in SAP Java code: An experience report. In: ICSE-Companion 2009. 31st International Conference on Software Engineering. pp. 172–181 (2009)
  8. Hryszko, J., Madeyski, L.: Bottlenecks in Software Defect Prediction Implementation in Industrial Projects. Foundations and Computing and Decision Sciences 40(1), 17–33 (2015),
  9. Hryszko, J., Madeyski, L.: Assessment of the Software Defect Prediction Cost Effectiveness in an Industrial Project. In: Software Engineering: Challenges and Solutions, Advances in Intelligent Systems and Computing, vol. 504, pp. 77–90. Springer (2017)
  10. Hryszko, J., Madeyski, L., Samlik, R.: Application of Defect Prediction-Driven Quality Assurance Methodology in Industrial Software Development Project (2016), pre-print
  11. Jaechang, N.: Survey on Software Defect Prediction (2014), hKUST PhD Qualifying Examination
  12. Jim Holmes: SourceMonitor Site (2016),, accessed: 2016.05.06
  13. Jureczko, M., Madeyski, L.: A Review of Process Metrics in Defect Prediction Studies. Metody Informatyki Stosowanej 30(5), 133–145 (2011),
  14. Khoshgoftaar, T.M., Pandya, A.S., Lanning, D.L.: Application of Neural Networks for Predicting Faults. Annals of Software Engineering 1(1), 141–154 (1995)
  15. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(13), 671–680 (1983)
  16. KNIME.COM AG: Backward Feature Elimination (2016),, accessed: 2016.06.28
  17. KNIME.COM AG: KNIME Framework Documentation (2016),, accessed: 2016.01.06
  18. Madeyski, L., Jureczko, M.: Which Process Metrics Can Significantly Improve Defect Prediction Models? An Empirical Study. Software Quality Journal 23(3), 393–422 (2015),
  19. Madeyski, L., Majchrzak, M.: Software Measurement and Defect Prediction with Depress Extensible Framework. Foundations of Computing and Decision Sciences p. 249–270 (2014)
  20. Microsoft: Code Analysis for Managed Code Overview (2016),, accessed: 2016.05.06
  21. Moser, R., Pedrycz, W., Succi, G.: A Comparative Analysis of The Efficiency of Change Metrics and Static Code Attributes for Defect Prediction. In: Software Engineering, 2008. ICSE ’08. ACM/IEEE 30th International Conference on. pp. 181–190 (2008)
  22. Nagappan, N., Ball, T., Zeller, A.: Mining Metrics to Predict Component Failures. In: Proceedings of the 28th International Conference on Software Engineering. pp. 452–461 (2006)
  23. Selby, R.W., Porter, A.: Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis. IEEE Transactions on Software Engineering 14(12), 1743–1756 (1988)
  24. Succi, G., Pedrycz, W., Stefanovic, M., Miller, J.: Practical Assessment of the Models for Identification of Defect-Prone Classes in Object-Oriented Commercial Systems Using Design Metrics. Journal of Systems and Software 65(1), 1–12 (2003)
  25. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005)
  26. Zhang, M., Hall, T., Baddoo, N.: Code Bad Smells: a review of current knowledge. Journal of Software Maintenance and Evolution: Research and Practice p. 179–202 (2011)
  27. Zhang, M., Hall, T., Baddoo, N., Wernick, P.: Do bad smells indicate ”trouble” in code? In: DEFECTS ’08 Proceedings of the 2008 workshop on Defects in large software systems. pp. 43–44. ACM (2008)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description