Early in a file before the change FIX Whether

Early Studies on bug prediction focused on software complexity measured in lines of code We aim to predict whether a particular file associated with a change is buggy or not. Traditionally All techniques follow the following steps: 1)    Training Data Extraction : For each change, label it as buggy or clean by mining a project’s revision history and issue tracking system. Buggy change means the change contains bugs (one or more), while clean change means the change has no bug.2)    Feature Extraction : Extract the values of various features from each change.

Many different features have been used in past change classification studies. 3)    Model Learning : Build a model by using a classification algorithm based on the labeled changes and their corresponding features. 4)    Model Application : For a new change, extract the values of various features. Input these values to the learned model to predict whether the change is buggy or clean.  14 basic change measures  Name Description NS The number of modified subsystems ND The number of modified directories NF The number of modified files Entropy Distribution of modified code across each file LA Lines of code added LD Lines of code deleted LT Lines of code in a file before the change FIX Whether or not the change is a defect fix NDEV The number of developers that changed the modified files AGE The average time interval between the last and the current change NUC The number of unique changes to the modified files EXP Developer experience REXP Recent developer experience SEXP Developer experience on a subsystem       The framework mainly contains two phases: a model building phase and a prediction phase.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
4,80
Writers Experience
4,80
Delivery
4,90
Support
4,70
Price
Recommended Service
From $13.90 per page
4,6 / 5
4,70
Writers Experience
4,70
Delivery
4,60
Support
4,60
Price
From $20.00 per page
4,5 / 5
4,80
Writers Experience
4,50
Delivery
4,40
Support
4,10
Price
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

In model building we aim to build a classifier using deep learning techniques from historical changes with known label. In prediction phase we utilize the framework to predict whether a incoming change is buggy or not. Our Framework extracts 14 basic features from a training set of changes that could determine whether a code is buggy or not. We preform data pre-processing to normalize and resample the data, as class of clean changes are more than buggy samples we perform random undersampling on the data. Then a deep learning network aka Deep Belief network(DBN) is used to generate and integrate more advanced features, advanced features are linear combination of initial features. Based on the advanced features we construct a logistic regression model to predict whether change is buggy or not.

 Data preprocessing: As the magnitudes of 14 features are not of same order, we perform data normalization on these features. In this approach, the data is scaled to a fixed range – usually 0 to 1.The cost of having this bounded range – in contrast to standardization – is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:X_{norm} = frac{X – X_{min}}{X_{max}-X_{min}} Random undersampling: Random undersampling method randomly chooses observations from majority class which are eliminated until the data set gets balanced. This step is essential and important for defect prediction because it helps the learned classifier not to be biased to the majority class and thus it can improve the performance of the classifier. RBN: Boltzmann Machine is a stochastic recurrent neural network with stochastic binary units and undirected edges between units. Unfortunately, learning for Boltzmann machines is impractical and has a scalability issue.

As a result, Restricted Boltzmann Machine (RBM) has been introduced which has one layer of hidden units and restricts connections between hidden units. This allows for more efficient learning algorithm.The structure of RBM is depicted in the following figure:          Figure 1: The structure of RBM  Given these configurations, probability distributions over hidden and/or visible units are defined in terms of the energy function:                                     Then, the maximum likelihood learning algorithm can train the network by simply alternating between updating all the hidden units in parallel and all the visible units in parallel. To fasten the learning for a RBM, contrastive divergence algorithm is used and the general idea is to update all the hidden units in parallel starting with visible units, reconstruct visible units from the hidden units, and finally update the hidden units again. Deep Belief Networks: As Deep Belief Networks (DBN) name indicates, it is multi-layer belief networks. Each layer is Restricted Boltzmann Machine and they are stacked each other to construct DBN. The first step of training DBN is to 8 learn a layer of features from the visible units, using Contrastive Divergence (CD) algorithm.

Then, the next step is to treat the activations of previously trained features as visible unites and learn features of features in a second hidden layer. Finally, the whole DBN is trained when the learning for the final hidden layer is achieved. Figure 2: Greedy learning for DBN This simple greedy learning algorithm works for training DBN. This is because that training RBM using CD algorithm for each layer looks for the local optimum and the next stacked RBM layer takes those optimally trained values and again look for the local optimum. At the end of this procedure, it is likely to get the global optimum as each layer consistently trained to get the optimum value.   Figure 2: Greedy Learning for DBN Classifier: Logistic regression models the relationship between features and labels as a parametric distribution P(y|x), where y refers to the label of a data point (in our case: a change), and x refers to the data point represented as a set of features.

The parameters of this distribution is directly estimated from the training data. Experiments and Results: We evaluate Deeper on six datasets from six wellknown open source projects, which are Bugzilla, Columba, Eclipse JDT, Eclipse Platform, Mozilla and PostgreSQL. These datasets are also used by Kamei et al. We used 10 times ten-fold cross validation  to evaluate the performance of Deeper.

We randomly divide the dataset into 10 folds, in which 9 folds are used as training dataset, and the remaining one fold is used as testing dataset. To further reduce the bias due to training set selection, we run ten-fold cross validation 10 times and record the average performance. Cross validation is a standard evaluation setting, which is widely used in software engineering studies. Evaluation Metrics: Cost Effectiveness: We use two evaluation metrics to evaluate the performance of our approach. The first one is cost effectiveness and the other is F1-score. 1) Cost Effectiveness: Cost effectiveness is often used to evaluate defect prediction approaches . Cost effectiveness is measured by computing the percentage of buggy changes found when reviewing a specific percentage of the lines of code. To compute cost-effectiveness, given a number of changes, we firstly sort them according to their likelihood to be buggy.

We then simulate to review the changes one-by-one from the highest ranked change to the lowest and record buggy changes found. Using this process we can obtain the percentage of buggy changes found when reviewing different percentages of lines of code (1% to 100%).  TABLE III. CONFUSION MATRIX         Predicted Buggy Predicted Clean True Buggy   TP FN True Clean   FP TN                  F1-Score: In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.

   We use the above two evaluation metrics, i.e., cost effectiveness and F1-score, to make comparisons. They are commonly-used measures to evaluate the performance of a defect prediction approach. To make our results more convincing, we perform 10-fold cross validation 10 times and report the average results.

For cost effectiveness, we record the percentage of buggy instances found when adding every one percentage of lines of code reviewed. So we will have 100 average values corresponding to the percentage of buggy instances found when reviewing 1% to 100% lines of code.  TABLE IV. POFB20 VALUES OF DEEPER AND THE TWO BASELINES                 Project   LR(%) Kamei et al.’s (%) Deeper(%)   Bugzilla   21.35 21.

44 43.80     Columba   12.52 12.29 40.

00     JDT   18.31 17.79 52.

77     Platform   25.71 24.92 63.87     Mozilla   18.

85 18.13 59.09     PostgreSQL 17.82 18.38 49.

70     Average   19.09 18.82 52.04     TABLE V.

PRECISION OF DEEPER AND THE TWO BASELINES                             Project LR Kamei et al.’s Deeper   Bugzilla 0.7019 0.5475 0.5157       Columba 0.

6312 0.4865 0.4193       JDT     0.4527 0.2490 0.2197       Platform 0.

5271 0.2321 0.2040       Mozilla 0.5836 0.1240 0.1021       PostgreSQL 0.

6988 0.5036 0.4073       Average 0.5992 0.3571 0.3164               TABLE VI.

RECALL OF DEEPER AND THE TWO BASELINES                   Project LR Kamei et al.’s Deeper   Bugzilla 0.4026 0.7027 0.6207       Columba 0.3104 0.

6486 0.6803       JDT     0.0304 0.6613 0.6283       Platform 0.

0320 0.7085 0.7103       Mozilla 0.0397 0.

6084 0.6320       PostgreSQL 0.2819 0.6016 0.6199       Average 0.1828 0.6552 0.6003               TABLE VII.

F1-SCORE OF DEEPER AND THE TWO BASELINES                 Project   LR Kamei et al.’s Deeper   Bugzilla   0.5106 0.6147 0.6164       Columba   0.4148 0.5550 0.5593       JDT     0.

0568 0.3616 0.3869       Platform   0.0603 0.3496 0.3733       Mozilla   0.0742 0.2058 0.

2113       PostgreSQL   0.4014 0.5480 0.5263       Average   0.

2530 0.4391 0.4406      Results.

Tables IV, V, VI and VII present the PofB20, Precision, Recall and F1-score values of Deeper as compared with those of the two baselines, respectively. From these tables, we can conclude several points. First, from Table IV, we can see that using our approach, on average, over 50% of the buggy instances can be found by reviewing only 20% of the lines of code, which is a substantial improvement as compared to the results achieved by the two baselines. The PofB20 values of our approach range from 41% to 62%.

For each dataset, the values exceed those of the two baselines substantially. Second, from Tables V to VII, we can find that in terms of Precision, LR is the best performer by achieving an average precision of 60%. However, in terms of Recall, LR performs the worst. Instead, in terms of Recall, Deeper is the best performer by achieving an average recall of 69%. Also, in terms of F1-score, which is the summary of the above two indicators, Deeper is the best performer by achieving an average F1-score of 45%. Third, although the precision of LR is the best, its recall value is very low. This is especially the case for three datasets where the proportions of buggy changes are less than 15%. For those datasets, LR’s recall values are only about 3%, while Deeper and Kamei et al.

‘s approach achieve much larger recall values of more than 60%. The result indicates that LR is not good enough for defect prediction and unbalanced-data processing is essential and important. Threats to Validity Threats to internal validity relate to errors in our experiments. We have double checked our experiments and implementations. Still, there could be errors that we did not notice. Threats to external validity relate to the generalizability of our results. We have evaluated our approach on 137,417 changes from six open source projects. In the future, we plan to reduce this threat further by analyzing even more datasets from more open source and commercial software projects.

Threats to construct validity refer to the suitability of our evaluation        (a) Bugzilla     (b) Columba           (c) JDT                                                          (d) Platform           (e) Mozilla                                                      (f) Postgres Fig. 3.  Cost Effectiveness Trends for the Six Dataset  Conclusion:In this paper, we propose a deep learning approach for just in-time defect prediction. The approach first extracts a set of expressive features from an initial set of basic change measures using Deep Belief Network (DBN), and then trains a classifier based on the extracted features using Logistic Regression. We evaluate our approach on datasets taken from six large open source projects and use two evaluation metrics which are F1- score and cost effectiveness.

We compare our approach with two baselines, i.e., a standard Logistic Regression algorithm and the approach proposed by Kamei et al. The results show that our approach is the best in terms of the two metrics.

Our approach achieve an average recall of 69% and an average F1-score of 45%. For cost effectiveness, our approach can identify over 50% defective changes by reviewing only 20% lines of code, which is much more than the defective changes that can be identified by the two baselines. In the future, we plan to improve the performance of our approach by optimizing parameters of DBN and perform experiments on more datasets to reduce the threats to external validity.

We also plan to try other classifiers to find if there exists better classifiers when combined with DBN.  Future Work: With the ever-increasing scale and complexity of modern software, reliability assurance has become a significant challenge. To enhance the reliability of software, in this paper, we focus on predicting potential code defects in the implementation of software, thus reduce the workload of software maintenance.

The results show that our approach is the best in terms of the two metrics. Our approach achieve an average recall of 69% and an average F1-score of 45%. For cost effectiveness, our approach can identify over 50% defective changes by reviewing only 20% lines of code, which is much more than the defective changes that can be identified by the two baselines. In the future, we plan to improve the performance of our approach by optimizing parameters of DBN and perform experiments on more datasets to reduce the threats to external validity. We also plan to try other classifiers to find if there exists better classifiers when combined with DBN.