The State of Empirical Evaluation in Static Feature Location
You are here
Context: Feature location (FL) is the task of finding the source code that implements a specific, user-observable functionality in a software system. It plays a key role in many software maintenance tasks: the most time consuming phase of the software life-cycle. Consequently, it is an area of much research and a wide variety of Feature Location Techniques (FLTs), that rely on source code structure or textual analysis, have been proposed by researchers.
Objective: As FLTs evolve and more novel FLTs are introduced, it is important to perform comparison studies to investigate "Which are the best FLTs?" However, an initial reading of the literature suggests that performing such comparisons would be an arduous process, based on the large number of techniques to be compared, the heterogeneous nature of the empirical designs, and the lack of transparency in the literature. This paper first assesses if these observations from the literature are valid for FLTs that rely on code structuring and--or textual analysis. Later, the paper proposes a more standard set of empirical components for all FLTs, towards more comparable evaluations.
Method: This paper presents a systematic review of 170 FLT papers, published between the years 2000 and 2015. This review focuses on core empirical design elements of FLT evaluations to investigate the degree to which heterogeneity and lack of transparency in empirical evaluations hinder FLT comparison.
Results: Results of the systematic review indicate that 95% of the papers studied are directed towards novelty, in that they propose a novel FLT. 69% of these novel FLTs are evaluated through standard empirical methods but, of those, only 9% use baseline technique(s) in their evaluations to allow cross comparison with other techniques. The heterogeneity of empirical evaluation is also clearly apparent: altogether, over 60 different FLT evaluation metrics are used across the 170 papers, 272 subject systems have been used, and 235 different benchmarks employed. The review also identifies numerous user input formats as contributing to the heterogeneity. These heterogeneities make it very difficult to compare across FLT evaluations. An alternative then is to reproduce the novel FLTs themselves and repeat the empirical evaluations against other approaches. However, analysis of the existing research suggests that only 27% of the FLTs presented might be reproduced from the published material.
Conclusion: These findings suggest that comparison across the existing body of FLT evaluations is very difficult. We conclude by providing guidelines for empirical evaluation of FLTs that may ultimately help to standardize empirical research in the field, cognisant of FLTs with different goals, leveraging common practices in existing empirical evaluations and allied with rationalisations. This is seen as a step towards standardising evaluation in the field, thus facilitating comparison across FLTs.
Systemic Survey Material |
|
Collected Data |
|
Intermediate Results |
|