PhD Defense: Steps Towards Semantic Code Search

Please click on this link and join the online PhD defense.

Members of the defense committee:

Professor, Dr. Yves Le Traon, University of Luxembourg, Luxembourg, Chairman
Assistant Professor, Dr. Dongsun Kim, Kyungpook National University, Korea, Vice Chairman
Associate Professor, Dr. Lingxiao Jiang, Singapore Management University, Singapore, Member
Dr. Xin Xia, Software Engineering Application Technology Lab at Huawei, China, Member
Associate Professor, Dr. Jacques Klein, Université du Luxembourg, Luxembourg, Expert
Associate Professor, Dr. Tegawendé Bissyandé, Université du Luxembourg, Luxembourg, Dissertation Supervisor

Abstract:

Code search is an unavoidable activity in software development. Developers commonly reuse existing source code fragments by searching for codebases available in local or global repositories. Code search helps developers ease the implementation or understand specific concepts deeper during software development. In addition, reading real-world examples (the results of code search) is helpful for developers to make programs more reliable, faster, or secure as the examples have been tested and reused by many other developers. However, it is getting more challenging as the codebases are becoming larger. Thus, the research community has invested substantial efforts in developing new techniques, combining methods, and applying more extensive data to improve the performance and efficiency of code search.

Despite the significant efforts made by researchers in the field, code search still has many open problems that the community needs to address, such as lack of benchmarks, vocabulary mismatch (between natural language and source code), and low extensibility on programming languages. Our work focuses on the open issues and the momentum of the domain on semantic code search, which considers the meaning of the user query rather than concerning the syntactic similarity that most other studies have approached. The thesis begins with exploring general issues on code search by conducting a systematic literature review. The survey organizes and classifies the semantic-based approaches and various directions such as learning-based, feedback-driven, dynamic approaches. It reveals insights and new research directions. Given the research directions by the survey, we concentrate on an approach to alleviating the vocabulary mismatch problem by augmenting natural language queries from the user. Then, we go further, reformulating the user code query with the real-world code snippets. This allows catching the semantics from the source code. Given the semantic information, a user can search for desired source code by using their code fragments.

In this context, the present dissertation aims to explore semantic code search by contributing to the following three building blocks:

Review of state-of-the-art: Despite the growing interest in code search, a comprehensive survey or systematic literature review on the field of code search remains limited. We conducted a large-scale systematic literature review on the internet-scale code search. Our objective in this study was to devise a grounded approach to understand the procedure for the code search approach. We built an operational taxonomy on top of each procedure to categorize the approaches and provide insights on the selection of various approaches. Our investigation on the open issues from the literature guide researchers and practitioners to future research directions.

CoCaBu: Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. We presented COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant but missing structural code entities to improve matching relevant code examples within large code repositories. To instantiate this approach, we built GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. Experimental results, collected via several comparisons against the state-of-the-art code search and existing online search engines such as Google, show that CoCaBu provides qualitatively better results. Furthermore, our live study on the developer community indicates that it can retrieve acceptable or attractive answers for their questions.

FaCoY: Most existing approaches focus on serving user queries provided as natural language free-form input. However, there exists a wide range of use-case scenarios where a code-to-code approach would be most beneficial. For example, research directions in code transplantation, code diversity, patch recommendation can leverage a code-to-code search engine to find essential ingredients for their techniques. Given the wide range of use-case for code-to-code search, we propose FaCoY, a novel approach for statically finding code snippets that may be semantically similar to user input code. FaCoY implements a query alternation strategy: instead of directly matching code query tokens with code in the search space, FaCoY first attempts to identify other tokens, which may also be relevant in implementing the functional behavior of the input code. The experimental results show that FaCoY is more effective than all the existing online code-to-code search engines, and it can also be used to find semantic code clones (i.e., Type-4). Moreover, the results proved that FaCoY could be helpful in code/patch recommendation.

Partager ce contenu