This is the technical blog of Keyvan Nayyeri, a 29 years old software engineer at Match.Com, speaker and author. You will find content about computer science, programming, and technology on here.
After our first meeting to discuss ICSE 2010 papers with my presentation on Software Traceability with Topic Modeling, yesterday we had our second presentation on another paper entitled a Search Engine for Finding Highly Relevant Applications. The implementation of the idea introduced in this paper (that I’ll describe shortly in this post) is available on the web and is called Exemplar code search engine.
As you would already know, Code Reusability is one of the main principles of Software Development and an important aspect of Object-Oriented Programming. Software developers try to reuse components or pieces of code in their programs in order to speed up the process and reduce the costs. Besides, code reusability can help improve the quality of code by focusing on better design and implementation of smaller components.
As a common part of daily programming for industrial Software Developers, they try to search for relevant components, libraries, or code snippets to use in their projects. They often search for their needs on code search engines like SourceForge, Google Code, Koders, CodePlex, and many other services.
Most of these code search engines rely heavily on some textual values entered by project coordinators on the websites such as the title, description, category, tag, or some other attributes.
However, there is a common problem in using these search engines and that is the relevance of search results because it depends on two major parameters: the careful selection of keywords and the richness of the textual parameters entered by project owners. The first parameter is something that can easily be resolved only by better training of users, but for the second parameter there are some difficulties. Whatever you enter for a project even something very rich, still there may be some parts of the project missing from the project codebase especially for bigger projects that consist of various components.
There have been some attempts to solve this issue with different techniques. The paper that we discussed and is recently published at ICSE 2010 tries to provide an improvement in this area. This technique consists of not only searching in the textual properties of a project on a repository, but also on the relationships between the project APIs based on the help documents written for the project.
In this paper, authors have tried to apply this idea using two approaches: a pure search in the help documents for project APIs, and an advanced search in API documents based on the Data Flow analysis of the API.
In order to implement this idea, the authors have aggregated around 30,000 Java projects on SourceForge, processed their APIs with the abovementioned approaches, and published this code search engine, called Exemplar, on the web. Then they asked a group of 39 Java developers with different levels of experience to search for some common programming tasks using this search engine under a time limit. In the next step they asked the developers to evaluate the results and rank the relevance of them as well as their own confidence in their answers.
This experiment is done using statistical methods and the authors have provided the results which reflects the fact that using the API descriptions improves the relevance of search results, but the use of Data Flow analysis doesn’t have a big impact.
However, it appears that there isn’t enough work done in the area of Data Flow analysis, and the implementation is weak and superficial. It seems that authors agree with this fact because they talk about their future work in this area to have a stronger implementation of Data Flow analysis to improve the relevance of search results.
All in all, I think that this new approach has a good potential to improve the search results on code search engines, but a higher level implementation of Data Flow analysis would be costly and much work will be needed in fine-tuning of the search engine in this area.