SLQA: Sequence Labeling Based Question Answering for Mining Aligned Natural Language and Programming Language Data
Mining high-quality natural language and code pairs from Stack Overflow is essential for many downstream tasks such as code retrieval, code generation and code summarization. Existing approaches treat it mainly as a nearest neighbor search problem, where natural language and code are mapped into the same vector space. These approaches often require models with knowledge of specific programming languages, which necessitate data annotation for each programming language, and are thus challenging to transfer to other programming language.
In this paper, we propose a novel approach to systematically mine aligned natural language and code pairs. The approach leverages the well-established sequence labeling techniques as well as large pre-trained NLP models, and is programming-language-agnostic. We achieved competitive results with the current state of the art, and mined in total 10m such pairs which is two orders of magnitude larger than the current benchmark dataset.