Skip directly to: content | search

Pliny: Big Code Analytics

About Pliny

Pliny: Mining Big Code to help programmers

Pliny is a family of systems — currently under development — that will automatically detect program defects, suggest bug fixes, and complete drafts of programs using code and specifications mined from vast repositories of existing code.

Systems in Pliny will analyze a large corpus of code from different languages and application domains, and create a database that records inferred specifications and executable implementations for a vast number of code snippets, as well as similarity relationships between these snippets. This database will support queries that allow the extraction of relevant code fragments and code specifications from the database. Pliny's analytics engines will use such queries, together with deep logic-based inference, to empower end-users with a rich collection of capabilities for correctness analysis, repair, and synthesis of programs.

From drafts to predictions

Programmers will interact with Pliny by writing drafts: possibly-ambiguous expressions of computational intent. For instance, a draft can consist of constraints, examples, or incomplete code that illustrate an intended program's behavior. Alternatively, it can be a complete program of uncertain correctness. Different systems in the Pliny family will allow different sorts of languages of drafts. Now the programmer would ask Pliny questions like:

In the real world, programmers seldom start out with an unambiguous logical specification of what they want to do. Consequently, drafts in Pliny are expected to have uncertain semantics. Indeed, a key technical challenge in Pliny is to "read the programmer's mind" based on the draft.

How will Pliny work?

Internally, each tool in the Pliny family will have four components: an artifact generator, an artifact database, statistical engine capable of doing probabilistic inference, and a logical engine that can perform automated logical reasoning.

The goal of Pliny's artifact generation engines is to generate reusable knowledge from pre-existing code. Specifically, these engines will generate a large number of useful "program elements" (for example, procedures, symbolic traces, or data definitions), compute a feature vector for each element, and compute relationships between elements. These elements and relationships will be arranged in Pliny's artifact database.

The statistical and logical engines are the twin workhorses of inference in Pliny. Pliny's statistical inference engine will mine the artifact database for information relevant to queries from programmers about their drafts. For instance, to answer queries about the correctness of a programmer's use of an API, Pliny will learn a statistical model of how the API was used in programs in the corpus; for synthesis tasks, Pliny will statistically learn program elements that are likely to be useful in producing a robust executable.

However, because programs are entities with rigorous semantics, purely statistical inference is not enough, and such inference needs to be complemented by logical reasoning about program correctness. Pliny's logical inference engines will accomplish this. For instance, to detect if a programmer is using an API incorrectly, Pliny will perform a static, logic-based program analysis that check if feasible program executions violate the statistically learned model of the API. In synthesis, logical inference will allow Pliny to stitch together statistically learned code elements into an executable, robust program.