Inspired by Co-Pilot recently published by GitHub and OpenAI, in this RECIPE we tried to collect some resources and guidelines about how a machine learning based source code generation can be built.

GitHub Co-Pilot is a collaboration between GitHub and OpenAI where they used open source code from GH, and GPT family of language models to auto-complete a code snippet typed in by a developer.

For a good review about what it can do and what it cannot, see this

For further details, look at the items on this RECIPE, but first, see the following as some high level points about each of the stages.

Collecting / Curating Training Dataset

  1. To see the differencces between GPT3, GPT-Neo, and GPT-J, see https://nlpcloud.io/gpt-3-open-source-alternatives-gpt-j-gpt-neo.html
  2. Use https://github.com/cjb/GitTorrent to do a largescale download of source code and metadata from GitHub
  3. A more effcient way is to use https://seart-ghs.si.usi.ch/ to search for high quality results and then scrape them
  4. If you want to use only the dataset that is used to train GPT-Neo and GPT-J, use this script: https://github.com/noanabeshima/github-downloader
  5. Make sure it doesn't have security issues: https://github.com/ossf/scorecard
  6. Make sure it doesn't include api keys and other sensitive information
  7. Make sure it only includes recent repositories
  8. Make sure the license is compatible with this kind of use

Model Training

  • You pre-train the model so that it understands the programming langauge
  • What you generate has to be able to compile. You can generate a syntax tree but it's unclear if that's what they did
  • You can use GPT-J (performs better than GPT-3 for many tasks including code generation).
  • It is not exclusively trained for this taks though.
  • You need access to some heavy weight processing. You can try a smaller model on Colab but the results wont be satisfactory
  • It needs constant retraining to deal with situations where the methods get depricated

Model Evaluation

Evaluation metrics for code generation models can be match-based or functional correctness based.

  • Match-based metrics include the BLEU score, adapted to this task as 'CodeBLEU'
  • Functional correctness metrics include '[email protected]', which involves running k generated samples on test problems where the test is passed if any of the k code samples succeeds on the unit tests

Acknowledgement

Some of the resources presented on this RECIPE are sourced from HuggingFace Flax/JAX Community Week Discord

Covers: theory of Source Code Generation
Estimated time needed to finish: 10 minutes
Questions this item addresses:
  • What is the overall process of source code generation?
0 comment
Recipe
publicShare
Star(0)

Data Collection, Training, and Evaluation for Large Scale Code Generation (Copilot)

Contributors
Total time needed: ~6 hours
Objectives
This recipe walks you through steps necessary to build a pipeline that generates sources code. This work is an analysis of the GitHub / OpenAI "co-pilot" service and will help you understand how that works, and provides necessary steps to reproduce it
Potential Use Cases
source code generation, code auto-complete
Who is This For ?
ADVANCEDData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resource Asset6/20
WRITEUP 1. An Introduction to Source Code Generation
  • What is the overall process of source code generation?
10 minutes
OTHER 2. Advanced Search for GitHub
  • How to search for most relevant GitHub repos based on a series of criteria?
30 minutes
REPO 3. Repo for creating code dataset by scraping github
  • How to scrape GitHub for source code?
30 minutes
REPO 4. GPT-J Autoregressive Text Generation
  • How to train an autoregressive model on source code?
30 minutes
REPO 5. Evaluation framework and harness for code generation
  • How to evalaute the quality of generated source code?
30 minutes
REPO 6. A plugin for code generation in PyCharm/IntelliJ using tranX
  • How to set up code generation as a browser plugin?
30 minutes
OTHER 7. GitHub Co-Pilot
  • What is GitHub Co-Pilot?
10 minutes
OTHER 8. HuggingFace Discussion about co-pilot
  • Can Co-Pilot be reproduced in open-source?
10 minutes
PAPER 9. A Syntactic Neural Model for General-Purpose Code Generation
  • How AST can be used in neural programming language generation?
30 minutes
PAPER 10. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  • How to find data for source code related tasks?
30 minutes
REPO 11. CodeSearchNet
  • How to access data that can be used for ML on source code?
10 minutes
PAPER 12. The Adverse Effects of Code Duplication in Machine Learning Models of Code
  • How is code duplication affect performance of ML on Code?
10 minutes
REPO 13. Security Score Card
  • How to ensure training set for ML on Code doesn't involve insecure code?
10 minutes
PAPER 14. Evaluating Large Language Models Trained on Code
10 minutes
PAPER 15. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
  • How to evaluate quality of generated source code?
10 minutes
REPO 16. GitHub Decentralized Over BitTorrent
  • How to download GitHub repos using Torrent?
10 minutes
OTHER 17. Repo of Papers on ML for Source Code related Tasks
  • Where can i find everything about ML on Code?
10 minutes
USE_CASE 18. Implementation of Language Modelling with three common algorithms
  • How do we implement language models?
20 minutes
USE_CASE 19. 8 Leading Language Models For NLP In 2020
  • What are the pre-trained model you can use in the market?
  • What are the details about each pre-trained model (architecture, achievement, use cases)?
10 minutes
REPO 20. code for detecting duplicate code
  • How to remove duplicate code from dataset for ML on Code?
10 minutes

Concepts Covered

0 comment