Covers: theory of Programming Language Theory
Estimated time needed to finish: 30 minutes
Questions this item addresses:
  • How AST can be used in neural programming language generation?
How to use this item?

Traditionally, code generation models have been trained to produce Abstract Syntax Tree (AST) nodes

Author(s) / creator(s) / reference(s)
Pengcheng Yin
0 comment

Data Collection, Training, and Evaluation for Large Scale Code Generation (Copilot)

Total time needed: ~6 hours
This recipe walks you through steps necessary to build a pipeline that generates sources code. This work is an analysis of the GitHub / OpenAI "co-pilot" service and will help you understand how that works, and provides necessary steps to reproduce it
Potential Use Cases
source code generation, code auto-complete
Who is This For ?
ADVANCEDData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resource Asset6/20
WRITEUP 1. An Introduction to Source Code Generation
  • What is the overall process of source code generation?
10 minutes
OTHER 2. Advanced Search for GitHub
  • How to search for most relevant GitHub repos based on a series of criteria?
30 minutes
REPO 3. Repo for creating code dataset by scraping github
  • How to scrape GitHub for source code?
30 minutes
REPO 4. GPT-J Autoregressive Text Generation
  • How to train an autoregressive model on source code?
30 minutes
REPO 5. Evaluation framework and harness for code generation
  • How to evalaute the quality of generated source code?
30 minutes
REPO 6. A plugin for code generation in PyCharm/IntelliJ using tranX
  • How to set up code generation as a browser plugin?
30 minutes
OTHER 7. GitHub Co-Pilot
  • What is GitHub Co-Pilot?
10 minutes
OTHER 8. HuggingFace Discussion about co-pilot
  • Can Co-Pilot be reproduced in open-source?
10 minutes
PAPER 9. A Syntactic Neural Model for General-Purpose Code Generation
  • How AST can be used in neural programming language generation?
30 minutes
PAPER 10. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  • How to find data for source code related tasks?
30 minutes
REPO 11. CodeSearchNet
  • How to access data that can be used for ML on source code?
10 minutes
PAPER 12. The Adverse Effects of Code Duplication in Machine Learning Models of Code
  • How is code duplication affect performance of ML on Code?
10 minutes
REPO 13. Security Score Card
  • How to ensure training set for ML on Code doesn't involve insecure code?
10 minutes
PAPER 14. Evaluating Large Language Models Trained on Code
10 minutes
PAPER 15. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
  • How to evaluate quality of generated source code?
10 minutes
REPO 16. GitHub Decentralized Over BitTorrent
  • How to download GitHub repos using Torrent?
10 minutes
OTHER 17. Repo of Papers on ML for Source Code related Tasks
  • Where can i find everything about ML on Code?
10 minutes
USE_CASE 18. Implementation of Language Modelling with three common algorithms
  • How do we implement language models?
20 minutes
USE_CASE 19. 8 Leading Language Models For NLP In 2020
  • What are the pre-trained model you can use in the market?
  • What are the details about each pre-trained model (architecture, achievement, use cases)?
10 minutes
REPO 20. code for detecting duplicate code
  • How to remove duplicate code from dataset for ML on Code?
10 minutes

Concepts Covered

0 comment