Covers: theory of Source Code Generation
Estimated time needed to finish: 10 minutes
Questions this item addresses:
  • How to access data that can be used for ML on source code?
How to use this item?

Yet another set of benchmark datasets and tools for code retrieval

Author(s) / creator(s) / reference(s)
GitHub
Programming Languages: Python
0 comment
Recipe
publicShare
Star(0)

Data Collection, Training, and Evaluation for Large Scale Code Generation (Copilot)

Contributors
Total time needed: ~6 hours
Objectives
This recipe walks you through steps necessary to build a pipeline that generates sources code. This work is an analysis of the GitHub / OpenAI "co-pilot" service and will help you understand how that works, and provides necessary steps to reproduce it
Potential Use Cases
source code generation, code auto-complete
Who is This For ?
ADVANCEDData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resource Asset7/21
WRITEUP 1. An Introduction to Source Code Generation
  • What is the overall process of source code generation?
10 minutes
VIDEO 2. Machine Learning on Source Code - GitHub / Open AI Copilot
  • What are nuances and opportunities for applying ML on source code?
28 minutes
OTHER 3. Advanced Search for GitHub
  • How to search for most relevant GitHub repos based on a series of criteria?
30 minutes
REPO 4. Repo for creating code dataset by scraping github
  • How to scrape GitHub for source code?
30 minutes
REPO 5. GPT-J Autoregressive Text Generation
  • How to train an autoregressive model on source code?
30 minutes
REPO 6. Evaluation framework and harness for code generation
  • How to evalaute the quality of generated source code?
30 minutes
REPO 7. A plugin for code generation in PyCharm/IntelliJ using tranX
  • How to set up code generation as a browser plugin?
30 minutes
OTHER 8. GitHub Co-Pilot
  • What is GitHub Co-Pilot?
10 minutes
OTHER 9. HuggingFace Discussion about co-pilot
  • Can Co-Pilot be reproduced in open-source?
10 minutes
PAPER 10. A Syntactic Neural Model for General-Purpose Code Generation
  • How AST can be used in neural programming language generation?
30 minutes
PAPER 11. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  • How to find data for source code related tasks?
30 minutes
REPO 12. CodeSearchNet
  • How to access data that can be used for ML on source code?
10 minutes
PAPER 13. The Adverse Effects of Code Duplication in Machine Learning Models of Code
  • How is code duplication affect performance of ML on Code?
10 minutes
REPO 14. Security Score Card
  • How to ensure training set for ML on Code doesn't involve insecure code?
10 minutes
PAPER 15. Evaluating Large Language Models Trained on Code
10 minutes
PAPER 16. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
  • How to evaluate quality of generated source code?
10 minutes
REPO 17. GitHub Decentralized Over BitTorrent
  • How to download GitHub repos using Torrent?
10 minutes
OTHER 18. Repo of Papers on ML for Source Code related Tasks
  • Where can i find everything about ML on Code?
10 minutes
USE_CASE 19. Implementation of Language Modelling with three common algorithms
  • How do we implement language models?
20 minutes
USE_CASE 20. 8 Leading Language Models For NLP In 2020
  • What are the pre-trained model you can use in the market?
  • What are the details about each pre-trained model (architecture, achievement, use cases)?
10 minutes
REPO 21. code for detecting duplicate code
  • How to remove duplicate code from dataset for ML on Code?
10 minutes

Concepts Covered

0 comment