Covers: theory of Source Code Generation
Estimated time needed to finish: 28 minutes
Questions this item addresses:
  • What are nuances and opportunities for applying ML on source code?
How to use this item?

In this conversation, we talking to 2 NLP experts about how those techniques can be applied on formal language instead of natural language. We started talking about the general topic of ML on source code, but then talked more specifically about the GitHub Copilot project and what it has achieved.

Timestamps:

  • 00:57 Introductions
  • 02:17 What are common ML tasks on source code?
  • 06:58 What's GitHub / OpenAI Co-Pilot and how does it work?
  • 11:22 Resources for ML on Source Code (RECIPE)
  • 12:39 Data Collection and Filtering
  • 16:36 Data Formatting and Training
  • 19:37 Evaluation
  • 22:40 Co-Pilot Criticism and Challenges
Fail to play? Open the link directly: https://youtu.be/eSgnnWcC6gE
Author(s) / creator(s) / reference(s)
Amir Feizpour, Suhas Pai, Ehsan Amjadian
0 comment
Recipe
publicShare
Star(0)

Data Collection, Training, And Evaluation For Large Scale Code Generation (Copilot)

Contributors
Total time needed: ~6 hours
Objectives
This recipe walks you through steps necessary to build a pipeline that generates sources code. This work is an analysis of the GitHub / OpenAI "co-pilot" service and will help you understand how that works, and provides necessary steps to reproduce it
Potential Use Cases
source code generation, code auto-complete
Who is This For ?
ADVANCEDData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resources7/21
WRITEUP 1. An Introduction to Source Code Generation
  • What is the overall process of source code generation?
10 minutes
VIDEO 2. Machine Learning on Source Code - GitHub / Open AI Copilot
  • What are nuances and opportunities for applying ML on source code?
28 minutes
OTHER 3. Advanced Search for GitHub
  • How to search for most relevant GitHub repos based on a series of criteria?
30 minutes
REPO 4. Repo for creating code dataset by scraping github
  • How to scrape GitHub for source code?
30 minutes
REPO 5. GPT-J Autoregressive Text Generation
  • How to train an autoregressive model on source code?
30 minutes
REPO 6. Evaluation framework and harness for code generation
  • How to evalaute the quality of generated source code?
30 minutes
REPO 7. A plugin for code generation in PyCharm/IntelliJ using tranX
  • How to set up code generation as a browser plugin?
30 minutes
OTHER 8. GitHub Co-Pilot
  • What is GitHub Co-Pilot?
10 minutes
OTHER 9. HuggingFace Discussion about co-pilot
  • Can Co-Pilot be reproduced in open-source?
10 minutes
PAPER 10. A Syntactic Neural Model for General-Purpose Code Generation
  • How AST can be used in neural programming language generation?
30 minutes
PAPER 11. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  • How to find data for source code related tasks?
30 minutes
REPO 12. CodeSearchNet
  • How to access data that can be used for ML on source code?
10 minutes
PAPER 13. The Adverse Effects of Code Duplication in Machine Learning Models of Code
  • How is code duplication affect performance of ML on Code?
10 minutes
REPO 14. Security Score Card
  • How to ensure training set for ML on Code doesn't involve insecure code?
10 minutes
PAPER 15. Evaluating Large Language Models Trained on Code
10 minutes
PAPER 16. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
  • How to evaluate quality of generated source code?
10 minutes
REPO 17. GitHub Decentralized Over BitTorrent
  • How to download GitHub repos using Torrent?
10 minutes
OTHER 18. Repo of Papers on ML for Source Code related Tasks
  • Where can i find everything about ML on Code?
10 minutes
USE_CASE 19. Implementation of Language Modelling with three common algorithms
  • How do we implement language models?
20 minutes
USE_CASE 20. 8 Leading Language Models For NLP In 2020
  • What are the pre-trained model you can use in the market?
  • What are the details about each pre-trained model (architecture, achievement, use cases)?
10 minutes
REPO 21. code for detecting duplicate code
  • How to remove duplicate code from dataset for ML on Code?
10 minutes

Concepts Covered

0 comment