Covers: implementation of Source Code Generation
Estimated time needed to finish: 30 minutes
Questions this item addresses:
  • How to scrape GitHub for source code?
How to use this item?

Use this repo to download repositories from GitHub for training. Eleuther AI adapted this repo to generate their training dataset 'The Pile', which includes github code, among others.

A few notes to pay attention to:

  • People who trained the model didn't pay much attention to the licenses, but you should
  • You should also pay attention to the diversity of the source code dataset
  • You can't include all languages in the world, but you can start from a few
Author(s) / creator(s) / reference(s)
Programming Languages: Python
0 comment

Data Collection, Training, and Evaluation for Large Scale Code Generation (Copilot)

Total time needed: ~6 hours
This recipe walks you through steps necessary to build a pipeline that generates sources code. This work is an analysis of the GitHub / OpenAI "co-pilot" service and will help you understand how that works, and provides necessary steps to reproduce it
Potential Use Cases
source code generation, code auto-complete
Who is This For ?
ADVANCEDData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resource Asset6/20
WRITEUP 1. An Introduction to Source Code Generation
  • What is the overall process of source code generation?
10 minutes
OTHER 2. Advanced Search for GitHub
  • How to search for most relevant GitHub repos based on a series of criteria?
30 minutes
REPO 3. Repo for creating code dataset by scraping github
  • How to scrape GitHub for source code?
30 minutes
REPO 4. GPT-J Autoregressive Text Generation
  • How to train an autoregressive model on source code?
30 minutes
REPO 5. Evaluation framework and harness for code generation
  • How to evalaute the quality of generated source code?
30 minutes
REPO 6. A plugin for code generation in PyCharm/IntelliJ using tranX
  • How to set up code generation as a browser plugin?
30 minutes
OTHER 7. GitHub Co-Pilot
  • What is GitHub Co-Pilot?
10 minutes
OTHER 8. HuggingFace Discussion about co-pilot
  • Can Co-Pilot be reproduced in open-source?
10 minutes
PAPER 9. A Syntactic Neural Model for General-Purpose Code Generation
  • How AST can be used in neural programming language generation?
30 minutes
PAPER 10. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  • How to find data for source code related tasks?
30 minutes
REPO 11. CodeSearchNet
  • How to access data that can be used for ML on source code?
10 minutes
PAPER 12. The Adverse Effects of Code Duplication in Machine Learning Models of Code
  • How is code duplication affect performance of ML on Code?
10 minutes
REPO 13. Security Score Card
  • How to ensure training set for ML on Code doesn't involve insecure code?
10 minutes
PAPER 14. Evaluating Large Language Models Trained on Code
10 minutes
PAPER 15. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
  • How to evaluate quality of generated source code?
10 minutes
REPO 16. GitHub Decentralized Over BitTorrent
  • How to download GitHub repos using Torrent?
10 minutes
OTHER 17. Repo of Papers on ML for Source Code related Tasks
  • Where can i find everything about ML on Code?
10 minutes
USE_CASE 18. Implementation of Language Modelling with three common algorithms
  • How do we implement language models?
20 minutes
USE_CASE 19. 8 Leading Language Models For NLP In 2020
  • What are the pre-trained model you can use in the market?
  • What are the details about each pre-trained model (architecture, achievement, use cases)?
10 minutes
REPO 20. code for detecting duplicate code
  • How to remove duplicate code from dataset for ML on Code?
10 minutes

Concepts Covered

0 comment