Covers: theory of Data augmentation
Estimated time needed to finish: 20 minutes
Questions this item addresses:
  • What are all the ways text data can be augmented to NLP tasks?
How to use this item?

Notes by Yan Nusinovich:

  • Data augmentation (DA) refers to strategies for increasing the diversity of training examples without explicitly collecting new data.
  • DA’s adaptation for natural language processing (NLP) seems underexplored, perhaps due to challenges presented by the discrete nature of language.
  • GitHub repository with a paper list that will be continuously updated: https://github.com/styfeng/DataAug4NLP.
  • The distribution of augmented data should neither be too similar nor too different from the original. This may lead to greater overfitting (augmented data is too similar) or poor performance through training on examples not representative of the given domain (augmented data is too different). Effective DA approaches should aim for a balance.
  • Techniques:
    • Rule-based
    • Example interpolation-based
    • Model-based
  • Applications
    • Low-resource languages
    • Mitigating bias
    • Fixing class imbalance
    • Few-shot learning
    • Adversarial examples
Author(s) / creator(s) / reference(s)
Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, Diyi Yang
0 comment
Recipe
publicShare
Star(0)

Data-centric Natural Language Processing

Contributors
Total time needed: ~5 hours
Objectives
In this recipe, you’ll learn fundamentals of Data-Centric ML, advanced techniques of data quality assurance with NLP implementation, as well as bonus material on how big companies implement large-scale data-centric mindset in production.
Potential Use Cases
Low-resource NLP, Small language models
Who is This For ?
INTERMEDIATEData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resources6/16
WRITEUP 1. Introduction to Data-centric NLP
  • Why should I care about data-centric NLP?
5 minutes
VIDEO 2. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • What is data-centric machine learning?
60 minutes
REPO 3. cleanlab: python package for machine learning with noisy labels
  • How can I prepare my noisy data for ML?
20 minutes
USE_CASE 4. Find label issues with confident learning for NLP
  • How do you find label issues with text data?
10 minutes
ARTICLE 5. A Visual Survey of Data Augmentation in NLP
  • What are the common ways to augment text data?
11 minutes
ARTICLE 6. Monitoring Data Quality at Scale with Statistical Modeling
  • How to ensure quality of data at large scale?
15 minutes
OTHER 7. Data-Centric AI Competition
  • How can i get hands-on practice with data-centric ML on a simple dataset?
10 minutes
ARTICLE 8. A deep-dive into Andrew NG data-centric competition
  • What does Andrew's "data-centric ml" mean more concretely?
10 minutes
OTHER 9. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
  • What's the impact of error in data-labelling?
10 minutes
USE_CASE 10. Find label issues with confident learning for NLP
10 minutes
ARTICLE 11. Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing
  • How can one get reliable labels for large-scale projects?
15 minutes
PAPER 12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
  • What are all the ways text data can be augmented to NLP tasks?
20 minutes
REPO 13. AugLy: audio, image, text & video Data Augmentation
  • How to augment audio, image, text, and video data?
20 minutes
REPO 14. TextAttack: Generating adversarial examples for NLP models
  • How to generate text examples?
20 minutes
REPO 15. nlpaug: Data Augmentation for NLP
  • How to augment text data for NLP tasks?
20 minutes
ARTICLE 16. A Visual Survey of Data Augmentation in NLP
10 minutes

Concepts Covered

Itamar Halevy.
Nybody knows if NLP OR TimeSeries are meeting Tuesdays ? Please leave details [email protected] THANK YOU