  • What are the common ways to augment text data?
Summary of Techniques

In this article you can see some of the most common ways that can be used to augment text data. It covers techqniues like: Lexical Substitution, Back Translation, Text Surface Transformation, Random Noise Injection, Instance Crossover Augmentation, Syntax-tree Manipulation, MixUp for Text, Generative Methods.

A Note on Swaping Techniques

The blog below does cover one of the random swapping techniques specifically on word swapping under section 4g Random Swap. Swapping, however, can be implemented to various types of sequences of tokens and structures in text including but not limited to:

  • Phrases
  • Sentences
  • Paragraphs
  • Pages
  • Documents (cases where the network parameter size is very large)
Amit Chaudhary
Data-centric Natural Language Processing

In this recipe, you’ll learn fundamentals of Data-Centric ML, advanced techniques of data quality assurance with NLP implementation, as well as bonus material on how big companies implement large-scale data-centric mindset in production.
Low-resource NLP, Small language models
Data Scientists, Machine Learning Engineers
WRITEUP 1. Introduction to Data-centric NLP
  • Why should I care about data-centric NLP?
VIDEO 2. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • What is data-centric machine learning?
REPO 3. cleanlab: python package for machine learning with noisy labels
  • How can I prepare my noisy data for ML?
USE_CASE 4. Find label issues with confident learning for NLP
  • How do you find label issues with text data?
ARTICLE 5. A Visual Survey of Data Augmentation in NLP
  • What are the common ways to augment text data?
ARTICLE 6. Monitoring Data Quality at Scale with Statistical Modeling
  • How to ensure quality of data at large scale?
OTHER 7. Data-Centric AI Competition
  • How can i get hands-on practice with data-centric ML on a simple dataset?
ARTICLE 8. A deep-dive into Andrew NG data-centric competition
  • What does Andrew's "data-centric ml" mean more concretely?
OTHER 9. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
  • What's the impact of error in data-labelling?
USE_CASE 10. Find label issues with confident learning for NLP
ARTICLE 11. Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing
  • How can one get reliable labels for large-scale projects?
PAPER 12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
  • What are all the ways text data can be augmented to NLP tasks?
REPO 13. AugLy: audio, image, text & video Data Augmentation
  • How to augment audio, image, text, and video data?
REPO 14. TextAttack: Generating adversarial examples for NLP models
  • How to generate text examples?
REPO 15. nlpaug: Data Augmentation for NLP
  • How to augment text data for NLP tasks?
Concepts Covered

