Covers: theory of Data augmentation
Estimated time needed to finish: 11 minutes
Questions this item addresses:
  • What are the common ways to augment text data?
How to use this item?

Summary of Techniques

In this article you can see some of the most common ways that can be used to augment text data. It covers techqniues like: Lexical Substitution, Back Translation, Text Surface Transformation, Random Noise Injection, Instance Crossover Augmentation, Syntax-tree Manipulation, MixUp for Text, Generative Methods.

A Note on Swaping Techniques

The blog below does cover one of the random swapping techniques specifically on word swapping under section 4g Random Swap. Swapping, however, can be implemented to various types of sequences of tokens and structures in text including but not limited to:

  • Phrases
  • Sentences
  • Paragraphs
  • Pages
  • Documents (cases where the network parameter size is very large)
Author(s) / creator(s) / reference(s)
Amit Chaudhary
0 comment
Recipe
publicShare
Star(0)

Data-centric Natural Language Processing

Contributors
Total time needed: ~5 hours
Objectives
In this recipe, you’ll learn fundamentals of Data-Centric ML, advanced techniques of data quality assurance with NLP implementation, as well as bonus material on how big companies implement large-scale data-centric mindset in production.
Potential Use Cases
Low-resource NLP, Small language models
Who is This For ?
INTERMEDIATEData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resources6/16
WRITEUP 1. Introduction to Data-centric NLP
  • Why should I care about data-centric NLP?
5 minutes
VIDEO 2. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • What is data-centric machine learning?
60 minutes
REPO 3. cleanlab: python package for machine learning with noisy labels
  • How can I prepare my noisy data for ML?
20 minutes
USE_CASE 4. Find label issues with confident learning for NLP
  • How do you find label issues with text data?
10 minutes
ARTICLE 5. A Visual Survey of Data Augmentation in NLP
  • What are the common ways to augment text data?
11 minutes
ARTICLE 6. Monitoring Data Quality at Scale with Statistical Modeling
  • How to ensure quality of data at large scale?
15 minutes
OTHER 7. Data-Centric AI Competition
  • How can i get hands-on practice with data-centric ML on a simple dataset?
10 minutes
ARTICLE 8. A deep-dive into Andrew NG data-centric competition
  • What does Andrew's "data-centric ml" mean more concretely?
10 minutes
OTHER 9. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
  • What's the impact of error in data-labelling?
10 minutes
USE_CASE 10. Find label issues with confident learning for NLP
10 minutes
ARTICLE 11. Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing
  • How can one get reliable labels for large-scale projects?
15 minutes
PAPER 12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
  • What are all the ways text data can be augmented to NLP tasks?
20 minutes
REPO 13. AugLy: audio, image, text & video Data Augmentation
  • How to augment audio, image, text, and video data?
20 minutes
REPO 14. TextAttack: Generating adversarial examples for NLP models
  • How to generate text examples?
20 minutes
REPO 15. nlpaug: Data Augmentation for NLP
  • How to augment text data for NLP tasks?
20 minutes
ARTICLE 16. A Visual Survey of Data Augmentation in NLP
10 minutes

Concepts Covered

Itamar Halevy.
Nybody knows if NLP OR TimeSeries are meeting Tuesdays ? Please leave details [email protected] THANK YOU