Why Data-centric NLP?

If you haven’t heard, Data-Centric Machine Learning, promoted by Andrew Ng, is the next big thing. Data-Centric ML has confused many professionals, as data quality has always been part of the package for data scientists, but there is more to explore when data quality has to be assured at large scale and cannot be done manually by a data scientist.

The ultimate purpose of data-centric approaches to ML is to ensure data quality, including data itself and any potential labels provided, are cleaned or otherwised kept at a very high quality standard so that the burdon of de-noising doesn't fall on the models.

This is especially important in NLP as we have been noticing larger and larger language models that are becoming well outside of the reach of the typical ML pracitioners. So, here we document some resources and instructions about how textual data can be cleaned and prepared for NLP use cases to reduce the need for ever larger lanugage models.

Follow this recipe step by step, but then also check out the "additional resources" section.

Importance of Exploratory Data Analysis, Data Quality Control

by: Ian Yu I know EDA sounds like it’s “obvious”, but there are two things that I personally think would be great here:

  1. First is the actual Quality Control, because even the most commonly cited datasets have mislabelled data. This is actually the reason why I stumbled upon Andrew Ng’s data centric approach, specifically looking for a more programmatic way. I haven’t tested, but I'm providing some libraries that are supposed to do a good job on this.
  2. The other thing I do think almost always lacks in many EDA notebooks is the external research (i.e. actually understanding subject matter knowledge), where most EDA would be focused on statistical measures within the dataset. This helps us be more creative with data augmentation (including feature engineering), and able to gauge whether or not the data augmentation makes sense from human level inspection.

As data get bigger, it’s hard to QA every single data point, therefore QA should be implemented before and after labelling. While labelling through vendors, there are a few things that must happen for each data point:

  • There shouldn’t be a single person performing labelling, but at least 2
  • If both are labeled the same, the data point is passed through
  • If there is a conflict, a third person would come in and label again

Anecdotally, conflicts happen about 10% of the time, which is why often the cost is calculated as 2.1 times per data point. But this doesn’t have to be done through vendors. For example, Walmart has implemented Chimera System(1. Introduction to Data-centric NLP) to handle data labelling at their enourmous scale through a combination of labelling vendors, business analysts, and machine learning.

Covers: theory of Data-centric Machine Learning
Estimated time needed to finish: 5 minutes
Questions this item addresses:
  • Why should I care about data-centric NLP?
0 comment
Recipe
publicShare
Star(0)

Data-centric Natural Language Processing

Contributors
Total time needed: ~5 hours
Objectives
In this recipe, you’ll learn fundamentals of Data-Centric ML, advanced techniques of data quality assurance with NLP implementation, as well as bonus material on how big companies implement large-scale data-centric mindset in production.
Potential Use Cases
Low-resource NLP, Small language models
Who is This For ?
INTERMEDIATEData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resources6/16
WRITEUP 1. Introduction to Data-centric NLP
  • Why should I care about data-centric NLP?
5 minutes
VIDEO 2. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • What is data-centric machine learning?
60 minutes
REPO 3. cleanlab: python package for machine learning with noisy labels
  • How can I prepare my noisy data for ML?
20 minutes
USE_CASE 4. Find label issues with confident learning for NLP
  • How do you find label issues with text data?
10 minutes
ARTICLE 5. A Visual Survey of Data Augmentation in NLP
  • What are the common ways to augment text data?
11 minutes
ARTICLE 6. Monitoring Data Quality at Scale with Statistical Modeling
  • How to ensure quality of data at large scale?
15 minutes
OTHER 7. Data-Centric AI Competition
  • How can i get hands-on practice with data-centric ML on a simple dataset?
10 minutes
ARTICLE 8. A deep-dive into Andrew NG data-centric competition
  • What does Andrew's "data-centric ml" mean more concretely?
10 minutes
OTHER 9. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
  • What's the impact of error in data-labelling?
10 minutes
USE_CASE 10. Find label issues with confident learning for NLP
10 minutes
ARTICLE 11. Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing
  • How can one get reliable labels for large-scale projects?
15 minutes
PAPER 12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
  • What are all the ways text data can be augmented to NLP tasks?
20 minutes
REPO 13. AugLy: audio, image, text & video Data Augmentation
  • How to augment audio, image, text, and video data?
20 minutes
REPO 14. TextAttack: Generating adversarial examples for NLP models
  • How to generate text examples?
20 minutes
REPO 15. nlpaug: Data Augmentation for NLP
  • How to augment text data for NLP tasks?
20 minutes
ARTICLE 16. A Visual Survey of Data Augmentation in NLP
10 minutes

Concepts Covered

Itamar Halevy.
Nybody knows if NLP OR TimeSeries are meeting Tuesdays ? Please leave details [email protected] THANK YOU