Contributors
Covers: theory of Data-centric Machine Learning
How to use this item?

In this talk, we present a data centric view of NLP operation and tooling, which bridges different style of software libraries, different user personas, and over additional infrastructures such as those for visualization and distributed training. We propose a highly universal data representation called DataPack, which builds on a flexible type-ontology that is morphable and extendable to subsume any commonly used data formats in all known (and hopefully, future) NLP tasks, yet remains invariant as a software data structure that can be passed across any NLP building blocks. Based on this abstraction, we develop Forte, a Data-Centric Framework for Composable NLP Workflows, with rich in-house processors, standardized 3rd-party API wrappers, and operation logics implemented at the right level of abstraction to facilitate rapid composition of sophisticated NLP solutions with heterogeneous components.

Fail to play? Open the link directly: https://www.youtube.com/watch?v=JV2y4cT56YE
0 comment
Recommended Resources
public

Data-centric Natural Language Processing

Contributors
Total time needed: ~5 hours
Objectives
In this recipe, you’ll learn fundamentals of Data-Centric ML, advanced techniques of data quality assurance with NLP implementation, as well as bonus material on how big companies implement large-scale data-centric mindset in production.
Potential Use Cases
Low-resource NLP, Small language models
Who is This For ?
INTERMEDIATEData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resources
WRITEUP 1. Introduction to Data-centric NLP
  • Why should I care about data-centric NLP?
5 minutes
VIDEO 2. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • What is data-centric machine learning?
60 minutes
REPO 3. cleanlab: python package for machine learning with noisy labels
  • How can I prepare my noisy data for ML?
20 minutes
USE_CASE 4. Find label issues with confident learning for NLP
  • How do you find label issues with text data?
10 minutes
ARTICLE 5. A Visual Survey of Data Augmentation in NLP
  • What are the common ways to augment text data?
11 minutes
ARTICLE 6. Monitoring Data Quality at Scale with Statistical Modeling
  • How to ensure quality of data at large scale?
15 minutes
OTHER 7. Data-Centric AI Competition
  • How can i get hands-on practice with data-centric ML on a simple dataset?
10 minutes
ARTICLE 8. A deep-dive into Andrew NG data-centric competition
  • What does Andrew's "data-centric ml" mean more concretely?
10 minutes
OTHER 9. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
  • What's the impact of error in data-labelling?
10 minutes
USE_CASE 10. Find label issues with confident learning for NLP
10 minutes
ARTICLE 11. Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing
  • How can one get reliable labels for large-scale projects?
15 minutes
PAPER 12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
  • What are all the ways text data can be augmented to NLP tasks?
20 minutes
REPO 13. AugLy: audio, image, text & video Data Augmentation
  • How to augment audio, image, text, and video data?
20 minutes
REPO 14. TextAttack: Generating adversarial examples for NLP models
  • How to generate text examples?
20 minutes
REPO 15. nlpaug: Data Augmentation for NLP
  • How to augment text data for NLP tasks?
20 minutes
ARTICLE 16. A Visual Survey of Data Augmentation in NLP
10 minutes
VIDEO 17. Neural Search for Low Resource Scenarios
  • How to train NLP with very few examples?
40 minutes
VIDEO 18. ICML 2021 invited talk: A Data-Centric View for Composable Natural Language Processing by Eric Xing - YouTube
10 minutes

Concepts Covered

Itamar Halevy.
Nybody knows if NLP OR TimeSeries are meeting Tuesdays ? Please leave details [email protected] THANK YOU
Manuel Ángel Suarez Álvarez.
Mark for read