Covers: theory of Data Quality Control
Estimated time needed to finish: 15 minutes
Questions this item addresses:
  • How to ensure quality of data at large scale?
How to use this item?

Uber facilitates 14 million trips per day, but not all data feeds into their pipeline without interruptions. Interruptions in network connectivity or malfunctions in certain data centers can create a massive change in the data, yet they are inevitable. Making large-scale automated decisions on poor quality data can lead to disasturous outcome, which Uber has implemented the Data Quality Monitor system in order to perform automated data quality analysis.

Due to the sheer amount of data the company receives every day, one-off statistical analysis with Confident Learning is not scalable. Uber first creates the Data Stats Service to create time series quality metrics for each table column at the data sources, feeds information to a dedicated anomaly detection platform, which then ties all analysis to the front end dashboard for data users, which alerts the users when there is concern about the data quality.

Statistically speaking, Uber utilizes Principle Component Analysis and Time Series Anaylsis to bundle hundreds of metrics generated by DSS and monitor anomalies in variance throughout the time series. To reduce the noise from metric-level anomalies, Uber has also implemente an unweighted scoring system alert at table-level anomalies instead.

While advanced techniques like Confident Learning is a great leap forward from manual inspection, such techniques could still solve only part of the problem when data feed is at a greater scale. This material helps data scientists to foresee what are the possibilities when designing a data pipeline at a grander scale.

Author(s) / creator(s) / reference(s)
Uber Engineering
0 comment
Recipe
publicShare
Star(0)

Data-centric Natural Language Processing

Contributors
Total time needed: ~5 hours
Objectives
In this recipe, you’ll learn fundamentals of Data-Centric ML, advanced techniques of data quality assurance with NLP implementation, as well as bonus material on how big companies implement large-scale data-centric mindset in production.
Potential Use Cases
Low-resource NLP, Small language models
Who is This For ?
INTERMEDIATEData Scientists, Machine Learning Engineers
Click on each of the following annotated items to see details.
Resources6/16
WRITEUP 1. Introduction to Data-centric NLP
  • Why should I care about data-centric NLP?
5 minutes
VIDEO 2. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • What is data-centric machine learning?
60 minutes
REPO 3. cleanlab: python package for machine learning with noisy labels
  • How can I prepare my noisy data for ML?
20 minutes
USE_CASE 4. Find label issues with confident learning for NLP
  • How do you find label issues with text data?
10 minutes
ARTICLE 5. A Visual Survey of Data Augmentation in NLP
  • What are the common ways to augment text data?
11 minutes
ARTICLE 6. Monitoring Data Quality at Scale with Statistical Modeling
  • How to ensure quality of data at large scale?
15 minutes
OTHER 7. Data-Centric AI Competition
  • How can i get hands-on practice with data-centric ML on a simple dataset?
10 minutes
ARTICLE 8. A deep-dive into Andrew NG data-centric competition
  • What does Andrew's "data-centric ml" mean more concretely?
10 minutes
OTHER 9. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
  • What's the impact of error in data-labelling?
10 minutes
USE_CASE 10. Find label issues with confident learning for NLP
10 minutes
ARTICLE 11. Chimera: Large-Scale Classification Using Machine Learning, Rules, and Crowdsourcing
  • How can one get reliable labels for large-scale projects?
15 minutes
PAPER 12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
  • What are all the ways text data can be augmented to NLP tasks?
20 minutes
REPO 13. AugLy: audio, image, text & video Data Augmentation
  • How to augment audio, image, text, and video data?
20 minutes
REPO 14. TextAttack: Generating adversarial examples for NLP models
  • How to generate text examples?
20 minutes
REPO 15. nlpaug: Data Augmentation for NLP
  • How to augment text data for NLP tasks?
20 minutes
ARTICLE 16. A Visual Survey of Data Augmentation in NLP
10 minutes

Concepts Covered

Itamar Halevy.
Nybody knows if NLP OR TimeSeries are meeting Tuesdays ? Please leave details [email protected] THANK YOU