If you haven’t heard, Data-Centric Machine Learning, promoted by Andrew Ng, is the next big thing. Data-Centric ML has confused many professionals, as data quality has always been part of the package for data scientists, but there is more to explore when data quality has to be assured at large scale and cannot be done manually by a data scientist.
The ultimate purpose of data-centric approaches to ML is to ensure data quality, including data itself and any potential labels provided, are cleaned or otherwised kept at a very high quality standard so that the burdon of de-noising doesn't fall on the models.
This is especially important in NLP as we have been noticing larger and larger language models that are becoming well outside of the reach of the typical ML pracitioners. So, here we document some resources and instructions about how textual data can be cleaned and prepared for NLP use cases to reduce the need for ever larger lanugage models.
Follow this recipe step by step, but then also check out the "additional resources" section.
by: Ian Yu I know EDA sounds like it’s “obvious”, but there are two things that I personally think would be great here:
As data get bigger, it’s hard to QA every single data point, therefore QA should be implemented before and after labelling. While labelling through vendors, there are a few things that must happen for each data point:
Anecdotally, conflicts happen about 10% of the time, which is why often the cost is calculated as 2.1 times per data point. But this doesn’t have to be done through vendors. For example, Walmart has implemented Chimera System(1. Introduction to Data-centric NLP) to handle data labelling at their enourmous scale through a combination of labelling vendors, business analysts, and machine learning.