Uber facilitates 14 million trips per day, but not all data feeds into their pipeline without interruptions. Interruptions in network connectivity or malfunctions in certain data centers can create a massive change in the data, yet they are inevitable. Making large-scale automated decisions on poor quality data can lead to disasturous outcome, which Uber has implemented the Data Quality Monitor system in order to perform automated data quality analysis.
Due to the sheer amount of data the company receives every day, one-off statistical analysis with Confident Learning is not scalable. Uber first creates the Data Stats Service to create time series quality metrics for each table column at the data sources, feeds information to a dedicated anomaly detection platform, which then ties all analysis to the front end dashboard for data users, which alerts the users when there is concern about the data quality.
Statistically speaking, Uber utilizes Principle Component Analysis and Time Series Anaylsis to bundle hundreds of metrics generated by DSS and monitor anomalies in variance throughout the time series. To reduce the noise from metric-level anomalies, Uber has also implemente an unweighted scoring system alert at table-level anomalies instead.
While advanced techniques like Confident Learning is a great leap forward from manual inspection, such techniques could still solve only part of the problem when data feed is at a greater scale. This material helps data scientists to foresee what are the possibilities when designing a data pipeline at a grander scale.