ArticleKhatri P, Shakya KS, Kumar P.
Environ Sci Pollut Res Int. 2024 Oct;31(49):59534-59570.
Just as the value of crude oil is unlocked through refining, the true potential of air quality data is realized through systematic processing, analysis, and application. This refined data is critical for making informed decisions that may protect health and the environment. Perhaps ground-based air quality monitoring data often face quality control issues, notably outliers. The outliers in air quality data are reported as error and event-based. The error-based outliers are due to instrument failure, self-calibration, sensor drift over time, and the event based focused on the sudden change in meteorological conditions. The event-based outliers are meaningful while error-based outliers are noise that needs to be eliminated and replaced post-detection. In this study, we address error-based outlier detection in air quality data, particularly targeting particulate pollutants (PM2.5 and PM10) across various monitoring sites in Delhi. Our research specifically examines data from sites with less than 5% missing values and identifies four distinct types of error-based outliers: extreme values due to measurement errors, consecutive constant readings and low variance due to instrument malfunction, periodic outliers from self-calibration exceptions, and anomalies in the PM2.5/PM10 ratio indicative of issues with the instruments' dryer unit. We developed a robust methodology for outlier detection by fitting a non-linear filter to the data, calculating residuals between observed and predicted values, and then assessing these residuals using a standardized Z-score to determine their probability. Outliers are flagged based on a probability threshold established through sensitivity testing. This approach helps distinguish normal data points from suspicious ones, ensuring the refined quality of data necessary for accurate air quality modeling. This method is essential for improving the reliability of statistical and machine learning models that depend on high-quality environmental data.