Data mining methods for prediction of air pollution

Krzysztof Siwek; Stanisław Osowski

International Journal of Applied Mathematics and Computer Science (2016)

  • Volume: 26, Issue: 2, page 467-478
  • ISSN: 1641-876X

Abstract

top
The paper discusses methods of data mining for prediction of air pollution. Two tasks in such a problem are important: generation and selection of the prognostic features, and the final prognostic system of the pollution for the next day. An advanced set of features, created on the basis of the atmospheric parameters, is proposed. This set is subject to analysis and selection of the most important features from the prediction point of view. Two methods of feature selection are compared. One applies a genetic algorithm (a global approach), and the other-a linear method of stepwise fit (a locally optimized approach). On the basis of such analysis, two sets of the most predictive features are selected. These sets take part in prediction of the atmospheric pollutants PM10, SO2, NO2 and O3. Two approaches to prediction are compared. In the first one, the features selected are directly applied to the random forest (RF), which forms an ensemble of decision trees. In the second case, intermediate predictors built on the basis of neural networks (the multilayer perceptron, the radial basis function and the support vector machine) are used. They create an ensemble integrated into the final prognosis. The paper shows that preselection of the most important features, cooperating with an ensemble of predictors, allows increasing the forecasting accuracy of atmospheric pollution in a significant way.

How to cite

top

Krzysztof Siwek, and Stanisław Osowski. "Data mining methods for prediction of air pollution." International Journal of Applied Mathematics and Computer Science 26.2 (2016): 467-478. <http://eudml.org/doc/280109>.

@article{KrzysztofSiwek2016,
abstract = {The paper discusses methods of data mining for prediction of air pollution. Two tasks in such a problem are important: generation and selection of the prognostic features, and the final prognostic system of the pollution for the next day. An advanced set of features, created on the basis of the atmospheric parameters, is proposed. This set is subject to analysis and selection of the most important features from the prediction point of view. Two methods of feature selection are compared. One applies a genetic algorithm (a global approach), and the other-a linear method of stepwise fit (a locally optimized approach). On the basis of such analysis, two sets of the most predictive features are selected. These sets take part in prediction of the atmospheric pollutants PM10, SO2, NO2 and O3. Two approaches to prediction are compared. In the first one, the features selected are directly applied to the random forest (RF), which forms an ensemble of decision trees. In the second case, intermediate predictors built on the basis of neural networks (the multilayer perceptron, the radial basis function and the support vector machine) are used. They create an ensemble integrated into the final prognosis. The paper shows that preselection of the most important features, cooperating with an ensemble of predictors, allows increasing the forecasting accuracy of atmospheric pollution in a significant way.},
author = {Krzysztof Siwek, Stanisław Osowski},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {computational intelligence; feature selection; neural networks; random forest; air pollution forecasting},
language = {eng},
number = {2},
pages = {467-478},
title = {Data mining methods for prediction of air pollution},
url = {http://eudml.org/doc/280109},
volume = {26},
year = {2016},
}

TY - JOUR
AU - Krzysztof Siwek
AU - Stanisław Osowski
TI - Data mining methods for prediction of air pollution
JO - International Journal of Applied Mathematics and Computer Science
PY - 2016
VL - 26
IS - 2
SP - 467
EP - 478
AB - The paper discusses methods of data mining for prediction of air pollution. Two tasks in such a problem are important: generation and selection of the prognostic features, and the final prognostic system of the pollution for the next day. An advanced set of features, created on the basis of the atmospheric parameters, is proposed. This set is subject to analysis and selection of the most important features from the prediction point of view. Two methods of feature selection are compared. One applies a genetic algorithm (a global approach), and the other-a linear method of stepwise fit (a locally optimized approach). On the basis of such analysis, two sets of the most predictive features are selected. These sets take part in prediction of the atmospheric pollutants PM10, SO2, NO2 and O3. Two approaches to prediction are compared. In the first one, the features selected are directly applied to the random forest (RF), which forms an ensemble of decision trees. In the second case, intermediate predictors built on the basis of neural networks (the multilayer perceptron, the radial basis function and the support vector machine) are used. They create an ensemble integrated into the final prognosis. The paper shows that preselection of the most important features, cooperating with an ensemble of predictors, allows increasing the forecasting accuracy of atmospheric pollution in a significant way.
LA - eng
KW - computational intelligence; feature selection; neural networks; random forest; air pollution forecasting
UR - http://eudml.org/doc/280109
ER -

References

top
  1. Agirre-Basurko, E., Ibarra-Berastegi, G. and Madriaga, I. (2006). Regression and multilayer perceptron-based models for forecast hourly O3 and nO2 levels in the Bilbao area, Environmental Modelling and Software 21(4): 430-446. 
  2. Bhanu, B. and Lin, Y. (2003). Genetic algorithm based feature selection for target detection in SAR images, Image and Vision Computing 21(4): 591-608. 
  3. Breiman, L. (2001). Random forests, Machine Learning 45(11): 5-32. Zbl1007.68152
  4. Brunelli, U., Piazza, V., Pignato, L. and Sorbello, F.and Vitabile, S. (2007). Two-day ahead prediction of daily maximum concentrations of SO2, O3, PM10, NO2, CO in urban area of Palermo, Italy, Atmospheric Environment 41(14): 2967-2995. 
  5. Cloete, I. and Zurada, J. (2000). Knowledge-based Neurocomputing, MIT Press, Cambridge, MA. 
  6. Goldberg, D. (2013). Genetic Algorithms in Search, Optimization, and Machine Learning, Pearson Education, Upper Saddle River, NJ. 
  7. Grivas, G. and Chaloulakou, A. (2006). Artificial neural network models for predictions of PM10 hourly concentrations in greater area of Athens, Atmospheric Environment 40(7): 1216-1229. 
  8. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3(1): 1158-1182. Zbl1102.68556
  9. Haykin, S. (2000). Neural Networks. A Comprehensive Foundation, 2nd Edition, Prentice-Hall, Englewood Cliffs, NJ. Zbl0828.68103
  10. Matlab (2014). Matlab User Manual-Statistics Toolbox, MathWorks, Natic, MA. 
  11. Mesin, L., Taormina, R. and Pasero, E. (2010). A feature selection method for air quality forecasting, Proceedings of the International Conference on Artificial Neural Networks, Thessaloniki, Greece, pp. 489-494. 
  12. Osowski, S., Siwek, K. and Szupiluk, R. (2009). Ensemble neural network approach for accurate load forecasting in a power system, International Journal of Applied Mathematics and Computer Science 19(2): 303-315, DOI: 10.2478/v10006-009-0026-2. Zbl1167.93338
  13. Perez, P. and Trier, A. (2001). Prediction of NO and NO2 concentrations near a street with heavy traffic in Santiago, Chile, Atmospheric Environment 35(21): 1783-1789. 
  14. Scholkopf, B. and Smola, A. (2002). Learning with Kernels, MIT Press, Cambridge, MA. Zbl1019.68094
  15. Siwek, K., Osowski, S. and Sowiński, M. (2010). Neural predictor ensemble for accurate forecasting of PM10 pollution, Proceedings of the International Joint Conference on Neural Networks, Barcelona, Spain, pp. 1-7. 
  16. Siwek, K., Osowski, S. and Sowiński, M. (2011). Evolving the ensemble of predictors model for forecasting the daily average PM10, International Journal of Environment and Pollution 46(3/4): 199-215. 
  17. Sprent, P. and Smeeton, N. (2007). Applied Nonparametric Statistical Methods, Chapman and Hall/CRC, Boca Raton, FL. Zbl1141.62020
  18. Sumi, S.M., Zaman, M.F. and Hirose, H. (2012). A rainfall forecasting method using machine learning models and its application to the Fukuoka city case, International Journal of Applied Mathematics and Computer Science 22(4): 841-854, DOI: 10.2478/v10006-012-0062-1. Zbl1283.68305
  19. Tan, P.N., Steinbach, M. and Kumar, V. (2006). Introduction to Data Mining, Pearson Education, Boston, MA. 
  20. Vafaie, H. and De Jong, K. (1992). Genetic algorithms as a tool for feature selection in machine learning, Proceedings of the 4th International Conference on Tools with Artificial Intelligence, Arlington, VA, USA, pp. 1-6. 
  21. Zhang, T. (2009). Adaptive forward-backward greedy algorithm for sparse learning with linear models, in D. Koller et al. (Eds.), NIPS: Proceedings of Neural Information Processing Systems, MIT Press, Cambridge, MA, pp. 1921-1928. 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.