Lookahead selective sampling for incomplete data

Loai Abdallah; Ilan Shimshoni

International Journal of Applied Mathematics and Computer Science (2016)

  • Volume: 26, Issue: 4, page 871-884
  • ISSN: 1641-876X

Abstract

top
Missing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.

How to cite

top

Loai Abdallah, and Ilan Shimshoni. "Lookahead selective sampling for incomplete data." International Journal of Applied Mathematics and Computer Science 26.4 (2016): 871-884. <http://eudml.org/doc/287176>.

@article{LoaiAbdallah2016,
abstract = {Missing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.},
author = {Loai Abdallah, Ilan Shimshoni},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {selective sampling; missing values; ensemble clustering},
language = {eng},
number = {4},
pages = {871-884},
title = {Lookahead selective sampling for incomplete data},
url = {http://eudml.org/doc/287176},
volume = {26},
year = {2016},
}

TY - JOUR
AU - Loai Abdallah
AU - Ilan Shimshoni
TI - Lookahead selective sampling for incomplete data
JO - International Journal of Applied Mathematics and Computer Science
PY - 2016
VL - 26
IS - 4
SP - 871
EP - 884
AB - Missing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.
LA - eng
KW - selective sampling; missing values; ensemble clustering
UR - http://eudml.org/doc/287176
ER -

References

top
  1. Abdallah, L. and Shimshoni, I. (2013). An ensemble-clustering-based distance metric and its applications, International Journal of Business Intelligence and Data Mining 8(3): 264-287. 
  2. Abdallah, L. and Shimshoni, I. (2014). Mean shift clustering algorithm for data with missing values, 14th International Conference of DaWaK, Munich, Germany, pp. 426-438. 
  3. Abdallah, L. and Shimshoni, I. (2016). k-means over incomplete datasets using mean Euclidean distance, 12th International Conference on Machine Learning and Data Mining, New York, NY, pp. 113-127. 
  4. Bai, X., Zhang, M., Wu, Q., Zheng, R., Zhao, H. and Wei, W. (2015). A novel data filling algorithm for incomplete information system based on valued limited tolerance relation, International Journal of Database Theory and Application 8(6): 149-164. 
  5. Clark, P.G., Grzymala-Busse, J.W. and Rzasa, W. (2013). Consistency of incomplete data, 2nd International Conference on Data Technologies and Applications, Marrakech, Morocco, pp. 80-87. 
  6. Clustering datasets (2008). http://cs.joensuu.fi/sipu/ datasets/, University of Eastern Finland, Joensuu. 
  7. Dasgupta, S. and Hsu, D. (2008). Hierarchical sampling for active learning, 25th International Conference on Machine Learning, Helsinki, Finland, pp. 208-215. 
  8. Dekel, O., Gentile, C. and Sridharan, K. (2012). Selective sampling and active learning from single and multiple teachers, Journal of Machine Learning Research 13(1): 2655-2697. Zbl06276195
  9. Donders, A.R.T., van der Heijden, G.J., Stijnen, T. and Moons, K.G. (2006). Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology 59(10): 1087-1091. 
  10. Grzymala-Busse, J. and Hu, M. (2001). A comparison of several approaches to missing attribute values in data mining, in W. Ziarko et al. (Eds.), Rough Sets and Current Trends in Computing, Springer, Berlin/Heidelberg, pp. 378-385. Zbl1014.68558
  11. Grzymala-Busse, J.W. (2006). A rough set approach to data with missing attribute values, in J.F. Peters and Y. Yao (Eds.), Rough Sets and Knowledge Technology, Springer, Berlin/Heidelberg, pp. 58-67. 
  12. Hospedales, T.M., Gong, S. and Xiang, T. (2013). Finding rare classes: Active learning with generative and discriminative models, IEEE Transactions on Knowledge and Data Engineering 25(2): 374-386. 
  13. Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R. and Herring, A.H. (2005). Missing-data methods for generalized linear models: A comparative review, Journal of the American Statistical Association 100(469): 332-346. Zbl1117.62360
  14. Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers, 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3-12. 
  15. Li, H., Shi, Y., Liu, Y., Hauptmann, A.G. and Xiong, Z. (2012). Cross-domain video concept detection: A joint discriminative and generative active learning approach, Expert Systems with Applications 39(15): 12220-12228. 
  16. Lindenbaum, M., Markovitch, S. and Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers, Machine Learning 54(2): 125-152. Zbl1057.68087
  17. Little, R.J. (1988). Missing-data adjustments in large surveys, Journal of Business & Economic Statistics 6(3): 287-296. 
  18. Little, R.J. and Rubin, D.B. (2014). Statistical Analysis with Missing Data, John Wiley & Sons. Hoboken, NJ. Zbl0665.62004
  19. Lughofer, E. (2012). Hybrid active learning for reducing the annotation effort of operators in classification systems, Pattern Recognition 45(2): 884-896. 
  20. MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations, 5th Symposium on Math, Statistics, and Probability, Berkeley, CA, USA, pp. 281-297. Zbl0214.46201
  21. Magnani, M. (2004). Techniques for dealing with missing data in knowledge discovery tasks, Obtido 15(01): 2007. 
  22. Nowicki, R.K. (2010). On classification with missing data using rough-neuro-fuzzy systems, International Journal of Applied Mathematics and Computer Science 20(1): 55-67, doi: 10.2478/v10006-010-0004-8. Zbl1300.93106
  23. Nowicki, R.K., Nowak, B.A. and Woźniak, M. (2016). Application of rough sets in k nearest neighbours algorithm for classification of incomplete samples, in S. Kunifuji et al. (Eds.), Knowledge, Information and Creativity Support Systems, Springer, Berlin/Heidelberg, pp. 243-257. 
  24. Stefanowski, J. and Tsoukias, A. (2001). Incomplete information tables and rough classification, Computational Intelligence 17(3): 545-566. 
  25. Strehl, A. and Ghosh, J. (2002). Cluster ensembles-A knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3: 583-617. Zbl1084.68759
  26. Tan, M. and Schlimmer, J. (1990). Two case studies in cost-sensitive concept acquisition, 8th National Conference on Artificial Intelligence, Boston, MA, USA, pp. 854-860. 
  27. Turney, P. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of Artificial Intelligence Research 2(1): 369-409. 
  28. Xu, Z., Akella, R. and Zhang, Y. (2007). Incorporating diversity and density in active learning for relevance feedback, in G. Amati et al. (Eds.), Advances in Information Retrieval, Springer, Berlin/Heidelberg, pp. 246-257. 
  29. Zhang, S., Qin, Z., Ling, C. and Sheng, S. (2005). Missing is useful: Missing values in cost-sensitive decision trees, IEEE Transactions on Knowledge and Data Engineering 17(12): 1689-1693. 
  30. Zhang, Y., Wen, J., Wang, X. and Jiang, Z. (2014). Semi-supervised learning combining co-training with active learning, Expert Systems with Applications 41(5): 2372-2378. 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.