Parallel and Distributed Data Mining: An Introduction

The explosive growth in data collection in business and scientific fields has literally forced upon us the need to analyze and mine useful knowledge from it. Data mining refers to the entire process of extracting useful and novel patterns/models from large datasets. Due to the huge size of data and amount of computation involved in data mining, high-performance computing is an essential component for any successful large-scale data mining application. This chapter presents a survey on large-scale parallel and distributed data mining algorithms and systems, serving as an introduction to the rest of this volume. It also discusses the issues and challenges that must be overcome for designing and implementing successful tools for large-scale data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic €32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

eBook EUR 42.79 Price includes VAT (France)

Softcover Book EUR 52.74 Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Preview

Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing

A Survey of Parallel Computing: Challenges, Methods and Directions

Data Mining in High-Performance Computing: A Survey of Related Algorithms

References

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: An overview. Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1996) [86] Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM 39 (1996) Google Scholar
Simoudis, E.: Reality check for data mining. IEEE Expert: Intelligent Systems and Their Applications 11 (1996) 26–33 Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: The future of high-performance database systems. Communications of the ACM 35 (1992) 85–98 ArticleGoogle Scholar
Valduriez, P.: Parallel database systems: Open problems and new issues. Distributed and Parallel Databases 1 (1993) 137–165 ArticleGoogle Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In Fayyad, U., et al, eds.: Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA (1996) 307–328 Google Scholar
Park, J.S., Chen, M., Yu, P.S.: An effective hash based algorithm for mining association rules. In: ACM SIGMOD Intl. Conf. Management of Data. (1995) Google Scholar
Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules in large databases. In: 21st VLDB Conf. (1995) Google Scholar
Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD Conf. Management of Data. (1997) Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
Mueller, A.: Fast sequential and parallel algorithms for association rule mining: A comparison. Technical Report CS-TR-3515, University of Maryland, College Park (1995) Google Scholar
Park, J.S., Chen, M., Yu, P.S.: Efficient parallel data mining for association rules. In: ACM Intl. Conf. Information and Knowledge Management. (1995) Google Scholar
Agrawal, R., Shafer, J.: Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg. 8 (1996) 962–969 ArticleGoogle Scholar
Cheung, D., Han, J., Ng, V., Fu, A., Fu, Y.: A fast distributed algorithm for mining association rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996) Google Scholar
Shintani, T., Kitsuregawa, M.: Hash based parallel algorithms for mining association rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996) Google Scholar
Zaki, M.J., Ogihara, M., Parthasarathy, S., Li, W.: Parallel data mining for association rules on shared-memory multi-processors. In: Supercomputing’96. (1996) Google Scholar
Cheung, D., Hu, K., Xia, S.: Asynchronous parallel algorithm for mining association rules on shared-memory multi-processors. In: 10th ACM Symp. Parallel Algorithms and Architectures. (1998) Google Scholar
Han, E.H., Karypis, G., Kumar, V.: Scalable parallel data mining for association rules. In: ACM SIGMOD Conf. Management of Data. (1997) Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal 1(4):343–373 (1997) ArticleGoogle Scholar
Tamura, M., Kitsuregawa, M.: Dynamic load balancing for parallel association rule mining on heterogeneous PC cluster systems. In: 25th Intl Conf. on Very Large Data Bases. (1999) Google Scholar
Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurrency 7 (1999) 14–25 ArticleGoogle Scholar
Agrawal, R., Srikant, R.: Mining sequential patterns. In: 11th Intl. Conf. on Data Engg. (1995) Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: 5th Intl. Conf. Extending Database Technology. (1996) Google Scholar
Oates, T., Schmill, M.D., Jensen, D., Cohen, P.R.: A family of algorithms for finding temporal structure in data. In: 6th Intl. Workshop on AI and Statistics. (1997) Google Scholar
Zaki, M.J.: Efficient enumeration of frequent sequences. In: 7th Intl. Conf. on Information and Knowledge Management. (1998) Google Scholar
Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in parallel: Hash based approach. In: 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining. (1998) Google Scholar
Oates, T., Schmill, M.D., Cohen, P.R.: Parallel and distributed search for structure in multivariate time series. In: 9th European Conference on Machine Learning. (1997) Google Scholar
Weiss, S.M., Kulikowski, C.A.: Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman (1991) Google Scholar
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994) Google Scholar
Lippmann, R.: An introduction to computing with neural nets. IEEE ASSP Magazine 4 (1987) Google Scholar
Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann (1989) Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984) Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman (1993) Google Scholar
Provost, F., Aronis, J.: Scaling up inductive learning with massive parallelism. Machine Learning 23 (1996) Google Scholar
Provost, F., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery: An International Journal 3 (1999) 131–169 ArticleGoogle Scholar
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: Proc. of the Fifth Intl Conference on Extending Database Technology (EDBT), Avignon, France (1996) Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data mining. In: 22nd VLDB Conference. (1996) Google Scholar
Joshi, M., Karypis, G., Kumar, V.: ScalParC: A scalable and parallel classification algorithm for mining large datasets. In: Intl. Parallel Processing Symposium. (1998) Google Scholar
Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutiwaraphun, J., To, H.W., Dan, Y.: Large scale data mining: Challenges and responses. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
Kufrin, R.: Decision trees on parallel processors. In Geller, J., Kitano, H., Suttner, C., eds.: Parallel Processing for Artificial Intelligence 3, Elsevier-Science (1997) Google Scholar
Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel classification for data mining on shared-memory multiprocessors. In: 15th IEEE Intl. Conf. on Data Engineering. (1999) Google Scholar
Srivastava, A., Han, E.H., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal 3 (1999) 237–261 ArticleGoogle Scholar
Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide and conquer techniques with application to classification trees. In: 13th International Parallel Processing Symposium. (1999) Google Scholar
Alsabti, K., Ranka, S., Singh, V.: Clouds: A decision tree classifier for large datasets. In: 4th Intl Conference on Knowledge Discovery and Data Mining. (1998) Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988) Google Scholar
Cheeseman, P., Kelly, J., Self, M., et al.: AutoClass: A Bayesian classification system. In: 5th Intl Conference on Machine Learning, Morgan Kaufman (1988) Google Scholar
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2 (1987) Google Scholar
Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering. In Michalski, R.S., Carbonell, J.G., Mitchell, T.M., eds.: Machine Learning: An Artificial Intelligence Approach. Volume I. Morgan Kaufmann (1983) 331–363 Google Scholar
Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Computing 11 (1989) 270–290 ArticleMathSciNetGoogle Scholar
Rivera, F., Ismail, M., Zapata, E.: Parallel squared error clustering on hypercube arrays. Journal of Parallel and Distributed Computing 8 (1990) 292–299 ArticleGoogle Scholar
Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distributed Systems 2(2) (1991) 129–137 ArticleGoogle Scholar
Rudolph, G.: Parallel clustering on a unidirectional ring. In et al., R. G., ed.: Transputer Applications and Systems’ 93: Volume 1. IOS Press, Amsterdam (1993) 487–493 Google Scholar
Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 21 (1995) 1313–1325 ArticleMATHMathSciNetGoogle Scholar
Judd, D., McKinley, P., Jain, A.: Large-scale parallel data clustering. In: Intl Conf. Pattern Recognition. (1996) Google Scholar
S. Goil, H. N., Choudhary, A.: MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Center for Parallel and Distributed Computing, Northwestern University (1999) Google Scholar
Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, W., Chan, P.: Jam: Java agents for meta-learning over distributed databases. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
Prodromidis, A., Stolfo, S., Chan, P.: Meta-learning in distributed data mining systems: Issues and approaches. [67] Google Scholar
Guo, Y., Sutiwaraphun, J.: Knowledge probing in distributed data mining. In: 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining. (1999) Google Scholar
Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
Kargupta, H., Park, B.H., Hershberger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. [67] Google Scholar
Parthasarathy, S., Subramonian, R.: Facilitating data mining on a network of workstations. [67] Google Scholar
Grossman, R.L., Bailey, S.M., Sivakumar, H., Turinsky, A.L.: Papyrus: A system for data mining over local and wide area clusters and super-clusters. In: Supercomputing’99. (1999) Google Scholar
Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., Syed, J.: An architecture for distributed enterprise data mining. In: 7th Intl. Conf. High-Performance Computing and Networking. (1999) Google Scholar
Bhatnagar, R., Srinivasan, S.: Pattern discovery in distributed databases. In: AAAI National Conference on Artificial Intelligence. (1997) Google Scholar
Aronis, J., Kolluri, V., Provost, F., Buchanan, B.: The WoRLD: Knowledge discovery from multiple distributed databases. In: Florida Artificial Intelligence Research Symposium. (1997) Google Scholar
Freitas, A., Lavington, S.: Mining very large databases with parallel processing. Kluwer Academic Pub., Boston, MA (1998) MATHGoogle Scholar
Kargupta, H., Chan, P., eds.: Advances in Distributed Data Mining. AAAI Press, Menlo Park, CA (2000) Google Scholar
Skillicorn, D.: Strategies for parallel data mining. IEEE Concurrency 7 (1999) 26–35 ArticleGoogle Scholar
Toivonen, H.: Sampling large databases for association rules. In: 22nd VLDB Conf. (1996) Google Scholar
Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: 7th Intl. Wkshp. Research Issues in Data Engg. (1997) Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proc. of the 22nd Intl Conference on Very Large Databases, Bombay, India (1996) Google Scholar
Cheung, D., Xiao, Y.: Effect of data distribution in parallel mining of associations. Data Mining and Knowledge Discovery: An International Journal 3 (1999) 291–314 ArticleGoogle Scholar
Agrawal, R., Shim, K.: Developing tightly-coupled data mining applications on a relational database system. In: 2nd Intl. Conf. on Knowledge Discovery in Databases and Data Mining. (1996) Google Scholar
Meo, R., Psaila, G., Ceri, S.: A new SQL-like operator for mining association rules. In: 22nd Intl. Conf. Very Large Databases. (1996) Google Scholar
Meo, R., Psaila, G., Ceri, S.: A tightly-coupled architecture for data mining. In: Intl. Conf. on Data Engineering. (1998) Google Scholar
Sarawagi, S., Thomas, S., Agrawal, R.: Integrating association rule mining with databases: alternatives and implications. In: ACM SIGMOD Intl. Conf. Management of Data. (1998) Google Scholar
Holsheimer, M., Kersten, M.L., Siebes, A.: Data surveyor: Searching the nuggets in parallel. [86] Google Scholar
Lavington, S., Dewhurst, N., Wilkins, E., Freitas, A.: Interfacing knowledge discovery algorithms to large databases management systems. Information and Software Technology 41 (1999) 605–617 ArticleGoogle Scholar
Kamber, M., Han, J., Chiang, J.Y.: Metarule-guided mining of multi-dimensional association rules using data cubes. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding interesting rules from large sets of discovered association rules. In: 3rd Intl. Conf. Information and Knowledge Management. (1994) 401–407 Google Scholar
Shen, W.M., Ong, K.L., Mitbander, B., Zaniolo, C.: Metaqueries for data mining. [86] Google Scholar
Ng, R.T., Lakshmanan, L., Jan, J., Pang, A.: Exploratory mining and pruning optimizations of constrained association rules. In: ACM SIGMOD Intl. Conf. Management of Data. (1998) Google Scholar
Srikant, R., Vu, Q., Agrawal, R.: Mining Association Rules with Item Constraints. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
Matheus, C., Piatetsky-Shapiro, G., McNeill, D.: Selecting and reporting what is interesting. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996) Google Scholar
Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., Mannila, H.: Pruning and grouping discovered association rules. In: MLnet Wkshp. on Statistics, Machine Learning, and Discovery in Databases. (1995) Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1996) Google Scholar

Author information

Authors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, 12180 Mohammed J. Zaki

Mohammed J. Zaki