Parallel and Distributed Data Mining: An Introduction
The explosive growth in data collection in business and scientific fields has literally forced upon us the need to analyze and mine useful knowledge from it. Data mining refers to the entire process of extracting useful and novel patterns/models from large datasets. Due to the huge size of data and amount of computation involved in data mining, high-performance computing is an essential component for any successful large-scale data mining application. This chapter presents a survey on large-scale parallel and distributed data mining algorithms and systems, serving as an introduction to the rest of this volume. It also discusses the issues and challenges that must be overcome for designing and implementing successful tools for large-scale data mining.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save
Springer+ Basic
€32.70 /Month
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (France)
eBook EUR 42.79 Price includes VAT (France)
Softcover Book EUR 52.74 Price includes VAT (France)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Similar content being viewed by others
Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing
Chapter © 2013
A Survey of Parallel Computing: Challenges, Methods and Directions
Chapter © 2023
Data Mining in High-Performance Computing: A Survey of Related Algorithms
Chapter © 2019
References
- Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: An overview. Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1996) [86] Google Scholar
- Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM 39 (1996) Google Scholar
- Simoudis, E.: Reality check for data mining. IEEE Expert: Intelligent Systems and Their Applications 11 (1996) 26–33 Google Scholar
- DeWitt, D., Gray, J.: Parallel database systems: The future of high-performance database systems. Communications of the ACM 35 (1992) 85–98 ArticleGoogle Scholar
- Valduriez, P.: Parallel database systems: Open problems and new issues. Distributed and Parallel Databases 1 (1993) 137–165 ArticleGoogle Scholar
- Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In Fayyad, U., et al, eds.: Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA (1996) 307–328 Google Scholar
- Park, J.S., Chen, M., Yu, P.S.: An effective hash based algorithm for mining association rules. In: ACM SIGMOD Intl. Conf. Management of Data. (1995) Google Scholar
- Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules in large databases. In: 21st VLDB Conf. (1995) Google Scholar
- Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD Conf. Management of Data. (1997) Google Scholar
- Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
- Mueller, A.: Fast sequential and parallel algorithms for association rule mining: A comparison. Technical Report CS-TR-3515, University of Maryland, College Park (1995) Google Scholar
- Park, J.S., Chen, M., Yu, P.S.: Efficient parallel data mining for association rules. In: ACM Intl. Conf. Information and Knowledge Management. (1995) Google Scholar
- Agrawal, R., Shafer, J.: Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg. 8 (1996) 962–969 ArticleGoogle Scholar
- Cheung, D., Han, J., Ng, V., Fu, A., Fu, Y.: A fast distributed algorithm for mining association rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996) Google Scholar
- Shintani, T., Kitsuregawa, M.: Hash based parallel algorithms for mining association rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996) Google Scholar
- Zaki, M.J., Ogihara, M., Parthasarathy, S., Li, W.: Parallel data mining for association rules on shared-memory multi-processors. In: Supercomputing’96. (1996) Google Scholar
- Cheung, D., Hu, K., Xia, S.: Asynchronous parallel algorithm for mining association rules on shared-memory multi-processors. In: 10th ACM Symp. Parallel Algorithms and Architectures. (1998) Google Scholar
- Han, E.H., Karypis, G., Kumar, V.: Scalable parallel data mining for association rules. In: ACM SIGMOD Conf. Management of Data. (1997) Google Scholar
- Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal 1(4):343–373 (1997) ArticleGoogle Scholar
- Tamura, M., Kitsuregawa, M.: Dynamic load balancing for parallel association rule mining on heterogeneous PC cluster systems. In: 25th Intl Conf. on Very Large Data Bases. (1999) Google Scholar
- Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurrency 7 (1999) 14–25 ArticleGoogle Scholar
- Agrawal, R., Srikant, R.: Mining sequential patterns. In: 11th Intl. Conf. on Data Engg. (1995) Google Scholar
- Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: 5th Intl. Conf. Extending Database Technology. (1996) Google Scholar
- Oates, T., Schmill, M.D., Jensen, D., Cohen, P.R.: A family of algorithms for finding temporal structure in data. In: 6th Intl. Workshop on AI and Statistics. (1997) Google Scholar
- Zaki, M.J.: Efficient enumeration of frequent sequences. In: 7th Intl. Conf. on Information and Knowledge Management. (1998) Google Scholar
- Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in parallel: Hash based approach. In: 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining. (1998) Google Scholar
- Oates, T., Schmill, M.D., Cohen, P.R.: Parallel and distributed search for structure in multivariate time series. In: 9th European Conference on Machine Learning. (1997) Google Scholar
- Weiss, S.M., Kulikowski, C.A.: Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman (1991) Google Scholar
- Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994) Google Scholar
- Lippmann, R.: An introduction to computing with neural nets. IEEE ASSP Magazine 4 (1987) Google Scholar
- Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann (1989) Google Scholar
- Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984) Google Scholar
- Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman (1993) Google Scholar
- Provost, F., Aronis, J.: Scaling up inductive learning with massive parallelism. Machine Learning 23 (1996) Google Scholar
- Provost, F., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery: An International Journal 3 (1999) 131–169 ArticleGoogle Scholar
- Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: Proc. of the Fifth Intl Conference on Extending Database Technology (EDBT), Avignon, France (1996) Google Scholar
- Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data mining. In: 22nd VLDB Conference. (1996) Google Scholar
- Joshi, M., Karypis, G., Kumar, V.: ScalParC: A scalable and parallel classification algorithm for mining large datasets. In: Intl. Parallel Processing Symposium. (1998) Google Scholar
- Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutiwaraphun, J., To, H.W., Dan, Y.: Large scale data mining: Challenges and responses. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
- Kufrin, R.: Decision trees on parallel processors. In Geller, J., Kitano, H., Suttner, C., eds.: Parallel Processing for Artificial Intelligence 3, Elsevier-Science (1997) Google Scholar
- Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel classification for data mining on shared-memory multiprocessors. In: 15th IEEE Intl. Conf. on Data Engineering. (1999) Google Scholar
- Srivastava, A., Han, E.H., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal 3 (1999) 237–261 ArticleGoogle Scholar
- Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide and conquer techniques with application to classification trees. In: 13th International Parallel Processing Symposium. (1999) Google Scholar
- Alsabti, K., Ranka, S., Singh, V.: Clouds: A decision tree classifier for large datasets. In: 4th Intl Conference on Knowledge Discovery and Data Mining. (1998) Google Scholar
- Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988) Google Scholar
- Cheeseman, P., Kelly, J., Self, M., et al.: AutoClass: A Bayesian classification system. In: 5th Intl Conference on Machine Learning, Morgan Kaufman (1988) Google Scholar
- Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2 (1987) Google Scholar
- Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering. In Michalski, R.S., Carbonell, J.G., Mitchell, T.M., eds.: Machine Learning: An Artificial Intelligence Approach. Volume I. Morgan Kaufmann (1983) 331–363 Google Scholar
- Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Computing 11 (1989) 270–290 ArticleMathSciNetGoogle Scholar
- Rivera, F., Ismail, M., Zapata, E.: Parallel squared error clustering on hypercube arrays. Journal of Parallel and Distributed Computing 8 (1990) 292–299 ArticleGoogle Scholar
- Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distributed Systems 2(2) (1991) 129–137 ArticleGoogle Scholar
- Rudolph, G.: Parallel clustering on a unidirectional ring. In et al., R. G., ed.: Transputer Applications and Systems’ 93: Volume 1. IOS Press, Amsterdam (1993) 487–493 Google Scholar
- Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 21 (1995) 1313–1325 ArticleMATHMathSciNetGoogle Scholar
- Judd, D., McKinley, P., Jain, A.: Large-scale parallel data clustering. In: Intl Conf. Pattern Recognition. (1996) Google Scholar
- S. Goil, H. N., Choudhary, A.: MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Center for Parallel and Distributed Computing, Northwestern University (1999) Google Scholar
- Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, W., Chan, P.: Jam: Java agents for meta-learning over distributed databases. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
- Prodromidis, A., Stolfo, S., Chan, P.: Meta-learning in distributed data mining systems: Issues and approaches. [67] Google Scholar
- Guo, Y., Sutiwaraphun, J.: Knowledge probing in distributed data mining. In: 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining. (1999) Google Scholar
- Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
- Kargupta, H., Park, B.H., Hershberger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. [67] Google Scholar
- Parthasarathy, S., Subramonian, R.: Facilitating data mining on a network of workstations. [67] Google Scholar
- Grossman, R.L., Bailey, S.M., Sivakumar, H., Turinsky, A.L.: Papyrus: A system for data mining over local and wide area clusters and super-clusters. In: Supercomputing’99. (1999) Google Scholar
- Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., Syed, J.: An architecture for distributed enterprise data mining. In: 7th Intl. Conf. High-Performance Computing and Networking. (1999) Google Scholar
- Bhatnagar, R., Srinivasan, S.: Pattern discovery in distributed databases. In: AAAI National Conference on Artificial Intelligence. (1997) Google Scholar
- Aronis, J., Kolluri, V., Provost, F., Buchanan, B.: The WoRLD: Knowledge discovery from multiple distributed databases. In: Florida Artificial Intelligence Research Symposium. (1997) Google Scholar
- Freitas, A., Lavington, S.: Mining very large databases with parallel processing. Kluwer Academic Pub., Boston, MA (1998) MATHGoogle Scholar
- Kargupta, H., Chan, P., eds.: Advances in Distributed Data Mining. AAAI Press, Menlo Park, CA (2000) Google Scholar
- Skillicorn, D.: Strategies for parallel data mining. IEEE Concurrency 7 (1999) 26–35 ArticleGoogle Scholar
- Toivonen, H.: Sampling large databases for association rules. In: 22nd VLDB Conf. (1996) Google Scholar
- Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: 7th Intl. Wkshp. Research Issues in Data Engg. (1997) Google Scholar
- Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proc. of the 22nd Intl Conference on Very Large Databases, Bombay, India (1996) Google Scholar
- Cheung, D., Xiao, Y.: Effect of data distribution in parallel mining of associations. Data Mining and Knowledge Discovery: An International Journal 3 (1999) 291–314 ArticleGoogle Scholar
- Agrawal, R., Shim, K.: Developing tightly-coupled data mining applications on a relational database system. In: 2nd Intl. Conf. on Knowledge Discovery in Databases and Data Mining. (1996) Google Scholar
- Meo, R., Psaila, G., Ceri, S.: A new SQL-like operator for mining association rules. In: 22nd Intl. Conf. Very Large Databases. (1996) Google Scholar
- Meo, R., Psaila, G., Ceri, S.: A tightly-coupled architecture for data mining. In: Intl. Conf. on Data Engineering. (1998) Google Scholar
- Sarawagi, S., Thomas, S., Agrawal, R.: Integrating association rule mining with databases: alternatives and implications. In: ACM SIGMOD Intl. Conf. Management of Data. (1998) Google Scholar
- Holsheimer, M., Kersten, M.L., Siebes, A.: Data surveyor: Searching the nuggets in parallel. [86] Google Scholar
- Lavington, S., Dewhurst, N., Wilkins, E., Freitas, A.: Interfacing knowledge discovery algorithms to large databases management systems. Information and Software Technology 41 (1999) 605–617 ArticleGoogle Scholar
- Kamber, M., Han, J., Chiang, J.Y.: Metarule-guided mining of multi-dimensional association rules using data cubes. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
- Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding interesting rules from large sets of discovered association rules. In: 3rd Intl. Conf. Information and Knowledge Management. (1994) 401–407 Google Scholar
- Shen, W.M., Ong, K.L., Mitbander, B., Zaniolo, C.: Metaqueries for data mining. [86] Google Scholar
- Ng, R.T., Lakshmanan, L., Jan, J., Pang, A.: Exploratory mining and pruning optimizations of constrained association rules. In: ACM SIGMOD Intl. Conf. Management of Data. (1998) Google Scholar
- Srikant, R., Vu, Q., Agrawal, R.: Mining Association Rules with Item Constraints. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997) Google Scholar
- Matheus, C., Piatetsky-Shapiro, G., McNeill, D.: Selecting and reporting what is interesting. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996) Google Scholar
- Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., Mannila, H.: Pruning and grouping discovered association rules. In: MLnet Wkshp. on Statistics, Machine Learning, and Discovery in Databases. (1995) Google Scholar
- Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1996) Google Scholar
Author information
Authors and Affiliations
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, 12180 Mohammed J. Zaki
- Mohammed J. Zaki