Data and Dimension Reduction

Sarlin, Peter

doi:10.1007/978-3-642-54956-4_4

Peter Sarlin^2,3,4

Part of the book series: Computational Risk Management ((Comp. Risk Mgmt))

983 Accesses

Abstract

Data and dimension reduction techniques hold promise for representing data in easily understandable formats, as has been shown by their wide scope of applications. Data reductions provide summarizations of data by compressing information into fewer partitions, whereas dimension reductions provide low-dimensional overviews of similarity relations in data. Thus, these techniques provide means for exploratory data analysis (EDA). From a broader perspective, EDA is only one approach out of many in data mining, and knowledge discovery includes data mining as only one of its steps. To provide a holistic view in a top-down manner, we start by the broader concepts, and end with discussions of data and dimension reductions and their combination. As the aim of Chap. 5 is to provide a comparison of early dimension reduction methods, the focus of this chapter is also on more detailed presentations of so-called first-generation methods, including Multidimensional Scaling (MDS), Sammon’s mapping and the Self-Organizing Map (SOM).

The eye, which is called the window of the soul, is the principal means by which the central sense can most completely and abundantly appreciate the infinite works of nature

– Leonardo da Vinci

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There are several software implementations of the SOM. The seminal packages—SOM_PAK, SOM Toolbox for Matlab, Nenet, etc—are not regularly updated or adapted to their environment. Out of the newer implementations, Viscovery SOMine provides the needed means for interactive exploratory analysis. The most recent addition to the list of implementations is the interactive, web-based implementation provided by infolytika (http://risklab.fi/demo/macropru/). For a description, see Sarlin (2014a). For a practical discussion of SOM software and an early version of the implementation in Viscovery SOMine, see Deboeck (1998b, a). See also Moehrmann et al. (2011), for a comparison of SOM implementations. The first analyses of this book were performed in the Viscovery SOMine 5.1 package due to its easily interpretable visual representation and interaction features, not the least when introducing it to practitioners in general and policymakers in particular. Recently, the packages available in the statistical computing environment R have significantly improved, in particular regarding the visualization of SOM outputs. Thus, the final parts of the research in this book, including the figures, have been produced in R. Moreover, the above mentioned interface by infolytika provides an interactive implementation of the R-based models.
2.
In the literature, learning of the SOM has been defined through the entire spectrum of supervision. For instance, van Heerden and Engelbrecht (2008) define semi-supervised SOMs as similar to the supervised ones, except for them not being included in the matching phase (Eq. 4.9), whereas the semi-supervised version herein is their supervised SOM. However, as the SOM is never fully supervised, we stick to the definition of an unsupervised and a semi-supervised version.

References

Anand, S., & Buchner, A. (1998). Decision support using data mining. London: Financial Time Management.
Google Scholar
Baddeley, A., & Logie, R. (1999). Working memory: The multiple-component model. In A. Miyake & P. Shah (Eds.), Models of working memory (pp. 28–61). New York: Cambridge University Press.
Google Scholar
Barreto, G. (2007). Time series prediction with the self-organizing map: A review. In P. Hitzler & B. Hammer (Eds.), Perspectives on neural-symbolic integration. Heidelberg: Springer-Verlag.
Google Scholar
Bederson, B., & Shneiderman, B. (2003). The craft of information visualization: Readings and reflections. San Francisco, CA: Morgan Kaufman.
Google Scholar
Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich, S. Becker & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14, pp. 586–691). Cambridge, MA: MIT Press.
Google Scholar
Bertin, J. (1983). Semiology of graphics. Madison, WI: The University of Wisconsin Press.
Google Scholar
Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press.
Google Scholar
Bishop, C., Svensson, M., & Williams, C. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234.
Article Google Scholar
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data mining: From concepts to implementation. New Jersey: Prentice Hall.
Google Scholar
Card, S., Mackinlay, J., & Schneidermann, B. (1999). Readings in information visualization, using vision to think. San Diego, CA: Academic Press.
Google Scholar
Card, S., Robertson, G., & Mackinlay, J. (1991). The information visualizer, an information workspace. In Proceedings of CHI ’91, ACM Conference on Human Factors in Computing Systems, New Orleans (pp. 181–188).
Google Scholar
Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semisupervised learning. Cambridge, MA: MIT Press.
Google Scholar
Chen, L., & Buja, A. (2009). Local multidimensional scaling for nonlinear dimension reduction, graph drawing and proximity analysis. Journal of the American Statistical Association, 104, 209–219.
Article Google Scholar
Cottrell, M., & Letrémy, P. (2005). Missing values: Processing with the Kohonen algorithm. In Proceedings of Applied Stochastic Models and Data Analysis (ASMDA 05), Brest, France (pp. 489–496).
Google Scholar
Cox, T., & Cox, M. (2001). Multidimensional scaling. Boca Raton, Florida: Chapman & Hall/CRC.
Google Scholar
Deboeck, G. (1998a). Best practices in data mining using self-organizing maps. In G. Deboeck & T. Kohonen (Eds.), Visual explorations in finance with self-organizing maps (pp. 201–229). Berlin: Springer-Verlag.
Google Scholar
Deboeck, G. (1998b). Software tools for self-organizing map. In G. Deboeck & T. Kohonen (Eds.), Visual explorations in finance with self-organizing maps (pp. 179–194). Berlin: Springer-Verlag.
Google Scholar
Demartines, P., & Hérault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8, 148–154.
Article Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B), 39(1), 1–38.
Google Scholar
Dunn, J. (1973). A fuzzy relative of the isodata process and its use in detecting compact, well-separated clusters. Cybernetics and Systems, 3, 32–57.
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han & U. Fayyad (Eds.), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 96) (pp. 226–231). AAAI Press.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996a). From data mining to knowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 1–34). Menlo Park, CA: AAAI Press / The MIT Press.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996b). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27–34.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996c). Knowledge discovery and data mining: Towards a unifying framework. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR (pp. 82–88).
Google Scholar
Fekete, J.-D., van Wijk, J., Stasko, J., & North, C. (2008). The value of information visualization. In Information visualization: Human-centered issues and perspectives (pp. 1–18). Springer.
Google Scholar
Forte, J., Letrémy, P., & Cottrell, M. (2002). Advantages and drawbacks of the batch Kohonen algorithm. In Proceedings of the European Symposium on Artificial Neural Networks (ESANN 02), Bruges, Belgium (pp. 223–230).
Google Scholar
Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992). Knowledge discovery in databases: An overview. AI Magazine, 13(3), 57–70.
Google Scholar
Gisbrecht, A., Hofmann, D., & Hammer, B. (2012). Discriminative dimensionality reduction mappings. In Proceedings of the International Symposium on Intelligent Data Analysis (pp. 126–138). Helsinki, Finland: Springer-Verlag.
Google Scholar
Haroz, S., & Whitney, D. (2012). How capacity limits of attention influence information visualization effectiveness. IEEE Transactions on Visualization and Computer Graphics, 18(12), 2402–2410.
Article Google Scholar
Havre, S., Hetzler, B., & Nowell, L. (2000). Themeriver: Visualizing theme changes over time. In Proceedings of the IEEE Symposium on Information Visualization (pp. 115–123).
Google Scholar
Hoaglin, D., Mosteller, F., & Tukey, J. (1983). Understanding robust and exploratory data analysis. New York: Wiley.
Google Scholar
Jain, A. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.
Article Google Scholar
Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.
Article Google Scholar
Kaser, O., & Lemire, D. (2007). Tag-cloud drawing: Algorithms for cloud visualization. In Proceedings of the Tagging and Metadata for Social Information Organization Workshop, Banff, Alberta, Canada.
Google Scholar
Keim, D. (2001). Visual exploration of large data sets. Communications of the ACM, 44(8), 38–44.
Article Google Scholar
Keim, D., Kohlhammer, J., Ellis, G., & Mannsmann, F. (2010). Mastering the information age. Solving problems with visual analytics. Goslar: Eurographics Association.
Google Scholar
Keim, D., & Kriegel, H.-P. (1996). Visualization techniques for mining large databases: A comparison. IEEE Transactions on Knowledge and Data Engineering, 8(6), 923–938.
Article Google Scholar
Keim, D., Mansmann, F., Schneidewind, J., & Ziegler, H. (2006). Challenges in visual data analysis. In Proceedings of the IEEE International Conference on Information Visualization (iV 13) (pp. 9–16). London, UK: IEEE Computer Society.
Google Scholar
Keim, D., Mansmann, F., & Thomas, J. (2009). Visual analytics: How much visualization and how much analytics? SIGKDD Explorations, 11(2), 5–8.
Article Google Scholar
Koffa, K. (1935). Principles of gestalt psychology. London: Routledge & Kegan Paul.
Google Scholar
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69.
Article Google Scholar
Kohonen, T. (1991). The hypermap architecture. In T. Kohonen, K. Mäkisara, O. Simula & J. Kangas (Eds.), Artificial neural networks (Vol. II, pp. 1357–1360). Amsterdam, Netherlands: Elsevier.
Google Scholar
Kohonen, T. (2001). Self-organizing maps (3rd ed.). Berlin: Springer-Verlag.
Google Scholar
Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27.
Article Google Scholar
Kurgan, L., & Musilek, P. (2006). A survey of knowledge discovery and data mining process models. The Knowledge Engineering Review, 21(1), 1–24.
Article Google Scholar
Lampinen, J., & Oja, E. (1992). Clustering properties of hierarchical self-organizing maps. Journal of Mathematical Imaging and Vision, 2(2–3), 261–272.
Article Google Scholar
Larkin, J., & Simon, H. (1987). Why a diagram is (sometimes) worth ten thousand words. Cognitive Science, 11, 65–99.
Article Google Scholar
Lee, J., & Verleysen, M. (2007). Nonlinear dimensionality reduction. Information science and statistics series. Heidelberg, Germany: Springer-Verlag.
Google Scholar
Lin, X. (1997). Map displays for information retrieval. Journal of the American Society for Information Science, 48(1), 40–54.
Article Google Scholar
Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1), 702–710.
Article Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297). Berkeley, CA: University of California Press.
Google Scholar
Moehrmann, J., Burkovski, A., Baranovskiy, E., Heinze, G., Rapoport, A., & Heideman, G. (2011). A discussion on visual interactive data exploration using self-organizing maps. In J. Laaksonen & T. Honkela (Eds.), Proceedings of the 8th International Workshop on Self-Organizing Maps (pp. 178–187). Helsinki, Finland: Springer-Verlag.
Google Scholar
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6), 559–572.
Article Google Scholar
Pölzlbauer, G. (2004). Survey and comparison of quality measures for self-organizing maps. In Proceedings of the 5th Workshop on Data Analysis (WDA 2004), Sliezsky dom, Vysoké Tatry, Slovakia (pp. 67–82).
Google Scholar
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326.
Article Google Scholar
Rubin, D. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley & Sons.
Google Scholar
Sammon, J. (1969). A non-linear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409.
Article Google Scholar
Sarlin, P. (2014a) Macroprudential oversight, risk communication and visualization. arXiv:1404.4550.
Shannon, C., & Weaver, W. (1963). A mathematical theory of communication. Champaign: University of Illinois Press.
Google Scholar
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 15(4), 13–19.
Google Scholar
Shepard, R. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27(125–140), 219–246.
Article Google Scholar
Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the IEEE Symposium on Visual Languages, Boulder, CO (pp. 336–343).
Google Scholar
Tenenbaum, J., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319–2323.
Article Google Scholar
Thomas, J., & Cook, K. (2005). Illuminating the path: Research and development agenda for visual analytics. Los Alamitos: IEEE Press.
Google Scholar
Torgerson, W. S. (1952). Multidimensional scaling: I. theory and method. Psychometrika, 17, 401–419.
Article Google Scholar
Triesman, A. (1985). Preattentive processing in vision. Computer Vision, Graphics and Image Processing, 31(2), 156–177.
Article Google Scholar
Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.
Google Scholar
Tukey, J. (1977). Exploratory data analysis. Reading, PA: Addison-Wesley.
Google Scholar
van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
Google Scholar
van Heerden, W., & Engelbrecht, A. (2008). A comparison of map neuron labeling approaches for unsupervised self-organizing feature maps. In Proceedings of the IEEE International Joint Conference on Neural Networks (pp. 2139–2146). Hong Kong: IEEE Computer Society.
Google Scholar
Venna, J., & Kaski, S. (2006). Local multidimensional scaling. Neural Networks, 19, 889–899.
Article Google Scholar
Vesanto, J., Himberg, J., Alhoniemi, E., & Parhankangas, J. (2000). SOM toolbox for Matlab 5. Technical Report: Helsinki University of Technology. A57.
Google Scholar
Ward, J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.
Article Google Scholar
Ware, C. (2004). Information visualization: Perception for design. San Francisco, CA: Morgan Kaufman.
Google Scholar
Ware, C. (2005). Visual queries: The foundation of visual thinking. In S. Tergan & T. Keller (Eds.), Knowledge and information visualization (pp. 27–35). Berlin, Germany: Springer.
Google Scholar
Weinberger, K., & Saul, L. (2005). Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70(1), 77–90.
Article Google Scholar
Wismüller, A. (2009). A computational framework for non-linear dimensionality reduction and clustering. In J. Principe & R. Miikkulainen (Eds.), Proceedings of the Workshop on Self-Organizing Maps (WSOM 09) (pp. 334–343). St. Augustine, Florida, USA: Springer.
Google Scholar
Yin, H. (2008). The self-organizing maps: Background, theories, extensions and applications. In J. Fulcher & L. Jain (Eds.), Computational intelligence: A compendium (pp. 715–762). Heidelberg, Germany: Springer-Verlag.
Google Scholar
Young, G., & Householder, A. S. (1938). Discussion of a set of points in terms of their mutual distances. Psychometrika, 3, 19–22.
Article Google Scholar
Zhang, J., & Liu, Y. (2005). SVM decision boundary based discriminative subspace induction. Pattern Recognition, 38(10), 1746–1758.
Article Google Scholar
Zhang, L., Stoffel, A., Behrisch, M., Mittelstädt, S., Schreck, T., Pompl, R., et al. (2012). Visual analytics for the big data era—a comparative review of state-of-the-art commercial systems. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST), Seattle, WA (pp. 173–182).
Google Scholar

Download references

Author information

Authors and Affiliations

Centre of Excellence SAFE, Goethe University Frankfurt, Grüneburgplatz 1, 60323, Frankfurt am Main, Germany
Peter Sarlin
RiskLab Finland, IAMSR Åbo Akademi University, Turku, Finland
Peter Sarlin
Arcada University of Applied Sciences, Helsinki, Finland
Peter Sarlin

Authors

Peter Sarlin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Sarlin .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sarlin, P. (2014). Data and Dimension Reduction. In: Mapping Financial Stability. Computational Risk Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54956-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-54956-4_4
Published: 09 May 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54955-7
Online ISBN: 978-3-642-54956-4
eBook Packages: Business and EconomicsEconomics and Finance (R0)

Publish with us

Policies and ethics