Big Health Data Exploration - Analyzing over 20 million death records from the USA with Viscovery SOMine

Visualizing Big Data has always been a challenge for data scientists and analysts. In one of our past projects, we demonstrated how even large datasets can be analyzed quickly and efficiently by focusing on smaller samples.

The Dataset

We worked with a huge dataset containing over 22.5 million death records from the U.S. from 2006 to 2014. The data included various details such as age, gender, education level, and the cause of death. With such a massive amount of information, the challenge was to find a way to analyze it both quickly and accurately.

Our Approach

Instead of processing the entire dataset, we used just 5% as a sample. This sample was enough to create a self-organizing map that closely reflected the patterns in the full dataset. This approach saved a lot of time while still maintaining accuracy. For example, analyzing the 5% sample took just 1.2 hours, whereas processing the entire dataset would have taken 24.5 hours. That’s a big difference, especially when quick decision-making is important.

The Benefits

This method allowed us to gain deep insights into the data in just a small fraction of the time. Despite the large dataset, the analysis remained flexible and interactive. We were able to identify patterns and clusters that might have been overseen otherwise.

Preprocessing the Data

Before starting the visualization, we had to do some data preprocessing. For instance, we combined different education variables and grouped thousands of ICD-10 codes (which describe health conditions and causes of death) into 30 main categories. This made the analysis clearer and more manageable.

Results and Takeaways

Our approach showed that with Big Data, you don’t always need to spend hours of processing and analyzing the data to get valuable insights. A well-chosen sample and the right tools can often give you a solid understanding of the whole dataset and allow you to make informed decisions.

If you're interested in the details of the project, you can find our showcase here.

Combining Viscovery SOMine and Python for outcome prediction in treatment of major depression

[News Viscovery]

Munich (DE) / Melbourne (AU) / Vienna (AT), 6 February 2023

Together with Nicolas Rost and his colleagues from the Max Planck Institute of Psychiatry, we published the article Multimodal predictions of treatment outcome in major depression: A comparison of data-driven predictors with importance ratings by clinicians in the Journal of Affective Disorders.

The aim of this article is to find reliable models to predict treatment outcomes for patients with major depressive disorder. To this end, multiple outcome data were clustered using Viscovery´s SOM-Ward algorithm and with a consensus clustering approach utilizing a Python k-medoids algorithm. The resulting cluster models are in good accordance with each other and define useful outcome classes. In a second step, supervised machine learning methods, namely logistic regression and random forest, were used to predict outcome classes based on the patients´ baseline assessments.

Find the full article at sciencedirect.com

Magic Maps of Winter-Wonder-Land 

Christmas is just around the corner. The days are short, the nights are even longer. And with a bit of luck, there will even be snow. The crystals that slowly trickle from the sky and cover everything with a white dress only look the same at first glance.

The US-American Wilson Bentley was already fascinated by the white ice formations as a child, and not only to build snowmen or go sledding on them. He was fascinated by the uniqueness and the different shapes that the snowflakes only reveal under the microscope. When he was just 20 years old, in 1885, he achieved something sensational: photographing a snowflake in its purest form. At that time, black-and-white photography was just experiencing its first heyday. There were no high-resolution SLR cameras yet and the exposure time was between eight and one hundred seconds.

Figure 1

Figure 1: Bentley W. (1902), Studies among the snow crystals during the winter of 1901-02. Monthly Weather Review

It took years of technical difficulties and numerous failures before the first snowflake was photographed. During his lifetime, Bentley photographed another 5,000 snow crystals, among which not even two were identical.

But snowflakes have one thing in common with us humans: We look more or less similar. But what is „similar“?

For this, statistics has developed a whole series of concepts known as „cluster analysis“. To roughly summarise, the idea is that individuals – people, animals, plants, or even snowflakes – shall be more similar to each other within a single cluster than between two different clusters. Similarity is defined as the distance between different characteristics. For example, two children are more similar regarding their age than father and son, but not necessarily regarding their hair colour.

However, such clustering problems can quickly become very complex. Especially if you have no idea how many clusters there are, which features are relevant for the clustering and to what extent, and whether the clusters are very sharply separated or have rather smooth transitions.

This is where an AI algorithm comes into play, the so-called „self-organising maps“ or SOMs. With a combination of a SOM and classical statistical methods, the hidden patterns in the snow crystals become visible. Such a map of the snow landscape can be generated amazingly fast with Viscovery, a very intuitively usable software from the Viennese AI company of the same name.

Figure 2

Figure 2: Self-organizing map from Viscovery Software GmbH, https://www.viscovery.net/demos/snow-crystals-classification

As a tribute to the „Snowflake Man“ and his breathtaking photographs, Viscovery has compiled a selection of 974 photographs by Wilson Bentley and coded them in such a way that they have become machine-readable data sets. They consist of the rotation-variant image moments up to the third order and the mean amplitudes of 7 different frequency bands. In the SOM, the snowflakes are then grouped into 18 clusters. Clicking on any point in the map shows the snow crystal that represents that point the best. When clicking on another point, the map shows the snow crystal that is visually most similar to the one clicked on before.

Figure 3

Figure 3: Photo of a snow crystal under the microscope, https://www.viscovery.net/demos/snow-crystals-classification

We spent quite a bit of time with the snowflake map and realised: There’s a bit of Wilson Bentley in all of us. We now see snow with different eyes: the smallest and most ephemeral things on our planet can hide the greatest miracles.

Have a peaceful Christmas and a happy and healthy New Year 2023!

Click here for the Viscovery Showcase!

This demo shows how photographic images can be ordered in a map with respect to their visual characteristics. In this example, images of snow crystals are ordered based on previous image preprocessing to determine similarity.

Literature:

Viscovery introduces new major release of its visual data-mining suite

[News Viscovery]

Vienna (AT), 13 December 2022

Data mining specialist Viscovery has released the latest Viscovery® software suite designed to help customers uncover high-value insights in complex data sets. Viscovery SOMine 8 provides new algorithms, an enhanced tool palette of analytical features, numerous usability enhancements and increased performance.

Building on Viscovery’s unique technology, new cluster algorithms provide the best approach to diverse clustering applications. A new classification method is introduced, which comes with an additional workflow. Numerous features for advanced analysis are provided, along with an R interface and improvements in user experience when working with SOMine. Download a summary of the new features from viscovery.net/data-sheets.

Viscovery SOMine version 8.0 is a full major release. Each modular configuration is available as a perpetual license or as a term-license for a limited period of one year. In addition to single-user licenses, network licenses are also available, which allow operation of the software for concurrent users in a local network.

A major novelty of version 8 is the decision to offer the basic Visual Explorer module for free to give users easy access to Viscovery’s core technology. Visit viscovery.net/visual-explorer to download Visual Explorer for free. Licenses for the extension modules can be purchased online at viscovery.net/somine or by contacting sales@viscovery.net.