Big Health Data Exploration - Analyzing over 20 million death records from the USA with Viscovery SOMine

Visualizing Big Data has always been a challenge for data scientists and analysts. In one of our past projects, we demonstrated how even large datasets can be analyzed quickly and efficiently by focusing on smaller samples.

The Dataset

We worked with a huge dataset containing over 22.5 million death records from the U.S. from 2006 to 2014. The data included various details such as age, gender, education level, and the cause of death. With such a massive amount of information, the challenge was to find a way to analyze it both quickly and accurately.

Our Approach

Instead of processing the entire dataset, we used just 5% as a sample. This sample was enough to create a self-organizing map that closely reflected the patterns in the full dataset. This approach saved a lot of time while still maintaining accuracy. For example, analyzing the 5% sample took just 1.2 hours, whereas processing the entire dataset would have taken 24.5 hours. That’s a big difference, especially when quick decision-making is important.

The Benefits

This method allowed us to gain deep insights into the data in just a small fraction of the time. Despite the large dataset, the analysis remained flexible and interactive. We were able to identify patterns and clusters that might have been overseen otherwise.

Preprocessing the Data

Before starting the visualization, we had to do some data preprocessing. For instance, we combined different education variables and grouped thousands of ICD-10 codes (which describe health conditions and causes of death) into 30 main categories. This made the analysis clearer and more manageable.

Results and Takeaways

Our approach showed that with Big Data, you don’t always need to spend hours of processing and analyzing the data to get valuable insights. A well-chosen sample and the right tools can often give you a solid understanding of the whole dataset and allow you to make informed decisions.

If you're interested in the details of the project, you can find our showcase here.