Data Science & Security Visualization

Data science and security visualization require the skills described in the Venn diagram. It is the space where the hacking skills, statistical knowledge and domain knowledge meet.

Substantive Expertise – This is the security domain knowledge, which will enable the security practitioner to understand the data, determine what is expected and find anomalies or metrics from visualization.

Hacking Skills – Hacking skills are the skills from a data scientist language required for working with massive amount of data that should be acquired, cleaned and sanitized.

Math & Statistics Knowledge – This knowledge is critical to understand which tools to use, understand the spread and other characteristics to derive insight from the data. Security practitioners will be comfortable with domain knowledge and hacking skills. Statistics knowledge is one aspect that security practitioners have to understand to gain insight from data and also to ask the right questions to derive the right security visualization.

Security Data Visualization Process 

At a very high level the security visualization process consists of below five steps:

Step 1 – Visualization Goals

Step 2 – Data Preparation phase

Step 3 – Exploration phase

Step 4 – Visualization phase

Step 5 – Feedback and fine-tune

Below section review the activities involved in each step:

Step 1 – Visualization Goals 

It is important to understand our goals and what the team is trying to achieve before jumping into any security visualization. Visualization should be goal driven and use case driven and not data driven. By thinking and documenting the goals security analysts start with the end objective in mind which is very important to capture the right data and use the right tools for visualization.

Step 2 – Data Preparation 

It starts with searching data and preparing the data for analysis. The next step is to explore the data with the right questions, then visualize the data to develop insights and act on it. The most important step before starting visualization is data cleansing or making the data available in a usable format. In real life, this is the biggest challenge since the data might be in an incompatible format; some parts may be missing or other similar challenges. This means a good amount of time has to be spent on data cleaning

Step 3 – Explore 

Asking the right question will lead to further exploration and visualization using statistical/probabilistic models/algorithms and lead to useful insights/decisions. The explore phase will look at some analytical activities that will enable security teams to ask the right questions and look at the data to see how security teams can achieve their goals.

1. “Retrieve Value – Given a set of specific cases, find attributes of those cases – o What is number of security incidents per day due to malware? o How long does it take to resolve a security incident?

2. Filter – Given some concrete conditions on attribute values, find data cases satisfying those conditions. o Which types of security incidents did not meet the Service Level Agreement defined?

3. Compute Derived Value – Given a set of data cases, compute an aggregate numeric representation of those data cases. o What is the average time taken to resolve security incidents?

4. Find Extremum – Find data cases possessing an extreme value of an attribute over its range within the data set. o What is the office location which most security incidents?

5. Sort – Given a set of data cases, rank them according to some ordinal metric. o Order the security incidents by severity and impact.

6. Determine Range – Given a set of data cases and an attribute of interest, find the span of values within the set. o What is the time taken during various phases in the Cyber Kill chain during incident response?

7. Characterize Distribution – Given a set of data cases and a quantitative attribute of interest, characterize the distribution of that attribute’s values over the set. o What is the distribution of phishing/malware/insider threat incidents?

8. Find Anomalies – Identify any anomalies within a given set of data cases with respect to a given relationship or expectation, e.g. statistical outliers. o Are there any outliers in type of incidents?

9. Cluster – Given a set of data cases, find clusters of similar attribute values. o Are there groups of incidents w/ similar TTPs? o Is there a cluster of incidents which take long times to resolve?

10. Correlate – Given a set of data cases and two attributes, determine useful relationships between the values of those attributes. o Is there a trend of increasing time to resolve security incidents?”

It is important to understand and use these tasks in different visualization techniques and allows security analyst to think through all the possibilities for coming up with right question and the results the organization is looking for. These analytical tasks/activities form foundation in understanding the statistical possibilities which can be used to explore the data.

Step 4 – Visualize 

There are two aspects to visualization theory, one is the aesthetics. There is literature around how to use color, hue, thickness and other aspects to make visually pleasing images to intended audience.

Step 5 – Feedback and fine-tune

This step involves continuous improvement with feedback from the stakeholders and availability of new data.