4.8.18

Kubernetes Threat Modelling: Hack more, you're actually making your company safer.

The moral of the story: Play.  Play as much as you can.  Hack as much as you can.    Don't do things the way they've been done before.

Why?

Because, if your monitoring security scenarios in your staging environments, then you are passively making your company safer, just by playing with new technologies.  This post shows the data to back this claim up + provides a not-so-subtle nod to our OpsSight Connector product , which is a completely open source platform for for securing your cloud native data center by scanning each and every container, in a way that is non-invasive.

Please note, the key thing here, is that a  non-invasivene threat detection model, for security scanning in staging evnironments, is critical: It  encourages experimentation over time, in a way that gives you real time oversight, and insight, into the dynamics of your production threat model.




I'll dive more into what a cloud native threat model actually is, later in this post, but the 10,000 foot view is below... The green line is the number of threats, over time, in your cluster, that are unscanned.  Without loss of generality, the red line is also a model of "high" criticality threats, roughly having the same characteristics.  The sum of your low, medium, and high criticality threats tell you how vulnerable your cluster is, at any given time.

Diving a little deeper...

In Blackduck's OpsSight connector, we plot realtime scan throughput for all images in a cluster via prometheus.  The overall picture often looks something like this:

"Real world" threat detection metrics.  Scan throughput goes up, and down, over time, leaving different types of gaps in your vulnerability awareness.  In this post we'll cover an abstraction of this system which allows you to estimate, plan, and model your threats in a production data center for cloud native, containerized applications.
A cluster's vulnerability status can be thought of as 100% known once the red line converges with the line at the top...
... However, in this  plot, we see temporarily that the redline flatlines (i.e. your scan tool may have been offline, or maybe you had a net split, ...) and during that time,, you are not actively increasing your scan coverage, and even once it goes back up, theres a long period of time between when it has converged with the line at the top (the total # of scans that need to happen).
But even still, note that this is a 'flat' cluster, where no new containers are being introduced..  If new containers are being introduced, theres an 'average' latency for scanning them, which will lead to 'vulnerability windows' over time that always exist, come and go, at some frequency.
This about a "threat modeller" which I've recently built in golang, that outputs prometheus metrics about the current threat level in cluster with the following configurable elements:


- container churn : assuming this is directly proportional to # of apps... i.e. if you have 2000 apps, maybe containers will churn at a rate of 1000 a month.
- total app count (in a large cluster, 2000 to 10000 is reasonable.
- container vulnerability probability (low) (roughly 1/9 containers will have these)
- container vulnerability probability (medium) (roughly 1/9 containers will have these)
- container vulnerability probability (high) (1/9 will have these)
- rate of scanning (finding vulnerabilities), reasonably around 100-1000 a day, depending on how you scan.

To just get down with the code and run it: 

1) first start a local prometheus:

rm -rf ./data/* ; ./prometheus

2) git clone https://github.com/jayunit100/vuln-sim.git
cd clustersim/
go build ./ &&  ./clustersim

Then open up localhost:9090 and check out the graphs.  Details below... You should ultimately see a chart like this (explained later on).


Premise

Before moving forward into the stochastic nature of finding vulnerabilities in your clusters, we have to make one thing clear ~ this article assumes (1) You cannot at any given time keep up with the potential amount of new images in your cluster and (2) you never will be able to do (1).

If either of these assumptions is false, then the need to think deeply about the probability of a vulnerability being introduced to your cluster is obviate by the fact that, at any given time, you have the 'god view' of all vulnerabilities.

However, given that an application might have Gigabytes of code, libraries, binaries in it  - and that apps continually change the libraries they depend on , and even the base images, you'll likely not ever be in a place where you have such a view - so you need to create a threat model, which tells you what you may or may not be at risk for.

Thus, we need to talk, a little bit, about probability, before we dive into it.

Distributions

 Note: this is just an initial treatment of probabilistic selection of events.  For something more sophisticated, check out RJ's blog post on http://rnowling.github.io/math/2015/07/06/bps-product-markov-model.html.  Markov models, which allow you to select random elements in a more hierarchical fashion, can be even more realistic , but are still ultimately based on probability distributions.

Before we go through the findings from a simulator I've built for cluster vulnerability, lets talk about the backbone of that simulator: The normal distribution.

In a 'normal' environment, where people tend to make similar decisions (either because of social pressure or because there are simply inherent similarities in the way most people perform in a given task), picking a random element from a collection to simulate human behavior should be done against a probability distribution ... i.e. like this one :


In the above scenario, if the Y axis is "odds of a given developer deploying a given image to your cluster", and the X axis represents the entire spectrum of docker containers available in all registries, then you can see that the "green stuff" is going to be most often selected.  This means the red stuff, and the yellow stuff - if ever having vulnerabilities - will be discovered at a lower overall rate over time.

Blackduck's OpsSight Connector is a really powerful tool for securing your openshift or kubernetes environments.  It uses Blackduck APIs to scan *every* container, new or old, in your cluster over time.

However, as we all know... you're never 100% secure, and you need to use your intuition to decide on how aggressively you should tune any security product, including OpsSight itself, for low latency.

So how should you integrate a product like OpsSight into your data center?

The simplest thing, in my opinion, is to take some reasonable statistical realities of your cluster, and run a  simulation of how your threat model changes over time.

From wikipedia:

 Threat modeling is a process by which potential threats, such as structural vulnerabilities can be identified, enumerated, and prioritized – all from a hypothetical attacker's point of view.

Threat modeling with simulators.

So, what does your vulnerability profile look like in a kubernetes cluster, where you're scanning every image that comes in reactively?

Initially, assuming a normal distribution of images (i.e. there is a large segment of reuse), you see something like this:

Ignore the X axis, we are plotting time series data where milliseconds map to days.  In any case, assuming you scan 100-200 images a day , even as far out as day 70, with a 10% modulation in 100 apps a day (average containers / app = 9), we can see periodic vulnerabilities that pop up and stick around for a short period of time.  These spikes below happen long after the above drop off...


The integral under the below curve is your actual vulnerability over time.

The integral of this metric, over a given time series, gives you the vulnerability, over that time period - i.e.

This blog post is now officially legit: It has a math thingy in it.






Are there other ways to model containers that are selected ?  Yes: You could assume that apps have a completely randomly distributed set of containers.  In that case, you get a different profile for vulnerabilities entirely !

In my simulator, by replacing the "image simulator" based on a normal distribution,  with a random one, I got a much higher initial vulnerability scenario... see the one on the right ->....

However: Given the same churn, you actually get much less vulnerability in the long run, i.e. if your developers are more experimental early on - your cluster will be safer in the long run if your monitoring the whole time.

I feel like this has broader implications then just security: If your developers experiment, innovate, and take risks on a daily basis, it de-risks your products - not only from a security standpoint, but also, from a stability, performance, and innovation standpoint, over time.

1 comment: