Why do we use the cloud?
I mean, generally speaking, if you are lifting and shifting computers from your data center to a cloud service on a one-for-one basis, the major cloud providers are probably more expensive (depending how you measure your ops teams time), so it shouldn't be a cost saving purpose.
There are a few major reasons to use cloud services for your infrasture.
The first is speed and flexibility. The ability to flex your infrastructure, to buy new machines, build new networks and move things around as fast as you can write code to make it happen (or click buttons in a UI if you are a monster). That flexibility is a business advantage in outmanoeuvring your competition, as well as going to save your bacon when there are issues or unexpected surges that you can simply flex your way through.
The second is based on the advantages of being on a shared platform, and the security benefits that you get from that. All of the major hyperscale provides have patched underlying Xen, VM or hardware bugs before most customers were even aware that they were a thing. If you were on Google Compute, you probably didn't even care that Spectre or Meltdown were a thing, you were protected.
But as well as that, the platform providers are constantly seeing the traffic and behaviours of the systems they run for you, at scale. That means they get a better idea of what normal looks like, and what abnormal is. They are far more likely to see criminal or malicious activity at scale, the sort of untargeted attacks that affect all of us, and be able to remediate it, or at least tell us when they detect it for us.
AWS rather famously was able to tell customers that they were vulnerable to heartbleed, and in many cases, block the exploits from reaching them and give their customers time to fix it.
There are lots of reasons that I'm a cloud infrastructure convert, but those are the two main ones, and to me, for most organisations, those should be no brain selections. There are lots of worries and threats with using cloud, and of course, as referenced below, there are new attacks that can come via the cloud providers systems themselves, but given how many attack mechanisms they block off, it's almost always worth it.
Looking at the various clusters that the above image ran on showed that most of them run Kubeflow. This fact implies that the access vector in this attacker is the machine-learning framework.
The question is how can Kubeflow be used as an access vector for such an attack?
Kubeflow framework consists of many different services. Some of those services include: frameworks for training models, Katib and Jupyter notebook server, and more.
The dashboard is exposed by Istio ingress gateway, which is by default accessible only internally. Therefore, users should use port-forward to access the dashboard (which tunnels the traffic via the Kubernetes API server).
In some cases, users modify the setting of the Istio Service to Load-Balancer which exposes the Service (istio-ingressgateway in the namespace istio-system) to the Internet. We believe that some users chose to do it for convenience
Lovely discovery of a bug based on scaled attacks on customer infrastructure. When people ask me why we should be using hyperscale providers like Microsoft and AWS, it's things like that that make the value clear.
Note that this attack was detected against tens of customer platforms. Not hundreds, not thousands, tens of them. Even if we are generous and assume that this was around 90 customers who were exploited, how many of those individual customers would have noticed? The use of the platform to detect and remediate common attacks is a big win for the platforms.
Among the wide range of actors that Microsoft tracks—from digital crime groups to nation-state activity groups—HOLMIUM is one of the most proficient in using cloud-based attack vectors. Attributed to a Middle East-based group and active since at least 2015, HOLMIUM has been performing espionage and destructive attacks targeting aerospace, defense, chemical, mining, and petrochemical-mining industries. HOLMIUM’s activities and techniques overlap with what other researchers and vendors refer to as APT33, StoneDrill, and Elfin.
HOLMIUM has been observed using various vectors for initial access, including spear-phishing email, sometimes carrying archive attachments that exploit the CVE-2018-20250 vulnerability in WinRAR, and password-spraying. Many of their recent attacks, however, have involved the penetration testing tool Ruler used in tandem with compromised Exchange credentials.
These HOLMIUM attacks typically started with intensive password spray against exposed Active Directory Federation Services (ADFS) infrastructure; organizations that were not using multi-factor authentication (MFA) for Office 365 accounts had a higher risk of having accounts compromised through password spray.
Once the group has taken control of the endpoint (in addition to the cloud identity), the next phase was hours of exploration of the victim’s network, enumerating user accounts and machines for additional compromise, and lateral movement within the perimeter. HOLMIUM attacks typically took less than a week from initial access via the cloud to obtaining unhampered access and full domain compromise, which then allowed the attackers to stay persistent for long periods of time, sometimes for months on end.
Spearphishing, password spraying. This is not advanced stuff, but we can see that Microsoft's threat protection team are already on it, and working on detection and defences. Remind me why you want to run your own infrastructure again?
But where can someone start when tasked with “threat modelling” such a complex system, made up by numerous components? Starting from scratch is probably not really feasible (unless you have countless hours available), so I had a look at what’s already available for public consumption.
So far, three main initiatives (performed by as many independent organisations) took upon the challenge of threat modelling a Kubernetes cluster.
In this blog post, I will summarise the outcome produced by each one of those initiatives, so that anyone can use them as a starting point for their own (custom) threat modelling exercise.
If you are looking at threat modelling cloud platforms, this is essential reading. Regardless of whether you think any of these threat models work, there's several different views towards how to threat model a platform, and what the risks or threats actually are. This covers the main ones, and shows you a good example of one that has been done.
If you are doing this yourself, then one of these will make an excellent starting point.
The Naive (and non-secure) Approach: IAM long-term credentials
Export AWS Access Key and Secret Key for some AWS IAM User, and inject AWS credentials into the orchestration job, either as a credentials file or environment variables. Probably not doing this directly, but using Kubernetes Secrets resource protected with RBAC authorization policy. The risk here is that these credentials never expire. They have to be transferred somehow from the AWS environment to the GCP environment, and in most cases, people want them to be stored somewhere so that they can be used to re-create the orchestration job later if required.
The basic idea is to assign AWS IAM Role to GKE Pod, similarly to Workload Identity and EKS IAM Roles for Service Accounts cloud-specific features. Luckily for us, AWS allows to create an IAM role for OpenID Connect Federation OIDC identity providers instead of IAM users. On the other hand, Google implements OIDC provider and integrates it tightly with GKE through Workload Identity feature. Providing a valid OIDC token to GKE pod, running under Kubernetes Service Account linked to a Google Cloud Service Account. All these may come in handy to implement GKE-to-AWS secure access.
This is a problem that hadn't occurred to me, but I can totally see as we move to multi cloud systems. You want to allow access to resources on one cloud system from another, without opening it to the internet. The obvious answer is exactly as outlined, use static keys, but that makes rotation hard. Instead using the native OpenIdConnect support allows you to easily and clearly create trust mechanisms that can be rotated and updated easily.
Let’s look at AWS Access Keys inside a Lambda function, from how they are populated into the function’s execution context, how long they last, how to exfiltrate them out and use them, and how we might detect an compromised access keys.
This is a marvelous deep dive into how your lambda functions run, what happens before your function happens, and how all of the underlying infrastructure is hooked up to your code. Useful if you are considering the security boundaries of your code.
In the previous part we talked about the Azure Stack architecture and mentioned that it can be extended with features that are not part of its core. Using the ability to research cloud components offline, we took this opportunity to research Azure App Service. In this part, we take a deep dive into Azure App Service internals. We examine its architecture, attack vectors, and demonstrate how a critical vulnerability we found in one of its components affected Azure Cloud.
This is a long and very technical article that could do with a good summary.
The short of it is that they found a vulnerability in the AppService system, which is Microsoft's PaaS implementation, that could allow a malicious tenant to break out of the container and attack the host.
Given that Microsoft offers shared tenancy hosts as one of the options, that means that your applications could be attacked by the underlying host.
This level of attack isn't simple or easy, a typical attacker isn't going to have access to it, and weaponising it would be harder still (to ensure you were in the same tenant as your target is additionally hard), but it's the first indication that these attacks are possible and findable by security specialists.
There is another view of mult-cloud however, which the industry hasn’t really given a name yet. In this view of the world organisations will use different clouds for different jobs. Different parts of the same organisation are going to make different choices. Mergers and acquisition activity are almost certainly going to render any single vendor cloud strategy moot over a reasonable time frame. And different clouds are frankly really good at different things. AWS primitives are excellent, but if you want to bet on a platform for its data services the natural choice is GCP. Spanner, BigTable, BigQuery, PubSub, is a powerful, well integrated platform from Google. Azure continues to make great strides in key areas, and Microsoft is beginning to make the GitHub advantage count. GitHub will become a dominant player in DevOps and CI/CD over time, and Microsoft’s focus on Developer Experience is unmatched by its competitors. Visual Studio Code is everyone’s favourite editor. CodeSpaces is going to make everything easier. Azure has some great IoT tools. Every cloud platform has its strengths.These strengths will be leveraged. Any major organisation is going to be using at least two of the major cloud platforms, but not for portability reasons, just because that’s the nature of tools choices.
This is a good analysis of the coming multi-cloud solutions for organisations. The idea of using multiple cloud vendors as underlying interchangable commodity suppliers of virtual machines misses the value that you get from a given cloud provider.
Instead this acceptance of cloud verticals, that you might use different cloud providers for different purposes rather than turn them into a bland "fabric". This is what the present looks like. Maybe we need a new name for it, using Trello, Miro and Github doesn't feel like "cloud" as we consider it today, because we restrict cloud thinking to that of virtual machines and infrastructure rather than services and the higher level abstractions.
Many organizations are looking to modernize their IT investments and transition quickly to the cloud. However, determining how to accredit cloud services can be a challenge. If the organizational model is too laborious or is seen as an obstacle to cloud adoption and cloud-first policies, this can delay the transition to cloud. Understanding the best practices of early cloud adopters and the organizational models that support their accreditation programs helps leaders make well-informed decisions.
“Accreditation Models for Secure Cloud Adoption” provides an overview of three organizational models for cloud accreditation: decentralized, centralized, and hybrid. It differentiates them based on who determines and approves risk decisions. Regardless of the organizational model used, four recommended best practices help cloud adopters balance speed, efficiency, and cost of adoption with security.
I feel like this is addressed directly to government customers, the biggest users of accreditation in their IT systems selection. This walkthrough should be useful for teams who are required to get their platform accredited, and help them use the right language and make the right decisions.
I'll also note that accreditation does come in these different modes, and is worth a read through even for just that insight.
“OK, smart guy. If they’re all wrong, what should CI/CD look like?” asks my imaginary friend after reading my last blog post about how large cloud providers don’t seem to be providing best guidance on secure CI/CD workflows.
This is nice. People like me have a tendency to say why everything is wrong, why other peoples solutions suck, so it's nice to see someone coming out and saying what they think it should look like.
I agree with pretty much every step in this workflow. If you are building a new secure CI/CD workflow in 2020 (and why did you leave it so late?) then this is the template you should be following.
Then he stands up straight and waves at me a hereto unknown part of my sink. “You know what this is?” I naturally answer in the negative. “Neither do I. Time to go see my guy at the plumbing store. He’s the guy who’s seen everything and done everything, the XK-587 or whatever here. Doesn’t matter what I bring in, he knows exactly what to do.”
My expert has his own expert when he doesn’t know something. The guy who’s “seen and done it all” still has someone he rings up, because that’s what experts do, sourcing from upstream and relaying that information back down to another node on the graph. If I’m lucky, I’ll retain enough information to use or maybe even pass along to someone else.
Which brings us back to software engineering. We’re always so surprised when the expert manages to jump to the right config, flip a switch, and set the machine back in order. We assume it’s innate talent, some magic imbued that resides only within them and goes no further. I’m deeply appreciative of expertise, but we hold too many folks up on pedestals, when in truth they have their own gurus who they rely upon. They remember their screws dropped down the drain and ask their authority in plumbing, combined into charting a best path forward.
This rings true for me. I've been lucky enough to work with some seriously smart people in my life, and there's areas where I would turn to them in a heartbeat to understand something, because I know that they'll know, they'll have seen it and they'll be able to help me with minimal brain power needed.
If I'm any good, I'll try to learn from that, make sure that I never have to ask them more than twice for any solution, and try to learn from them every time I have to consult them. Sometimes people tell me that I'm smart, that I know the answers, but having that network, that's really my superpower.