Since the early days of Monzo, we delight in constructed our banking platform around a microservices-based architecture, using largely containerised workloads which could perhaps be distributed and redundant. We skedaddle our infrastructure on AWS, and as an early adopter of Kubernetes, we were ready to successfully scale a banking platform consisting of bigger than 20,000 containerised workloads all over bigger than 2000 microservices to this level.
In this blog submit, we’ll talk regarding the principles our Safety Infrastructure team observe to make safety in Monzo’s quickly-transferring engineering ambiance, how we robustly apply these principles in apply, and work with other engineering groups to handle our platform and potentialities righteous.
When engineering groups introduce unique infrastructure parts into Monzo’s banking platform, our safety engineers work with them to function risk modelling as a collaborative notify. It’s predominant to realise unique dangers, and agree on predominant safety controls which every handle us righteous and proceed to permit us to pass quickly. We use the STRIDE mannequin paired with OWASP Threat Dragon to plot diagrams and doc our items.
Once unique infrastructure parts are offered into the platform, safety engineers will conduct popular risk modelling intervals for bigger aspects of our platform as a full. This helps us handle risk items updated as time passes.
We condense the risk items for numerous aspects of our platform into increased-degree dangers and predominant safety controls, and salvage our safety make principles from that. Monzo safety engineers then in flip observe these principles when advising engineering groups or making key safety designs.
Don’t have faith workloads by default
After we deploy a brand unique microservice, it could perhaps now not to find advice from other products and services or procure entry to the web by default. It has no procure entry to to a database, secrets and ways or other sources on the network. Rating entry to is explicitly granted. We now delight in got Kubernetes Network Insurance policies for site visitors between all our microservices, we apply network and/or application procure entry to principles on all shared sources in the non-public network, and we allowlist egress to the web on a DNS-name foundation and at a microservice-degree.
We fetch sure that these controls scale, each in phrases of platform performance and developer productivity. As an illustration, now we delight in bigger than 2000 microservices operating on our platform. Since our deployment pipeline analyses carrier code on the key division routinely to identify legit network paths to other products and services and the web, there is just not this form of thing as a more safety overhead now when put next with what it modified into once in 2019.
For recordsdata and secret procure entry to, we delight in templated policies where every carrier can procure entry to sources namespaced to their name. As an illustration, a Vault policy could perhaps to find such as this:
The same will seemingly be ragged for assigning our products and services new cryptographic identities by job of a non-public public-key infrastructure (PKI), supported by solid hardware safety measures.
For web-going by site visitors we fetch, handiest load balancers terminating site visitors for the backend endpoints they reduction delight in public IP addresses, and firewall principles for subnets and sources fetch sure that they are the handiest parts in our non-public network that could perchance fetch web page visitors. We are then ready to provide consideration to securing factual a miniature effect of network ingress paths.
We authenticate connections from Monzo workers to all interior systems and infrastructure using their linked workers profiles old to allowing procure entry to to any non-public network interfaces. All accesses to our interior systems require more than one authentication components, which could perhaps be much less seemingly to be compromised at the same time. We are also careful regarding the toil that safety measures generate. An example of right here is how we improved our workers VPN scheme using a aggregate of client credentials secured on laptops and a dynamic recount using mobile gadgets.
For all network procure entry to controls we implement, we also ensure we delight in reliable visibility of these controls in action. We export metrics and prepare signals to flag scenarios where the controls could perhaps be misbehaving or there could perhaps be indications of an assault, and these feed into our on-name scheme to fetch sure that safety engineers reply hasty to attainable problems.
Automation, automation, automation
We extremely opt automating infrastructure adjustments to having engineers making manual adjustments. Our trade management policy dictates that we require at least one other person to study a trade, this potential that if we allow manual adjustments from a pc it’s vital more challenging to effect in force such policy.
As an illustration any adjustments to AWS need to fight by seek evaluate, then adjustments would be routinely utilized with Concourse to our staging environments. Once authorized, merged and verified, engineers can trigger the same trade to use to production.
We try to provide self-carrier safety as vital as that it’s good to perhaps perchance perchance mediate of at Monzo, by giving engineering groups tools and guard rails so that they could perhaps implement safety with out us being a blocker. As an illustration, if an engineering team desires to make a brand unique pipeline in Concourse handing over adjustments for one other phase of our infrastructure, there are properly-outlined guidance and templates for them to thrill in a examine, that can then handiest require evaluate and approval.
When human interventions are required, in general due to an incident, we delight in “spoil glass” tools which permit us to answer sooner. We’ll discuss these later on this submit.
Technical controls to help safety policies
We opt technically enforceable policies over paper-driven processes. Paper policies portray us what our technical controls must peaceable be, but they’ll’t be relied upon to forestall a security incident. After we mention that an action requires multi-birthday celebration authorisation, we mean technical controls are in predicament to fetch sure that the action can’t be conducted except it has been authorized. This follows the “policy-as-code” thought.
Our trade management policy requires at least one engineer to thrill in authorized adjustments to production, so we’ve made our deployment pipeline require evaluate from the owning groups and the PR to be merged old to allowing deployments.
We’ve scaled this beyond engineering. Sensitive actions by our buyer operations team and other aspects of the firm also require multi-birthday celebration authorisation.
Log the entirety and fetch use of the logs
Every layer of our infrastructure produces audit logs, let’s recount CloudTrail occasions from AWS and Kubernetes audit occasions from our Kubernetes clusters. Assorted network parts of our infrastructure also generate their very grasp procure entry to logs or site visitors logs. Across all audit sources, we log all actions by each human customers and restore accounts, with the very ideal degree of part recorded on actions that doubtlessly affect the protection of our potentialities’ money and knowledge. Parts handling recordsdata for the pipeline implement append-handiest permissions to provide protection to the integrity of audit logs and logging systems.
We now delight in got pipelines handing over audit occasions into a centralised scheme for detecting doubtlessly suspicious patterns, which sends signals to our safety groups for additonal investigation. Attributable to we make and prepare most of our grasp tournament pipelines, we’re going so that you just can enrich the stateless audit occasions referencing particular sources or customers with extra contextual recordsdata from the infrastructure which could perhaps be linked to their issues. This recordsdata enables our alerting scheme to fetch sophisticated decisions and toughen the signal-to-noise ratio of our signals.
The profit of getting a centralised scheme for analysing occasions is that we’re going so that you just can with out problems combine occasions which could perhaps be circuitously linked to actions by folk, and job them in the same method as we handle audit occasions generated from sources such as AWS CloudTrail and Kubernetes.
Sooner or later, in phrases of auditing shell intervals in our production compute cases and Kubernetes workloads, we require bigger than the conventional metadata-degree logs on who accessed which server at what time.
Engineers at Monzo fetch use of intensive automation and interior tooling to ship backend adjustments safely and more successfully with out infrastructure procure entry to. Explain infrastructure procure entry to is limited to distinctive scenarios handiest, and after we allow engineers procure entry to, they need to enact this by job of Teleport as an auditing proxy to handle corpulent visibility over their actions:
Breaking the glass safely
Our banking platform works reliably the massive majority of the time, and so enact the infrastructure procure entry to regulate systems operating on the platform. But when there’s a pain, our engineers will need to elevate an incident and fix it as quickly as that it’s good to perhaps perchance perchance mediate of. Our safety controls work in a “fail-closed” blueprint. So if the infrastructure procure entry to systems slither down with the platform, they received’t repeat any extra procure entry to when put next with after they were Up.
Nonetheless, we enact need to let our platform engineers slither in and fix the platform when this occurs, and to enact this we delight in utilized a series of backup procure entry to systems we name “spoil glass” systems. These systems provide us with the flexibility to procure entry to infrastructure parts and interior tooling in an emergency, and are examined frequently by our engineers to guarantee we’re going so that you just can rely on them in the tournament of a major incident.
Destroy glass systems need to thrill in minimal dependency on other platform parts so that they’ll work reliably in an incident, but this infrequently potential that we’re going so that you just can’t implement as many controls and logging systems as we would admire for the favored procure entry to systems. Nonetheless, we always fetch sure that once somebody uses a spoil-glass procure entry to scheme, this can:
trigger very loud signals, and must peaceable straight web page safety on-name engineersAdvertisements
require multi-birthday celebration authorisation if predominant
provide the same degree of audit scope and skill to name the person as using the favored scheme
Permissions are least privileged
Inner capabilities use role-based procure entry to controls where every role is scoped to a explicit responsibility at the firm. As a buyer enhance specialist that you just can delight in dozens of roles constant with your home of journey and practicing, same applies to engineers.
For SaaS products we aim to attain the same where that it’s good to perhaps perchance perchance mediate of. Being one amongst our predominant SaaS products, we write as tightly scoped IAM policies as gleaming for our AWS accounts; alternatively now not all SaaS products delight in a policy language. For these products we peaceable try to align permissions assigned interior them to be as shut to least-privilege as that it’s good to perhaps perchance perchance mediate of.
We handle procure entry to evaluate intervals at a popular cadence, for the interval of which every team to blame for a SaaS product or an interior product will evaluate customers of such scheme, their roles and linked policies. The job is extremely structured and audited at least once per year.
No shared credentials
We require private accounts for every workers so we’re going so that you just can attribute actions to individuals. The effect on hand, we effect in force hardware tokens due to their many safety benefits, including the flexibility to effect in force our policy on no credential sharing internally.
This applies to interior capabilities such as our buyer enhance portal, but also to all of our SaaS products that enhance a Single Signal-On (SSO) protocol. We use a centralised Identification Provider (IdP) which we configure to always require more than one components, including a possession recount such as a hardware token instrument.
This blog submit provides an give an explanation for of the different principles we observe to provide protection to the bottom layers of our banking platform. The Safety Infrastructure team is factual one amongst the many groups at Monzo who work on different aspects of our banking platform to handle our potentialities righteous.
Whereas you happen to’re drawn to constructing infrastructure apart from safety, we are always looking to hunt down these that is also a massive addition to our team, including backend engineers.