Establishing a Security Operation Center is a great way to reduce the risk of cyber attacks damaging your organization by detecting and investigating suspicious events derived from infrastructure and network data. In traditionally heavily regulated industries such as banking, the motivation to establish a SOC is often further complimented by a regulatory requirement. It is therefore no wonder that SOCs have been and still are on the rise. As for In-House SOCs, “only 30 percent of organizations had this capability in 2017 and 2018, that number jumps to over half (51%)” (DomainTools).
But as usual, increased security and risk reduction comes at a cost, and a SOC’s price tag can be significant. Adding up to the cost of SIEM tools are in-demand cyber security professionals whose salaries reflect their scarcity on the job market, the cost of setting up and maintaining the systems, developing processes and procedures as well as regular trainings and awareness measures.
It is only fair to expect the return on investment to reflect the large sum of money spent – that is for the SOC to run effectively and efficiently in order to secure further funding. But what does that mean?
I would like to briefly discuss a few key points when it comes to properly evaluate a SOC’s performance and capabilities. I will refrain from proposing a one size fits all-approach, but rather outline which common issues I have encountered and which approach I prefer to avoid them.
I will take into account that – like many security functions – a well-operating SOC can be perceived as a bit of a black box, as it will prevent large-scale security incidents from occurring, making it seem like the company is not at risk and is spending too much on security. Cost and budget are always important factors when it comes to risk management, the right balance between providing clear and understandable numbers and sticking to performance indicators that actually signify performance has to be found.
The limitations of security by numbers and metrics-based KPIs
To demonstrate performance, metrics and key performance indicators (KPI) are often employed. A metric is an atomic data point (e.g. the number of tickets an analyst closed in a day) while a KPI sets an expected or acceptable range for the KPI to fall into (e.g. each analyst is supposed to close from x – x+y tickets in a day).
The below table from the SANS institute’s 2019 SOC survey conveys that the top 3 metrics used to track and report a SOC’s performance are the number of incidents/cases handled, the time from detection to containment to eradication (i.e. the time from detection to full closure) and the number of incidents/cases closed per shift.
Metrics are popular because they quantify complex matters into one or several simple numbers. As the report states, “It’s easy to count; it’s easy to extract this data in an automated fashion; and it’s an easy way to proclaim, ‘We’re doing something!’ Or, ‚We did more this week than last week!‘“ (SANS Institute). But busy does not equal secure.
There are 3 main issues that can arise when using metrics and KPIs to measure a SOC’s performance:
- Picking out metrics commonly associated with a high workload or -speed does not ensure that the SOC is actually performing well. This is most apparent with the second-most used metric of the time it takes to fully resolve an incident as this will vary greatly depending on the complexity of the cases. Complex incidents may take months to actually resolve (including a full scoping, containment, communication and lessons learned). Teams should not be punished for being diligent where they should be.
As a metric, e.g. the number of cases handled or closed are atomic pieces of information without much context and meaning to it. This data point could be made into a KPI by defining a range the metric would need to fall into to be deemed acceptable. This works well if the expected value range can be foreseen and quantified, as in ‘You answered 8 out of 10 questions correctly’. For a SOC there is no fixed number of cases supposed to reliably come up each shift.
- Furthermore, the number of alerts processed and tickets closed can easily be influenced via the detection rules configuration. While generally the “most prominent challenge for any monitoring system—particularly IDSes—is to achieve a high true positive rate” (MITRE), a KPI based on alert volume creates an incentive to work in an opposite direction. As shown below in Figure 2, more advanced detection capabilities will likely reduce the amount of alerts generated by the SIEM, allowing analysts to spend more time to drill down on remaining key alerts and on complementary threat hunting.
- Lessons learned and the respective improvement of the SOC’s capabilities are rarely rewarded with such metrics, resulting in less incentive to perform these essential activities regularly and diligently.
Especially when KPIs are used to evaluate individual people’s performance and eventually affect bonus or promotion decisions, great care must be taken to not create a conflict of interest between reaching an arbitrary target and actually improving the quality of the SOC. Bad KPIs can result in inefficiencies being rewarded and even increase risk.
Metrics and KPIs certainly have their use, but they must be chosen wisely in order to actually indicate risk reduction via the SOC as well as to avoid conflicting incentives.
Below I will highlight strategies on how to rethink KPIs and SOC performance evaluation.
Operating model-based targets
To understand how to evaluate if the SOC is doing well, it is crucial to focus on the SOCs purpose. To do so, the SOC target operating model is the golden source. A target operating model should be mandatory for each and every SOC, especially at the early stages. It details how the SOC integrates into the organization, why it was established and what it will and will not do. Clearly outlining the purpose of the SOC in the operating model, as well as establishing how the SOC plans to achieve this goal, can help to set realistic and strategically sound measures of performance and success. If you don’t know what goal the SOC is supposed to achieve, how can you measure if it got there?
One benefit of this approach is that it allows for a more holistic view on what constitutes ‘the SOC’, taking into account the maturity of the SOC as well as the people, processes and technology trinity that makes up the SOC.
A target operating model-based approach will work from the moment a SOC is being established. Which data sources are planned to be onboarded (and why)? How will detection capabilities be linked to risk, e.g. via a mapping to MITRE? Do you want to automate your response activities? These are key milestones that provide value to the SOC and reaching them can be used as indicators of performance especially in the first few years of establishing and running the SOC.
Formulating Objectives and Key Results (OKR)
From the target operating model, you can start deriving objectives and key results (OKRs) for the SOC. The idea of OKRs is to define an objective (what should be accomplished) and associate key results with it that have to be achieved to get there. KPIs can fit into this model by serving as key results, but linking them with an objective makes sure that they are meaningful and help to achieve a strategic goal (Panchadsaram).
The objectives chosen can be either project or operations-oriented. A project-oriented objective can refer to a new capability that is to be added to the SOC, e.g. the integration of SOAR capabilities for automation. The key results for this objective are then a set of milestones to complete, e.g. selecting a tool, creating an automation framework and completing a POC.
KPIs are generally well suited when it comes to daily operations. Envisioning the SOC as a service within the organization can help to define performance-oriented baselines to monitor the SOC’s health as well as to steer operational improvements.
- While the number of cases handled is not a good measure of efficiency on its own, it would be odd if a SOC had not even a single case in a month or two, allowing this metric to act as one component to an overall health and plausibility check. If you usually get 15-25 cases each day and suddenly there is radio silence, you may want to check your systems.
- The total number of cases handled and the number of cases closed per shift can serve to steer operational efficiency by indicating how many analysts the SOC should employ based on the current case volume.
To implement operational KPIs, metrics can be documented over a period of time to be analyzed at the end of a review cycle – e.g. once per quarter – to decide where the SOC has potential for improvement. This way, realistic targets can be defined tailored to the specific SOC.
Testing the SOC’s capabilities
While metrics and milestones can serve as a conceptional indicator of the SOC’s ability to effectively identify and act on security incidents, it is simply impossible to be sure without seeing the SOC’s capabilities applied in an actual incident. You would need to wait for an actual incident to strike, which is not something you can plan, foresee, or even want to happen. In reality, some SOCs may never face a large incident. This means that they got very lucky – or that they missed something critical. Which of these is true, they will never know. It is very possible to be compromised without knowing.
Purple teaming is a great exercise to see how the SOC is really doing. Purple teaming refers to an activity where the SOC (the ‘blue team’) and penetration testers (the ‘red team’) work together in order to simulate a realistic attack scenario. The actual execution can vary from a complete surprise test where the red teamers act without instructions – just like a real attacker would – , to more defined approaches where specific attack steps are performed in order to confirm if and when they are being detected.
When you simulate an attack in this way, you know exactly what the SOC should have detected and what it actually found. If there is a gap, the exercise provides good visibility on where to follow up in improving the SOC’s capabilities. Areas of improvement can range from a missing data source in the SIEM to a lack of training and experience for analysts. There is rarely a better opportunity to cover people, processes and technology in one single practical assessment.
It is important that these tests are not being seen as a threat to the SOC, especially if it turns out that the SOC does not detect the red team’s activities. Red teaming may therefore be understood as “a practical response to a complex cultural problem” (DCDC), where an often valuable team-oriented culture revolving around cohesion under stress can “constrain thinking, discourage people from speaking out or exclude alternative perspectives” (DCDC). The whole purpose of the exercise is to identify such blind spots, which – especially when conducted for the first times – can be larger than expected. This may discourage some SOC managers from conducting these tests, fearing that they will make them look bad in front of senior management.
Management should therefore encourage such exercises from an early stage and clearly express what they expect as an outcome: That gaps are closed after a proper assessment, not that no gaps will ever show up. If “done well by the right people using appropriate techniques, red teaming can generate constructive critique of a project, inject broader thinking to a problem and provide alternative perspectives to shape plans” (DCDC).
Conducting such testing early on and on a regular basis – at least once a year – can help improve the SOCs performance as well as steering investments the right way, eventually saving money for the organization. Budget can be used effectively to close gaps and to set priorities instead of blindly adding capabilities such as tools or data sources that end up underused and eventually discarded.
Establishing and running a SOC is a complex and expensive endeavor that should yield more benefit to a company then a couple of checks on compliance checklists. Unfortunately classic SOC metrics are often insufficient to indicate actual risk reduction. Furthermore, metrics can set incentives to work inefficiently and thus waste money and provide a wrong sense of security.
A strategy focused approach on measuring whether the SOC is reaching targets as an organizational unit facilitated by a target operating model complemented by well-defined OKRs and operational KPIs can be of great benefit to lead the SOC to reduce risk more efficiently.
To really know if the SOC is capable of identifying and responding to incidents, regular tests should be conducted in a purple team manner, starting early on and making them a habit as the SOC improves its maturity.