Monitoring and Improving Managed Security Services inside a Security Operation Center

Monitoring and improving the performance of Security Operation Centers (SOC) are becoming crucial due to the emerging need of beneﬁting from Managed Security Services (MSS) rather than hiring in-house security experts. In this paper, by observing workﬂows of a real-world SOC, a system consisting of three di ﬀ erent modules is designed for monitoring analysts’ activities, analysis performance measurement


Introduction
Advantages of employing Managed Security Services (MSS), such as cost-effectiveness, skilled security experts, appropriate facilities, up to date security awareness, and 24 hours continuous service encourage different companies to outsource their security services rather than having in-house security employees [1].Network security monitoring (NSM) is a service of MSS for continuous monitoring of networks by human experts instead of installing solely security appliances.The official definition of NSM is "the collection, analysis, and escalation of indications and warnings to detect and respond to intrusions" [2].Therefore, in order to provide NSM service, Managed Security Service Providers (MSSP) deploy various sensors in the client site, such as Intrusion Detection Systems (IDS), to gather various suspicious alerts from each client's computer network, and send them to the Security Operation Center (SOC).Then, SOC as a heart of NSM correlates and analyzes the alerts by its human security analysts to confirm whether they are successful exploits.A security incident is detected and confirmed by True Positive (TP) alerts as indications.In case of an incident, results of analysis need to be exposed to decision makers, in a process called escalation, to react in an appropriate way.
Emerging demand of outsourcing security services from different companies makes the business world increasingly competitive for MSS providers.Monitoring and improving performance of the SOC becomes more crucial to the managers to optimize their resources and improve the quality of their service.
The ability to monitor and improve the performance of security analysts inside the SOC urges the need of having measurable performance metrics for the human activities.In order to have quantitative performance metrics, we need to carefully analyze analysts' tasks, which is mainly a security investigation workflow, to model their behavior.Modeling the workflow should result in trackable and measurable actions.Task analysis techniques help extracting characteristics of human activities which would result in revealing the potential improvement options [3].
Monitoring the SOC performance lets the managers know if they conform to their SLA with their clients, if the number of analysts working in the SOC for each work-shift is enough to serve all the customers appropriately, how each analyst is working efficiently.Afterwards, with historical performance metrics on hands, managers can identify and assess the potential options to improve the current performance.
Challenges faced by the operational SOC include: • To the best of our knowledge, there does not exist any model for the SOC analysis workflow to elaborate analysts' detailed tasks.By modeling the analysis workflow, we can obtain a clear picture about analysis steps allowing the system to track analysts' activities.
• Automated system for monitoring and evaluating SOC performance are still lacking in existing works.Consequently, there is no clear understanding of SOC capability and different analysts' performance.
• The study related to simulating potential improvement options of the SOC in order to assess their effectiveness stays neglected.
• A convenient knowledge transfer among analysts system is still missing.Analysts usually possess different knowledge, since they gain different knowledge during each investigation related to different clients.
The aforementioned challenges prevent managers to prioritize the efforts on improving SOC performance.
There are three main categories of related work (a more detailed review is given in Section 7).The first category discusses different aspects of MSS.Different designs for a SOC architecture, such as employing recognition mechanism of the immune system, cloud-based NSM, hierarchical mobile-agentbased approaches, etc, have been proposed [4][5][6][7][8][9][10].
Researchers study [11][12][13] various aspects of different operational SOCs to compare their functionality.However, these works are fundamentally different from our work, our study focuses on providing a solution for SOC managers to evaluate and improve SOC capability without modifying SOC architecture.The second category is alert correlation techniques helping to provide more accurate alerts, and reducing the rate of False Positive (FP) [14][15][16][17].The third category reviews studies about Call Centers (CC), since SOC and CC are similar regarding their performance evaluation.In a CC, operators answer to different calls in a queue, where in a SOC, analysts analyze incoming logs in a queue.Different queueing models are employed in this area to solve the problem of staffing, scheduling, and routing jobs policy [18][19][20][21].To the best of our knowledge, there is little work related to performance measurement and improvement of a SOC.
In this paper, we propose a system to help managers by evaluating and monitoring the performance of a SOC and analysts by easing knowledge transfer among them.The designed system consists of a Graphical User Interface (GUI) for managers with three modules, monitoring, measuring, and simulation respectively.Additionally, a background service to generate feedbacks to SOC's analysts about anomalies and informative issues are proposed to transfer knowledge among analysts.Monitoring module helps managers to obtain the current analysts' activities of the SOC.This module illustrates details of recent investigations which are in progress or recently completed by the analysts.Managers can check overall and detailed SOC performance with the measuring module.Consequently, they can make the decision related to adding new analysts, recognize demanding clients, or optimize certain analysis steps average duration.With the simulation module, the managers can assess different performance improvement options to see the potential effect of each possible improvement on the performance result of real production data, without affecting the normal operation of the SOC.The simulation results also provide insight for the development team to improve SOC console applications and provide proof of concept for clients.Analysts benefit from the feedback module where they get hints about next probable steps.Feedbacks include the range of task times, the alerts for missing steps.Moreover, the knowledge of one analyst could be transferred to another one through the feedback module.Specifically, the contributions of our work are: • First, we model the alert analysis workflow based on a real operational SOC.
• Second, we develop a practical system to monitor the analysis workflow, measure analysts' performance based on divers performance metrics, simulate possible improvement options and enable knowledge transfer among analysts by the feedback module.
• Finally, we conduct three case studies based on real-life activity logs of analysts over a period of 57 days to show how the performance would be affected by various simulation scenarios.
Monitoring and Improving Managed Security Services inside a Security Operation Center The rest of the paper is organized as follows.Section 2 describes the required background knowledge to understand this work.Section 3 illustrates our modeling methodology of the analysis workflow, and the logging phase of analysts' activities.Section 4 provides an overview of the system's functionality, and the modules employed methodology.Section 5 demonstrates the implementation of the whole system.Section 6 evaluates three case studies to assess different improvement options of the SOC performance.Section 7 reviews related work.Section 8 concludes our work and addresses the future work.

Preliminaries
In this section, the operational SOC workflow and its related characteristics and notations are explained to provide background knowledge.In this paper, our study is based on a real operational SOC.Besides illustrating SOC characteristics, a brief description of the designed system, and related definitions are given to show how the proposed system functions alongside the SOC main workflow.
Figure 1 illustrates a sample deployment model from an infrastructure perspective.The operational SOC uses Virtual Private Network (VPN) to connect the client site to the SOC.Sensors can be placed at various locations in the client network, such as outside the firewall facing internet, behind the firewall inside the demilitarized zone (DMZ), which are accessible internal points of client's network from outside, or in the local network behind the firewall.
Analysts rely on SOC console, to review related alerts 1 of each client and conduct the analysis.The workflow of alert analysis starts from the SOC console consisting tasks, such as receiving alerts, crossreferencing network map of a client, examining a

List of investigation types related to the incident categories
There are eight investigation (the analysis task of sensor generated alerts by an analyst) types in two main categories from the SOC perspective, securityrelated incidents, and policy violation incidents.Table 1 represents different investigation types.
For each investigation type, multiple approaches may exist.We collect all possible analysis approaches in one integrated model consisting of different investigation paths.The complete investigation workflow modeling phase will be described in details in Section 3.1.
Sub-steps in each step are defined as actions.Actions are track-able points for each step performed by analysts.They correspond to single mouse clicks on the SOC console listed in Table 3.To avoid discussing unnecessary details, we focus on steps rather than actions mostly throughout the paper.

Modeling and Logging the Investigation Workflow
In this section, we demonstrate the modeling phase of the investigation workflow in Section 3.1 following with the logging phase of analysts' activities in Section 3.2.

Modeling the Investigation Workflow
The modeling of an investigation workflow is performed in two phases.The first phase is to gather the expert knowledge from analysts to identify different investigation types and relevant approaches.To model different tasks (steps) and their relationship, UML activity diagram has been employed to describe the model.The second phase is to visualize the model for the designed system.Graphviz V.2.38 [22] is employed to layout the activity diagram to represent the investigation workflow model in the system.
Figure 2 demonstrates one integrated model.Each node is labeled as the combination step ID and action ID (step ID: action ID).A sequence of nodes following from first node to the last node of the model forms an investigation path.The logical relationship among different nodes of the model could be AND or OR.By traversing a single investigation path in a model, all existing nodes in the same path have AND logical relationship and in different investigation paths are described as OR relationship.AND logical relation implies mandatory nodes.If an analyst misses one mandatory node between previous and next node of a path, the feedback module will warn the analyst about the missing steps.A detailed discussion related to feedback module is in In our system, we use Graphviz [22] to generate the investigation models and output them as a DOT file format.Then DOT files will be used in our system as the standard to track analysts' activities.Since preparing activity diagrams by the Graphviz tool requires writing codes in a special format, we introduce an open-source extension for Microsoft Visio called GraphVisio [23] to simplify this procedure.This extension eases the phase of generating a DOT file where analysts draw activity diagrams with Visio and then export their model directly to a DOT file.

Logging Analysts' Activities
Based on the extracted model from Section 3.1, each task is defined by different actions as single mouse clicks on the SOC console.These mouse clicks are logged automatically in our system.Each log resembles one single action performed by an analyst through the SOC Console.Our data gathering process follows extract, transform, load (ETL) process; this section mainly focuses on the data extraction and section 6.1 completes the data processing.Each row of logs consists of eleven attributes, which are described in Table 4.For example, the start and end time of the action as two attributes: TimeStart, and TimeEnd.Based on the identified actions through the SOC Console, the logging script captures the actions started and ended via the SOC Console.The start time of actions are bound to the specific buttons, and end time of actions are bound to the start time of the next recognizable action by the system.In this way, no time in the middle of an investigation is skipped in case of using other tools not being monitored by our logging system.End time of the last action is a specific button to close an investigation.A tuning step is required to adjust the logs during the pre-processing phase before storing the logs into the database, which will be demonstrated in Section 6.1.
As is shown in Table 4, each row of activity logs has related StepID and ActionID which the row will be matched to the investigation model.Therefore, by extracting StepID:ActionID from a log, we can map the nodes from our model to the logs.By tracking analysts activities, our system is able to understand the process in each investigation, such as the analyst is working on which step currently, or what is the next action.Mapping all logs of one investigation to the model, the system can also identify the missing steps to alert analysts.
Table 5 shows a sequence of logs belonging to one investigation, containing three steps and eight actions (A:130¨A:105¨C:104¨C:110¨A:107¨E:119¨C:112 ¨E:121).This investigation path is illustrated in the model shown in Figure 2 as specified by the double boxes.By comparing the combined key of each log, which is StepID:ActionID, the system can map the log to a node of the model (the analyst is performing which step and action).By looking at all logs and mapping them to related nodes in the investigation model, we can see which path is followed by the analyst.Furthermore, By identifying the path followed by an analyst in the model, the feedback module can notify analysts about probable next steps or missing steps (will be discussed in Section 5.4).

Methodology
In this section, we elaborate the system modules and their functionalities.Then, we describe the methodology and detailed implementation of each module.

Overview of The System
The designed system assists both managers and analysts as a value-added component of the SOC Console.General workflow of the designed system is shown in Figure 3, where activity logs are being gathered from analysts' machines and stored in the database.The main GUI including the monitoring, the measuring, and the simulation modules keeps reading the database and provides different capabilities to the managers.The feedback module works as a background service to give feedback to analysts in real time.
Generally, the monitoring module tracks and monitors analysts' activities.The measuring module allows managers to check SOC's performance with divers metrics provided by the system.Furthermore, customizable Online Analytical Processing (OLAP) analysis is integrated into the system to provide more detailed analysis performance results.The simulation module facilitates evaluating improvement options through two different approaches: first, studying the impact of analysis duration by modifying the duration of analysis steps.second, applying a different queuing model and alert dispatching approach to evaluate the overall performance.The feedback module assists analysts of the SOC by notifying them with different information through the Microsoft Windows operating system tray, while they are performing investigations through the SOC Console.
Monitoring Module.Figure 4 shows the monitoring module of the managers' GUI.This module is provided to illustrate the detailed information of the SOC ongoing investigations in an easy-to-understand way.The main purpose is to visualize the ongoing investigations.The GUI keeps only the investigations started within a predefined period (e.g., 30 minutes).
Y-axis shows different investigations and the x-axis shows the corresponding investigation durations.Each    Measuring module.We propose the measuring module in order to have a better understanding of the performance of the SOC.Qualitative measurement usually is ignored as it is difficult or expensive to be scaled.However, in our system, the measuring module provides managers with quantitative measurements of the SOC performance.
To the best of our knowledge, this work is one of the first studies related to SOC analysis performance in the literature.Jacobs et al. [12] examine different aspects of three real-world operational SOCs.The performance metric in their work only counts the number of analyzed incidents per analyst for each day.However, without taking the time duration of analysis into consideration, counting the number of incidents solely is not a good metric.The efforts on difficult investigations will not be properly captured.As a result, analysts are not motivated to analyze carefully, since it results in showing lower productivity for them in managers' point of view.In a university SOC assessed in the same work, a ticketing system dispatching alert tickets to analysts, provides a performance metric, which is time spent on each ticket, however, the detailed study is missing.To address the existing works' limitation, our measuring module is designed to provide various SOC's performance metrics to evaluate SOC behavior.
We enumerate the metrics provided by the measuring module.Important results, such as the maximum values, are shown in the main GUI, where detailed analysis is reachable through different buttons.Figure 5 shows the measuring module dashboard, and the two detailed analysis reports are open in separate windows.
The first group of metrics provides average analysis time from different perspectives.More specifically, • Total average analysis time.
• Average analysis time of each investigation type/analyst/client.
• Average analysis time of different analysts for the same investigation type.• Average analysis time of different analysts for the client.
The second group of performance metrics concentrates on the average duration of different investigation steps.This group attempts to show steps' average duration in general, for different analysts and clients.More specifically, • Ranking analysis steps according to the time spent in executing.
• Average analysis time of different analysts for the same step.
• Average analysis time of different clients the same step.
The third group is about showing maximum values for selected performance metrics in the managers' GUI.This group helps managers to determine if maximum values significantly differ from historical average durations.For example, • Identifying which investigation type takes more time to be analyzed.
• Which step of which investigation type takes more time to be analyzed.
• Which analyst has the highest average time for which investigation type.
• Which client has the highest average time for which investigation type.
• Which analyst has the highest average time for which client.
The fourth group is the number of created and updated incidents for different parameters.For example, • The number of created and updated incidents for each investigation type.
• The number of created and updated incidents for each analyst.
• The number of created and updated incidents for each client.
OLAP Component An OLAP tool is employed beside the designed measuring module inside the managers' GUI.OLAP empowers managers to create new analysis queries with mouse dragging and clicking instead of modifying code or writing complicated SQL queries.An open-source web-based OLAP engine, Community Edition Saiku (CE Saiku) [24], is integrated into the GUI providing customized data analysis opportunities.CE Saiku is implemented by Pivot4J Java API using Mondrian OLAP server.We integrate web-based Saiku to the managers' desktop GUI by embedding a browser.Saiku configuration for SOC performance metrics is discussed beside an example in section 5.2.
Simulation Module.The simulation module is designed to show the effects of potential changes on the investigation workflow.It simulates the effect of a change on real activity logs, and recalculates performance metrics.It helps the managers to prioritize efforts on potential improvements of the SOC.Two potential changes as simulation options are provided.One is modifying the current steps' duration, and the other is changing the dispatching method of alerts.
The first simulation capability is to modify specific steps' duration by a specified percentage to see how different performance metrics would be affected.This simulation helps managers to find out whether it is the best option to optimize one specific step to reduce the time taken by that task.By assessing optimization options for each step, the manager decides one or more steps duration to be modified and by a specific percentage.An optimization option can be a possible automation for a specific step to reduce analysts' tasks.The reduction percentage is the manager's prediction as a result of that potential change in the investigation workflow.For example, if one step is going to be automated completely, the related duration of the step should be removed completely (reduction percentage is 100%).The simulation is designed in a way that the simulator considers all historical database investigations containing that specific step, and modifies all current sub-steps' (actions) durations by the mentioned value, and recalculate all performance metrics based on the manager's assumption.The second simulation capability is simulating the dispatching phase of incoming alerts among analysts with a different queueing model.Different ways of dispatching services (alerts) among servers (analysts) are usually studied as queuing models [25].In our work, a different alert dispatching method is simulated to assess the employed approach effect on the SOC performance metrics.
The result of both simulation scenarios is shown side by side with the real one (measuring module) in the GUI to ease the comparison process for managers.Figure 6 shows an example of the provided simulation results beside the measuring module.In this way, comparing the effect of the simulation scenario on actual SOC performance metrics is easy for the managers.For instance, as we see in Figure 6, the simulation reduces the average time of analysis from 6:41 to 5:53.By the measuring module results, we see the most time-consuming investigation type is PV with the average of 10:17, the conducted simulation reduces it to 6:03 however the investigation type is not changed.Moreover, we can compare easily that the biggest investigation type average duration belongs to AnalystID 2 for PV with the average of 45:05, and this simulation changes it to AnalystID 4 for MI with the average of 11:42.
Feedback Module.The feedback module is designed for knowledge transfer among SOC analysts.Knowledge can be a hint about next required action to proceed in the investigation, notifying the analyst about anomalous durations and missing steps, or showing the result of previous similar investigation types performed by other analysts.The analysts are firstly notified about mentioned information in the Windows operating system tray, and the summary of all notifications are accessible in a desktop GUI.
The feedback Module works as a background service uninterruptedly as analysts activity logs are being stored in the database.It keeps reading the database, and mapping them to the investigation model.As a result, the system can notify analysts about next probable steps, find anomalies, and warn analysts about them.The module reminds the analyst what is the probable next step based on the step he is currently performing.Anomaly notifications are about spending normal duration on the steps, or not missing an action.For instance, If the analyst takes 50% (which is configurable) less or more time than the historical average duration of the step, he will be warned.If the analyst's activities do not match one of the investigation model's paths, he will receive a warning about the missing steps.
Moreover, once an analyst starts to perform an investigation, the feedback module shows previous investigations activity logs of the same type by different analysts.This feature provides background knowledge from other analysts' approaches with different clients, or result of similar investigations for the same client.The results can be filtered by a specific client to provide knowledge whether the client recently had the same event type, what was the result, which source or destination IP were involved, etc.The main investigation approaches adopted by analysts are the same.However, analysts' knowledge can be improved over time with experiences and knowing clients' environment better.Moreover, the feedback module shows results of investigations which can help the next analyst to have an inference about similar situations, for instance, the same event is generated for the same source and destination IP which was false positive before.Then, the next analyst by knowing the history of the client about this specific event, and result of previous analyst's investigation, can take a faster action with the knowledge of the previous analyst.
The feedback module is shown in Figure 7.As is shown, different hints are provided to the analyst.The analysts receive notifications in real time about their ongoing investigations in the Windows operating system tray, where all notification reports are accessible by the GUI.

Implementation
In this section, the methodology and implementation details of each module are discussed.All modules are implemented in Python 2.7 programming language and existing libraries.

Monitoring Module
Visualization of the monitoring module is implemented with Matplotlib package [26] in Python.In each update, the module fetches StepID and ActionID from the last log of the investigation, and composes the mapping key.Then, it maps the key value to the relevant node of the investigation model.By traversing the investigation model, it can recognize what is the next probable action.If the next node is End, it shows the investigation is completed, otherwise it shows the next probable action and step.If the predicted steps contain multiple possible actions, a black-color step will display on the GUI otherwise a white-color step will show.

Measuring Module
The measuring module is designed to provide SOC performance statistics with different metrics.The SOC performance metrics are described in Section 4.1.By having the measurable investigation workflow, performance metrics can be calculated by running SQL queries on the database.The duration of an investigation is measurable from provided analysts' activity logs with different attributes.In our system, we use PostgreSQL [27] in the implementation to support SQL standard queries.
Saiku OLAP Configuration.CE Saiku is integrated into the managers' GUI to the measuring module to provide easy customizable metrics on the investigation logs.Leveraging this open source OLAP tool allows us to connect directly to the developed system relational database, which is PostgreSQL, access data easily, and design the desired multidimensional schema by mapping database tables fields to cubes.The package we use to integrate CE Saiku browser into the desktop GUI is CEF python [28].In this way, the managers can work with CE web Saiku application through the GUI directly.
The multidimensional OLAP schema, Mondrian [29], is defined in an XML format to be used by CE Saiku engine.It is designed by considering multiple attributes of analysts' activity logs to provide detailed investigation measurement to the SOC managers.The attributes considered for the schema are: investigation types, analysts, and clients.Quantitative attributes in the calculation include TimeStart, TimeEnd, and InvID.For each quantitative attribute, we need to set a proper calculation function, called aggregator.For example, a sum function could be set to TimeStart and TimeEnd attributes resulting in a total investigation duration.The calculation function for InvID counts the number of distinct values indicating how many investigations are done.
A schema is designed by considering multiple attributes of analysts' activity logs to provide detailed investigation measurement to the managers.The attributes considered for the schema are: investigation types, analysts, and clients.Quantitative attributes in the calculation include TimeStart, TimeEnd, and InvID.For each quantitative attribute, we should set a proper calculation.For example, a sum calculation could be set to TimeStart and TimeEnd attributes.The calculation for InvID counts the number of distinct values.
After having basic measures, a method called calculated member is used to write formulas to combine different measurements together.For instance, as a basic measure, different investigations' durations are calculated.Then by using the calculated member, we can calculate the average of investigation duration per client, analyst, and investigation type.The calculated member in our example is Average per Investigation.The results shown in Figure 8 are for each client, analyst and investigation type.The potential benefit of using such an OLAP tool to have customizable performance metrics as cubes, rather than hard-coded metrics is demonstrated in this example.

Simulation Module
The simulation results only provide a hypothesis evaluation for end users; the production database stays the same.The alert dispatching simulation is provided in a replicated database by storing the simulated logs, and the modifying steps' duration simulation is performed by SQL queries in real time.
Modifying Steps' Duration.Performance metrics are calculated based on modified time durations as a comparison to the measuring module.The goal of this modification is to obtain the possible steps for automation and significantly reduce human effort in the analysis.
Alert Dispatching Method.The default alert dispatching method is assigning the same number of clients to the available analysts of a work shift.Analysts are responsible for their assigned clients and work independently.For instance, we have five available analysts and 15 clients, every three clients are assigned to one analyst regardless of any consideration.This approach has some pitfalls, such as the possibility of assigning clients with a high load of alerts to the same analyst, where the other analyst is assigned clients with a low load of alerts.It is also possible that two clients assigned to the same analyst are under attack at the same time, and the analyst cannot handle both of them at the same time, where none of the clients of the other analyst are in such a critical situation.
The simulation methods follow a single-queue, multiple-servers methodology.Considering several servers serving clients from a single queue is known as the M/M/c model where c is the number of servers [25].The discipline in these systems is first-come, first-served and the arrival rate of jobs is based on the Poisson process.And the job routing method is Fastest Server First [30], which distribute the incoming jobs to the fastest server first.
In this simulation, two metrics will be provided to assess the effect of the simulated method.One is the average analysis duration, and the other is the alert waiting time.The waiting time is the time the alert stays in the SOC console (queue) to be analyzed.It is the subtraction of the start time of an investigation from the arrival time of an alert in the queue.In our study, waiting time implies the response time of the SOC to incidents which is an important factor for the managers to respect SLA for different clients.By considering the waiting time, we can also assess how the new approach would affect the waiting time of the alerts.The simulation will be detailed in a case study in Section 6.4.

Feedback Module
The feedback module first maps analysts' activity logs back to the investigation model to locate the analysis step.Then, it provides the predicted steps to analysts as hints for the possible solutions.In the end, it tracks the analysts' activities and provides notifications for missing steps.
The first algorithm to find missing steps maps analysts' incoming activity logs to the investigation model continuously to see whether a step is missed.It starts checking logs of one investigation from the first log.It checks every two successive logs (adjacent) in the database to see whether they are also successive in the model.If they are not adjacent in the model, there are one or more missing nodes between them.The second algorithm is to find the shortest path between two nodes.Since it is possible to have several paths between two nodes, the shortest path, including fewer nodes, is selected to report missing actions or steps.The first algorithm provides the accurate missing steps, while the second algorithm consumes less computation time and provides the least missing steps.
In order to report duration anomalies, as some analysts may not spend enough time on some steps, a dynamic duration standard is provided base on historical data for each specific step.The standard duration is considered as real-time average of historical data in a range, as follows.
For instance, by applying the above formula, if n is 20% and Average is 60 sec, normal duration range is between 48 and 72.The alternative range percentage is configurable.

Data Processing and Case Studies
Three case studies are elaborated in this section to show the effectiveness of our designed system in improving the SOC performance.Firstly, we go through the dataset pre-processing and provide different statistics of the dataset.Then we provide case studies.

Dataset Pre-processing
A pre-processing step is implemented to remove datagathering mistakes resulting in out-of-range values, impossible data combinations, and missing values.
Out of range values implies those steps' durations, which are too long compared to real durations and need to be normalized.One possible reason for these abnormal cases is that the TimeEnd of each action is considered as TimeStart of the next action.Therefore, the gap (e.g., analysts take a leave) between sequential actions will increase the duration.Such a large duration between steps is considered as noise in our dataset.Similar to the noise removal, we clean the data from the attributes missing values, conflicting actions, and false positive or contextual events.
Different activity logs from analysts' machines are gathered and stored at the same time by the logging script.Firstly, different analysts' activity logs are separated from each other and ordered by time.Then, different investigations from the same analyst are distinguishable from each other by a specific ActionID (#130).Then we assign each investigation a unique identifier as InvID.InvestigationTypeID and StepID values are mapped from numbers to predefined codes of the system in the pre-processing phase.For instance, in the text file, value 1 for the InvestigationTypeID attribute is an identifier of Policy Violation (PV) investigation type.Consequently, those numbers are mapped to the abbreviated forms of their investigation type names.Similarly, it is done for StepID attribute, codes 1, 2, 3, ... are mapped to A, B, C, ... respectively.

Table 7. The dataset statistics
In Table 6, we can see values of InvestigationTypeID attribute are zero and one, firstly all values of this attribute are changed to one.Then all values are mapped to PV as we see in Figure 9.If all those values related to the InvestigationTypeID were zero, it would mean the InvestigationTypeID was not retrievable.We will simply assign OTHERS to the investigationTypeID.

Statistics of Dataset
The time period of the dataset is from June 2015 to August 2015 for 57 days.6 analysts perform alert analysis for 40 clients.The time duration format in Table 7, and all following tables is mm:ss.As is shown, 40.7% of investigations result in creating or updating incidents, where 59.3% of total investigations are about closing alerts after an investigation.Average analysis duration of investigations ending in creating new incidents is 17:52, while the average duration of updating an incident and closing an alert are 06:41, 05:16, respectively.Moreover, different analysts have different average analysis durations.The slowest analyst's average duration is 11:52 whereas the fastest one is 06:22.
The different average investigation durations of different investigation types are shown in Table 8 based on three investigation results.For each column, the first element is the number of related investigations, and the second is the average investigation duration.For all investigation types, we can see the average duration of creating a new incident category is longer compared to updating the incident and closing the alert categories.EVS, MI, BFA and PV investigation types take more time to gather indications to create an incident than  For almost all investigation types, average time of updating an incident is more than closing the alert of the same type.An exception is the AA type whose number of related cases (#2) for updated ones is not enough to be considered as a counterexample.However EVS attack type takes more time to be confirmed as an incident, the average duration of getting closed is the least (03:27) among other types.The most timeconsuming investigation type for updating an incident is PV type, which is reasonable as it mostly needs communicating with the client to clarify the situation.
Looking at the total dataset statistics, MI, BFA, and PV are the most time-consuming alert types in the SOC regardless of the investigation result.The number of received alerts from these attack types is highest beside EVS, although EVS alerts are analyzed quickly.The trend shows MI type has the highest number of investigations (298), and the highest average analysis time (08:45).

Case Study I, Modifying the Duration of Steps
This case study is about assessing potential steps which could be automated to improve overall performance.Every change in a system needs to be assessed before going into production.After going through all steps that analysts perform for an alert investigation, two possible automation options are recognized.One is about checking clients' assets and the other is about checking escalation grid.These two are considered as potential improvements to the current investigation workflow.

Checking clients' assets (part of step C).
In this case study, we assess how the possible solution would affect the SOC performance.As we know, investigations can    Considering investigation classes, the summary of simulation results is shown in Table 13.
Checking escalation grid (step F).One analysis step is checking escalation grid to find out how the related client should be informed in case of a new or updated incident.Each client's escalation grid as an informative document is accessible through some mouse clicks in the SOC Console.The contacting approaches can be different for each client based on the severity of the incident and the client's preference.
A possible improvement for this step is providing information about the required escalation method for   an incident in the window of creating and updating incidents in the SOC Console.By correlating related client escalation grid and related incident type, the proper contact approach can be fetched and shown to the analyst as a hint which saves him time to go through different buttons to find out the required information.
In this scenario, we observe how frequent this step is beside the average duration analysts spending on it.We found that 58 out of 194 (29.9%) investigations contain checking escalation grid for the category of creating incidents.Besides, 18 out of 331 (5.44%) and 23 out of 765 (3.0%) investigations include checking escalation grid for the categories of updating and closing incidents respectively.Since the number of involved investigations for the last two investigation classes is not much (as expected), the simulation is only done for the creating new incidents category.Table 14 shows the detailed simulation results for the class of creating new incidents, where the related step is eliminated completely.Our simulation illustrates the average investigation duration is decreased by just 1.3%.MI and AA attack types are mostly checked for the proper escalation method.
Combination of two possible improvements.By combining the above two improvement options, we assess how averages would be affected.Correlated alerts with    assets' vulnerabilities and automated shown escalation grid together are simulated to show the effect on the averages.Table 15 shows the simulation results.Firstly, by removing 41 investigations which were closed immediately after checking assets' vulnerabilities, new analysis averages are shown in the second column indicating on an increase which means removed investigations were part of short investigations.Then by eliminating the steps related to checking assets and escalation grid, simulated average analysis durations are calculated and shown in the fourth column.By employing both automation solutions, the total average analysis duration would be decreased by 7.36%.The most affected attack types are EVS and MI.

Case Study II, A Different Alert Dispatching Method
In this case study, the simulation of a different dispatching method of incoming alerts among analysts is assessed.As discussed in Section 5.3, the simulation follows single-queue, multiple-servers methodology, where the routing strategy is assigning alerts to the fastest analysts.
The number of analysts working in the SOC is different from one work-shift to another work-shift.For weekdays (Monday to Friday), there are three workshifts per day; day-shift from 8:00 to 16:00, eveningshift from 16:00 to 00:00, and night-shift from 00:00 to 8:00.
In total, 33-day work-shifts are considered for the simulation with the minimally two analysts working per work-shift.For 17 work-shifts, two analysts are working per work-shift.For 15 work-shifts, three analysts and for one work-shift four analysts are working in the SOC in parallel.Table 16 shows statistics on the selected dataset for 33-day work-shifts and simulation results.Each column represents one investigation class, such as new incidents which the first sub-column is dataset average analysis time, the second one is simulated average analysis time, and the third sub-column is the reduction ratio.
By comparing Table 7 with Table 16, we can see the proportion of the closing alerts category increases from 59.3% to 94.12%, since all investigations of the workshift are considered regardless of their duration.
The simulation result in Table 16 represents the effect of the simulated dispatching approach on the total average analysis time and waiting time of different investigation classes and the total dataset.Total average analysis duration and waiting time of the total dataset are improved by 4.42% , and 2.18% respectively.

Case Study III, The Feedback Module
Our study on the analysts' activity logs shows investigations' average durations for different attack types are usually different from each other.Different analysts' average durations are usually also different even for the same attack type implying the different efficiency.The different efficiency can be due to the analysts' different levels of knowledge regarding analysis approaches or familiarity about clients' environments.
As is discussed in Sections 4.1 and 5.4, one feature of the feedback module is showing previous investigation logs whose alert type is similar with which the analyst is currently analyzing.In this case study, we evaluate how such a feedback module would affect the analysts' performance.
We consider one analyst as the senior analyst who has better efficiency than others.The other analysts who may benefit from the senior analyst's knowledge and experience are called junior analysts.We model the trend of investigation durations of the senior analyst, and partially apply the model to future investigation durations of the junior analysts, assuming that, since the junior analysts can see what the senior analyst performed through the feedback module, they can potentially improve their efficiency.
Linear regression analysis [31] is employed to model the senior analyst's investigation durations.By the   A realistic assumption here is that the junior analyst will partially benefit from the knowledge of the senior analyst by observing the latter's investigation details but such benefit is not likely sufficient to enable the former to perform those analyses with exactly the same efficiency as the latter.In other words, knowledge transfer is partial instead of complete.Accordingly, we assign a percentage range by which a junior analyst can improve the efficiency of his/her investigations of the same type after observing the senior analyst's approach and results.It is assumed in this case study that a junior analyst can gain 10% to 60% of the senior analyst's knowledge to improve his investigations.To obtain a more accurate estimation of such a range from real data is a future work.
The average investigation duration of a junior analyst for EventID 101010 is considered as the default value for his/her future investigation durations in our simulation.This default value can be improved by learning from the senior analyst.For example, when we assume the junior analyst gains knowledge by 10%, his average investigation duration is calculated as the summation of 90% of the estimation proportion (90% * Junior_Investigation_Average) and 10% of the senior analyst's investigation duration (as predicted by the model) (10% * Model_Investigation_Duration).As another example, when we assume the junior analyst gains knowledge by 60%, his average investigation duration is equal to 40% of his own duration plus 60% of the senior analyst's duration.
The reason behind considering the average investigation duration of the senior and junior analysts in this case study instead of modeling their investigations durations is that we do not have sufficient data points to establish the model for them.Since the dataset does not provide enough investigations for any single EventID, the average investigation duration is considered.Since the dataset does not provide enough investigations for any single EventID, the average investigation duration is considered for the junior analysts.
For the regression analysis, the X-axis represents time series ordering investigations chronologically and the Y-axis is the investigation duration for the data points.In practice investigations might be performed on the same day or across different days in the period of the dataset, but the time distance between data points considered in this case study is limited to one day in this simulation.We aim to obtain the main trend of the investigation durations in chronological order as either an increase or decrease in the average investigation durations of the senior analyst during the dataset period.
There turn out to be a lot of fluctuations for the 45 data points representing investigation durations for the AnalystID 6.To smooth the curve, every five adjacent investigation durations are averaged and represents one data point.In the end, we obtain a model of the senior analyst's investigation durations as the exponential equation shown below.
In Table 18, some of the data points are shown.The first 10 data points are averaged investigation durations from 45 investigations of the dataset for the senior analyst, and the next 10 data points are the extrapolation of the model.The average duration of the dataset data points is 3:04, where the average duration of the extrapolated data points is 2:37 showing the decreasing trend of the senior analyst's model, which indicates that the analyst's efficiency for this type of investigations slowly improves over time.As is discussed, in order to estimate the junior analyst's efficiency, a percentage range of gaining knowledge is considered from 10% to 60%.We simulate 10 future investigation durations for the junior analysts by combining their own investigation average duration and the effect of the senior analysts knowledge using the percentage.
Table 19 shows the simulation results, where the junior analyst is AnalystID 5 with the default investigation average duration of 7:29.Estimation results show that, if the junior AnalystID 5 gains 10% of the senior analyst's knowledge through the feedback module, the average investigation duration will change from 7:29 to 6:59, decreased by 6.68%.If he/she gains 60% of the senior analyst's knowledge, his/her average investigation duration will change from 7:29 to 4:33, decreased by 39.2%.
Table 20 shows the simulation results where the junior analyst is AnalystID 4 with the default average of 5:46.Simulation results show that, if the junior AnalystID 4 gains 10% of the senior analyst's knowledge, the average investigation duration will change from 5:46 to 5:26, decreased by 5.78%.If he/she gains 60% of the senior analyst's knowledge, his/her average investigation duration will change from 5:46 to 3:53, decreased by 32.66%.
In summary, in this case study, the impact of showing previous investigations to the analysts, which is one of the important features of the feedback module, is assessed based on some assumptions.The AnalystID 6 is considered as a senior analyst, and AnalystsIDs 5 and 4 are juniors.Based on the simulation results, if  the junior analysts gain 10% to 60% of the professional analyst's knowledge; the efficiency of AnalystID 5 can be improved by 6.68% to 39.2%, and the efficiency of AnalystID 4 can be improved by 5.78% to 32.66%.
Those results clearly demonstrate the potential benefit of the feedback module of the system.The summary of results is shown in Table 21.

Related Work
In this section we first review the existing work regarding MSS, MSM, NSM, and SOC in Section 7.1, then the literature of alert correlation techniques are presented in Section 7.2.Finally, we illustrate the studies related to CC in Section 7.3.

MSS, MSM, NSM, SOC
Allen et al. [1] study the different security services, e.g., network boundary protection services, vulnerability assessment, and provide guidelines related to the MSS.Then, MSM is introduced as a network security solution of this century by Schneier [32].McKeown et al. [33] conclude the monitoring scope for MSM, which mainly focuses on a client's network, and MSS, which focuses on security products for the companies.A correlation method is introduced by Zhu and Ghorbani [39] to recognize different attack scenarios without experts' knowledge background.Multilayer perceptron and support vector machine are employed as neutral network approaches to evaluate correlation probability of each pair of alerts.Correlation probability estimation results are stored in Alert Correlation Matrix, which will be used to extract high-level attack scenarios.
Ramaki et al. [17] present a correlation framework to detect multi-step attack scenarios in real time as an Early Warning System (EWS).An EWS aims to identify hidden risky behavior of a system which might expose the system to threats [40].Statistical and stream mining (sequence analysis) techniques are employed to design the correlation scheme.

Call Centers And Queuing Models
In both SOC and CC, humans serve different clients with various service requests in a queue.In the SOC, security analysts are the servers and incoming alerts are considered as different service requests, whereas in the CC, operators respond to different calls.In both cases, incoming service requests are coming in a queue and the service needs to meet certain service level agreement (SLA) specified between clients and service company.
Brown et al. [41] describe queuing-theoretic models for service systems based on a basic common queuing model M/M/N system or Erlang-C [42] which considers arrival rate based on the Poisson process.Green et al. [18] study the different methods of queuing-theory for setting a service system to serve clients whose request-pattern is predictable during a day (how much demanding in which periods).
Excoffier et al. [19] propose a solution to solve the staffing problem, which determines the minimum number of servers that could conform to SLA.The proposed solution is based on linear approximation, and considered arrival rate as random.Mattia et al. [20] presents a robust solution guaranteeing that the proposed shift schedule with the minimum number of servers can conform to SLA.It computes the solution by considering the probability distributions of uncertain parameters.
Other recent works related to queuing models of CC [21,43] focus on considering impatient customers as a new input parameter for queuing models.This new parameter introduces the uncertainty from the client side into the models.

Conclusion and Future Work
In this section, we conclude our work by summarizing the contributions in Section 8.1 and discussing the directions of future work in Section 8.2.

Conclusion
In this paper, by modeling the main workflow of an operational SOC, a system for improving the SOC's performance has been designed consisting of four modules, monitoring, measuring, simulation, and feedback.The integration of the first three modules provide SOC managers a solution to evaluate the current performance, and assess potential improvement options through simulations.The feedback module enables knowledge transfer among SOC analysts in their ongoing workflows to improve their performance.
By deploying a logging component inside the main SOC console, analysts' activity logs from a real production SOC are collected from June to August 2015 for 57 days in a dataset for evaluating our system.Three case studies have been conducted based on the dataset to study the designed system's effectiveness, namely, modifying the duration of steps, a different alert dispatching method, and the feedback module's impact.
In the case study of modifying the duration of steps, we provide two improvement scenarios for the SOC workflow.The simulation result of the combined scenarios demonstrates a performance improvement of 7.36%.In the case study of a different alert dispatching method, the results indicate that one important factor to improve the efficiency by the employed approach is about a combination of the selected analysts' expertise for one work-shift.If analysts with different expertise are chosen to work in the same work-shift, it would increase the efficiency of the SOC by the proposed dispatching model more.The simulation results for the 33-day work-shifts state a 4.42% improvement in the average investigation duration and 2.18% in the alerts waiting time.In the case study of the feedback module's impact, knowledge transfer rates (10% to 60%) are considered for two junior analysts gaining knowledge from a senior analyst regarding a specific EventID.The average performance improvement for the two junior analysts ranges from 6.23% to 35.93% depending on their knowledge transfer rate.In order to asses the improvement results, it should be considered that all improvement percentages point at the time duration reduction in one single investigation.

Future Work
The future directions for this work will mainly be in two parts.First, we will apply data mining for automated analysis of investigations logs, e.g., apply classification and association techniques.By using classification, we can label analysts' performance regarding their performance evaluation.And association rules can be employed to find frequent patterns, such as the habit of analysts during the investigation, pre-filter EventIDs.
Second, we will also apply and simulate queuing theories on the dataset to show how different models affect the overall performance of the SOC differently to find the optimum approach.Moreover, extending case studies on a larger scale dataset would also be an interesting future direction.

Figure 1 .
Figure 1.An example of a typical deployment model for Network Security Monitoring service representing an operational SOC and a client infrastructure

Figure 2 .
Figure 2.An Example of an Investigation Model

5
Monitoring and Improving Managed Security Services inside a Security Operation Center EAI Endorsed Transactions on Security and Safety 12 2018 -01 2019 | Volume 5 | Issue 18 | e1

Figure 5 .
Figure 5.The measuring module dashboard

7
Monitoring and Improving Managed Security Services inside a Security Operation CenterEAI Endorsed Transactions onSecurity and Safety

Figure 6 .
Figure 6.The results of the simulation module alongside the measuring module

Figure 7 .
Figure 7.The feedback module notifications and reports

Figure 9 .
Figure 9. Rows of logs related to one investigation placed in the database

13
Monitoring and Improving Managed Security Services inside a Security Operation Center EAI Endorsed Transactions on Security and Safety 12 2018 -01 2019 | Volume 5 | Issue 18 | e1 regression analysis, the mathematical function representing investigation durations is extracted from the dataset.The duration of each investigation completed by the senior analyst is considered as a data point for that analyst.Different data points of the analyst is ordered by time chronologically as they are performed in different days.By extrapolating the established model, we can predict future investigation durations of the senior analyst based on his historical data.

15
Monitoring and Improving Managed Security Services inside a Security Operation Center EAI Endorsed Transactions on Security and Safety 12 2018 -01 2019 | Volume 5 | Issue 18 | e1

Table 2 .
Step IDs and their description

Table 3 .
Steps with sub-steps as Action IDs and related description

Table 4 .
Database table attributes and description

Table 5 .
Log entries of one investigation related to MI investigation type

Table 6 .
Raw activity logs in the format of text file

Table 8 .
The dataset statistics regarding different investigation types.Here, average values are time duration in the format of mm:ss, and the other column (#) is the number of instances.AA, DOS, and OTHERS.Most popular attack type is MI with 47 distinct incidents in the dataset, whereas DOS has the least number of created incidents, 4.After MI, other two popular attack types are BFA and PV.Since the number of cases for OTHERS type is much more than other known categories, it is not mentioned in our comparisons for known investigation types.
Creating a new incident Table9shows the results for creating a new incident class, where the dataset average, the simulated average, and the reduction ratio are represented in this table.Generally, 48.45% of investigations in this class contains an asset checking step which is involved in this simulation.The total average of this investigation class is decreased by 7.55%.The most affected investigation types are AA, EVS, and MI.Table10depicts the simulation result of the updating incidents class, where 12.69% of investigations has the asset checking step.The total average of this class is decreased by 6.98%.

Table 12 .
Total dataset; 29.38%, 367 out of 1249 investigations contain step C (41 investigations are removed from 1290 total investigations, as is discussed for the closed alerts category).By the simulation, the total average analysis duration decreases from 07:34 to 07:03, or 6.83%.

Table 13 .
The summary of simulation results for different investigation classes

Table 15 .
Combined effect: 41 investigations are removed resulting an increase in the average analysis durations shown in the third column, 377 out of 1249 investigations (30.18%) are affected by the combination of two simulation scenarios, and decreases the total average analysis duration from 8:09 to 7:33, or 7.36% beside saving four and a half hours man-hour.

Table 16 .
The alert dispatching method dataset statistics for 33-day work-shifts, the simulation results and the reduction ratios.

Table 17 .
Different analysts' statistics; the average investigation duration, the variance of investigations durations, and the number of investigations related to EventID 101010 which is a custom EventID for verifying DNS queries.

Table 18 .
First 10 investigation durations represent data points from the dataset for the senior AnalystID 6 and EventID 101010, and the next 10 investigation durations are extrapolated under the model.

Table 19 .
The simulation results of the feedback module's impact for the junior AnalystID 5 with the default average investigation duration of 7:29 for the EventID 101010.The first part simulates the junior's efficiency for 10 future investigations by considering the knowledge transfer percentage as 10%, and the second part simulates the junior's efficiency for 10 future investigations by considering the knowledge transfer percentage as 60%.
[16]oying attack graphs which are predefined attack scenarios.By introducing a novel approach Queue Graph, nested loop based correlation is solved, and it is possible to match alerts to related nodes of the attack graph.Zali et al.[16]presents a correlation approach by pre-defining simple relations among minor attacks to identify attack scenarios in real time. by