A Survey of System Level Power Management Schemes in the Dark-Silicon Era for Many-Core Architectures

Power consumption in Complementary Metal Oxide Semiconductor (CMOS) technology has escalated to a point that only a fractional part of many-core chips can be powered-on at a time. Fortunately, this fraction can be increased at the expense of performance through the dark-silicon solution. However, with many-core integration set to be heading towards its thousands, power consumption and temperature increases per time, meaning the number of active nodes must be reduced drastically. Therefore, optimized techniques are demanded for continuous advancement in technology. Existing e ﬀ orts try to overcome this challenge by activating nodes from di ﬀ erent parts of the chip at the expense of communication latency. Other e ﬀ orts on the other hand employ run-time power management techniques to manage the power performance of the cores trading-o ﬀ performance for power. We found out that, for a signiﬁcant amount of power to saved and high temperature to be avoided, focus should be on reducing the power consumption of all the on-chip components. Especially, the memory hierarchy and the interconnect. Power consumption can be minimized by, reducing the size of high leakage power dissipating elements, turning-o ﬀ idle resources and integrating power saving materials.


Introduction
Aggressive transistor scaling with technology has fuelled an unprecedented growth in the number of Processing Elements (PE) available in modern Systemson-Chip (SoCs).However, due to the excess thermal issues caused by the breakdown of Dennard's scaling, multi-core/many-core chips do not scale properly with die area.Continues scaling of the chip size aggravates the total power consumption and hence to meet the systems power budget, only a subset of nodes can be powered-on while the rest are poweredoff (dark).To make things worse, researchers have already estimated that in the near future, 50% of a chip size at 8 nanometer (nm) technology will be powered-off.This implies that only half of the applications that are currently being executed in manycore chips will be executed in the future at a time.The industrial approach to this solution is the fabrication of processor chips, designed to work within a thermal design constraint to prevent possible overheating and permanent damage.Unfortunately, as a trade-off, this solution prevents peak frequency level operation and thus novel techniques are needed to maximise the chip's performance.
One possible solution for this challenge is application mapping, where specific nodes are selected for incoming applications.Unfortunately, prior works only focus on distributing application tasks in different regions on the chip without considering the performance of the applications.Another alternative solution is through Dynamic Thermal Management (DTM) techniques such as power-gating, Dynamic Voltage Frequency Scaling (DVFS) and Task migration.Unfortunately, this also trades-off application performance to satisfy the temperature threshold by scaling the Voltage Frequency (V/F) levels of the cores.Nonetheless, there have been many works and techniques proposed through the darksilicon solution, however only a few consider the power consumption of on-chip components such as the memory hierarchy and the interconnect.Recently, Multi-Level Caches (MCA) and the Network-on-Chip (NoC) paradigm have replaced single-level caches and buses respectively as the standard components for many-core future chip designs [1][2][3][4].However, these components increase the power consumption and impacts heavily on the temperature of the chip.In fact, the Last-Level-Caches and NoC in a 16-core machine [5] constitutes to 33% of the total power consumption.Therefore, to avoid high temperature, a reduction of power consumption in these elements is very essential.
The remainder of this paper is organised as follows.Section II discuss the causes of power consumption.Section III introduce techniques for DCSCs while Section VI discusses the influence that the interconnect and memory sub-system have on the chip's total power.Section V summarises all the techniques presented and finally, the conclusion is drawn in Section VI.

Background
For decades, Moore and Dennardian theories were the embodiment of exa-scale technology.Dennard's scaling revealed that, by reducing the size of transistors, it can be utilised at lower power and voltage because, power density is equivalent to the square of applied voltage, therefore, it remains the same [6][7][8].Consequently, by reducing the physical parameters of transistors, it has been possible to utilise them under lower power and voltage.Thus, enabling an advent in resource duplication for performance enhancement resulting in the multi-core/many-core technologies.
Figure 2 depicts the effects of transistor scaling.A chip at 8nm with all its node activated will cause the chip's temperature to be very high because of high of power consumption.To prevent this, the dark-silicon solution permits fractions of the chip to be poweredoff.By switching parts of the chip off, the leakage power consumption reduces as well as the dynamic power consumption.Power consumption materialises as a subset of dynamic and leakage power in CMOS integrated circuits.Until the deep-submicron processes emerged, dynamic power consumption was held accountable for majority of the power consumption in CMOS technology.Unfortunately, transistor size reduction has caused a halt in voltage scaling down and resulted with an increase in the amount of sub-threshold leakage, as well the gate tunnelling leakage current caused by having thinner gate oxides.Therefore, it has been reported that leakage power contributes to a higher percentage of the chip's total power consumption at the deep-sub micron level [8][9][10].Consequently, high power density generates excess heat and increases the temperature of the chip.The consequence of such peak temperature is overheating, permanent damage, transient faults and faster ageing [11,12].Therefore, to reduce power consumption at the transistor level, designers adjust the equations above.Dynamic power consumption can be reduced by scaling the V/F level and reducing the activities.Leakage power consumption on the hand is reduced by utilizing low power cells or reducing the number of active transistors.An example of such a technique is the dark-silicon.Where, every chip is allocated a Thermal Design Power (TDP) for the chip to operate with.

Dark-Silicon: The Future For Emerging SoC Designs
Unfortunately, the TDP provided by the industries only allow DSCSs to operate at a feasible power budget to keep the thermal profile of the chip down.Thus, limiting them from operating at high V/F levels.Nonetheless, Intel's Turbo Boost [13] and AMD's Turbo CORE [14] violate the TDP constraint during short intervals by boosting the system for higher performance.When the threshold is violated, DTM techniques are used to cool down the chip.The result of such an action is performance degradation.Therefore, such techniques have to be used appropriately.Particularly, Task migration and DVFS.

Task Migration
Task migration is an optimized technique used to migrate tasks between nodes that are dissipating high temperature.For example, if a task causes a node to generate excess heat thus raising the temperature, that task is then migrated to a node in exchange for another task with low temperature.This is done to avoid excess heat which can have a negative effect on neighbouring tasks.Unfortunately, this technique only applies when some nodes are executing heavy loaded tasks.If all the nodes are executing heavy loaded tasks, task migration will not have any effect.Figure 3 depicts an image of task migration.The temperature of application 1 has been reduced because the task in node 1 has been migrated to node 7 because it has a lower temperature.Unfortunately, in application 3, because all the nodes are executing heavy load tasks, there is no difference in the overall temperature of the chip when the task is migrated to a different node [15].

Dynamic Voltage Frequency Scaling
DVFS is used to dynamically vary the V/F levels of a node that exhibits high temperature.This technique is best used in a Run-Time Management (RTM) system because accommodating newly arrived applications can at sometimes cause an overshoot.During this process, DVFS is used to vary the V/F levels until the application has been successfully mapped.Although, this helps to reduce the temperature of the chip, the tasks in question suffer from performance loss.With such technique, tasks are likely to be executed beyond the deadline time [15].

Thermal Design Power Techniques
For this purpose, Pagani et al. [16] proposed a power budgeting technique for DSCSs to operate at their highest power.Unlike TDP where all the cores are modelled with one power value, TSP allows different groups of cores to have different power constraints based on the incoming application.Therefore, for every floor-plan, a different power constraint is computed for the worst-case mappings.This is contrary to TDP architectures, where all the cores are judged to be functioning in their worst-case V/F.TSP worst case mapping is computed based on the number of active cores, their position, and the influence of neighbouring cores temperature.Therefore, TSP is considered as the most optimized thermal constraint because the amount of power the chip can operate on, is based on core alignment which is determined by application mapping [17,18].Similarly, Wang et al. [15] also introduced a new power budgeting technique for DSCSs.The proposed power budgeting technique advocates the number of cores that needs to be activated, as well as selecting the maximum power that every core can consume based on the current thermal profile of the chip.In addition to this, the proposed technique uses a model prediction method to generate a power budget for the chip for future mappings.

Run-Time Management Systems
The purpose of a RTM is to monitor the power budget, reserve idle cores and allocate them to applications.In case there is not enough power available for an incoming application, the application is halted until an executing application leaves the system.However, due to the dynamic nature of workloads, the number of core count available for applications may change, depending on the requirement of an application [19].This can result in a change of layout, increase in power consumption and deadline time for executing tasks.A biased RTM will result in impecunious resource allocation limiting the maximum achievement of the system.In a such a system, more stress is placed on regions where applications can be executed faster.Therefore, a run-time management system which incorporates design factors such as the layout of the processor chip, heterogeneity, uncore components (NoC and cache), architecture (2D, 3D and Wireless), temperature of the system and TDP/TSP power budget   16].In addition to this, the RTM needs to consider which DTM techniques to use for specific applications.
Run-time Management System Algorithms.Rahmani et al. [20] propose a dynamic power management with a multi-objective approach for NoC based dark-silicon many-core platforms.The proposed management system utilises per-core power-gating and DVFS based on the following characteristics: workload, network congestion and the power performance of the cores.The management system incorporates the following to measure the characteristics of the system: Application Power Calculator, Application Processor Utilization Calculator, Application Buffer Utilization Calculator, Application Injection Rate Calculator, a TSP Lookup Table, and Proportional Integral Derivative Controller.There are four algorithms which can be activated in the system.
The first algorithm dynamically scales the V/F Level by monitoring various feedback from the system.The second algorithm downscale the applications with the lowest priority when there is an overshoot in the system.Among the lowest priority, the congested applications are chosen to be optimized since congested areas contributes to high power consumption.The third algorithm is used to scale up the applications when there is an undershoot.In particular, priority is given to applications that were previously downscaled, not congested and non-intensive.Since a newly added application can push the power consumption above the TSP/TDP constraint, algorithm four performs the following tasks: when a new application arrives, the algorithm checks the available power budget and determines if the new application can be mapped to nodes.After checking the new application, the algorithm predicts the power consumption of the system when the new application is executed.
If the application is likely to cause an overshoot, the currently running applications are scaled down to execute the application.However, if it does not exceed the TSP/TDP constraint, the new application is added without any scaling.The only disadvantage with the proposed RTM is how application tasks are mapped.The dark nodes which are used to cool down the chip are only used to separate applications.Although external heat generated from neighbouring application nodes are minimized, internal heat is ignored.The mapping algorithms could be further enhanced to contain dark nodes inside the region which has been selected.In this way, internal heat is minimized.Salehi et al. [21] on the hand propose a powerconstrained reliability Management System for darksilicon chips (dsReliM) which considers the reliability of tasks.The model of the system has been categorised into four different parts.(Hardware Architecture, Application Model, Reliability Model, and Power Model).The hardware architecture model of the system consists of heterogeneous cores which can operate at different V/F levels through DVFS.A reliability compiler is utilised in the application model to compile multiple code versions for each application task with properties such as reliability and execution time.The purpose of dsRelim is to execute applications with minimum reliability loss while meeting deadlines.Firstly, the code version with the highest reliability is chosen along with the maximum V/F level.If the selected code exceeds the TDP constraint, the V/F level is gradually adjusted.If the execution task of the system is violated after adjusting the V/F, another version of the task which meets the deadline but with a lesser reliability loss is chosen.However, if there are not any code versions which meets the deadline time, the code with the minimum execution task is chosen with a performance trade-off.The V/F level of the selected code version is scale down to meet the TDP constraint.
Rahmani et al. resolves the reliability problem by introducing a novel power controller unit [22].The power controller contains an operating mode selector which monitors the workload or intensity of the system and selects the following modes for the system to operate at: overboosting mode and reliability aware.The overboosting mode is selected when there are high intensive applications which requires full system operation without considering the reliability performance.The other mode is reliability-aware.This mode is selected for applications with low priority.Unlike [21], during this mode, the system operates at feasible V/F where thermal hotspots are considered as well as good performance.
Haghbayan et al. [23] also proposed a reliabilityaware resource management for many-core systems which prioritises the younger cores than older cores.The proposed solution consists of two units (Reliability Analysis Unit (RAU) and Runtime Mapping Unit (RMU).The RAU monitors the ageing information/status of all the cores.The RMU on other hand, takes into the account the ageing status of the cores provided by the RAU and then the total power consumption of system provided by a power monitor before mapping applications to cores.MapPro [24] is used to locate the first node, however regions with busy cores are ignored during the application mapping stage.Furthermore, during the application mapping, a reliability factor metric is applied to prioritize the selection of younger cores for performance enhancement.
Similarly, Khan et al. [25] presents a hierarchical budget scheme which distributes resource and power budgets based on the system workload for clusters.Firstly, the scheme determines the number of cores required for an application to be executed successfully.For the inter clusters, several factors are used to determine which cluster is allocated more power.One factor that is used is the number of cores in a cluster.Another is the history of an application.For example, an application with a history of requiring high power consumption is allocated more power at the next epoch.For the intra clusters, since different types of threads require different amount of power for execution, in video applications, data tiles which consists of high motion content are allocated more power.Therefore, in the inter clusters, cores are allocated power individually based on their data tile.
Likewise, Yang et al. [26] proposed a run-time management system to handle a scalable hardware topology based on a Quad-core cluster.Quad-core cluster is a tile-based architecture which consists of heterogeneous cores ((High Performance (HP), General Purpose (GP), Power Saving (PA) and Low Energy (LE)) and a shared cache within each cluster.The purpose of having different cores is to utilise them based on the incoming application requirements.For example, the HP cores are used for high workloads and thus consumes the most power while the LE cores do the opposite.Only one core is activated in each cluster to keep the temperature at a minimum.Consequently, each core in the node has various V/F levels which can meet an application's demand.In addition to this, idle cores are turned-off to keep the temperature under the safe value.Furthermore, the active cores are physically decentralised to avoid possible heat dissipation.

Application Mapping
As previously stated, application mapping ensures that specific nodes are selected on the chip for mapping.This can be done in several ways [17,23,[27][28][29].Different mapping algorithms produces different results (temperature effects and power budget).The selection of the correct nodes enables more nodes to be activated to accommodate more applications and run tasks faster.
In DSCSs, application mapping is initiated in two stages.For clarification purposes, we refer to the first stage as region mapping and the second stage as task mapping.Region mapping is the process of finding a particular region on the chip with sufficient nodes available for task mapping.Task mapping on the hand refers to the process of identifying and assigning tasks to preferred nodes from the pool of nodes found using application mapping.The most applied method for region mapping is to find an optimal node and then map task to surrounding nodes to form a rectangular/square shape mapping.In practice, MapPro is used by many [17,23] to automatically calculate and determine this approach.
Subsequently, existing application mapping algorithms can be categorized into groups.These are contiguous mapping and non-contiguous mapping.Figure 4 shows the impact of contiguous and non-contiguous mapping.Contiguous mapping is the process of activating nodes in one region for an application to be mapped to reduce communication overhead between tasks.Noncontiguous mapping on the hand, assign application tasks to any available node.
In practice, some techniques aim to monitor the temperature of nodes periodically in order to map incoming applications.These techniques predict and estimate the temperature and produce optimized mapping algorithms with minimum chip temperature [30].
J. Wang et al. [28] propose an Ant-Colony (ACO) based thermal-ware thread-to-core mapping.The ACObased thread-to-core mapping algorithm releases an ant colony into the system.Each ant conducts the threadto-core mapping individually.After conducting the thread-to-core mapping, the ant with the best minimizing Chip Multi Processors (CMP) peak temperature is chosen.The temperature of the results generated by the Similarly, Wang et al. [27] also proposed a threadto-core mapping management system.However, this mapping systems uses two different types of virtual mapping algorithms to estimate the performance of applications when different number of dark cores are used.Upon the arrival of a new application, the virtual mapping process is used to estimate the performance of the application with different number of dark cores.In addition to this, the mapping system consists of two different modes: Computational and Communication.Applications which are affected by the task computation performances are mapped as far away from each other as possible.Applications whose performances are affected by their communication volume are mapped closer to each other.With this algorithm, it is possible to migrate a task from core to core to harness the best performance.
Li et al. [30] propose a Mixed Integer Linear Programming (MILP) thermal model which monitors all nodes and map applications while minimizing the temperature.The proposed algorithm works by predicting the temperature of chip when applications are mapped.This approach sorts out all applications into groups and execute them starting from the highest V/F to their lowest V/F.During this process, MILP is used find the best optimized mapping with the least minimized temperature.In addition, an efficient algorithm is proposed to release some applications from the list in case they violate the temperature threshold.
Application Mapping Techniques.Contiguous mapping algorithms are preferred choices for application mapping because non-contiguous mapping techniques do not consider the increase in communication latency between tasks which requires inter communication.Contiguous mapping on the hand ensures that tasks are mapped to cores located in the same region.However, due to the alignment of nodes directly next to each other, the dissipating heat generated by each node affects their neighbouring nodes gradually increasing the temperature of the system as applications are being executed.Furthermore, because contiguous mapping demands that an application is mapped in one region, an incoming application may be forced to wait when there are insufficient nodes available for it to be mapped in one region.As a matter of fact, in some cases, the application will be non-contiguously mapped to free nodes to satisfy the application deadline [31].
Nonetheless, to accommodate more applications on the chip, Ng et al. [31] propose an optimized technique called defragmentation.Defragmentation ensures that, all the free nodes which are dispersed on the chip are gathered into one region for an application to be mapped contiguously.Similarly, X. Wang et al. [32] introduced an application mapping algorithm which dynamically adjust and shift tasks onto different nodes for a contiguous mapping to take place.The algorithm proposed relocate tasks to different nodes to accommodate a new application in a square-shaped region.
Unfortunately, accommodating more applications means more hot regions on the chip as shown in Figure 4.For this purpose, Kanduri et al. [17] presented an optimized application mapping and patterning algorithm for DSCSs based on MapPro.The dense nodes from the region are activated as dark nodes to cool their active counterparts.During task mapping, the task with the highest communication volume is mapped to the first node.This process is repeated until the last task is assigned.After this process, one node in the square region is left un-occupied and used as a dark node to avoid hotspot.
Similarly, Aghaaliakbari et al. [33] propose a contiguous mapping algorithm which positions dark nodes in between application tasks to reduce heat.Rezaei et al. [34] also proposed a contiguous mapping algorithm contiguous mapping algorithm called Round Rotary mapping which targets a hybrid Wireless NoC virtually divided into regions.The proposed algorithm map applications in a round robin approach to evenly distribute applications all over the chip.
Moreover, by placing dark nodes in contiguous mapping algorithms, the hot regions could be reduced thus allowing more nodes to be activated.In addition to this, prioritising younger cores is also an essential technique because ageing cores dissipate more heat when they are stressed.Furthermore, nodes are able to perform at peak V/F levels when dark nodes are activated near it.Therefore, it will be beneficial for contiguous mapping algorithms to incorporate darksilicon patterning approaches for a trade-off between communication cost and hot regions by efficiently positioning dark nodes in between tasks.Another alternative would be to incorporate heterogeneity in such a way that different resources are used to perform different computations.Every application has its own requirements for executing tasks.Computerintensive applications require more power to execute applications while communication tasks require close connection with other tasks.These various applications could be executed using different nodes.

Architectural Heterogeneity
It has been proven [35] that incorporating heterogeneity through diverse materials which offers extra power savings in DSCSs, reduces the dynamic and leakage power consumption at a cost of a slight degrade in performance [36].For this purpose, many techniques have been proposed in literature that combines different materials, sizes etc. to offer more power for actual computation.Shafique et al. [19] conducted a survey about the challenges in dark-silicon trends.In the survey, Shafique addresses the challenges of dark-silicon by presenting factors which demands high emphasis on when designing a system.Particularly, high emphasis is placed on the importance of incorporating heterogeneity.
Heterogeneous Cores.Zhang et al. [36] demonstrated that the employment of diverse materials to form processors can lead to less dark areas on the chip.Zhang proved that by integrating High-K (consists of big cores) and NEMS-CMOS (consists of small cores), the processor can operate more efficiently than the conventional CMPs formed with one material.In addition to this, because NEMS-CMOS consumes less power and generate less heat, the power density is smaller compared to CMPs formed with a single material.
Yang et al. [26] approached the use of heterogeneity in a different design aspect by introducing a Quadcore Cluster Architecture which is not situated about the size of the core but rather the purpose of each core.The Quad-core Cluster Architecture consists of four different types of heterogeneous cores: High Performance (HP), General Purpose (GO), Power Saving (PA) and Low Energy (LE).The integration of these cores allows different types of applications to be executed on different cores depending on the workload.For example: In this architecture, the HP cores are used for intensive workloads which consumes the highest power consumption while the LE cores are used for workloads which consumes the lowest power consumption.
Power consumption is one major factor which constitutes to the heat dissipated by the on-chip resources.The amount of heat generated by the resources is proportional to the amount of power consumed by each resource.Incorporating components which consists of power hungry elements results in high power consumption which increases the amount of heat being generated by the resources.By incorporating heterogeneity, power saving materials can be used to form low power architectures.
Moreover, one common action that all these techniques that we have review share is that, to reduce power consumption or temperature, they ignore uncore components such as the Last Level Cache (LLC) and the routers in NoC.Ignoring these components result in an increase in heat since they contribute to the power consumption.Additionally, these components consume a significant of on-chip power and therefore impacts heavily on the power budget.To ensure that, that the power budget allocated for a specific chip is sufficient enough for high performance computation, we target at reducing the power consumption in the NoC interconnect and the memory sub-system without performance degradation.

The Dominance of Uncore components in Dark-Silicon Constraint Systems: The NoC Interconnect and Cache Architecture
The power dominance of uncore components (memory hierarchy (L2/L3 caches), memory controllers (MCs) and Interconnect) are often ignored in DSCSs, with majority of the power budgeting techniques (V/F scaling, power-gating, dynamic cache resizing and pipeline reconfigurations) that are found in literature, either targeted at the processor level or chip level.Therefore, for more power to be available for executing applications, the power consumption in on-chip components needs to be addressed.the traditional buses and single-level-caches.NoC supplies high level of parallelism through multiple working routers and links, clustered with cores and caches together to form a node.With the introduction of these components in many-core systems in which they scale proportionate to, processing power is set to increase [37].Figure 5 depicts a DSCS node comprised of 3 cores, a cache and a router.It is therefore important to address the power consumption of these uncore components as many-core systems dominate modern technology.Evaluation results conducted with McPAT [38] shows that uncore components are responsible for nearly half of the chip's total budget with the LLC and NoC interconnect being the largest consumers.
Caches suffer from high leakage in cache storage cells caused by the size reduction in emerging chip resources.NoC's power consumption on the other hand is down to its power-hungry elements.This problem becomes even worse when computational sprinting is applied on cores [39][40][41].This affects the total power budget and makes it hard for more power to be used for actual computation.This part of the survey present techniques which can be applied to reduce the power consumption of these components but before we introduce a background information of each component.NoC scales along with the size of the architecture and therefore amplifies the throughput equivalent to the system performance.A typical NoC architecture is comprised of routers and links.In the manycore system, NoC is used to form nodes as depicted in Figure 5. Routers communicate with each other through the links which establishes multiple access and communication channels between a source and a destination.However, the switching of activities of transistors during transmission causes an increase in dynamic power consumption and with leakage power already dominating power consumption, the overall chip temperature rises.[46][47][48][49][50][51].Consequently, the router architecture has already been identified by many as the main component responsible for majority of the NoC's total power.However, with the continues shrinkage of technology, the power consumption in the links have also increased along with the workload.NoC routers are composed of power consuming elements such as the buffers, arbiters, crossbars, input and output ports.In as much as all these elements imbibe the power budget of the NoC, buffers and the crossbars are identified as the main culprits to exaggerate above the power constraints [47,52].With the increasing dark fractions in many-core systems, a significant portion of power can be saved through optimized algorithms and components.However, a lot of factors must be considered before proposing schemes relating to the buffers.One thing to consider is that, the absence of buffers provokes network congestion leading to high latency while on the other hand, excessive use of buffers aggravates the chips power consumption.Therefore, a balance is required.

Run-time Power Consumption Techniques
For NoC Architectures.Modarressi et al. [53] propose a NoC Architecture in which packets can bypass the dark regions of the chip.The proposed algorithm takes the CTG of an input application and the number of active cores and establishes virtual long links among them.Furthermore, the router architecture of the dark nodes is optimized as follows: Firstly, the short-cut path of an input port that allows the pipeline stages of a router to be bypassed is selected.Secondly, incoming flits are then buffered using a register along a virtual long link which is established between two active nodes.Thirdly, the register indicates which output is part of the virtual link and which input port should be assigned to it.Bypassing the pipeline stages reduces the power consumption as power hungry elements such as buffers and virtual channels are avoided.In dynamic system workloads, the number of nodes available changes based on the arrival and departure of applications.In theory, active nodes can run at a higher frequency level if dark nodes are located near it for heat dissipation; this ultimately helps leverage the temperature of the system.Unfortunately, the downside of this is the communication latency between active cores.This proposed design allows the communication latency between two active cores to be minimized thus enhancing the performance of the system.
Bokhari et al. [54] propose the Malleable NoC for DSCS CMPs.In the proposed architecture, each node contains multiple heterogeneous routers designed for their frequencies and voltage to be altered.Depending on the behaviour/characteristics of an application, a router from each node is selected and formed into a low power NoC Fabric while idle routers are switched off.Sharifi et al. [5] propose PEPON, a power budget distribution mechanism that shares the chip-wide power budget among the chip's resources (cores, caches and NoC) based on the workload for an optimized performance while respecting their allocated budget.
Reducing Power Consumption in The NoC Router Architecture: Buffers.Input buffers occupy majority of power consumption in the router architecture [52,55].Therefore, by reducing the power consumption of the input buffers, the power consumption of the chip will be reduced.The following techniques either seek to avoid the use of input buffers during run time or activate and deactivate them depending on the workload.An effective way to reduce power consumption is reducing the number of pipeline stages that packets traverse to reach their destination.By reducing the pipeline stages, dynamic power is reduced as well reducing the workload latency [56,57].
Alternatively, virtual channels are employed in buffers to enable parallelism in one router.The Traffic-Based Virtual Channel Algorithm introduced in [58], enables the switch port of virtual channels to be organised into various cells.By grouping them, some of these cells can be powered-on or powered-off based on the network traffic and congestion.
Virtual channels are employed in buffers to enable simultaneous use of one physical channel.However, this consumes power.To reduce the power consumption of virtual channels, Zhan et.[59] propose an algorithm which categorises the virtual channels of a switch port into different levels.The architecture consists of a level (lower level) designed with SRAM and another with STT-RAM.The use of STT-RAM trade-off leakage power for dynamic power which can be tolerated.In addition to this, the algorithm allows the SRAM level to be powered-on, off or left in a drowsy state.In case of heavy traffic, the STT-RAM levels are activated.Nasirian et al. [60] on the other hand, employs a power-gating control unit to disable buffers when they are in-active for a number of cycles.However, powergating can cause a performance penalty and therefore, system performance needs to be considered.This is because, constantly turning-on and off routers leads to non-negligible power overhead.Secondly, switchedoff routers block all paths it intersects with and therefore, arriving packets have to wait until the router is powered-on first before traversing to the next router.Nonetheless, power punch is presented by [61] to send a signal three hops ahead to alert routers that are switched-off are about to switched off or routers which are switched-off to stay activated.
Another alternative to input buffers is the concept of bufferless routers.Bufferless Routers [62][63][64][65] have emerged as one possible solution to the leakage power consumption in routers.Unfortunately, due to the performance bottleneck that occur in bufferless router architectures, this technique is often disregarded.Buffers are used as temporal storage for packets which cannot be transmitted immediately.The absence of it causes packets to be deflected leading to livelock which also increases the power consumption.For this purpose, some techniques introduce heterogeneous architectures comprised of buffered and bufferless routers.
Fang et al. [66] propose a heterogeneous NoC architecture comprised of buffered and bufferless routers.Results show that the use of both these routers reduces the power consumption by 42%.Furthermore, this reduction in power allows more nodes to be activated compared to a generic buffered router.
Naik et al. [67] introduces a heterogeneous NoC comprised of circuit switched buffered and bufferless routers.The use of this heterogeneous approach reduces the power consumption by 26% and 32% in area.
Kodi et al. [68] on the other hand, introduced a dualfunction links architecture called iDeal which unlike input buffers, does not consume a lot of power.iDeal uses a dynamic router buffer allocation to allocate incoming flits to any available buffer.
Similarly, DiTomaso et al. [52] propose a powerefficient architecture called QORE, which saves power consumption through the use of multi-function channel buffers and enhances the performance through reversible links.Li et al. [69] on the other hand, replaces the traditional SRAM with 3T_N eDRAM.The result of this is a reduction of 52% of power consumption and 43% of area.

Reducing Power Consumption in the Cache Architecture
On-chip cache memories account for a significant portion [70][71][72][73][74][75][76][77] of power consumption in embedded devices.Therefore, for mobile devices that run on batteries, efficient power optimization techniques are highly in demand as the sizes of transistors progressively decreases.The introduction of MCA trades-off performance for power expanding the fraction of chip area and on-chip power that caches account for [41,70,78].Consequently, this increase in chip area and power can lead to thermal and reliability issues and therefore, reducing cache power consumption can avoid this and increase the power budget available for actual computation.While Last Level Caches (LLC) account for majority of leakage power due to their relatively large sizes, the First Level Cache (FLC) dominate dynamic power.Therefore, architectural techniques which seek to reduce the leakage power switches-off parts of the caches off and focus on minimizing transistor activity during cache accesses at the expense of performance.For lower memory caches, it is practical for sequential accesses of meta-data and data arrays to take place to save energy because very few access occur [79].However, for FLC caches, there is a performance penalty.As a matter of fact, the lower memories are only accessed when there is a cache miss.Figure 7 depicts an image of a typical cache architecture in many-core CMPS.L1 is generally referred as the FLC and L2/L3 is referred as the LLC.

Figure 7. Cache Architecture
Buffers and caches are similar because they are both used as temporal storage however, they consume a lot of power.Reducing the sizes of these two components can drastically affect the performance of the system.Chakraborty et al. [80] conducted a survey on caches and concluded that, turning-off cache banks trades-off performance for power.Results show that decreasing the cache banks from 16 to 8 caused a massive degrade in performance.However, when 12 banks are used, this degrade in performance is not as high.Although, power is reduced by shutting down cache banks, there is an increase in conflict misses.In this section of the paper, we present techniques for reducing power in both the FLC and LLC.
First Level Cache Power Consumption.First-Level Caches (FLC) are generally optimised for performance enhancement with less emphasis on power efficiency due to the impact and importance of having high associativity.For this purpose, designers trade-off power consumption for high performance in FLCs.Consequently, to improve the performance of memory, Set Associates Caches (SAC) are employed to enable blocks to be stored anywhere in the cache.This reduces cache miss rates and improves the performance of the system.Unlike associative caches where, a tag array has to be compared with each block in the cache, data is accessed parallel with a lookup tag.Unfortunately, during this process, power is wasted reading meta-data and looking up all the sets when only one set will be accessed after the cycle.Consider a 16-way set associative cache.A significant amount of power is wasted accessing all sets when the required data resides in only one set.
To avoid these challenges, several techniques have been proposed to address the power loss in setassociative FLC's trading-off power for design complexity or an increase in latency.These techniques can however be can classified into two categories (Tag Look up and Voltage Scaling).
A. Tag Look up: Some of the techniques proposed perform tag lookup and data sequentially.Unfortunately, this increases the cache latency.Others on the hand, store parts of the data and retrieve way information before the FLC is accessed escaping the need for accessing all sets.Performing tag lookup and reading tags introduce extra cycles which consumes more power [81].Therefore, optimized techniques are very essential because FLC performance plays an important role in processor efficiency.On a cache miss, the system suffers a performance penalty which further increases the power consumption.
For this purpose, Zhang et al. [82] propose an Early Tag Lookup (ETL) for FLC instruction caches.Unlike existing 2-phase methods, the proposed algorithm determines the matching way one cycle earlier than the actual cache access, eliminating non-matching way accesses without sacrificing performance.The technique proposed retains two instruction fetch addresses.One of these being the current fetch address stored in the program counter and the other in the next program counter.The matching way is determined by looking up the tag array using the next programming counter so when it is loaded up by the program counter, the matching way is already known.The program counter therefore accesses the matching way without accessing other ways.
Similarly, Dai et al. [83] proposed an early tag access cache technique which determines the location of most memory instructions before the FLC Data cache is accessed.The proposed technique operates by storing a part of the physical address in tag arrays while the conversion between the virtual address and physical address is performed by the Translation Lookaside Buffer (TLB).This data is used to locate the destination of a required memory instruction during the Load/Stage Queue before accessing the FLC data cache escaping the need for accessing all ways.
Valls et al. [84] on the proposed the tag filter cache which unlike the first two techniques can be applied to all levels of the cache hierarchy.The proposed architecture filters the number of tags and data blocks to be checked when accessing the cache hierarchy by using the least significant bits of the tag part of address to determine which ways to access.The proposed architecture reduces power consumption between 74% and 85.9%.Sembrant et al. [85] proposed an extended TLB which will provide the location of cache lines in the data-array by adding an extra way index information (way location and location of cache lines).This reduces extra data array reads and avoid tag comparisons.
In contrast, Bardizbanyan et al. [81] argues that accessing ways sequentially by predicting the location of memory instructions affect the performance by incurring extra cycles due to additional switching of the clock.For this purpose, they propose load data dependency detection, a technique which decides when to sequentially access the FLC data based on the data dependency of the load.Similarly, Dayalan et al. [86] propose a technique which dynamically selects the best associativity of the cache during execution.The proposed technique operates by employing shadow tags to monitor how the cache would have performed if it was operating in the other mode.
In conventional MCA, the FLC data cache and write buffer are accessed in parallel for the same data.During a write/read miss, both the FLC data cache and write buffer are updated.Lee et al. [70] proposed an architecture which functions opposite to this.In the proposed architecture, during write operations, only the write buffer is updated.The only time the FLC is updated is when data is retired or the write buffer is full.
B. Voltage Scaling: Alternatively, downscaling the supply voltage close to the transistors' threshold is a technique which can effectively reduce the power consumption in FLCs.However, as a result of operating below the safe margin, persistent faults occur caused by voltage and temperature variations.Therefore, techniques which employ near-threshold scaling utilizes Error Correcting Codes (ECC) to overcome this challenge.ECC encoder generates parity bits when a data line is updated.During the reading of the data line, the decoder regenerates the parity bits to check and correct any existing faults.This process requires extra cycles and consume power causing a performance penalty which FLCs cannot tolerate.This gets even worse, when the fault rate is very high which is normal in near threshold scaling [87,88].
For this purpose, Reviriego et al. [89] proposed a Single Error Correcting -Multiple Adjacent ECC to correct faults in one cycle (SEC-MAECC).Similarly, Yalcin et al. [87] proposed an improved version of SEC-MAEC to correct the faults in half a cycle.The proposed architecture reduces the encoding and decoding latency up to 80%.
Hijaz et al. [88] proposed an FLC hybrid architecture which can operate in two different modes (Cache line disable and correction and disable).The proposed architecture switches in between mode to preserve the performance of the system.Cache Line disable techniques allow the FLC to function at near-threshold voltages.During this process, cache lines which suffer from error bit rates are shut down limiting the cache capacity.In case the cache capacity loss becomes too high, the FLC utilizes (SEC-MAECC) to correct the faulty cache lines and enable them for use.Saito et al. [90] proposed a FLC architecture which operate under different speed through Dynamic Voltage Frequency Scaling (DVFS).The proposed architecture dynamically selects the right speed depending on the type of performance that is required.Yan et al. [91] proposed two techniques which permit voltage scaling in FLCs (data and instruction caches) without a performance penalty (access latency).The first technique, Fault-Free Window (FFW) reduces the effect of defective words by only permitting cache lines to only store the most likely accesses.The second technique prevent the core from accessing defective words.Das et al. proposed a replacement policy algorithm which prioritise remote blocks to remain in the FLC to avoid latency.The proposed technique reduces power consumption by 14.85%.
Last Level Cache Power Consumption.As previously mentioned, LLCs have been reported to occupy and consume majority of leakage power because of its large size.To improve the power efficiency, several techniques have been proposed forward.These techniques can however be classified into two categories: hybrid architectures and cache performance.

A. Hybrid Architectures:
STT-RAM, has been widely tutored as the conventional material for LLC cache design.With similar like features such as high density, low power consumption and good performance, STT-RAM is able to mirror performances close to that of SRAM.However, STT-RAM suffers extensively from dynamic power consumption during write accesses.Consequently, STT-RAM read latency also becomes an issue when implemented in FLCs.
Komalan et al. [92] proposed a NVM FLC with a very wide buffer to mitigate the read latency.The proposed architecture offers more area with a reduction in power consumption.However, it is not quite sure how the proposed architecture will perform during heavy workloads as it was experimented under light workloads.
Similarly, Wang et al. [93] proposed a hybrid STT-RAM and SRAM FLC.The proposed design incorporates the MESI cache coherence protocol to effectively manage block relocation between the SRAM and STT-RAM partitions.However, because system performance is closely related to FLC, to the best of our knowledge, not many work have been done on designing FLCs out of SRAM.In terms architecting LLC caches out STT-RAM, the most considered optimized solution for MCA by many designers is employing both SRAM and STT-RAM [94][95][96][97].Moreover, the following authors combine the benefits of both hardware technology to overcome the challenges that each technology brings.Li et al. [94], Kim et al. [98] and Safayenikoo [99] all proposed architectures which incorporates STT-RAM and SRAM technology.The proposed design proposed by Li, focuses on sharing private STT-RAM groups with neighbouring nodes to reduce latency and power consumption.KimâĂŹs architecture consists of algorithm which decides which region of the cache (STT-RAM and SRAM), data needs to be placed in.Safayenikoo's cache architecture moves data to the SRAM blocks when the energy writes in the STT-RAM increases.
Asad et al. [100] introduced a heterogeneous cache memory hierarchy for CMPs.Each cache level in the memory hierarchy has been designed with a different memory technology (Static Random Access Memory (SRAM), Embedded Dynamic Random Access Memory (eDRAM) and STT-RAM.Similarly, Onsori et al. [101] proposed a hybrid memory system for DSCSs comprised of NVM devices.The propose architecture consists of STT-RAM memory banks which have been incorporated with SRAM memory banks.
B. Cache Performance: Alternatively, power-gating techniques are employed to disable idle parts of the cache when under minimal workload.However, shutting down idle parts of a cache can incur performance penalties which can exacerbate the power being dissipated [102].To ensure that power-gating techniques does not impose a significant threat on the performance, less likely used banks and powered-off and their requests forwarded to neighbouring requests [102].Other techniques on the hand, power-off cache ways instead of banks.Azad et al. [103] on the other hand reduces power consumption by categorising cache blocks into different groups and apply ECC based on the level of protection that is required.

Summary
The breakdown of Dennardian Law has made it a challenge for systems to maintain the same power performance as the same time transistors quadrupled in many-core/multi-core systems.For this purpose, the dark-silicon phenomenon has become an interesting field because it allows only a subset of resources to be active and with the application of the techniques presented, this subset of resources can provide high level performances.
Table 1 presents a summary of all the techniques which target many-core systems in DSCSs.Unlike other work found in literature, we have targeted all of the on-chip components with considering the performance of the chip as well as the temperature.It can be deduced from the table that for a good temperature efficiency, the power budgeting technique must consider several factors.One of these checking the surrounding of the neighbouring codes before allocating power.Consequently, for a high-power efficiency, idle virtual channels can shut down while alternative buffers are employed.However, this can degrade the performance therefore, it is only wise to do so under minimal workload.Additionally, the implementation of STT-RAM and SRAM architecture offers a high-power efficiency resulting in the reduction of high temperature.
Unfortunately, with scaling set to go even deeper, dark-silicon may yet become a constraint rather than a solution.Deep scaling meant that fractional parts of the chip had to be shut down.Therefore, reducing the transistor size further will only increase the fractional part of the chip which are dark [104].
Consequently, this had led many researchers to now shift their focus to Near-Threshold Voltage Computing Constraint Systems (NTCCS) [105], [106].In contrary to DSCSs where transistors are under-utilised, NTC allows all the transistors to operate in the near-threshold region thus providing a fluid balance between power and delay [107].Since, the entire chip can be utilised at the same time, multiple applications can be executed, however this is at a cost of performance degradation and reliability loss.Another alternative would be a joint implementation of both dark-silicon and NTC in future technologies.A technique which has already been proven to provide better performances [108].

Conclusion
This paper introduced techniques which can be implemented in DSCSs to reduce power consumption whilst considering performance and avoiding high temperature.Particularly, efficient application mapping techniques and heterogeneous architectures are presented.Using the correct application mapping algorithm to distribute applications across the chip can effectively reduce thermal hotspots.In addition to this, we showed that by using resources which are made of power saving materials, high power consumption which increases the power density can be reduced.Thus, keeping the temperature of the working resources low enough for the chip to function beyond its supplied thermal design constraints.
In addition to this, we provided alternative thermal design constraints which can implemented to allow systems to function beyond their restricted threshold.Furthermore, we discussed novel techniques which can applied to the NoC interconnect and the cache architecture.Power consumption in the NoC interconnect can be reduced through the replacement of input buffers.Cache power consumption on the hand can reduced by implementing hybrid STT-RAM and SRAM architectures.Based on these findings, it can be deduced that the thermal limitations in the dark-silicon era cannot be significantly reduce by application mapping alone.However, with the addition of architectural heterogeneity and consideration of uncore components, the thermal profile of the chip can be kept at a minimum whilst still maintaining the performance.With this in mind, optimization techniques for NoC and the memory subsystem will be the focal of our future work.

Figure 3 .
Figure 3.An Example of Task Migration

Figure 4 .
Figure 4. Contiguous Mapping and Non-Contiguous Mapping

Table 1 .
A summary of DSCS techniques