Automated RTL Generator for Optimized Flop Repeater Network

In all Hard IP (HIP) designs, there is a need for signals to travel from its source to destination module. To meet timing, they are flop repeated based on the floorplan of the design. Different signals have unique repetition requirements based on their functionality/timing criticality. Also signals may have single destination or multiple destinations. The flop repeater structure for each signal should be optimized so that it has minimum number of flops and timing is also met, to achieve desired targets for timing, area and power. This paper will show how we are creating optimized tree structure based on mathematical techniques and generating automated RTL for such repeater flop module. This submission presents a new tool ART (Auto Repeater Tool) that completely eliminates the manual effort/time required to generate functional, ready to plug-in RTL.


Introduction
The Hard IP (HIP) designs, nowadays, are getting very competitive in terms of power and area. The basic element consuming power is the flip flop. The more the number of flops, the greater the power consumption. Modern designs are having thousands of signals which are flop repeated multiple times from source to destination. Most of the IPs have a dedicated module, let's call it "Repeater Channel", which flop repeats these signals and feeds them to their destination module. The functionality of such Repeater Channel module is very critical as it has to carefully flop repeat each signal based on multiple parameters like its functionality, timing criticality, number of endpoints (destination module) etc.
The important question is how to decide number of flop stages required for these signals to travel from source 'Misc Logic' partition to 'Data, Command, Control' partitions. There are multiple factors which are taken into account to calculate this.  We need to come up with the custom structure for each signal to have the optimized number of flops. This structure can be a flop tree for multiple endpoints or a serial repeater chain (SREP) for single endpoint.
Problem 2: Finding a tree structure with least number of flops is very cumbersome.
We won't be talking about the serial chains here as they don't need any optimization. Number of flop stages in SREP, which is calculated based on geometry and flop drive strength, is fixed for that geometry and it doesn't have any room for optimization. For example, if a signal requires 5 stages to travel from partition A to partition B, then in a SREP we cannot have 4 stages as geometry itself requires 5 stages. However, if a signal is less timing critical and we want to add more than 5 stages, we have an option for that also.
There is a lot of scope for optimizing number of flops in the tree structure. This is critical as there are multiple mathematical ways to create the tree. Finding a tree structure with least number of flops is very cumbersome hence here we have developed an algorithm to solve this.
Traditionally we manually calculate number of repeater flop stages needed & then hand code the RTL for it, which is error prone, time consuming and not really scalable. The process involves putting a series of flops from source module till destination module with each flop branching into maximum of 2 flops. This binary flop generation is repeated till all the endpoints are connected.
The disadvantages of this approach are • Increased Area & Power: More number of flops are used as the generated tree structure may not be optimal. This increased the area of the HIP. • Manual & error prone: Time taken for development is more as each signal has to be hand coded. We have developed an automated system which takes all the necessary inputs and does all the calculation and analysis. Our system comes up with an optimized tree structure and automatically generates a functional RTL module so the RTL we are getting is completely automated and also most optimized. On top of that, user also has flexibility to add additional flops for the signals that is not timing critical.

ART Framework
The system we have developed consists of 3 different utilities working together. Utility 2 Tree Optimizer: This is C based algorithm. Inputs are the number of flop stages and the degradation factors for various fanouts. Flop calculator then takes the optimum tree structure dumped by Tree Optimizer and then excel file is passed to the 'RTL Code Generator'.
Utility 3 RTL Code Generator: The RTL Code Generator generates a functional RTL code in .sv format and can be plugged in to create signal repetition between modules.

Floorplan based Flop Calculator
Below is the snapshot of the excel sheet in which we feed the required inputs and information to do the mathematical calculation.  Calculations are done automatically with click of a button embedded in Excel Sheet. These buttons will run the VBA macro to generate the data. Table1 consists of the dimensions of the partitions our IP is having. Table2 is the drive strength of a flop when it is driving a single flop. Table 3 indicates how much degradation the flop will have in its drive strength when it is driving 2 and 3 flops further down. Table 4 is the textual representation of our floorplan which indicates which partition is at what position. Button1 gives us the Table5 which is the midpoint distance between each partitions in floorplan.
We are using midpoint distance and not specific X/Y location of ports on the border as X/Y location of ports is very specific to Back-End and subject to change.
We want to have a little flexibility for the Back-End by not being too specific by having exact distance. Midpoint of partition gives us pessimistic distance which gives us extra margin in terms of distance. The ports interacting with North partitions are mostly located above the midpoint of the source partition boundary and those interacting with South partitions are mostly located below the midpoint of the source partition boundary.
There are 2 different cases for midpoint distance calculation.
Case 1: Distance between adjacent partitions (e.g. between Data5-Data4, Data5-Data3…Data5-Control). To calculate center to center distance between partition x and partition y, following is the formula we used in the VBA macro.
. (1) Where, X height = height of partition x Y height = height of partition y XY height = Sum of height of all partitions between x and y Case 2: Distance between source partition (Misc Logic) to end partitions (data, command and control). To calculate the center to center distance between Misc Logic and Partition Command1, we have used following formula.
. Button2 gives us the minimum number of flops required between each partition in Table6 which is based on Table5 and Table2. This will divide the distance we got in Table5 by the Table 2 numbers of 'For Critical' and 'Non Critical' Flop Drive Strength and round it up to the nearest ceiling value.
Button3 generates an optimized tree structure, which is an output of Tree Optimizer (.exe file), for all the predefined signal categories in Table 7. We are passing the .exe file to the excel macro running behind this button which runs the executable file within the macro and captures the output in Table 7.
Table7 consists of the initial optimized Tree structure given by our "C" based algorithm which takes .csv file as an input. We have divided all the input signals into some predefined categories based on the functionality of these signals. Tree structure for each of these categories will be different depending on the stages required and number of endpoints.

Tree Optimizer
Tree optimizer is a C based algorithm. It takes the distances between source and destination endpoints, flop drive strength and degradation factors as inputs and dumps the optimized tree structure for the given input. This is implemented based on the tree data structure. Here root of tree will be the source endpoint and all leaf nodes represent the destination endpoint. Each node in between root and tree is a flop. Tree Optimizer mainly has 3 phases.

Phase 1:
In phase 1 it finds the farthest endpoint (say x) from the source and creates a chain of flops to connect from the source module to x. In C, each flop is realized using a node of tree.
In Figure 4, MISC is the source module and we need to repeat the MISC signal to Data1, Data2, Data3, Data4, Data5 and Data6 modules. Farthest destination endpoint here is Data1 so in phase 1 we create a tree where the root is MISC and has only 1 leaf node -Data1. The number of levels of this tree will be equal to the flop drive length between MISC and Data1. The chain will look like below if the flop drive length is 6. In this phase each of the nodes (node1:6) are assigned its height position relative to the source endpoint.

Phase 2:
In phase 2 the algorithm looks at the farthest endpoint from the leaf node of the main chain. In the above example the farthest is Data 6 (from Data1). It then calculate the number of flops required to connect to each of the nodes. i.e., the number of flops needed to branch from node 1 to Data6, node 2 to Data6 etc.
In the above example, number of flops needed to reach Data6 • From node 1 is 5 (To match the delay between Data1 and Data6). • From node 2 is 4. • From node 3 is 3 etc. This calculation is done for all the destination endpoints. While doing the calculation there may be cases that from a node in main chain to Data6, it might take less flops than from that node to Data1. These nodes should not be considered as it will increase the delay for all the destination endpoints.
From the calculation, we find the node with least number of flops required to reach Data6. If this node has children less than the number of fanouts, then we add nodes till Data6, else next node, with least number of flops, is considered. If it takes least number of flops from node3, then the tree will look like below: This process is repeated for all the destination endpoint till the full tree is found.

Phase 3
In this phase we have a tree structure with root at MISC and 6 leaf nodes -Data1:6. Now we parse through the tree to calculate the distance from the MISC to each node with degradation in consideration for each node because of one flop driving multiple endpoints. This distance is compared with the actual distance from the floorplan. If any of the node's calculated distance (by parsing the tree) is less than the actual distance (from floorplan), then the number of flops needed to reach from MISC to that flop is increased and rerun the program from Phase 1.

Figure 6. Tree Optimizer Phase 3 Output
So at the end of the Phase 3, we will get the most optimal tree structure which is passed to Table 7 in excel. Then the excel file is passed to RTL Code Generator. Refer Figure 6 for sample Tree Optimizer output.

RTL Code Generator
The PERL based script takes the tree structure and serial repeater count as input from Table 7 along with another tab in the excel sheet which consists of signal level information as shown in below snapshot.  along with the intermediate logic declaration of the interconnect signals for tree. It will also assign the final stage tree outputs to the output ports. Below is the snapshot of RTL generated for signal (Command_Enable) mentioned in the excel sheet. Figure 8. Sample of RTL generated for single signal The beauty of this is whenever there is a change in the tree structure (because of updated flop drive strength or change in the position of one of the partition), the updated RTL will be generated automatically with a click of a button. Designer won't have to go through the hassle of breaking the existing connections to include updated repeater structure. If new signals are added then integration of those will be required.
This will be applicable to all the signals going from one partition to another. If the signal is travelling from source partition to destination directly without any flops, it will be a feedthrough signal. Such signal can be included in respective signal category and will be taken care by simple assign statement.

Results
Following tree diagrams and tables show the comparison in number of flops seen using traditional approach vs the new automated approach mentioned in the paper.

Summary
This complete system, consisting of all 3 utilities, can be used for multiple projects in Intel in which this kind of EAI Endorsed Transactions on Cloud Systems 11 2019 -05 2020 | Volume 6 | Issue 17 | e2 Rahul S Bhatt, Shweta Sharma and Shyam A 6 repeater flops are required. With our solution we have designed a tool that • Takes the floorplan information, distance from the source module to the destination and single flop drive strength as the input. • Algorithmically finds out a tree structure with minimal number of flops. • Generates the optimal functional RTL FUB that is ready to plug in.
The advantages of our solutions are: • Less Area: This approach generates the least number of flops needed for a given source endpoint to destination endpoint. Thus area and flop delay are optimized. • Automated: It needs minimal manual effort and it generates a functional RTL which is ready to plug-in. Thus the development is faster, easy to use and less prone to errors. • Scalable: Adding new signals just requires adding the corresponding fields in the input to the tool and rerun the program again. • Fast development: Using tool to generate automated RTL, aids quicker development & faster simulation readiness.
Optimized & automatic RTL, generated using the tool described in this document, is being intercepted in one of the Intel's IPs.