Скачати 95.23 Kb.
Using complex instructions in microprocessors
Paul Gonchar. Usage of copmlex instructions in microprocessors. The way of increasing the power of microprocessors by using the complex instructions has been proposed. The idea of complex instructions is in moving the reiterative code of a program to a special internal memory of microprocessor. The execution of moved code is initiated by special instruction. The operation code of this instruction is defined by user. The research of efficiency of the idea has been conducted on the simulation model. The results of the research has been brought also.
One of the ways of raising the productivity of microprocessors is pipelining of microprocessors' architecture. Let's consider the customary pipeline, in which each instruction passes 5 stages of processing (Fig. 1): fetch, first decode , second decode, execute and writeback. At the fetch stage the operation code of the necessary instruction is fetched from memory. Further, at the first decode stage, the instruction operation code is decoded and its remaining parts, such as an immediate operand or a branch address are fetched from memory. After that, at the second decode stage, all data necessary for execution of current instruction are fetched to the microprocessor's internal registers. At the execute stage the instruction is executed, and then the result of execution is written back to a specified destination by the writeback unit. The destination can be both register of the microprocessor and memory.
First decode unit
Second decode unit
Fig 1. 5-stage pipeline structure
Before entering the pipeline, the instructions, as a rule, are placed in a sequential instruction queue. This allows to reduce the pipeline stalls when fetching instructions from memory. If the microprocessor has an instruction cache memory or united instruction and data cache memory, the instruction queue allows to reduce the pipeline stalls at instruction cache memory miss. The diagram of the elementary pipeline is shown in Fig. 2. It is clear, that in the same moment in the pipeline can be up to five instructions. Thus, when executing a linear fragment of the program, the pipeline will execute one instruction each microprocessor cycle. Critical for the pipeline architecture are the branch instructions. The correct branch address will be determined only at the execution stage. For this time the devices of the previous pipeline stages , and instruction queue as well, will be filled by instructions, fetched from wrong addresses. In this case the information in these devices is flushed and they are filled by instructions at correct addresses. Thus, the time needed for recovery of the pipeline, will be
Fig. 2. The elementary pipeline work diagram
010 Instruction 1
011 Instruction 2
012 Instruction 3
013 Instruction 4
014 Branch to 100
015 Instruction 5
016 Instruction 6
017 Instruction 7
100 Instruction 8
101 Instruction 9
102 Instruction 10
Fig. 3. Flushing of tree pipeline stages afer branch instruction execution
directly proportional to its length, that is the quantity of pipeline stages. The essence said is illustrated in Fig. 3.
To reduce the influence of the branch instructions on continuity of the pipeline's work is possible with the help of branch prediction. Only from recent time, when the integration technologies have allowed, they have begun to be used in microprocessors. In Intel Corporation this device was named Branch Target Buffer - BTB). Each entry of the buffer corresponds to one branch instruction. The more quantity of entries in BTB, the more branches the microprocessor is able to predict For example, Intel Pentium processor had a BTB with 512 entries. If the information on the necessary branch is not in BTB, the static branch prediction mechanism comes into effect , working on the following rules:
The practice demonstrates, that the given mechanism is rather effective. However, with complicating of the pipeline architecture, the time necessary for its recovery after a mispredicted branch increases. So, well known P6 pipeline includes 14 stages. By then, when the microprocesor will determine that branch was mispredicted , half of the pipeline will be filled by "wrong" instructions. From the information of Intel Corporation , the Pentium Pro processor needs 4 clock cycles while correct path instruction enters the decoder. After that the P6 pipeline must wait for 15 cycles before flushing 'bad' micro-operations from the pipeline.
Thus, it is very difficult to increase the productivity of the microprocessors due to real tasks' branches. Using the cache memory reduces the pipeline stalls, but it does not allow to eliminate them absolutely.
2. Architectral additions of the microprocessor which uses complex instructions
One of possible methods of reduction of the pipeline stalls is to move some reiterative fragments of the program, or subroutines, to a special internal memory of the microprocessor. It is necessary to add one more memory unit into the microprocessor's architecture. This memory contains routines defined by user. Thus, user can build the program being aware, that often used fragments of the program code are inside the microprocessor, and executing of these fragments will not cause pipeline stall. The similar ideas already stated earlier . It was proposed to add into the RISC architecture microprocessor a changable ROM. This ROM contains CISC instructions, which are suitable for current task. The version with re-programmed CISC instructions memory was proposed also. In this case every CISC instruction is the sequence of the microinstructions realising some function. This causes difficulties in programming of such ROM. To program a new instruction user will have to write the sequence of the microinstructions specific to architecture of each microprocessor. The instruction sets of microprocessors of one series are similar, however microinstructions corresponding to these instructions may be completely different (for example, x86 instruction set is common for all Intel microprocessors from 8086 up to Pentium III, for AMD microprocessors (Am386, Am486, Am5x86, K6, K6-2, K6-III), Cyrix (386, 486, M1, M2, M3), IDT (Winchip 1, Winchip2) and all other microprocessors of x86 clone. However their microinstruction sets are different.). Therefore, to use the possibility of creating user's instructions for a new microprocessor, user will have to study the new microinstruction set.. Thus all previous codes for CISC instructions will appear unsuitable. Backward compatibility is not provided in this case. Knowing that new microprocessors are produced rather often makes such solution not reasonable.
The problem is resolved when microinstructions in internal memory have replaced with sequences of standard RISC instructions. The backward compatibility is saved and the programming of this memory is easied. Let's name such subroutines which are the sequences of standard RISC instructions as complex instructions. These instructions may have a nested structure, i.e. consist not only of RISC instructions, but also of the same complex instructions.. Each nesting level should use its own program counter (PC). So, if a microprocessor supports three nesting levels for complex instructions, it should involve three more registers - PC1, PC2, PC3. While executing the current instruction the microprocessor should increment one of four registers, depending on a current nesting level.
From the point of the architecture view, such solution can be realized as follows. In the architecture of the microprocessor four new devices should be added: the Copmlex Instructions Control Unit (CICU) , Complex Instructions Descriptors Memory (CIDM), Complex Instructions Bodies Memory (CIBM) and Complex Instructions Defragmentation Unit (CIDU) (Fig. 4). It is also necessary to add new instructions to the microprocessor instruction set. These instructions will operate with complex instructions. These can be operations of creation of a new complex instruction, erasing existing, changing the operation code of a complex instruction, temporary disabling and accordingly enabling of a complex instruction, flushing all set of complex instructions etc.
CIDM contains descriptors of format shown in Fig. 5. In a case, when the microprocessor fetches from the instruction flow a complex instruction CICU accesses the CIDM for a descriptor with the fetched operation code. For reduction the search time this memory needs to be with associative organization. If the necessary descriptor is retrieved, and existence bit and bit of availability are set to active level, CICU loads the address of the body of a complex instruction to the conforming program counter. Then it sequentially fetches the body of this complex instruction from CIBM, routing them to the microprocessor's fetch unit.. Otherwise the "unknown instruction code" exception occurs.
To fetch unit
Fig. 4. Additional hardware units for microprocessor using complex instructions
The availability bit allows to temporary disable a complex instruction without unloading its body from CIBM. The CIDU defragments CIBM after deleting the body of an existing complex instruction. CIDU works in parallel with the rest units of the microprocessor and uses the CIBM - CIDM bus (when CICU doesn't access it ) to defragment and update the complex instruction information if necessary.
Existance bit Availability bit Operation code Address Length
Fig. 5. The complex instruction descriptor
3. Theoretical estimation of the complex instructions efficiency
For an estimation of the efficiency of this solution it is necessary to use the probabilistic approach since quantity and sequence of instructions of definite group in the input stream depend on algorithm of a task and from input datas. Hence, consequently for the microprocessor these values have random nature. Let's designate probabilities depicting an input instruction stream :
- probability of finding an instruiction in instruction queue;
- probability of meeting a load instruction in input instruction steram;
- probability of meeting ariphmetic instructions with execution unit latency more than 1 in input instruction stream;
- cache hit probability;
- probability of meeting a branch instruction in input instruction steram;
- probability of meeting a complex instruction in input instruction steram at n-th nesting level;
At organization of the microprocessor pipeline as shown in a figure 6, the average execution time for a one instruction will be (disregarding branch instructions):
, where (1)
- one microprocessor cycle duration
- cache memory access time
- system memory access time
From system memory
Fig. 6. 5-stage pipeline structure with instruction cache and instruction queue
- ariphmetic operations execute unit latency
- The execution time for all arithmetic instructions requiring more than 1 clock cycle of activity of the execution unit for a simplicity is considered identical.
With taking into account the influence of branch instructions (flushing the pipeline), the average execution time for a one instruction will be:
Similarly, it is possible to compound the formulas for the microprocessor realising the concept of complex instructions. The fetch time parameters, first and second decoding, and also the writeback time parameter for a complex instruction are similar to a standard instruction's . The only difference is the execution time. The formula for definition of the complex instruction's average execution time at zero nesting level will look like:
, where (8)
- The fetch time for an instruction residing at 1-st nesting level, i.e. is the part of the complex instruction of zero nesting level.
- similarly, the first decoding time
- similarly, the second decoding time
- similarly, the execution unit instruction latency
- similarly, writeback latency
- the number of instructions in current zero nesting level complex instruction
The meaning of value is shown in Fig. 7.
It is considered that the complex instruction does not contain branch instructions. Otherwise the execution of all the complex instructions at all nesting levels will be terminated and the branch will be made. For implementation of branches inside bodies of complex instructions it is necessary to add the additional instruction set, which one works with different nesting levels.
Let's define the magnitudes in (8). Since the necessary instruction in complex instruction's body is always available, we will have:
The main instruction flow
CI body. Level 2
CI body. Level 1
CI body. Level 0
CI - Complex Instruction
Fig. 7. Different nesting levels of complex instructions.
Similarly, it is possible to compound the formulas for any nesting level. In the formula for a residence time of an instruction located at last nesting level in the execution unit there won't be a probability , because at this level usage of complex instructions is inadmissable. Otherwise the exception like "a complex instruction at last nesting level " is generated .
Having analyzed the formulas for an instruction average execution unit latency you may see that latency for the microprocessor realising the concept of complex instructions can be much greater than for the customary microprocessor. However and the functionality of such instruction is much greater (at the expense of all set of instructions in a complex instruction's body , including all highest nesting levels).
Also it is visible, that in the formulas for the first and second decoding latencies the
probability is absent, i.e. the instructions from an input stream are always accessible for the pipeline and are located in CIBM. Thus, the time of full decoding of an instruction is equal to 2 . It must be kept in view that this is truthful only for considered RISC microprocessor with the 5-stage pipeline.
The absence of the instruction queue references when executing a complex instruction allows to fill the instruction queue with instructions from the main instruction flow in parallel, that also reduces the pipeline's stall time upon returning from the body of a complex instruction.
The beginning of a complex instruction execution and its ending are not branches, therefore the pipeline is not flushed, unlike a subroutine execution and return from it. The sense said is figured in Fig. 8.
4. The estimation of the complex instructions efficiency using a simulation model
For estimation of these statements the simulation model of simple RISC microprocessor realising the concept of complex instructions, with the following structural characteristics was designed:
- the general purpose registers digit capacity - 64 bit
- the number of general purpose registers - 32
- the instruction queue length - 48 bytes
IQ is empty
IQ is empty
IQ is full
IQ - Instruction Queue
Fig. 8. The instruction queue status after executing a subroutine and a complex instruction
- the minimum instruction length (opcode only) - 3 bytes
- the maximum instruction length (with immediate operand) - 11 bytes
- the pipeline - 5-stage
- the cache memory - separated instruction and data
- the data cache size - 16 Кbytes
- the instruction cache size - 16 Кbytes
- the number of CIDM entries - 512
- the CIBM size - 64 Кbytes
A modification of this model also was built, which one has no instruction cache memory. With the help of this model it is possible to evaluate "clean" efficiency of the concept of complex instructions, without influence of the instruction cache.
On the built models three tests were launched. The estimation of a run time of each test was made in clock cycles of the simulated microprocessor. Thus, it is possible to exclude the binding to a clock rate. The execution latency for each instruction was borrowed from the MIPS R10000 RISC microprocessor documentation. This microprocessor was chosen because MIPS Technologies, Inc is considered founder of the RISC architecture. The instruction set of the simulated microprocessor is similar to assembler of microprocessors with the MIPSIII architecture .
The test 1 represents implementation of Simpson algorithm for integrals calculation. The algorithm is iterational, therefore accuracy of the obtained result depends on quantity of made iterations. For this method 1 and 800 iterations were made. The results of the tests are listed in table 1. The time spared of using the complex instructions increases with increasing of quantity of iterations. On one iteration using of complex instructions is not expedient and slows down the algorithm. It is connected with need to create necessary complex instructions and place their bodies to CIBM before starting the main algorithm. With increasing of quantity of iterations, quantity of wasted clock cycles decreases on the increasing value, and finally percentage scoring remains approximately identical. It is clear visible from the table, that gain from using the complex instructions in the test 1 is not large. It is connected with domination of arithmetic instructions in the test. Many of them are multiply and divide operations which are contained by an integrating function. To execute the multiply instruction in the format of 32 bits the execution unit requires 6 clock cycles, and for multiplying in the format of 64 bits - 10 clock cycles . Similar parameters for a divide instruction are 35 and 67 clock cycles accordingly. Therefore, even after cache memory miss during execution of such instruction the pipeline has time to get the necessary instrucion from system memory, and the gain gets minor.
The following test, the test 2, represents the logical test. In the test are processed 1, 3 and 10 8-byte units. The size of a unit is selected not incidentally but from the size of general purpose registers which is 64 bits (8 bytes). 8 bytes are fetched from memory, each byte is processed with the set of logical operations, and the 8-byte block is written back to the memory. The similar problems meet in computer graphics when processing an image pixels. The results of performing the test are listed in table 2. The 8-byte block processing procedures in the test are replaced by complex instructions. As You may see from the table, when processing only one block the clock cycles gain is negative, on the causes described above. However, when processing 10 units the gain gets considerable and makes about 50 % for the microprocessors with the instruction cache memory. For the microprocessors without the instruction cache the gain will be even greater.
Тable 1. The clock cycles necessary to perform the test 1.
Тable 2. The clock cycles necessary to perform the test 2 .
The test 3 represents an implementation of instruction of VAX 9000 CISC microprocessor using the simulating microprocessor's instruction set. The following instruction was implemented:
ADDL @(r1)+, @(r2)+, @(r3)+,
This instruction calculates the sum of two numbers from memoty with addresses, accordingly located on addresses stored in registers r1 and r2 (indirect addressing). The result is written according
to address obtained similarly, but with usage of r3 register. After fetching each operand, the microprocessor increments a used register. With the help of implementation of this instruction, in the test 3 two number arrays were summarized and the result was written to the third array. Three implementations of the instruction were madet: as a linear code, as a subroutine and as a complex instruction. The results are listed in table 3. The microprocessor operating complex instructions has appeared slower than the simple microprocessor which performed the test using a linear code, even if the time for creating complex instructions is excluded from the result. It is connected with cyclical structure of the test. After an instruction initiating a complex instruction executing, there is a branch
Table 3. The clock cycles necessary to perform the test 3 .
instruction, and the pipeline is flushed. By comparison with linear code, quantity of instructions in an test's loop body for the program using complex instructions is greater because of an instruction initiating the executing of a complex instruction (which requires fetching and decoding too). Therefore, on each iteration this instruction adds to a total clock cycles some more additional ones.
The gain is obvious when using complex instructions instead of the subroutines. Even if include the clock cycles necessary to create complex instructions to result , it will be approximately for 16 % faster.
When estimating the results of testing, it is necessary to keep in view that fact, that 16 Kbytes of instruction cache memory exceeds the size of any launched test. Therefore, the efficiency of complex instructions usage is a little lowered.
5. Conclusions and Perspectives
The obtained results may not represent a full view of usage of complex instructions. Before creating the simulation model the structural block diagrams were developed. But real implementation may be fairly different. The further researches will be conducted in this field.
Multiprocessor systems based on microprocessors supporting complex instructions will be examinated also. One of the main problems of creation of multiprocessor systems is a low transfer capacity of interprocessor channels and memory bus. To reduct the influence of buses' restrictions complex instructions can be used. Each microprocessor can contain it's own complex instruction set, and also exchange them with other microprocessors in system.
The possible usage of complex instructions is using them in control systems. Very often these systems are made separated. In this case the problem of transfering many information portions between separated parts of the system occurs. It is possible to arrange some permanent control procedures as complex instructions on a remote part of the system. In this case instead of a body of a control procedure it is possible to transmit only code of corresponding complex instruction, and therefore the traffic in the channel can be greatly reduced. It is possible to create the whole hierarchies of control systems. A standard RISC instruction for the microprocessor on higher hierarchical level can be the whole subroutine or complex instruction for the microprocessor on lower level. Each microprocessor will keep just a small part of a general control program. The high grade of parallelism will be reached in this case.
|Abstract preparation instructions||Abstract. The abstract body should be in Times New Roman, 10 pt Italic type. Do not cite references in the abstract. Key words|
|Abstract. The abstract body should be in Times New Roman, 10 pt Italic type. Do not cite references in the abstract. Key words||Abstract. The abstract body should be in Times New Roman, 10 pt Italic type. Do not cite references in the abstract. Key words|
|Abstract In this paper the instructions for preparing camera ready paper for International Conference memstech 2013 are given||Abstracts application form to the State abstract database “Ukrainika naukova” and Ukrainian abstract journal “Dzherelo”|
«ulrich periodicals directory», and national database «ukrainika naukova» («Dzherelo» abstract journal). Our journal has applied...
|International relations theory: realism, pluralism, globalism / Paul R. Viotti, Mark V. Kauppi. London, 1987|
Бжезінський З. Велика Шахівниця: американська першість та її імперативи. – Львів; Івано-Франківськ: Лілея-нв, 2000
|Title of the Abstract||Instructions for practical classes|