By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), August 10, 2016 8:44 pm
Room: Moderated Discussions
Heikki Kultala (hkultala.delete@this.iki.fi) on August 9, 2016 12:37 pm wrote:
>
> > A question that was raised, however, is about fixed length archs - can these architectures avoid use of BTB
> > entries for fixed-target branches by decoding such jumps early and redirecting fetch? That is, avoiding the
> > use of the BTB for branches whose targets are fixed in the
> > instruction, leaving the BTB resources for branches
> > which may actually vary (e.g., indirect jumps). Do any of the common fixed-length archs actually do this?
>
> Modern pipelines are way too long for this, the branch cannot be decoded
> early enough when just the i-cache access takes multipl clock cycles.
For instruction address relative jumps/branches using largish offsets/insets, the block of instructions can be predecoded and the fetch chunk index of a possibly taken jump/branch stored in a smaller, faster portion of Icache. If there is not a possibly taken jump/branch in a chunk of instructions, the bits would be taken from a common location in the chunk. (For MIPS and Alpha this would also provide Icache space for a "BTB entry" for indirect jumps; Alpha even architects some of the bits as a hint.) This requires at least one extra bit per chunk (to encode whether an instruction address is encoded or the bits are taken from the common location; obviously with more state bits other fast-access information choices might be made available, e.g., stack load offset).
With 16-bit offsets and 32-bit instructions, presumably a 2-cycle Icache could be converted to support one 16-bit fast-access portion per two instructions while providing single cycle latency.
This concept could be extended to short branch encodings (requiring additional metadata) and variable length instruction encodings (fetch chunk crossing branch instructions would be more difficult to handle, cache block crossing even more so).
This concept (which I had posted years ago on comp.arch and which someone mentioned had been considered by Alpha designers) is sort of between a BTB and an aggressively predecoded instruction cache. (It is not a BTB in the traditional sense because it is mostly using instruction storage, expanding storage as part of predecode is not unusual so it might be classified as a predecoded instruction cache.) It is also a use of subblock NUCA, where part of a cache block has lower latency (with another obvious use being for storing pointers or enough of a pointer to generate a cache address).
> And even the only 5-stage pipeline of most of the "original"
> RISC processors needed the one delay slot for this.
Except for MIPS jumps, the offset addition would take some time. Without branch direction prediction, one would still need to perform the branch evaluation. One also would need to sufficiently decode the instruction to know that it was a control flow instruction (and what type). If branches had been encoded as inset with implicit carry/borrow and metadata provided for quick branch recognition, then a delay slot might not have been necessary.
(Mitch Alsup once suggested on comp.arch using predecode to provide microarchitectural delayed branches.)
>
> > A question that was raised, however, is about fixed length archs - can these architectures avoid use of BTB
> > entries for fixed-target branches by decoding such jumps early and redirecting fetch? That is, avoiding the
> > use of the BTB for branches whose targets are fixed in the
> > instruction, leaving the BTB resources for branches
> > which may actually vary (e.g., indirect jumps). Do any of the common fixed-length archs actually do this?
>
> Modern pipelines are way too long for this, the branch cannot be decoded
> early enough when just the i-cache access takes multipl clock cycles.
For instruction address relative jumps/branches using largish offsets/insets, the block of instructions can be predecoded and the fetch chunk index of a possibly taken jump/branch stored in a smaller, faster portion of Icache. If there is not a possibly taken jump/branch in a chunk of instructions, the bits would be taken from a common location in the chunk. (For MIPS and Alpha this would also provide Icache space for a "BTB entry" for indirect jumps; Alpha even architects some of the bits as a hint.) This requires at least one extra bit per chunk (to encode whether an instruction address is encoded or the bits are taken from the common location; obviously with more state bits other fast-access information choices might be made available, e.g., stack load offset).
With 16-bit offsets and 32-bit instructions, presumably a 2-cycle Icache could be converted to support one 16-bit fast-access portion per two instructions while providing single cycle latency.
This concept could be extended to short branch encodings (requiring additional metadata) and variable length instruction encodings (fetch chunk crossing branch instructions would be more difficult to handle, cache block crossing even more so).
This concept (which I had posted years ago on comp.arch and which someone mentioned had been considered by Alpha designers) is sort of between a BTB and an aggressively predecoded instruction cache. (It is not a BTB in the traditional sense because it is mostly using instruction storage, expanding storage as part of predecode is not unusual so it might be classified as a predecoded instruction cache.) It is also a use of subblock NUCA, where part of a cache block has lower latency (with another obvious use being for storing pointers or enough of a pointer to generate a cache address).
> And even the only 5-stage pipeline of most of the "original"
> RISC processors needed the one delay slot for this.
Except for MIPS jumps, the offset addition would take some time. Without branch direction prediction, one would still need to perform the branch evaluation. One also would need to sufficiently decode the instruction to know that it was a control flow instruction (and what type). If branches had been encoded as inset with implicit carry/borrow and metadata provided for quick branch recognition, then a delay slot might not have been necessary.
(Mitch Alsup once suggested on comp.arch using predecode to provide microarchitectural delayed branches.)