The rest of our overhead comes from indirect branch checks and translations, the only added instructions in the code cache. We've spent a good deal of time optimizing our indirect branch handling, which isn't easy on x86. We have to translate return addresses, making it hard for us to use the IA-32 return instruction, thwarting the special return predictor in Pentium processors, the RSB. This is a case of the processor being heavily optimized for common instruction patterns that our code cache cannot duplicate. Perhaps in the future, if runtime code manipulators become pervasive, hardware will be optimized for their usage patterns.
So where does the rest of the slowdown come from? Although we are spending time in the cache, we may be spending more time than the application instructions incur natively. Our indirect branch lookup's hashtable causes extra data cache pressure, which makes a significant difference in many of these benchmarks.
|Copyright © 2004 Derek Bruening|