RightMark Memory Analyzer - Universal CPU/Chipset/RAM Benchmark: Test Packet Description

Appendix 1:
RightMark Memory Analyzer 2.5;
changes and additional information

Before this test packet was created there was no proper software for measuring system's vital parameters such as CPU/Chipset/RAM providing steady and reliable (reproducible) test results and allowing for changing test parameters in a wide range. Among vital low-level system characteristics are latency and real RAM bandwidth, average/minimal latency of different cache levels and its associative level, a real L1-L2 cache bandwidth and TLB levels specs. Besides, these aspects are not paid sufficient attention to in a product's technical documentation (CPU or chipset). Such test packet which combines a good deal of subsets aimed at measuring objective system characteristics is much needed for estimating crucial objective platform's parameters. The test packet is developed within RightMark, named RightMark Memory Analyzer and available as an open source code.

System requirements

Minimal system requirements:

Pentium MMX CPU or higher;
32MB RAM available;
Windows 2000 OS and higher.

General settings

This sector contains general settings of all subtests realized in RMMA.

CPU Clock, CPU Count

Data on CPU clock and the number of logic processors in the system.

Cache Line Size

The effective cache line size detected automatically when a given application starts (detection takes only several seconds). This parameter is very important in achieving correct results in most realized subtests. That is why its automatic detection is an integral part of RMMA.

Memory Allocation

Choice of a method of allocation of memory needed for execution of the tests.

Standard - standard method of memory block allocation with malloc() with the further usage of VirtualLock() on the memory region selected which guarantees that later accesses to this region won't cause page fault.

AWE - this method uses Address Windowing Extensions available in Windows 2000/XP/2003 Server. This memory allocation method is more reliable for some tests such as cache associativity. Usage of AWE requires Lock Pages in Memory privilege which is not available by default. To get this privilege make the following steps:

Enter the system at the administrative access level;
Launch Local Security Policy from Administrative Tools;
Select Security Settings -> Local Policies -> User Rights Assignment;
Select Lock pages in memory policy and add a user or group name (e.g., Administrators).
Reenter the system to apply this policy .

Data Set Size

The total size of data to be read/recorded when measuring every next pixel. Every pixel is measured four times, then the minimal result (in CPU clocks) is chosen. It provides higher repeatability. So, if you measure the memory bandwidth by reading 1MB units the Data Set Size equal to 128 MB means that it takes 4 measurements with 32 reading iterations. A higher Data Set size provides a more reliable result (smoother lines), but respectively increases the test time.

Thread Lock

In the general case every test runs in the main stream which is given the highest priority (realtime) to prevent effect from other running processes. It concerns only uniprocessor systems, though most user systems are such. In case of SMP or Hyper-Threading systems additional processors can have a noticeable effect on the test scores. This option locks other processes in SMP systems to increase precision and reliability of the measurement data. At the same time, it's not recommended that you use this option in Hyper-Threading systems as it makes its own great effect. The ideal test condition for Hyper-Threading systems is the minimal possible system load. All applications including those with the lowest priority should be closed.

Logarithmic Y Scale

Linear scale by default.

White Background

It allows using a white background in graphic representation of test results during the test execution and in the report BMP file, instead of the default black one. This option is needed for more convenient test results printout.

Create Test Report

This option determines whether a report will be created on completion of the test. The report includes two files with textual (MS Excel CSV) and graphic (BMP) representation of test results.

Sequential test execution (Batch)

The tests are executed sequentially as it's more convenient, in particular, for streaming testing of a great number of systems with the same test suite. RMMA supports the following operations with a batch:

Delete - deletes a test selection from the batch;
Clear - clears the whole test suite;
Load - loads a saved suite from a file;
Save - saves a current suite into a file.

Press Add to Batch to add individual tests whatever subtest is running.

RMMA tests description

The RMMA test packet has 7 types of tests for estimation of key characteristics of the CPU/Chipset/RAM system. They include:

Average and maximal memory bandwidth;
Average and minimal latency of L1/L2 data cache and RAM;
L1/L2 cache associativity;
Actual bandwidth of L1-L2 data cache bus;
Size and associativity of every D-TLB level;
Size (including the effective one) and associativity of L1 instructions cache;
Effectiveness of decoding of ALU/FPU/MMX instruction sets;
Size and associativity of every I-TLB level.

In each test you can set user settings or select one of the presets. Presets are needed for more convenient usage of test options and for comparison of systems of various classes in the same conditions. Once you select a preset the test parameters can't be changed.

Benchmark #1: Memory BW

The first benchmark estimates an actual memory bandwidth of L1/L2/L3 data caches and RAM. This test measures time (in CPU clocks) of full reading/recording/copying of a data block of a certain size (which can vary or stay fixed) using some or other CPU registers (MMX, SSE, or SSE2). In case of reading and writing the test also allows for various optimizations - Software Prefetch or Block Prefetch - in order to reach the Maximal Real Read Bandwidth. The scores are calculated in bytes transferred to the CPU (from CPU) at one clock, as well as in MB/s. Below you can see settings of the first benchmark.

Variable Parameter

Selection of one of three test modes:

Block Size - dependence of an actual memory bandwidth on data block size;
PF Distance - dependence of an actual memory bandwidth on prefetch length in Software Prefetch method. This mode is developed for reaching the maximal real read bandwidth and it's recommended only for large data blocks (larger than the overall data cache size);
Block PF Size - dependence of an actual memory bandwidth on block prefetch size in one of two Block Prefetch methods. Similarly to PF Distance, it's recommended only for large data block sizes.

Minimal Block Size

Minimal Block Size, KB, in case of Variable Parameter = Block Size; block size in other cases.

Maximal Block Size

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal PF Distance

Minimal Software Prefetch Distance, in bytes, in case of Variable Parameter = PF Distance; Software Prefetch Distance in other cases. 0 means that the Software Prefetch mode is disabled.

Maximal PF Distance

Maximal PF Distance (for Software Prefetch) in case of Variable Parameter = PF Distance.

Minimal Block PF Size

Minimal Block PF Size, KB, in case of Variable Parameter = Block PF Size; Block Prefetch size in other cases. This parameter makes sense only for the Block Prefetch methods (1, 2) described below.

Maximal Block PF

Maximal Block PF Size, KB, in case of Variable Parameter = Block PF Size.

Stride Size

Stride Size in operations of reading data into cache in Block Prefetch methods (1, 2), in bytes. For more reliable results this parameter must correspond to the cache line size. That is why in this and other subtests this parameter is set to auto-detect which means that the cache line size will be automatically detected by the program at launch.

CPU Register Usage

CPU Register Usage - selection of registers for fulfilling read/write operations (64-bit MMX, 128-bit SSE and 128-bit SSE2).

Read Prefetch Type

Read Prefetch Type defines a type of instructions used for Software Prefetch (PREFETCHNTA/T0/T1/T2); also, it enables one of Block Prefetch modes needed for taking measurements at Variable Parameter = Block PF Size. Block Prefetch 1 uses line readsets from memory to block prefetch of a certain size using MOV instructions and is recommended for AMD K7 family (Athlon/Athlon XP/MP). At the same time, in the Block Prefetch 2 method data are read with one of the Software Prefetch instructions (PREFETCHNTA). This method is recommended by AMD for K8 family (Opteron/Athlon 64/FX).

Non-Temporal Store

Non-Temporal Store - direct memory access (write combining protocol) at writing. This access method writes data into memory without prereading of old data into the CPU cache levels system (without using the write allocate mode). It saves the CPU cache from unneeded data, in particular, in case of copy operations.

Copy-to-Self Mode

Data block is copied to the same memory region where the copy block is located, i.e. the memory content doesn't actually change. This option is not enabled by default, and data copied are shifted by the offset equal to the transferred data block size. Since in this mode write operations completely get into the cache, this benchmark tests memory's ability to read data after writing (read around write). In this case the cache memory is utilized to a greater degree and the benchmark turns out to be much lighter for the memory subsystem. Note that the Non-Temporal Store and Copy-to-Self modes are incompatible.

Selected Tests

Selected Tests define the memory access ways.

Read Bandwidth - real memory bandwidth at reading;
Write Bandwidth - real memory bandwidth at writing;
Copy Bandwidth - real memory bandwidth at copying.

Benchmark #2: Latency/Associativity of L1/L2 Data Cache (D-Cache Lat)

The second benchmark estimates the average/minimal latency of L1/L2 data cache and memory, L2 cache line size and L1/L2 data cache associativity. Below are its parameters and modes of its operation.

Variable Parameter

There are 4 types of this test:

Block Size defines dependence of cache/memory latency on a block size. This test mode demonstrates latency of various memory regions - L1, L2, L3 (if it exists) caches or RAM. A dependent access chain is created in the allocated memory, with each element containing an address of the following one. At every full read iteration stage every chain element will be addressed only once. The number of the chain elements is equal to the block size divided by the Stride Size (see below). If the stride size corresponds to the cache line length, the block size is a real characteristic of the number of data read (because data are read from RAM to L2 or from L2 to L1 line by line). The block sizes less or equal to the L1 cache allow estimating the load-use latency when accessing the L1 cache; the block size within the range (L1..L1+L2) or (L1..L2) estimates the L2 cache latency depending on the cache architecture (exclusive or inclusive), and finally (since an L3 cache is rarely used), the block size greater than L1+L2 estimates latency when accessing RAM. The chain elements execution order depends on the test method (see below). Forward Read Latency method starts from the first element and goes through all to the last one which contains the first element's addresses which allows repeating the operation multiple times. In case of Backward Read Latency the first element contains the last one's address, and reading goes from the last element to the first one. Finally, Random Read Latency test selects elements on a random basis, but the condition of selecting one element once does not change. Below you can see the principle of forward reading of the chain comprised by 8 elements.

Stride Size - dependence of cache latency on stride size. This test mode makes sense only for block sizes that can get into the L2 cache and allows estimating its line length. This method of estimating of a cache line size is not the only one, the RMMA contains three such methods, the others will be studied below.

Chains Count - dependence of cache latency on the number of sequential dependent access chains. It estimates L1/L2 data caches associativity. The number of chains is actually a conditional concept because in reality there is only one dependent access chain which is executed several times. The only difference between the multi-chain version and the single-chain one is that in the first case data are read from difference memory regions (their number is equal to the number of "chains"), and the offset between them is a multiple of the cache segment size. Have a look at the forward reading of an array that contains 4 chains.

To estimate such important processor cache parameter as an associative level you should gradually increase the number of dependent access chains while maintaining the block size minimal. This fact proves that it's simple to "do harm" to the CPU cache, and it's not needed to pack it up with data. Actually, to make a "breach" in the n-way set associative cache, you just have to read n cache lines at the addresses having the offset being a multiple of the cache segment size. This is what makes this test. For example, to show inconsistency of the Pentium 4 L2 cache of 512K, with the associative level of n = 8 and a 64 bytes line one has to read only 8 x 64 = 512 bytes, i.e. it's needed to take less than 0.1% of its size(!). The minimal cache segment size in the current test version is equal to 1MB. Such a large value guarantees that the test will correct define the L2/L3 cache associative level even in systems having a large cache (note that the cache segment equal to 1MB corresponds to a 8MB L3 data cache with the associative level of 8).

NOP Count - dependence of latency of the memory region selected (L2 cache or memory) on the number of voids between two successive accesses to the region selected (L2 cache or memory). These operations called NOP are not related to the cache access but they bring a fixed time gap between two successive accesses to different cache/memory lines. It unloads the data bus between L1-L2 or L2-RAM to make the latency in accessing a selected memory area as low as possible. In the current RMMA version such NOPs are based on x86 ALU or eax, edx (eax stores the chain element address, and edx is initialized with 0); this command suits well for testing a good deal of modern processors.

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; Block size in other cases.

Maximal Block Size, KB

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal NOP Count

Minimal NOP Count in case of Variable Parameter = NOP Count; NOP count in other cases.

Maximal NOP Count

Maximal NOP Count in case of Variable Parameter = NOP Count.

Minimal Chains Count

Minimal Chains Count - a minimal number of successive dependent access chains in case of Variable Parameter = Chains Count; the number of successive dependent access chains in other cases. The offset of every such dependent access chain from its neighbors is equal to the value which is a multiple of the maximum possible cache segment size.

Maximal Chains Count

Maximal Chains Count - a maximal number of successive dependent access chains in case of Variable Parameter = Chains Count.

Minimal Stride Size

Minimal Stride Size, bytes, in the dependent access chain (in each chain if they are more than one) in case of Variable Parameter = Stride Size; stride size in a dependent access chain in other cases.

Maximal Stride Size

Maximal Stride Size in a dependent access chain in case of Variable Parameter = Stride Size.

Latency Measurement

Latency Measurement technique (the parameters can be configured only if Variable Parameter = NOP Count). In Method 1 an ordinary dependent access chain with a varying number of NOPs (see above (edx = 0)) is used to determine the minimal latency :

// loading of next chain element
mov eax, [eax]
// bus unloading, varying number of NOPs
or eax, edx
...
or eax, edx

Nevertheless, in some cases (if the speculative loading works effectively) the minimal cache latency may not be achieved. For such cases there's an alternative RMMA method (Method 2) which uses a different chain read code (ebx = edx = 0):

// bus unloading, fixed number of NOPs
add ebx, edx
...
add ebx, edx
// loading of next chain element
mov eax, [eax+ebx]
and ebx, eax
// bus unloading, varying number of NOPs
add ebx, edx
...
add ebx, edx

Selected Tests

Selected Tests define the memory access ways when testing latency.

Forward Read Latency - forward sequential access latency;
Backward Read Latency - backward sequential access latency;
Random Read Latency.

Benchmark #3: Real L1/L2 Data Cache Bus Bandwidth (D-Cache BW)

This benchmark estimates a real L1-L2 cache bus bandwidth (or L2-RAM bus bandwidth). It's the simplest test in RMMA regarding its configuring. It's based on the method used in the real L1/L2/RAM bus bandwidth test (Benchmark #1). But in this case memory read/write operations are carried out line by line, i.e. with the stride equal to the cache line length and with CPU' ALU registers. Both forward and backward access modes are supported. Test parameters:

Variable Parameter

Variable Parameter defines one of two test modes:

Block Size - dependence of a real data bus bandwidth on the data block size;
Stride Size - dependence of a real L1-L2 or L2-RAM bus bandwidth on a stride size. This mode is the second way to calculate a cache line length.

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; Block Size in other cases. A value less than 1.5 times L1 cache will yield senseless results. This test doesn't estimate a bandwidth of the L1-LSU-registers tandem because loading of data from L1 into LSU (Load-Store Unit) and then to CPU registers is not fulfilled line by line. To estimate the L1-LSU bandwidth it's better to run the first test (Memory BW) within the range of block sizes which can get into the L1 cache.

Maximal Block Size, KB

A value lower than the L2 cache size (inclusive cache architecture) or L1+L2 (exclusive cache architecture) allows estimating a real L1-L2 bus bandwidth. In case of the Block Size values ranging from L1+L2 to some greater value this benchmark estimates the maximal real memory bandwidth at reading/writing of full cache lines, which in some cases turns out to be greater than the maximal real bandwidth in case of total data reading/writing.

Minimal Stride Size

Minimal Stride Size in cache accessing at reading/writing in case of Variable Parameter = Stride Size; Cache access Stride Size in other cases.

Selected Tests

Selected Tests define a type of measurements.

Forward Read Bandwidth - forward sequential cache line reading;
Backward Read Bandwidth - backward sequential cache line reading;
Forward Write Bandwidth - forward sequential cache line writing;
Backward Write Bandwidth - backward sequential cache line writing;

Benchmark #4: L1/L2 (D-Cache) Arrival

The fourth benchmark estimates L1-L2 bus realization features (bit capacity, multiplexing) for some processors with an exclusive cache architecture, in particular, for AMD K7/K8 processors. This test actually measures the total latency of two accesses to the same cache line which are separated by a certain value. The measurement method is identical to the one in Method #2 except the fact that two consecutive chain elements are located in the same cache line.

Besides, the fourth test can be used to calculate the L2 cache line size (this is the third way in RMMA to estimate it, and it's used for its estimation at the program startup). The fourth test parameters are as follows:

Variable Parameter

Variable Parameter define one of five test types:

Block Size - dependence of the total latency on the block size.
NOP Count - dependence of the total latency on the number of NOPs between successive accesses to different cache lines.
SyncNOP Count - dependence of the total latency on the number of NOPs between successive accesses to the same cache line.
1st DW Offset - dependence of the total latency on the first word offset within the cache line.
2nd DW Offset - dependence of the total latency on the second word offset within the cache line..

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; total Block Size in other cases.

Maximal Block Size, KB

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal NOP Count

Minimal NOP Count defines the minimal number of NOPs between two successive accesses to adjacent cache lines in case of Variable Parameter = NOP Count; the number of NOPs between two successive accesses to adjacent cache lines in other cases.

Maximal NOP Count

Maximal NOP Count defines the maximal number of NOPs between two successive accesses to adjacent cache lines in case of Variable Parameter = NOP Count.

Minimal SyncNOP Count

Minimal SyncNOP Count defines the minimal number of NOPs between two successive accesses to the same cache line in case of Variable Parameter = SyncNOP Count; the number of NOPs between two successive accesses to the same cache line in other cases.

Maximal SyncNOP Count

Maximal SyncNOP Count defines the maximal number of NOPs between two successive accesses to the same cache line in case of Variable Parameter = SyncNOP Count.

Stride Size

Minimal Stride Size, in bytes, in the dependent access chain between two successive accesses to consecutive cache lines.

Minimal 1st Dword Offset

Minimal 1st Dword Offset within the cache line, in bytes, in case of Variable Parameter = 1st DW Offset; 1st Dword Offset within the cache line in other cases.

Maximal 1st Dword Offset

Maximal 1st Dword Offset within the cache line in case of Variable Parameter = 1st DW Offset.

Minimal 2nd Dword Offset

Minimal 2nd Dword Offset within the cache line, in bytes, in case of Variable Parameter = 2st DW Offset; 2st Dword Offset within the cache line in other cases. The 2nd DW Offset is calculated relative to the 1st Dword offset modulo stride size (cache line size):

2nd_Dword_Offset = (2nd_Dword_Offset + 1st_Dword_Offset) % Stride_Size

Maximal 2nd Dword Offset

Maximal 2nd Dword Offset within the cache line in case of Variable Parameter = 2nd DW Offset.

Selected Tests

Selected Tests define a way of testing the latency of the double access.

Forward Two-Dword Read Latency;
Backward Two-Dword Read Latency;
Random Two-Dword Read Latency.

Benchmark #5: Data Translation Lookaside Buffer Test (D-TLB)

The fifth test defines the size and associative level of the Translation Lookaside Buffer (L1/L2 D-TLB). Actually, it measures latency when accessing the L1 cache provided that every next cache line is loaded from the next memory page (not the same).

(The memory page size in real operating systems is much greater (e.g. 4096 bytes), than in our scheme which houses only 4 cache lines).

So, if the number of pages used is less than the TLB size the test calculates L1 cache's own latency (TLB hit). Otherwise, it measures the L1 cache latency in case of TLB miss. Note that the Maximal TLB Entries mustn't be greater than the number of L1 cache lines, otherwise the graph will have a jump related with the transition from L1 to L2, but not with the D-TLB structure size. But the overall size of TLB levels is always less than the number of cache lines which can be put into the L1 cache. Test settings:

Variable Parameter

Variable Parameter defines one of two test modes.

TLB Entries - dependence of latency when accessing the L1 cache on the number of memory pages used.
Chains Count - dependence of latency when accessing the L1 cache on the number of sequential access chains at a given number of pages used for estimation of the associative level of each D-TLB level. The principle of chain formation is identical to the one used in the latency test (Benchmark #2), but in this case the value equal to the stride size when accessing every next element (cache line size) is added to the offset between the chains. Here you can see reading of four TLB elements in case of two access "chains".

Stride Size

Stride Size in the dependent access chain, in bytes.

Minimal TLB Entries

Minimal TLB Entries used for reading cache lines in case of Variable Parameter = TLB Entries; TLB Entries in other cases.

Maximal TLB Entries

Maximal TLB Entries in case of Variable Parameter = TLB Entries.

Minimal Chains Count

Minimal Chains Count defines the minimal number of sequential dependent access chains in case of Variable Parameter = Chains Count; the number of sequential dependent access chains in other cases.

Maximal Chains Count

Maximal Chains Count defines the maximal number of sequential dependent access chains in case of Variable Parameter = Chains Count.

Selected Tests

Selected Tests define the ways of testing.

Forward Access;
Backward Access;
Random Access.

Benchmark #6: Instruction Cache Test (I-Cache)

The sixth test estimates effectiveness of decoding/execution of certain simple CPU instructions (ALU/FPU/MMX), and efficiency of operation of the L1 instructions cache and its associative level. This test is of special interest for estimating the effective Trace Cache size of Pentium 4 processors in case of decoding/execution of various instructions. Test parameters.

Variable Parameter

Defines one of three types of this benchmark:

Block Size - dependence of decode bandwidth on the code block size (Decode Bandwidth is the speed of sequence of operations of reading, decoding and execution of instructions by the CPU). The test method includes on-the-fly creation of a code block of a certain size on the fly (in runtime) and measurement of CPU clocks taken for its execution. The last instruction in the code block in all cases is the return instruction (RET).
Chains Count - dependence of decode bandwidth on the number of sequential access chains. Like in the Benchmark #2, we can estimate the associative level of the L1 instructions cache. From the methodological standpoint the transitions between the neighboring access chains which correspond to different cache segments are carried out with unconditional jmp instruction. Below you can see the code execution graph (red arrows) in case of two chains (transition operations are marked with green arrows).

Prefixes Count - dependence of decode bandwidth for [pref]_nNOP instructions on the number of prefixes used (pref = 0x66, operand-size override prefix).

Minimal Block Size, KB

Minimal Code Block Size, KB, in case of Variable Parameter = Block Size; Code Block Size in other cases.

Maximal Block Size, KB

Maximal Code Block Size, KB, in case of Variable Parameter = Block Size.

Minimal Chains Count

Minimal Chains Count defines the minimal number of sequential access chains in case of Variable Parameter = Chains Count; the number of sequential access chains in other cases.

Maximal Chains Count

Maximal Chains Count defines the maximal number of sequential access chains in case of Variable Parameter = Chains Count.

Minimal Prefixes Count, Maximal Prefixes Count

Minimal Prefixes Count, Maximal Prefixes Count in case of Variable Parameter = Prefixes Count. Unavailable in other cases.

Stride Size

Stride Size is the minimal size of the code executed in this chain which includes transition to the neighboring chain. It's recommended that the stride size is equal to the instructions cache line size.

Instructions Type

Instructions Type is a type of decoded/executable instructions:

ALU - arithmetic and logic integer operations using general-purpose registers;
FPU - some elementary and computing operations carried out by the floating-point unit (FPU);
MMX - arithmetic and logic integer operations using the CPU's MMX block.

Instructions Subtype

Instructions Subtype is a subtype of decoded/executable instructions. It depends on an instruction type selected. An instruction size in bytes is given in parentheses.

Instruction type	Instruction subtype	Operation
ALU	NOP (1) LEA (2) MOV (2) ADD (2) SUB (2) OR (2) XOR (2) TEST (2) CMP (2) SHL (3) ROL (3) XOR/ADD (4) CMP-0 (4) CMP-0 (6) CMP-8 (6) CMP-16 (6) CMP-32 (6) CMP-0 (8) CMP-8 (8) CMP-16 (8) CMP-32 (8)	nop lea eax, [eax] mov eax, eax add eax, eax sub eax, eax or eax, eax xor eax, eax test eax, eax cmp eax, eax shl eax, 0 rol eax, 0 xor eax, eax; add eax, eax cmp ax, 0x00 cmp eax, 0x00000000 cmp eax, 0x0000007f cmp eax, 0x00007fff cmp eax, 0x7fffffff [rep][addrovr]cmp eax, 0x00000000 [rep][addrovr]cmp eax, 0x0000007f [rep][addrovr]cmp eax, 0x00007fff [rep][addrovr]cmp eax, 0x7fffffff
FPU	WAIT (1) FADD (2) FMUL (2) FSUB (2) FSUBR (2) FCHS (2) FABS (2) FTST (2) FXAM (2) FCOM (2) FCOMI (2) FST (2) FXCH (2) FDECSTP (2) FINCSTP (2) FFREE (2) FFREEP (2)	wait fadd st(0), st(1) fmul st(0), st(1) fsub st(0), st(1) fsubr st(0), st(1) fchs fabs ftst fxam fcom st(1) fcomi st(0), st(1) fst st(0) fxch fdecstp fincstp ffree st(0) ffreep st(0)
MMX	EMMS (2) MOVQ (3) POR (3) PXOR (3) PADDD (3) PSUBD (3) PCMPEQD (3) PUNPCKLDQ (3) PSLLD (4)	emms movq mm0, mm0 por mm0, mm0 pxor mm0, mm0 paddd mm0, mm0 psubd mm0, mm0 pcmpeqd mm0, mm0 punpckldq mm0, mm0 pslld mm0, 0

Benchmark #7: Instruction Translation Lookaside Buffer Test (I-TLB)

The last RMMA benchmark measures size and associative level of the Instructions Translation Lookaside Buffer (L1/L2 I-TLB). The test settings are identical to the Benchmark #5:

Variable Parameter

Variable Parameter defines one of two types of the test modes:

TLB Entries - dependence of latency when accessing the L1i cache on the number of memory pages used.
Chains Count - dependence of latency when accessing the L1i cache on the number of sequential access chains at a given number of pages used.
Stride Size

Stride Size in the dependent access chain, in bytes. Strides are made with an unconditional jump (jmp). Here you can see forward sequential reading of four I-TLB elements in case of two access chains.

The last element marked with cross contains the return instruction (ret).

Minimal TLB Entries

Minimal TLB Entries used for reading L1i cache lines in case of Variable Parameter = TLB Entries; TLB Entries in other cases.

Maximal TLB Entries

Maximal TLB Entries in case of Variable Parameter = TLB Entries.

Minimal Chains Count

Minimal Chains Count defines the minimal number of sequential dependent access chains in case of Variable Parameter = Chains Count; the number of sequential dependent access chains in other cases.

Maximal Chains Count

Maximal Chains Count defines the maximal number of sequential dependent access chains in case of Variable Parameter = Chains Count.

Selected Tests

Selected Tests define the ways of testing.

Forward Access;
Backward Access;
Random Access.

The latency estimated in this test actually defineS latency in execution of an instructions tandem

mov ecx, address_value
jmp ecx

at their different number and relative positions. Nevertheless, such characteristic is acceptable for defining the I-TLB levels structure and their associativity.

Appendix 1: RightMark Memory Analyzer 2.5
Changes and Additional Information

The first comparison tests of various platforms (AMD K7/K8, Intel Pentium 4, Intel Pentium III / Pentium M) let us reveal the bottlenecks of the RightMark Memory Analyzer 2.4. They are accounted for in the new version of the test suite and highlighted in this appendix.

General Test Settings

The key change in this section is the automatic detection of the L1 and L2 Cache Line Sizes. Earlier only the L1 cache line size was displayed, though both were measured. Theoretically and practically the effective L1 and L2 cache lines can have different sizes of some processors. Let's look at the Intel Pentium 4. The effective L2 cache line size is 128 bytes (such line is called dual-sector and consists of two 64-byte lines), which means that 128 bytes are transferred at a time through the L2-RAM bus (or L2-L3, L3-RAM of Pentium 4 XE).

That is why the values in the D-Cache Bandwidth test of estimation of the real bandwidth of the L2-RAM bus (L2-L3, L3-RAM) were underestimated if the stride size was set equal to the automatically detected line size (L1 cache). We had to set the stride size manually to get objective scores. Now this problem is solved - now you can select the Minimal Stride Size equal to the line size of the L1 or L2 cache (see the screenshot above).

The general test settings got one new parameter, Active CPU Index, which can indicate the number of the active CPU (physical or logical) where the mainstream runs. This option is useless for usual SMP (and HT) systems, but it can successfully be used for studying performance differences when the CPU accesses the "native" or "alien" memory, in the systems with the separate memory architecture (for example, in dual-processor AMD K8 platforms each CPU has its own memory).

The last change is three Memory Allocation methods: Standard, VirtualLock and AWE. The first one uses malloc() and is not recommended for ordinary platform tests. It's used mostly for testing operation under memory managers different from the standard Windows memory manager which have certain advantages (for example, support of large 4MB memory pages). AWE is recommended to obtain the most reliable results, that is why this method is default.

Test 1: Memory Bandwidth

The changes are made mostly in the memory read/write/copy procedures in all access modes (MMX, SSE, SSE2), including the methods using the Software Prefetch. A big unlooping factor optimizes operation of these procedures in case of small block sizes (in L1 d-cache) and increases the real bandwidth of this cache level.

Test 2: D-Cache Latency

The new parameter here is Pseudo-Random Read Latency which reduces the memory random-access latency. In this mode the dependent access chain is accessed randomly within every memory page, but the memory pages are accessed in the forward manner.

The first fact minimizes the Hardware Prefetch interference, and the second nearly completely prevents D-TLB misses. That is why the pseudo-random latency is much lower than the random latency (which goes along with a great number of D-TLB misses) and can be considered the objective memory latency parameter.

Another change in this test is the altered procedure of building the dependence of latency of the memory subsystem level selected (L1/L2 cache/RAM) on the walk size (Variable Parameter = Walk Size). In the previous RMMA version the real number of walks over the dependent access chain was calculated as the block size divided by the walk size (which was variable), that is why walks were getting fewer as their size grew up. In the new test version the number of walks is fixed (irrespective of the walk size the borders of which are defined by Minimal Walk Size and Maximal Walk Size). It's calculated as the block size divided by the Stride Size which by default equals L1 line size.

Dmitry Besedin (dmitri_b@ixbt.com)

Write a comment below. No registration needed!