RightMark Memory Analyzer 3.0: a New Version of the Universal CPU/Chipset/RAM Test

It's been almost two months since the first version of the universal RightMark Memory Analyzer 2.4 test was officially announced. It proved to be perfectly suitable for detailed study of low-level platform parameters. But as time went by, hardware and software advanced, and so did RightMark Memory Analyzer. Soon after, a modified second version appeared — RightMark Memory Analyzer 2.5, where the main focus was not so much on the test development as on the updating of aspects not well enough realised previously. But we can't say that the changes were made to suit some specific CPUs, not at all — the universal principles remained the same. The RMMA 3.0 is different: there are much more real innovations than minor corrections concerning test realisations, which is certainly good. Well, it's high time we examined its main parameters.

CPU Information

First of all, look at the window that opens at the test launch.

Its customary view has changed: there are now fewer tabs, though none of the test parameters has disappeared, they have just undergone a certain reorganisation. The first tab (CPU Info / CPU) opens by default when the test is launched. It contains detailed information about the CPU (or CPUs, if there are more than one of them). The information is similar to that given by popular CPU-identification programs, such as WCPUID and CPU-Z. We won't dwell on it going instead to the control options

CPU Index is the number of the active CPU (physical or logical) if the system has several physical (SMP) or logical (Hyper-Threading) ones. If the CPU index is changed, the information about the chosen CPU is automatically refreshed. This parameter also specifies the number of the active CPU (physical or logical) that will be used for executing the main test flow.

Though the option is virtually of no use for standard SMP systems (to say nothing of HT), it can be applied for examining system's behaviour differences when the CPU accesses "its own" memory and an "alien" one. Of course, memory architecture must be separate (such as two-processor AMD K8 platforms (Opteron), with each CPU possessing its own memory).

Save — CPU's identification info dump (CPUID) is saved into a binary file. Because this information is used to identify the CPU in the database (described further), we call on users of our test to save this information as well as test results. It will enable us to extend our CPU database, and the users will, in turn, have the opportunity to receive its regular updates.

Refresh — information about the CPU selected in CPU Index is refreshed for this and following tabs (Cache/TLB).

Now we're going to the next tab — CPU Info / Cache.

It includes full information about different CPU cache levels. Most parameters have featured and actual values. The former are received by deciphering the CPUID values, the latter are inserted from our data base that is an integral part of the benchmark.

Now we will dwell on the database. Its functioning principle is simple. Each CPU model has a unique data set provided by the CPUID instruction with different input parameters. These data include, first of all, a 12-letter manufacturer's identificator (e.g. — GenuineIntel, AuthenticAMD). Another important parameter is the so-called CPU signature (numbers of family, model, and CPU stepping taken as a single 32-bit value). Then come CPU features (such as various SIMD extensions) and finally, cache and TLB features. These parameters are enough to identify any given CPU model (though not a particularl CPU). And it would be useless, or even harmful to supply this "set of fingerprints" with such specific information as CPU's serial number (Intel Pentium III and Transmeta Crusoe CPUs have it), or its "name" (e.g. — Brand String), which often contains CPU frequency. Considering all this, the searching principle is very simple: if all specified CPUID values of a CPU coincide with those of an Nth processor from the base, then it is the CPU we're looking for, and we can insert all actual values right from the base. And on the contrary, if a CPU is not found in the base, these values are unaccessible — except CPU's L1/L2 line sizes which are identified automatically by the test itself when the application is launched.

Coming back to the features of CPU cache levels, additional parameters provided here (and unavailable in CPUID) are latency of the access to the cache level, bus width and its hierarchy (exclusive or inclusive).

The next tab (CPU Info / TLB) is similar to the previous one, it contains information about TLB buffers of the CPU.

The tab is organised in a similar way: featured values are taken from processor's CPUID data, while actual ones — from our database. Data on TLB sizes and associativity that use large memory pages (LP Entries/Associativity) are now absent from the database. It will be recalled that today's OSs still use small 4-KB memory pages making impossible an objective evaluation of the mentioned parameters for large 4-MB pages (by means of direct measurement). At the same time, we can estimate a miss penalty value for each TLB level — the value of an increased latency of access to CPU's L1 cache if the corresponding TLB level misses. These are the values that are inserted from our database.

And finally, this is what the information about the database itself looks like.

Field names are understandable: database version (3.0 stands for the program version itself, the next figure — for the revision number that will increase with each new refreshing, and the last unit— for the number of entries in the database). The number of entries is also shown in a separate line, as is the current CPU status. The latter, in our case, was found in the database as it had been the first to be placed there (ID=0).

General Settings

This sector contains general settings of all subtests realized in RMMA 3.0.

Memory Allocation

Chooses a way to allocate memory needed for test executions.

Standard — a standard method of memory block selection "in heap" with the help of malloc. It is suitable for almost any OS.

VirtualLock — a method that works with the virtual memory using the procedures realised in Windows NT-class OSs. A memory block of a needed size is reserved by invoking VirtualAlloc. Then the VirtualLock function is invoked at the address of the selected memory region, which guarantees that all data from this region will be in the physical memory, that is, later accesses to this region won't cause page fault.

AWE— this method uses Address Windowing Extensions available in Windows 2000/XP/2003 Server. This memory allocation method is more reliable for some tests such as cache associativity. Usage of AWE requires Lock Pages in Memory privilege which is not available by default. To get this privilege make the following steps:

Enter the system at the administrative access level;
Launch Local Security Policy from Administrative Tools ;
Select Security Settings -> Local Policies -> User Rights Assignment ;
Select Lock pages in memory policy and add a user or group name (e.g., Administrators ).
Reenter the system to apply this policy.

Data Set Size

The total size of data to be read/recorded when measuring every next pixel. Every pixel is measured four times, then the minimal result (in CPU clocks) is chosen. It provides higher repeatability. So, if you measure the memory bandwidth by reading 1MB units the Data Set Size equal to 128 MB means that it takes 4 measurements with 32 reading iterations. A higher Data Set size provides a more reliable result (smoother lines), but respectively increases the test time.

Thread Lock

In general, every test is started in the main stream which has the highest priority (realtime) to prevent effects from other running processes. It concerns only uniprocessor systems, though most user systems are such. In case of SMP or Hyper-Threading systems additional processors can have a noticeable effect on the test scores. This option locks other processes in SMP systems to increase precision and reliability of the measurement data. At the same time, it's not recommended that you use this option in Hyper-Threading systems as it makes its own great effect. The ideal test condition for Hyper-Threading systems is the minimal possible system load. All applications including those with the lowest priority should be closed.

Logarithmic Y Scale

Applies a logarithm scale on the ordinate axis. A linear scale is set by default.

White Background

It allows using a white background in graphic representation of test results during the test execution and in the report BMP file, instead of the default black one. This option is needed for more convenient test results printout.

Create Test Report

This option determines whether a report will be created on completion of the test. The report includes two files with textual (MS Excel CSV) and graphic (BMP) representation of test results.

Sequential test execution (Batch)

The tests are executed sequentially as it's more convenient, in particular, for streaming testing of a great number of systems with the same test suite. RMMA supports the following operations with a batch:

Delete — deletes a test selection from the batch;
Clear — clears the whole test suite;
Load — loads a saved suite from a file;
Save — saves a current suite into a file.

Press Add to Batch to add individual tests whatever subtest is running.

Description of RMMA 3.0 Tests

Now we are going to describe tests for the most important platform features. Initially, tests of the project were supposed only to measure those features, and not to compare platforms with each other.

Measuring Tests

We can divide RMMA 3.0 tests into measuring and competitive. The first group includes almost all the tests you came across in previous versions of the project. They measure values that sometimes can't be compared with one another. These values are:

Average and maximal memory bandwidth;
Average and minimal latency of L1/L2 data cache and RAM;
L1/L2 cache associativity;
Actual bandwidth of L1-L2 data cache bus;
Size and associativity of every D-TLB level;
Size (including the effective one) and associativity of L1 instructions cache;
Effectiveness of decoding of ALU/FPU/MMX instruction sets;
Size and associativity of every I-TLB level.

It's very difficult to compare even such elemental values as maximal memory bandwidth and minimal memory latency, because different platforms use different memory types. But this is just half the trouble, the other one is that the values depend very much on the way this or that CPU microarchitecture is realised (the most important parameter here being the cache line size). Thus, different approaches are needed to measure the values.

To be precise, let's formulate the principles that refer the values to the measurable class, not to the comparable.

Depending on the instructions and algorithms used for measurements, the values of the parameters can be absolutely different.
The algorithms and instructions suitable for receiving the best possible test results of these parameters are nor frequently used for real sotfware programming. Besides, the "purely theoretical" parameters not always manifest themselves in real applications (for instance, high latencies "camouflage" in reality).
The algorithms and instructions suitable for receiving the best possible values on one CPU architecture are not always such for another one.

In this RMMA version the measuring tests were referred to "microarchitectural" tests. As a result, a new tab was added with a corresponding name. And this is where all the tests realised in previous versions are "concealed". It will be recalled that each test enables to specify user's settings or to select one of the presets. The latter were introduced to make viewing of test options more convenient and to allow comparing different-class systems in the same conditions. When you select a preset the option to change test parameters is blocked for safety reasons.

Benchmark #1: Memory BW

The first benchmark estimates an actual memory bandwidth of L1/L2/L3 data caches and RAM. This test measures time (in CPU clocks) during which a data block of a certain size (which can vary or stay fixed) was fully read/recorded/copied using some or other CPU registers (MMX, SSE, or SSE2).

The important loop unrolling factor allows to optimize operating modes concerning small block sizes (L1 data cache region) and to increase the values of the cache level's real bandwidth.

In case of reading and writing the test also allows for various optimisations - Software Prefetch or Block Prefetch - in order to reach the Maximal Real Read Bandwidth. The scores are calculated in bytes transferred to the CPU (from CPU) at one clock, as well as in MB/s. Below you can see settings of the first benchmark.

Variable Parameter

Selection of one of three test modes:

Block Size - dependence of an actual memory bandwidth on data block size;
PF Distance - dependence of an actual memory bandwidth on prefetch length in Software Prefetch method. This mode is developed for reaching the maximal real read bandwidth and it's recommended only for large data blocks (larger than the overall data cache size);
Block PF Size - dependence of an actual memory bandwidth on block prefetch size in one of two Block Prefetch methods. Similarly to PF Distance, it's recommended only for large data block sizes.

Minimal Block Size

Minimal Block Size, KB, in case of Variable Parameter = Block Size; block size in other cases.

Maximal Block Size

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal PF Distance

Minimal Software Prefetch Distance, in bytes, in case of Variable Parameter = PF Distance; Software Prefetch Distance in other cases. 0 means that the Software Prefetch mode is disabled.

Maximal PF Distance

Maximal PF Distance (for Software Prefetch) in case of Variable Parameter = PF Distance.

Minimal Block PF Size

Minimal Block PF Size, KB, in case of Variable Parameter = Block PF Size; Block Prefetch size in other cases. This parameter makes sense only for the Block Prefetch methods (1, 2) described below.

Maximal Block PF Size

Maximal Block PF Size, KB, in case of Variable Parameter = Block PF Size.

Stride Size

Stride Size in operations of reading data into cache in Block Prefetch methods (1, 2), in bytes. For more reliable results this parameter must correspond to the cache line size. That is why in this and other subtests this parameter is selected by default, like L1 Cache Line, which means using L1 cache line width value either inserted from the database or automatically identified by the program at launch.

CPU Register Usage

CPU Register Usage - selection of registers for fulfilling read/write operations ( 64-bit MMX, 128-bit SSE and 128-bit SSE2 ).

Read Prefetch Type

Read Prefetch Type defines a type of instructions used for Software Prefetch (PREFETCHNTA/T0/T1/T2); also, it enables one of Block Prefetch modes needed for taking measurements at Variable Parameter = Block PF Size. Block Prefetch 1 uses line readsets from memory to block prefetch of a certain size using MOV instructions and is recommended for AMD K7 family (Athlon/Athlon XP/MP). At the same time, in the Block Prefetch 2 method data are read with one of the Software Prefetch instructions (PREFETCHNTA). This method is recommended by AMD for K8 family (Opteron/Athlon 64/FX).

Non-Temporal Store

Non-Temporal Store - direct memory access (write combining protocol) at writing. This access method writes data into memory without prereading of old data into the CPU cache levels system (without using the write allocate mode). It saves the CPU cache from unneeded data, in particular, in case of copy operations.

Copy-to-Self Mode

Data block is copied to the same memory region where the copy block is located, i.e. the memory content doesn't actually change. This option is not enabled by default, and data copied are shifted by the offset equal to the transferred data block size. Since in this mode write operations completely get into the cache, this benchmark tests memory's ability to read data after writing (read around write). In this case the cache memory is utilized to a greater degree and the benchmark turns out to be much lighter for the memory subsystem. Note that the Non-Temporal Store and Copy-to-Self modes are incompatible.

Selected Tests

Selected Tests define the memory access ways.

Read Bandwidth - real memory bandwidth at reading;
Write Bandwidth - real memory bandwidth at writing;
Copy Bandwidth - real memory bandwidth at copying.

Benchmark #2: Latency/Associativity of L1/L2 Data Cache (D-Cache Lat)

The second benchmark estimates the average/minimal latency of L1/L2 data cache and memory, L2 cache line size and L1/L2 data cache associativity. Below are its parameters and operating modes.

Variable Parameter

There are 4 types of this test:

Block Size defines dependence of cache/memory latency on a block size. This test mode demonstrates latency of various memory regions - L1, L2, L3 (if it exists) caches or RAM. A dependent access chain is created in the allocated memory, with each element containing an address of the following one. At every full read iteration stage every chain element will be addressed only once. The number of the chain elements is equal to the block size divided by the Stride Size (see below). If the stride size corresponds to the cache line length, the block size is a real characteristic of the number of data read (because data are read from RAM to L2 or from L2 to L1 line by line). The block sizes less or equal to the L1 cache allow estimating the load-use latency when accessing the L1 cache; the block size within the range (L1..L1+L2) or (L1..L2) estimates the L2 cache latency depending on the cache architecture (exclusive or inclusive), and finally (since an L3 cache is rarely used), the block size greater than L1+L2 estimates latency when accessing RAM. The chain elements execution order depends on the test method (see below).

Forward Read Latency method starts from the first element and goes through to the last one which contains the first element's addresses thus allowing to repeat the operation multiple times. For example, the picture below shows a forward reading of an 8-element chain.

In case of Backward Read Latency the first element contains the last one's address, and reading goes from the last element to the first one. Random Read Latency test selects elements on a random basis, but the condition of selecting one element once at a reading does not change. In versions since 2.5, a new memory reading mode was added to the test. It was called Pseudo-Random Read Latency and designed to reduce memory random-access latency.
In this mode the dependent access chain is read randomly within every memory page, but the memory pages are accessed in the forward manner.

The first fact minimizes the Hardware Prefetch interference, and the second nearly completely prevents D-TLB misses. That is why the pseudo-random latency is much lower than the random latency (which goes along with a great number of D-TLB misses) and can be considered an objective memory latency parameter.
Walk Step Size — a dependence of cache latency on the step size. The number of steps is fixed (irrespective of the walk size the borders of which are defined by Minimal Walk Size and Maximal Walk Size). It's calculated as the block size divided by the Stride Size which equals L1 line size by default. This procedure enables to identify the cache segment size (a parameter directly related to its associativity).
Segments Count — a dependence of cache latency on the number of successive L1/L2 data cache segments, that enables to estimate associativity of the L1/L2 data cache. This test variant executes reading of one and the same dependent access chain. The only difference between the multi-segment and the single-segment versions is that in the first case data are read from different memory regions (their number is equal to the number of segments), and the offset between them is a multiple of the cache segment size. Have a look at the forward reading of an array that contains 4 chains.

To estimate such important processor cache parameter as the associative level you should gradually increase the number of dependent access chains while maintaining the block size minimal. This fact proves that it's simple to "do harm" to the CPU cache, and it's not needed to pack it up with data. Actually, to make a "breach" in the n-way set associative cache, you just have to read n cache lines at the addresses having the offset being a multiple of the cache segment size. This is what makes this test. For example, to show inconsistency of the Pentium 4 L2 cache of 512K, with the associative level of n = 8 and a 64 bytes line one has to read only 8 x 64 = 512 bytes, i.e. it's needed to take less than 0.1% of its size(!). The minimal cache segment size in the current test version is equal to 1MB. Such a large value guarantees that the test will correct define the L2/L3 cache associative level even in systems having a large cache (note that the cache segment equal to 1MB corresponds to a 8MB L3 data cache with the associative level of 8).
NOP Count - dependence of latency of the memory region selected (L2 cache or memory) on the number of voids between two successive accesses to the region selected (L2 cache or memory). These operations called NOP are not related to the cache access but they bring a fixed time gap between two successive accesses to different cache/memory lines. It unloads the data bus between L1-L2 or L2-RAM to make the latency in accessing a selected memory area as low as possible. In the current RMMA version such NOPs are based on x86 ALU or eax, edx ( eax stores the chain element address, and edx is initialized with 0); this command suits well for testing a good deal of modern processors.

Stride Size

Stride Size, bytes, in the dependent access chain.

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; Block size in other cases.

Maximal Block Size, KB

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal NOP Count

Minimal NOP Count in case of Variable Parameter = NOP Count; NOP count in other cases.

Maximal NOP Count

Maximal NOP Count in case of Variable Parameter = NOP Count.

Minimal Segments Count

Min Segments Count — a minimal number of successive cache segments in case of Variable Parameter = Segments Count; the number of successive cache segments in other cases.

Maximal Segments Count

Max Segments Count — a maximal number of successive cache segments in case of Variable Parameter = Segments Count; the number of successive cache segments in other cases.

Minimal Walk Step Size

Minimal Walk Step Size, bytes, in the dependent access chain in case of Variable Parameter = Walk Step Size.

Maximal Walk Step Size

Maximal Walk Step Size, bytes, in the dependent access chain in the case of Variable Parameter = Walk Step Size.

Latency Measurement

Latency Measurement technique (the parameters can be configured only if Variable Parameter = NOP Count). In Method 1 an ordinary dependent access chain with a varying number of NOPs (see above ( edx = 0)) is used to determine the minimal latency:

// loading of next chain element
mov eax, [eax]
// bus unloading, varying number of NOPs
or eax, edx
...
or eax, edx

Nevertheless, in some cases (if the speculative loading works effectively) the minimal cache latency may not be achieved. For such cases there's an alternative RMMA method (Method 2) which uses a different chain read code (ebx = edx = 0):

// bus unloading, fixed number of NOPs
add ebx, edx
...
add ebx, edx
// loading of next chain element
mov eax, [eax+ebx]
and ebx, eax
// bus unloading, varying number of NOPs
add ebx, edx
...
add ebx, edx

Selected Tests

Selected Tests define the memory access ways when testing latency:

Forward Read Latency
Backward Read Latency
Random Read Latency
Pseudo-Random Read Latency

Benchmark #3: Real L1/L2 Data Cache Bus Bandwidth (D-Cache BW)

This benchmark estimates a real L1-L2 cache bus bandwidth (or L2-RAM bus bandwidth). It's the simplest test in RMMA regarding its configuration. It's based on the method used in the real L1/L2/RAM bus bandwidth test (Benchmark #1). But in this case memory read/write operations are carried out line by line, i.e. with the stride equal to the cache line size and with CPU's ALU registers. Both forward and backward access modes are supported. Test parameters are as follows:

Variable Parameter

Variable Parameter defines one of two test modes:

Block Size - dependence of a real data bus bandwidth on the data block size;
Stride Size - dependence of a real L1-L2 or L2-RAM bus bandwidth on a stride size. This mode is the second way to calculate a cache line length.

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; Block Size in other cases. A value less than 1.5 times L1 cache will yield senseless results. This test doesn't estimate a bandwidth of the L1-LSU-registers tandem because loading of data from L1 into LSU (Load-Store Unit) and then to CPU registers is not fulfilled line by line. To estimate the L1-LSU bandwidth it's better to run the first test (Memory BW) within the range of block sizes which can get into the L1 cache.

Maximal Block Size, KB

A value lower than the L2 cache size (inclusive cache architecture) or L1+L2 (exclusive cache architecture) allows estimating a real L1-L2 bus bandwidth. In case of the Block Size values ranging from L1+L2 to some greater value this benchmark estimates the maximal real memory bandwidth at reading/writing of full cache lines, which in some cases turns out to be greater than the maximal real bandwidth in case of total data reading/writing.

Minimal Stride Size

Minimal Stride Size in cache accessing at reading/writing in case of Variable Parameter = Stride Size; Cache access Stride Size in other cases.

Maimal Stride Size

Minimal Stride Size in cache accessing at reading/writing in case of Variable Parameter = Stride Size;

Selected Tests

Selected Tests define a type of measurements.

Forward Read Bandwidth - forward sequential cache line reading;
Backward Read Bandwidth - backward sequential cache line reading;
Forward Write Bandwidth - forward sequential cache line writing;
Backward Write Bandwidth - backward sequential cache line writing;

Benchmark #4: L1/L2 (D-Cache) Arrival

The fourth benchmark estimates bus realisation features (bit capacity, multiplexing) of the CPU's L1-L2 data cache. This test actually measures the total latency of two accesses to the same cache line, which are separated by a certain value. The measurement method is identical to the one in Method #2 except the fact that two consecutive chain elements are located in the same cache line.

Besides, the fourth test can be used to estimate L1/L2 cache line size (this is the test that is used for its calculation at program launch if the CPU model hasn't been found in the database. These are the test parameters:

Variable Parameter

Variable Parameter defines one of five test types:

Block Size - dependence of the total latency on the block size.
NOP Count - dependence of the total latency on the number of NOPs between successive accesses to different cache lines.
SyncNOP Count - dependence of the total latency on the number of NOPs between successive accesses to the same cache line.
1st DW Offset - dependence of the total latency on the first word offset within the cache line.
2nd DW Offset - dependence of the total latency on the second word offset within the cache line.

Minimal Block Size, KB

Minimal Block Size, KB, in case of Variable Parameter = Block Size; total Block Size in other cases.

Maximal Block Size, KB

Maximal Block Size, KB, in case of Variable Parameter = Block Size.

Minimal NOP Count

Minimal NOP Count defines the minimal number of NOPs between two successive accesses to adjacent cache lines in case of Variable Parameter = NOP Count; the number of NOPs between two successive accesses to adjacent cache lines in other cases.

Maximal NOP Count

Maximal NOP Count defines the maximal number of NOPs between two successive accesses to adjacent cache lines in case of Variable Parameter = NOP Count.

Minimal SyncNOP Count

Minimal SyncNOP Count defines the minimal number of NOPs between two successive accesses to the same cache line in case of Variable Parameter = SyncNOP Count; the number of NOPs between two successive accesses to the same cache line in other cases.

Maximal SyncNOP Count

Maximal SyncNOP Count defines the maximal number of NOPs between two successive accesses to the same cache line in case of Variable Parameter = SyncNOP Count.

Stride Size

Minimal Stride Size, in bytes, in the dependent access chain between two successive accesses to consecutive cache lines.

Minimal 1st Dword Offset

Minimal 1st Dword Offset within the cache line, in bytes, in case of Variable Parameter = 1st DW Offset; 1st Dword Offset within the cache line in other cases.

Maximal 1st Dword Offset

Maximal 1st Dword Offset within the cache line in case of Variable Parameter = 1st DW Offset.

Minimal 2nd Dword Offset

Minimal 2nd Dword Offset within the cache line, in bytes, in case of Variable Parameter = 2st DW Offset; 2st Dword Offset within the cache line in other cases. The 2nd DW Offset is calculated relative to the 1st Dword offset modulo stride size (cache line size): 2nd_Dword_Offset = (2nd_Dword_Offset + 1st_Dword_Offset) % Stride_Size

Maximal 2nd Dword Offset

Maximal 2nd Dword Offset within the cache line in case of Variable Parameter = 2nd DW Offset.

Selected Tests

Selected Tests defines a way of testing the latency of the double access:

Forward Two-Dword Read Latency
Backward Two-Dword Read Latency
Random Two-Dword Read Latency
Pseudo-Random Two-Dword Read Latency

Benchmark #5: Pipeline and Instruction Cache Test
(Decode Bandwidth)

The fifth test estimates how effectively the CPU decodes/executes a set of simple instructions (ALU/FPU/MMX), as well as how effective the functioning of L1 instruction cache is. It also helps to estimate the effective size of Pentium 4 CPU's Trace Cache during instruction decodings/executions. Below are the test parameters.

Variable Parameter

Variable Parameter defines one of two types of this benchmark:

Block Size - dependence of decode bandwidth on the code block size (Decode Bandwidth is the speed of sequence of operations of reading, decoding and execution of instructions by the CPU). The test method includes on-the-fly creation of a code block of a certain size on the fly (in runtime) and measurement of CPU clocks taken for its execution. The last instruction in the code block in all cases is the return instruction (RET).
Prefixes Count - dependence of decode bandwidth for [pref] n NOP instructions on the number of prefixes used (pref = 0x66, operand-size override prefix).

Minimal Block Size, KB

Minimal Code Block Size, KB, in case of Variable Parameter = Block Size; Code Block Size in other cases.

Maximal Block Size, KB

Maximal Code Block Size, KB, in case of Variable Parameter = Block Size.

Minimal Prefixes Count, Maximal Prefixes Count

Min Prefixes Count, Max Prefixes Count are the minimal and maximal numbers of prefixes in the case of Variable Parameter = Prefixes Count.

Instructions Type

Type of the instructions being decoded/executed:

ALU - arithmetic and logic integer operations using general-purpose registers;
FPU - some elementary and computing operations carried out by the floating-point unit (FPU);
MMX - arithmetic and logic integer operations using the CPU's MMX block.

Instructions Subtype

Instructions Subtype is a subtype of decoded/executable instructions. It depends on the selected instruction type. An instruction size in bytes is given in parentheses.

Instruction type	Instruction subtype	Operation
ALU	NOP (1) LEA (2) MOV (2) ADD (2) SUB (2) OR (2) XOR (2) TEST (2) CMP (2) SHL (3) ROL (3) XOR/ADD (4) CMP-0 (4) CMP-0 (6) CMP-8 (6) CMP-16 (6) CMP-32 (6) CMP-0 (8) CMP-8 (8) CMP-16 (8) CMP-32 (8)	nop lea eax, [eax] mov eax, eax add eax, eax sub eax, eax or eax, eax xor eax, eax test eax, eax cmp eax, eax shl eax, 0 rol eax, 0 xor eax, eax; add eax, eax cmp ax, 0x00 cmp eax, 0x00000000 cmp eax, 0x0000007f cmp eax, 0x00007fff cmp eax, 0x7fffffff [rep][addrovr]cmp eax, 0x00000000 [rep][addrovr]cmp eax, 0x0000007f [rep][addrovr]cmp eax, 0x00007fff [rep][addrovr]cmp eax, 0x7fffffff
FPU	WAIT (1) FADD (2) FMUL (2) FSUB (2) FSUBR (2) FCHS (2) FABS (2) FTST (2) FXAM (2) FCOM (2) FCOMI (2) FST (2) FXCH (2) FDECSTP (2) FINCSTP (2) FFREE (2) FFREEP (2)	wait fadd st(0), st(1) fmul st(0), st(1) fsub st(0), st(1) fsubr st(0), st(1) fchs fabs ftst fxam fcom st(1) fcomi st(0), st(1) fst st(0) fxch fdecstp fincstp ffree st(0) ffreep st(0)
MMX	EMMS (2) MOVQ (3) POR (3) PXOR (3) PADDD (3) PSUBD (3) PCMPEQD (3) PUNPCKLDQ (3) PSLLD (4)	emms movq mm0, mm0 por mm0, mm0 pxor mm0, mm0 paddd mm0, mm0 psubd mm0, mm0 pcmpeqd mm0, mm0 punpckldq mm0, mm0 pslld mm0, 0

Benchmark #6: I-Cache Latency Test

The sixth test was only introduced into this RMMA 3.0 version. It measures latency of the access to L1i/L2 cache/RAM in the code execution condition, and it also estimates L1 I-cache and unified L2 I/D-cache associativity. The test works according to the following principle. In a selected memory block, a dependent chain is constructed, where each of the elements contains a jump-to-the-next instruction (jmp) with the stride specified by Stride Size. The last element contains a return instruction (ret). When an associativity test is run, reading in the "multi-segment" mode first goes through an Nth element of each segment and then jumps to the (N+1)th element of the main segment.

The scheme taken from the D-cache associativity test is also suitable for this case. The only difference is that jump instructions (jmp) are used instead of dependent data-loading ones (mov eax, [eax]). Here are the test parameters:

Variable Parameter

Variable Parameter selects one of the two test variants:

Block Size — dependence of jump-instruction execution latency on the block size;
Segments Count — dependence of jump-instruction execution latency on the number of successive cache segments, which enables to estimate its associativity.

Stride Size

Stride Size, bytes, in the dependent code-chain.

Minimal Block Size

Minimal code-block size, KB, in case of Variable Parameter = Block Size; code-block size in other cases.

Maximal Block Size

Maximal code-block size, KB, in the case of Variable Parameter = Block Size.

Minimal Segments Count

Min Segments Count — the minimal number of successive cache segments in the case of Variable Parameter = Segments Count; the number of successive cache segments in other cases.

Maximal Segments Count

Max Segments Count — the maximal number of successive cache segments in the case of Variable Parameter = Segments Count.

Jump Type

A type of the jump instruction:

Near — the nearer jump (jmp near +relative_offset);
Far — the farther jump (mov ecx, absolute_offset; jmp ecx).

Selected Tests

Selected Tests defines reading modes at latency testing:

Forward Read Latency
Backward Read Latency
Random Read Latency
Pseudo-Random Read Latency

Benchmark #7: Data Translation Lookaside Buffer Test (D-TLB)

The test defines the size and associative level of the Translation Lookaside Buffer (L1/L2 D-TLB). Actually, it measures latency when accessing the L1 cache provided that every next cache line is loaded from the next, not the same, memory page.

(The memory page size in real operating systems is much larger (e.g. 4096 bytes) than in our scheme which only houses 4 cache lines).

So, if the number of pages used is less than the TLB size the test calculates L1 cache's own latency (TLB hit). Otherwise, it measures the L1 cache latency in case of TLB miss. Note that the Maximal TLB Entries mustn't be greater than the number of L1 cache lines, otherwise the graph will have a jump related with the transition from L1 to L2, but not with the D-TLB structure size. But the overall size of TLB levels is always less than the number of cache lines which can be put into the L1 cache. Test settings:

Variable Parameter

Variable Parameter defines one of two test modes.

TLB Entries - dependence of latency when accessing the L1 cache on the number of memory pages used.
Segments Count — dependence of latency when accessing the L1 cache on the number of sequential "TLB segments" at a given number of pages used for estimation of the associative level of each D-TLB level. The principle of "multi-segment" dependent access chain formation is identical to the one used in the latency test (Benchmark #2), but in this case the value equal to the stride size when accessing each next element (cache line size by default) is added to the offset between the segments. Here you can see reading of four TLB elements in the "two-segment" access mode.

Stride Size

Stride Size in the dependent access chain, in bytes.

Minimal TLB Entries

Minimal TLB Entries used for reading cache lines in case of Variable Parameter = TLB Entries; TLB Entries in other cases.

Maximal TLB Entries

Maximal TLB Entries in case of Variable Parameter = TLB Entries.

Minimal Segments Count

Min Segments Count defines the minimal number of successive segments in the dependent access chain in the case of Variable Parameter = Segments Count; the number of segments in other cases.

Maximal Segments Count

Max Segments Count defines the maximal number of successive segments in the case of Variable Parameter = Segments Count.

Selected Tests

Selected Tests defines a way of testing:

Forward Access
Backward Access
Random Access

Benchmark #8: Instruction Translation Lookaside Buffer Test (I-TLB)

The eighth test measures the size and the associativity of instructions translation lookaside buffer (L1/L2 I-TLB). Like in the D-TLB case, the test measures L1 I-cache access latency provided that every next cache line is loaded from the next, not the same, memory page. All the parameters (except one additional) are the same as in Benchmark #7.

Variable Parameter

Variable Parameter defines one of two types of the test modes:

TLB Entries - dependence of latency when accessing the L1i cache on the number of memory pages used.
Segments Count - dependence of latency when accessing the L1i cache on the number of sequential segments at a given number of pages used.

Stride Size

Stride Size, bytes, in the dependent data chain. Strides are made with a jump (jmp). Here you can see forward sequential reading of four I-TLB elements in the two-segment access mode.

The last element, marked with a cross, contains the return instruction (ret).

Minimal TLB Entries

Minimal TLB Entries used for reading L1i cache lines in case of Variable Parameter = TLB Entries; TLB Entries in other cases.

Maximal TLB Entries

Maximal TLB Entries in case of Variable Parameter = TLB Entries.

Minimal Segments Count

Minimal Segments Count defines the minimal number of sequential segments in case of Variable Parameter = Segments Count; the number of sequential segments in other cases.

Maximal Segments Count

Maximal Segments Count defines the maximal number of sequential segments in case of Variable Parameter = Segments Count.

Jump Type

A type of the jump instruction:

Near — the nearer jump (jmp near +relative_offset);
Far — the farther jump (mov ecx, absolute_offset; jmp ecx).

Selected Tests

Selected Tests defines the ways of testing:

Forward Access
Backward Access
Random Access

Competitive Tests

Now we have come to the second test type — competitive tests that give out comparable values on different platforms. There doesn't seem to be much to say, as this type comprises an overwhelming majority of performance tests conducted on a system component (and RAM is no exception here). They also include the new project we're developing — CPU RightMark that we described in a separate article. Competitive memory tests are based on a single principle: a full independence of those "crucial low-level platform parameters" (and first of all, cache line size) that we measure in the first type of tests. From the point of view of code organisation, a competitive test must be something like a reference code of a typical streaming C/C++ task, written without assembler optimisations and compiled by a good compilor.

Benchmark #9: RAM Performance Test

In fact, there is only one RAM performance test for the moment. We decided not to complicate things and realised STREAM tests, popular, comprehensible and possible in the source code form. These are the most important parameters of the test:

Minimal Block Size

Min Block Size, KB, is the minimal size of the array.

Maximal Block Size

Max Block Size, KB, is the maximal size of the array. The real memory size used in the test is larger and depends on the measurement method (see below).

Selected Tests

Selected Tests define performance test variants:

Float Copy — copies the array of float-point numbers (the data type hereinafter is 64-bit double):
a[i] = b[i];
Float Scale — multiplies the element of the first array by the constant and saves the element in the second array:
a[i] = b[i] * k;
Float Add — sums two arrays elementwise and saves the element in the third array:
a[i] = b[i] + c[i];
Float Triad — Float Add and Float Scale, together. The element of the first array is added to the second one's multiplied by the constant, and is saved in the third array:
a[i] = b[i] + c[i] * k;

It is clear that two first tests (Float Copy, Float Scale) coincide with two last ones (Float Add and Float Triad) in terms of the load on the memory subsystem. Indeed, there are two accesses to the memory by the first test group (reading and writing), and three more by the second group (reading, reading, and writing). The test result is the so-called "full" real memory bandwidth, which was obtained considering the fact of several accesses. Another important thing is that test configurations specify only the array size, not the size of the whole memory block. That is, real memory utilisation increases twofold (Float Copy, Float Scale) and threefold (Float Add and Float Triad).

Benchmark #10: RAM Stability Test

The article on the new RMMA 3.0 version wouldn't be complete without mentioning the last tab entitled RAM Stability.

The test estimates stability of memory work in the given configuration (mainly concerning the timings). This is, probably, the simplest test of all realised in RMMA 3.0. Besides, it is nothing new: it just measures latency of a random access to a large memory block (its size is specified by Block Size, KB). It is different from a stndard "microarchitecture" latency test in the duration of a pretty heavy load on the memory subsystem as the test settings (Test Time, minutes) allow to select a long testing period.

Considering the fact that the address of each next element in the chain is accessed from RAM, any fault in its work will cause the wrong value to appear, which, in turn, will lead it to the point that the CPU will generate an exception. But this test will inform the user on it as exit/halt test completion is written down in the report. To make assurance double sure, the test also makes a write into a log file (Stability.log) at the interval specified by the user (Log File Interval, seconds). A value written in the file corresponds to random access latency measured at this particular moment.

Dmitri Besedin (dmitri_b@ixbt.com)

15.04.2004

Write a comment below. No registration needed!

Article navigation: