Finding Timer Resolution and Overhead

Measuring resolution and overhead for your timer is important - it lets you know how many times to repeat an operation to get reliable timings. Increasingly, the overhead of calling the timer is negligible, but resolution is far from it and increasingly fast processors a mean significant amount of computation can be performed in a single clock tic. This is not the hardware system clock that sends out the drumbeat at whatever Ghertz rating your box has. Instead it is the clock that you can actually call from a C/C++/Fortran code.

Measuring the Resolution of a Timer

To find timer resolution, use the computer equivalent of the Milliken Oil Drop experiment. Time an operation that takes some small amount of time, and repeat it several times. If that operation is close to the resolution, some timings will show zero while others will show up as taking a small amount of time, which is an integer multiple of the clock resolution. Actually, the difference between any two timings is an integer multiple of the clock resolution, so even if the timer is high resolution and no zero times occur, all of the timings obtained will be multiples of the clock resolution. So take several measurements, plot them, and look for them to line up on horizontal bands.

An example algorithm using an elapsed time clock and integer addition is:


           initialize:
               nsamples = 3333333
               noperations = 7
        
           iterate:
           for k = 1, ... , nsamples
               time_start = clock_time()
               sum = 0
               for i = 1, ... ,  noperations
                     sum = sum + 1
               end for
               time_end = clock_time()
               time_required = time_end - time_start
               write out time_required
           end for

If roughly 50-90% of the times are nonzero, then the smallest positive time is probably the clock resolution. A good value for noperations is machine dependent. Start out with noperations = 100, but don't be suprised if a value like 100000 or larger is needed. A fast machine and low resolution clock will require many more operations to get the clock to tick over. On some modern systems the clock has high enough resolution that even setting noperations = 0 will not cause any zero timings to appear. If that is the case, function call overhead costs dominate and life is good. Even in this case the resolution can still be estimated using the measured timings.

Example 1: For timings of

[ 1.00    2.00    1.75     0.50    1.25   3.25 ]

seconds, the resolution is at least as small as 0.25 seconds because that is the smallest value that all of the timings are integer multiples of. Resolution could be smaller than 0.25, e.g., the timings shown are also integer multiples of 0.05. Without an additional timing measurement that gives something like 6.10 seconds, it can only be stated that 0.25 is a upper bound (and integer multiple of) the timer resolution.

Example 2: If the vector of measured timings is

[0  1.953125e-3  1.953125e-3   3.906250e-3  4.8828125e-3   0   0   0 ]

the resolution is 9.765625e-4. If this number looks strange, crank up Matlab or some calculator and look at its inverse 1/9.765625e-4. These numbers came from an old HP PC and were not just concocted for pedagogy. How on earth was the number 9.765625e-4 extracted from those timings? In Matlab look at the consecutive differences in the sorted timing vector: d = diff(sort(timings)), and extract the smallest nonzero from it. Doing so is not guaranteed to give the clock resolution.

Example 3: Using Matlab's diff(sort()) on timings

[0.00    2.00    1.25     0.50    1.25   3.25 ]

shows that the smallest difference is 0.50, while the clock resolution is obviously 0.25 or less. However, other than a cooked-up examples like this, I've never encountered a machine where taking large number of measurements and using diff() and sort() have failed to yield the correct resolution. Sidenote: it is possible to extract the 0.25 from the set of data ... try to figure out how, and how to guarantee that it will work in general.

Example 4: Here are some timings taken on a 3.2 Ghertz Intel CoreI7. First, the raw timings using a blue + sign for each datum:

After sorting the timings into non-decreasing order:

The consecutive differences between elements in the sorted timing vector:

At this point, you should be able to read off the clock resolution from the above graphs. Go no further until you figure it out and understand the plots. The remaining material is of little or no use to you otherwise.

Warning 1: do not trust manual pages or vendor's claims about clock resolution. They lie blatantly, often, and shamelessly, sometimes to cover up weaknesses, and sometimes just to avoid headaches. E.g., "Posix requires 1/100 second resolution, our clock has nanosecond resolution, so if we just leave the documentation saying 1/100, then we're safe and won't have obnoxious professors complaining that their measurements show a 2 nanosecond resolution." Even in the rare event that a vendor tells the truth, the resolution you can actually measure is a better guide and more useful than a theoretical number. There may be some unavoidable and erratic costs that occur sporadically, e.g., ones related to the OS accessing a hardware clock.

Warning 2: The examples have only a few timings shown, but in practice use a gazillion timings (a gazillion means "a lot of" or "gobs and gobs of"). Nowadays (circa 2019 C.E.) use a million or more, that is, nsamples = 1000000. Doing so on a 2.8 Ghertz Intel Core i7 system takes less than 40 seconds. Also in less than 40 seconds, Matlab can slurp in, analyze, and plot that much data.

Timer Overhead Measurement

How much does calling the timer itself perturb timing results? Calling a timer is a function call, which involves some work: pushing the state of the process onto a stack, loading registers, and reading data (the fields in memory where the timer data is kept) which may in turn involve a page fault or cache miss. Invoking a timer function usually requires a call to the operating system, and process schedulers take advantage of that to swap in and run any processes waiting for the CPU. So do not assume that the overhead of a timer function is the same as any other function. Timer overhead can be found by calling the timer many times in a single timing block, as required by the clock resolution found earlier.

Applying knowledge about timing methodologies

The Rule of 100: To assure that timings are reliable, a basic rule of thumb is to make sure that a timing block (the stuff that appears between two calls to a clock function) takes at least 100*(resolution + overhead) seconds. That usally guarantees that any perturbation caused by the timing itself is ≤ 1%.
Corollary 1: When timing code chunks you may be tempted to ask "How many repetitions do I need to get reliable timings?" Resist that temptation. All of the information needed to answer it is available on this page, and the actual numbers required can be found by running a few short tests to get clock resolution and overhead. Scientific computing is an experimental field of research. You should raise questions like that, but whenever the question can be answered by running a quick experiment or two, do so!
Corollary 2: if determining the values of resolution and overhead is not practical, make sure the timing block takes one second or more. Posix standard is that a clock must have at least a resolution of 1/100 second or better. Doing this is OK but one-second-long timing blocks can take excruciatingly long when many timings are needed. Some of the computational rate timings shown here took over four days to run.
Sub-corollary (or should that be corollary²?): keep a record of the timing block size(s) that were actually made. It will help to prove or disprove that the results are reliable.
Other reasons to find these things experimentally: the numbers change
- as platforms evolve,
- depending on which timer function is used,
- with different compiler options (this alone can make the numbers vary by three orders of magnitude).
As a strange and weird example, with Intel's Fortran compiler the resolution varies depending on whether 32-bit or 64-bit integers are passed to the intrinsic function system_clock(count, count_rate). This holds independently of the compiler options used.
Stable, repeatable timings are not sufficient assurance of reliability. Some operations have inherent wide variations. As an example, disk I/O routinely has timings that vary by three or more orders of magnitude.
Knowing how to time things is going to crop up repeatedly in your career, and not just in a scientific computing (or other CS) course. You will need to judge timing results from other people: vendors, collaborators, enemies, researchers whose paper or grant or project proposal you are reviewing.
Code it up: All of these considerations suggest writing a code or codes that can quasi-automate finding resolution and overhead. Do that now, or be condemned to writing such codes over and over again.
Trends: Over the past few years, timer overhead has become negligible while some systems provide high resolution, on the order of a few nanoseconds. [Maybe this is in support of real-time gaming and audio?]. For now (2019) you can assume the timer overhead is zero, and just make sure the timing block is ≥ 100*resolution seconds in size. These trends are not guaranteed to hold true for the ages. E.g., as clock cycles become shorter, a nanosecond may be the equivalent of thousands or millions of floating point operations.
Cargo cult science: In a paper, book, or report, when data seems quantized like the figures plotted above, you should immediately suspect clock resolution issues. The numbers typically will not lie clearly in horizontal bands as above. Instead some derived quantity like a computational rate might be plotted. In that case, look for curves splitting into mutliple curves. Also, sometimes timings may look quantized but are not. As one example, the computation rates for computing three different vector norms looks like

Zoom in on part of the graph to get

Zoom in further and it is obvious the computational rate are following a pattern, but it is not from timer discretization:

Each timing block used in the last three plots was over 10 seconds, and Posix standards require an OS to provide a system timer with at least 0.01 seconds resolution.
The 600 pound gorilla: Timing parts of code that take much less than the clock resolution requires multiple repetitions in a single timing block. However, the first execution of the small fragment will load its data into the cache, and succeeding repetitions will access the data from cache, not from main memory. This problem can be ameliorated by knowing the relative speed of memory accesses from cache to main memory. A ratio of 1:10 is common, but I have in the past measured it on one machine as 1:50. Determining that ratio requires timing carefully chosen code fragments. This seems like an infinite recursion, but it is relatively easy to make determination of the ratio reliably, as will be shown later

Next page: Timing Simple Operations

Last Modified: Fri 27 Sep 2019, 02:36 PM