Finding Timer Resolution and Overhead
Measuring resolution and overhead for your timer is important - it lets you know
how many times to repeat an operation to get reliable timings. Increasingly, the
overhead of calling the timer is negligible, but resolution is far from it and
increasingly fast processors a mean significant amount of computation can be
performed in a single clock tic. This is not the hardware system clock that sends
out the drumbeat at whatever Ghertz rating your box has. Instead it is the clock
that you can actually call from a C/C++/Fortran code.
Measuring the Resolution of a Timer
To find timer resolution, use the computer equivalent of
the
Milliken
Oil
Drop experiment.
Time an operation that takes some small amount of
time, and repeat it several times. If that operation is close to the resolution,
some timings will show zero while others will show up as taking a small
amount of time, which is an integer multiple of the clock resolution.
Actually, the difference between any two timings is an integer multiple of the
clock resolution, so even
if the timer is high resolution and no zero times occur, all of the timings
obtained will be multiples of the clock resolution. So take several measurements,
plot them, and look for them to line up on horizontal bands.
An example algorithm using an elapsed time clock and integer addition is:
initialize:
nsamples = 3333333
noperations = 7
iterate:
for k = 1, ... , nsamples
time_start = clock_time()
sum = 0
for i = 1, ... , noperations
sum = sum + 1
end for
time_end = clock_time()
time_required = time_end - time_start
write out time_required
end for
If roughly 50-90% of the times are nonzero, then the smallest positive time is
probably the clock resolution. A good value for noperations is machine dependent.
Start out with noperations = 100, but don't be suprised if a value like 100000 or larger is
needed. A fast machine and low resolution clock will require many more operations
to get the clock to tick over.
On some modern systems the clock has high enough resolution that even setting
noperations = 0 will not cause any zero timings to appear. If that
is the case, function call overhead costs dominate and life is good.
Even in this case the resolution can still be estimated using the measured timings.
Example 1:
For timings of
[ 1.00 2.00 1.75 0.50 1.25 3.25 ]
seconds,
the resolution is at least as small as 0.25 seconds because that is the smallest
value that all of the timings are integer multiples of. Resolution could
be smaller than 0.25, e.g.,
the timings shown are also integer multiples of 0.05. Without an additional timing
measurement that gives something like 6.10 seconds, it can only be stated that
0.25 is a upper bound (and integer multiple of) the timer resolution.
Example 2:
If the vector of measured timings is
[0 1.953125e-3 1.953125e-3 3.906250e-3 4.8828125e-3 0 0 0 ]
the resolution is 9.765625e-4. If this number looks strange, crank up Matlab or
some calculator and look at its
inverse 1/9.765625e-4. These numbers came from an old HP PC and were
not just concocted for pedagogy. How on earth was the number 9.765625e-4
extracted from those
timings? In Matlab look at the consecutive differences in the sorted timing
vector: d = diff(sort(timings)), and extract the smallest nonzero from it.
Doing so is not guaranteed to give the clock resolution.
Example 3:
Using Matlab's diff(sort()) on timings
[0.00 2.00 1.25 0.50 1.25 3.25 ]
shows that the smallest difference is 0.50, while the clock resolution is obviously
0.25 or less. However, other than a cooked-up examples like this, I've never
encountered a machine where taking large number of measurements and using
diff() and sort() have failed to yield the correct resolution. Sidenote:
it is possible to extract the 0.25 from the set of data ... try to
figure out how, and how to guarantee that it will work in general.
Example 4:
Here are some timings taken on a 3.2 Ghertz Intel CoreI7.
First, the raw timings using a blue + sign for each datum:
After sorting the timings into non-decreasing order:
The consecutive differences between elements in the sorted timing vector:
At this point, you should be able to read off the clock resolution from the above graphs.
Go no further until you figure it out and understand the plots. The remaining material
is of little or no use to you otherwise.
Warning 1: do not trust manual pages or vendor's claims about
clock resolution. They lie blatantly, often, and shamelessly, sometimes
to cover up weaknesses, and sometimes just to avoid headaches. E.g., "Posix
requires 1/100 second resolution, our clock has nanosecond resolution, so if we
just leave the documentation saying 1/100, then we're safe and won't have obnoxious
professors complaining that their measurements show a 2 nanosecond resolution."
Even in the rare event that a vendor tells the truth, the resolution you can actually
measure is a better guide and more useful than a theoretical number. There may be
some unavoidable and erratic costs that occur sporadically, e.g., ones related to
the OS accessing a hardware clock.
Warning 2: The examples have only a few timings shown, but in practice
use a gazillion timings (a gazillion means "a lot of" or "gobs and gobs of").
Nowadays (circa 2019 C.E.) use a million
or more, that is, nsamples = 1000000. Doing so on a 2.8 Ghertz Intel Core i7 system
takes less than 40 seconds. Also in less than 40 seconds, Matlab can slurp in, analyze,
and plot that much data.
Timer Overhead Measurement
How much does calling the timer itself perturb timing results? Calling a timer
is a function call, which involves some work: pushing the state of the process
onto a stack, loading registers, and reading data (the fields in memory where
the timer data is kept) which may in turn involve a page fault or cache miss.
Invoking a timer function usually requires a call to the operating system,
and process schedulers take advantage of that to swap in and run any processes
waiting for the CPU. So do not assume that the overhead of a timer function
is the same as any other function. Timer overhead can be found by calling the
timer many times in a single timing block, as required by the clock resolution
found earlier.
Applying knowledge about timing methodologies
- The Rule of 100:
To assure that timings are reliable, a basic rule of thumb is to make sure
that a timing block (the stuff that appears between two calls to a clock function)
takes at least 100*(resolution + overhead) seconds. That usally guarantees that
any perturbation caused by the timing itself is ≤ 1%.
- Corollary 1: When timing code chunks you may be tempted to ask
"How many repetitions do
I need to get reliable timings?" Resist that temptation.
All of the information needed to answer it is available
on this page, and the actual numbers required can be found by running a
few short tests to get clock resolution and overhead. Scientific computing
is an experimental field of research. You should raise questions
like that, but whenever the question can be answered by running a
quick experiment or two, do so!
- Corollary 2: if determining the values of resolution and overhead is not
practical, make sure the timing block takes one second or more. Posix standard
is that a clock must have at least a resolution of 1/100 second or better. Doing this is OK
but one-second-long timing blocks can take excruciatingly long when many
timings are needed. Some of the computational rate timings shown here took
over four days to run.
- Sub-corollary (or should that be corollary2?): keep a record of the
timing block size(s) that were actually made. It will help to prove or
disprove that the results are reliable.
- Other reasons to find these things experimentally: the numbers change
- as platforms evolve,
- depending on which timer function is used,
- with different compiler options (this alone can make the
numbers vary by three orders of magnitude).
As a strange and weird example,
with Intel's Fortran compiler the resolution varies depending
on whether 32-bit or 64-bit integers are passed to the
intrinsic function system_clock(count, count_rate).
This holds independently of the compiler options used.
- Stable, repeatable timings are not sufficient
assurance of reliability. Some operations have inherent wide variations.
As an example, disk I/O routinely has timings that vary by three or more
orders of magnitude.
- Knowing how to time things is going to crop up repeatedly in your career,
and not just in a scientific computing (or other CS) course. You will need
to judge timing results from other people: vendors, collaborators, enemies,
researchers whose paper or grant or project proposal you are reviewing.
- Code it up: All of these considerations suggest writing a code or codes that can
quasi-automate finding resolution and overhead. Do that now, or be condemned
to writing such codes over and over again.
- Trends: Over the past few years, timer overhead has become negligible
while some systems provide high resolution, on the order of a few nanoseconds.
[Maybe this is in support of real-time gaming and audio?].
For now (2019) you can assume the timer overhead is zero, and just make sure
the timing block is ≥ 100*resolution seconds in size. These
trends are not guaranteed to hold true for the ages. E.g., as clock cycles
become shorter, a nanosecond may be the equivalent of thousands or
millions of floating point operations.
- Cargo cult science:
In a paper, book, or report, when data seems quantized like the figures
plotted above, you should immediately suspect clock resolution issues.
The numbers typically will not lie clearly in horizontal bands as above.
Instead some derived quantity like a computational rate might be plotted.
In that case, look for curves splitting into mutliple curves. Also, sometimes
timings may look quantized but are not. As one example, the computation rates for
computing three different vector norms looks like
Zoom in on part of the graph to get
Zoom in further and it is obvious the computational rate are following a
pattern, but it is not from timer discretization:
Each timing block used in the last three plots was over 10 seconds, and Posix
standards require an OS to provide a system timer with at least 0.01 seconds
resolution.
- The 600 pound gorilla: Timing parts of code that take much
less than the clock resolution requires multiple repetitions in a single
timing block. However, the first execution of the small fragment will
load its data into the cache, and succeeding
repetitions will access the data from cache, not from main memory. This
problem can
be ameliorated by knowing the relative speed of memory accesses from
cache to main memory. A ratio of 1:10 is common, but I have in the past
measured it on one machine as 1:50. Determining that ratio requires timing
carefully chosen code fragments. This seems like an infinite recursion,
but it is relatively easy to make determination of the ratio reliably,
as will be shown later
- Last Modified: Fri 27 Sep 2019, 02:36 PM