How Could You Use This Dataset To Empirically Estimate

Search the site...

Variance is a measurement of the spread between numbers in a data set. The variance measures how far each number in the set is from the mean.

The Empirical Rule is often used in statistics for forecasting, especially when obtaining the right data is difficult or impossible to get. The rule can give you a rough estimate of what your data collection might look like if you were able to survey the entire population.

Using a data set chart, we can observe what the linear relationship of the various data points, or numbers, is. We do this by drawing a regression line, which attempts to minimize the distance of any individual data point from the line itself. In the chart below, the data points are the blue dots, the orange line is the regression line, and the red arrows are the distance from the observed data and the regression line.

When we calculate a variance, we are asking, given the relationship of all these data points, how much distance do we expect on the next data point? This 'distance' is called the error term, and it's what variance is measuring.

By itself, variance is not often useful because it does not have a unit, which makes it hard to measure and compare. However, the square root of variance is the standard deviation, and that is both practical as a measurement.

Calculating Variance in Excel

Calculating variance in Excel is easy if you have the data set already entered into the software. In the example below, we will calculate the variance of 20 days of daily returns in the highly popular exchange-traded fund (ETF) named SPY, which invests in the S&P 500.

The formula is =VAR.S(select data)

The reason you want to use VAR.S and not VAR.P (which is another formula offered) is that often you don't have the entire population of data to measure. For example, if we had all returns in history of the SPY ETF in our table, we could use the population measurement VAR.P, but since we are only measuring the last 20 days to illustrate the concept, we will use VAR.S.

As you can see, the calculated variance value of .000018674 tells us little about the data set, by itself. If we went on to square root that value to get the standard deviation of returns, that would be more useful.

Background

I'd like to estimate the big-oh performance of some methods in a library through benchmarks. I don't need precision -- it suffices to show that something is O(1), O(logn), O(n), O(nlogn), O(n^2) or worse than that. Since big-oh means upper-bound, estimating O(logn) for something that is O(log logn) is not a problem.

Right now, I'm thinking of finding the constant multiplier k that best fits data for each big-oh (but will top all results), and then choosing the big-oh with the best fit.

Questions

Are there better ways of doing it than what I'm thiking of? If so, what are they?
Otherwise, can anyone point me to the algorithms to estimate k for best fitting, and comparing how well each curve fits the data?

Notes & Constraints

Given the comments so far, I need to make a few things clear:

This needs to be automated. I can't 'look' at data and make a judgment call.
I'm going to benchmark the methods with multiple n sizes. For each size n, I'm going to use a proven benchmark framework that provides reliable statistical results.
I actually know beforehand the big-oh of most of the methods that will be tested. My main intention is to provide performance regression testing for them.
The code will be written in Scala, and any free Java library can be used.

Example

Here's one example of the kind of stuff I want to measure. I have a method with this signature:

Given an n, it will return the nth element of a sequence. This method can have O(1), O(logn) or O(n) given the existing implementations, and small changes can get it to use a suboptimal implementation by mistake. Or, more easily, could get some other method that depends on it to use a suboptimal version of it.

Daniel C. Sobral

Daniel C. SobralDaniel C. Sobral

259k77 gold badges451 silver badges652 bronze badges

9 Answers

In order to get started, you have to make a couple of assumptions.

n is large compared to any constant terms.
You can effectively randomize your input data
You can sample with sufficient density to get a good handle on the distribution of runtimes

In particular, (3) is difficult to achieve in concert with (1). So you may get something with an exponential worst case, but never run into that worst case, and thus think your algorithm is much better than it is on average.

With that said, all you need is any standard curve fitting library. Apache Commons Math has a fully adequate one. You then either create a function with all the common terms that you want to test (e.g. constant, log n, n, n log n, nn, nn*n, e^n), or you take the log of your data and fit the exponent, and then if you get an exponent not close to an integer, see if throwing in a log n gives a better fit.

(In more detail, if you fit C*x^a for C and a, or more easily log C + a log x, you can get the exponent a; in the all-common-terms-at-once scheme, you'll get weights for each term, so if you have n*n + C*n*log(n) where C is large, you'll pick up that term also.)

You'll want to vary the size by enough so that you can tell the different cases apart (might be hard with log terms, if you care about those), and safely more different sizes than you have parameters (probably 3x excess would start being okay, as long as you do at least a dozen or so runs total).

Edit: Here is Scala code that does all this for you. Rather than explain each little piece, I'll leave it to you to investigate; it implements the scheme above using the C*x^a fit, and returns ((a,C),(lower bound for a, upper bound for a)). The bounds are quite conservative, as you can see from running the thing a few times. The units of C are seconds (a is unitless), but don't trust that too much as there is some looping overhead (and also some noise).

Note that the multibench method is expected to take about sqrt(2)nm*time to run, assuming that static initialization data is used and is relatively cheap compared to whatever you're running. Here are some examples with parameters chosen to take ~15s to run:

Anyway, for the stated use case--where you are checking to make sure the order doesn't change--this is probably adequate, since you can play with the values a bit when setting up the test to make sure they give something sensible. One could also create heuristics that search for stability, but that's probably overkill.

(Incidentally, there is no explicit warmup step here; the robust fitting of the Theil-Sen estimator should make it unnecessary for sensibly large benchmarks. This also is why I don't use any other benching framework; any statistics that it does just loses power from this test.)

Edit again: if you replace the alpha method with the following:

then you can get an estimate of the exponent when there's a log term also--error estimates exist to pick whether the log term or not is the correct way to go, but it's up to you to make the call (i.e. I'm assuming you'll be supervising this initially and reading the numbers that come off):

(Edit: fixed the RMS computation so it's actually the mean, plus demonstrated that you only need to do timings once and can then try both fits.)

Rex KerrRex Kerr

152k21 gold badges287 silver badges382 bronze badges

I don't think your approach will work in general.

The problem is that 'big O' complexity is based on a limit as some scaling variable tends to infinity. For smaller values of that variable, the performance behavior can appear to fit a different curve entirely.

The problem is that with an empirical approach you can never know if the scaling variable is large enough for the limit to be apparent in the results.

Another problem is that if you implement this in Java / Scala, you have to go to considerable lengths to eliminate distortions and 'noise' in your timings due to things like JVM warmup (e.g. class loading, JIT compilation, heap resizing) and garbage collection.

Finally, nobody is going to place much trust in empirical estimates of complexity. Or at least, they wouldn't if they understood the mathematics of complexity analysis.

FOLLOWUP

In response to this comment:

Your estimate's significance will improve drastically the more and larger samples you use.

This is true, though my point is that you (Daniel) haven't factored this in.

Also, runtime functions typically have special characteristics which can be exploited; for example, algorithms tend to not change their behaviour at some huge n.

For simple cases, yes.

For complicated cases and real world cases, that is a dubious assumption. For example:

Suppose some algorithm uses a hash table with a large but fixed-sized primary hash array, and uses external lists to deal with collisions. For N ( number of entries) less than the size of the primary hash array, the behaviour of most operations will appear to be O(1). The true O(N) behaviour can only be detected by curve fitting when N gets much larger than that.
Suppose that the algorithm uses a lot of memory or network bandwidth. Typically, it will work well until you hit the resource limit, and then performance will tail off badly. How do you account for this? If it is part of the 'empirical complexity', how do you make sure that you get to the transition point? If you want to exclude it, how do you do that?

Stephen CStephen C

537k74 gold badges616 silver badges964 bronze badges

If you are happy to estimate this empirically, you can measure how long it takes to do exponentially increasing numbers of operations. Using the ratio you can get which function you estimate it to be.

e.g. if the ratio of 1000 operations to 10000 operations (10x) is (test the longer one first) You need to do a realistic number of operations to see what the order is for the range you have.

1x => O(1)
1.2x => O(ln ln n)
~ 2-5x => O(ln n)
10x => O(n)
20-50x => O(n ln n)
100x => O(n ^ 2)

Its is just an estimate as time complexity is intended for an ideal machine and something should can be mathematically proven rather than measures.

e.g. Many people tried to prove empirically that PI is a fraction. When they measured the ratio of circumference to diameter for circles they had made it was always a fraction. Eventually, it was generally accepted that PI is not a fraction.

Peter LawreyPeter Lawrey

455k59 gold badges605 silver badges995 bronze badges

What you are looking to achieve is impossible in general. Even the fact that an algorithm will ever stop cannot be proven in general case (see Halting Problem). And even if it does stop on your data you still cannot deduce the complexity by running it. For instance, bubble sort has complexity O(n^2), while on already sorted data it performs as if it was O(n). There is no way to select 'appropriate' data for an unknow algorithm to estimate its worst case.

Igor KorkhovIgor Korkhov

6,2781 gold badge20 silver badges30 bronze badges

We have lately implemented a tool that does semi-automated average runtime analysis for JVM code. You do not even have to have access to the sources. It is not published yet (still ironing out some usability flaws) but will be soon, I hope.

It is based on maximum-likelihood models of program execution [1]. In short, byte code is augmented with cost counters. The target algorithm is then run (distributed, if you want) on a bunch of inputs whose distribution you control. The aggregated counters are extrapolated to functions using involved heuristics (method of least squares on crack, sort of). From those, more science leads to an estimate for the average runtime asymptotics (3.576n - 1.23log(n) + 1.7, for instance). For example, the method is able to reproduce rigorous classic analyses done by Knuth and Sedgewick with high precision.

The big advantage of this method compared to what others post is that you are independent of time estimates, that is in particular independent of machine, virtual machine and even programming language. You really get information about your algorithm, without all the noise.

And---probably the killer feature---it comes with a complete GUI that guides you through the whole process.

See my answer on cs.SE for a little more detail and further references.You can find a preliminary website (including a beta version of the tool and the papers published) here.

(Note that average runtime can be estimated that way while worst case runtime can never be, except in case you know the worst case. If you do, you can use the average case for worst case analysis; just feed the tool only worst case instances. In general, runtime bounds can not be decided, though.)

Maximum likelihood analysis of algorithms and data structures by U. Laube and M.E. Nebel (2010). [preprint]

RaphaelRaphael

You should consider changing a critical aspects of your task.

Change the terminology that you are using to: 'estimate the runtime of the algorithm' or 'setup performance regression testing'

Can you estimate the runtime of the algorithm? Well you propose to try different input sizes and measure either some critical operation or the time it takes. Then for the series of input sizes you plan to programmaticly estimate if the algorithm's runtime has no growth, constant growth, exponential growth etc.

So you have two problems, running the tests, and programmatically estimating the growth rate as you input set grows. This sounds like a reasonable task.

Brian C.Brian C.

2,3631 gold badge14 silver badges26 bronze badges

I'm not sure I get 100% what you want. But I understand that you test your own code, so you can modify it, e.g. inject observing statements. Otherwise you could use some form of aspect weaving?

How about adding resetable counters to your data structures and then increase them each time a particular sub-function is invoked? You could make those counting @elidable so they will be gone in the deployed library.

Then for a given method, say delete(x), you would test that with all sorts of automatically generated data sets, trying to give them some skew, etc., and gather the counts. While as Igor points out you cannot verify that the data structure won't ever violate a big-O bound, you will at least be able to assert that in the actual experiment a given limit count is never exceeded (e.g. going down a node in a tree is never done more than 4 * log(n) times) -- so you can detect some mistakes.

Of course, you would need certain assumptions, e.g. that calling a method is O(1) in your computer model.

0__0__

50.5k13 gold badges131 silver badges217 bronze badges

I actually know beforehand the big-oh of most of the methods that will be tested. My main intention is to provide performance regression testing for them.

This requirement is key. You want to detect outliers with minimal data (because testing should be fast, dammit), and in my experience fitting curves to numerical evaluations of complex recurrences, linear regression and the like will overfit. I think your initial idea is a good one.

What I would do to implement it is prepare a list of expected complexity functions g1, g2, ..., and for data f, test how close to constant f/gi + gi/f is for each i. With a least squares cost function, this is just computing the variance of that quantity for each i and reporting the smallest. Eyeball the variances at the end and manually inspect unusually poor fits.

HarryHarry

For an empiric analysis of the complexity of the program, what you would do is run (and time) the algorithm given 10, 50, 100, 500, 1000, etc input elements. You can then graph the results and determine the best-fit function order from the most common basic types: constant, logarithmic, linear, nlogn, quadratic, cubic, higher-polynomial, exponential. This is a normal part of load testing, which makes sure that the algorithm is first behaving as theorized, and second that it meets real-world performance expectations despite its theoretical complexity (a logarithmic-time algorithm in which each step takes 5 minutes is going to lose all but the absolute highest-cardinality tests to a quadratic-complexity algorithm in which each step is a few millis).

EDIT: Breaking it down, the algorithm is very simple:

Define a list, N, of various cardinalities for which you want to evaluate performance (10,100,1000,10000 etc)

For each element X in N:

Create a suitable set of test data that has X elements.

Start a stopwatch, or determine and store the current system time.

Run the algorithm over the X-element test set.

Stop the stopwatch, or determine the system time again.

The difference between start and stop times is your algorithm's run time over X elements.

Repeat for each X in N.

Plot the results; given X elements (x-axis), the algorithm takes T time (y-axis). The closest basic function governing the increase in T as X increases is your Big-Oh approximation. As was stated by Raphael, this approximation is exactly that, and will not get you very fine distinctions such as coefficients of N, that could make the difference between a N^2 algorithm and a 2N^2 algorithm (both are technically O(N^2) but given the same number of elements one will perform twice as fast).

KeithSKeithS

57.7k14 gold badges91 silver badges150 bronze badges