Analysis of Algorithms

Recall, that an algorithm is a method for solving a problem or accomplishing some task expressed as a sequence of steps that are suitable for execution by a computer.

We are interested in designing "good" algorithms -- that is to say, algorithms that don't take forever to run, and don't completely exhaust the resources of our machines, specifically with regard to memory.

With regard to the first consideration, the running time of our algorithms, we note that running time typically increases with the input size (i.e., problem size), and is affected by the hardware and software environment in which the algorithm is executed.

As such, we focus on the relationship between the running time and the input size. In particular, we are interested in how fast the running time is growing as the input size is increasing -- as a faster growth rate means that we will run into long running times more quickly and -- if these running times become long enough -- may not be able to deal at all with problems past a certain size.

One way to analyze an algorithm is to do so experimentally. In such cases, we:

  1. Write a program that implements the algorithm.
  2. Run the program with inputs of various sizes and compositions.
  3. Use a method like System.currentTimeMillis() to get an accurate measure of the actual running time.
  4. Then, we plot the results, and see how the running time increases as a function of the problem size

Analyzing algorithms experimentally, of course, has its own set of problems:

  • We have to implement the algorithm, which may be difficult
  • In order to compare two algorithms, the same hardware and software environments must be used
  • Results may not be indicative of the running time on other inputs not included in the experiment

A better way to analyze an algorithm is to realize that the running time of a program is determined primarily by two factors:

  • The cost of executing each statement (which is a property of the computer, the Java compiler, and the operating system).
  • The frequency of execution of each statement (which is a property of the program and the input).

To help with the quantification of the above, we first make a definition: A primitive operation is defined to be an operation that corresponds to a low-level, basic computation with a constant execution time. Examples of primitive operations include evaluating an expression, assigning a value to a variable, indexing into an array, etc.

Then, to analyze an algorithm, we simply determine the frequency of primitive operations, and characterize this as a function of the input size. In doing so, we have the benefit of taking into account all possible inputs, and our result is independent of any hardware/software environment.

Given that an algorithm may run faster on some inputs than it does on others (even with the same input size), we have different ways of counting the frequency of primitive operations encountered. Specifically, we can look at:

  • The average case, where we take the average frequency over all possible inputs of the same size -- although this depends on the input distribution, which may complicate things. Additionally, we must be wary of misleading averages, as typified by the old joke: "A statistician who put his head in the oven and his feet in the refrigerator was heard to say "On average, I feel just fine!"

  • The best case scenario, where the frequency count is minimized -- although folks are more likely interested in where the algorithm is going to create a problem (by running too long).

  • The worst case scenario, which is generally easier to analyze and typically leads to better algorithms.

In counting the frequency of primitive operations in programs, it is useful to note that programs typically use loops to enumerate input data items. As such, we can often just count the number of operations or steps taken inside of the loops present in the program (where each statement within a loop is counted as a "step").

Let's take a look at some examples...

Example 1

sum = 0.0;
for (int i = 0; i < n; i++) {
   sum += array[i];  // <----- a primitive operation          

Ignoring the update to the loop variable $i$, we note the marked statement above is executed $n$ times, so the cost function is $C(n) = n$.

Example 2

sum = 0.0;
for (int i = 0; i < n; i += 2) {
    sum += array[i];    // <----- a primitive operation

Here however, since the counter variable $i$ is going up by $2$ with each pass through the loop, the number of times the marked statement above gets executed this time is halved -- leading to $C(n) = n/2$.

Example 3

for (int i = 0; i < n; i ++) {
   for (int j = 0; j < n; j ++) {
      int x = i*j;    // <----- a primitive operation             
      sum += x;       // <----- a primitive operation

When nested loops are encountered, notice that the two primitive operations each get executed $n^2$ times, yielding $C(n) = 2n^2$.

Example 4

for (int i = 0; i < n; i ++) {
   for (int j = i; j < n; j ++) {
      sum += i*j;      // <----- a primitive operation            

Here, the cost function is a little more complicated to calculate. Note that the primitive operation is executed $$n + (n-1) + (n-2) + \cdots + 3 + 2 + 1 \quad \textrm{times}$$ However, and identity from algebra will help consolidate this into $\displaystyle{C(n) = \frac{n(n+1)}{2}}$.

The cost function $C(n)$ can often become a complicated and length mathematical expression. However, recall that we really only care about how the cost increases with respect to the problem size, as opposed to the absolute cost. To this end, we often simplify things down to just what is important to us through the use of Tilde notation and Big-O notation.

When using tilde notation, we essentially ignore any insignificant terms in our cost function. To be precise, we write $$f(n) \sim g(n) \quad \textrm{if} \quad \lim_{n \rightarrow \infty} \frac{f(n)}{g(n)} = 1$$ Thus,

  • $2n+10 \sim 2n$
  • $3n^3 + 20n^2 + 5 \sim 3n^3$

Applying tilde notation to example 4 discussed above, we see that $$\displaystyle{C(n) = \frac{n(n+1)}{2} = \frac{n^2}{2}+\frac{n}{2} \sim \frac{n^2}{2}}$$

Big-O notation provides an even simpler approach. Given functions $f(n)$ and $g(n)$, we say that $f(n0$ is $O(g(n))$ if there are positive constants $c$ and $n_0$ such that $$f(n) \leq c g(n) \quad \textrm{when} \quad n \geq n_0$$

As an example, $2n + 10$ is $O(n)$, since $2n + 10 < 3n$ whenever $n>10$.

On the other hand, $n^2$ is not $O(n)$ as otherwise, we could find some constant $c$ where $n^2 \leq cn$ for sufficiently large $n$. This of course is impossible, as it would imply $n \leq c$ whenever $n$ is sufficiently large.

Defining the Big-O notation in this way, we see it provides an upper bound on the growth rate of a function. That is to say, the statement "$f(n)$ is $O(g(n))$"means that the growth rate of $f(n)$ is no more than the growth rate of $g(n)$. As such, we can use the Big-O notation to rank functions according to their growth rate (as shown in the list below, with higher growth rates appearing farther to the right): $$O(1), O(\log n), O(n \log n), O(n^2), O(n^3), O(2^n), O(3^n)$$

Returning to example 4 above, we see that its cost function $C(n)$ is $O(n^2)$.

Example 5

double product = 1.0;
for (int i = 1; i <= n; i *= 2) { 
   product *= i;     // <----- a primitive operation                  

Here, if we suppose that $n = 2^m$, then we have the primitive operation executed when $i = 1, 2, 4, 8, ..., 2^m$, and thus $m+1$ times. Since $m = \log_2 n$, we have a logarithmic cost function $C(n) = \log_2 n + 1 \sim \log_2 n$, which is $O(\log n)$.

If the switch to $\log n$ from $\log_2 n$ was confusing, note that changing the base requires only a multiplication by a constant, per the logarithm rule from high school algebra shown below. $$\log_b a = \frac{\log_x a}{\log_x b}$$

Example 6

for (int i=1; i<=N; ++i){
   for (int j=1; j<N; j*=2){

In this example, the inner loop contributes a logarithmic cost, while the other loop requires this be done $N$ times. Thus, we have $C(n) \sim n \log n$.