Archive for the 'Fundamentals' Category

Zero is a number

Friday, August 6th, 2010

I won’t bore you with the story of how long it took for people to recognize that zero is a number. Without zero it would be difficult to explain what is the value of, say, 3 minus 3; we’d be forced to say that it’s a “meaningless” expression. Funny huh? Yet some developers seem to be stuck to medieval thinking in this respect.

Have you ever seen code like this?

public List findAllEmployeesByDepartment(int departmentId) {
  String sql = "select * from employees where department_id = ?";
  ResultSet rs = select(sql, department_id);
  if (rs.size() == 0) {
    return null;
  } else {
    // ... convert the recordset to a List and return it
  }
}

This developer seems to think that an empty List is not a regular list, so he thinks he should return a special value like null to signal that the query returned no values. This is totally unnecessary. No, I take it back: this is totally wrong. You are forcing all callers of findAllEmployeesByDepartment to check for null. Not only that; this code seem to say that it’s a totally unnatural and unexpected thing for this query to return no rows. Soon developers will forget to check for null, and the application will throw NullPointerExceptions.

A related example is:

Foo[] foos = ...;
if (foos.length > 0) {
  for (int i=0; i < foos.length; i++) {
    // do something with foo[i]
  }
}

Here the developer thinks that they have to treat the case of an empty array separately. In fact the IF is totally unnecessary. If the array is empty, the loop would execute zero times anyway. Java (and C) arrays use asymmetric bounds, which make it easier to write code that does not need to treat a zero-size interval as a special case.

In conclusion: empty collections are perfectly valid collections, and empty arrays are perfectly valid arrays. It’s a good idea to write code that doesn’t treat “zero” as a special case.

This post is part of a series on development fundamentals.

Work with asymmetric bounds

Tuesday, December 22nd, 2009

Let’s talk about ranges of integers. Quick: how many numbers are in the range 16 ≤ x ≤ 37, that is in the set {16, 17, …, 37}?

If you answered 21 then you got it wrong: the correct answer is 22. When you express a range inclusive of both ends, the formula to compute the number of elements is

size of [lower, upper] = upper - lower + 1

The “+1″ in the above formula is the source of many “off-by-one” or fenceposts errors. Can we get rid of the “+1″? Yes, there is a way. Use asymmetric bounds, including the lower bound and excluding the upper bound. The formula becomes

size of [lower, upper) = upper - lower

I'm using square and round brackets as a shorthand notation for ranges with inclusive and exclusive bounds:

  [a, b] = { x | a ≤ x ≤ b }
  [a, b) = { x | a ≤ x < b }

Another reason in favor of asymmetric bounds: how do you express an empty range? With symmetric inclusive ranges, you can't: the smallest range you can express contains one element. For instance [3, 3] contains only the number 3. With asymmetric bounds, you can express an empty range with, for instance, [3, 3), which is empty.

Asymmetric bounds work very well in conjunction with counting from zero. Java and C arrays are indexed from 0, so that the index range is [0, N). The upper bound is now equal to the number of elements! Expressing it with symmetric bounds gives [0, N-1], which is ugly because of the "-1".

If you need to iterate on a C or Java array, the standard pattern is

  int[] array = new int[N];
  for (int i=0; i<N; i++) { ... }

You see how the body is executed when i==0, through i==N-1, but not when i==N, since we are using i<N as a boundary condition. How many times is the loop is executed? Exactly N! It's now easy to see that the iteration is performed the correct number of times.

Asymmetric bounds can be concatenated easily. Suppose you have a function that paginates results on a web page. You are on the page which contains results 30, 31, ... 39. If you express the range with two parameters from and to, the page url would look like


http://example.com/results?from=30&to=40

And the links to the next and previous page would be

  <a href="/results?from=20&to=30">previous page</a>
  <a href="/results?from=40&to=50">next page</a>

where the bounds for the next or previous page are easily computed by adding or subtracting the page size from the current bounds. Compare with


http://example.com/results?from=30&to=39

  <a href="/results?from=20&to=29">previous page</a>
  <a href="/results?from=40&to=49">next page</a>

it's more difficult to check that the bounds are correct here: you have to think where the "-1" must be applied.

So, if you define "+" as the operation of concatenation of ranges, you get this nice little law:

  [a, b) + [b, c) = [a, c)

It's this sort of properties that make it easy to combine software objects together. This helps towards the goal of programming with lego-like bricks.

In conclusion: express ranges with asymmetric bounds! You will avoid off-by-one errors, get rid of "+1"s from your code, and make your programs more modular.

References

The book C Traps and Pitfalls by Andrew Koenig has a good explanation of asymmetric bounds.

A database for every developer

Saturday, October 17th, 2009

A database for every developer. No, *two* databases for every developer.

This is a fundamental for project organization that many projects get wrong. Every development workstation should be equipped with a full local development environment, with a local copy of the database software, and a one-command way to recreate the databases from scratch.

Why *two* databases? Well, one is for exploratory testing of the application we’re building. The other one is for automatic unit tests.

Why *local*? Because whenever the database server is not local, it becomes difficult to add a new workstation, it’s impossible to work when you’re not in the office, and you must depend on other people to fix your database problems.

The software that we write should *not depend* on the data sources that live outside our development workstation. To this end it’s a good start to have simple scripts that allow you to rebuild your database, so that you know you can experiment, change everything, make mistakes, and you’re still able to get back to a known working situation in a flash.

Why it’s important that I can rebuild the databases with *one command*? Because if it takes more than one command, it’s too complicated and I’m likely to make mistakes. Because it’s too easy to fall in the trap of not knowing exactly which steps are needed to set up a new database instance. If you have a single script that does the job, that script is also a living, always up-to-date document that describes how to recreate the database from scratch.

The benefits are not just in development; when the time comes to release our software in production, you can see how helpful it is to have a script that is able to set up the database with no effort. In fact, all database maintenance operations should be automated. It’s one of the principles explained so well in The Pragmatic Programmer, a very good book.

For example, this is a typical script that I use in my non-Rails projects:

#!/bin/bash

src=src/main/sql
dbname=myapp_development
dbname_test=myapp_test
dbuser=myapp_user
dbpassword=myapp_password

# Usually no changes needed beyond this point

if [ ! -d "$src" ]; then
  echo "Run this script from the main directory"
  exit 1
fi
read -s -p "mysql root password? (type return for no password) " MYSQL_ROOT_PASSWORD

if [ "$MYSQL_ROOT_PASSWORD" != "" ]; then
    MYSQL_ROOT_PASSWORD=-p$MYSQL_ROOT_PASSWORD
fi

mysqladmin -uroot $MYSQL_ROOT_PASSWORD drop $dbname
mysqladmin -uroot $MYSQL_ROOT_PASSWORD --force drop $dbname_test
mysqladmin -uroot $MYSQL_ROOT_PASSWORD create $dbname
mysqladmin -uroot $MYSQL_ROOT_PASSWORD create $dbname_test
echo "$dbname created"
echo "grant all on $dbname.* to '$dbuser'@localhost identified by '$dbpassword';" \
     | mysql -uroot $MYSQL_ROOT_PASSWORD $dbname
echo "grant all on $dbname_test.* to '$dbuser'@localhost identified by '$dbpassword';" \
     | mysql -uroot $MYSQL_ROOT_PASSWORD $dbname_test
echo "$dbuser authorized"
cat $src/???_*.sql | mysql -u$dbuser -p$dbpassword $dbname
cat $src/???_*.sql | mysql -u$dbuser -p$dbpassword $dbname_test
echo "schema loaded"

This handy little script will create the development and test databases, and load all sql scripts. I like to name sql scripts like 001_create_foobar_table.sql and 002_add_frobniz_column_to_foobar.sql, so that they can be loaded in sequence. It’s a simple way to develop the database schema incrementally. I may talk about it in another post.

Two floats are never equal

Saturday, October 17th, 2009

While we are on the topic of floating point fundamentals, there is another thing to remember: it is always a mistake to compare two floating-point numbers for equality.

It all boils down to the simple fact that floating-point arithmetic is not exact. It is meant for approximate calculations with engineering or scientific measurements, which are inexact to begin with. In fact, floating-point arithmetics results are almost never equal to the “true” value you would get by using exact real arithmetic.

Therefore, wherever you see something like x == 0.0, you can be fairly sure that it’s a mistake. Whatever computation produces the value of x, it’s unlikely to ever produce exactly 0.0.

The proper way to compare floating points is equality within some tolerance. For instance:

boolean approximatelyEqual(double a, double b, double epsilon) {
  return Math.abs(a - b) <= epsilon;
}

The above code works for most applications. It does not take into account the case that the inputs are NaN or infinities. I’m no expert of floating point arithmetic, so I will not give advice about this. For reference I copy here the following code from JUnit:

static public void assertEquals(String message, double expected,
    double actual, double delta) {
  if (Double.compare(expected, actual) == 0)
    return;
  if (!(Math.abs(expected - actual) <= delta))
    failNotEquals(message, new Double(expected), new Double(actual));
}

The purpose of the compare call is to have the test pass when the two numbers are both NaN.

Money is not a float

Saturday, October 17th, 2009

One suggestion I took to heart is that in order to be great, you need to work on fundamentals. It’s no good to be up to date with the latest and greatest, be they Agile techniques or new technologies, if you’re weak on fundamentals.

So I’m starting a collection of fundamentals, that is certainly not going to be comprehensive. Rather, it’s a random collection of things that I think are fundamental, yet many experienced developers get wrong.

Let us start with a surprising discovery: did you know that the number 1/10 cannot be represented in a finite way in base 2? Yep, it turns out that in base 2 the number 1/10 is periodical, much like the number 1/3 has no finite decimal representation in base 10. But what is the implication for us?

The implication comes when we make the mistake of representing a money in a floating-point number. Suppose you encode the amount of “ten cents” in the floating-point number 0.10. And now look at this program, and guess what happens when it runs.

  public class MoneyIsNotAFloat {
    public static void main(String[] args) {
      double tenCents = 0.1;
      double sum = 0.0;
      for (int i=0; i<10; i++) {
        sum += tenCents;
        System.out.println("0.1 * " + (i+1) + " = " + sum);
      }
    }
  }

(Hint: 1.0 times 10 equals… 0.99999999999999).

And this is not a Java problem. The same happens with any language, for it’s a matter of floating point arithmetic.

The simple fact is that floating-point arithmetic is not exact, therefore it should not be used for representing money!

What to use then? One simple solution is to use a plain int to represent an amount of cents. Integer arithmetic is exact. A 32-bit int should be enough for most applications. If you’re worried about overflow, use a BigDecimal type. Java has one, and most modern languages do too. (Just a note: if you use a Java BigDecimal, remember that you should not compare them with “equals”, you must use “compare”. Go figure.)