Spark of Life

A while ago I gave a short presentation at JFall about Apache Spark (see this post) and I decided that I should also demonstrate a different type of job for Apache Spark. Whereas the JFall presentation showed a more typical data analysis type job the one in this example is a CPU bound long running process: solving the traveling salesman problem through a genetic algorithm.

It’s heavily inspired by a talk my friend and colleague Bas Knopper gave on the subject (my solver implementation is based on an earlier presentation). I won’t dive too deep into the algorithm (Bas does a much better job of explaining it) but it basically works by generating random routes between cities, finding the best ones and then recombining and mutating these best ones to find even better ones. This is done in a number of iterations.

The code can be found at this repository.

Introduction

What we need to do is parallelize our task in such a way that we benefit from the efficiency of multiple worker threads. To see how we might do this we can take a look at the Spark Pi example as provided by the Spark docs. This little piece of code uses multiple workers to perform a CPU bound task; approximating Pi:

JavaSparkContext sc = new JavaSparkContext("local[4]", "PiSample");

int slices = sc.defaultParallelism();
int n = 100000 * slices;
List<Integer> l = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
    l.add(i);
}

JavaRDD<Integer> dataSet = sc.parallelize(l, slices);

int count = dataSet.map(i -> {
    double x = Math.random() * 2 - 1;
    double y = Math.random() * 2 - 1;
    return (x * x + y * y < 1) ? 1 : 0;
}).reduce((i1, i2) -> i1 + i2);

double pi = 4.0 * (double)count / (double)n;
System.out.println("Pi is roughly " + pi);

We first parallellize the RDD (the fork), then we do the actual CPU bound calculation in the map (solve) and then reduce the values (recombine) so we can print an approximation of Pi. Great! So now we know how to do a fork-solve-recombine we can also do it in a loop on a different type of calculation.

Our genetic algorithm works slightly different: the more time you let it spend the closer it will get to an optimal solution but it doesn’t know for sure if it is in fact the optimal solution. This is why you normally set some kind of cutoff where you decide a solution is ‘good enough’. For example you only want to spend a fixed amount of time or you just quit if it hasn’t found a better solution in a certain amount of iterations.

And then there is the issue that every now and then you need to join the results of the different workers: one of the workers might’ve gotten ‘lucky’ with a good route and you want to share that good path with the other workers.

So let’s take a look at our SparkFacade’s ‘solve’ method:

Work work = Work.forCities(cities, maxDuration);
double start = work.shortest().getDistance();

for(int i = 0; i < iterations;i++) {
    double iterStart = work.shortest().getDistance();
    JavaRDD<Work> dataSet = sc.parallelize(work.fork(sc.defaultParallelism()));

    work = dataSet.map(Work::solve).reduce(Work::combine);

    LOG.info("Iteration {} result: {}", i, formatDistance(work.shortest().getDistance()));
    if(iterStart == work.shortest().getDistance()) {
        LOG.info("No change; terminating.");
        break;
    }
}

As you can see it’s very similar to the Pi example with the only difference that it runs the parallelize-map-reduce code inside a for loop. You can see that I’ve created a Work class: this is a neat little package with a List of Routes that is basically a task for the Worker to solve. And this is exactly what happens: in the map function the solve method is called (which in turn lets the genetic algorithm loose) and it simple runs until it reaches a maximum in time, same values (meaning the algorithm found an optimum) or generations.

The loop ends when either the max number of iterations is reached or there is no improvement in distance between iterations.

If you play around with the length of the routes (the -c command line switch) you will soon notice it finds an optimal length very fast for short routes (10 cities) but it takes a lot longer for longer routes (50-150 routes).

Conclusion

I hope you enjoyed this short post on using Spark for long running algorithms. It was a lot of fun toying with it and getting it to work. If you have any questions or suggestions please poke me on github or via Twitter!

Running a Genetic Algorithm on Apache Spark

Introduction

Conclusion

Further reading / interesting links