March 22nd, 2010 @ 00:27

Last autumn I had to do some particularly cpu-time consuming but completely independant computations on thousands of pictures (namely computing SIFT keypoints on high resolution images). Using my laptop, it would have taken hundreds of hours to complete (it took about 20 minutes per image), but it could be easily done in a distributed manner on dozens of machines.

I thus started writing a little cloud-computing alike framework to handle the distributed computations in a safe manner : the machines I was intending to run my computations on were public-access machines and could get rebooted at any time, so the computation (let’s call it task from now on) that was started on it would never end.

There’s no trick here, the user defines his tasks, an entry point which sets off the initial tasks, and that’s it. The tasks are stored in a workqueue, which is handled by a master node, which assigns tasks to the other nodes (actually, the master server is split in two, the “heartbeat” server which monitors slaves status, and a taskqueue server from which the slaves pull their tasks). If a node does not complete a task within a user-defined amount of time, the master assigns it to another node. Yet, if the initial node manages to complete the task before the second one, the task gets completed anyway (this prevents infinite task respawning in case of an unexpectedly long task).

The tasks may use data stored in either a file-like storage or in a database. Since the whole thing is in Python without any runtime restriction, any arbitrary code or binary can be ran through this thing, which feels like a great pro versus frameworks where only a few languages are supported. A nice feature is that each task can be associated to a file pack (holding e.g. binaries which will be ran by the task) which is automatically deployed or cleaned when the task is started on the node.

Let’s also mention that the slaves are automatically deployed based on a simple configuration file (through ssh/scp) ; that applications are easy to deploy as well (the initial files, the entry points and al are defined in a single config file, form which the deploy tool automagically creates the file packs, updates the pack data to the heartbeat server and plugs the entry point into the taskqueue) and that there are only a few deps for the slaves (python and python-yaml), and not much awful ones for the various servers (mostly add sqlalchemy to the slaves deps).

There are still some points I’d like to improve. I’d first like to make the master node dynamic, and not static like it is now. Nodes which may elect as master (i.e. those who may directly connect to the various storages) would use a distributed protocol to actually elect the master (I’ve worked on such stuff last year during a complexity course, see my (French) summary about the subject : Calculabilité Distribuée), and then broadcast to the other nodes the winner of the election. If the current master is unreachable at any time, the electable slaves would start doing the election process right away, but would still try to attempt to reach the master until the election is completed.
I’d also like to add a way to handle in a nicer way mixes of 32 and 64 bits architectures (i.e. provide a way to define 32-bit/64-bit specific initial packages).
Security is another matter, I guess I should probably add some chrooting to the runtime environment, and secure the transmissions between the nodes (note that there’s absolutely no notion of authentication in any part of the software at the moment).

The only problem is that whan I got to the point I had finished the framework, I had no time to write the little bit of script to actually test the whole thing, and a friend had thrown the pictures to his high-end quadcore computer, so I had no chance to test it on a real life scenario. I’ll try someday, though. If you want to give a try to the latest version I pushed (it lacks the database backend and documentation at the moment, though I intend to finish/test them and upload them someday soon), check the Cloudpy git repo.