March 26th, 2010 @ 18:12
[youtube]http://www.youtube.com/watch?v=ymuV4Cql7lY[/youtube]
The ramp between the new Gates Hillman Center and the Purnell Center for the Arts at night
March 26th, 2010 @ 03:43
Sense : this picture makes none
No clue on where this image first appeared, so I won’t be able to give proper credit.

This image makes me think of the video clip of Takin’ Back My Love by Enrique Iglesias and Ciara. I fail to see the logic behind the different attacks : she drops his paintings and clothes into the swimming pool, while he barely spills the milk on the floor. WTF ?

By the way, I had been wondering for a while how to make links which directly point to a specific time in a Youtube video. Well, now I know.

March 26th, 2010 @ 01:28

After migrating a bunch of stuff from one (about to expire) server which ran lenny to a new one running squeeze, a friend’s blog, which is powered by Dotclear, appeared heavily broken. His posts appeared empty, though they were still there and the titles were right, but nothing else (no url, no author, no date, no content). After a little bit of investigation, I figured that the problem was that squeeze is running PHP 5.3, and that my friend’s version of Dotclear obviously didn’t support it. Checked the Dotclear website, found out that since PHP 5.3 support is planned for the upcoming Dotclear 2.2, the latest version (Dotclear 2.1.6 — which my friend already had, actually) did not support it. Checked the PHP website to find the 5.3 release date : 30 June 2009. Wow.

Looked a little deeper in the Dotclear forums, and I found a patch which is actually a workaround for the problem. This workaround has been available since the 20th of July, 2009, and the Dotclear developers won’t include it even in the 2.1 branch because it’s a workaround and not a real fix :

Le patch n’est pas appliqué parce que, tu le dis toi-même, c’est une rustine. Ça peut paraître vieux jeu, mais nous préférons garder un code propre et régler les problèmes correctement.

Which translates to “The patch hasn’t been applied because it’s only a workaround. It might sound overaged, but we’d rather keep a clean code and fix the problems correctly”. Well, it sounds like a great plan, which would be fine if it did not took them ages to produce that clean code :) (arguably, since the patch is easily available, my point might is pretty much void, but still, it’s not official — the average user grabbing the latest official release and installing it on their hosting which provides php 5.3 might easily get confused).

March 25th, 2010 @ 23:45

There’s an outstanding bug right now which makes that cvCanny edge detector function in OpenCV currently segfaults on x86_64 systems. This post is an open attempt to track my debugging process :)

  • Bug encountered. I know it’s x86_64 specific since I ran the same code on an i686 machine a few hours ago (with a home compiled OpenCV, though).
  • Googled it : found reports on both OpenCV and RedHat bugtrackers.
  • Installed debug symbols, ran under gdb : all values I may need are optimized out.
  • Fetched OpenCV source, compiled it in debug mode.
  • CvCanny works great in debug mode.
  • Recompiled in release (optimized) mode to check if it is a distro-specific bug (both reports are from Fedora users).
  • Woha, release mode compilation is so slow :( But bug confirmed : it segfaults again. Time to instrument the code.
  • Filled cvCanny function with printfs and fflushs to track the function execution. Looks like it tries to access an element at index -514. Hugh. What’s even more frightening is that it successfully achieves that on another array.
  • After running the same instrumented code on my i686 machine, it appears that the indexes are right and that the same indexes are accessed without any problem in optimized mode in the i686 build.
  • Reading the code tells me that the accesses at negative indexes are legit since the array origin is shifted from the actual allocated memory blob start. Well, that’s good, since it explains why it works well in debug mode or on i686 setups, but that’s pretty bad because it’s going to be awful to narrow down.
  • Ok, doing the access by hand (i.e. doing _map[-514] instead of _map[j – mapstep]) works. This is getting crazy. Doing k = j – mapstep and accessing _map[k] segfaults as well. Huh.
  • After an hour of heavy fprintfs, I figured that long k = j – mapstep; gave me a k which wasn’t the int value (-514) but rather the unsigned int value (4294966782), while doing int k = -514; long k2 = k; printf (“%d %ld\n”, k, k2); in a very simple code gives out -514 -514, even with -O3 or -O5 and all the options used for OpenCV release build. Since we are working with 64 bits pointers (i.e. of the size of long integers), this is probably the issue : when accessing _map[k], it unreferences the value at _map + k, which fails since it unreferences _map + 4294966782 instead of _map – 514.
  • Doing volatile int k = j – mapstep; and accessing _map[k] works, and cvCanny runs great now. Though this isn’t a real fix, just a workaround. There’s most likely a compiler bug underneath.
  • Posted a summary of my findings and the workaround on the bug report on the OpenCV tracker.

Patch against latest svn (it should apply nicely to the 2.0.0 release as well) :

Index: cvcanny.cpp
===================================================================
--- cvcanny.cpp	(révision 2908)
+++ cvcanny.cpp	(copie de travail)
@@ -239,7 +239,8 @@
                 {
                     if( m > _mag[j-1] && m >= _mag[j+1] )
                     {
-                        if( m > high && !prev_flag && _map[j-mapstep] != 2 )
+                        volatile int k = j - mapstep;
+                        if( m > high && !prev_flag && _map[k] != 2 )
                         {
                             CANNY_PUSH( _map + j );
                             prev_flag = 1;
@@ -253,7 +254,8 @@
                 {
                     if( m > _mag[j+magstep2] && m >= _mag[j+magstep1] )
                     {
-                        if( m > high && !prev_flag && _map[j-mapstep] != 2 )
+                        volatile int k = j - mapstep;
+                        if( m > high && !prev_flag && _map[k] != 2 )
                         {
                             CANNY_PUSH( _map + j );
                             prev_flag = 1;
@@ -268,7 +270,8 @@
                     s = s < 0 ? -1 : 1;
                     if( m > _mag[j+magstep2-s] && m > _mag[j+magstep1+s] )
                     {
-                        if( m > high && !prev_flag && _map[j-mapstep] != 2 )
+                        volatile int k = j - mapstep;
+                        if( m > high && !prev_flag && _map[k] != 2 )
                         {
                             CANNY_PUSH( _map + j );
                             prev_flag = 1;
March 25th, 2010 @ 01:46

Today I’ve been looking at opencv-python for a quick project (I’d like to practice OpenCV a little bit). Installed the opencv-python package on Fedora 13, headed to the samples directory (/usr/share/opencv/samples/python/), started running one of them and… boum, segfault. Tried another one (the inpainting one), and it worked. A third one… segfault. Most of the samples in there segfaulted, mostly with SWIG errors about wrong parameters, always mentioning int64 (I’m using a x86_64 kernel & distribution).

After half an hour of failure on trying to get opencv.i686 work alongside my x86_64 python, I went back to the OpenCV website to check if there was some known heavy problems with x86_64 systems and… I discovered that :

Starting with release 2.0, OpenCV has a new Python interface. (The previous Python interface is described in SwigPythonInterface.)

Wait wait wait, you mean that during all this time I was running the OLD, pre-2.0 Python interface ? Why the hell does the opencv-python 2.0 package provides both the new and the old interfaces ? (well, I know the answer : backwards compatibility). Meh :( Anyway, I wish the old samples would get ported to the new interface… At the moment there’s no sample using it at all :/

March 24th, 2010 @ 00:53

Since Sunday I have been testing Picasa web face recognition on my set of pictures. After an hour of initial processing, I was presented an interface showing a list of cluster of faces, which I have to check, remove false-positives and name. While it seemed to work great for the very first clusters (which correctly grouped about 50 to 100 faces of the same persons), it quickly appeared that the whole thing was not that great.

Here are a few rants :

  • It seems to be heavily influenced by the face expression and angle (i.e. it’ll often make two clusters of faces of the same person depending on whether it looks tilted to the left or to the right).
  • It doesn’t reconsider the clustering after the initial processing : I’m pretty sure that after tagging a bunch of clusters of the same person, it could easily merge the remaining clusters into a single one.
  • It keeps giving me “communication errors”. I’m used to the “click and it’s immediately there” scenario with google services, and I have to say that this service is definitely not a good example. About two thirds of my actions result in such an error, which takes about 10 seconds to arrive ; successful actions also take several seconds (5 to 10) to complete, which is not really efficient when you have pictures of about 800 different persons which gives clusters of 1/2 faces you just want to ignore and that it takes about 20 second per cluster to get it actually ignored.

I know this is still in development, so the actual recognition problems are ok, but meh, the communication errors are really really annoying…

Update (03/25/10) : I don’t know if they fixed the problem or if it was just pure luck, but I was able to tag 1500 faces without a single problem in about half an hour. Yeepee !

March 22nd, 2010 @ 01:56

For a project midterm presentation, we were asked to produce a bunch of slides explaining our project architecture and implementation choices. Apart of the obvious things (libraries in use, network protocol…), I had no real clue on what I could put in, so I thought I’d just throw some UML-like diagrams and that it would be fine. The only detail was : how to produce these diagrams ?

Since the project code was written in Python, all the inheritance relations were already held by the code and could be introspected, so that it was theoretically possible to automatically produce the inheritance diagram. And it actually is, and is implemented by things like the Epydoc (a documentation generator for Python code) parser, as well as the diagram generation, which Epydoc also implements. The only thing is that I wasn’t satisfied by the Epydoc diagrams since they were limited to the inheritance relationships, while I was also willing to include usage relationships and display only the main methods and variables of my objects.

I thus wrote Umlpy, a UML-like class diagram generator for Python code, which depends on Epydoc (for the parser) and python-graphviz (for the graph generation, it produces nicely spaced graphs and can output jpg, png or pdf files, and probably more). It handles the aforementioned requirements through docstrings parsing and introspection. Check the Umlpy README file for more documentation on how to use it. It took me about 10 minutes to get to the result I was expecting (it’s basically about adding a little docstring for usage relationship, and copy-pasting a docstring on methods or variables you want to see on the diagram.

This wouldn’t be complete without the mandatory screenshot, and this example code results into this diagram :
Umlpy result example

March 22nd, 2010 @ 00:27

Last autumn I had to do some particularly cpu-time consuming but completely independant computations on thousands of pictures (namely computing SIFT keypoints on high resolution images). Using my laptop, it would have taken hundreds of hours to complete (it took about 20 minutes per image), but it could be easily done in a distributed manner on dozens of machines.

I thus started writing a little cloud-computing alike framework to handle the distributed computations in a safe manner : the machines I was intending to run my computations on were public-access machines and could get rebooted at any time, so the computation (let’s call it task from now on) that was started on it would never end.

There’s no trick here, the user defines his tasks, an entry point which sets off the initial tasks, and that’s it. The tasks are stored in a workqueue, which is handled by a master node, which assigns tasks to the other nodes (actually, the master server is split in two, the “heartbeat” server which monitors slaves status, and a taskqueue server from which the slaves pull their tasks). If a node does not complete a task within a user-defined amount of time, the master assigns it to another node. Yet, if the initial node manages to complete the task before the second one, the task gets completed anyway (this prevents infinite task respawning in case of an unexpectedly long task).

The tasks may use data stored in either a file-like storage or in a database. Since the whole thing is in Python without any runtime restriction, any arbitrary code or binary can be ran through this thing, which feels like a great pro versus frameworks where only a few languages are supported. A nice feature is that each task can be associated to a file pack (holding e.g. binaries which will be ran by the task) which is automatically deployed or cleaned when the task is started on the node.

Let’s also mention that the slaves are automatically deployed based on a simple configuration file (through ssh/scp) ; that applications are easy to deploy as well (the initial files, the entry points and al are defined in a single config file, form which the deploy tool automagically creates the file packs, updates the pack data to the heartbeat server and plugs the entry point into the taskqueue) and that there are only a few deps for the slaves (python and python-yaml), and not much awful ones for the various servers (mostly add sqlalchemy to the slaves deps).

There are still some points I’d like to improve. I’d first like to make the master node dynamic, and not static like it is now. Nodes which may elect as master (i.e. those who may directly connect to the various storages) would use a distributed protocol to actually elect the master (I’ve worked on such stuff last year during a complexity course, see my (French) summary about the subject : Calculabilité Distribuée), and then broadcast to the other nodes the winner of the election. If the current master is unreachable at any time, the electable slaves would start doing the election process right away, but would still try to attempt to reach the master until the election is completed.
I’d also like to add a way to handle in a nicer way mixes of 32 and 64 bits architectures (i.e. provide a way to define 32-bit/64-bit specific initial packages).
Security is another matter, I guess I should probably add some chrooting to the runtime environment, and secure the transmissions between the nodes (note that there’s absolutely no notion of authentication in any part of the software at the moment).

The only problem is that whan I got to the point I had finished the framework, I had no time to write the little bit of script to actually test the whole thing, and a friend had thrown the pictures to his high-end quadcore computer, so I had no chance to test it on a real life scenario. I’ll try someday, though. If you want to give a try to the latest version I pushed (it lacks the database backend and documentation at the moment, though I intend to finish/test them and upload them someday soon), check the Cloudpy git repo.

March 22nd, 2010 @ 00:13

A few days ago, I discovered I had a miniUSB-A (or something along the line) plug on my beloved EOS 450D camera. Fun stuff, I could read pictures directly from the camera without having to swapping my SDHC card back and forth, at a pretty decent transfer rate (yet quite slower than through the cardreaded of my laptop).

I then discovered libghoto2, which is the library behind the Linux support of this feature. After a little bit of googling, I stumbled upon the gphoto2 website, where they mentionned that it was possible to command shooting from the computer and directly get the resulting image on the computer. Immediately tried and… it works great ! (though I haven’t figured yet how to get the picture on the sdcard instead of directly on the computer).

Not sure what I’m going to do with this, but I’m confident there is a bunch of hacking possibilities here. Maybe should I write a simple GUI on top of it someday to begin with. Or I could try building a motorized base and remote control the whole thing from my n900 (and attempt to do things like Paris 26 Gigapixels). Hmmmmm…

March 20th, 2010 @ 01:40

Last summer, I went to INRIA Sophia Antipolis (INRIA is the French research institute for computer science, basically ; Sophia Antipolis names a group of villages near Nice and Antibes (on the Côte d’Azur) on which are installed quite a lot of technology businesses, among which HP or Amadeus) for a 3 months internship in PLANETE project, under the advisory of Mathieu Lacage, a great software engineer.

Before any extra details, let’s point to the code, the report and the slides. There are also some slides somewhere in the wild which were made for a quick talk given by Mathieu at ns-3 workshop at SIMUtools 2010.

The goal of the internship was to enable the seamless use of multicore parallelization in ns-3 network simulator to speed up simulations.
The obvious idea behind this project is that with all the trends on going more and more towards highly multicore architectures even for personal computers, such a change would possibly allow large speedups without having to directly throw a lot of hardware in the balance and without requiring most likely complicated configurations from the user.

Our goal was to make theses changes entirely transparent for both developers (i.e. those working on improving the ns-3 framework — routing stuff, devices handling…) and users (i.e. those using ns-3 to simulate their protocols…).

Before going into the optimization details, let’s mention the basic idea of the algorithms that we used for the synchronisation of the execution units (the cores, in our case), algorithms which are part of the conservative synchronization algorithms class (as opposed to optimistic algorithms class).
We first tried a message-passing based approach, namely the Chandy, Misra, Bryant algorithm. The whole idea in this algorithm is that each node knows how far in the future it can avance without any risk of simulation desynchronization. This is based on knowing the physical transfer delays, plus the current date of the neighboring nodes ; date which is passed through recurrent messages. This way, under certain conditions (namely that no link has a null transfer time), the simulation will safely advance.
We found out that this approach would lead to no performance gain due to the amount of messages which have to be passed compared to the actual amount of stuff done per time unit.
We then used a barrier-based synchronization algorithm : at each iteration, a global maximum date is determined, based on the current date of each node and the physical transfer delays (again). All the node are thus allowed to process events up to this date, then wait at the barrier, and then go to the next iteration when everyone has reached the barrier. The actual barrier implementation is a fun part, which is described in the report. From a trivial semaphor implementation to a fully optimized, tree-shaped implementation, performance gains of over a 100 fold can be observed.

One of the usual challenges when doing parallel computing is to define a proper partition of what each execution unit does. For instance, when working on a network of nodes (which are computers, routers…), which nodes should be handled by which execution unit ? Since we don’t want to ask the user about what he thinks would be the best, and since we can’t just use topology clues to make a good partitionning (because it also depends on the actual dynamics of the network, which we can’t predict), we had to use a dumb solution and make it smart. The solution we found there was to have both dedicated and global nodes : a given percentage of nodes is dedicated to each execution unit, while a global pool of nodes remains undedicated. When an unit has finished processing it’s own dedicated nodes, it starts picking nodes from the global pool and process them. This way, these units won’t waste their time waiting for other units which have to handle more complicated nodes in their dedicated set, thus requiring more processing time just for the local set.

Now you are going to ask, why not just use a global pool and let all the units directly pick nodes to process from there ? The whole problem is that this global list must be protected, two threads just can’t pick stuff from it at the same time — this would lead to awful memory corruptions, and probably more. Consequently, we need to use either mutexes (i.e. locks) or atomic operations to modify the list. Atomic operations are cheap compared to locks, and yet they are still at least 10 times slower than normal operations. This justifies the whole dedicated + global pools approach.

Now, there were more thread safety problems. While most of them were easily disarmed for minor performances loss (such as disabling free list or caches), there was still a major issue : reference counting. Indeed, in order to be as user-friendly as possible, ns-3 provides (and heavily uses) a reference counting system to ease memory handling for users. The whole problem there is that the counters might be accessed concurrently by two threads at once. What happens when a thread references an objet at microseconds before another thread unreferences it ?
An usual i++ operation is usualy done by taking i from memory to a register, adding 1 to it and putting it back to memory. Now, if the original reference count was 1, we could have the first thread taking i = 1 from memory, the second one taking i = 1 from memory, remov 1, putting i = 0 back into memory, then figuring i == 0 and trashing the objet, then the first threading adding 1 locally, and putting i = 2 back into memo… wait, this memory has been unallocated. Game over.
Again, a possible solution is to use atomic operations. Oh right, they are awfully expensive. Don’t care, let’s try, there must not be that much ref/unrefs going around there. And then… ow, actually, we just catched a plain 100% runtime increase.
While we fixed some of the stupidities that lead to such horrible increases, the problem was still on.
We tried to use thread local storage to maintain a per-thread reference count, but this did not really solve the problem, just helped.
We finally tried to use thread local buffers of Ref/Unref operations, which would be batch-ran by a single thread. This lead to nice performance improvements over the previous solutions, but if I remember correctly under certain unknown conditions the code would badly crash or eat all the memory. Urgh.

Still, in some conditions, I still managed to get over 20% performance improvement with 8 threads (iirc).

The conclusion of all this is that multithreaded programming is definitely not a trivial problem and that converting an existing software to gain from multithreading is not that easy. For more information, numeric data and all, check the report.

Now, I have had no time since middle September to work on the issue, but I know Mathieu has. It’d be awesome to have some news from him here… To be continued, hopefully.