Archive for the ‘Code’ Category

July 22nd, 2010 @ 01:18

A school friend, namely p4bl0, mentioned the idea of maintaining blog posts with git and a set of hooks which would produce the blog html from the contents of the repo. I loved the idea, but thought I could push it a little further : a blog engine which would use no other storage than git, with the post subject and contents being the commit message subject and contents. A post-commit or post-receive hook then produces the html. As simple as that !

You can find the source in BloGit git repo, and see an example at BloGit example. To use the source, you first have to pack it (using the pack script), which will merge the raw_post and raw_produce, producing a single post script (which I also included at the end of this post), which you can simply put in an empty directory and run it. It will unpack the other script (produce), initialize the git repo, and set the hooks. It’ll then prompt you for your post title and then open an editor for you to set your post contents. Save the file, and you’re done with your first post : check the index.html file which has been produced in the same directory. You can write your own stylesheet in the blogit-style.css file. Further posts can be done with the same post script.

Yet, the best way is probably just to use the usual git workflow. To initialize the repo and all, run post --unpack, and to post post --raw or git commit --allow-empty (when using git commit, leave a blank line between the subject line and the rest of the post). You can also amend existing commits (using git commit --amend), use the GIT_AUTHOR_* environment variables to change the author, and so on. Since merge commits are skipped by the html generator, it should work just great for multi author blogging !

PS : I know this is JUST a git log pretty printer, and that the whole thing is pretty much trivial. I also know that using versionned files to store the posts would allow a lot of extra bonuses (such as automatically adding “Updated on …” mentions based on the commit log of each single file). I just thought the idea was fun :p There are probably a lot of things to improve, or a lot of smart git features to use there that I overlooked. Feel free to leave a line :)

(more…)

April 21st, 2010 @ 20:54

After sorting out my clutter issues and finally producing a video of a clutter animation, I thought I’d use it on the initial goal, that animation I had written ages ago. What I had sadly forecast occurred, the video dumping awfully slowed down the animation.

The whole problem is that, for now at least, I don’t think it’s possible to run the animation frame per frame rather than time based. So I thought “let’s just defer the whole video generation to after the end of the animation, and bufferize the frames meanwhile”. Well, this worked… until oomkiller jumped in and killed my process. Urgh.

So, I can’t bufferize the whole video, but I can’t push the frames to gstreamer in real time directly from the animation either. Well, all I need is parallelism then ! Push frames to a queue which is consumed by another execution unit (by pushing the frames to gstreamer). And since threading pretty much sucks in Python (well, it would definitely since we need real parallelism), let’s use the new multiprocessing framework from Python 2.6. Using it is pretty straightforward : create some parallel structures (queues, pipes), spawn a new process with it’s own main function, push to the structures from one process, read from another, and you’re done. The only thing I’m still wondering is why there is a close() function on Queues when there is no obvious way to detect from the other end that the queue has been closed (which I worked around by pushing a None message).

Well, now I have a smooth animation and a smooth video dump, my two cores being nicely fully used :)

The code is available below, with the interesting parts being StageRecorder.create_pipeline, StageRecorder.dump_frame, StageRecorder.stop_recording and StageRecorder.process_runner.

(more…)

April 20th, 2010 @ 19:15

A while back, I used clutter (a very nice and simple animation toolkit that basically let’s you easily work in a 3D environment with 2D objects) to do a little photo slideshow with a lot of customisations, but I never even showed it to the person it was aimed at because the whole thing was not satisfying enough (it either took ages to start or was not smooth and it was not easy to put a decent soundtrack when you can’t synchronize video & audio).

A simple solution would have been to do the rendering once and then just do the postproduction. I had quickly looked for a way to use a direct output of the animation to gstreamer (since there is gstreamer input support for clutter, this pretty much made sense), but there was none. Another option would have been to use a capture software, like Xvidcap, but this stuff is too heavy for my poor laptop. Consequently, I just gave up back then.

What I had completely overlooked is that clutter uses OpenGL for the rendering, so that all I had to do was to dump each frame myself using glReadPixels or using things like Yukon to do the dirty stuff. After a quick googling, I found this clutter mailing list thread about capturing the clutter output to a video file, which mentions the clutter_stage_read_pixels function, which does all the glReadPixels magic and even puts it in a more standard format. It also points to gnome-shell’s recorder stuff, which does the glReadPixels stuff and outputs it to a gstreamer pipeline, plus some extra fancy things (since they are doing screencasts of gnome-shell features, they draw the mouse cursor on top of each frame). So all I have to do now is put things together :)

One of the bad things I figured is that clutter_stage_read_pixels calls clutter_stage_paint, so mixing the gnome-shell recorder approach with clutter_stage_read_pixels results in a bad infinite loop if you don’t protect the whole thing. Even though this means painting things twice, I guess this is a much easier approach than having to use python-opengl or something along the line.

Another bad thing I encountered was that the Python bindings for clutter_stage_read_pixels are broken at the moment (pyclutter 1.0.2). The first problem is that the argument parsing part seems to be broken. Changing the PyArg_ParseTupleAndKeywords to a simple PyArg_ParseTuple gets things “working”, and gdb indicates a segfault in a PyDict_Check of the keywords argument :

Program received signal SIGSEGV, Segmentation fault.
0x00000032d34ecd9c in _PyArg_ParseTupleAndKeywords_SizeT (args=(0, 0, 500, 200), keywords=, format=
0x7ffff000d9ac "dddd:ClutterStage.read_pixels", kwlist=0x7ffff022f6c0) at Python/getargs.c:1409
1409 (keywords != NULL && !PyDict_Check(keywords)) ||
(gdb) bt
#0 0x00000032d34ecd9c in _PyArg_ParseTupleAndKeywords_SizeT (args=(0, 0, 500, 200), keywords=
, format=
0x7ffff000d9ac "iiii:ClutterStage.read_pixels", kwlist=0x7ffff022f6c0) at Python/getargs.c:1409

After asking on #clutter, ebassi immediately caught the problem, there was a missing “kwargs” bit in the python binding override definition, so that the kwargs were never actually passed to the C wrapper which was getting garbage instead.

The other problem was that the returned data was empty. This was simply due to the fact that the buffer returned by the C function was interpreted as a NULL-terminated string, which is wrong when you get such binary data. The fix was simply to specify the length to read to fill the string.

Both issues are now fixed in pyclutter git, and should be available on the next stable release.

The remainder of the port was pretty straightforward. The only problem was that I had no experience with gstreamer, which wasted me quite a lot of time. Here are a few things I discovered :

  • The --gst-debug-level command line argument is really really useful, especially on levels 3 and 4, it outputs a lot of valuable information on what’s going on and what’s not working.
  • The whole caps story is really important. After spending an hour trying to figure why my pads wouldn’t negotiate their caps, I found out that they couldn’t because I had a wrong cap (the endianness one), and after a few more hours I figured that I had to set the caps on each buffer, and that I actually only had to set caps on buffers.
  • Timestamps are not magically inferred (at least not without extra gstreamer elements) and should be set by hand using the buffer.timestamp python property (this is not quite well documented in the Python bindings documentation imho).

Well, that’s pretty much it. I used a Clutter python demo from fedora-tour and here is the result : Clutter Stage Recorder demo. The whole source is available below :)

(more…)

March 31st, 2010 @ 20:01

Keyboard shortcuts are always a great matter of debate, and the whole problem is that most often they are chosen based on assumptions of the end user layout.

For instance, take this metacity commit : Change default cycle_group keybinding to Alt-grave. This change looks perfectly harmless, right ? Well not quite. It’s most likely based on the assumption that the end users has a qwerty keyboard layout (and it makes perfectly sense there). But let’s take an azerty layout. Grave is on the é/7 key, which is even farther from alt or tab than F6 is (well, not much I agree, but it might be even worse on other layouts). Is it really worth doing such a change then ?

Let’s also note that this also triggers a bad bug which gets alt+7 and alt+shift+7 to trigger the binding as well, while alt+grave is actually alt+altgr+7. This has been keeping me from nicely switching to my window n°7 in irssi for months (great thing that this window holds a really low traffic channel…).

All in all, I guess that the real problem is not that this change was made, but rather than we might need a system to have layout-dependant keybindings, or maybe hardware-location-based keybindings (i.e. that the key above the Tab key would trigger this keybinding independently of the layout).

Initially published on Mar 24, 2010 @ 8:22

Update : this change has been reverted for the GNOME 2.30 release. Even though I’m happy that the problem is “fixed”, it’s sad that the underlying problem (Alt+Shift+7 triggering Alt+`) is still there.

March 26th, 2010 @ 01:28

After migrating a bunch of stuff from one (about to expire) server which ran lenny to a new one running squeeze, a friend’s blog, which is powered by Dotclear, appeared heavily broken. His posts appeared empty, though they were still there and the titles were right, but nothing else (no url, no author, no date, no content). After a little bit of investigation, I figured that the problem was that squeeze is running PHP 5.3, and that my friend’s version of Dotclear obviously didn’t support it. Checked the Dotclear website, found out that since PHP 5.3 support is planned for the upcoming Dotclear 2.2, the latest version (Dotclear 2.1.6 — which my friend already had, actually) did not support it. Checked the PHP website to find the 5.3 release date : 30 June 2009. Wow.

Looked a little deeper in the Dotclear forums, and I found a patch which is actually a workaround for the problem. This workaround has been available since the 20th of July, 2009, and the Dotclear developers won’t include it even in the 2.1 branch because it’s a workaround and not a real fix :

Le patch n’est pas appliqué parce que, tu le dis toi-même, c’est une rustine. Ça peut paraître vieux jeu, mais nous préférons garder un code propre et régler les problèmes correctement.

Which translates to “The patch hasn’t been applied because it’s only a workaround. It might sound overaged, but we’d rather keep a clean code and fix the problems correctly”. Well, it sounds like a great plan, which would be fine if it did not took them ages to produce that clean code :) (arguably, since the patch is easily available, my point might is pretty much void, but still, it’s not official — the average user grabbing the latest official release and installing it on their hosting which provides php 5.3 might easily get confused).

March 25th, 2010 @ 23:45

There’s an outstanding bug right now which makes that cvCanny edge detector function in OpenCV currently segfaults on x86_64 systems. This post is an open attempt to track my debugging process :)

  • Bug encountered. I know it’s x86_64 specific since I ran the same code on an i686 machine a few hours ago (with a home compiled OpenCV, though).
  • Googled it : found reports on both OpenCV and RedHat bugtrackers.
  • Installed debug symbols, ran under gdb : all values I may need are optimized out.
  • Fetched OpenCV source, compiled it in debug mode.
  • CvCanny works great in debug mode.
  • Recompiled in release (optimized) mode to check if it is a distro-specific bug (both reports are from Fedora users).
  • Woha, release mode compilation is so slow :( But bug confirmed : it segfaults again. Time to instrument the code.
  • Filled cvCanny function with printfs and fflushs to track the function execution. Looks like it tries to access an element at index -514. Hugh. What’s even more frightening is that it successfully achieves that on another array.
  • After running the same instrumented code on my i686 machine, it appears that the indexes are right and that the same indexes are accessed without any problem in optimized mode in the i686 build.
  • Reading the code tells me that the accesses at negative indexes are legit since the array origin is shifted from the actual allocated memory blob start. Well, that’s good, since it explains why it works well in debug mode or on i686 setups, but that’s pretty bad because it’s going to be awful to narrow down.
  • Ok, doing the access by hand (i.e. doing _map[-514] instead of _map[j - mapstep]) works. This is getting crazy. Doing k = j – mapstep and accessing _map[k] segfaults as well. Huh.
  • After an hour of heavy fprintfs, I figured that long k = j – mapstep; gave me a k which wasn’t the int value (-514) but rather the unsigned int value (4294966782), while doing int k = -514; long k2 = k; printf (“%d %ld\n”, k, k2); in a very simple code gives out -514 -514, even with -O3 or -O5 and all the options used for OpenCV release build. Since we are working with 64 bits pointers (i.e. of the size of long integers), this is probably the issue : when accessing _map[k], it unreferences the value at _map + k, which fails since it unreferences _map + 4294966782 instead of _map – 514.
  • Doing volatile int k = j – mapstep; and accessing _map[k] works, and cvCanny runs great now. Though this isn’t a real fix, just a workaround. There’s most likely a compiler bug underneath.
  • Posted a summary of my findings and the workaround on the bug report on the OpenCV tracker.

Patch against latest svn (it should apply nicely to the 2.0.0 release as well) :

Index: cvcanny.cpp
===================================================================
--- cvcanny.cpp	(révision 2908)
+++ cvcanny.cpp	(copie de travail)
@@ -239,7 +239,8 @@
                 {
                     if( m > _mag[j-1] && m >= _mag[j+1] )
                     {
-                        if( m > high && !prev_flag && _map[j-mapstep] != 2 )
+                        volatile int k = j - mapstep;
+                        if( m > high && !prev_flag && _map[k] != 2 )
                         {
                             CANNY_PUSH( _map + j );
                             prev_flag = 1;
@@ -253,7 +254,8 @@
                 {
                     if( m > _mag[j+magstep2] && m >= _mag[j+magstep1] )
                     {
-                        if( m > high && !prev_flag && _map[j-mapstep] != 2 )
+                        volatile int k = j - mapstep;
+                        if( m > high && !prev_flag && _map[k] != 2 )
                         {
                             CANNY_PUSH( _map + j );
                             prev_flag = 1;
@@ -268,7 +270,8 @@
                     s = s < 0 ? -1 : 1;
                     if( m > _mag[j+magstep2-s] && m > _mag[j+magstep1+s] )
                     {
-                        if( m > high && !prev_flag && _map[j-mapstep] != 2 )
+                        volatile int k = j - mapstep;
+                        if( m > high && !prev_flag && _map[k] != 2 )
                         {
                             CANNY_PUSH( _map + j );
                             prev_flag = 1;
March 25th, 2010 @ 01:46

Today I’ve been looking at opencv-python for a quick project (I’d like to practice OpenCV a little bit). Installed the opencv-python package on Fedora 13, headed to the samples directory (/usr/share/opencv/samples/python/), started running one of them and… boum, segfault. Tried another one (the inpainting one), and it worked. A third one… segfault. Most of the samples in there segfaulted, mostly with SWIG errors about wrong parameters, always mentioning int64 (I’m using a x86_64 kernel & distribution).

After half an hour of failure on trying to get opencv.i686 work alongside my x86_64 python, I went back to the OpenCV website to check if there was some known heavy problems with x86_64 systems and… I discovered that :

Starting with release 2.0, OpenCV has a new Python interface. (The previous Python interface is described in SwigPythonInterface.)

Wait wait wait, you mean that during all this time I was running the OLD, pre-2.0 Python interface ? Why the hell does the opencv-python 2.0 package provides both the new and the old interfaces ? (well, I know the answer : backwards compatibility). Meh :( Anyway, I wish the old samples would get ported to the new interface… At the moment there’s no sample using it at all :/

March 22nd, 2010 @ 01:56

For a project midterm presentation, we were asked to produce a bunch of slides explaining our project architecture and implementation choices. Apart of the obvious things (libraries in use, network protocol…), I had no real clue on what I could put in, so I thought I’d just throw some UML-like diagrams and that it would be fine. The only detail was : how to produce these diagrams ?

Since the project code was written in Python, all the inheritance relations were already held by the code and could be introspected, so that it was theoretically possible to automatically produce the inheritance diagram. And it actually is, and is implemented by things like the Epydoc (a documentation generator for Python code) parser, as well as the diagram generation, which Epydoc also implements. The only thing is that I wasn’t satisfied by the Epydoc diagrams since they were limited to the inheritance relationships, while I was also willing to include usage relationships and display only the main methods and variables of my objects.

I thus wrote Umlpy, a UML-like class diagram generator for Python code, which depends on Epydoc (for the parser) and python-graphviz (for the graph generation, it produces nicely spaced graphs and can output jpg, png or pdf files, and probably more). It handles the aforementioned requirements through docstrings parsing and introspection. Check the Umlpy README file for more documentation on how to use it. It took me about 10 minutes to get to the result I was expecting (it’s basically about adding a little docstring for usage relationship, and copy-pasting a docstring on methods or variables you want to see on the diagram.

This wouldn’t be complete without the mandatory screenshot, and this example code results into this diagram :
Umlpy result example

March 22nd, 2010 @ 00:27

Last autumn I had to do some particularly cpu-time consuming but completely independant computations on thousands of pictures (namely computing SIFT keypoints on high resolution images). Using my laptop, it would have taken hundreds of hours to complete (it took about 20 minutes per image), but it could be easily done in a distributed manner on dozens of machines.

I thus started writing a little cloud-computing alike framework to handle the distributed computations in a safe manner : the machines I was intending to run my computations on were public-access machines and could get rebooted at any time, so the computation (let’s call it task from now on) that was started on it would never end.

There’s no trick here, the user defines his tasks, an entry point which sets off the initial tasks, and that’s it. The tasks are stored in a workqueue, which is handled by a master node, which assigns tasks to the other nodes (actually, the master server is split in two, the “heartbeat” server which monitors slaves status, and a taskqueue server from which the slaves pull their tasks). If a node does not complete a task within a user-defined amount of time, the master assigns it to another node. Yet, if the initial node manages to complete the task before the second one, the task gets completed anyway (this prevents infinite task respawning in case of an unexpectedly long task).

The tasks may use data stored in either a file-like storage or in a database. Since the whole thing is in Python without any runtime restriction, any arbitrary code or binary can be ran through this thing, which feels like a great pro versus frameworks where only a few languages are supported. A nice feature is that each task can be associated to a file pack (holding e.g. binaries which will be ran by the task) which is automatically deployed or cleaned when the task is started on the node.

Let’s also mention that the slaves are automatically deployed based on a simple configuration file (through ssh/scp) ; that applications are easy to deploy as well (the initial files, the entry points and al are defined in a single config file, form which the deploy tool automagically creates the file packs, updates the pack data to the heartbeat server and plugs the entry point into the taskqueue) and that there are only a few deps for the slaves (python and python-yaml), and not much awful ones for the various servers (mostly add sqlalchemy to the slaves deps).

There are still some points I’d like to improve. I’d first like to make the master node dynamic, and not static like it is now. Nodes which may elect as master (i.e. those who may directly connect to the various storages) would use a distributed protocol to actually elect the master (I’ve worked on such stuff last year during a complexity course, see my (French) summary about the subject : Calculabilité Distribuée), and then broadcast to the other nodes the winner of the election. If the current master is unreachable at any time, the electable slaves would start doing the election process right away, but would still try to attempt to reach the master until the election is completed.
I’d also like to add a way to handle in a nicer way mixes of 32 and 64 bits architectures (i.e. provide a way to define 32-bit/64-bit specific initial packages).
Security is another matter, I guess I should probably add some chrooting to the runtime environment, and secure the transmissions between the nodes (note that there’s absolutely no notion of authentication in any part of the software at the moment).

The only problem is that whan I got to the point I had finished the framework, I had no time to write the little bit of script to actually test the whole thing, and a friend had thrown the pictures to his high-end quadcore computer, so I had no chance to test it on a real life scenario. I’ll try someday, though. If you want to give a try to the latest version I pushed (it lacks the database backend and documentation at the moment, though I intend to finish/test them and upload them someday soon), check the Cloudpy git repo.

March 20th, 2010 @ 01:40

Last summer, I went to INRIA Sophia Antipolis (INRIA is the French research institute for computer science, basically ; Sophia Antipolis names a group of villages near Nice and Antibes (on the Côte d’Azur) on which are installed quite a lot of technology businesses, among which HP or Amadeus) for a 3 months internship in PLANETE project, under the advisory of Mathieu Lacage, a great software engineer.

Before any extra details, let’s point to the code, the report and the slides. There are also some slides somewhere in the wild which were made for a quick talk given by Mathieu at ns-3 workshop at SIMUtools 2010.

The goal of the internship was to enable the seamless use of multicore parallelization in ns-3 network simulator to speed up simulations.
The obvious idea behind this project is that with all the trends on going more and more towards highly multicore architectures even for personal computers, such a change would possibly allow large speedups without having to directly throw a lot of hardware in the balance and without requiring most likely complicated configurations from the user.

Our goal was to make theses changes entirely transparent for both developers (i.e. those working on improving the ns-3 framework — routing stuff, devices handling…) and users (i.e. those using ns-3 to simulate their protocols…).

Before going into the optimization details, let’s mention the basic idea of the algorithms that we used for the synchronisation of the execution units (the cores, in our case), algorithms which are part of the conservative synchronization algorithms class (as opposed to optimistic algorithms class).
We first tried a message-passing based approach, namely the Chandy, Misra, Bryant algorithm. The whole idea in this algorithm is that each node knows how far in the future it can avance without any risk of simulation desynchronization. This is based on knowing the physical transfer delays, plus the current date of the neighboring nodes ; date which is passed through recurrent messages. This way, under certain conditions (namely that no link has a null transfer time), the simulation will safely advance.
We found out that this approach would lead to no performance gain due to the amount of messages which have to be passed compared to the actual amount of stuff done per time unit.
We then used a barrier-based synchronization algorithm : at each iteration, a global maximum date is determined, based on the current date of each node and the physical transfer delays (again). All the node are thus allowed to process events up to this date, then wait at the barrier, and then go to the next iteration when everyone has reached the barrier. The actual barrier implementation is a fun part, which is described in the report. From a trivial semaphor implementation to a fully optimized, tree-shaped implementation, performance gains of over a 100 fold can be observed.

One of the usual challenges when doing parallel computing is to define a proper partition of what each execution unit does. For instance, when working on a network of nodes (which are computers, routers…), which nodes should be handled by which execution unit ? Since we don’t want to ask the user about what he thinks would be the best, and since we can’t just use topology clues to make a good partitionning (because it also depends on the actual dynamics of the network, which we can’t predict), we had to use a dumb solution and make it smart. The solution we found there was to have both dedicated and global nodes : a given percentage of nodes is dedicated to each execution unit, while a global pool of nodes remains undedicated. When an unit has finished processing it’s own dedicated nodes, it starts picking nodes from the global pool and process them. This way, these units won’t waste their time waiting for other units which have to handle more complicated nodes in their dedicated set, thus requiring more processing time just for the local set.

Now you are going to ask, why not just use a global pool and let all the units directly pick nodes to process from there ? The whole problem is that this global list must be protected, two threads just can’t pick stuff from it at the same time — this would lead to awful memory corruptions, and probably more. Consequently, we need to use either mutexes (i.e. locks) or atomic operations to modify the list. Atomic operations are cheap compared to locks, and yet they are still at least 10 times slower than normal operations. This justifies the whole dedicated + global pools approach.

Now, there were more thread safety problems. While most of them were easily disarmed for minor performances loss (such as disabling free list or caches), there was still a major issue : reference counting. Indeed, in order to be as user-friendly as possible, ns-3 provides (and heavily uses) a reference counting system to ease memory handling for users. The whole problem there is that the counters might be accessed concurrently by two threads at once. What happens when a thread references an objet at microseconds before another thread unreferences it ?
An usual i++ operation is usualy done by taking i from memory to a register, adding 1 to it and putting it back to memory. Now, if the original reference count was 1, we could have the first thread taking i = 1 from memory, the second one taking i = 1 from memory, remov 1, putting i = 0 back into memory, then figuring i == 0 and trashing the objet, then the first threading adding 1 locally, and putting i = 2 back into memo… wait, this memory has been unallocated. Game over.
Again, a possible solution is to use atomic operations. Oh right, they are awfully expensive. Don’t care, let’s try, there must not be that much ref/unrefs going around there. And then… ow, actually, we just catched a plain 100% runtime increase.
While we fixed some of the stupidities that lead to such horrible increases, the problem was still on.
We tried to use thread local storage to maintain a per-thread reference count, but this did not really solve the problem, just helped.
We finally tried to use thread local buffers of Ref/Unref operations, which would be batch-ran by a single thread. This lead to nice performance improvements over the previous solutions, but if I remember correctly under certain unknown conditions the code would badly crash or eat all the memory. Urgh.

Still, in some conditions, I still managed to get over 20% performance improvement with 8 threads (iirc).

The conclusion of all this is that multithreaded programming is definitely not a trivial problem and that converting an existing software to gain from multithreading is not that easy. For more information, numeric data and all, check the report.

Now, I have had no time since middle September to work on the issue, but I know Mathieu has. It’d be awesome to have some news from him here… To be continued, hopefully.