Best Malpractices
posted at 11:54 on 2009.04.08
"I learned very early the difference between knowing the name of something and knowing something." -- Richard Feynman
With that in mind, here's a well-reasoned rant against Best Practices. The IT world is replete with buffoons - programmers, managers, CS students, whatever - who toss around terms like AJAX, ORM, XP/Agile, Web 2.0, RDBMS, DRY, NIH, and OOP without ever pausing to ask the real questions. What do they mean? What do they do? Where do they succeed - and where do they fall short?

Consider this: Google does not follow Best Practices. They solve problems. Period.
Scripted Reality
posted at 14:06 on 2009.04.04
As promised, here are the scripts.
The Sound of Scraping
posted at 22:30 on 2009.04.03
After sitting on my CBC Radio 3 metadata for just over a week, I finally got around to throwing together a decent downloading script. Actually, the scraper/downloader is a loose federation of scripts, deliberately kept in separate modules so as to allow nice things like, say, running multiple copies thereof concurrently. I'll post a link to the source in the near future, along with a few words of explanation. Maybe I'll even write a README - after all, although CBC Radio 3 is afloat for now, there's no telling how long it will survive the budget axes of doom.

(And, to prevent the inevitable smartasses from chiming in with "you forgot wget, n00b" - nope, it's in there somewhere. That said, I think you'll find these scripts go a tiny bit further...)
If It Floats Like An Octopus
posted at 08:06 on 2009.04.02
"C++: an octopus made by nailing extra legs onto a dog."

C++ is often referred to as a statically-typed language - it aggressively checks types at compile time (and, in the case of anything STL-related, spews out monstrous-looking errors.) By comparison, many "scripting" languages (and I use that word cautiously, since these are proving to be powerful application development languages in their own right!) such as Python and PHP support what is known as duck typing. Many rail against this, arguing that it leads to unmaintainable code and nasty runtime errors.

That aside, C++ does have support for duck typing - in its template system. Consider:

#include <iostream>
using namespace std;
template <class T> void p(const T& t) { t.print(); }
struct A { void print() const { cout << "A!" << endl; } };
struct B { void print() const { cout << "B!" << endl; } };
struct C { };
int main() {
   A a; B b; C c;
   p(a);        // "A!"
   p(b);        // "B!"
   //p(c);      // compile-time error
}

The global function p() enforces no type restrictions on the template class T; all it requires is that T implement void T::print() const, as A and B do. Also, note that p(c) causes a compile-time error, not a runtime error! There is a tradeoff, naturally: the compiler creates a separate copy of p() for each type that it's invoked with (g++ -S for the gory details!), so extensive use of this technique can easily balloon your object files.
And the total is…
posted at 20:07 on 2009.03.26
76302. Of course, this is just the metadata - I wouldn't be so reckless as to hammer the CBC Radio 3 servers with 200 GB worth of download requests in a day! (Nor would my bandwidth permit me to suck down that much data in anything less than a couple of weeks. Oh well.) Next up: filter it down to a list of songs that I might actually want.
2017 Songs and Counting
posted at 07:45 on 2009.03.26
Your guess as to what this does:

#!/bin/bash
for i in `seq 0 25`; do
echo "http://radio3.cbc.ca/nmc/artists.aspx?offset=${i}"
done | tee -a artists.log |\
./url-dumper 1.0 |\
egrep -o "/bands/[^\"]*" | uniq |
while read line; do
b=`basename "$line"`
echo "/play/band/${b}"
done | uniq | tee -a bands.log | ./cbc3-get-music-info 1.0

(Yes, I've left out some details - like what exactly those scripts do under the covers. I'll post about that when it's finished!)

Better make that 3156 songs. And counting.
Miscellaneous Perverse Tree Tricks
posted at 13:01 on 2009.03.20
Unless you're Google, SQL-based relational databases are the de facto standard for storing web application data. Trouble is, MySQL et al. are great at storing relations - lists and sets, essentially - and are absolutely horrible for everything else, right?
Well, not quite. Tree structures are easily and efficiently stored using modified preorder traversal, which relies on nested intervals. This alone is enough to tackle many problems, like ACL-style permissions. I've seen the adjacency list method in use - if you like scalability, don't do it.

(For the truly masochistic: you can even pull off a decent directed acyclic graph implementation by computing the transitive closure.)
Yay for Efficiency!
posted at 07:14 on 2009.03.20
(Caveat: This post is somewhat C++-centric, but its main points translate to any language.)

I've been continuously optimizing my first project so that I can run ever-larger experiments. The project boils down to two main problems:
  • searching a list of high-dimensional sparse vectors for nearest-neighbour pairs; and
  • keeping track of the sets generated by merging those pairs.
In the process, I've cut the runtime for runs against the GALE 2008 Chinese-English corpus from something like millions of years to a couple of hours.

(If the preceding part made absolutely no sense, you might want to stop reading now.)

Lesson The, er, Zeroth. Take advantage of sparse data.

This should be obvious - if I can ignore the holes in my data rather than trudging through each one, it can only help efficiency.

Lesson The First. boost::tr1::unordered_map is not always preferable to map.

The latter beats the former hands-down in iteration, and the ordering property of a map permits more efficient linear sweeps for a variety of iteration-based algorithms - sparse-vector arithmetic, for example. In some cases, a sorted array might be better still.

Lesson The Second. Use the structure that fits the problem best.

The second problem above is a classic use case for the disjoint sets structure. Using anything else is a recipe for gross inefficiency - you can't beat the amortized inverse Ackermann cost for all operations.

Lesson The Third. You can make file output faster.

std::endl flushes the write buffer - unless you need the output immediately, use "\n" instead. Dump large structures in compact binary formats where possible. If you go this route, test that your read/write operations undo each other; this sanity check prevents most nasty surprises. If you're really in a pinch, use <cstdio> instead of <iostream>.

Lesson The Fourth. Consider using heuristics.

Do you really need an exact answer? Most of the time - especially when dealing with massive datasets - the answer is no. Look for small assumptions or shortcuts that work for your dataset. For instance, when performing the nearest-pair search, I only consider pairs that have non-zeros in the same coordinate at least once. This works well with the sparse representation, allowing me to search over a sub-quadratic number of pairs.
So You Think You Can Sort
posted at 18:54 on 2009.02.01
Selection sort. Insertion sort. Quicksort. Mergesort. All are standard fare in your average basic-level CS class; the first two are inefficient but conceptually simple, whereas the last two are decent general sorting algorithms. This is all good and well, but no one uses them for performance-critical applications:
  • Often, a sort is not required; what you really want in most large-scale cases are the top k results for some relatively small k.
  • Real-world data has patterns that general algorithms like your standard library quicksort fail to exploit.
  • Comparisons and swaps aren't always cheap.
For you C++ programmers, the first point means little more than the difference between sort and partial_sort. The second point, however, is slightly more subtle.

As a simple example, suppose you have a billion integers to sort, but you know that each one is between 1 and 100. With that restriction alone, you can easily sort them in O(n) time. Other examples:
  • The data are k-sorted (i.e. A[i] >= A[i-k].)
  • The data aren't quite sorted, but they are unique; furthermore, abs(A[i] - A[i-1]) < k.
The third point might seem like a dubious proposition, especially that part about swaps. How can swaps be expensive? If I want to swap complex structures, I use pointers. Simple, right?

A while ago, I wrote a DHTML table sorter (for fun, naturally!) For reference, here's a standard quicksort implementation. Ignoring the horrendous colour scheme for a moment, why is the first one so much faster? Simple - it avoids swapping DOM elements. (There's a reason it's called permutation quicksort!)

Fine. So we can guarantee O(n) swaps in exchange for linear memory overhead. How do you limit comparisons? Remember that second point - notice patterns.
At Last!
posted at 09:48 on 2009.02.01


Yeah, I know. Problem 117 is easy. IMHO, it's ranked "harder" than other problems simply because the requisite formulae for many of these questions are easily found using a combination of Wikipedia, MathWorld, and Sloane - which is not to say that the implementation part is always trivial!