The Other Knapsack Problem

One of the famous NP problems in combinatorics and decision theory is the “knapsack problem”, where the issue is to choose a subset that a) can fit in a total size constraint, and b) maximizes the total value, from a set of things of different values and sizes. This problem is presented somewhere in every CS department’s algorithms class and has lead to some important and interesting research.

I was wondering what this had to do with packing real knapsacks (well backpacks), and realized that there is another knapsack problem which is an engineering problem.

How do we choose and design the elements of the set so that a minimum functional specification is met, while being as far as possible below the size constraints?

This probably should be called the ‘ultra-light’ knapsack problem, because the real-world analog of this is how to select the lightest most functional equipment needed to survive have a good time in wilds. It’s also a problem in software engineering, where the issue of weight becomes code size and maintainability as well as the performance of components of the program. These all are sort of related.

So how is this problem approached in practice?

Direct Minimization: The most obvious thing to do is to make everything light weight. As long as expected conditions permit, choose the 2lbs tarptent instead of the 6lbs winter supertent. This can be tricky – for example a ‘bivy sack’ and light tarp might weigh as much as a good single layer tent – so it’s important to be sure what and why you are using. What you choose must still do what you need it to do. If I design a software tool – I should choose the ‘best’ (by my performance criteria) algorithm but I need to be sure that the algorithm and its implementation actually to what they are supposed to do.

Journaling: One very effective technique in backpacking is to keep track of what equipment is used and what equipment isn’t. You don’t bring along things that you didn’t use on the next trip. There is an exception to this – you do bring first aid supplies, some repair tools and rain gear because you really might need it. On my trips, my cooking gear has (d)evolved to a folding plastic plate, a thermix cosy, a small light pot and plastic bags (see http://www.freezerbagcooking.com for my inspiration). I use Wendy’s chilli spoons because they are light and cheap. In software engineering, keeping a journal of command usage can help focus efforts on the parts of the program that actually get used. Journaling locations in the program gives a performance profile that can help identify bottlenecks. Not skimping on emergency equipment is the same as remembering to put in software diagnostics and interrupt handlers. Try()… Catch()… can be a real pain, but are important.

Multifunction: Why carry two single function tools if there is one tool that can do both things? There are good and bad aspects to this. Sometimes merging multiple functions is a great idea – a bandanna is one of the really useful things to take along – a light weight multitool or swissarmy knife is really good – and you can use a big tent stake to dig your catholes. The caution is that sometimes this doesn’t work. Sporks are my favorite trivial example – but seriously there is a sleeping bag that doubles as a parka. This sounds great until you realize that the time you need a dry sleeping bag is when you are nearly hypothermic and your parka is soaked.  Designing an overly complex class that handles two distinctly different logical ideas is lame.  Similarly, in the world of user interfaces, arbitrary placement of unrelated things on one ‘uber-checkbox’ is not a good way to engineer usability.

Validate: Always make sure the new light-weight equipment actually works. I do this in the real-world with two or three steps: first at home (for example trying a stove on my driveway), then on a “base camp” or car camping trip, and finally on a short real trip. In software engineering this the same as designing an acceptance suite and ensuring that new components actually pass it.

Distribute the Load: If you’re going in a group, you don’t all need to carry everything. Not everyone needs a stove. It’s all right to share a tent. Processes and threads can improve performance if you can implement them on appropriate hardware.

Written by Rob in: engineering,laboratory practice |

Databases are only as good as the data (D’oh)

More on how you can learn CS from everyday problems:

One of the things I enjoy doing is wandering in the woods with a heavy weight on my back, and no this is not about the knapsack problem (I’ll save that for a discussion on ultralight backpacking). The map below shows a trip I took with my son’s boy scout troop. For fun I used a GPS to track our path, and plotted that on top of the topographic maps from the National Geographic program. (not a bad program)

topo map of troop 550 trip

You probably can’t see what is wrong with this – so I snipped a bit out with the Gimp.

topo map of troop 550 trip

We walked in and out by the same path, but one time we appear to fly. (The GPS clocked us as 11mph – which is pretty good for a trail runner ;->). The lack of reproducibility is a systematic error due to reflections from the sides of the valley and in fact in this case off of a cliff face we walked beneath. But that’s not the serious error. That’s just goofy data which could be corrected by careful remeasurement. Notice the purple line – that’s supposed to be a road. It may have been there once, but it isn’t now. There are a few traces, but nothing I’d even want to take a 4WD off-road bike on, let alone my car.

The road database was last updated in the 1930′s when the fish hatchery was built with CCC money. It isn’t accurate because the conditions have changed. This is an example of a temporal or contextual error.

North Georgia and North Alabama maps are rife with these errors. In one section around Centre Alabama there are at least three road databases, and two major ones are incorrect. The roads changed because of a little dam creating a big lake (lake Weiss) and the farm tracks changed as the land shifted owners. Unfortunately the GPS databases haven’t been field checked, and will gladly direct you into impassible woods or onto the water. Google Earth synchronized their database with the one of these and the roads it draws now don’t follow the roads that are clearly visible in the images.

What’s going on here is pretty obvious. If the data are wrong then the conclusions you deduce from that data will be wrong (or worse trivial and therefor meaningless). Part of the responsibility of maintaining a public database is the collation and removal of errors. The errors I’m showing in this post only survived because the areas are rural or wilderness and there isn’t that much interest in surveying an area with only a few people (although a quick pass with image recognition software wouldn’t be a bad idea). I just hope the local rangers and sheriffs have correct maps.

A good example of collation can be seen with the Protein structure database. The original system was simply a record-driven FORTANish file where individual labs deposited what they thought was important. This worked well while there were only a few protein structures, but with nearly 50,000, data consistency becomes critical. While some of the pdb’s solutions (a proprietary data language (mmCIF) based on a precursor to XML, and an XML dialect that is awesomely awful in its detail) are less than ideal (IMHO), the work of collation has made it much easier to search the database automatically (as long as you use the ‘old’ record-based form).



Just a short post.  One of the problems with running a lab that produces software artifacts as part of the process of computer science is the maintenance of those objects – especially after the students leave.   I know with my own program, AMMP, it’s a bit of a pain (posix batch, windoze graphical, unix graphical, various potential sets and accessory programs).  But at least I’m here.  What happens when a student graduates (which is a desirable event)?

Currently the code is orphaned.  This is not a good event.

So we’re exploring using subversion (SVN) to archive results.  In addition to helping maintain continuity, this will help train the students in modern software engineering methods.

Don’t know yet how it will work, but it has to be better than the alternative.

Written by Rob in: laboratory practice,pedagogy |

Anti-productivity tools (classes start again)

Not much time to write as between getting ready for class, reviewing papers, and getting ready for the university’s annual report cycle takes up a lot of my mindspace. If there is a way to do something that is onerous and bassackward, GSU will find it. Now I know where the “pointy haired bosses” go to retire. The powers that be would love us to use a web interface (FIMS) to input our data, which is fine if you write a paper every other year and have no students.

The nasty minded ones of us enjoy defeating such interfaces, but it is time consuming. Web interfaces should help you produce (WordPress is a good example), not interfere with your work. It takes effort to do this well and hiring a cousin with a years Java programming is not the way to go.

There is an interesting commentary that backs up what my previous post was saying about languages:


I’m not sure I agree with everything they say (Ada is not my language of choice). I think there is a dichotomy of language requirements in computer science. If you are concerned with doing things (how to simulate, databases, machine learning, parallel programming, security) then you need to understand the interaction between the high level constructions of your language and the hardware because the implementation details will impact your ability to “get her done”. If you are concerned with more abstract things (like reducing the complexity of a graph algorithm from N^0.4N to N^0.35N) then this interaction is not critical. None the less as part of the general education of a student it is critical to be aware of (and competent in) several programming paradigms.

Written by Rob in: Uncategorized |

Why Computer Scientists Need to Know How To Code

More thoughts on what it means to be a computer scientist….

One of the running discussions I have with my colleagues and have seen on the net is whether computer scientists need to be able to program. If I had a dollar for every time some student has said “Dr. Harrison computer science isn’t programming”, I could fire the lot of them and start my own research institution. Equally short-sighted are the people who state “well I can code – so I don’t need any of that theory cr*p”.

Many CS people are “computational mathematicians”. From their viewpoint, once you’ve stated a set of axioms and derived theorems from them then the problem is completely solved. The beauty of the proof, even if it cannot be reduced to actual practice, suffices for this school of thought. The alternative viewpoint tends to be an engineering viewpoint, where (with disservice to real engineers who care about theory) the fact that it seems to work is all that matters.

The answer lies somewhere between these extremes.

Computer scientists need to know how to program and they need to actually reduce their ideas to practice.

  • A formal system derived from a set of axioms is great, but what if the axioms do not completely reflect reality? There’s no way to prove an axiom. The only thing to do is to use real data and demonstrate, by experiment, that with some probability the axioms and derived structures are an accurate model of reality. This is why its computer science and not computer mathematics.
  • A computer scientist who cannot code is limited to running and describing the performance of existing applications. In machine learning – an area near and dear to me – the students who have difficulty coding cannot move beyond running packages like SVMlite or random forests (c). The quality of their work suffers because it cannot be highly original simply because (almost) everyone else is doing the same thing.
  • Lousy coding means that you cannot exploit the power of your mathematics. One class of methods we are looking at is using fuzzy descriptions in machine learning. Very often part of the domain knowledge is a description of the accuracy of the observation. Can we use this to improve the reliability of the predictions? Yes. Can we use existing “fuzzy decision tree” programs to shortcut our effort. No, the programs only run in finite time for toy problems.
  • Slow and inefficient coding means a student takes forever to finish, and papers and ideas get “scooped”.
  • Interesting domains of research are shut off if you can’t program in different ways. For example if you can program Matlab and a little Java (typically Public Static Void Main(){ System.out.println(“hello world”);}) then you’ll have trouble dealing with anything in computer security. If all you know are streams in C++, then using the low level posix interface to monitor hardware and thus generate the data for your great machine learning algorithm becomes impossible. Yes, I know that the Church Turing thesis states that all computations can be done in any sufficient model – but some approaches are better for some tasks than others.
  • Computer languages are tools for writing representations of algorithms. Just as sometimes a well-designed figure or example problem simplifies the presentation of a difficult idea, the choice of a different language to represent an algorithm can result in insights into the structure and limitations of the underlying formal structure of the problem. It’s worth using a new language or better different class of language to examine the engineering of the problem. While I wouldn’t use Erlang to recode my molecular mechanics program AMMP, understanding the ability of the functional representation in Erlang to minimize side effects and thus improve parallel performance results in a better understanding of how to improve the parallel performance in AMMP.

Some recommendations:

  1. Learn several languages. Don’t just focus on variants of C or C++, but look at other classes of languages like functional ones. Also look at both “high level” and “low level” languages.
  2. Do a little assembly coding. Really, but only a little. It’s worth seeing what the machine model you compile to looks like.
  3. Look at the assembly level output from your compiler. Do the constructs which improve the readability change the output? Usually not.
  4. Python is a good bridge language between the functional world (it has lists and functional aspects), the procedural world, and the object-oriented world. The “self” argument in objects is a clear reminder of the way an object is the merging of data structures with methods that manipulate them.
  5. Try Haskell or Erlang.
  6. Have fun with this.
Written by Rob in: Uncategorized |

Powered by WordPress | Aeros Theme | TheBuckmaker.com WordPress Themes