Engineering Problem, or Tool Blinders?

Tony and I met with a former co-worker early this year, to talk about his use of MongoDB. For some reason, they thought Tony was an authority on this. I just came along for the lunch.

The Question

The gist of the discussion was this: how best to store large binary objects in the database? Was Mongo the best choice, or would PostgreSQL be better? They were using both, but some devs were making a concerted effort to get over to MongoDB for new work.

After about 15 minutes of (kinda non-) discussion, I butted in: “I think you’re asking the wrong question. Do those files even NEED to be in the database?”

Metadata about the files? Absolutely. But the files themselves, probably not.

The Actual Problem

Databases are good at a few things:

Finding things quickly (indexing, hashing, etc.)
Keeping things straight between many competing threads (ACID stuff)

You know what they’re usually bad at? Sending or receiving large field values (I mean > 512KB or so), especially within a transaction. Especially under heavy concurrent load.

I can hear the DBAs out there yelling at the page that of course it can be done. I didn’t say it was impossible, it’s just soooo not what database are good at. While various database platform vendors have tried mightily over the years, storing large objects in your database is still possibly the worst physical database design decision you can make:

Worse than adding all of your query fields to the PK.
Even worse than getting involved in a land war in Asia.
Worse than, (gasp), stored procedures? Well, that might be a closer race, but let’s leave stored procedures for another day.

The GForge Approach (is nothing special)

I’ve always thought that GForge handled files in the clunkiest, lowest-tech way possible – on the file system. It still does, and even the Big New Version will continue to do so. But it wasn’t until this discussion that I actually became an advocate for this approach. Why? Because it’s only clunky until you compare it to all of the other approaches.

GForge stores all of the metadata (date, name, size, author, project, etc) in the database, we even add some keyword indexing in PostgreSQL’s excellent tsearch facility. But when it’s time to upload or download one of these files, we use FTP or HTTP. We get away from the DB when it’s time to do something that might run longer than a second, and we play to the design strengths of those other protocols. And while nearly all of our customers service web and file traffic from one server, GForge does allow configuration of different hosts to offload those tasks.

This shouldn’t come as a surprise to anyone. It didn’t really come as a surprise to our friends that asked the original question. But I could tell they didn’t really want to hear that the best solution didn’t involve new, shiny stuff. I understood that, and I still do. The best coders are the ones who do it even when no one’s paying them, just to try things out and learn. You don’t want to squash that initiative, but it can’t be your main criteria for choosing tools to solve business problems.

Have fun out there, just don’t let the shiny get in the way of the job.

Epilogue

I heard from my former co-worker a couple of weeks ago. They went with S3 storage, indexed metadata in a (PostgreSQL) database, and links between the two. They managed to find a solution that was both powerful and shiny. Kudos to them!