Why data doesn’t mean anything

October 7th, 2009

Today I finished a report that my boss wanted as soon as possible at the beginning of the month, or at least no later than end of business on Friday. Last Friday, I mean. He even told me to get it done ahead of several other things that he wanted immediately at the beginning of the month (or at least by the end of Friday).

It took so long–beside of course the fact that it wasn’t the only thing I was working on–because it require combining multiple incongruent data sets. The primary data concerned how often we changed the promise on orders. To get the different perspectives on this that my boss wanted, it was necessary to compare the primary data with both shipment records and open order records. And shipment records and open order records are different because an order can ship several different times (as partial shipments) but there is only one record for the open (balance of the) order. All that we wanted was permanent information about the order like what item was ordered on a particular line and where the order was shipping to, but we don’t store that information by itself (where I can access it) so I have to skim it off of open and shipped orders.

And we actually store the current month’s shipments in a different place than rest of the current year. So I had to combine month to date shipments with year to month shipments and combine those (which had multiple records per line) with open orders (with only one record per line).

And that is when I found out that in our system you can book a particular line, ship it completely, then book the exact same line number for a completly different item. Not to mention that you can change the shipping information at any point–even while lines are partially shipped. And you can move lines from one site to another.

So I had to use union queries to combine the shipments records (just the quasi-static fields such as order, line, item, and shipping address), then another union query to combine that union with the open order records, adding a “preference date” (date of shipment or date of open order snapshot). Then I had to pour that into a new table,  sorted by order, line, and preference date, so that I could query and group by order and line taking only the last record (by prefernce date) for the other fields. The idea of “last by date” doesn’t seem to work if you try to do it from a query that’s sorted; from my experience the records have to appear in order in the table for the concept to work (using Microsoft Access).

Some of the data never associated at all because a promise on and order had been changed and then later the order had been cancelled (or some other exception) so that while we had a record of changing the promise on the order, I from my sources did not have any record of the order ever existing.

And then after that I had to do the actual analysis my boss wanted, including comparing my results to analysis done by my bosses’ peer, who is considered less of an expert on the data extraction, and explain any differences. And that’s when I found out Access doesn’t have built-in functions for Mode and Median–only Mean.