first_page

Excel Returns Trumping OpenOffice.org Calc

I run two virtual machines, an Ubuntu box and a Windows 2003 Server. It was quite a relief to get the monstrosity that is Microsoft Office 2007 off of my real hardware and stuffing it into a virtual machine. While I was stuffing, I was trimming and I felt that I did not need Excel for day-to-day use. In fact for almost two years (maybe more), I was getting away with using Excel spreadsheets (in the *.``xls binary format)—until I ran into this ancient, tiny-but-huge problem. This problem is so ancient that it actually involves the Windows Registry. But first, let me rant about the role of the spreadsheet in the lives of “real” professional developers.

I think it’s interesting to search the famous CodingHorror.com site for the word “spreadsheet” just to see what role it plays for the young cats out there. These are the quips:

  • The Google spreadsheet is the “new black”—I just need to inform the thousands of people in my current W2 enterprise of this wonderful HIPAA-non-compliant fact.
  • An article like “Let’s Build a Grid” suggests that “real” developers need to make their version of the Google spreadsheet.
  • The spreadsheet is “ThemWare.” Buy this book at Amazon.com! Another article, “UsWare vs. ThemWare,” sets the psychological landscape for the spreadsheet in the lives of many “serious” developers. These are the same serious people that would accept an ASCII text file as a formal message to be validated and processed (either as XML or conventional text). To me, a spreadsheet is just another message format—and a file on disk is just another set of data in a primitive database. It is an error to regard a spreadsheet as a “toy database”—it is far healthier (and business-friendly) to regard a spreadsheet as a binary message format that needs to be validated and read just like an XML set. The great news about this “binary” format is that, from 2007 forward, it is just XML all over again… But the rest of this article assumes we are talking about the “legacy” binary *.xls format for the billions of spreadsheets in the world today.

Once the developer accepts this Excel situation as important and true, one very important issue must be recognized and addressed: Excel uses the registry to guess whether numbers in spreadsheets should be regarded as numbers or as text. And I am very serious about the word “guess”—the exact term Microsoft uses in the registry setting TypeGuessRows starring in “Data truncated to 255 characters with Excel ODBC driver.” This setting asks Excel to read ahead a certain number of rows, 0–16, in order to “guess” what type to set for the column (with possible nulls for values that fail whatever casting takes place). One clever developer may be tempted to set TypeGuessRows to zero which forces the system to read all the rows in the spreadsheet—but:

For performance reasons, setting the TypeGuessRows value to zero (0) is not recommended if your Excel table is very large. When this value is set to zero, Microsoft Excel will scan all records in your table to determine the type of data in each column.

My sad experience tells me to not even bother dealing with the Windows Registry. You can never tell when your changes will be recognized by the black-box of your concern—and you will tempted to reboot the system which is totally unacceptable to ‘real’ developers building applications for a ‘real’ enterprise.

When we use ODBC and the unadorned .NET Framework to connect with and read from an Excel spreadsheet, every column must be one data type. This can be a serious problem for a column that appears to contain numbers and text because null or empty strings will be returned for either the number-like text or the text-like—umm—text.

You can study the details of this somewhat archaic situation in “PRB: Excel Values Returned as NULL Using DAO OpenRecordset” because the plot thickens:

Insure that the data in Excel is entered as text. Just reformatting the Excel column to Text will not accomplish this. You must re-enter the existing values after reformatting the Excel column. In Excel, you can use F5 to re-enter existing values in the selected cell. [bold and strikethrough added by me]

That last sentence is struck off because the F5 key behaves differently by default on my copy of Excel 2007—and a ‘real’ developer’ should not have to “re-enter” data into a damn spreadsheet. When I have the clout to reject the spreadsheet and send it back to the business information worker, I will. But here, in my real world of a prestigious medical doctor’s cherished spreadsheet, I have to resort to this option suggested by a comment for Mr. Excel:

A tip of the MrExcel cap and a free Salt Lake City 2002 Olympic calendar to Mark R. who wrote in with the other solution to this problem that works very well: Highlight the entire column, select Data, Text to Columns, Finish.

What this tells me is that the Text to Columns command forces data “re-entry”—so when you see Number Stored as Text warnings/errors in Excel you can actually rejoice.

My version of OpenOffice.org Calc does not have a similar command (and I refuse to wonder why my version of Ubuntu has not upgraded OpenOffice.org for me). Now I have just seen (a few seconds before writing this paragraph) that in Calc version 2.4 and above that there is a Text to Columns command for OpenOffice.org but I have yet to test whether it can force data “re-entry” like Excel and ensure that a column marked as Text actually contains text. What is real right now in the article “OpenOffice Calc Is Evil”:

The problem with OpenOffice Calc is that in most situations (but not always) it will silently ignore numbers in “text” cells that are part of formulas or calculations and in the end produce wrong answers.

Rob van Gelder: Number Stored as Text When packages like Excel and Calc are regarded as “ThemWare” by developers, such smart folk will get bit in the ass and probably stuck for hours trying to figure out why data from some high-profile spreadsheet guru is not showing up in their huge, multi-tier, enterprise-wide system. Now the next thing I am going to write is a first for me: my Java-based way of reading Microsoft Excel spreadsheets is arguably better than the Microsoft “legacy” way.

I am very proud of my first (or second—because I wrote two small Java-based applications at the same time) Java console application, ExcelToXml that uses Apache POI to read Excel files. Apache POI treats binary Excel files truly as disconnected files (no external dependencies like some cross-platform equivalent of a registry or ODBC). Also, when POI thinks a value in a cell like 7779311 is a number, it renders it as 7779311``.0—for many situations you can use XSLT to run find-change operations to edit that trailing zero out… More on this later… I’m thinking of releasing this Java code as Open Source…

rasx()