Possibly I am just turning into a grumpy old man in my middle-age, but there are two words that when used together annoy me beyond almost all reason – yes, even more than the “p-word” that has featured in two of my previous posts: “unstructured” and “data.”
Despite what some vendors – and some commentators, who really should know better – would have you believe, there is nothing remotely formless or “unstructured” about “new” types of data, like image files, audio files, text-based documents, XML documents and so on. Of course for the most part these data hardly qualify as “new,” either, but don’t indulge my pedantry by getting me started down that road.
Data is merely information that has been encoded in some way and the only truly “unstructured data” is “noise”; random signals, representative of nothing much more than a system in equilibrium with its environment. A picture, a song, the complete works of Shakespeare – these are all forms of information and they are emphatically not “unstructured.”
To see the truth of this, take, for example, a GIF file (make sure that it is one that you don’t much care about, or a copy of one that you do) and open it with a text …
Possibly I am just turning into a grumpy old man in my middle-age, but there are two words that when used together annoy me beyond almost all reason – yes, even more than the “p-word” that has featured in two of my previous posts: “unstructured” and “data.”
Despite what some vendors – and some commentators, who really should know better – would have you believe, there is nothing remotely formless or “unstructured” about “new” types of data, like image files, audio files, text-based documents, XML documents and so on. Of course for the most part these data hardly qualify as “new,” either, but don’t indulge my pedantry by getting me started down that road.
Data is merely information that has been encoded in some way and the only truly “unstructured data” is “noise”; random signals, representative of nothing much more than a system in equilibrium with its environment. A picture, a song, the complete works of Shakespeare – these are all forms of information and they are emphatically not “unstructured.”
To see the truth of this, take, for example, a GIF file (make sure that it is one that you don’t much care about, or a copy of one that you do) and open it with a text editor. Now mess with and/or delete some of the bytes at random, save the adulterated file and then try and open it with your normal picture editing or viewing software.
In fact a GIF file is highly structured and includes meta-data in the header that, for example, includes a colour table; the height and width of the pixels represented by the bitmap that follows; whether the image is animated or still; etc., etc. All this meta-data is then followed by an array of bytes that define the actual bitmap bits and an end-of-file marker. Monkey with this file structure and you risk reducing the value of the data that it contains to peanuts; monkey with the actual data payload and you likewise either corrupt the file so that it can’t be read or so that it represents a different or a degraded image. Repeat this experiment with just about any multimedia file type and you will get the same result – either a corrupt file that cannot be read correctly or one that is no longer an accurate representation of the original object. These data are not only structured; the nature of that structure is critical to their correct interpretation.
And of course it’s not just the “wrapper” that has structure; the structure of the data itself is critical. Most people would interpret the statement “Dave didn’t marry Sue because she was rich” as meaning that Dave and Sue were married, but that Dave’s motivation for their union was not financial. Conversely, the statement that “Dave didn’t marry Sue, because she was rich” would probably be interpreted as meaning that Dave and Sue did not marry and that is was the difference in their circumstances that got in the way. A single structural element – one comma – makes a big difference to our interpretation of the “same” data. Suppose that during their courtship Dave tells Sue “I love you”; the structure of this sentence is identical to the structure of the sentence “I want you” (subject-verb-object, I think, but if I am mistaken and there are any linguists out there reading this, please feel free to correct me), but the two statements may or may not be synonymous (although I hear that Dave is a good guy, so perhaps we should give him the benefit of the doubt).
In fact, even apparently random noise can convey meaning. Tune a radio telescope to the microwave range of the electromagnetic spectrum and you will hear a faint hum, directionally uniform to 1 part in 500. This is quite literally a distant reverberation of the “Big Bang” in which the Universe was created and which confirms that the Universe was indeed once hot-and-dense, as the Big Bang theory demands that it must have been. That’s important information, as historically there have been other theories of the origin of the Universe that don’t assume an explosive beginning.
From measurements of the cosmic microwave background radiation, as it is called, physicists and astronomers are able either to infer or to calculate directly many other essential truths about the Universe, including the speed at which our galaxy is moving (600 kilometres-per-second towards the constellation of Leo, in case this answer is one day all that stands between you and the “who wants to be a millionaire?” prize money). It turns out that there is an awful lot of important information encoded in that apparently random noise.
Back on Earth, less exotic, “new” types of data are increasingly interesting to the commercial and government organizations that most of us serve. We should probably call these “multimedia data”, “non-record based data” or “non-relational” data. Actually, I’m not crazy about “non-relational” either; whilst this data is typically not relational in the accepted sense – the ordering of the bytes that define the bitmap in a GIF file is important, for example – this data can, after all, be accommodated in tables in a relational database using BLOB and CLOB objects. So long as we regard these objects themselves as atomic, it seems to me these data are as relational as any other attribute of an entity. Things clearly get more complex if we want to examine or “query” the objects themselves (“select all of the pictures in which the sky is red”), but let’s not go there for now.
My recent travelling companion and the main attraction on the “CTO Road Show” that we took on tour across the EMEA region in June – Teradata CTO Stephen Brobst – refers to “non-traditional data types” versus “record-based” or “square” data. These are definitions that I can live with. And I’m sure that engineering PhD Stephen will sleep easier for knowing that the flunky from marketing considers his use of technical vocabulary to be correct and not in the least aggravating!