When discussing the importance of weaving together an organization’s diverse observation space (data), often this question comes up: “How does one related structured data to unstructured data?”
My shortest answer is: “Shared features.” Let me explain.
Observations are only useful to the extent you can extract features from them. A picture taken at night without a flash, that is pitch black, is nearly useless as it contains no extractable features. The cell phone call that suddenly goes bonkers and becomes all garbled is equally useless because you are unable to extract any useful features from the noise. On the other hand, a video from the parking garage – from which the license plate number, location and time features are easily extracted – is easily related to other data.
This point of view that “observations are only useful if features can be extracted from them” levels the playing field between different kinds of data sources, at least from a systems architecture point of view. Simplified architecture? Check.
Unfortunately, getting high quality features extracted from unstructured observations – at scale without humans – is easier said than done.
Beneficiary vs. Victim
When I talk to the hard-working folks building and deploying feature/entity extraction algorithms I often first say “My context accumulating engines are beneficiaries of your work.” They typically respond with a grin as they hold their head high. Then I say “I am also the victim.” Buzz kill, no more grin, as they know exactly what I am talking about as the accuracy rates of feature/entity extraction leave much to be desired. A common counterclaim: It is better than doing nothing at all! So true, today these feature extraction algorithms may be the only viable option when there are not enough people to do this by hand anyway.
Eventually, I think breakthroughs in feature/entity extraction algorithms will mean no more need to manually map structured data (e.g., to common field names). Instead, we will just forklift in boatloads of structured data and it will auto map – knowing the difference between first name vs. middle name fields and home address vs. work address fields merely by evidence found in the data.
As features find features, data will find data.
Whether the observation space is structured or unstructured is going to matter less and less.