Time for Semantic ETL?
Revenue-funded startup First Retail, whose principals Anne Jude Hunt and Simon G. Handley will be speaking at the upcoming Semantic Technology Conference in June, thinks the answer is semantic ETL.
Article by Jennifer Zaino on February 25, 2011 republished from SemanticWeb
Extract, transform, load (ETL) is a widely known concept in the well-charted terrain of the IT world. That’s about transforming a bunch of heterogeneous data to unify it within a data warehouse and get some use out of it.
Semantic ETL, says Hunt, is brought on by the fact that today people want to deal with the growing loads of streaming data while it’s streaming and that “people want intelligent data, machine-readable tags,[they want] to slice and dice it for BI in lots of different ways, so the traditional data warehouse and relational database approach is just not working for people.” Cleansed and integrated semantic data loaded into distributed, scalable triple stores can come to the rescue.
Explaining the use cases further, Hunt and Handley point to requirements by companies to get away from having to sit on data for a few days while it’s processed for loading into a traditional data store. But when, say, product catalogues are continually streaming in data – prices, attributes, new releases – from multiple parties in different formats, the labor-intensive and lengthy process of merging and dealing with it all is unsustainable. “A lot of big companies are doing that and it hurts,” she says.
Take the example of an import/export business with an e-commerce site that has to categorize product data according to World Trade Organization categories to deal with different laws or taxes around the offerings. “They can’t tell customers what the price is with taxes included unless they know what category it’s in, but vendors flow in new products all the time,” she says. “If you have to sit on new product data for three days while you categorize it before you can sell it on the web site, that’s losing revenue.”
First Retail has a customer engagement side of the business where it’s getting lots of exposure to the difficulties companies have around dealing with big data (whether for e-commerce or internal information). It’s using that insight – and the funding it generates – to help propel the research side of its house that’s working on stealth semantic technology to further help relieve businesses’ pain points.
Today, it’s helping businesses by putting some components around the stream of raw semantic data. It’s using NLP, machine learning, entity extraction and hand-written semantic rules, for example, to categorize information in flow so that data is in order by the time it makes it to a scalable triple store. It’s also making it cheaper and easier for human curators to propagate decisions they’ve taken about data like product descriptions – ala semantic inferencing, to have all similar products described in the same way. Basically, as the principals describe it, it’s the use of a bucket of different intelligent technologies to do data categorization on the fly.
While it can’t provide specific details about its stealth semantic technology, ideas include helping determine the degree of similarity between any two objects at really large scale, to allow precise search and discovery of objects where data is coming from lots of different siloed data sets, and precise management of user searches and search results. On-the-fly category creation that uncovers and draws on similarities between clusters of objects could fast-forward ideas and implementations about how people organize e-commerce web sites, for instance. “The key behind the stealth technology is to have such operations at scale,” says Handley. “We can do these things, precision searching and so on, but customers want it in real-time with very large amounts of data.”
From a general viewpoint, the semantic wave, especially for e-commerce companies, is unavoidable, Hunt thinks. “Data is not only bigger but expected to be more semantic,” she says. “GoodRelations is awesome…. This is really the beginning of a huge wave where it’s no longer acceptable to companiess to have their products out there non-semantically.”