Today’s enterprise organizations receive and process data from a variety of sources, including silos generated by web as well as mobile applications, social media, artificial intelligence solutions in addition to IoT sensors. That said, the efficient processing of this data at high volume in an enterprise setting is still a challenge for many organizations.
Typical challenges include issues such as the integration of mainframe data with real-time IoT messages and hierarchical documents.
One of such issues being that enterprise data is not clean and might have contradicting characteristics as well as interpretations. This poses a challenge for many processes such as when integrating customers from multiple source systems.
Though, data cleansing could be considered as a solution to this problem. However, what if different data cleansing rules should be applied to the incoming data set? For example, because the basic assumption for “a single version of the truth” doesn’t exist in most enterprises. While one department might have a clear understanding of how the incoming data should be cleansed, another department, or an external party, might have another understanding.
True, it is desirable to have only one business rule to cleanse the data but in an enterprise organization, the circumstances are much more complex than that. Just consider different tax laws as well as other regulations, leading vs. secondary source systems that depend on the context of the report, external parties such as government or industry regulators that don’t care for internal definitions and such.
Other challenges of today’s organizations include auditability of data processing, GDPR including security and privacy challenges such as the right to be forgotten. Other organizations struggle with the provisioning of data in usable formats for applications and analytical approaches such as data mining. In addition to that, the migration into the cloud, or at least its leverage, is a topic of confusion for many organizations.
On top of that, data volumes go up, with some industries facing a massive data growth. Adding to that, the data to be processed also becomes more volatile and more complex. Dealing with structured, semi-structured and polymorphic data becomes an everyday job for most, if not all, data professionals. Even structured data becomes more complex due to more complex data types used such as table-like structures, key-value pairs, text, geospatial data and the nodes as well as edges used in graph processing.
That stated, this article starts a series of articles that promises a solution for most of these challenges as well as additional challenges we see in the field of enterprise data warehousing that deals with the complexity of enterprise data, such as continuously changing schemata in sources, including new or changed attributes; difficult schema in addition to data migration projects with the inclusion of the unpopular fact that source data typically doesn’t follow the target schema in an analytical project.
The promised solution is based on an approach that will overcome these challenges while allowing enterprise organizations to gather insights from massive amounts of complex and changing document structures. It follows the Data Vault 2.0 System of Business Intelligence which is a popular approach to build an enterprise data warehouse (EDW) solutions on relational databases and increasingly on NoSQL databases as well as data lakes.
This approach distinguishes between the raw data integration and the delivery of useful in addition to actionable information.
The basic idea is to break data and documents into smaller components first, integrate and store them efficiently to be later re-assemble into the desired target format. This approach is often applied to relations, processed in relational databases or similar technologies such as Hive, but can also be applied to documents.
While complex on first glance, more details will be described in upcoming articles so check back in on the series for more information as we shed light on the topic.