ETL 2.0 Data Integration Comes of Age
December 9, 2016 | Author: pdoconno | Category: N/A
Short Description
What is ETL 2.0...
Description
ETL 2.0 Data Integration Comes of Age
Robin Bloor, Ph.D. Rebecca Jozwiak
WHITE PAPER
ETL 2.0
Executive Summary In this white paper, we examine the evolution of ETL, concluding that a new generation of ETL products, ETL 2.0 as we have called it, is putting a much needed emphasis on the transformation aspect of ETL. The following bullet points summarize the contents of the paper. •
Data movement has proliferated wildly since the advent of the data warehouse, necessitating the growth of a market for ETL products that helps to automate such transfers.
•
Few data centers have experienced a consistent use of ETL, with many such programs being hand coded or implemented using SQL utilities. As a consequence, the ETL environment is usually fragmented and poorly managed.
•
Databases and data stores in combination with data transfer activities can be viewed as providing a data services layer to the organization. Ultimately, the goal of such data services is to provide any data needed by authorized IT and business users when they want it and in the form that they need it.
•
The capabilities of the first generation of ETL products are now being stressed by: -
The growth of new applications, particularly BI applications.
-
The growth of data volumes, the increasing variety of data, and the need for speed.
-
The increasing need to analyze very large pools of data, often including historical and social network data.
-
High-availability (24/7) requirements that have closed batch windows in which ETL programs could run.
-
Rapid changes in technology.
•
In respect to technology changes, we note the emergence of a whole new generation of databases that are purpose-designed to exploit current computer hardware both to achieve better performance and scale, and to manage very large collections of data. Similarly, we believe the second generation of ETL products will be capable of better performance and scalability, and will be better able to process very large volumes of data.
•
We characterize the second generation of ETL products as having the following qualities: -
Improved connectivity
-
Versatility of extracts, transformations, and loads
-
Breadth of application
-
Usability and collaboration
-
Economy of resource usage
-
Self-optimization
1 Á
ETL 2.0 By leveraging an ETL tool that is versatile in both connectivity and scalability, businesses can negate the challenges of large data volumes to improve the overall performance of data flows. The versatility of second generation ETL tools additionally allows for a wide variety of applications that address business needs, however complex. These products will improve the time to value for many applications that depend on data flows and provide a framework that fosters collaboration among developers, analysts, and business users. By virtue of software efficiency, these tools will require fewer hardware resources than previous tools, and because transformations are processed in memory, they will eliminate the need for workarounds, scheduling, and constant tuning. In summary, it is our view that ETL tools with such capabilities become increasingly strategic because of their critical role in the provision of data services to applications and business users, and the inherently low development and maintenance costs can help businesses realize a significantly lower overall total cost of ownership (TCO).
2 Á
ETL 2.0
The Big Data Landscape The vision of a single database that could serve the needs of a whole organization was laid to rest long ago. That discarded ideal was superseded by a pragmatic acceptance that the data resources of an organization will involve many data stores threaded together by ad hoc data flows carrying data from one place to another. The corporate data resource is fragmented, and there is scant hope that this state of affairs will change any time soon. It is worth reflecting on why this is the case. As the use of database technology grew, it soon became clear that the typical workload of transactional systems was incompatible with the heavy query workloads that provided data to reporting applications. This gave rise to the idea of replicating the data from operational systems into a single large database ! a data warehouse ! that could serve all reporting requirements. Initially data migration from operational systems to data warehouses was served by programmers writing individual programs to feed data to the data warehouse. This task was time-consuming, carried with it an ongoing maintenance cost, and could be better automated through the use of purpose-built tools. Such tools quickly emerged and were called extract, transform, and load (ETL) tools. Over time, the wide proliferation of business intelligence (BI) applications drove the increasing creation of data marts, a subset of data focused on a single business area or function. This, in turn, meant more work for ETL tools. Eventually, the speed of this data transfer process came into question. The time required to move data from production systems to the data warehouse and then on to data marts was too long for some business needs. Therefore, organizations were forced to implement suboptimal workarounds to achieve the performance needed to support the business. Meanwhile, data continued to grow exponentially. While Moore’s Law increases computer power by a factor of 10 roughly every six years or so, big databases seemed to grow by 1,000 times in size during that period. That’s Moore’s Law cubed. That’s Big Data, and that’s mainly what is prompting the latest revolution. There is no doubt that data integration has become increasingly complex and costly. Organizations can no longer rely solely on hardware and inefficient workarounds to overcome the Big Data challenges ahead. Clearly, a new approach is needed.
The Stressing of ETL 1.0 Aside from the growth in the number of business applications and the perennial growth in data volumes, there are four distinct factors that have placed increasing demands on ETL tools since they were originally introduced.
Timing Constraints Fifteen years ago interactive applications rarely ran for more than 12 hours. This left ample time for ETL tools to feed data to a data warehouse, for reporting tools to use the operational data directly, and for database backups. This convenient slack period was generally referred to as the batch window. ETL transfers tended to run on a weekly or even a monthly basis.
3 Á
ETL 2.0 But the batch windows gradually began to close or vanish. Data marts proliferated as a much wider variety of BI and analytics applications emerged. Eventually, the demand for data warehouse updates shifted from nightly to hourly to near-real time as the need for timely information grew. ETL tools had to try to accommodate this new reality.
Technology Shifts Computer technology gets faster all the time, but what Moore’s law provides, data growth takes away. To complicate the situation, hardware does not improve in a uniform way. By 2003, after several years of increasing CPU clock speed as means to accelerate processing power, Intel and AMD began to produce multicore CPUs, packaging more than one processor on each chip. Most databases, because they had been built for high scalability and performance, soon benefited from these additional resources. However, few ETL tools were designed to exploit multiple processors and parallelize workloads. They were behind the curve.
The Advent of Big Data There have always been massive amounts of data that would be useful to analyze. What is relatively new is the increasing volume, complexity, and velocity of that data – referred to as Big Data. In general, the term “Big Data” involves collections of data measured in tens or hundreds of terabytes that require significant hardware and software resources in order to be analyzed. Large web businesses have such data. So do telecom companies, financial sector companies, and utilities companies of various kinds. And now that there are databases that cater to such data, many other businesses are discovering areas of opportunity where they can accumulate and analyze large volumes of data as well. In practice, for many organizations Big Data means digging into previously archived or historical data. Similarly, large and small businesses are also discovering the need to store and analyze data at a more granular level. Either way, the elusive heaps of data that were previously considered inaccessible are fast becoming viable data assets. For ETL, Big Data translates into new and larger workloads.
Cloud Computing, Mobile Computing Cloud computing adds complexity to the equation by extending the data center through the Internet, providing not only additional data sources and feeds, but also cloud-based applications like Salesforce.com, which may pose additional integration challenges. Moreover, cloud environments will most likely suffer from relatively low connection speeds and, possibly, data traffic limitations. Similarly, mobile computing adds new and different applications, many of which demand a very specific data service. The dramatic adoption of smart phones and other mobile devices ultimately augments the creation and velocity of data, two of the key characteristics of Big Data.
4 Á
ETL 2.0
The Limitations of ETL 1.0 ETL evolved from a point-to-point data integration capability to a fundamental component of the entire corporate data infrastructure. We illustrate this in a simple way in Figure 1, which notionally partitions IT into a data services layer that manages and provides data and an applications layer that uses the data. However, reality is far more complex than shown in the illustration. There is a variety of ways that applications can be connected to data. Applications vary in size and can reside on almost any device from a server to a mobile phone. Data flows can weave a complex web. Nevertheless, as the diagram suggests, data management software and ETL are complementary components that combine to deliver a data service. As such, they need to work hand in hand.
Application Layer
BI Apps BI Apps Apps
BI Apps BI Apps Apps
Files Files
DBMS DBMS DBMS
BI Apps BI Apps BI Apps
Files Files
BI Apps BI Apps BI Apps
DBMS DBMS DBMS
ETL ETL ETL The Data Services Layer
The problem with most ETL products, those we think of as ETL 1.0, is that they were Figure 1. Applications and Data Services never designed for such a role. They were designed to make it easy for IT users and programmers to specify data flows and to carry out some simple data transformations “in flight” so that data arrived in the right format. They included a scheduling capability so that they would fire off at the right time, and they usually included a good set of connectors to provide access to a wide variety of databases and data stores. They were very effective for specifying and scheduling point-to-point data flows. What many of them lacked, however, was a sophisticated software architecture. They weren’t designed to efficiently handle complex data transformations “in flight.” Indeed the T in ETL was largely absent. They weren’t designed to use resources economically. They weren’t designed for scalability or high-speed data transfers. They weren’t designed to handle everincreasing data volumes. In summary, they were not designed to globally manage data flows in a data services layer. As data volumes increased, so did the challenge of accessing that data. In many situations, ETL tools simply were not fast enough or capable enough. Consequently, data transformation activity was often delegated to the database, with database administrators (DBAs) trying to manage performance through constant tuning. Developers resorted to hand coding or using ETL tools just for scheduling. This inevitably led to spaghetti architectures, longer development cycles, and higher total cost of ownership. Strategic business objectives were not being met.
5 Á
ETL 2.0 Increasingly more companies find themselves in this situation. For instance, a leading telecommunications company spent over $15 million dollars in additional database capacity, just to get a 10% improvement in overall performance. More importantly, 80% of their database capacity was consumed by data transformations as opposed to analytics.
The Nature of ETL 2.0 Having described the failings and limitations of ETL 1.0, we can now describe what we believe to be the characteristics of ETL 2.0. Just as database technology is evolving to leverage Big Data, we should expect ETL products to be either re-engineered or to be superseded. ETL products are (or should be) complementary to databases and data stores, together delivering a data services layer that can provide a comprehensive data service to the business. Unlike ETL 1.0, this new approach would reduce the complexity, the cost, and the time to value of data integration. We list what we believe the qualities of an ETL 2.0 product are in Figure 2 and describe them in detail below in the order in which they are listed.
Versatility of Connectivity, Extract, and Load
ETL 2.0 Qualities Versatility of Connectivity, Extract, and Load Versatility of Transformations and Scalability Breadth of Application Usability and Collaboration Economy of Resource Usage Self-Optimization Figure 2. Nature of ETL 2.0
ETL has always been about connectivity to some degree with ETL tools providing as many connections as possible to the wide variety of databases, data stores, and applications that pervade the data center. As new databases and data stores emerge, ETL products need to accommodate them, and this includes the ability to connect to sources of unstructured data in addition to databases. It also means connecting to cloud data sources as well as those in the data center. Where ETL tools fail to provide a connection, hand coding ! with all its painful overhead !"becomes necessary. Extracting data can be achieved in a variety of ways. The ETL product can simply use an SQL interface to a database to extract data, for example, but this is likely to be inefficient and it presents an extra workload to the database. Alternatively, it can make use of database log files or it can access the raw disk directly. ETL products need to provide such options. The same goes for loading data. The ETL tool may load the data into staging tables within the database in a convenient form or may simply deposit the data as a file for the database to load at its leisure. Ideally, ETL tools would be able to present data in a form that allows for the fastest ingest of data by the target database without violating constraints defined within the databases schema. Products that qualify as ETL 2.0 need to have as many extract and load options as possible to ensure the overall performance of any given data flow, while placing the least possible overhead on data sources and targets. Versatility of connectivity is also about leveraging and extending the capabilities of the existing data integration environment, a concept commonly known as data integration 6 Á
ETL 2.0 acceleration. This includes the ability to seamlessly accelerate existing data integration deployments without the need to rip and replace as well as leveraging and accelerating emerging technologies like Hadoop.
Versatility of Transformations and Scalability All ETL products provide some transformations but few are versatile. Useful transformations may involve translating data formats and coded values between the data sources and the target (if they are, or need to be, different). They may involve deriving calculated values, sorting data, aggregating data, or joining data. They may involve transposing data (from columns to rows) or transposing single columns into multiple columns. They may involve performing look-ups and substituting actual values with looked-up values accordingly, applying validations (and rejecting records that fail) and more. If the ETL tool cannot perform such transformations, they will have to be hand coded elsewhere ! in the database or in an application. It is extremely useful if transformations can draw data from multiple sources and data joins can be performed between such sources “in flight,” eliminating the need for costly and complex staging. Ideally, an ETL 2.0 product will be rich in transformation options since its role is to eliminate the need for direct coding all such data transformations. Currently ETL workloads beyond the multi-terabyte level are unlikely, although in the future they may be seen more frequently. Consequently, scalability needs to be inherent within the ETL 2.0 product architecture so that it can optimally and efficiently transfer and transform multiple terabytes of data when provided with sufficient hardware resources.
Breadth of Application At its most basic, an ETL tool transfers data from a source to a target. It is more complex if there are multiple sources and multiple targets. For example, ETL may supplement or replace data replication carried out by a database, which can mean the ETL tool needs to deliver data to multiple locations. This type of complexity can jeopardize speed and performance. An ETL 2.0 product must be able to swiftly and deftly transfer data, despite the number of sources and targets. A store-and-forward mode of use is important. The availability of data from data sources may not exactly coincide with the availability of the target to ingest data, so the ETL tool needs to gather the data from data sources, carry out whatever transformations are necessary, and then store the data until the target database is ready to receive it. Change data capture, whereby the ETL tool transfers only the data that has changed in the source to the target, is a critically important option. This can reduce the ETL workload significantly and improve timing dramatically, by ensuring that the change data capture keeps databases in sync. The ability to stream data so that the ingest process begins immediately when the data arrives at the target is another function to reduce the overall time of a data transfer and achieve near real-time data movement. Such data transfers are often small batches of data being transferred frequently. For similar small amounts of data, it is important that real-time
7 Á
ETL 2.0 interfaces, such as web services, MQ/JMS, and HTTP are supported. This is also likely to be important for mobile data services. The ability to work within, or connected to, the cloud is swiftly becoming a necessity. This requires not only supporting data transfer to and from common software-as-a-service (SaaS) providers, such as Salesforce.com or Netsuite, but also accommodating the technical, contractual, or cost constraints imposed by any cloud service. An ETL 2.0 tool should be able to deliver on all these possibilities.
Usability and Collaboration An ETL 2.0 tool should be easy to use for both the IT developer and the business user. As a matter of course, the ETL tool should log and report on all its activity, including any exceptions that occur in any of its activities. Such information must be easily available to anyone who needs to analyze ETL activity for any purpose. Developers must be able to define complex data transfers involving many transformations and rules, specifying the usage mode and scheduling the data transfer in a codeless manner. Business users should be able to take advantage of the power of the ETL environment with a “self-service” interface based on their role and technical proficiency. Today’s business user is a more savvy purveyor of technology, and as such, he has the potential to bring more to the table than a request for a report. An ETL tool should enable and foster collaboration between business users, analysts, and developers by providing a framework that automatically adapts to each user’s role. When the business user has a clear understanding of the data life cycle and developers and analysts have a clear understanding of the business goals and objectives, a greater level of connectivity can be achieved. In addition to bridging the proverbial gap between IT and the business, this type of ETL approach can result in faster time to production and, ultimately, increased business agility and lower costs. By eliminating the typical back-and-forth discussions, a collaborative effort during the planning stages can have a significant impact on the efficiency of the environment in which the ETL tool is leveraged.
Economy of Resource Usage At a hardware level, the ETL 2.0 tool must identify available resources (CPU power, memory, disk, and network bandwidth) and take advantage of them in an economic fashion. Specifically, it should be capable of data compression to alleviate disk and network I/O – something particularly important for cloud environments – and parallel operation both for speed and resource efficiency. With any ETL operation, I/O is almost always one of the biggest bottlenecks. The ETL tool should be able to dynamically understand and adapt to the file system and I/O bandwidth to ensure optimized operation. The ETL tool also needs to “clean up” after itself, freeing up computing resources, including disk space (eliminating all temporary files) and memory as soon as it no longer requires them.
8 Á
ETL 2.0
Self-Optimization In our view, ETL 2.0 products should require very little tuning. The number of man-hours an organization can spend on the constant tuning of databases and data flows hinders business agility and eats up resources at an alarming rate. ETL tuning requires time and specific skills. Even when it is effective, the gains are usually marginal and may evaporate when data volumes increase or minor changes to requirements are implemented. Tuning is an expensive and perpetual activity that doesn’t solve the problem ! it just defers it. ETL 2.0 products will optimize data transfer speeds in line with performance goals, all but eliminating manual tuning. They will embody an optimization capability that is aware of the computer resources available and is able to optimize its own operations in real time without the need for human intervention beyond setting very basic parameters. The optimization capability will need to consider all the ETL activities (extracts, transforms, and loads) automatically, adjusting data processing Timing Technology Cloud changes computing algorithms to optimize the data transfer imperatives Big Mobile More activity irrespective of how complex it is.
The Benefits of ETL 2.0 It is clear that the modern business and computing environment demands much more from ETL and data integration tools than they were designed to deliver. So it makes sense to discuss how much that matters. To most end users, ETL tools are little known and largely invisible until they perform badly or fail. As far as the business user is concerned, there is useful data, and they need access to it in a convenient form when and for whatever reason they need it. Their needs are simple to articulate but not so easy to satisfy.
apps & data
data
apps
ETL
Any data delivered to authorized users when they want it and how they want it
Existing ETL products face many challenges as summarized in Figure 3. First and foremost, they need to deliver a first-class data service to business users by ensuring, where possible, that the whole data services layer delivers data to those users when and how they want it. The Figure 3. The ETL Challenges ETL products need to accommodate timing imperatives, perennial growth in applications and data volumes, technology changes, Big Data, cloud computing, and mobile computing. And, ultimately, they need to deliver business benefit. The business benefits of effective ETL have two aspects: those that affect the operations of the business directly and those that impact the efficient management and operation of IT resources. 9 Á
ETL 2.0
The Operations of the Business The growth in the business use of data is unlikely to stall any time soon. At the leading edge of this is the explosion of Big Data, cloud computing, and mobile BI, which are in their infancy ! but they won’t be for long. A product page on Facebook, for example, can record certain behaviors of the users who “like” that page. Drawing on such information and matching it, perhaps with specific information drawn from Twitter, the company can tailor its marketing message to specific categories of customers and potential customers. Such information is easy enough to gather, but not so easy to integrate with corporate data. The business needs to be able to exploit any opportunity it identifies in these and related areas as quickly as possible; continued competitiveness and revenue opportunities depend on it. Assuming data can be integrated, the information needs to be delivered to the entire business through an accurate and nimble data service, tailored to the various uses for that data. Delivering new data services quickly and effectively, means providing user self-service where that is feasible and desirable, and enabling the fast development of new data services by IT where that is necessary. The ability to identify opportunities from the various sources of data – and deliver the information with agility – is essential to continued competitiveness and revenue growth. If this can be achieved then ETL and the associated databases and data stores that provide information services are doing their job.
The Operations of IT Even when an effective ETL service is provided, its delivery may be costly. The expense of ETL is best viewed from a total cost of ownership perspective. A primary problem that eventually emerges from the deployment of outdated ETL products is entropy, the gradual deterioration of the data services layer, which results in escalating costs. In reality, the software license fees for the ETL tools are likely to be a very small percentage of the cost of ownership. The major benefits of ETL 2.0 for the operations of IT will come from: •
Low development costs: New data transfers can be built with very little effort.
•
Low maintenance effort: The manual effort of maintaining data transfers will be low when changes to requirements emerge.
•
Tunability/optimization: There will be little or no effort associated with ensuring adequate performance.
•
Economy of resource usage: They will require less hardware resources than previous ETL products for any given workload.
•
Fast development and user self-service: They will reduce the time to value for many applications that depend on data flows.
•
Scalability: There will be no significant limits to moving data around since every variety of data transfer will be possible and data volume growth will not require exponential management overhead.
•
Manageability: Finally, all ETL activities will be visible and managed collectively rather than on a case-by-case basis. The major win comes from being able to plan for 10 Á
ETL 2.0 the future, avoiding unexpected costs and provisioning resources as the company needs them. Clearly ETL 2.0 benefits will differ from business to business. A capable ETL product will help organizations remain competitive and relevant in the marketplace. Those organizations that are under pressure from data growth or a highly fragmented ETL environment will see results immediately by putting their house in order. For example, a Fortune 500 company has been able to reduce its annual costs by more than $1 million by deploying an ETL 2.0 environment with a well-planned, scalable data flow architecture. It replaced most of its existing data transfer programs, eliminating all hand coding, and significantly reducing its tuning and maintenance activity. Similarly, businesses that are pioneering in mobile computing and Big Data may also see more gains than others. As a rough rule of thumb, the more data transfers that are done, the more immediate the benefits of ETL 2.0 will be.
Conclusions The first generation of ETL products, ETL 1.0, is becoming increasingly expensive and difficult to deploy and maintain. The details are different for each IT environment, but the same characteristics emerge. The resource management costs of ETL escalate. The amount of IT effort to sustain ETL increases, and the manageability of the whole environment deteriorates. What began as a series of point-to-point deployments becomes an ad hoc spaghetti architecture. The environment becomes saturated with a disparate set of transformations ! some of them using the ETL tool itself, some of them in the database, and some of them hand coded. What’s left is a data services layer that that is impossible to manage, reuse, or govern. The IT department is faced with failing to deliver an adequate service to the business or paying a high price in order to do so. Such a situation is not sustainable in the long run. As we’ve discovered, to meet today’s business needs, a new approach to data integration is a necessity. We call this approach ETL 2.0, and it is key to helping organizations remain competitive in the market place. The characteristics of ETL 2.0 include: •
Connectivity and versatility of extract and load
•
Versatility of transformations and scalability
•
Breadth of application
•
Usability and collaboration
•
Economy of resource usage
•
Self-optimization
ETL products that provide the full range of capabilities described in this paper will almost certainly have a significant impact on both organizations and the data integration industry as a whole. The benefits of ETL 2.0 are threefold: the business receives the data service it needs to remain competitive and achieve strategic objectives, the ETL environment does not suffer from entropy and can quickly scale to accommodate new demands for information, and most 11 Á
ETL 2.0 importantly, the total cost of owning, deploying, and maintaining the ETL environment is significantly lower than that of its predecessor. A capable ETL product will reduce TCO simply by removing the need for additional personnel and hardware, but one that delivers really well will further increase ROI by providing businesses with the data they need to make game-changing decisions precisely when it is needed, enabling organizations to maximize the opportunities of Big Data.
About The Bloor Group The Bloor Group is a consulting, research and technology analysis firm that focuses on open research and the use of modern media to gather knowledge and disseminate it to IT users. Visit both www.TheBloorGroup.com and www.TheVirtualCircle.com for more information. The Bloor Group is the sole copyright holder of this publication. ! PO Box 200638! Austin, T X 78 7 20 ! Te l: 5 1 2 – 5 2 4 – 3 6 8 9 ! www.TheVirtualCircle.com www.BloorGroup.com 12 Á
View more...
Comments