Big Data Projects

The Problem

All ETL products are based on the same paradigm. The unit-of-work is a job, graph, drawing, etc.. This unit-of-work consumes one or more data sources, joins, cleanses, modifies, restructures the source data, and finally produces one or more data targets. These products contribute to data project failure because the ETL product unit-of-work is not capable of accommodating and directing the entire lifecycle of a big data project. A data project produces a data application which is composed of many ETL units-of-work created by many data project participants. Further, application development and application runtime must be a single endeavor. Something much larger is required.

What is the entire lifecycle of a big data project?

A big data project will employ many participants – data analysts, developers, testers, runtime operations – creating 10s, 100s, or even 1000s of those ETL units-of work concurrently, and those units-of-work become the data application. The lifecycle of a data application begins with:

Identifying the resulting product targets (data mart, data warehouse, data repository, etc.).
Identifying and profiling the necessary data sources to populate target product.
Decomposing the complex application requirements for processing the source data into manageable standalone pieces for team development and testing. This step requires encapsulation principles where each standalone piece is considered an object with well defined inputs, outputs, and processing requirements. Object encapsulation principles allow objects to be developed independently anywhere in the world (offshoring) where an individual object developer need know nothing about the overall data application, other application objects, or the application run time environment. In addition to creating an ETL object, each object developer is tasked with creating a process for recovering the environment when his/her standalone ETL object fails during an application job instance. The objects and recovery process may be implemented using any graphical ETL product or coding solutions like PL/SQL and Informix-4GL.
The integration step plugs all of the developed and tested standalone object and their recovery processes into the overall data application. The resulting is a data application is:
- Portable across processing environments.
- Point-of-failure recoverable without user intervention.
- Dynamically load balanced within a single processing environment and across multiple processing environments.
- Easily modified as application requirements evolve over time.
- Provides end-to-end metadata at both development time and execution time.

ETL Tooling

Many data projects begin with the acquisition of an ETL tool meaning developer training is required as these tool tend to be very complex. Imagine being an English speaker in the USA, learning the Italian language over a 6 week period, and then being tasked to write a successful Italian Opera. This is how most data projects begin and the projects that actually succeed are typically implemented using ETL vendor consultants at a huge expense. Further, the applications created can not be easily maintained or modified by the novice Italian speakers. This means that you are now married to your ETL product vendor’s consultants when problems arise or modifications are needed as business requirements evolve over time. There is a better way.

Encapsulation Principles and Data Projects

Encapsulation principles provide segregated context for all project roles. Segregated because a data analyst needs no knowledge of ETL development, testing, systems, and runtime environment. A developer of an ETL object needs no knowledge of other ETL objects, testing, the overall application, or the application run time environment. The run time production operators needs no knowledge of the application functionality or purpose because load management, load distribution, and point-of-failure recovery are all automated.

Because objects resemble standalone black-boxes whose contents are irrelevant to most project participants, different developers may use different ETL technologies to implement their object’s functionality. This means that if a project begins with 3 proficient Datastage developers and 5 proficient Oracle PL/SQL developers, the project begins with the best object developers from day one, each developer working in a preferred technology. Further, because each object is self contained having defined interfaces and requirements, each object may be developed and tested in unique locations anywhere in the world.

Mozart4Draw.io and Mozart4Visio

Mozart4Draw.io is both a software plugin and a methodology created for open source Draw.io drawing application. It facilitates the decomposition of complex data application requirements into encapsulated multi-level objects for distribution to developers and testers. When the developed and tested objects are plugged into the overall parent data application, the completed application is portable across processing environments, distributed within a single processing environment or across multiple processing environments, point-of-failure recoverable, and easily modified and maintained over time. You essentially draw a multi-level object-oriented data diagram and run it on any processing topology. Mozart4Draw.io provides guided context for the entire data project lifecycle.

You will be amazed at how easy it is to manage any data project producing successful high performing data applications.

Our Story

The journey began as a big data consultant working for the Informix Data Warehouse group in the mid-1990s. I was assigned to an enterprise data warehouse project for a large car rental company in the Midwest. It was their first data warehouse project and had 34 participants. The ETL product selected was Informix-4GL, a hand-coding product. The ETL developers were hired from within and given very little training before the project began. I instantly realized that I could give a developer specifications defining inputs, outputs, and mappings for an individual ETL job, but the developer would not be able to deal with things like overall application point-of-failure recovery, application load balancing and distribution, and application metadata. I wrote an engine application based on encapsulation principles that managed the hundreds of ETL jobs in the production environment and the project was a resounding success delivering accurate data in a timely manner.

Welcome

Lifecycle Management for Big Data Projects

Why do most big data projects fail?

The ETL Product Myth

The Problem

What is the entire lifecycle of a big data project?

The Solution

ETL Tooling

Encapsulation Principles and Data Projects

Mozart4Draw.io and Mozart4Visio

Our Story

Contact Us

Drop us a line!

Mozart4Draw.io

Welcome

Lifecycle Management for Big Data Projects

Why do most big data projects fail?

The ETL Product Myth

The Problem

What is the entire lifecycle of a big data project?

The Solution

ETL Tooling

Encapsulation Principles and Data Projects

Mozart4Draw.io and Mozart4Visio

Our Story

Contact Us

Drop us a line!

Mozart4Draw.io

This website uses cookies.