Organize Your Data Lake

Author: Jorge Anicama | 7 min read | October 25, 2022

Data is like oxygen: It’s everywhere. It’s everything.

But, unlike oxygen, it’s not all the same, nor can it all be stored the same way. As every business is coming to learn, the vast volume and variety of data streaming into databases – Big Data – is growing exponentially, which can be good if that information is also correctly cleaned, categorized, integrated, and stored for easy access.

It can be bad, however, when none of those corrective actions are taken, and the raw data is left sitting idly in the back of an overlooked data storage container. When that happens, even mission-critical corporate data is beyond the reach of company analysts, who are then forced to make decisions based on insufficient or, worse, erroneous information.

The Growth of Big Data

The reality is that today’s variety and volume of data are immense and growing. Each day, more than 2.5 quintillion bytes of digital information are generated by the global community, the equivalent of 1,000 petabytes (a petabyte is a ‘1’ with 15 zeros behind it). Moreover, the volume of enterprise data is expanding faster than consumer data, as more of it lands in large cloud-based storage facilities. (Interestingly, cloud-generated data isn’t growing as fast as cloud-stored data.)

The biggest challenge with all this information flying around is that it defies simple organization techniques. Data comes in many formats and types, both structured and unstructured, many of which don’t naturally connect with other styles, nor can they be stored the same way. Consequently, technology experts devised equally unique data storage vessels to contain and control these varying forms of data so they could eventually be usable.

The Database

As the first data storage vessel, the standard ‘database’ is very well used and loved, despite its limitations. Typical databases are ‘relational,’ meaning they organize data bits into easily read rows and columns. Data stored in these databases fit neatly into this organization, such as information from financial records, sports scores, IoT device readings, etc. Other types of databases store ‘non-relational’ information in other, less rigid arrangements that accommodate the information’s diverse sizes and shapes. Images, audio files, graphics, and other ‘nonlinear’ formats are stored in non-relational databases. Databases are used primarily for operational and transactional activities.

The challenge with using only one or the other type of database for today’s corporate information streams is that they can’t collect or store the entire volume of incoming information, and that which is not collected or stored is then often lost.

The Data Warehouse

Another data repository that has gained favor is the data warehouse. A Data Warehouse is an orchestrated system of databases optimized to facilitate analytics (as opposed to being used as operational resources). Stored data can be raw or well-curated for use, ‘curated’ meaning the information has been ‘cleansed’ (treated to remove incorrect, wrongly formatted, duplicate, or incomplete data), ‘filtered’ (arranged to meet the needs of specific users), and ‘aggregated’ (a process that ‘summarizes’ a volume of data so that users can read the summary, not parse through the original data set itself) for relevance.

The highly structured data produced by some companies lends itself well to a data warehouse storage configuration. However, companies that generate non-relational data will have a more difficult time finding and using all the values that information generates.

The Data Lake

The largest storage facility of the three, the data lake, can store vast quantities of unstructured and structured data, all in its raw form and all in one convenient location. Maintaining a complete data record in one place facilitates easier and faster analyses and eliminates the need to maintain multiple storage capacities for numerous data types. Besides convenience, the benefit of using a data lake is that it allows analysis without requiring all data to be moved or transformed. As such, data lakes are very popular for supporting both artificial intelligence (AI) and machine learning (ML) programming to facilitate deep analysis of current and historical corporate information.

Despite their benefits, however, data lakes also present a significant challenge. All that data in all its forms still needs sorting and organizing if users are to glean its full informational value. The organizational process is especially challenging given the data lake’s overall complexity as a single ‘data mass.’ Further, the project must overcome the two biggest challenges in big data initiatives:

connecting workers to the data they need to accomplish their goals, and
ensuring they know how to use it to make appropriate business decisions.

These challenges are even more imposing if the information is streaming from numerous sources, through multiple and varied devices, and even flowing in across time zones.

Organizing the Data Lake

Fortunately, there are strategies to accomplish the goal of organizing all the disparate forms of data included in the lake, so they are all accessible by the enterprise and can provide the information its users need.

Sort the data by source.

Every company has operational data that generates daily and provides critical information about daily business metrics.
Many companies also have machines that automatically generate information, such as IoT devices, sensors, and performance logs.
People also create information daily through emails, images, web content, etc. Communications with clients, customers, and supply chain participants also add data to the daily mix.

Layer the data by status.

Commonly used layers include:

Raw data – information that enters and is maintained in its original state;
Standardized data – data that’s been formatted to fit its cleansing process. Both the formatting and the cleansing processes depend on the information itself and how it will be accessed and used.
Cleansed data – Also called a ‘curated’ data layer, these data are organized into consumable sets based on their intended use. Users are typically authorized access to only this layer of information within the data lake.
Application data – information that is used directly by the application for which it is captured. This layer frequently is the most complicated to maintain because it is also where data security systems live, as well as ML and AI models.
Sandbox data – is optional for many companies. Access to this layer allows data scientists to explore and experiment with various models and analyses to find new insights or better ways of managing corporate affairs.

Impose rules.

After completing the first two steps, the lake is ready for the rules that will manage its access and usage. These rules include those unique to the company as well as those required by outside entities.

Security, governance, and stewardship standards establish who gets access to which data and for which purpose.
Metadata, master data, and archived data libraries provide both historical insights and daily guidance for data usage.
Tools and rules to manage offloaded data from data warehouses and the rules governing the orchestration of the cleansing processes are also often applied at this stage.

Once you have a well-organized data lake, you can begin harnessing innovative capabilities, such as Artificial Intelligence (AI) and Machine Learning (ML). Learn more about the data-driven possibilities in our white paper AI & ML: The Future Analyzed.

Blog Author

Jorge Anicama

Director of Oracle Analytics

Mr. Anicama is a data management and analytics professional with over 15 years of experience in all phases of management & planning, implementing Data Integration and BI/Data Warehousing Solutions for customers in North America, Latin America, and Europe. Mr. Anicama also helps companies define and execute enterprise-wide BI solutions from inception to delivery to ongoing growth. His experience ranges from Master Data Management, Data Governance to Data Warehousing, Advanced & Predictive Analytics, and Cloud Analytics solutions using platforms like Oracle Cloud, Amazon AWS, MSFT Azure, and Google Cloud. Mr. Anicama possesses a Master in Mathematical Sciences (Pontificia Universidad Catolica del Peru) and a Master in Computational Algebra (Mathematics Research Institute at Universiteit Utrecht, the Netherlands.) He fluently speaks Spanish, English, German, and some Portuguese.