Deep Lake — an architectural blueprint for managing Deep Learning data at scale — part I

By Assaf Pinhasi

June 10, 2022 - 3537 words

Image by author using vqgan+clip (“underwater roboic world | trending on artstation”, 1000 iter.)

Introduction

In the past few years, machine learning data management practices have evolved dramatically, with the introduction of new design patterns and tools such as feature stores, data and model monitoring practices, and feature generation frameworks.

Most advances in data management for machine learning are focused on classical (feature-based) data, and cannot be applied as-is to unstructured data, leaving deep learning data management practices behind

The goals of this series is to describe Deep Lake — an architectural __ blueprint for managing deep learning data that can be viewed as a cross between a Data Lake, a feature store and an evaluation store for deep learning.

Part I of the series will highlight the unique challenges of managing data for deep learning applications, and introduce Deep Lake at high level.

Part II of the series will deep dive into specific design patterns that this blueprint enables, as well as notes around implementation and technology stack choices.

Scope

This post covers deep learning use-cases where the raw data includes images, videos, audio or other signals, typically represented as large binary data (vs. deep learning for tabular data).

The examples in this post refer to deep learning tasks that can be represented as structured goals (e.g. classification, object detection, phoneme predictions etc), but can be also applied to generative models that output unstructured data with some adaptation.

caveat: The focus is on tasks that depend at least partially on annotated datasets (vs. massive self-supervision approaches).

Audience

This post assumes familiarity with machine learning and deep learning practices and terminology, as well as a working knowledge of data engineering design patterns and practices.

Deep Learning data lifecycle 101

The goal of this section is to describe in high level the way in which data is produced and consumed during the lifecycle of a deep learning model/system.

image by the author

Data acquisition

Teams often use multiple data sources to train deep learning models, where data arrives in different cadences and in different formats.

It’s common to source datasets from external data sources— such as open datasets or one-time extractions from third-parties.
In some cases, data is also generated synthetically using simulations or generative algorithms — or by “proactive” activities (e.g. staging and capturing images of a person carrying a weapon) or scouting for data (e.g. Google street view or Tesla’s car fleet).
Whenever possible production data is used to create a virtuous cycle of model improvements — which can usually be created only after an initial model is released.

Data pipeline

Before the data can be used, we often need to transform, clean and enrich it.

This process requires processing large binary data which makes it challenging.

Transformation of the unstructured data can be as simple as scaling and cropping, sampling or encoding, or in a more complex case embedding the data in a vector space using a pre-trained model (that may or may not need to be trained and validated periodically).
Cleaning usually means flagging broken data such as empty data files or unreasonable annotations.
Enrichment can include extracting interesting data from the raw bytes (e.g. “lighting conditions” per image) programmatically.
A different type of enrichment is to create labels using human annotators — The human annotation tasks may be as simple as classifying cats and dogs using a crowdsourcing platform, or require rigorous, regulated, multi-domain-expert label with an enforced protocol for resolving inter-expert disagreement.
Finally, enrichments may include augmentation of the data prior to the training process, especially if the augmentation is a heavy process.

Building datasets

Often, teams have access to pools of data that cannot all be used in modeling — especially if the task requires labeling.

Selecting data randomly is often not enough — as we need to ensure datasets represent a broad enough distributions (e.g. variability like lighting conditions, backgrounds, genders, etc.).
Datasets often must also contain a representation of rare and difficult edge-cases (especially common for mission critical applications).
As a result, a common workflow is to narrow down the data and sample it using some form of stratification technique, and only then create labels by human annotators.
Sometimes, dataset creation is an iterative loop (e.g. when some labels are needed to bootstrap the dataset), and generally follow a complex and domain-specific workflow (more on this in part II).
Finally, there are often multiple datasets used in training and evaluation to test model’s behavior from different aspects.

Training, validation and testing

These processes consume datasets, of course — but also produce data.

Metrics report for both model performance
Model prediction data, including explainability data, confidence scores etc.
Experiment lineage, trained models, metrics etc. are usually stored in an experiment tracking system.

Production data capture and monitoring

Once a model is deployed to production, raw data and predictions need to be monitored to ensure the models are performing correctly.

It’s common to monitor for distribution of labels or their properties (class distribution, bounding box sizes) as well as prediction confidence etc.
Often times, we can capture signals from the production system that may suggest how well our algorithm performed — e.g. by collecting explicit or implicit feedback from the users via the product itself.
When possible, this production data is useful to capture as we can use it to improve the next version of the model.

Unique challenges in Deep Learning data

At high level, the lifecycle of a deep learning model is not too different from that of a classical ML model. However, there are certain differences that are unique to deep learning and pose a considerable challenge when it comes to data management:

Large unstructured payloads

In classical ML, a single payload is structured of features, and usually can be represented in under 1K of data for a single model example, which makes it relatively straight forward to leverage a wide range of storage, query and visualization engines build specifically for structured data.

Unstructured data, on the other hand:

Often cannot be or should not be stored in a database engine
Does not come with a declarative query language (such as SQL)
Cannot be “eyeballed” without dedicated viewers> This requires designing custom solutions for storing, querying and visualizing unstructured data

Data volumes

In classical ML we can often rely exclusively on data produced by the system itself (e.g. user clicks or purchases, transactions, etc.) — and store virtually all relevant data indefinitely.

In DL it’s not uncommon for production data to be prohibitively expensive to collect in full (think about storing all video captured by all of Tesla’s cars indefinitely).

Moreover, after some time, 99.9% of the data from the production system has little value for improving the model, and the remaining 0.1% becomes critical to obtain but hard to come by (e.g. rare edge cases related to safety).

In some deep learning systems, the data acquisition or data processing pipelines need to be able to identify and filter only interesting data for long term retention

Another challenge in data volumes is that training deep learning models requires high end, fast storage (and unique data formats) in order to optimize the data loading path from disk to GPU memory.

Since this storage class is very expensive, and data volume are high, we are forced to manage the “cold” data in a cheaper long term storage that is optimized for throughput, cost and support random access to individual examples (in order to be able to label and eyeball examples).

Deep learning data often requires multiple storage tiers with different cost/performance profiles, and to be stored in different storage formats for different consumers

Data labeling

In classical ML we often get the labels for all data examples “for free”, or can derive them programmatically from the rest of the data.

In deep learning, we often need to perform some annotation, often following a complex, expensive and long processes — including by leveraging human annotators.

Anti-patterns in managing deep learning data

Managing a large and ever growing corpus of data for deep learning can be a daunting task. Here are a few common anti-patterns that DL teams sometimes follow:

Treat data preparation as an ad-hoc task It’s not uncommon for teams to carry out the entire data preparation process from raw data to training examples ready for the GPU as part of a single model development or improvement effort.

The equivalent of this would be to start the development of a classical ML model version from a replica of the production DB (which is where the industry was 5–10 years ago).

Storage jungle Often, datasets, labels predictions and metrics from a cycle of model improvements are either lost, or stored as artifacts of specific experiments.

This makes data sharing between projects and teams a big challenge, since most data is not easily accessible, let alone discoverable.

Format spaghetti Even if data is accessible in a central location, it’s not uncommon for every data source, and even every dataset, to be stored in a different format — both in terms of how the unstructured data is organized into folders, and in terms of how labels are represented.

This increases the effort of combining data from multiple sources and project.

Dimension ignorance Curating datasets that contain good representations of domain data requires teams to understand what’s in their (binary) data — for example, collect images with both day and night scenes.

But how can you know which image contains a day scene or a night scene, for terabytes of data?> One manual way is to eyeball enough examples and collect them in an ad-hoc manner. However, this is both time consuming, repetitive, and doesn’t work well for uncommon cases such as “hail storm at night”

The cost of doing this work manually each time you want to improve your model is prohibitive, and as a result teams often do not have sufficiently good distributions in their datasets.

State of the art

How these challenges tackled by the leading tech giants?
Unfortunately, not much was published on the topic.

Uber ATG

A blog post from Uber describes Petastorm**—**a library that was created by the autonomous driving group, that helps store and represent datasets in different formats useful for both queries and training.

source: https://eng.uber.com/petastorm/

The post also mentions Uber’s autonomous driving “data lake”, that unifies data from multiple sources, pre-processes it and stores it in an optimized way for queries and retrievals ahead of the dataset generation process.

Referring to the data lifecycle 101 diagram above, it seems that Uber’s internal platform includes:

The data pipeline from raw data to a common/useful format
A data lake with a data schema that enables easy access to the data (over HDFS and Parquet)
A query engine on top of the data (Apache Spark) that supports efficient queries on the data by extracting some information into queryable columns, and modeling the storage to ease the retrieval of time-series data.
Ability to easily convert data into to training-efficient and framework-specific data formats (using the PetaStorm format)

One are that is not mentioned is if and how data is accessed for eyeballing or labeling by humans — as the formats mentioned do not allow point-queries very easily.

Tesla

In talks from a recent AI day at Tesla, it seems that Tesla’s AI team built:

A synthetic data generator with corpuses of characters and scenarios
An automatic pre-processing tool that can create annotations for intersections and other areas (which may need to be eyeballed by a human labeler)> All signs suggest that these tasks are a part of a larger common infrastructure that performs a lot of the heavy lifting in advance to any concrete modeling task

Some labeling tools offer functionality for visualizing and searching the unstructured data (e.g. using metadata or vector similarity)
Vector databases help teams manage embedding data at scale — though typically for point-queries during vector similarity searches, vs. data exploration and analytics
Further downstream, data version control tools help teams store and manage their datasets post data selection — although the way in which the data is stored doesn’t lend itself to exploration or analysis.

Applying data engineering first principles to DL data

Perform heavy-lifting upstream

Most data engineering architectural patterns are built around extracting the common, heavy lifting data work ahead of the business task, free the downstream teams to focus on solving their unique problem.

Data lakes and warehouses, ETL pipelines and Feature stores operate based on this principle.

Note: this approach is viable when there are enough downstream consumers that need similar data processing to create an efficiency of doing this work once.

Store data in layers according to their level of processing

Since not all downstream use-cases can be predicted in advance, and data might need to be reprocessed from scratch, data lakes often store the data in tiers :

Raw
Lightly processed and cleaned
Ready for common-use cases (sometimes in marts).

This is often referred to as “bronze, silver, gold” data layers design pattern.

Create an explicit data model

Data modeling helps make sense of the data and informs the thinking around it.

As a start we need to design the structure in which we organize our binary data so that we can manage it — apply retention policies, inventories, move between storage tiers etc.

More importantly, as we saw, building datasets requires to filter our data by “dimensions” — e.g. by image resolution, lighting conditions, gender of a profile photo etc.

It makes sense to define the data model explicitly — e.g.:

Identifying “domain entities”
Identify “dimensions” on these entities
Identify “events” that describe “things that happened to our entities”.> Once an explicit data model can be described, by definition it has a structure and can be stored in a storage engine for structured data (e.g. a database)

Create separate “materialized views” If access patterns differ

Generally we attempt to pre-compute and store data in useful formats.

In case we have multiple consumers that have very different data access patterns, it’s sometimes useful to materialize the data in different storage devices and formats to ease its consumption.

In our case, GPU data loading requires a very optimized storage device and format, which is often not useful to eyeballing the data or long term storage.

Maintain change logs vs. overwrite data

When possible, it’s usually better to append a new version of the data (raw, label etc.) vs. overwrite the old version.

Deep Lake architectural pattern — overview

Generally this pattern follows modern data lake and warehouse patterns.

Image by the author

*The numbering refers to the diagram above.

(1) Acquire

We capture data from the data source, either in batch or in stream.

From external data sources, this may mean unstructured data and some structured data like labels.

If the source is the production system, we usually can obtain unstructured raw data, predictions, metadata, additional relevant events, or all of the above.

In case the (unstructured) data needs to be aggressively filtered (think IoT devices), the acquisition logic needs to handle at least part of the filtering.

(2) Land

the structured and unstructured information in its raw form in a “landing zone”. Often at this stage they can be both stored as files or objects.

(3) Ingestion pipeline (ETL/ELT)

From raw data to the common useful structure.
The “T” in “ETL/ELT” includes transforming unstructured data to useful formats, extracting common dimensions, cleaning broken data, and organizing the data in the data model.

This involves querying, filtering, and transforming both structured and unstructured data.

Some or all of the filtering of data which is deemed less useful to retain often takes place inside the ETL pipeline.

To handle data from multiple, disparate data sources, we would first normalize the data using a per-datasource logic first, before continuing the processing.

(4) Store in the common layer

Data in its useful form is stored in a central data lake that can serve multiple projects.

Data is modeled according to a well understood schema — both the layout (e.g. partitioning) of the unstructured data and especially the data model of the structured data.

It’s possible to unify the storage technology used for both structured and unstructured data (such as use HDFS), or use separate storage technologies for the structured and unstructured data (with a non-trivial tradeoff).

At this point, we can perform queries on the data to understand and characterize both the raw data properties, as well as the labels, predictions and other aspects of our data.

(5) Domain workflows and (6) warehouse

If the data is used for different purposes, it’s common for each domain to need richer data model — such as additional dimensions on the data.

(7) Create datasets

Often times, the task of creating a dataset is requires a workflow:

Select a pool of data by distribution using stratification
Perform additional labeling only on the selected data
Perform validation and splitting
Prepare and storing in the GPU-training-friendly format

For complex datasets with edge cases and complex distributions, we may need to repeat this loop multiple types until we end up with the dataset we need.

The same pipeline tools used to prepare larger data pools may be used to generate specific datasets.

(8) Evaluation

Models that pass the initial bar for usefulness may need to be evaluated by performing inference on (multiple) datasets and applying pass/fail criteria.

It’s useful to ingest predictions back to the lake, as well as be able to pass them to human reviewers in the annotation system for more rigorous validation as well as active learning opportunities.

Performing large scale inference, storing data in the warehouse and interacting with the annotation system can be expressed as workflows using the pipeline tools.

Architectural views

Data model

The data model is usually somewhat domain and modality specific.
However, the approach for building it is usually similar:

Raw data (e.g. an image or an audio file) — as an “entity” we want to “know things about”
Derived unstructured data (e.g. embedding) — is also an entity, pointing to the “parent” (raw data)
Labels and predictions are “events”> Labels and predictions often share a common representation, and can be thought of as “An opinion of some actor about a question regarding the subject (raw data)”.> There can be any number of such “opinions” for every entity, including opinions derived from those of multiple actors (e.g. consensus between labelers)
Dimensions:

Technical dimensions (resolution, FPS, encoding, etc.)
Domain dimensions — things that help describe the content of the raw data but are not the main goal of the models (e.g. lighting conditions, number of objects, gender of person etc.)

In some more advanced cases, the lower level data can be used to derive insights about other higher level entities. For example, we can perform aggregations and store summarized model performance data in the warehouse.
An even more advanced use-case is to reify information about the objects on which we perform predictions — e.g. derive a real world high resolution map of an intersection from multiple predictions there.

Component view

Unstructured storage engine Such as centralized file system or object storage, partitioned in a logical way.
Used as the source of truth for the data including for access from viewers/labeling systems.

Structured storage such as a cloud data warehouse or a modest relational database.

Note: as per the Uber blog post, you might be able to combine these engines and the query engines in one infrastructure (with non-trivial tradeoffs)

Logical data layers — raw, common, and domain warehouses
Typically implemented as schemas or namespaces in the structured storage since unstructured data is immutable there’s not point in copying it around.

Pipeline tool Typically a DAG orchestrator, with support for long-running steps (human labeling) and ability to trigger the data processing infrastructure (see below)

Unstructured data processing infrastructure

Logic expressed as imperative code
Computation is executed over a cluster (e.g. serverless, k8s jobs, Spark jobs etc.)
Process reads from the unstructured storage and writes output to some (potentially separate) storage device (including message busses, object store, etc.)
Output is accessible by downstream pipeline steps and engineers.

Structured data processing and query tools
Ideally a declarative query language (that comes with the storage solution), that is used to retrieve data as well as perform DML statements like CTAS.

API

Read API

Low level — for structured data and for unstructured data separately
High level — for querying all raw data that answers a criteria and materialize it in a specified format (useful for labeling/eyeballing or training data preparation)

Write API With a focus on ingesting new labels and predictions from different sources (labeling system, external data and model predictions)

Label format translation API

To / from the annotation system label representation
From model prediction data format

Summary and next steps

Data management for deep learning is a challenge, due to the unstructured, large volume, and labor intensive nature of the data.

In this post, we introduced “Deep Lake” — an architectural blueprint that draws on data engineering best practices and patterns, and adapts them to the domain of managing deep learning data.

In the next part of the series, we will describe design patterns used within the Deep Lake system, as well as review potential technology stack implementation alternatives.