The Coding Guide for Implementing Zarr on Large Data Sets: Compression, Indexing and Visualization Techniques

This tutorial will take you on a journey through the features of Zarr, a library designed for efficient storage & manipulation of large, multidimensional arrays. We start by learning the basics: creating arrays, setting up chunking, and editing values on disk. We then move on to more complex operations, such as testing different chunk sizes and access patterns. We also apply multiple compression codes for both storage and speed optimization, and compare their performance using synthetic datasets. In addition, we build hierarchical data structures that are enriched with meta-data, demonstrate realistic workflows using time series and volumetric datasets, and show advanced indexing techniques to extract meaningful subsets. See the FULL CODES here.

!pip install zarr numcodecs -q
import zarr
Import numpy np
Matplotlib.pyplot can be imported as a plt
From numcodecs, import Blosc Delta FixedScaleOffset
Download a temp file
Import shutil
Import os
Import Path from pathlib


print(f"Zarr version: {zarr.__version__}")
print(f"NumPy version: {np.__version__}")


print("=== BASIC ZARR OPERATIONS ===")

Our tutorial begins by installing NumPy & Matplotlib, as well as Zarr & Numcodecs. After setting up the environment, we verify that the correct versions are installed. We can now dive in to basic Zarr functions. Click here to see the FULL CODES here.

tutorial_dir = Path(tempfile.mkdtemp(prefix="zarr_tutorial_"))
print(f"Working directory: {tutorial_dir}")


z1 = zarr.zeros((1000, 1000), chunks=(100, 100), dtype="f4",
               store=str(tutorial_dir / 'basic_array.zarr'), zarr_format=2)
z2 = zarr.ones((500, 500, 10), chunks=(100, 100, 5), dtype="i4",
              store=str(tutorial_dir / 'multi_dim.zarr'), zarr_format=2)


print(f"2D Array shape: {z1.shape}, chunks: {z1.chunks}, dtype: {z1.dtype}")
print(f"3D Array shape: {z2.shape}, chunks: {z2.chunks}, dtype: {z2.dtype}")


z1[100:200, 100:200] = np.random.random((100, 100)).astype('f4')
z2[:, :, 0] = np.arange(500*500).reshape(500, 500)


print(f"Memory usage estimate: {z1.nbytes_stored() / 1024**2:.2f} MB")

Then we create our directory and initialize Zarr Arrays, a two-dimensional array of zeroes and three-dimensional array of ones. Then we fill in random values and sequences, checking the memory consumption, shape, chunk size, and other parameters. Click here to see the FULL CODES here.

print("n=== ADVANCED CHUNKING ===")


Time_steps height and width = 365; 1000; 2000
time_series = zarr.zeros(
   (time_steps, height, width),
   chunks=(30, 250, 500),
   dtype="f4",
   store=str(tutorial_dir / 'time_series.zarr'),
   zarr_format=2
)


For t within range (0, time_steps 30):
   end_t = min(t + 30, time_steps)
   seasonal = np.sin(2 * np.pi * np.arange(t, end_t) / 365)[:, None, None]
 Spatial = np.random.normal (20, 5, (end_t-t, height, and width))
   time_series[t:end_t] = (spatial + 10 * seasonal).astype('f4')


print(f"Time series created: {time_series.shape}")
print(f"Approximate chunks created")


import time
"Start" = "time.time()
temporal_slice = time_series[:, 500, 1000]
temporal_time = time.time() Startseite


"Start" = "time.time()
spatial_slice = time_series[100, :200, :200]
spatial_time = time.time() Startseite


print(f"Temporal access time: {temporal_time:.4f}s")
print(f"Spatial access time: {spatial_time:.4f}s")

We simulate an annual time-series dataset in this step with optimal chunking to allow for spatial and temporal access. Add seasonal patterns, spatial noise and measure the access speeds to get a first-hand look at how chunking affects performance. See the FULL CODES here.

print("n=== COMPRESSION AND CODECS ===")


Data = np.randint (0, 1000, (1, 1,000), dtype="i4")


Import BloscCodec and BytesCodec from zarr.codecs


z_none = zarr.array(data, chunks=(100, 100),
                  codecs=[BytesCodec()],
                  store=str(tutorial_dir / 'no_compress.zarr'))


z_lz4 = zarr.array(data, chunks=(100, 100),
                  codecs=[BytesCodec(), BloscCodec(cname="lz4", clevel=5)],
                  store=str(tutorial_dir / 'lz4_compress.zarr'))


z_zstd = zarr.array(data, chunks=(100, 100),
                   codecs=[BytesCodec(), BloscCodec(cname="zstd", clevel=9)],
                   store=str(tutorial_dir / 'zstd_compress.zarr'))


sequential_data = np.cumsum(np.random.randint(-5, 6, (1000, 1000)), axis=1)
z_delta = zarr.array(sequential_data, chunks=(100, 100),
                    codecs=[BytesCodec(), BloscCodec(cname="zstd", clevel=5)],
                    store=str(tutorial_dir / 'sequential_compress.zarr'))


sizes = {
   'No compression': z_none.nbytes_stored(),
   'LZ4': z_lz4.nbytes_stored(),
   'ZSTD': z_zstd.nbytes_stored(),
   'Sequential+ZSTD': z_delta.nbytes_stored()
}


print("Compression comparison:")
original_size = data.nbytes
Name, Size in Items():
   ratio = size / original_size
   print(f"{name}: {size/1024**2:.2f} MB (ratio: {ratio:.3f})")


print("n=== HIERARCHICAL DATA ORGANIZATION ===")


root = zarr.open_group(str(tutorial_dir / 'experiment.zarr'), mode="w")


raw_data = root.create_group('raw_data')
processed = root.create_group('processed')
metadata = root.create_group('metadata')


raw_data.create_dataset('images', shape=(100, 512, 512), chunks=(10, 128, 128), dtype="u2")
raw_data.create_dataset('timestamps', shape=(100,), dtype="datetime64[ns]")


processed.create_dataset('normalized', shape=(100, 512, 512), chunks=(10, 128, 128), dtype="f4")
processed.create_dataset('features', shape=(100, 50), chunks=(20, 50), dtype="f4")


root.attrs['experiment_id'] = 'EXP_2024_001'
root.attrs['description'] = 'Advanced Zarr tutorial demonstration'
root.attrs['created'] = str(np.datetime64('2024-01-01'))


raw_data.attrs['instrument'] = 'Synthetic Camera'
raw_data.attrs['resolution'] = [512, 512]
processed.attrs['normalization'] = 'z-score'


timestamps = np.datetime64('2024-01-01') + np.arange(100) * np.timedelta64(1, 'h')
raw_data['timestamps'][:] = timestamps


for i in range(100):
   frame = np.random.poisson(100 + 50 * np.sin(2 * np.pi * i / 100), (512, 512)).astype('u2')
   raw_data['images'][i] Frame


print(f"Created hierarchical structure with {len(list(root.group_keys()))} groups")
print(f"Data arrays and groups created successfully")


print("n=== ADVANCED INDEXING ===")


volume_data = zarr.zeros((50, 20, 256, 256), chunks=(5, 5, 64, 64), dtype="f4",
                       store=str(tutorial_dir / 'volume.zarr'), zarr_format=2)


For t within range(50)
   for z in range(20):
 If y is greater than x, then np.ogrid will be the result.[:256, :256]
       center_y, center_x = 128 + 20*np.sin(t*0.1), 128 + 20*np.cos(t*0.1)
       focus_quality = 1 - abs(z - 10) / 10
      
       signal = focus_quality * np.exp(-((y-center_y)**2 + (x-center_x)**2) / (50**2))
       noise = 0.1 * np.random.random((256, 256))
       volume_data[t, z] = (signal + noise).astype('f4')


print("Various slicing operations:")


max_projection = np.max(volume_data[:, 10], axis=0)
print(f"Max projection shape: {max_projection.shape}")


z_stack = volume_data[25, :, 100:156, 100:156]
print(f"Z-stack subset: {z_stack.shape}")


bright_pixels = volume_data[volume_data > 0.5]
print(f"Pixels above threshold: {len(bright_pixels)}")

Comparing the size of data on disk with LZ4, ZSTD and no compression allows us to compare real-world savings. We then organize the experiment into a Zarr hierarchy, complete with images and timestamps, rich attributes. Finaly, we create a 4D volume with advanced indexing. This is followed by sub-stacking, thresholding and max projections. See the FULL CODES here.

print("n=== PERFORMANCE OPTIMIZATION ===")


def process_chunk_serial(data, func):
 Results = []
   for i in range(0, len(dt), 100):
 Data = chunk[i:i+100]
       results.append(func(chunk))
   return np.concatenate(results)


def gaussian_filter_1d(x, sigma=1.0):
   kernel_size = int(4 * sigma)
 If kernel_size == 0, then:
       kernel_size += 1
   kernel = np.exp(-0.5 * ((np.arange(kernel_size) - kernel_size//2) / sigma)**2)
   kernel = kernel / kernel.sum()
   return np.convolve(x.astype(float), kernel, mode="same")


large_array = zarr.random.random((10000,), chunks=(1000,),
                              store=str(tutorial_dir / 'large.zarr'), zarr_format=2)


start_time = time.time()
chunk_size = 1000
filtered_data = []
for i in range(0, len(large_array), chunk_size):
   end_idx = min(i + chunk_size, len(large_array))
   chunk_data = large_array[i:end_idx]
   smoothed = np.convolve(chunk_data, np.ones(5)/5, mode="same")
   filtered_data.append(smoothed)


result = np.concatenate(filtered_data)
processing_time = time.time() - start_time


print(f"Chunk-aware processing time: {processing_time:.4f}s")
print(f"Processed {len(large_array):,} elements")


print("n=== VISUALIZATION ===")


Figure, Axes= plt.subplots(2), (3), figsize=(15.10, 15)
fig.suptitle('Advanced Zarr Tutorial - Data Visualization', fontsize=16)


The axes[0,0].plot(temporal_slice)
The axes[0,0]Set the title of this image to "Temporal Evolution" (single pixel).
The axes[0,0]"Day of Year" is set by.set_xlabel.
The axes[0,0].set_ylabel('Temperature')


Im1 = axes[0,1].imshow(spatial_slice, cmap='viridis')
The axes[0,1]Set the title to "Spatial Pattern Day 100"
plt.colorbar(im1, ax=axes[0,1])


methods = list(sizes.keys())
ratios = [sizes[m]"/original_size" for the m method]
The axes[0,2].bar(range(len(methods)), ratios)
The axes[0,2].set_xticks(range(len(methods)))
The axes[0,2].set_xticklabels(methods, rotation=45)
The axes[0,2].set_title('Compression Ratios')
The axes[0,2].set_ylabel('Size Ratio')


The axes[1,0].imshow(max_projection, cmap='hot')
The axes[1,0]Set the title to 'Max Intensity Projection.'


z_profile = np.mean(volume_data[25, :, 120:136, 120:136], axis=(1,2))
The axes[1,1].plot(z_profile, 'o-')
The axes[1,1].set_title('Z-Profile (Center Region)')
The axes[1,1].set_xlabel('Z-slice')
The axes[1,1].set_ylabel('Mean Intensity')


The axes[1,2].plot(result[:1000])
The axes[1,2].set_title('Processed Signal (First 1000 points)')
The axes[1,2].set_xlabel('Sample')
The axes[1,2].set_ylabel('Amplitude')


plt.tight_layout()
plt.show()

We can improve performance by processing large batches of data and using simple filters to smooth out the edges without loading all of it into memory. Then we visualize the temporal patterns, spatial patterns and compression effects. This allows us to quickly see how chunking and compressing our data has shaped the result. See the FULL CODES here.

print("n=== TUTORIAL SUMMARY ===")
print("Zarr features demonstrated:")
print("✓ Multi-dimensional array creation and manipulation")
print("✓ Optimal chunking strategies for different access patterns")
print("✓ Advanced compression with multiple codecs")
print("✓ Hierarchical data organization with metadata")
print("✓ Advanced indexing and data views")
print("✓ Performance optimization techniques")
print("✓ Integration with visualization tools")


Def show_tree() path, prefix="", max_depth=3, current_depth=0):
   if current_depth > max_depth:
 You can return to your original language by clicking here.
   items = sorted(path.iterdir())
 The item i in the enumerate (items).
 Is_Last = i == Len(items - 1)
       current_prefix = "└── " If is_last, then "├── "
       print(f"{prefix}{current_prefix}{item.name}")
 If item.is_dir() Current_depth

The tutorial concludes by reviewing everything that we have covered: indexing, hierarchy, array creation, chunking and compression. Also, we review all the files that were generated and verify the total amount of disk used.

We conclude by moving beyond the basics and gaining a holistic view of how Zarr integrates into modern data workflows. It handles compression to optimize storage, groups complex experiments into hierarchical structures, and allows for easy access to large datasets. Enhancements in performance, like chunk-aware processors and integrations with visualization tools bring more depth. This shows how theory is translated into reality.

Take a look at the FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. This platform has over 2,000,000 monthly views which shows its popularity.

The Coding Guide for Implementing Zarr on Large Data Sets: Compression, Indexing and Visualization Techniques

Compare Kiro, BMAD GSD and Other AI Tools.

Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

Prego Has a Dinner-Conversation-Recording Device, Capisce?

What is Google One? The Google One Plans and Pricing

OpenAI Ramps Robotics in the Race for AGI

Anthropic denies that it can sabotage AI during war

Price Increases Are Driven by Algorithms According to Game Theory

Top Insights

Netflix wants podcasts to be the new talk shows for daytime

Apple released FastVLM, a novel hybrid vision encoder that is 85x faster and 3.4x smaller than comparable sized Vision Language Models.

Latest News

Compare Kiro, BMAD GSD and Other AI Tools.

Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers

The Coding Guide for Implementing Zarr on Large Data Sets: Compression, Indexing and Visualization Techniques

Related Posts