Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • Compare Kiro, BMAD GSD and Other AI Tools.
  • Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers
  • Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery
  • OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.
  • Nick Bostrom Has a Plan for Humanity’s ‘Big Retirement’
  • A long shot proposal to protect California workers from AI
  • AI Kids’ Toys: The New Wild West
  • Natural Language Automatencoders by Anthropic Convert Claude’s internal activations directly into human-readable text explanations
AI-trends.todayAI-trends.today
Home»Tech»The Coding Guide for Implementing Zarr on Large Data Sets: Compression, Indexing and Visualization Techniques

The Coding Guide for Implementing Zarr on Large Data Sets: Compression, Indexing and Visualization Techniques

Tech By Gavin Wallace17/09/20258 Mins Read
Facebook Twitter LinkedIn Email
A Coding Implementation to Build an Interactive Transcript and PDF
A Coding Implementation to Build an Interactive Transcript and PDF
Share
Facebook Twitter LinkedIn Email

This tutorial will take you on a journey through the features of Zarr, a library designed for efficient storage & manipulation of large, multidimensional arrays. We start by learning the basics: creating arrays, setting up chunking, and editing values on disk. We then move on to more complex operations, such as testing different chunk sizes and access patterns. We also apply multiple compression codes for both storage and speed optimization, and compare their performance using synthetic datasets. In addition, we build hierarchical data structures that are enriched with meta-data, demonstrate realistic workflows using time series and volumetric datasets, and show advanced indexing techniques to extract meaningful subsets. See the FULL CODES here.

!pip install zarr numcodecs -q
import zarr
Import numpy np
Matplotlib.pyplot can be imported as a plt
From numcodecs, import Blosc Delta FixedScaleOffset
Download a temp file
Import shutil
Import os
Import Path from pathlib


print(f"Zarr version: {zarr.__version__}")
print(f"NumPy version: {np.__version__}")


print("=== BASIC ZARR OPERATIONS ===")

Our tutorial begins by installing NumPy & Matplotlib, as well as Zarr & Numcodecs. After setting up the environment, we verify that the correct versions are installed. We can now dive in to basic Zarr functions. Click here to see the FULL CODES here.

tutorial_dir = Path(tempfile.mkdtemp(prefix="zarr_tutorial_"))
print(f"Working directory: {tutorial_dir}")


z1 = zarr.zeros((1000, 1000), chunks=(100, 100), dtype="f4",
               store=str(tutorial_dir / 'basic_array.zarr'), zarr_format=2)
z2 = zarr.ones((500, 500, 10), chunks=(100, 100, 5), dtype="i4",
              store=str(tutorial_dir / 'multi_dim.zarr'), zarr_format=2)


print(f"2D Array shape: {z1.shape}, chunks: {z1.chunks}, dtype: {z1.dtype}")
print(f"3D Array shape: {z2.shape}, chunks: {z2.chunks}, dtype: {z2.dtype}")


z1[100:200, 100:200] = np.random.random((100, 100)).astype('f4')
z2[:, :, 0] = np.arange(500*500).reshape(500, 500)


print(f"Memory usage estimate: {z1.nbytes_stored() / 1024**2:.2f} MB")

Then we create our directory and initialize Zarr Arrays, a two-dimensional array of zeroes and three-dimensional array of ones. Then we fill in random values and sequences, checking the memory consumption, shape, chunk size, and other parameters. Click here to see the FULL CODES here.

print("n=== ADVANCED CHUNKING ===")


Time_steps height and width = 365; 1000; 2000
time_series = zarr.zeros(
   (time_steps, height, width),
   chunks=(30, 250, 500),
   dtype="f4",
   store=str(tutorial_dir / 'time_series.zarr'),
   zarr_format=2
)


For t within range (0, time_steps 30):
   end_t = min(t + 30, time_steps)
   seasonal = np.sin(2 * np.pi * np.arange(t, end_t) / 365)[:, None, None]
 Spatial = np.random.normal (20, 5, (end_t-t, height, and width))
   time_series[t:end_t] = (spatial + 10 * seasonal).astype('f4')


print(f"Time series created: {time_series.shape}")
print(f"Approximate chunks created")


import time
"Start" = "time.time()
temporal_slice = time_series[:, 500, 1000]
temporal_time = time.time() Startseite


"Start" = "time.time()
spatial_slice = time_series[100, :200, :200]
spatial_time = time.time() Startseite


print(f"Temporal access time: {temporal_time:.4f}s")
print(f"Spatial access time: {spatial_time:.4f}s")

We simulate an annual time-series dataset in this step with optimal chunking to allow for spatial and temporal access. Add seasonal patterns, spatial noise and measure the access speeds to get a first-hand look at how chunking affects performance. See the FULL CODES here.

print("n=== COMPRESSION AND CODECS ===")


Data = np.randint (0, 1000, (1, 1,000), dtype="i4")


Import BloscCodec and BytesCodec from zarr.codecs


z_none = zarr.array(data, chunks=(100, 100),
                  codecs=[BytesCodec()],
                  store=str(tutorial_dir / 'no_compress.zarr'))


z_lz4 = zarr.array(data, chunks=(100, 100),
                  codecs=[BytesCodec(), BloscCodec(cname="lz4", clevel=5)],
                  store=str(tutorial_dir / 'lz4_compress.zarr'))


z_zstd = zarr.array(data, chunks=(100, 100),
                   codecs=[BytesCodec(), BloscCodec(cname="zstd", clevel=9)],
                   store=str(tutorial_dir / 'zstd_compress.zarr'))


sequential_data = np.cumsum(np.random.randint(-5, 6, (1000, 1000)), axis=1)
z_delta = zarr.array(sequential_data, chunks=(100, 100),
                    codecs=[BytesCodec(), BloscCodec(cname="zstd", clevel=5)],
                    store=str(tutorial_dir / 'sequential_compress.zarr'))


sizes = {
   'No compression': z_none.nbytes_stored(),
   'LZ4': z_lz4.nbytes_stored(),
   'ZSTD': z_zstd.nbytes_stored(),
   'Sequential+ZSTD': z_delta.nbytes_stored()
}


print("Compression comparison:")
original_size = data.nbytes
Name, Size in Items():
   ratio = size / original_size
   print(f"{name}: {size/1024**2:.2f} MB (ratio: {ratio:.3f})")


print("n=== HIERARCHICAL DATA ORGANIZATION ===")


root = zarr.open_group(str(tutorial_dir / 'experiment.zarr'), mode="w")


raw_data = root.create_group('raw_data')
processed = root.create_group('processed')
metadata = root.create_group('metadata')


raw_data.create_dataset('images', shape=(100, 512, 512), chunks=(10, 128, 128), dtype="u2")
raw_data.create_dataset('timestamps', shape=(100,), dtype="datetime64[ns]")


processed.create_dataset('normalized', shape=(100, 512, 512), chunks=(10, 128, 128), dtype="f4")
processed.create_dataset('features', shape=(100, 50), chunks=(20, 50), dtype="f4")


root.attrs['experiment_id'] = 'EXP_2024_001'
root.attrs['description'] = 'Advanced Zarr tutorial demonstration'
root.attrs['created'] = str(np.datetime64('2024-01-01'))


raw_data.attrs['instrument'] = 'Synthetic Camera'
raw_data.attrs['resolution'] = [512, 512]
processed.attrs['normalization'] = 'z-score'


timestamps = np.datetime64('2024-01-01') + np.arange(100) * np.timedelta64(1, 'h')
raw_data['timestamps'][:] = timestamps


for i in range(100):
   frame = np.random.poisson(100 + 50 * np.sin(2 * np.pi * i / 100), (512, 512)).astype('u2')
   raw_data['images'][i] Frame


print(f"Created hierarchical structure with {len(list(root.group_keys()))} groups")
print(f"Data arrays and groups created successfully")


print("n=== ADVANCED INDEXING ===")


volume_data = zarr.zeros((50, 20, 256, 256), chunks=(5, 5, 64, 64), dtype="f4",
                       store=str(tutorial_dir / 'volume.zarr'), zarr_format=2)


For t within range(50)
   for z in range(20):
 If y is greater than x, then np.ogrid will be the result.[:256, :256]
       center_y, center_x = 128 + 20*np.sin(t*0.1), 128 + 20*np.cos(t*0.1)
       focus_quality = 1 - abs(z - 10) / 10
      
       signal = focus_quality * np.exp(-((y-center_y)**2 + (x-center_x)**2) / (50**2))
       noise = 0.1 * np.random.random((256, 256))
       volume_data[t, z] = (signal + noise).astype('f4')


print("Various slicing operations:")


max_projection = np.max(volume_data[:, 10], axis=0)
print(f"Max projection shape: {max_projection.shape}")


z_stack = volume_data[25, :, 100:156, 100:156]
print(f"Z-stack subset: {z_stack.shape}")


bright_pixels = volume_data[volume_data > 0.5]
print(f"Pixels above threshold: {len(bright_pixels)}")

Comparing the size of data on disk with LZ4, ZSTD and no compression allows us to compare real-world savings. We then organize the experiment into a Zarr hierarchy, complete with images and timestamps, rich attributes. Finaly, we create a 4D volume with advanced indexing. This is followed by sub-stacking, thresholding and max projections. See the FULL CODES here.

print("n=== PERFORMANCE OPTIMIZATION ===")


def process_chunk_serial(data, func):
 Results = []
   for i in range(0, len(dt), 100):
 Data = chunk[i:i+100]
       results.append(func(chunk))
   return np.concatenate(results)


def gaussian_filter_1d(x, sigma=1.0):
   kernel_size = int(4 * sigma)
 If kernel_size == 0, then:
       kernel_size += 1
   kernel = np.exp(-0.5 * ((np.arange(kernel_size) - kernel_size//2) / sigma)**2)
   kernel = kernel / kernel.sum()
   return np.convolve(x.astype(float), kernel, mode="same")


large_array = zarr.random.random((10000,), chunks=(1000,),
                              store=str(tutorial_dir / 'large.zarr'), zarr_format=2)


start_time = time.time()
chunk_size = 1000
filtered_data = []
for i in range(0, len(large_array), chunk_size):
   end_idx = min(i + chunk_size, len(large_array))
   chunk_data = large_array[i:end_idx]
   smoothed = np.convolve(chunk_data, np.ones(5)/5, mode="same")
   filtered_data.append(smoothed)


result = np.concatenate(filtered_data)
processing_time = time.time() - start_time


print(f"Chunk-aware processing time: {processing_time:.4f}s")
print(f"Processed {len(large_array):,} elements")


print("n=== VISUALIZATION ===")


Figure, Axes= plt.subplots(2), (3), figsize=(15.10, 15)
fig.suptitle('Advanced Zarr Tutorial - Data Visualization', fontsize=16)


The axes[0,0].plot(temporal_slice)
The axes[0,0]Set the title of this image to "Temporal Evolution" (single pixel).
The axes[0,0]"Day of Year" is set by.set_xlabel.
The axes[0,0].set_ylabel('Temperature')


Im1 = axes[0,1].imshow(spatial_slice, cmap='viridis')
The axes[0,1]Set the title to "Spatial Pattern Day 100"
plt.colorbar(im1, ax=axes[0,1])


methods = list(sizes.keys())
ratios = [sizes[m]"/original_size" for the m method]
The axes[0,2].bar(range(len(methods)), ratios)
The axes[0,2].set_xticks(range(len(methods)))
The axes[0,2].set_xticklabels(methods, rotation=45)
The axes[0,2].set_title('Compression Ratios')
The axes[0,2].set_ylabel('Size Ratio')


The axes[1,0].imshow(max_projection, cmap='hot')
The axes[1,0]Set the title to 'Max Intensity Projection.'


z_profile = np.mean(volume_data[25, :, 120:136, 120:136], axis=(1,2))
The axes[1,1].plot(z_profile, 'o-')
The axes[1,1].set_title('Z-Profile (Center Region)')
The axes[1,1].set_xlabel('Z-slice')
The axes[1,1].set_ylabel('Mean Intensity')


The axes[1,2].plot(result[:1000])
The axes[1,2].set_title('Processed Signal (First 1000 points)')
The axes[1,2].set_xlabel('Sample')
The axes[1,2].set_ylabel('Amplitude')


plt.tight_layout()
plt.show()

We can improve performance by processing large batches of data and using simple filters to smooth out the edges without loading all of it into memory. Then we visualize the temporal patterns, spatial patterns and compression effects. This allows us to quickly see how chunking and compressing our data has shaped the result. See the FULL CODES here.

print("n=== TUTORIAL SUMMARY ===")
print("Zarr features demonstrated:")
print("✓ Multi-dimensional array creation and manipulation")
print("✓ Optimal chunking strategies for different access patterns")
print("✓ Advanced compression with multiple codecs")
print("✓ Hierarchical data organization with metadata")
print("✓ Advanced indexing and data views")
print("✓ Performance optimization techniques")
print("✓ Integration with visualization tools")


Def show_tree() path, prefix="", max_depth=3, current_depth=0):
   if current_depth > max_depth:
 You can return to your original language by clicking here.
   items = sorted(path.iterdir())
 The item i in the enumerate (items).
 Is_Last = i == Len(items - 1)
       current_prefix = "└── " If is_last, then "├── "
       print(f"{prefix}{current_prefix}{item.name}")
 If item.is_dir() Current_depth 

The tutorial concludes by reviewing everything that we have covered: indexing, hierarchy, array creation, chunking and compression. Also, we review all the files that were generated and verify the total amount of disk used.

We conclude by moving beyond the basics and gaining a holistic view of how Zarr integrates into modern data workflows. It handles compression to optimize storage, groups complex experiments into hierarchical structures, and allows for easy access to large datasets. Enhancements in performance, like chunk-aware processors and integrations with visualization tools bring more depth. This shows how theory is translated into reality.


Take a look at the FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter.


Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. This platform has over 2,000,000 monthly views which shows its popularity.

coding dat data Tech x
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

Compare Kiro, BMAD GSD and Other AI Tools.

09/05/2026

Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers

09/05/2026

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

09/05/2026

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

08/05/2026
Top News

Prego Has a Dinner-Conversation-Recording Device, Capisce?

What is Google One? The Google One Plans and Pricing

OpenAI Ramps Robotics in the Race for AGI

Anthropic denies that it can sabotage AI during war

Price Increases Are Driven by Algorithms According to Game Theory

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

Netflix wants podcasts to be the new talk shows for daytime

19/12/2025

Apple released FastVLM, a novel hybrid vision encoder that is 85x faster and 3.4x smaller than comparable sized Vision Language Models.

02/09/2025
Latest News

Compare Kiro, BMAD GSD and Other AI Tools.

09/05/2026

Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers

09/05/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.