BRIGHT: Bi-level Feature Representation of Image Collections using Groups of Hash Tables

1Simon Fraser University
concept of our method

Bright projects images into key codes and then uses the key codes to retrieve features from hash tables, instead of directly encoding images into features.

Uncurated generated results from the diffusion model trained on our key codes.

Abstract

We present BRIGHT, a bi-level feature representation for an image collection, consisting of a per-image latent space on top of a multi-scale feature grid space. Our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution hash tables. Our key codes and hash tables are trained together continuously with well- defined gradient flows, leading to high usage of the hash table entries and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRIGHT is compact, efficient to train, and enables generative modeling over the image codes using state-of-the-art generators such as latent diffusion models (LDMs). Experimental results show that our method achieves comparable recon- struction results to VQ methods while having a smaller and more efficient decoder network. By applying LDM over our key code space, we achieve state-of-the-art performance on image synthesis on the LSUN-Church and human-face datasets.

Video

Pipeline

Overall pipeline of our method in three parts: Encoding, Hash Retrieval and Decoding.

Results

Reconstruction

Our largest model (code size 16 × 16 × 16) has only 10M more parameters (≈ 24% increase) than those of the corresponding VQGAN and RQ-VAE models. However, our model outperforms VQGAN and RQ-VAE across all metrics, showing 55% and 41% improvements in LPIPS, respectively. In terms of computational costs, our method requires ≈ 50% fewer GFlops due to its smaller decoder and the efficiency gained from utilizing the hash tables.

Reconstruction metrics on the validation splits of FFHQ and LSUN-Church dataset.

Trainable parameters and computational load of decoders. An * indicates total number of parameters in hash tables and refers to total computational cost of decoding and feature retrieval from hash tables.


Generation

The precision of our generated images is significantly higher than others, indicating a substantial reduction of low-quality samples in our results, while our recall is almost the same as StyleGAN2. Our method also gets competitive results on FID, CLIP-FID and Inception Scores.

Quantitative results of generation on FFHQ dataset. * denotes the results calculated on publicly released checkpoint by LDM author on Github.

Quantitative results of generation on LSUN-Church dataset.

Nearest Neighbour

Although our method has much higher precision scores than previous methods, we show the nearest neighbour search by LPIPS to demonstrate that our generated samples are unique and not mere retrievals from the training dataset. In the following images, the leftmost images in each row are generated images from our method and the rest images in each row are the nearest neighbour search results.

BibTeX

@misc{yang2023bright,
      title={BRIGHT: Bi-level Feature Representation of Image Collections using Groups of Hash Tables}, 
      author={Dingdong Yang and Yizhi Wang and Ali Mahdavi-Amiri and Hao Zhang},
      year={2023},
      eprint={2305.18601},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}