Current AI models are too large and expensive for practical applications. You are asked to explore one of the techniques below (or a combination) to make them more efficient:
Speculative Decoding:
* Vision Language Model Speculative Decoding
* Grammar based Drafters with Token Tree Verification
* Speculative Decoding for Diffusion Language Models
Sparsification:
* Knowledge-Distillation-based LLM Structured Pruning
* Task specific LLM Structured Pruning
* VLM /LLM Token Pruning
* Vocabulary Pruning
* Context pruning / prompt pruning / KV-cache pruning
Quantization:
* Extreme few-bit Quantization (ternary, 2bit etc)
* Mixed precision / mixed bit-width quantization
* CNN quantization
* Vision Language Action model quantization
* Diffusion LM quantization
Note: this project is external and requires interviews with the company
Joaquin Vanschoren