Kaoken aims to optimize CPU inference for small language models under 500M parameters, like BERT and T5. By enabling fast on-device processing, it addresses privacy concerns and reduces the cost associated with cloud GPU usage, making it ideal for applications like image segmentation and text completion.
Kaoken is an innovative tool designed to significantly enhance CPU inference speed for smaller language models (less than 500M parameters) such as BERT and T5. In an era where prompt responses are critical, maintaining efficiency while preserving privacy and minimizing costs is paramount. Kaoken aims to provide a solution for various applications, including on-device inference scenarios that require real-time text completions or image software, especially when connectivity with larger foundation models (like Claude and Gemini) is limited or impractical due to cost or privacy concerns.
Why Kaoken?
Consider a bot operating on an embedded ARM device for image segmentation. Relying on a remote server for such processing can hinder performance due to latency. Kaoken enables the execution of these processes on-device, drastically improving speed and reliability while reducing operating costs associated with GPU inference, which can range from $0.5 to $0.7 per hour.
Optimization Strategies
Kaoken leverages standard optimization techniques to enhance the inference capability of smaller models on inexpensive CPUs. With the rise in usage of Transformer-based architectures, many models utilize similar components, such as Attention and Layer Normalization. This approach allows us to dissect models easily using frameworks like PyTorch, which we utilize to print and analyze model components effectively.
Baked Models
The core innovation of Kaoken lies in generating baked models by compiling extensive model specifications into specialized bytecode for optimized performance. The inference process operates under the principle of knowing:
- The specific layers involved
- The input and output dimensions for each layer
- The computations required for each layer, primarily focusing on matrix multiplication and addition.
For example, a simple PyTorch operation for matrix multiplication can be transformed into efficient C code:
output[0][0] = (a[0][0] * weights[0][0]) + (a[0][1] * weights[1][0]);
This optimally precomputes weights for faster execution.
Performance Validation
Kaoken's capabilities are backed by rigorous benchmarking. The validation process confirms that the output of the baked models matches that of traditional PyTorch implementations. Each layer undergoes stress testing to ensure efficiency and correctness, where significant performance improvements have already been noted:
Layer | Input Shape | PyTorch (ms) | Kaoken (ms) |
---|---|---|---|
Layer Normalization | [1, 768] | 0.052 | 0.0052 |
GELU Activation | [1, 768] | 0.174 | 0.084 |
Linear | [768, 50257] | 22.4 | 2.378 |
Attention | [1, 768] | 14.484 | 9.386 |
Goals and Future Work
While Kaoken presents remarkable advancements in the efficiency of small model inference on CPUs, challenges remain, particularly in compiling large C source files generated from baked models. Future work includes optimizing the generation process and ensuring that Kaoken can universally accommodate any model and its specific requirements.
Kaoken represents a step forward in making highly efficient, cost-effective inference for language models more accessible. By empowering developers with new tools for model deployment and optimization, Kaoken is reshaping the landscape of on-device AI.