Choosing Between Long-Context Embedding Models
Discover How to Select the Best Embeddings and Explore the Business Viability of Offering an Embeddings API
Let's dive into why embeddings are fundamental in today's tech landscape. In the realm of search and chat technologies, the goal extends beyond mere keyword hunting; you aim to uncover results that resonate with the deeper meaning of your queries. This is where embeddings shine. By transforming documents and user queries into vectors, you can compare and retrieve information based on semantic similarity, not just literal matches.
Most retrieval augmented generation systems (RAGs) and search engines today use embeddings. Providing vector-based search capabilities for data, which was previously accessible only through keywords, opens up new markets for startups.
Moreover, new embedding models are coming out every week, making semantic search better and cheaper, continually expanding the potential applications.
How to Choose the Right Embedding Model
When choosing embeddings you have to care about the following things
Search quality: Look for use-case-specific benchmarks. But nothing beats testing on your data.
Number of dimensions: affects both the quality of the search results and the computational requirements. More dimensions typically provide more detailed comparisons but can slow down search speeds and increase resource usage.
Size of the model: Larger models could offer more capabilities but are also more expensive to run. However, you will often notice that newer and smaller models might beat older and bigger ones.
Context length: Determines how much text the model can handle at a time. Ensure the embeddings you choose can handle the length of texts you intend to process.
Supported languages: Many models are trained mostly in English. If your application requires handling multiple languages, look for multilingual models that can effectively process the specific languages you need.
Start by checking the Embeddings Leaderboard on Huggingface. Unfortunately, the newest models are often missing from this leaderboard and it’s lacking multilingual performance metrics.
For the Tendery use case, we have to support all European languages with large context (at least 4,000, ideally 8,000+ tokens), have good search quality, and reduce the costs of converting 100k+ documents to <50$. With these requirements, there were very few models available.
Nomic-1.5 looked promising on paper, but on my tests, the search quality was underwhelming. It was especially bad for non-English texts.
BAAI/bge-m3, on the other hand, worked very well on all metrics. After multiple optimizations (quantization, parallelization) I was able to get about 5 documents per second on my free setup of 2 T4 GPUs available on Kaggle. This was an acceptable performance, the only problem is that there is no API available for this model. In the ideal world, I would like to run my offline workflows myself. However, when a user performs the search, you have to convert their query into embeddings in real time. For that you either have to find an API, or have an instance that is running all the time, which would make it very expensive.
All other long-context models were either too big to be able to run at a cheaper price than OpenAI or were only available through APIs that were also more expensive. If you are reading this after April 2024, this information might no longer be up to date.
Given the need for an API, we opted for OpenAI's embedding services. To make it fit our requirements we put extra effort into cleaning up our documents from non-essential data.
There were about 10 other companies that provide APIs for embeddings, but none of them are competitive with OpenAI’s pricing: 0.02$ (small) and 0.13$ (large) per MM tokens. And APIs that are comparable in price lack the required features.
Furthermore, OpenAI released a new batch API which reduced the costs by 50%, if you are willing to wait for your embeddings for hours.
Google released its promising new embedding model available through API on April 9. They claim to outperform OpenAI, while also having 2x smaller vectors. I would be very happy to try them, but their APIs are not yet available in Germany.
What about cloud solutions?
Amazon has the following relevant offerings: Kendra, and Sagemaker. Kendra's pricing starts from 800$ per month. All the machines on AWS are much more expensive than in clouds specialized in GPU machines, such as Paperspace.
Azure AI Search and other options are also more expensive for our use case than OpenAI.
Google Embeddings API: Although potentially a strong competitor in terms of technology, this service was not available in Germany at the time of our assessment
Algolia: Out-of-the-box search solution. Unfortunately, they don’t enable vector search unless you are on their enterprise plan and multilingual search didn’t work well on my tests.
Embeddings - SASS Business
With all that in mind, the immediate question that came to me is why don’t we wrap BAAI/bge-m3 with a few alternatives into an API and sell it cheaper than competitors?
The short answer is that margins are too small. In a SASS company you aim at margins above 80%, this idea would produce margins significantly below 50%.
Let’s do a back-of-the-napkin calculation together. We were able to process about 20000 tokens per second on 2 T4s. I’ll make an assumption that on A100 we would do about 5x = 100k tokens per second at 100% utilization, but we normally aim at utilization of 50% when providing a real-time service, so let’s say 50k tokens per second per A100. If we utilized our machines perfectly at 0,01$ per MM tokens, we would get about 1400$ per month. One A100 will cost us around 1k$ per month.
We could try to optimize a lot of things and make this business more profitable, but margins are not looking promising. I expect this market to be commoditized.