Skip to content

Embedding

This service consists of indexing document embedding for text similarity search purpose.

  • The embedding service can be used for document stored in this components or outside of it (ex: FR insurance documents).
  • Current uses cases include:
    • Documents (FR/BE insurance documents, global documents like retiree proofs, etc.)
    • Help center articles
    • Health guarantees labels
  • It supports rather high volumes of documents (for instance, 2.5 millions FR insurance documents are indexed)

How it works

During the indexation, the documents text are transformed into vectors using an embedding algorithm. These vectors are saved in our Postgres database (thanks to the PGVector extension).

During the search, the text to search is transformed into a vector (or fetched in DB by ID). Then, the l2 distance is computed between this vector and all the vectors of the indexed documents, returning the documents with the highest similarity. Note that you can filter documents to search on using docs metadata.

Adding a new use case

  • Define a new document type in DocumentType enum
  • First index the documents using the functions index_document or index_documents.
  • Then search for similar documents using the function find_similar_documents.

Embedding algorithms

Different embedding algorithm can be used, each having pros and cons:

  • [OpenAI] text_embedding_3_large (recommended): Very good similarity search accuracy. Hosted on Azure, cost $ by request.
  • [OpenAI] text-embedding-ada-002 (legacy): Prefer using text_embedding_3_large for new use cases.
  • [Sentence Transformers] all-MiniLM-L6-v2: Rather good similarity search accuracy while providing fast similarity search because the embedding vectors are much shorter. Hosted on AWS Sagemaker.

Embedding algorithm migration

You can only compute the similarity of documents embedded using the same algorithm! So if you want to change the algorithm, you need to re-index all your documents. Existing embeddings are kept in the database as saved in different DB columns.