Skip to content

Extraction

The extraction step is the last step of the document parsing flow.

  • It consists of extracting structured information from the document content.
  • The expected structured information is defined in the document parsing configuration for each category.

How it works

  • The extraction uses a pydantic model to validate the extracted information.
  • The validation include basic JSON schema validation (similar to the one done in the manual parsing tool).
  • The extraction is validated using pydantic models for advanced validation that includes basic JSON schema validation done in the manual parsing tool.
  • It can go further by implementing safeguards to ensure that the output of the parser is valid.

Dynamic LLM Extractor

The dynamic LLM extractor is an LLM extractor using a dynamic few-shot prompting approach. When a document arrives for extraction (alongside its transcription and classification), we prepare a prompt for the LLM that includes:

  • Extraction instructions based on the document's classification
  • Few-shot examples of similar documents and their validated extraction outputs, retrieved from our reference dataset document pool using text similarity search on the transcriptions (using document embeddings)
  • The new document's Markdown transcription as the final user input
flowchart TB

    subgraph extraction ["Extraction (new document)"]
        direction TB
        A(New document arrives with transcription + classification) --> B(Prepare extraction instructions)
        A --> C(Compute embedding of transcription)
        C --> D(Query pgvector for N most similar documents by L2 distance)
        D --> E(For each similar doc: fetch markdown transcription + validated extraction output)
        E --> F(Build prompt)
        B --> F
    end


    subgraph prompt ["LLM prompt structure"]
        direction TB
        P1["SYSTEM: extraction instructions"]
        P2["SYSTEM (example_user): similar doc 1 transcription"]
        P3["SYSTEM (example_assistant): similar doc 1 extraction (JSON)"]
        P4["... more few-shot examples ..."]
        P5["USER: new document transcription"]
        P1 --- P2 --- P3 --- P4 --- P5
    end

    SE -.->|grows the example pool| D
    F --> prompt
    prompt --> G(Parse LLM response as structured extraction)

    G -.->|validated extraction feeds back into pool| V

    subgraph indexing ["Indexing (after validation)"]
        direction LR
        V(Document validated) --> CE(Compute embedding of transcription)
        CE --> SE(Store embedding + metadata in PostgreSQL with pgvector)
    end
Hold "Alt" / "Option" to enable pan & zoom