Skip to content

Extraction

The extraction step is the last step of the document parsing flow.

  • It consists of extracting structured information from the document content.
  • The expected structured information is defined in the document parsing configuration for each category.

How it works

  • The extraction uses a pydantic model to validate the extracted information.
  • The validation include basic JSON schema validation (similar to the one done in the manual parsing tool).
  • The extraction is validated using pydantic models for advanced validation that includes basic JSON schema validation done in the manual parsing tool.
  • It can go further by implementing safeguards to ensure that the output of the parser is valid.

Dynamic LLM Extractor

The dynamic LLM extractor is an extractor that uses a dynamic LLM model to extract structured information from a document. It consists of providing to the LLM model:

  • The instruction to extract the document
  • The document transcription.
  • And a set of similar examples with their input (transcription) and expected output (structured information). This set of similar examples if computed using document embeddings.