Skip to content

Transcription

The transcription step is the first step of the document parsing flow.

  • It is responsible for converting the document content into a structured format that can be used by the classification and extraction steps.
  • It consists of transforming a document to transcribed text in Markdown format.
  • The Markdown format allows us to retain information on the structure of the document (headings, tables, etc.), which we will use during extraction.

How it works

  • The transcription is done using AWS Textract.
  • The response of AWS Textract consists of blocs of text with their position (bounding boxes) and relation between the blocs (ex: blocs in the same table header, etc.).

Storage

  • The transcription is versioned and stored in DocumentTranscriptionResult.
  • The AWS textract raw response is stored in S3 and referenced in the document transcription result.