Contact Us 1-800-596-4880

Configuring Transform Operations

Configure the [Transform] Parse document and [Transform] Chunk text operations.

Configure the Transform Parse Document Operation

The [Transform] Parse document operation parses a document from a raw binary or Base64-encoded content.

To configure the [Transform] Parse document operation:

  1. Select the operation on the Anypoint Code Builder or Studio canvas.

  2. In the General properties tab for the operation, enter these values:

    • Document binary

      Enter the raw binary or Base64-encoded content of the document to parse.

    • Document parser

      Enter the document parser to use.

This is the XML for this operation:

<ms-vectors:transform-parse-document
  doc:name="[Transform] Parse document"
  doc:id="a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  config-ref="MuleSoft_Vectors_Connector_Document_config"
  documentBinary="<![CDATA[#[payload.documentPath]]]>"
  documentParser="text">
</ms-vectors:transform-parse-document>

Output Configuration

This operation responds with a JSON payload. This is an example response:

{
    "text": "In the modern world, technological advancements have become essential for businesses to remain competitive. E-commerce giants have redefined the retail landscape through innovative use of technology and data analytics.",
    "metadata": {
        "title": "Technology in Business",
        "author": "John Smith",
        "creationDate": "2024-01-15T10:30:00Z",
        "pageCount": 5,
        "wordCount": 1247,
        "fileSize": 245760,
        "documentType": "PDF"
    },
    "extractedAt": "2024-01-20T14:25:30Z",
    "success": true
}
  • text: The complete extracted text content from the document.

  • metadata: Document properties and information.

    • title: Document title if available.

    • author: Document author if available.

    • creationDate: Document creation timestamp.

    • pageCount: Number of pages in the document.

    • wordCount: Total word count in the extracted text.

    • fileSize: Original file size in bytes.

    • documentType: Detected or specified document format.

  • extractedAt: Timestamp when the parsing operation completed.

  • success: Boolean indicating if the parsing completed successfully.

Configure the Transform Chunk Text Operation

The [Transform] Chunk text operation chunks the provided text into multiple segments based on the segmentation parameters. This operation splits the input text into smaller segments according to the maximum segment size and overlap size specified in the segmentation parameters. The result is returned as a JSON document containing the chunked text segments and associated metadata.

To configure the [Transform] Chunk text operation:

  1. Select the operation on the Anypoint Code Builder or Studio canvas.

  2. In the General properties tab for the operation, enter these values:

    • Text

      Enter the text content to chunk.

    • Max Segment Size (Characters)

      Enter the maximum size of a segment in characters.

    • Max Overlap Size (Characters)

      Enter the maximum overlap between segments in characters.

This is the XML for this operation:

<ms-vectors:transform-chunk-text
  doc:name="[Transform] Chunk text"
  doc:id="b2c3d4e5-f6g7-8901-bcde-f23456789012"
  config-ref="MuleSoft_Vectors_Connector_Document_config"
  text="In the modern world, technological advancements have become essential for businesses to remain competitive. E-commerce giants have redefined the retail landscape through innovative use of technology and data analytics."
  maxSegmentSize="1000"
  maxOverlapSize="100">
</ms-vectors:transform-chunk-text>

Output Configuration

This operation responds with a JSON payload. This is an example response:

{
  "chunks": [
    {
      "index": 0,
      "text": "In the modern world, technological advancements have become essential for businesses to remain competitive.",
      "startPosition": 0,
      "endPosition": 198,
      "characterCount": 198
    }
  ],
  "totalChunks": 1,
  "originalLength": 1247,
  "avgChunkSize": 1247,
  "processingTime": "0.125s"
}
  • chunks: List of text segments created from the original text.

    • index: Sequential number of the chunk starting from 0.

    • text: The actual text content of the chunk.

    • startPosition: Character position where this chunk begins in the original text.

    • endPosition: Character position where this chunk ends in the original text.

    • characterCount: Number of characters in this specific chunk.

  • totalChunks: Total number of chunks created.

  • originalLength: Character count of the original input text.

  • avgChunkSize: Average character count across all chunks.

  • processingTime: Time taken to complete the chunking operation.

View on GitHub