Contact Us 1-800-596-4880

Configuring Embedding Operations for Einstein AI Connector

Configure the Adhoc File Query Operation

The Embedding Adhoc File Query operation takes a document and ingests it into the vector database along with its query.

The output of this operation is a set of scores with the complete content of the document that’s the most likely answer to the query. The vector database is used to identify the numeric representation of the content before creating the likely score.

  1. Select the operation on the Anypoint Code Builder or Studio canvas.

  2. In the General properties tab for the operation, enter these values:

    • Prompt

      Plain text for the prompt to send to the LLM

    • File path

      Full file path for the document to ingest into the embedding store. Ensure the file path is accessible.

      You can also use a DataWeave expression for this field, for example:

      mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf"
  3. In Additional Properties, select:

    • Model name

      Name of the API model that interacts with the LLM.

    • File type

      • Text

        Any type of text files (JSON, XML, txt, CSV, and so on)

      • PDF

        System-generated PDF files

      • CSV

        CSV file with comma-separated values

      • URL

        A single URL.

    • Option type

      How to split the document prior to ingestion into the vector database

Configure the Generate From File Operation

The Embedding generate from file operation takes a document and ingests it into the vector database. The output of this operation is a numeric representation of the content.

  1. Select the operation on the Anypoint Code Builder or Studio canvas.

  2. In the General properties tab for the operation, enter these values:

    • File path

      Full file path for the document to ingest into the embedding store. Ensure the file path is accessible.

      You can also use a DataWeave expression for this field, for example:

      mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf"
  3. In Additional Properties, select:

    • Model name

      Name of the API model that interacts with the LLM.

    • File type

      • Text

        Any type of text files (JSON, XML, txt, CSV, and so on)

      • PDF

        System-generated PDF files

      • CSV

        CSV file with comma-separated values

      • URL

        A single URL.

    • Option type

      How to split the document prior to ingestion into the vector database

How Data is Parsed

Data from files are parsed in a way that changes the format of the generated embeddings slightly (the content itself isn’t changed):

  • Spaces are removed from the beginning and end of file content, for example:

    Before parsing:

    "
    
    Para 1
    Para 2
    
    
    
    "

    After parsing:

    "Para 1
    Para 2"
  • Extra lines between paragraphs are removed, for example:

    Before parsing:

    "
    Para 1
    
    
    Para2
    "

    After parsing:

    "Para 1
    Para 2"
  • The connector provides filtering logic that removes some characters from the generated embedding, for example, non-breaking spaces (<0xa0>). These characters are removed to drive more accurate embeddings:

Character Hex Code Decimal Code Name Description

\u0000

0x00

0

Null (NULL)

Marks the end of a string in C-like languages

\u0001

0x01

1

Start of Heading (SOH)

Used to mark the beginning of a heading in data streams

\u0007

0x07

7

Bell (BEL)

Triggers a beep or alert sound

\u0008

0x08

8

Backspace (BS)

Moves the cursor one position backward

\u0009

0x09

9

Horizontal Tab (TAB)

Inserts a tab space

\u000A

0x0A

10

Line Feed (LF)

Moves to a new line (Unix newline)

\u000D

0x0D

13

Carriage Return (CR)

Returns to the beginning of a new line

\u001B

0x1B

27

Escape (ESC)

Used to introduce escape sequences for control

\u001F

0x1F

31

Unit Separator (US)

Separates units of information

Configure the Generate From Text Operation

The Embedding generate from text operation takes text and ingests it into the vector database. The output of this operation is a numeric representation of the content.

  1. Select the operation on the Anypoint Code Builder or Studio canvas.

  2. In the General properties tab for the operation, enter these values:

    • Text

      Text to ingest into the vector database

  3. In Additional Properties, select:

    • Model name

      Name of the API model that interacts with the LLM

View on GitHub