Configuring Embedding Operations for Einstein AI Connector 1.2

Configure the Adhoc File Query Operation

The Embedding Adhoc File Query operation takes a document and ingests it into the vector database along with its query.

The output of this operation is a set of scores with the complete content of the document that’s the most likely answer to the query. The vector database is used to identify the numeric representation of the content before creating the likely score.

Select the operation on the Anypoint Code Builder or Studio canvas.
In the General properties tab for the operation, enter these values:
- Prompt
  
  Plain text for the prompt to send to the LLM
- File path
  
  Full file path for the document to ingest into the embedding store. Ensure the file path is accessible.
  
  You can also use a DataWeave expression for this field, for example:
  mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf"
In Additional Properties, select:
- Model name
  
  Name of the API model that interacts with the LLM.
- File type
  - Text
    
    Any type of text files (JSON, XML, txt, CSV, and so on)
  - PDF
    
    System-generated PDF files
  - CSV
    
    CSV file with comma-separated values
  - URL
    
    A single URL.
- Option type
  
  How to split the document prior to ingestion into the vector database

Configure the Generate From File Operation

The Embedding generate from file operation takes a document and ingests it into the vector database. The output of this operation is a numeric representation of the content.

Select the operation on the Anypoint Code Builder or Studio canvas.
In the General properties tab for the operation, enter these values:
- File path
  
  Full file path for the document to ingest into the embedding store. Ensure the file path is accessible.
  
  You can also use a DataWeave expression for this field, for example:
  mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf"
In Additional Properties, select:
- Model name
  
  Name of the API model that interacts with the LLM.
- File type
  - Text
    
    Any type of text files (JSON, XML, txt, CSV, and so on)
  - PDF
    
    System-generated PDF files
  - CSV
    
    CSV file with comma-separated values
  - URL
    
    A single URL.
- Option type
  
  How to split the document prior to ingestion into the vector database

How Data is Parsed

Data from files are parsed in a way that changes the format of the generated embeddings slightly (the content itself isn’t changed):

Spaces are removed from the beginning and end of file content, for example:

Before parsing:
```
"

Para 1
Para 2



"
```
After parsing:
```
"Para 1
Para 2"
```
Extra lines between paragraphs are removed, for example:

Before parsing:
```
"
Para 1


Para2
"
```
After parsing:
```
"Para 1
Para 2"
```
The connector provides filtering logic that removes some characters from the generated embedding, for example, non-breaking spaces (<0xa0>). These characters are removed to drive more accurate embeddings:

Character	Hex Code	Decimal Code	Name	Description
\u0000	0x00	0	Null (NULL)	Marks the end of a string in C-like languages
\u0001	0x01	1	Start of Heading (SOH)	Used to mark the beginning of a heading in data streams
\u0007	0x07	7	Bell (BEL)	Triggers a beep or alert sound
\u0008	0x08	8	Backspace (BS)	Moves the cursor one position backward
\u0009	0x09	9	Horizontal Tab (TAB)	Inserts a tab space
\u000A	0x0A	10	Line Feed (LF)	Moves to a new line (Unix newline)
\u000D	0x0D	13	Carriage Return (CR)	Returns to the beginning of a new line
\u001B	0x1B	27	Escape (ESC)	Used to introduce escape sequences for control
\u001F	0x1F	31	Unit Separator (US)	Separates units of information

Character

Hex Code

Decimal Code

Name

Description

\u0000

0x00

Null (NULL)

Marks the end of a string in C-like languages

\u0001

0x01

Start of Heading (SOH)

Used to mark the beginning of a heading in data streams

\u0007

0x07

Bell (BEL)

Triggers a beep or alert sound

\u0008

0x08

Backspace (BS)

Moves the cursor one position backward

\u0009

0x09

Horizontal Tab (TAB)

Inserts a tab space

\u000A

0x0A

Line Feed (LF)

Moves to a new line (Unix newline)

\u000D

0x0D

Carriage Return (CR)

Returns to the beginning of a new line

\u001B

0x1B

Escape (ESC)

Used to introduce escape sequences for control

\u001F

0x1F

Unit Separator (US)

Separates units of information

Configure the Generate From Text Operation

The Embedding generate from text operation takes text and ingests it into the vector database. The output of this operation is a numeric representation of the content.

Select the operation on the Anypoint Code Builder or Studio canvas.
In the General properties tab for the operation, enter these values:
- Text
  
  Text to ingest into the vector database
In Additional Properties, select:
- Model name
  
  Name of the API model that interacts with the LLM