mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf"
Configuring Embedding Operations for Einstein AI Connector
Configure the Adhoc File Query Operation
The Embedding Adhoc File Query operation takes a document and ingests it into the vector database along with its query.
The output of this operation is a set of scores with the complete content of the document that’s the most likely answer to the query. The vector database is used to identify the numeric representation of the content before creating the likely score.
-
Select the operation on the Anypoint Code Builder or Studio canvas.
-
In the General properties tab for the operation, enter these values:
-
Prompt
Plain text for the prompt to send to the LLM
-
File path
Full file path for the document to ingest into the embedding store. Ensure the file path is accessible.
You can also use a DataWeave expression for this field, for example:
-
-
In Additional Properties, select:
-
Model name
Name of the API model that interacts with the LLM.
-
File type
-
Text
Any type of text files (JSON, XML, txt, CSV, and so on)
-
PDF
System-generated PDF files
-
CSV
CSV file with comma-separated values
-
URL
A single URL.
-
-
Option type
How to split the document prior to ingestion into the vector database
-
Configure the Generate From File Operation
The Embedding generate from file operation takes a document and ingests it into the vector database. The output of this operation is a numeric representation of the content.
-
Select the operation on the Anypoint Code Builder or Studio canvas.
-
In the General properties tab for the operation, enter these values:
-
File path
Full file path for the document to ingest into the embedding store. Ensure the file path is accessible.
You can also use a DataWeave expression for this field, for example:
mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf"
-
-
In Additional Properties, select:
-
Model name
Name of the API model that interacts with the LLM.
-
File type
-
Text
Any type of text files (JSON, XML, txt, CSV, and so on)
-
PDF
System-generated PDF files
-
CSV
CSV file with comma-separated values
-
URL
A single URL.
-
-
Option type
How to split the document prior to ingestion into the vector database
-
How Data is Parsed
Data from files are parsed in a way that changes the format of the generated embeddings slightly (the content itself isn’t changed):
-
Spaces are removed from the beginning and end of file content, for example:
Before parsing:
" Para 1 Para 2 "
After parsing:
"Para 1 Para 2"
-
Extra lines between paragraphs are removed, for example:
Before parsing:
" Para 1 Para2 "
After parsing:
"Para 1 Para 2"
-
The connector provides filtering logic that removes some characters from the generated embedding, for example, non-breaking spaces (<0xa0>). These characters are removed to drive more accurate embeddings:
Character | Hex Code | Decimal Code | Name | Description |
---|---|---|---|---|
\u0000 |
0x00 |
0 |
Null (NULL) |
Marks the end of a string in C-like languages |
\u0001 |
0x01 |
1 |
Start of Heading (SOH) |
Used to mark the beginning of a heading in data streams |
\u0007 |
0x07 |
7 |
Bell (BEL) |
Triggers a beep or alert sound |
\u0008 |
0x08 |
8 |
Backspace (BS) |
Moves the cursor one position backward |
\u0009 |
0x09 |
9 |
Horizontal Tab (TAB) |
Inserts a tab space |
\u000A |
0x0A |
10 |
Line Feed (LF) |
Moves to a new line (Unix newline) |
\u000D |
0x0D |
13 |
Carriage Return (CR) |
Returns to the beginning of a new line |
\u001B |
0x1B |
27 |
Escape (ESC) |
Used to introduce escape sequences for control |
\u001F |
0x1F |
31 |
Unit Separator (US) |
Separates units of information |
Configure the Generate From Text Operation
The Embedding generate from text operation takes text and ingests it into the vector database. The output of this operation is a numeric representation of the content.
-
Select the operation on the Anypoint Code Builder or Studio canvas.
-
In the General properties tab for the operation, enter these values:
-
Text
Text to ingest into the vector database
-
-
In Additional Properties, select:
-
Model name
Name of the API model that interacts with the LLM
-