arrow-right

Back

RAG simplified with Truto

by

The Truto Team

Posted

Feb 12, 2025

Introduction

Building a RAG system that trains on a URL or file upload is straightforward. But does this approach truly meet your need for an AI chatbot that can access and answer questions using internal data from Confluence pages, Jira tickets, Notion pages, or other SaaS tools?

I assume the answer is no, so you may want to explore using a RAG provider along with connectors. Here are a few key questions to consider while looking for a RAG provider:

  1. Do they support all the connectors your customers need?

  2. Can they quickly add new connectors before your lead loses momentum?

  3. Do they offer the flexibility to choose your own embedding model and vector database?

With Truto, building native integrations is effortless—we've perfected them. Now, we've also solved your RAG challenges, enabling seamless data syncing from virtually any SaaS tool. (Fun fact: We already support 350+ integrations, allowing instant data syncing to your vector database, and even enabling data write-backs based on user actions.)

Challenge #1 - File or Page selection

You wouldn’t want your AI model—or any third-party provider—accessing sensitive internal data, so it's best to restrict syncing to specific files or pages. While apps like Google Drive, SharePoint, and Box provide native file pickers, what about integrations that don’t?

Truto solves this with RapidForm, allowing users to select exactly which files and pages to sync during the connection process. Plus, we support native file pickers for Google Drive and SharePoint (with more on the way!) to ensure a seamless user experience.

Check out our change log for a sneak peek at the native file pickers: Truto Change Log.

Challenge #2 - Content retrieval

APIs can be unpredictable, each with its own quirks in handling responses. For example, the Confluence API delivers page content as a single record, the Notion API structures content in blocks requiring pagination, and the SharePoint API retrieves only the file itself—not its content.

Notion

At Truto, tackling these API challenges is what we do best—so you don’t have to. Take Notion, for example—its API delivers content in fragmented blocks, requiring pagination. We’ve solved this by spooling content in memory and seamlessly merging it into a single record, all thanks to RapidBridge (also known as Sync Job).

The code snippet below shows how our Notion Sync Job works: it fetches pages, recursively retrieves their content, spools the data, and merges everything into a unified record for each page.

{
   "type":"request",
   "name":"list-pages",
   "resource":"knowledge-base/pages",
   "method":"list",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "type":"add_context",
   "name":"add-page-name",
   "depends_on":"list-pages",
   "config":{
      "expression":"{ \"page_id\": resources.`knowledge-base`.pages.id, \"page_title\": resources.`knowledge-base`.pages.title, \"page_url\": resources.`knowledge-base`.pages.urls[type=\"view\"].url }"
   }
},
{
   "type":"request",
   "name":"get-page-content",
   "resource":"knowledge-base/page-content",
   "method":"list",
   "depends_on":"list-pages",
   "query":{
      "page":{
         "id":"{{resources.knowledge-base.pages.id}}"
      }
   },
   "recurse":{
      "if":"{{resources.knowledge-base.page-content.has_children:bool}}",
      "config":{
         "query":{
            "page_content_id":"{{resources.knowledge-base.page-content.id}}"
         }
      }
   },
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"remove-remote-data",
   "type":"transform",
   "config":{
      "expression":"resources.`knowledge-base`.`page-content`.$sift(function($v, $k) { $k != remote_data })"
   },
   "depends_on":"get-page-content"
},
{
   "name":"all-page-content",
   "type":"spool",
   "depends_on":"remove-remote-data"
},
{
   "name":"combine-page-content",
   "type":"transform",
   "config":{
      "expression":"{ \"file_id\": page_id, \"file_name\": page_title, \"content\": \"# \" & page_title & \"\n\n\" & $reduce($sortNodes(resources.`knowledge-base`.`page-content`, \"id\", \"parent.id\"), function($acc, $v) { $acc & $v.body.content }, \"\" ) }"
   },
   "depends_on":"all-page-content"
}

SharePoint

With SharePoint, the process is a bit different—we need to download and parse the file to extract its content.

{
   "type":"request",
   "name":"list-files",
   "resource":"file-storage/drive-items",
   "method":"get",
   "integrated_account_id":"{{args.integrated_account_id}}",
   "id":"{{drive_items.id}}",
   "query":{
      "drive":{
         "id":"{{drive_items.parentReference.driveId}}"
      },
      "workspace":{
         "id":"{{drive_items.parentReference.sharepointIds.siteId}}"
      }
   },
   "loop_on":"drive_items"
},
{
   "type":"add_context",
   "name":"add-file-name",
   "depends_on":"list-files",
   "config":{
      "expression":"{'file_name': resources.`file-storage`.`drive-items`.name, 'file_id': resources.`file-storage`.`drive-items`.id, 'file_type': resources.`file-storage`.`drive-items`.mime_type, 'file_size': resources.`file-storage`.`drive-items`.size, 'file_url': resources.`file-storage`.`drive-items`.urls[type=\"self\"].url, 'aws_file_path': $join([tenant_id,resources.`file-storage`.`drive-items`.workspace.id, resources.`file-storage`.`drive-items`.drive.id],\"/\")  }"
   }
},
{
   "type":"request",
   "name":"download-file",
   "resource":"file-storage/drive-items",
   "method":"download",
   "depends_on":"list-files",
   "query":"{'file_url':  resources.`file-storage`.`drive-items`.urls[type = 'download'].url, 'truto_response_format': 'stream'}",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"tee-stream",
   "type":"transform",
   "config":{
      "expression":"{ \"file_streams\": $teeStream(resources.`file-storage`.`drive-items`) }"
   },
   "depends_on":"download-file"
},
{
   "name":"transform-file-content",
   "type":"transform",
   "config":{
      "expression":"{ 'file_content': $parseDocument(resources.`file-storage`.`drive-items`[0].file_streams[0], file_type)}"
   },
   "depends_on":"tee-stream"
}

Challenge #3 – Embeddings Generation

Once you’ve got the content, the next challenge is generating embeddings. But can you process the entire content in one go? Of course not. You’ll need to split it into chunks first. (Yeah, we get the "pain".)

The following Sync job node demonstrates how we chunk the page content using the recursiveCharacterTextSplitter method from langchain/textsplitters.

{
   "name":"file-content-chunk",
   "type":"add_context",
   "config":{
      "expression":"{'file_content_chunk': $recursiveCharacterTextSplitter(resources.`file-storage`.`drive-items`[0].file_content) }"
   },
   "depends_on":"transform-file-content"
}

Remember, you're not locked into this specific chunking method—you can easily swap it out for your own custom approach if needed.

Next, we move on to generating embeddings for these chunks. The Sync job node below leverages the Cohere Embed API with the embed-multilingual-light-v3.0 model to generate the embeddings:

{
   "name":"generate-embeddings",
   "type":"transform",
   "config":{
      "expression":"{ 'content_embeddings' : $generateEmbeddingsCohere({'model': 'embed-multilingual-light-v3.0', 'input_type': 'search_document', 'embedding_types': ['float'], 'texts': file_content_chunk }, args.cohere_api_key)}"
   },
   "depends_on":"transform-file-content"
}

This approach is highly flexible—change the model attribute to use a different Cohere model or swap out the $generateEmbeddingsCohere() method to use another provider like OpenAI.

Challenge #4 - Storing embeddings

Now we reach the stage where the real magic unfolds: storing the embeddings in your own vector database. This is a crucial step, as it enables rapid similarity searches and efficient retrieval of your processed data.

In the example below, we form the payload in accordance with Qdrant’s API request body schema. The subsequent node then calls the upsertPoints method of the Qdrant datastore, ensuring your embeddings are correctly indexed and stored.

{
   "name":"qdrant-config",
   "type":"transform",
   "config":{
      "expression":"{ 'qdrant_config': ($texts:= resources.`file-storage`.`drive-items`[0].content_embeddings.texts; $embeddings:= resources.`file-storage`.`drive-items`[0].content_embeddings.embeddings.float; $file_id:= file_id; $tenant_id:= $toNumber(tenant_id); $file_url:= file_url; $site_id:= site_id; $aws_file_path:= aws_file_path; {'query': {'ordering': 'strong'}, 'body': {'points': $texts#$i.{ 'id': $uuid(), 'payload': {'content': $, 'chunk_number': $i, 'organization_user_file_id': $file_id, 'user_id': $tenant_id, 'external_url': $file_url, 'updated_at': $now(), 'site_id': $site_id, 'saved_filename':$aws_file_path  }, 'vector': $embeddings[$i]}}})}"
   },
   "depends_on":"generate-embeddings"
},
{
   "name":"qdrant-db",
   "type":"destination",
   "destination_type":"datastore",
   "method":"upsertPoints",
   "config":{
      "id":"{{args.qdrant_datastore_id}}",
      "config":{
         "query":"{{payload.records.0.qdrant_config.query:json}}",
         "body":"{{payload.records.0.qdrant_config.body:json}}"
      }
   },
   "run_if":"$exists(args.qdrant_datastore_id)",
   "resources_to_persist":[
      "qdrant-config"
   ]
}

It also incorporates the actual content, the chunk number, the integration's file URL, and a reference to the original file in your object storage. This method ensures your data remains organized, easily searchable, and scalable as it grows.

Challenge #5 - Saving the Original File

It is not done until it is. Retaining the source file is crucial, especially if you plan to display it as part of your chatbot's response.

Truto makes this easy by supporting file storage in Google Cloud Storage and any S3-compatible object storage. Once you configure your datastore through the Truto console, the Sync job node below handles uploading the file to the specified path in your storage system.

{
   "name":"s3-storage",
   "type":"destination",
   "destination_type":"datastore",
   "method":"uploadObject",
   "config":{
      "id":"{{args.s3_datastore_id}}",
      "config":{
         "path":"{{aws_file_path}}",
         "file_name":"{{file_name}}",
         "content":"{{payload.records.0.file_streams.1}}"
      }
   },
   "run_if":"$exists(args.s3_datastore_id)",
   "resources_to_persist":[
      "tee-stream"
   ]
}

Getting Started

By now, you might be convinced that Truto is your ultimate solution for integrations and RAG. Truto isn't just a single service—it’s a comprehensive package that supports every step of your workflow.

Still not convinced? Consider this: Truto’s Unified APIs empower you to perform virtually any action your users request. For example:

User Prompt: "Give me a list of Jira tickets assigned to me."

AI Agent Response:

  • Ticket #1 – Backlogged

  • Ticket #2 – Pending from engineering

  • Ticket #3 – Resolved … and so on.

User Prompt: "Leave a comment on all tickets pending from engineering, asking for follow-ups."

AI Agent Response: Calls the CREATE ticketing/comments API for all pending tickets.

Sounds amazing, right? Don’t wait any longer—experience the power of Truto for yourself and transform the way you manage integrations and data. Get started today!

Introduction

Building a RAG system that trains on a URL or file upload is straightforward. But does this approach truly meet your need for an AI chatbot that can access and answer questions using internal data from Confluence pages, Jira tickets, Notion pages, or other SaaS tools?

I assume the answer is no, so you may want to explore using a RAG provider along with connectors. Here are a few key questions to consider while looking for a RAG provider:

  1. Do they support all the connectors your customers need?

  2. Can they quickly add new connectors before your lead loses momentum?

  3. Do they offer the flexibility to choose your own embedding model and vector database?

With Truto, building native integrations is effortless—we've perfected them. Now, we've also solved your RAG challenges, enabling seamless data syncing from virtually any SaaS tool. (Fun fact: We already support 350+ integrations, allowing instant data syncing to your vector database, and even enabling data write-backs based on user actions.)

Challenge #1 - File or Page selection

You wouldn’t want your AI model—or any third-party provider—accessing sensitive internal data, so it's best to restrict syncing to specific files or pages. While apps like Google Drive, SharePoint, and Box provide native file pickers, what about integrations that don’t?

Truto solves this with RapidForm, allowing users to select exactly which files and pages to sync during the connection process. Plus, we support native file pickers for Google Drive and SharePoint (with more on the way!) to ensure a seamless user experience.

Check out our change log for a sneak peek at the native file pickers: Truto Change Log.

Challenge #2 - Content retrieval

APIs can be unpredictable, each with its own quirks in handling responses. For example, the Confluence API delivers page content as a single record, the Notion API structures content in blocks requiring pagination, and the SharePoint API retrieves only the file itself—not its content.

Notion

At Truto, tackling these API challenges is what we do best—so you don’t have to. Take Notion, for example—its API delivers content in fragmented blocks, requiring pagination. We’ve solved this by spooling content in memory and seamlessly merging it into a single record, all thanks to RapidBridge (also known as Sync Job).

The code snippet below shows how our Notion Sync Job works: it fetches pages, recursively retrieves their content, spools the data, and merges everything into a unified record for each page.

{
   "type":"request",
   "name":"list-pages",
   "resource":"knowledge-base/pages",
   "method":"list",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "type":"add_context",
   "name":"add-page-name",
   "depends_on":"list-pages",
   "config":{
      "expression":"{ \"page_id\": resources.`knowledge-base`.pages.id, \"page_title\": resources.`knowledge-base`.pages.title, \"page_url\": resources.`knowledge-base`.pages.urls[type=\"view\"].url }"
   }
},
{
   "type":"request",
   "name":"get-page-content",
   "resource":"knowledge-base/page-content",
   "method":"list",
   "depends_on":"list-pages",
   "query":{
      "page":{
         "id":"{{resources.knowledge-base.pages.id}}"
      }
   },
   "recurse":{
      "if":"{{resources.knowledge-base.page-content.has_children:bool}}",
      "config":{
         "query":{
            "page_content_id":"{{resources.knowledge-base.page-content.id}}"
         }
      }
   },
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"remove-remote-data",
   "type":"transform",
   "config":{
      "expression":"resources.`knowledge-base`.`page-content`.$sift(function($v, $k) { $k != remote_data })"
   },
   "depends_on":"get-page-content"
},
{
   "name":"all-page-content",
   "type":"spool",
   "depends_on":"remove-remote-data"
},
{
   "name":"combine-page-content",
   "type":"transform",
   "config":{
      "expression":"{ \"file_id\": page_id, \"file_name\": page_title, \"content\": \"# \" & page_title & \"\n\n\" & $reduce($sortNodes(resources.`knowledge-base`.`page-content`, \"id\", \"parent.id\"), function($acc, $v) { $acc & $v.body.content }, \"\" ) }"
   },
   "depends_on":"all-page-content"
}

SharePoint

With SharePoint, the process is a bit different—we need to download and parse the file to extract its content.

{
   "type":"request",
   "name":"list-files",
   "resource":"file-storage/drive-items",
   "method":"get",
   "integrated_account_id":"{{args.integrated_account_id}}",
   "id":"{{drive_items.id}}",
   "query":{
      "drive":{
         "id":"{{drive_items.parentReference.driveId}}"
      },
      "workspace":{
         "id":"{{drive_items.parentReference.sharepointIds.siteId}}"
      }
   },
   "loop_on":"drive_items"
},
{
   "type":"add_context",
   "name":"add-file-name",
   "depends_on":"list-files",
   "config":{
      "expression":"{'file_name': resources.`file-storage`.`drive-items`.name, 'file_id': resources.`file-storage`.`drive-items`.id, 'file_type': resources.`file-storage`.`drive-items`.mime_type, 'file_size': resources.`file-storage`.`drive-items`.size, 'file_url': resources.`file-storage`.`drive-items`.urls[type=\"self\"].url, 'aws_file_path': $join([tenant_id,resources.`file-storage`.`drive-items`.workspace.id, resources.`file-storage`.`drive-items`.drive.id],\"/\")  }"
   }
},
{
   "type":"request",
   "name":"download-file",
   "resource":"file-storage/drive-items",
   "method":"download",
   "depends_on":"list-files",
   "query":"{'file_url':  resources.`file-storage`.`drive-items`.urls[type = 'download'].url, 'truto_response_format': 'stream'}",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"tee-stream",
   "type":"transform",
   "config":{
      "expression":"{ \"file_streams\": $teeStream(resources.`file-storage`.`drive-items`) }"
   },
   "depends_on":"download-file"
},
{
   "name":"transform-file-content",
   "type":"transform",
   "config":{
      "expression":"{ 'file_content': $parseDocument(resources.`file-storage`.`drive-items`[0].file_streams[0], file_type)}"
   },
   "depends_on":"tee-stream"
}

Challenge #3 – Embeddings Generation

Once you’ve got the content, the next challenge is generating embeddings. But can you process the entire content in one go? Of course not. You’ll need to split it into chunks first. (Yeah, we get the "pain".)

The following Sync job node demonstrates how we chunk the page content using the recursiveCharacterTextSplitter method from langchain/textsplitters.

{
   "name":"file-content-chunk",
   "type":"add_context",
   "config":{
      "expression":"{'file_content_chunk': $recursiveCharacterTextSplitter(resources.`file-storage`.`drive-items`[0].file_content) }"
   },
   "depends_on":"transform-file-content"
}

Remember, you're not locked into this specific chunking method—you can easily swap it out for your own custom approach if needed.

Next, we move on to generating embeddings for these chunks. The Sync job node below leverages the Cohere Embed API with the embed-multilingual-light-v3.0 model to generate the embeddings:

{
   "name":"generate-embeddings",
   "type":"transform",
   "config":{
      "expression":"{ 'content_embeddings' : $generateEmbeddingsCohere({'model': 'embed-multilingual-light-v3.0', 'input_type': 'search_document', 'embedding_types': ['float'], 'texts': file_content_chunk }, args.cohere_api_key)}"
   },
   "depends_on":"transform-file-content"
}

This approach is highly flexible—change the model attribute to use a different Cohere model or swap out the $generateEmbeddingsCohere() method to use another provider like OpenAI.

Challenge #4 - Storing embeddings

Now we reach the stage where the real magic unfolds: storing the embeddings in your own vector database. This is a crucial step, as it enables rapid similarity searches and efficient retrieval of your processed data.

In the example below, we form the payload in accordance with Qdrant’s API request body schema. The subsequent node then calls the upsertPoints method of the Qdrant datastore, ensuring your embeddings are correctly indexed and stored.

{
   "name":"qdrant-config",
   "type":"transform",
   "config":{
      "expression":"{ 'qdrant_config': ($texts:= resources.`file-storage`.`drive-items`[0].content_embeddings.texts; $embeddings:= resources.`file-storage`.`drive-items`[0].content_embeddings.embeddings.float; $file_id:= file_id; $tenant_id:= $toNumber(tenant_id); $file_url:= file_url; $site_id:= site_id; $aws_file_path:= aws_file_path; {'query': {'ordering': 'strong'}, 'body': {'points': $texts#$i.{ 'id': $uuid(), 'payload': {'content': $, 'chunk_number': $i, 'organization_user_file_id': $file_id, 'user_id': $tenant_id, 'external_url': $file_url, 'updated_at': $now(), 'site_id': $site_id, 'saved_filename':$aws_file_path  }, 'vector': $embeddings[$i]}}})}"
   },
   "depends_on":"generate-embeddings"
},
{
   "name":"qdrant-db",
   "type":"destination",
   "destination_type":"datastore",
   "method":"upsertPoints",
   "config":{
      "id":"{{args.qdrant_datastore_id}}",
      "config":{
         "query":"{{payload.records.0.qdrant_config.query:json}}",
         "body":"{{payload.records.0.qdrant_config.body:json}}"
      }
   },
   "run_if":"$exists(args.qdrant_datastore_id)",
   "resources_to_persist":[
      "qdrant-config"
   ]
}

It also incorporates the actual content, the chunk number, the integration's file URL, and a reference to the original file in your object storage. This method ensures your data remains organized, easily searchable, and scalable as it grows.

Challenge #5 - Saving the Original File

It is not done until it is. Retaining the source file is crucial, especially if you plan to display it as part of your chatbot's response.

Truto makes this easy by supporting file storage in Google Cloud Storage and any S3-compatible object storage. Once you configure your datastore through the Truto console, the Sync job node below handles uploading the file to the specified path in your storage system.

{
   "name":"s3-storage",
   "type":"destination",
   "destination_type":"datastore",
   "method":"uploadObject",
   "config":{
      "id":"{{args.s3_datastore_id}}",
      "config":{
         "path":"{{aws_file_path}}",
         "file_name":"{{file_name}}",
         "content":"{{payload.records.0.file_streams.1}}"
      }
   },
   "run_if":"$exists(args.s3_datastore_id)",
   "resources_to_persist":[
      "tee-stream"
   ]
}

Getting Started

By now, you might be convinced that Truto is your ultimate solution for integrations and RAG. Truto isn't just a single service—it’s a comprehensive package that supports every step of your workflow.

Still not convinced? Consider this: Truto’s Unified APIs empower you to perform virtually any action your users request. For example:

User Prompt: "Give me a list of Jira tickets assigned to me."

AI Agent Response:

  • Ticket #1 – Backlogged

  • Ticket #2 – Pending from engineering

  • Ticket #3 – Resolved … and so on.

User Prompt: "Leave a comment on all tickets pending from engineering, asking for follow-ups."

AI Agent Response: Calls the CREATE ticketing/comments API for all pending tickets.

Sounds amazing, right? Don’t wait any longer—experience the power of Truto for yourself and transform the way you manage integrations and data. Get started today!

Introduction

Building a RAG system that trains on a URL or file upload is straightforward. But does this approach truly meet your need for an AI chatbot that can access and answer questions using internal data from Confluence pages, Jira tickets, Notion pages, or other SaaS tools?

I assume the answer is no, so you may want to explore using a RAG provider along with connectors. Here are a few key questions to consider while looking for a RAG provider:

  1. Do they support all the connectors your customers need?

  2. Can they quickly add new connectors before your lead loses momentum?

  3. Do they offer the flexibility to choose your own embedding model and vector database?

With Truto, building native integrations is effortless—we've perfected them. Now, we've also solved your RAG challenges, enabling seamless data syncing from virtually any SaaS tool. (Fun fact: We already support 350+ integrations, allowing instant data syncing to your vector database, and even enabling data write-backs based on user actions.)

Challenge #1 - File or Page selection

You wouldn’t want your AI model—or any third-party provider—accessing sensitive internal data, so it's best to restrict syncing to specific files or pages. While apps like Google Drive, SharePoint, and Box provide native file pickers, what about integrations that don’t?

Truto solves this with RapidForm, allowing users to select exactly which files and pages to sync during the connection process. Plus, we support native file pickers for Google Drive and SharePoint (with more on the way!) to ensure a seamless user experience.

Check out our change log for a sneak peek at the native file pickers: Truto Change Log.

Challenge #2 - Content retrieval

APIs can be unpredictable, each with its own quirks in handling responses. For example, the Confluence API delivers page content as a single record, the Notion API structures content in blocks requiring pagination, and the SharePoint API retrieves only the file itself—not its content.

Notion

At Truto, tackling these API challenges is what we do best—so you don’t have to. Take Notion, for example—its API delivers content in fragmented blocks, requiring pagination. We’ve solved this by spooling content in memory and seamlessly merging it into a single record, all thanks to RapidBridge (also known as Sync Job).

The code snippet below shows how our Notion Sync Job works: it fetches pages, recursively retrieves their content, spools the data, and merges everything into a unified record for each page.

{
   "type":"request",
   "name":"list-pages",
   "resource":"knowledge-base/pages",
   "method":"list",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "type":"add_context",
   "name":"add-page-name",
   "depends_on":"list-pages",
   "config":{
      "expression":"{ \"page_id\": resources.`knowledge-base`.pages.id, \"page_title\": resources.`knowledge-base`.pages.title, \"page_url\": resources.`knowledge-base`.pages.urls[type=\"view\"].url }"
   }
},
{
   "type":"request",
   "name":"get-page-content",
   "resource":"knowledge-base/page-content",
   "method":"list",
   "depends_on":"list-pages",
   "query":{
      "page":{
         "id":"{{resources.knowledge-base.pages.id}}"
      }
   },
   "recurse":{
      "if":"{{resources.knowledge-base.page-content.has_children:bool}}",
      "config":{
         "query":{
            "page_content_id":"{{resources.knowledge-base.page-content.id}}"
         }
      }
   },
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"remove-remote-data",
   "type":"transform",
   "config":{
      "expression":"resources.`knowledge-base`.`page-content`.$sift(function($v, $k) { $k != remote_data })"
   },
   "depends_on":"get-page-content"
},
{
   "name":"all-page-content",
   "type":"spool",
   "depends_on":"remove-remote-data"
},
{
   "name":"combine-page-content",
   "type":"transform",
   "config":{
      "expression":"{ \"file_id\": page_id, \"file_name\": page_title, \"content\": \"# \" & page_title & \"\n\n\" & $reduce($sortNodes(resources.`knowledge-base`.`page-content`, \"id\", \"parent.id\"), function($acc, $v) { $acc & $v.body.content }, \"\" ) }"
   },
   "depends_on":"all-page-content"
}

SharePoint

With SharePoint, the process is a bit different—we need to download and parse the file to extract its content.

{
   "type":"request",
   "name":"list-files",
   "resource":"file-storage/drive-items",
   "method":"get",
   "integrated_account_id":"{{args.integrated_account_id}}",
   "id":"{{drive_items.id}}",
   "query":{
      "drive":{
         "id":"{{drive_items.parentReference.driveId}}"
      },
      "workspace":{
         "id":"{{drive_items.parentReference.sharepointIds.siteId}}"
      }
   },
   "loop_on":"drive_items"
},
{
   "type":"add_context",
   "name":"add-file-name",
   "depends_on":"list-files",
   "config":{
      "expression":"{'file_name': resources.`file-storage`.`drive-items`.name, 'file_id': resources.`file-storage`.`drive-items`.id, 'file_type': resources.`file-storage`.`drive-items`.mime_type, 'file_size': resources.`file-storage`.`drive-items`.size, 'file_url': resources.`file-storage`.`drive-items`.urls[type=\"self\"].url, 'aws_file_path': $join([tenant_id,resources.`file-storage`.`drive-items`.workspace.id, resources.`file-storage`.`drive-items`.drive.id],\"/\")  }"
   }
},
{
   "type":"request",
   "name":"download-file",
   "resource":"file-storage/drive-items",
   "method":"download",
   "depends_on":"list-files",
   "query":"{'file_url':  resources.`file-storage`.`drive-items`.urls[type = 'download'].url, 'truto_response_format': 'stream'}",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"tee-stream",
   "type":"transform",
   "config":{
      "expression":"{ \"file_streams\": $teeStream(resources.`file-storage`.`drive-items`) }"
   },
   "depends_on":"download-file"
},
{
   "name":"transform-file-content",
   "type":"transform",
   "config":{
      "expression":"{ 'file_content': $parseDocument(resources.`file-storage`.`drive-items`[0].file_streams[0], file_type)}"
   },
   "depends_on":"tee-stream"
}

Challenge #3 – Embeddings Generation

Once you’ve got the content, the next challenge is generating embeddings. But can you process the entire content in one go? Of course not. You’ll need to split it into chunks first. (Yeah, we get the "pain".)

The following Sync job node demonstrates how we chunk the page content using the recursiveCharacterTextSplitter method from langchain/textsplitters.

{
   "name":"file-content-chunk",
   "type":"add_context",
   "config":{
      "expression":"{'file_content_chunk': $recursiveCharacterTextSplitter(resources.`file-storage`.`drive-items`[0].file_content) }"
   },
   "depends_on":"transform-file-content"
}

Remember, you're not locked into this specific chunking method—you can easily swap it out for your own custom approach if needed.

Next, we move on to generating embeddings for these chunks. The Sync job node below leverages the Cohere Embed API with the embed-multilingual-light-v3.0 model to generate the embeddings:

{
   "name":"generate-embeddings",
   "type":"transform",
   "config":{
      "expression":"{ 'content_embeddings' : $generateEmbeddingsCohere({'model': 'embed-multilingual-light-v3.0', 'input_type': 'search_document', 'embedding_types': ['float'], 'texts': file_content_chunk }, args.cohere_api_key)}"
   },
   "depends_on":"transform-file-content"
}

This approach is highly flexible—change the model attribute to use a different Cohere model or swap out the $generateEmbeddingsCohere() method to use another provider like OpenAI.

Challenge #4 - Storing embeddings

Now we reach the stage where the real magic unfolds: storing the embeddings in your own vector database. This is a crucial step, as it enables rapid similarity searches and efficient retrieval of your processed data.

In the example below, we form the payload in accordance with Qdrant’s API request body schema. The subsequent node then calls the upsertPoints method of the Qdrant datastore, ensuring your embeddings are correctly indexed and stored.

{
   "name":"qdrant-config",
   "type":"transform",
   "config":{
      "expression":"{ 'qdrant_config': ($texts:= resources.`file-storage`.`drive-items`[0].content_embeddings.texts; $embeddings:= resources.`file-storage`.`drive-items`[0].content_embeddings.embeddings.float; $file_id:= file_id; $tenant_id:= $toNumber(tenant_id); $file_url:= file_url; $site_id:= site_id; $aws_file_path:= aws_file_path; {'query': {'ordering': 'strong'}, 'body': {'points': $texts#$i.{ 'id': $uuid(), 'payload': {'content': $, 'chunk_number': $i, 'organization_user_file_id': $file_id, 'user_id': $tenant_id, 'external_url': $file_url, 'updated_at': $now(), 'site_id': $site_id, 'saved_filename':$aws_file_path  }, 'vector': $embeddings[$i]}}})}"
   },
   "depends_on":"generate-embeddings"
},
{
   "name":"qdrant-db",
   "type":"destination",
   "destination_type":"datastore",
   "method":"upsertPoints",
   "config":{
      "id":"{{args.qdrant_datastore_id}}",
      "config":{
         "query":"{{payload.records.0.qdrant_config.query:json}}",
         "body":"{{payload.records.0.qdrant_config.body:json}}"
      }
   },
   "run_if":"$exists(args.qdrant_datastore_id)",
   "resources_to_persist":[
      "qdrant-config"
   ]
}

It also incorporates the actual content, the chunk number, the integration's file URL, and a reference to the original file in your object storage. This method ensures your data remains organized, easily searchable, and scalable as it grows.

Challenge #5 - Saving the Original File

It is not done until it is. Retaining the source file is crucial, especially if you plan to display it as part of your chatbot's response.

Truto makes this easy by supporting file storage in Google Cloud Storage and any S3-compatible object storage. Once you configure your datastore through the Truto console, the Sync job node below handles uploading the file to the specified path in your storage system.

{
   "name":"s3-storage",
   "type":"destination",
   "destination_type":"datastore",
   "method":"uploadObject",
   "config":{
      "id":"{{args.s3_datastore_id}}",
      "config":{
         "path":"{{aws_file_path}}",
         "file_name":"{{file_name}}",
         "content":"{{payload.records.0.file_streams.1}}"
      }
   },
   "run_if":"$exists(args.s3_datastore_id)",
   "resources_to_persist":[
      "tee-stream"
   ]
}

Getting Started

By now, you might be convinced that Truto is your ultimate solution for integrations and RAG. Truto isn't just a single service—it’s a comprehensive package that supports every step of your workflow.

Still not convinced? Consider this: Truto’s Unified APIs empower you to perform virtually any action your users request. For example:

User Prompt: "Give me a list of Jira tickets assigned to me."

AI Agent Response:

  • Ticket #1 – Backlogged

  • Ticket #2 – Pending from engineering

  • Ticket #3 – Resolved … and so on.

User Prompt: "Leave a comment on all tickets pending from engineering, asking for follow-ups."

AI Agent Response: Calls the CREATE ticketing/comments API for all pending tickets.

Sounds amazing, right? Don’t wait any longer—experience the power of Truto for yourself and transform the way you manage integrations and data. Get started today!

Introduction

Building a RAG system that trains on a URL or file upload is straightforward. But does this approach truly meet your need for an AI chatbot that can access and answer questions using internal data from Confluence pages, Jira tickets, Notion pages, or other SaaS tools?

I assume the answer is no, so you may want to explore using a RAG provider along with connectors. Here are a few key questions to consider while looking for a RAG provider:

  1. Do they support all the connectors your customers need?

  2. Can they quickly add new connectors before your lead loses momentum?

  3. Do they offer the flexibility to choose your own embedding model and vector database?

With Truto, building native integrations is effortless—we've perfected them. Now, we've also solved your RAG challenges, enabling seamless data syncing from virtually any SaaS tool. (Fun fact: We already support 350+ integrations, allowing instant data syncing to your vector database, and even enabling data write-backs based on user actions.)

Challenge #1 - File or Page selection

You wouldn’t want your AI model—or any third-party provider—accessing sensitive internal data, so it's best to restrict syncing to specific files or pages. While apps like Google Drive, SharePoint, and Box provide native file pickers, what about integrations that don’t?

Truto solves this with RapidForm, allowing users to select exactly which files and pages to sync during the connection process. Plus, we support native file pickers for Google Drive and SharePoint (with more on the way!) to ensure a seamless user experience.

Check out our change log for a sneak peek at the native file pickers: Truto Change Log.

Challenge #2 - Content retrieval

APIs can be unpredictable, each with its own quirks in handling responses. For example, the Confluence API delivers page content as a single record, the Notion API structures content in blocks requiring pagination, and the SharePoint API retrieves only the file itself—not its content.

Notion

At Truto, tackling these API challenges is what we do best—so you don’t have to. Take Notion, for example—its API delivers content in fragmented blocks, requiring pagination. We’ve solved this by spooling content in memory and seamlessly merging it into a single record, all thanks to RapidBridge (also known as Sync Job).

The code snippet below shows how our Notion Sync Job works: it fetches pages, recursively retrieves their content, spools the data, and merges everything into a unified record for each page.

{
   "type":"request",
   "name":"list-pages",
   "resource":"knowledge-base/pages",
   "method":"list",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "type":"add_context",
   "name":"add-page-name",
   "depends_on":"list-pages",
   "config":{
      "expression":"{ \"page_id\": resources.`knowledge-base`.pages.id, \"page_title\": resources.`knowledge-base`.pages.title, \"page_url\": resources.`knowledge-base`.pages.urls[type=\"view\"].url }"
   }
},
{
   "type":"request",
   "name":"get-page-content",
   "resource":"knowledge-base/page-content",
   "method":"list",
   "depends_on":"list-pages",
   "query":{
      "page":{
         "id":"{{resources.knowledge-base.pages.id}}"
      }
   },
   "recurse":{
      "if":"{{resources.knowledge-base.page-content.has_children:bool}}",
      "config":{
         "query":{
            "page_content_id":"{{resources.knowledge-base.page-content.id}}"
         }
      }
   },
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"remove-remote-data",
   "type":"transform",
   "config":{
      "expression":"resources.`knowledge-base`.`page-content`.$sift(function($v, $k) { $k != remote_data })"
   },
   "depends_on":"get-page-content"
},
{
   "name":"all-page-content",
   "type":"spool",
   "depends_on":"remove-remote-data"
},
{
   "name":"combine-page-content",
   "type":"transform",
   "config":{
      "expression":"{ \"file_id\": page_id, \"file_name\": page_title, \"content\": \"# \" & page_title & \"\n\n\" & $reduce($sortNodes(resources.`knowledge-base`.`page-content`, \"id\", \"parent.id\"), function($acc, $v) { $acc & $v.body.content }, \"\" ) }"
   },
   "depends_on":"all-page-content"
}

SharePoint

With SharePoint, the process is a bit different—we need to download and parse the file to extract its content.

{
   "type":"request",
   "name":"list-files",
   "resource":"file-storage/drive-items",
   "method":"get",
   "integrated_account_id":"{{args.integrated_account_id}}",
   "id":"{{drive_items.id}}",
   "query":{
      "drive":{
         "id":"{{drive_items.parentReference.driveId}}"
      },
      "workspace":{
         "id":"{{drive_items.parentReference.sharepointIds.siteId}}"
      }
   },
   "loop_on":"drive_items"
},
{
   "type":"add_context",
   "name":"add-file-name",
   "depends_on":"list-files",
   "config":{
      "expression":"{'file_name': resources.`file-storage`.`drive-items`.name, 'file_id': resources.`file-storage`.`drive-items`.id, 'file_type': resources.`file-storage`.`drive-items`.mime_type, 'file_size': resources.`file-storage`.`drive-items`.size, 'file_url': resources.`file-storage`.`drive-items`.urls[type=\"self\"].url, 'aws_file_path': $join([tenant_id,resources.`file-storage`.`drive-items`.workspace.id, resources.`file-storage`.`drive-items`.drive.id],\"/\")  }"
   }
},
{
   "type":"request",
   "name":"download-file",
   "resource":"file-storage/drive-items",
   "method":"download",
   "depends_on":"list-files",
   "query":"{'file_url':  resources.`file-storage`.`drive-items`.urls[type = 'download'].url, 'truto_response_format': 'stream'}",
   "integrated_account_id":"{{args.integrated_account_id}}"
},
{
   "name":"tee-stream",
   "type":"transform",
   "config":{
      "expression":"{ \"file_streams\": $teeStream(resources.`file-storage`.`drive-items`) }"
   },
   "depends_on":"download-file"
},
{
   "name":"transform-file-content",
   "type":"transform",
   "config":{
      "expression":"{ 'file_content': $parseDocument(resources.`file-storage`.`drive-items`[0].file_streams[0], file_type)}"
   },
   "depends_on":"tee-stream"
}

Challenge #3 – Embeddings Generation

Once you’ve got the content, the next challenge is generating embeddings. But can you process the entire content in one go? Of course not. You’ll need to split it into chunks first. (Yeah, we get the "pain".)

The following Sync job node demonstrates how we chunk the page content using the recursiveCharacterTextSplitter method from langchain/textsplitters.

{
   "name":"file-content-chunk",
   "type":"add_context",
   "config":{
      "expression":"{'file_content_chunk': $recursiveCharacterTextSplitter(resources.`file-storage`.`drive-items`[0].file_content) }"
   },
   "depends_on":"transform-file-content"
}

Remember, you're not locked into this specific chunking method—you can easily swap it out for your own custom approach if needed.

Next, we move on to generating embeddings for these chunks. The Sync job node below leverages the Cohere Embed API with the embed-multilingual-light-v3.0 model to generate the embeddings:

{
   "name":"generate-embeddings",
   "type":"transform",
   "config":{
      "expression":"{ 'content_embeddings' : $generateEmbeddingsCohere({'model': 'embed-multilingual-light-v3.0', 'input_type': 'search_document', 'embedding_types': ['float'], 'texts': file_content_chunk }, args.cohere_api_key)}"
   },
   "depends_on":"transform-file-content"
}

This approach is highly flexible—change the model attribute to use a different Cohere model or swap out the $generateEmbeddingsCohere() method to use another provider like OpenAI.

Challenge #4 - Storing embeddings

Now we reach the stage where the real magic unfolds: storing the embeddings in your own vector database. This is a crucial step, as it enables rapid similarity searches and efficient retrieval of your processed data.

In the example below, we form the payload in accordance with Qdrant’s API request body schema. The subsequent node then calls the upsertPoints method of the Qdrant datastore, ensuring your embeddings are correctly indexed and stored.

{
   "name":"qdrant-config",
   "type":"transform",
   "config":{
      "expression":"{ 'qdrant_config': ($texts:= resources.`file-storage`.`drive-items`[0].content_embeddings.texts; $embeddings:= resources.`file-storage`.`drive-items`[0].content_embeddings.embeddings.float; $file_id:= file_id; $tenant_id:= $toNumber(tenant_id); $file_url:= file_url; $site_id:= site_id; $aws_file_path:= aws_file_path; {'query': {'ordering': 'strong'}, 'body': {'points': $texts#$i.{ 'id': $uuid(), 'payload': {'content': $, 'chunk_number': $i, 'organization_user_file_id': $file_id, 'user_id': $tenant_id, 'external_url': $file_url, 'updated_at': $now(), 'site_id': $site_id, 'saved_filename':$aws_file_path  }, 'vector': $embeddings[$i]}}})}"
   },
   "depends_on":"generate-embeddings"
},
{
   "name":"qdrant-db",
   "type":"destination",
   "destination_type":"datastore",
   "method":"upsertPoints",
   "config":{
      "id":"{{args.qdrant_datastore_id}}",
      "config":{
         "query":"{{payload.records.0.qdrant_config.query:json}}",
         "body":"{{payload.records.0.qdrant_config.body:json}}"
      }
   },
   "run_if":"$exists(args.qdrant_datastore_id)",
   "resources_to_persist":[
      "qdrant-config"
   ]
}

It also incorporates the actual content, the chunk number, the integration's file URL, and a reference to the original file in your object storage. This method ensures your data remains organized, easily searchable, and scalable as it grows.

Challenge #5 - Saving the Original File

It is not done until it is. Retaining the source file is crucial, especially if you plan to display it as part of your chatbot's response.

Truto makes this easy by supporting file storage in Google Cloud Storage and any S3-compatible object storage. Once you configure your datastore through the Truto console, the Sync job node below handles uploading the file to the specified path in your storage system.

{
   "name":"s3-storage",
   "type":"destination",
   "destination_type":"datastore",
   "method":"uploadObject",
   "config":{
      "id":"{{args.s3_datastore_id}}",
      "config":{
         "path":"{{aws_file_path}}",
         "file_name":"{{file_name}}",
         "content":"{{payload.records.0.file_streams.1}}"
      }
   },
   "run_if":"$exists(args.s3_datastore_id)",
   "resources_to_persist":[
      "tee-stream"
   ]
}

Getting Started

By now, you might be convinced that Truto is your ultimate solution for integrations and RAG. Truto isn't just a single service—it’s a comprehensive package that supports every step of your workflow.

Still not convinced? Consider this: Truto’s Unified APIs empower you to perform virtually any action your users request. For example:

User Prompt: "Give me a list of Jira tickets assigned to me."

AI Agent Response:

  • Ticket #1 – Backlogged

  • Ticket #2 – Pending from engineering

  • Ticket #3 – Resolved … and so on.

User Prompt: "Leave a comment on all tickets pending from engineering, asking for follow-ups."

AI Agent Response: Calls the CREATE ticketing/comments API for all pending tickets.

Sounds amazing, right? Don’t wait any longer—experience the power of Truto for yourself and transform the way you manage integrations and data. Get started today!

In this article

Content Title

Content Title

Content Title

Learn how Truto helps product teams build integrations faster

by

The Truto Team

Posted

Feb 12, 2025

LinkedIn
Twitter Logo
Link

In this article

RAG simplified with Truto

More from our Blog

Guides

Insurance for Your Integrations

Implementing an integration platform is like buying insurance for your integrations—you never know when things will go sideways, so it’s best to be prepared.

Guides

Insurance for Your Integrations

Implementing an integration platform is like buying insurance for your integrations—you never know when things will go sideways, so it’s best to be prepared.

Guides

Insurance for Your Integrations

Implementing an integration platform is like buying insurance for your integrations—you never know when things will go sideways, so it’s best to be prepared.

All Posts

Launching FetchDB: A drop-in MongoDB Atlas Data API Alternative

A seamless alternative to the MongoDB Atlas Data API. Without any change to your current logic.

All Posts

Launching FetchDB: A drop-in MongoDB Atlas Data API Alternative

A seamless alternative to the MongoDB Atlas Data API. Without any change to your current logic.

All Posts

Launching FetchDB: A drop-in MongoDB Atlas Data API Alternative

A seamless alternative to the MongoDB Atlas Data API. Without any change to your current logic.

Guides

Tackling the Challenges of File Upload and Download Integrations: A Detailed Guide

Integrating file uploads and downloads across different systems can be challenging. Learn more about how Truto makes this a breeze for developers in this post.

Guides

Tackling the Challenges of File Upload and Download Integrations: A Detailed Guide

Integrating file uploads and downloads across different systems can be challenging. Learn more about how Truto makes this a breeze for developers in this post.

Guides

Tackling the Challenges of File Upload and Download Integrations: A Detailed Guide

Integrating file uploads and downloads across different systems can be challenging. Learn more about how Truto makes this a breeze for developers in this post.

Take back focus where it matters. Let Truto do integrations.

Learn more about our unified API service and solutions. This is a short, crisp 30-minute call with folks who understand the problem of alternatives.

Take back focus where it matters. Let Truto do integrations.

Learn more about our unified API service and solutions. This is a short, crisp 30-minute call with folks who understand the problem of alternatives.

Take back focus where it matters. Let Truto do integrations.

Learn more about our unified API service and solutions. This is a short, crisp 30-minute call with folks who understand the problem of alternatives.

Did our integrations roster hit the spot?

© Yin Yang, Inc. 2024. All rights reserved.

9450 SW Gemini Dr, PMB 69868, Beaverton, Oregon 97008-7105, United States

Did our integrations roster hit the spot?

© Yin Yang, Inc. 2024. All rights reserved.

9450 SW Gemini Dr, PMB 69868, Beaverton, Oregon 97008-7105, United States

Did our integrations roster hit the spot?

© Yin Yang, Inc. 2024. All rights reserved.

9450 SW Gemini Dr, PMB 69868, Beaverton, Oregon 97008-7105, United States