my image Some great alternative text


How to select and improve the right sources to train your AI

Learn the impact of good and bad training data on AI with our guidelines for optimal results.

my image Some great alternative text

Artificial Intelligence (AI) has become an integral part of various industries, revolutionizing the way we interact with technology. However, the effectiveness of AI systems relies heavily on the quality of the data used to train them.

Training your AI is as simple as adding your preferred training sources and you’d be ready to go in no time. But there are a couple of things you should consider doing to ensure better results from the AI responses. As the saying goes, "garbage in, garbage out."

In this article, we will explore the significance of training data and provide best practices and guidelines for using good sources to achieve accurate and valuable AI outcomes. These tips can be used to optimize your existing content as well as when creating new content.

Training source best practices

  1. The AI is only as good as the training sources you add.
  2. Review your content before adding it as a source.
    • Is it correct, clear, up-to-date, and comprehensive? Is it relevant?
    • This review can help identify gaps in your content as well as updates that might be needed.
  3. Have a scannable structure.
    • Use rich formatting. Make sure your content has headers, sections, numbered lists, bullet points, etc.
    • Have concise paragraphs. Break large chunks of text into smaller sections.
    • This is good practice in general. It makes your content more readable not just for the AI but for your customers as well.
  4. Explain key terminology and acronyms.
    • Make sure you explain any key terminology and acronyms whether they be common in your industry or specific to your company or product. Once again, this is good for the AI just as it is good for your users.
  5. Add written descriptions to accompany visual content.
    • While visual content can be helpful, especially in the case of walkthroughs and explainers, the AI cannot read screenshots, photos, or videos.
    • Include a breakdown of the content using rich formatting and a clear structure with step-by-step instructions.
    • Adding descriptions to visuals and transcripts to videos is also great for accessibility so it's a win-win!
  6. Do not use user-generated content.
    • We elaborate on this down below.
  7. Avoid duplicate content.
    • Have a single source of information as much as possible. This will ensure consistency in the answers and linked sources provided by the AI.
  8. Think twice before doing any bulk uploads.
    • As the saying goes, quality over quantity. Consider the quality of the content and its use case.
    • Ps. There might be a better approach (ie. integrations) so don’t hesitate to contact our support team.
  9. Always check that the source is indexed correctly.
    • For each source listed under your global training sources or within your segments, you can select Show indexed items from the Actions menu. You can then click View content for each indexed item to review the exact content that has been ingested by the AI. This is great for troubleshooting.
  10. Continuous improvement
    • Add your sources and test your AI. As you keep asking questions to test and improve the AI, you can identify any gaps.
    • Which questions is the AI not able to answer because there is no information on it? Which answers are outdated and need an update? What are the most frequently asked questions?

Types of training sources

To guarantee that AI systems receive the most effective and comprehensive training possible, we provide support for a variety of source types such as website content, PDFs, and FAQs.

Data sources you can use to train your AI

Website content

Website content is an ideal resource for training AI since it tends to be clear and concise. The well-written and thought-out nature of website content makes it perfect for this task. The detailed analysis and descriptions present on websites provide valuable information that can be used to feed machine learning models.

Knowledge bases that contain help articles can be a valuable resource for training purposes. Like website content, they are also well-written and thought-out. Furthermore, they can be updated regularly with new information, ensuring that the AI always has access to the latest and most accurate training material.

Unless AI zone

Properly using headings, also known as H-tags in HTML, is crucial for creating structured and organized content. By using headings, you can segment your content into logical and easy-to-follow sections. This not only enhances the readability of your content but also improves the AI's ability to identify and extract the most relevant information. In other words, headings are an essential tool for optimizing your content for both human readers and search engines. So, it is important to understand the different types of headings and when to use them, to ensure that your content is well-structured and easy to navigate.

When using our AI, you have the option to expand your training set by manually adding individual pages or by allowing the AI to crawl your entire website through the sitemap.xml file. Our include/exclude feature gives you more control over which pages should be included or excluded during the crawling process, enabling you to tailor the training data according to your specific needs and preferences. By making these decisions, you can ensure that the AI is trained on the most relevant and accurate data, which can improve its overall performance and accuracy in the long run.

Add a training source


PDFs can be a valuable source of knowledge, similar to website content. Like web pages, PDFs should have appropriate headings and well-structured content for effective AI training. This means that the headings should be descriptive and accurately reflect the content that follows.

Moreover, it is important to ensure that the content is organized in a logical and coherent manner. This can be achieved through the use of subheadings, bullet points, and other formatting techniques to break up large chunks of text into more digestible pieces.

Keep in mind that images within PDFs will not be indexed, and tables with extensive data might not yield optimal results, as the AI thrives on written content with contextual meaning.

FAQ entries

Our system offers the option to manually add FAQ entries directly into the training set. You can add, update, and delete these entries in real-time. Adding a list of your frequently asked questions is an excellent way to instruct AI in offering specific answers to common queries.

The exact phrasing of the questions in the training data is not crucial, as the AI seeks to understand the context and meaning of the question rather than the specific words used.

FAQs can also be utilized to provide additional information or instructions, enhancing the accuracy and relevance of AI-generated responses.

Add an FAQ entry

In addition to the benefits of using FAQs for training AI, there is another valuable aspect worth mentioning. You have the flexibility to include temporary FAQs in the training set, providing a way to override specific responses temporarily under certain circumstances.

By adding temporary FAQs, you gain the ability to address transient situations effectively. For example, if a particular feature is currently experiencing issues and users are frequently asking about it, you can create a temporary FAQ entry explaining the situation. The AI will then take this information into account and provide appropriate responses until the issue is resolved.

Avoid user-generated content

While user-generated content, such as questions and answers from knowledge bases or help desks, may seem like valuable information to train the AI, it comes with some significant drawbacks.

  • User questions may contain irrelevant context, leading to inaccurate AI responses. For example, if a user asks about using his own Mastercard for credit card payments and the AI is trained on this data, it might incorrectly assume that Mastercard is the preferred credit card used by the company.
  • User-generated questions may remain online for extended periods, leading to outdated information.
  • Bug reports might not accurately represent the actual behavior of the system, potentially resulting in misleading AI responses.


The quality of training data significantly impacts the performance of AI systems. By carefully selecting and curating sources such as website content, PDFs, and well-structured FAQs, you can ensure that your AI is trained with accurate and relevant information.

Avoiding user-generated content helps prevent misinformation and outdated data from influencing AI responses. By following these guidelines, you can unleash the true potential of AI and enhance the user experience for your customers. Plus, as mentioned earlier, a lot of these recommendations are also just good practice in general, not just for the AI but also for your users.

Related content

my image Some great alternative text

Friendly support from real people

We’re here to help

We are known for our quick responses if you have an issue. Feel free to ask us anything. But you can also ask our conversational AI a question, of course!