Supercharging AI Assistants: Unleash Human-Level Performance

In brief

  • Optimising the performance of AI assistants is complex but can significantly improve accuracy and relevance, while enhancing response speed and reducing energy footprint.
  • This is a comprehensive guide on strategies to enhance the performance of AI assistants. It is aimed at technology professionals, who have some familiarity with Generative AI.
  • Incorporate these approaches into your optimisation strategies to create state-of-the-art AI assistants that deliver human-level performance.

1. Unleashing Human-Level Performance

With new generative AI tools and solutions emerging almost daily, organisations are now turning to specialised AI assistants to generate leads, improve customer satisfaction and lift productivity.

These digital aides are becoming integral across various domains, from customer service to back-office operations, and their effectiveness hinges on their ability to deliver accurate, relevant, and timely responses. However, many organisations struggle to progress from prototype to live deployment due to the challenges in creating reliable AI assistants.


This is a comprehensive guide on strategies to enhance the performance of AI assistants. It is aimed at technology professionals, who have some familiarity with Generative AI. The strategies covered are product-agnostic and universally applicable across technology stacks.

2. The Performance Trade-Off

Let us begin by defining what we mean by "performance" in the context of AI assistants. Performance encompasses two primary dimensions:


Accuracy and Relevance: Ensuring that the AI assistant provides correct and contextually appropriate answers to user queries. This involves not only retrieving the right information but also presenting it in a coherent and understandable manner. High accuracy and relevance are vital for maintaining user trust and satisfaction. Additionally, adherence to safe AI rules is essential. This means maintaining ethical guidelines and avoiding harmful or biased outputs. Safe AI practices ensure that the assistant's responses do not perpetuate stereotypes, misinformation, or harmful content, and adhere to guidelines on fairness, accountability, and transparency.


Efficiency: Reducing the latency of responses and minimising the computational resources and energy consumption required to operate the AI assistant. Fast response times are crucial for maintaining user engagement and satisfaction, as users expect near-instantaneous answers. Efficient use of resources not only reduces operational costs but also contributes to environmental sustainability, which is increasingly important in the context of sustainable AI practices.


There is often a trade-off between accuracy and relevance and efficiency. For instance, more complex models may provide more accurate responses but at the cost of increased computational resources and slower response times. Conversely, optimising for speed and resource usage might compromise the quality of the responses. In this guide, we will address strategies to balance and optimise both aspects to create a more effective and sustainable AI assistant.

Deliver commercially viable solutions that offer real value, not just great engineering.

3. Understanding the Flow of Information in an AI Assistant

To effectively optimise an AI assistant, it is important to understand the typical flow of information within the system. A simple assistant, such as ChatGPT, makes sense of the user's query and answers from its own knowledge base. However, the workflow for an advanced assistant can be much more complex. Here is a high-level overview of how an advanced AI assistant processes information:


Step 1 - Understanding the User's Query: The assistant first parses and interprets the user's query to understand the intent and context. This involves the use of a Large Language Model (LLM) to extract relevant information and determine the appropriate response strategy.


Step 2 - Formulating a Step-by-Step Plan: Once the query is understood, the assistant formulates a plan to address it. This plan may involve multiple steps, such as retrieving content from documents, making API calls to other systems, or running code. The assistant may also need to break down complex queries into simpler sub-tasks that can be tackled sequentially or in parallel.


Step 3 - Generating a Response: After executing the necessary steps, the assistant synthesizes the gathered information to generate a coherent and contextually appropriate response. Depending on the application, the assistant may involve the user in one or more steps of the workflow (human-in-the-loop), adjusting the plan as necessary based on user feedback.


Each of these steps represents an opportunity for optimisation. Lifting the performance of AI assistants involves a multi-faceted approach that can be broadly categorised into three themes: improving information retrieval, enhancing model performance, and optimising solution architecture.

Advanced Agent Flow | Credit: LlamaIndex

4. Optimising Information Retrieval

4.1. Introduction to Retrieval-Augmented Generation (RAG)

One of the significant limitations of large language models (LLMs) is the length of their context window. The context window refers to the maximum size of the information (instructions and query) that can be given to the LLM at one time. The latest models boast large context windows, such as GPT-4o with a context window of 128,000 tokens (roughly equivalent to 400 pages of text). However, most models are biased to focus on content at the start and at the end of their input, ignoring the middle. This is known as the "needle in the haystack" problem. Recent research has shown a significant discrepancy between claimed and effective context window lengths [1].


This limitation has led to the development of the Retrieval-Augmented Generation (RAG) approach, which aims to overcome the constraints of context windows by combining information retrieval with language generation. In a RAG system, the process typically involves two main stages:


Step 1 - Retrieval: The system retrieves relevant documents or information from an external knowledge base based on the user's query. This step leverages information retrieval techniques to identify the most pertinent data that can help answer the query.


Step 2 - Generation: The retrieved information is then fed into a language model to generate a coherent and contextually appropriate response. By augmenting the language model with external knowledge, RAG systems can provide more accurate and comprehensive answers, even for complex queries that exceed the context window limitations of the LLM.


The RAG approach effectively extends the capabilities of LLMs by allowing them to access a broader range of information. This makes RAG a powerful technique for building advanced AI assistants that can handle a wide variety of queries with high accuracy and relevance.

Retrieval Augmented Generation | Credit: Microsoft

4.2. Managing and Retrieving Information in a RAG System

In a RAG system, documents are loaded and indexed as vectors (numerical representations) using an embedding model. Here's a step-by-step overview of how information is managed and retrieved:


Step 1 - Document Loading and Indexing: Documents are first loaded into the system. Each document is then split into smaller chunks, which are converted into vector representations using an embedding model. These vectors capture the semantic meaning of the text and are stored in an index for efficient retrieval.


Step 2 - Semantic Search: When a user query is received, the system performs a semantic search to find the most relevant content chunks. The query is also converted into a vector representation, and the system searches the index to find chunks with similar vectors. This process leverages the semantic similarity between the query and the document chunks to identify the most relevant information.


Step 3 - Context Window Management: The retrieved content chunks are then loaded into the LLM's context window. Since the context window has a limited size, the system must carefully select and organise the chunks to ensure that the most relevant information is included. This step is crucial for generating accurate and contextually appropriate responses.

4.3. Building Performant RAG Applications for Production

4.3.1. Basic Retrieval Strategies

While RAG is a powerful technique, making RAG applications robust and scalable can be challenging. Here are some general techniques to improve RAG workflows:


Decoupling Chunks for Retrieval vs. Synthesis: The optimal chunk representation for retrieval might differ from that used for synthesis. For instance, a raw text chunk might contain essential details for generating a detailed answer but also include filler words that could bias the embedding. By decoupling the chunks used for retrieval from those used for synthesis, you can optimise each process independently. This involves creating smaller, more focused chunks for retrieval and using larger, more context-rich chunks for synthesis.


Structured Retrieval for Larger Document Sets: As the number of documents scales, structured retrieval can help ensure more precise results. This involves tagging documents with metadata and using hierarchical retrieval strategies. For example, you might first retrieve relevant document summaries and then drill down into the specific sections of those documents. This hierarchical approach can significantly improve retrieval accuracy and efficiency.


Dynamic Retrieval Based on Task: Different queries may require different retrieval techniques. For example, fact-based questions might benefit from top-k similarity (finding the top k relevant chunks), while summarisation tasks might require a different approach. By dynamically adjusting the retrieval strategy based on the nature of the query, you can improve the relevance and accuracy of the responses.


Optimising Context Embeddings: Fine-tuning embedding models to better capture the salient properties of the data can significantly improve retrieval performance. Pre-trained models may not fully capture the nuances of your specific dataset, so fine-tuning can help create more meaningful embeddings that improve retrieval accuracy.


Hybrid Search: Combining semantic search (embedding similarity) with keyword search to enhance retrieval accuracy. Semantic search uses embeddings to find text chunks that are contextually similar to the query, while keyword search matches specific terms. By combining these approaches, hybrid search addresses the limitations of embeddings and ensures that text chunks with matching keywords are also considered during retrieval.

4.3.2. Advanced Retrieval Strategies

Reranking: Adjusting the order of retrieved documents based on additional criteria to improve relevance. For example, you might use a secondary model (a cross-encoder or another custom model) to rerank the initial search results based on their relevance to the query. This can help surface the most relevant documents at the top of the search results.


Recursive Retrieval: Iteratively refining retrieval results to enhance accuracy. This involves running multiple rounds of retrieval, each time using the results of the previous round to refine the search. Recursive retrieval can be particularly effective for complex queries that require multiple steps to answer.


Embedded Tables: Using structured data to support more complex queries. For example, if your dataset includes tables with numerical data, you can use these tables to answer quantitative queries more accurately. Embedding tables in your retrieval system allows you to leverage structured data alongside unstructured text.


Small-to-Big Retrieval: Starting with a narrow search and progressively expanding the scope to improve precision. This approach involves running an initial search with a very specific query and then gradually broadening the search criteria. This can help ensure that the most relevant documents are retrieved first, while still allowing for a comprehensive search.


Query Transformations: Transforming user queries before processing to improve retrieval effectiveness. Techniques like Hypothetical Document Embeddings (HyDE) generate a hypothetical document based on the query, which is then used for embedding lookup [2]. This can help bridge the gap between the user's query and the available documents, improving retrieval accuracy. Multi-step query transformations can break down complex queries into manageable sub-questions, each of which can be answered more easily.

HyDE Method [2]

4.3.3. Embeddings Model Selection

Choosing the Right Embedding Model: The choice of embedding model can significantly impact the performance of the AI assistant. Different models have different strengths and weaknesses, and selecting the right one for your specific use case is crucial. For instance, some models might be better suited for certain languages, while others might excel in specific domains.


Re-indexing Data: When changing embedding models, it's essential to re-index the data to maintain consistency. The embeddings generated by different models can vary significantly, and using inconsistent embeddings can lead to poor retrieval performance. By re-indexing the data, you ensure that the embeddings are consistent and aligned with the new model.

4.4.4. Chunk Sizes

Customising chunk sizes and overlaps can influence retrieval precision. Smaller chunks provide more precise embeddings, while larger chunks capture broader context. The default chunk size might not be optimal for all datasets, so experimenting with different sizes can help improve retrieval performance. Additionally, adjusting the overlap between chunks can help ensure that important context is not lost when splitting the text.

Deliver commercially viable solutions that offer real value, not just great engineering.

5. Enhancing Model Performance

Optimising the performance of AI assistants is not solely about improving information retrieval. Enhancing the underlying models themselves is equally crucial. This involves fine-tuning models and employing prompt engineering to ensure that the AI assistant can generate accurate, relevant, and appropriate responses from the context information.

5.1. Prompt Engineering

Customising prompts can significantly impact the performance of language models. Prompt engineering involves carefully crafting the input prompts to guide the model's responses and reduce hallucinations. This can include providing context, specifying the format of the desired output, and including examples of correct answers (few-shot examples). Advanced prompt engineering techniques might involve dynamically generating prompts based on the user's query or the retrieved documents.


Customising Prompts: Simple adjustments to the wording of prompts can have a substantial effect on the quality of the responses. For instance, specifying the desired format or including additional context can help the model generate more accurate and relevant answers.


Advanced Prompts: For more complex tasks, advanced prompt engineering techniques can be employed. This might involve dynamically injecting few-shot examples or processing the injected inputs to better align with the desired output. Providing the model with examples of the kind of responses you expect can significantly improve its reliability.

5.2. Fine-tuning

Fine-tuning involves taking a pre-trained model and further training it on a specific dataset that is representative of the target use case. This process allows the model to learn the nuances and specific patterns of the new data, improving its performance for the targeted tasks. Fine-tuning language models can help them learn specific styles, correct hallucinations, and distil knowledge from more advanced models.

Base Vs Fine-tuned Model Performance | Credit: Predibase

5.2.1. How Fine-tuning Works

Step 1 -Pre-trained Model Selection: The process begins by selecting a pre-trained model that serves as the base. This model has already been trained on a large and diverse dataset, providing a solid foundation of general knowledge and language understanding.


Step 2 - Dataset Preparation: The next step involves preparing a fine-tuning dataset that closely matches the specific use case. This dataset should include examples that reflect the types of queries and responses the AI assistant will encounter in the real world.


Step 3 - Fine-tuning Process: The pre-trained model is then further trained on the fine-tuning dataset. During this process, the model adjusts its weights based on the new data, learning to generate more accurate and contextually appropriate responses for the specific tasks. This training can be supervised (using labelled data) or unsupervised (using unlabelled data).


Step 4 - Evaluation and Iteration: After fine-tuning, the model is evaluated using a separate validation dataset to assess its performance. Based on the evaluation results, further adjustments and iterations may be necessary to optimise the model's performance.

5.2.2. Benefits of Fine-tuning

Improved Performance for Specific Use Cases: A fine-tuned model focuses on a specific use case and can outperform a much larger general-purpose model. By tailoring the model to the specific requirements of the task, fine-tuning enhances its accuracy and relevance.


Faster Responses: Fine-tuned models are often more efficient, providing faster response times. This is because the model has been optimised to handle specific types of queries, reducing the computational complexity and processing time.


Reduced Energy Consumption: Fine-tuned models typically require less computational power and energy compared to larger, general-purpose models. This makes them more sustainable and cost-effective to deploy, especially in resource-constrained environments.

5.2.3. Fine-tuning Specific Components

Fine-tuning can be applied to various components of the AI assistant to enhance performance. Here are some key use cases:


Fine-tuning for Distillation: Using advanced models to generate training data for fine-tuning less capable models. This can help the less capable models produce higher-quality outputs by learning from the more advanced models.


Fine-tuning for Better Structured Outputs: Enhancing models to produce more accurate structured data outputs. This can involve training the model on datasets that include structured data, such as tables or JSON objects, to improve its ability to generate similar outputs.


Fine-tuning for Better Text-to-SQL: Training models on text-to-SQL datasets for improved structured analytics. This can help the model generate more accurate and relevant SQL queries based on natural language inputs.


Fine-tuning Embedding Models: Fine-tuning embedding models can lead to more meaningful representations, improving retrieval accuracy. This involves training the model on a specific dataset to capture the nuances of the data. By fine-tuning the embeddings, you can create more meaningful representations of the text, which can improve the accuracy and relevance of the retrieved documents.


Fine-tuning Evaluators: Distilling the evaluation capabilities of advanced models into less capable ones can improve evaluation efficiency and reduce costs. This involves training a less capable model to perform evaluations based on the outputs of a more advanced model, helping it learn to assess the quality of the generated responses.


Fine-tuning Cross-Encoders for Re-Ranking: Fine-tuning cross-encoders can enhance re-ranking performance, ensuring that the most relevant results are prioritised. This involves training the cross-encoder on a dataset of query-result pairs to improve its ability to rank the results based on their relevance to the query.


Custom Rerankers: Training custom rerankers tailored to specific datasets can further improve retrieval accuracy. This involves creating a reranker that is specifically trained on your dataset, helping it better understand the nuances of the data and improve the relevance of the search results.

Deliver commercially viable solutions that offer real value, not just great engineering.

6. Adopting a Multi-Agent Strategy

A multi-agent architecture enables combining the capabilities of specialised agents, each using their own fine-tuned models and optimised RAG pipelines, to create a high-performing system that surpasses the capability of any single LLM. By leveraging multiple agents, each designed for specific tasks, an AI assistant can achieve a level of performance that is akin to having access to a hypothetical "GPT-5" level model. This approach allows for a more flexible, scalable, and efficient system that can handle a wide range of queries with high accuracy and relevance.

Multi-Agent Framework | Credit: Microsoft

6.1. Multi-Agent System Architecture

Utilising a modular approach to orchestrate and execute tasks is at the core of a multi-agent strategy. In this architecture, one agent manages the overall workflow, while other specialised agents handle individual steps. This separation of responsibilities allows for more efficient task management and execution:


Orchestrator Agent: The orchestrator agent manages the overall workflow, including state management (such as conversational memory), task creation, and task execution. It provides the high-level interface for users to interact with the system and coordinates the activities of specialised agents.


Executor Agents: These agents are responsible for executing individual steps within a task. Given an input step, an executor agent generates the next step. These agents can be initialised with specific parameters and act upon state information passed down from the orchestrator but do not inherently store state themselves.


There are different methods to configure the orchestrator agent, each suited to different use cases. The choice of configuration can significantly impact the performance and capabilities of the AI assistant. We will examine some of these methods below.

6.2. Advanced Agentic Strategies

6.2.1. Chain-of-Abstraction

This involves breaking down complex tasks into a series of steps and executing them in a structured manner. By prompting the LLM in a chain-of-thought format, the system can execute both simple and complex combinations of actions needed to complete a task [3].


This strategy is ideal for applications that require detailed procedural execution, such as technical support or complex data analysis tasks. Additionally, this approach involves planning and decomposing tasks into sub-tasks for more manageable execution, allowing the system to handle complex queries more efficiently and accurately.

Chain-of-Thought Prompting [3]

6.2.2. Introspective Agents

Introspective agents iteratively improve their responses by reflecting on their outputs and making corrections. This involves generating an initial response to a task and then iteratively executing reflection and correction cycles until a satisfactory result is achieved. The ReACT (Reasoning and Acting) agent model serves as an example of an introspective agent [4].


ReACT integrates reasoning and action in language models to enhance decision-making capabilities. ReACT dynamically generates reasoning traces and actions, improving the accuracy and relevance of the responses. The benefits of the ReACT model include improved reasoning and decision-making, integration with external tools, greater synergy between reasoning traces and actions, adaptability and resilience through real-time auto-correction, and transparency in decision-making, which enables human oversight. This approach is particularly useful for tasks that require high accuracy and quality, such as content moderation or sensitive data handling.

ReACT Vs Other Methods [4]

6.2.3. Language Agent Tree Search (LATS)

This involves using a Monte Carlo tree search framework search algorithm to explore different possible actions and select the most promising one. LATS allows for deliberate and adaptive problem-solving guided by external feedback and self-reflection [5]. This strategy is well-suited for applications that require adaptive problem-solving and strategic planning, such as game playing or strategic decision-making systems.

Language Agent Tree Search [5]

Deliver commercially viable solutions that offer real value, not just great engineering.

7. Evaluation

Effective evaluation is critical to optimising the performance of AI assistants. Evaluation can be broadly categorised into component-wise evaluation and end-to-end evaluation. Balancing both approaches ensures both holistic and granular performance insights.

7.1. Component-Wise Evaluation

This involves breaking down the AI assistant's workflow into individual components and assessing each part separately. Components can include the retrieval model, language model, orchestrator agent, and executor agents. This granular approach helps identify specific areas that may require optimisation.

7.1.1. Evaluating the RAG Pipeline

Retrieval performance can be evaluated using datasets like BEIR, which includes a diverse set of domains. This helps determine how well a retrieval model generalises beyond its training distribution. By fine-tuning an embedding model on your dataset and evaluating its performance on BEIR, you can gauge the impact of data drift on retrieval accuracy [6].

7.1.2. Evaluating LLM Performance

Evaluating the performance of large language models (LLMs) is crucial for ensuring that they generate accurate, relevant, and contextually appropriate responses. Here are some key aspects to consider:


Accuracy: Measure how well the LLM provides correct answers to factual questions. This can be assessed using benchmarks like SQuAD (Stanford Question Answering Dataset) or other domain-specific datasets [7].


Relevance: Evaluate the relevance of the responses to the user's query. This involves assessing whether the generated responses are contextually appropriate and meet the user's needs. Metrics such as BLEU (Bilingual Evaluation Understudy) [8] and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [9] can be useful for this purpose.


Coherence: Assess the coherence and fluency of the generated text. This involves evaluating whether the responses are logically structured and easy to understand. Human evaluation or automated metrics like perplexity can be used for this purpose.


Bias and Fairness: Ensure that the LLM's responses do not perpetuate harmful biases or stereotypes. This involves evaluating the model's outputs for fairness and inclusivity, using tools like the AI Fairness 360 toolkit [10].


Robustness: Test the LLM's ability to handle adversarial inputs or ambiguous queries. This involves evaluating how well the model can maintain performance under challenging conditions. For example, trying to “jailbreak” the LLM by injecting prompt instructions into the query.

7.1.3. Evaluating Orchestrator Agents

Assessing orchestrator agents involves evaluating how well they manage the overall workflow, including state management and task execution. This can be done by analysing their efficiency in coordinating the activities of execution agents and ensuring seamless task completion.

7.1.4. Evaluating Executor Agents

Executor agents are responsible for performing individual steps within a task. Evaluating these agents involves assessing their ability to generate accurate and contextually appropriate responses based on the input they receive from the orchestrator agent. The HotpotQA dataset, for example, is useful for testing multi-step queries that require multiple retrieval steps [11]. This helps ensure that the execution agents can handle complex queries effectively.

7.2. End-to-End Evaluation

7.2.1. Setting Up an Evaluation Set

Creating a diverse evaluation set is crucial for comprehensive end-to-end evaluation. This set should include a variety of queries that reflect the different types of interactions the AI assistant will encounter. Tools for automatically generating datasets based on documents can help streamline this process.


End-to-end testing can be expensive so you may wish to explore the “metrics ensembling” method. This uses an ensemble of weaker signals (such as exact match, F1, ROUGE, BLEU, BERT-NLI, and BERT-similarity) to predict the output of more expensive evaluation methods, like human evaluation or GPT-4 assessments. This technique is useful for evaluating changes quickly and cheaply during development and flagging outliers for further evaluation in production.

7.2.2. Spectrum of Evaluation Options

Evaluation options range from quantitative metrics to qualitative assessments:


Quantitative Evaluation: Useful for applications with clear correct answers, such as validating tool inputs or retrieving specific information.


Qualitative Evaluation: More appropriate for tasks that require long-form, helpful responses. Combining both approaches provides a balanced view of the AI assistant's performance.

7.2.3. Sensitivity Testing

Sensitivity testing helps identify which components of the workflow are affecting results. By systematically varying inputs and observing the impact on outputs, you can discover issues that may not be apparent through standard evaluation methods. This approach helps prioritise which components to test or tweak more thoroughly.

Deliver commercially viable solutions that offer real value, not just great engineering.

In summary

Optimising the performance of AI assistants is complex but can significantly improve accuracy and relevance, while enhancing response speed and reducing energy footprint.

Key Takeaways

  • Techniques such as RAG can significantly enhance the retrieval process, ensuring that the AI assistant provides accurate and relevant responses.
  • Fine-tuning models and employing prompt engineering can lead to significant performance gains.
  • Leveraging specialised agents for different tasks can create a high-performing system that surpasses the capability of any single LLM.
  • Systematic evaluation, both component-wise and end-to-end, is crucial for identifying areas for improvement and ensuring that the AI assistant meets performance standards.

As AI technology evolves, staying updated with the latest advancements and incorporating them into your optimisation strategies will be key to creating state-of-the-art AI assistants that deliver human-level performance.

Frequently Asked Questions

1. What is the main advantage of using Retrieval-Augmented Generation (RAG) in AI assistants?

The main advantage of using RAG is that it combines information retrieval with language generation, allowing AI assistants to access a broader range of information beyond their context window limitations. This results in more accurate and comprehensive answers, even for complex queries.

2. How can fine-tuning improve the performance of AI assistants?

Fine-tuning involves further training a pre-trained model on a specific dataset that closely matches the target use case. This process helps the model learn the nuances and specific patterns of the new data, improving its accuracy, relevance, and response speed for the specific tasks it will handle.

3. What are the key components of a multi-agent system architecture for AI assistants?

A multi-agent system architecture typically includes an orchestrator agent that manages the overall workflow and state management, and executor agents that handle individual steps within a task. This modular approach allows for more efficient task management and execution, leveraging specialised agents for specific tasks.

4. Why is it important to balance accuracy and efficiency in AI assistants?

Balancing accuracy and efficiency is crucial because more complex models may provide more accurate responses but at the cost of increased computational resources and slower response times. Conversely, optimising for speed and resource usage might compromise the quality of the responses. A balanced approach ensures that the AI assistant delivers high-quality answers while maintaining fast response times and sustainable resource usage.

5. How can evaluation methods ensure the effectiveness of AI assistants?

Evaluation methods, including component-wise and end-to-end evaluation, help identify specific areas for optimisation and ensure that all components work together seamlessly. Techniques like sensitivity testing and metrics ensembling can provide insights into the performance of different components and flag outliers for further evaluation, ensuring the AI assistant meets performance standards and user expectations.

6. What are some advanced retrieval strategies for improving the performance of RAG systems?

Advanced retrieval strategies for RAG systems include reranking, recursive retrieval, embedded tables, small-to-big retrieval, and query transformations. These strategies help refine the retrieval process, ensuring that the most relevant and accurate information is used to generate responses.

7. How does prompt engineering enhance the performance of language models in AI assistants?

Prompt engineering involves carefully crafting input prompts to guide the model's responses and reduce hallucinations. This can include providing context, specifying the format of the desired output, and including examples of correct answers. Advanced prompt engineering techniques might involve dynamically generating prompts based on the user's query or the retrieved documents.

8. What is the role of embeddings in the performance of AI assistants?

Embeddings capture the semantic meaning of text and are used to represent documents and queries as vectors for efficient retrieval. Fine-tuning embedding models can lead to more meaningful representations, improving retrieval accuracy. Customising chunk sizes and overlaps can also influence retrieval precision, with smaller chunks providing more precise embeddings and larger chunks capturing broader context.

9. What are introspective agents, and how do they improve AI assistant performance?

Introspective agents iteratively improve their responses by reflecting on their outputs and making corrections. This involves generating an initial response to a task and then iteratively executing reflection and correction cycles until a satisfactory result is achieved. Introspective agents, like the ReACT model, integrate reasoning and action to enhance decision-making capabilities and improve response quality.

10. Why is it important to consider both quantitative and qualitative evaluation methods for AI assistants?

Quantitative evaluation methods are useful for applications with clear correct answers, while qualitative evaluation methods are more appropriate for tasks requiring long-form, helpful responses. Combining both approaches provides a balanced view of the AI assistant's performance, ensuring it meets user expectations and delivers high-quality, contextually appropriate responses.

Deliver commercially viable solutions that offer real value, not just great engineering.

References

[1] Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y. and Ginsburg, B. (n.d.). RULER: What’s the Real Context Size of Your Long-Context Language Models? [online] Available at: https://arxiv.org/pdf/2404.06654 [Accessed 5 Aug. 2024].


[2] Gao, L., Ma, X., Lin, J. and Callan, J. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv (Cornell University). doi:https://doi.org/10.18653/v1/2023.acl-long.99.


[3] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi Quoc, E., Le, V. and Zhou, D. (n.d.). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Chain-of-Thought Prompting. [online] Available at: https://arxiv.org/pdf/2201.11903.


[4] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. and Cao, Y. (n.d.). ReACT: Synergizing Reasoning and Acting in Language Models. [online] Available at: https://arxiv.org/pdf/2210.03629.


[5] Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H. and Wang, Y.-X. (2023). Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. [online] arXiv.org. Available at: https://arxiv.org/abs/2310.04406 [Accessed 5 Aug. 2024].


[6] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A. and Gurevych, I. (2021). BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663 [cs]. [online] Available at: https://arxiv.org/abs/2104.08663.


[7] Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P. (n.d.). SQuAD: 100,000+ Questions for Machine Comprehension of Text. [online] Available at: https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf.


[8] Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. [online] Available at: https://aclanthology.org/P02-1040.pdf.


[9] Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. [online] Available at: https://aclanthology.org/W04-1013.pdf.

[10] aif360.res.ibm.com. (n.d.). AI Fairness 360. [online] Available at: https://aif360.res.ibm.com/.


[11] Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R. and Manning, C.D. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs]. [online] Available at: https://arxiv.org/abs/1809.09600.

About Cognis

Cognis helps organisations to transition to an AI-powered future.


We equip and enable you to harness the power of AI to create new revenue streams, reimagine customer experiences, and transform operations.

SOCIAL

NEWSLETTER

Sign up to our newsletter.

© Cognis Pty Ltd