Fine-Tuning Small LLMs: Improved Performance with Fewer Resources

AIGenerative AI2024-12-03

The rapid adoption of generative AI is transforming industries worldwide. According to McKinsey's 2024 report, AI adoption has surged to 72% of organizations, with 65% of respondents reporting regular use of generative AI in at least one business function.

This rapid growth emphasizes the need for cost-efficient, customizable, and privacy-focused solutions like small-size open-source large language models (LLMs). However, as highlighted in our benchmark test results, these models often fall short in performance compared to their larger counterparts. To address this, we explored fine-tuning and prompt optimization as strategies to close the gap.

What Are Small Open-Source LLMs?

Small open-source LLMs are publicly accessible AI models with fewer parameters (under 10B), making them deployable on standard hardware such as consumer-grade PCs. Their accessibility makes them an attractive option for organizations and individuals with limited technical resources.

Key Advantages

Cost Efficiency: Operate on standard systems with minimal resource requirements, reducing costs.
Customizability: Can be fine-tuned to specific tasks, delivering tailored solutions.
Privacy Assurance: Deployed locally, eliminating the need for external APIs and ensuring data security.

In this article, we focus on optimizing three popular small open-source models: Qwen 2.5-7B (Alibaba), Llama 3.1-8B, and Llama 3.2-3B (Meta).

Evaluating RAG Capabilities of Small Models

We assessed the models using Recursive’s proprietary Flow Benchmark Tools, designed to evaluate models on real-world tasks. The benchmark utilized a dataset of Japanese government documents paired with challenging questions to test two critical capabilities:

Question-Answering Using Retrieval-Augmented Generation (RAG)
Whole Document Analysis

Flow Benchmark Tools evaluated model performance across English and Japanese, using a scoring system from 0 (worst) to 10 (perfect).

Prompt Optimization: A Low-Cost Enhancement

Prompt optimization involves refining how instructions or input text are presented to an AI model. Think of it as crafting a clear, well-structured directive to guide the model’s response. This low-cost strategy is often the first step to improving performance without altering the model itself.

In our experiments, the optimizations were focused on document analysis, as question-answering with RAG relies more heavily on retrieval quality, which is less influenced by prompt adjustments.

1. Query Placement Improves Accuracy

The position of the query within the prompt significantly influenced results. For example, placing the query after the document block for the Qwen 2.5-7B model improved its score from 6.35 to 6.88. This demonstrates the value of careful prompt design in enhancing performance.

2. Addressing Hallucinations

Hallucinations occur when AI models generate incorrect or fabricated answers with confidence. For example, a model might invent URLs that do not exist or provide information absent from the input data. This issue is particularly problematic for tasks requiring factual accuracy, such as document analysis.

In our tests with Llama 3.1-8B, minor prompt adjustments were insufficient to mitigate hallucinations. However, using a structured data prompt—adapted from DSPy—significantly reduced hallucinations, improving the model’s performance score from 4 to 8. This template defines clear input fields and enforces strict output formats, ensuring the model produces accurate and consistent responses.

Structured data prompt template (adapted from DSPy):

Fine-Tuning LLMs on a Custom Dataset

Fine-tuning adapts pre-trained AI models by further training them on a specialized dataset aligned with specific tasks. Using LoRA (Low-Rank Adaptation), we efficiently trained the models by adding small, trainable modules to existing layers, which adjust weights without changing the original parameters, reducing memory usage and speeding up training.

To align the models with our benchmark tasks, we created a dataset modeled on the Flow Benchmark instruction style. Fine-tuning was applied to two models, Llama 3.1-8B and Llama 3.2-3B, and produced mixed results depending on the task.

1. Question-Answering with RAG

Llama 3.1-8B showed negligible improvement.
Llama 3.2-3B achieved a 7% improvement.

The limited improvement stems from the task’s reliance on information retrieval—a capability the models already handled well. Fine-tuning focused more on helping the models follow task-specific instructions than introducing new knowledge. Since the Llama 3.1-8B model’s performance was already comparable to larger models like Llama 3.1-70B, further fine-tuning provided minimal additional benefits.

2. Whole Document Analysis

Llama 3.1-8B improved by 10%.
Llama 3.2-3B achieved a 26% boost.

These significant improvements resulted from the fine-tuning process, which enhanced the models’ ability to understand and adapt to document analysis tasks, leading to higher scores.

Key Takeaways

The results presented in this article highlight the importance of fine-tuning and prompt optimization in maximizing the potential of small open-source LLMs:

Prompt Optimization: While thoughtful prompt design can improve model performance in many cases, addressing hallucinations remains a complex challenge. Various approaches exist, but most lack general applicability. However, in our experiments, the structured data prompt proved highly effective. While this approach worked in our scenario, it should be seen as one of many potential methods rather than a universal solution.
Fine-Tuning: Aligning fine-tuning datasets with task-specific requirements is crucial. Although fine-tuning showed limited impact on question-answering with RAG, it delivered substantial improvements for complex tasks like document analysis, enabling more accurate and reliable outputs.

By combining these techniques, small open-source models can narrow the performance gap with larger alternatives while maintaining advantages in cost, customization, and privacy.

At Recursive, our commitment to open-source technologies reflects our vision of democratizing AI while empowering enterprises to build fairer, more sustainable solutions. Reach out to us at sbdm@recursiveai.co.jp to discuss how our tools can enhance your AI strategy.

Marketing and PR Associate

Alina Paniuta

Originally from Ukraine, Alina earned her Master’s degree in International Communications from National Chengchi University (Taiwan). After four years in Taiwan working as a Marketing Manager, she relocated to Japan in 2024 to join the Recursive team and experience life in her dream country.

back to blogs

Fine-Tuning Small LLMs: Improved Performance with Fewer Resources

What Are Small Open-Source LLMs?

Key Advantages

Evaluating RAG Capabilities of Small Models

Prompt Optimization: A Low-Cost Enhancement

1. Query Placement Improves Accuracy

2. Addressing Hallucinations

Fine-Tuning LLMs on a Custom Dataset

1. Question-Answering with RAG

2. Whole Document Analysis

Key Takeaways

Related articles

DeepSeek: A Game Changer in AI Efficiency and Cost?

What is Multimodal RAG and How It Transforms Educational Content Generation

What are AI Agents, and How Do We Build Them at Recursive?

Document download

Fine-Tuning Small LLMs: Improved Performance with Fewer Resources

What Are Small Open-Source LLMs?

Key Advantages

Evaluating RAG Capabilities of Small Models

Prompt Optimization: A Low-Cost Enhancement

1. Query Placement Improves Accuracy

2. Addressing Hallucinations

Fine-Tuning LLMs on a Custom Dataset

1. Question-Answering with RAG

2. Whole Document Analysis

Key Takeaways

Author

Related articles

DeepSeek: A Game Changer in AI Efficiency and Cost?

What is Multimodal RAG and How It Transforms Educational Content Generation

What are AI Agents, and How Do We Build Them at Recursive?

Document download