小規模言語モデル（SLM）のファインチューニング：少ないリソースでのパフォーマンス向上

生成AIの急速な普及が、世界中の業界に変革をもたらしています。マッキンゼーの2024年のレポートによると、AIの導入率は企業の72%に達し、65%が少なくとも1つの業務において生成AIを定期的に利用していると報告しています。

このような急速な成長は、コスト効率が高く、カスタマイズ可能で、プライバシーを重視した、小規模なオープンソースの大規模言語モデル（LLM）のようなソリューションの必要性を浮き彫りにしています。しかし、弊社のベンチマークテストの結果でも明らかになったよう、これらのモデルは、大規模なモデルに比較して性能が劣ることが多いという課題があります。本日は、この課題に対処するため、ファインチューニングとプロンプト最適化に着目しました。

‍

オープンソースの小規模言語モデルとは？

オープンソースの小規模言語モデルは、パラメータ数が100億未満の公開されているAIモデルです。一般的なPCのような標準的なハードウェア上で展開することが可能です。この利便性により、技術リソースが限られた組織や個人にとって魅力的な選択肢となります。

主な利点

コスト効率: 一般的なシステム上で動作可能。リソース要件が最小限であるため、コスト削減
カスタマイズ性: 特定のタスクに合わせて微調整可能。カスタマイズされたソリューションを提供
プライバシー保証: ローカルに展開されるため、外部APIの必要がなく、データセキュリティが確保

本検証では、Qwen 2.5-7B (Alibaba)、Llama 3.1-8B、Llama 3.2-3B (Meta)の3つの一般的な小規模オープンソースモデルの最適化に焦点を当てます。

‍

小規模モデルのRAG性能評価

Recursive独自の評価ツールである Flow Benchmark Tools を用いて、実社会における実際のシナリオを反映するよう、モデル評価を実施しました。このベンチマークでは、日本政府が公表しているあらゆる文書と、難易度の高い質問を用いて、以下の2つの重要な機能を厳密に評価しました。

RAG（Retrieval-Augmented Generation）を用いた質問応答
文書全体の分析

Flow Benchmark Toolsは、英語と日本語の両言語に対応しており、評価システムは、0（最低）から10（最高）までの値を持つ平均意見評価を出力します。

‍

プロンプト最適化：低コストでの改善

AIモデルへの指示語であるプロンプトを工夫することで、モデルの生成する回答の質を向上させることを「プロンプト最適化」と言います。これは、モデル自体を変更することなく、より正確で有用な情報を引き出すための効果的な手法です。特に、モデルへの指示を具体的にしたり、質問の仕方を工夫したりすることで、モデルの理解度を高めることができます。

私たちの検証では、文書全体を分析するタスクに焦点を当て、プロンプト最適化の効果を検証しました。文書から必要な情報を抽出するRAGという技術は、検索の精度が非常に重要であり、プロンプトの調整による改善は限定的であると考えられるためです。

1. クエリの配置による精度向上

プロンプト内でのクエリの配置が、モデルの回答制度に大きな影響を与えることがわかりました。例えば、Qwen 2.5-7Bモデルにおいて、ドキュメントの後にクエリを配置したところ、スコアが6.35から6.88に改善されました。この結果、プロンプトを適切に設計することが、モデルのパフォーマンス向上に不可欠であることを示しています。

2. Addressing Hallucinations

Hallucinations occur when AI models generate incorrect or fabricated answers with confidence. For example, a model might invent URLs that do not exist or provide information absent from the input data. This issue is particularly problematic for tasks requiring factual accuracy, such as document analysis.

In our tests with Llama 3.1-8B, minor prompt adjustments were insufficient to mitigate hallucinations. However, using a structured data prompt—adapted from DSPy—significantly reduced hallucinations, improving the model’s performance score from 4 to 8. This template defines clear input fields and enforces strict output formats, ensuring the model produces accurate and consistent responses.

Structured data prompt template (adapted from DSPy):

‍

Fine-Tuning LLMs on a Custom Dataset

Fine-tuning adapts pre-trained AI models by further training them on a specialized dataset aligned with specific tasks. Using LoRA (Low-Rank Adaptation), we efficiently trained the models by adding small, trainable modules to existing layers, which adjust weights without changing the original parameters, reducing memory usage and speeding up training.

To align the models with our benchmark tasks, we created a dataset modeled on the Flow Benchmark instruction style. Fine-tuning was applied to two models, Llama 3.1-8B and Llama 3.2-3B, and produced mixed results depending on the task.

‍1. Question-Answering with RAG

Llama 3.1-8B showed negligible improvement.
Llama 3.2-3B achieved a 7% improvement.

The limited improvement stems from the task’s reliance on information retrieval—a capability the models already handled well. Fine-tuning focused more on helping the models follow task-specific instructions than introducing new knowledge. Since the Llama 3.1-8B model’s performance was already comparable to larger models like Llama 3.1-70B, further fine-tuning provided minimal additional benefits.

2. Whole Document Analysis

Llama 3.1-8B improved by 10%.
Llama 3.2-3B achieved a 26% boost.

These significant improvements resulted from the fine-tuning process, which enhanced the models’ ability to understand and adapt to document analysis tasks, leading to higher scores.

Key Takeaways

The results presented in this article highlight the importance of fine-tuning and prompt optimization in maximizing the potential of small open-source LLMs:

Prompt Optimization: While thoughtful prompt design can improve model performance in many cases, addressing hallucinations remains a complex challenge. Various approaches exist, but most lack general applicability. However, in our experiments, the structured data prompt proved highly effective. While this approach worked in our scenario, it should be seen as one of many potential methods rather than a universal solution.
Fine-Tuning: Aligning fine-tuning datasets with task-specific requirements is crucial. Although fine-tuning showed limited impact on question-answering with RAG, it delivered substantial improvements for complex tasks like document analysis, enabling more accurate and reliable outputs.

By combining these techniques, small open-source models can narrow the performance gap with larger alternatives while maintaining advantages in cost, customization, and privacy.

At Recursive, our commitment to open-source technologies reflects our vision of democratizing AI while empowering enterprises to build fairer, more sustainable solutions. Reach out to us at sbdm@recursiveai.co.jp to discuss how our tools can enhance your AI strategy.