Home / Use cases / Qwen3.5-35B-A3B for long context
Use case · Long-context RAG

Qwen3.5-35B-A3B for long context

Qwen3.5-35B-A3B is a 35B-parameter MoE with only 3B active per token and a 262K context window. The MoE lets it run at a 3B-dense cost while keeping a 35B knowledge base — ideal for RAG and long-document workflows. At $0.13 input / $1.00 output per 1M tokens, it's the cheapest per-input-token model in our catalog.

Why it's a fit for RAG

262K context

Fits a 500-page PDF or 200 code files into a single prompt. No need for aggressive chunking if the retrieved corpus fits; single-shot RAG simplifies your pipeline.

Low input cost

$0.13 per 1M input tokens means a 100K-token RAG prompt costs $0.013. DeepSeek V3 at $0.24/1M would cost $0.024 for the same prompt — 46% more.

MoE speed

Only 3B parameters are active per token, so inference speed is closer to a 3B dense model than a 35B dense one. For long-input workflows, this shows up as noticeably lower per-request latency.

Quickstart: long-document QA

Python · openai SDK
from openai import OpenAI

client = OpenAI(
    base_url="https://api.quicksilverpro.io/v1",
    api_key="sk-qsp-...",
)

# Load a long document — say a 500-page PDF, already extracted to text
document = open("annual-report.txt").read()  # ~180K tokens

resp = client.chat.completions.create(
    model="qwen3.5-35b",
    messages=[
        {"role": "system", "content": "You answer questions using only the provided document."},
        {"role": "user", "content": f"Document:\n{document}\n\nQuestion: What was free cash flow in Q3?"},
    ],
    max_tokens=500,
)
print(resp.choices[0].message.content)
print(f"Input tokens: {resp.usage.prompt_tokens}, cost: ${resp.usage.cost:.4f}")

A 180K-token document + 500 output-token answer costs $0.0234 + $0.0005 = ~$0.024 per query. Same query on DeepSeek V3 (if it fit): $0.0432 + $0.00035 = $0.044.

RAG pipeline pattern

Simple single-shot: if retrieved context fits in 262K tokens, skip reranking and hierarchical summarization — feed everything to Qwen3.5-35B-A3B in one call. Lower pipeline complexity, lower latency.

With retrieval: embed → top-K retrieve → concat into a 50-100K token prompt → Qwen3.5-35B-A3B answer. Input-cost economics favor longer top-K (more context) because input tokens are cheap.

Summarize-then-answer: for >262K corpora, first summarize by section with Qwen3.5-35B-A3B, then answer on the summaries. Two-pass; still cheaper than most alternatives.

Pricing

Model Input / 1M Output / 1M Context
Qwen3.5-35B-A3B $0.13 $1.00 262K
DeepSeek V3 (compare) $0.24 $0.70 131K

At 46% lower per-input-token, Qwen3.5-35B-A3B is the default for prompt-heavy RAG. Output is slightly more expensive than V3, so short-prompt tasks still favor V3.

FAQ

Can I really use 262K tokens in one prompt?

Yes. The 262,144-token context is the published hard limit. Long-context performance (needle-in-a-haystack recall) is strong up to about 200K; past that, accuracy can degrade on fine-grained lookup tasks. For critical retrieval, combine with vector search to put the most relevant chunks near the top of the prompt.

What's the "3B active MoE" thing?

Mixture-of-Experts routes each token through only a subset of the model's parameters. Qwen3.5-35B-A3B has 35B total parameters but activates only 3B per token. Compute per token is that of a 3B dense model; knowledge capacity is closer to a 35B model. The result is faster and cheaper inference than dense 35B, which is why long-context workloads are a particularly good fit.

Does thinking mode affect cost?

Qwen3.5-35B-A3B ships with reasoning mode available. On QuickSilver Pro, reasoning mode is suppressed by default to keep output concise and predictable — you're not billed for unnecessary thinking tokens. This matches the behavior most RAG and summarization workloads expect.

Does Qwen support tool calling?

Yes, via the OpenAI tools API. Tool-call reliability is good for simple function signatures; for complex multi-tool agent loops, DeepSeek V3 tends to be more reliable. Benchmark both on your specific agent before committing.

Related

Start your RAG on $1 free

262K context, OpenAI-compatible API, model="qwen3.5-35b".

Get API Key