DEV Community: Featherless.ai

Experimental support for Kimi-K2 by Moonshot AI now available for premium users!

Darin Verheijke — Tue, 15 Jul 2025 09:27:47 +0000

We now have experimental support for Kimi-K2 on Featherless for our premium subscribers.

Hop on the Kimi-K2 train 🚂

Kimi-K2 achieves exceptional performance across coding, reasoning, and agentic tasks with its revolutionary 1 trillion parameter MoE architecture. A SOTA open-source model specifically designed for autonomous problem-solving.

Moonshot AI has delivered breakthrough performance in agentic intelligence while maintaining open-source accessibility, positioning themselves as a formidable challenger in the frontier model space.

Some highlights of the Kimi-K2 release:

Massive Scale: 1T total parameters with 32B activated parameters using mixture-of-experts architecture
Zero Training Instability: Achieved stable pre-training on 15.5T tokens with novel MuonClip optimizer
Agentic Excellence: Specifically optimized for tool use, reasoning, and autonomous problem-solving
Superior Performance: Leading results on SWE-bench, LiveCodeBench, and tool-use benchmarks

Moonshot AI's rise in the agentic AI space reflects their focused approach to two critical innovations:

MuonClip Optimizer: Applied the Muon optimizer at unprecedented scale with novel optimization techniques
Agentic Specialization: Purpose-built architecture for tool use and autonomous reasoning tasks
"Era of Experience" Training: Advanced RL system using self-judging mechanisms and MCP (Model Context Protocol) tools for real-world agentic scenarios

Kimi-K2 represents a major leap forward in open-source agentic capabilities.

Experimental support notice

We're excited to offer experimental support for Kimi-K2 for our premium users. Given the substantial computational requirements of this 1T parameter model, we're closely monitoring usage patterns and operational costs. We may need to temporarily adjust or suspend availability. Try out Kimi-K2 and share your feedback to help us improve the experience.

Experience Kimi-K2 on Featherless:

🔔 Subscribe to access Kimi K2

🦜 Chat with it on our chat Phoenix

⚡ Integrate via the Featherless API

🔍 Join our Discord to give feedback and discuss the new model

Featherless Becomes Hugging Face’s Largest LLM Inference Provider with 6,700+ Models

Darin Verheijke — Thu, 12 Jun 2025 14:03:09 +0000

We’re excited to announce that Featherless is now the most extensive LLM inference provider on Hugging Face, serving over 6,700 open-weight models—and counting.

This milestone means developers, researchers, and teams can now run thousands of the world’s models directly from Hugging Face, backed by Featherless's serverless infrastructure, flat pricing, and production-grade scalability.

Featherless is the only Hugging Face Inference Endpoints provider supporting this scale.

Any model with 100+ downloads is automatically onboarded to Featherless.

Reliable, Open AI — At Scale

This collaboration brings together two shared commitments: accessibility and open source.

With Featherless powering Hugging Face endpoints, users now get:

6,700+ Models, Instantly Available

From DeepSeek, LLaMA, Mistral, and Qwen to new release like Magistral and Devstral. All ready to deploy, fine-tune, or benchmark.
Serverless, Scalable Infrastructure

Model cold-starts average under 250ms, enabling users to plan their usage by models and concurrent connections. No GPUs, no containers, no infrastructure.
Automatic Model Onboarding

Hugging Face models with 100+ downloads are auto-integrated with Featherless for access.
Unlimited Usage, Predictable Pricing (when subscribed to Featherless)

Run any model—without usage caps, per-token math, or surprise bills.

“Featherless AI is doing for inference what Hugging Face did for open-source model hosting, making it simple, accessible, and scalable. This partnership is a big step towards the future where anyone can have instant access to all the world’s collection of AI models.”

— Eugene Cheah, Co-founder, Featherless AI

Two Ways to Use Featherless on Hugging Face

Starting June 12, 2025, users can invoke Featherless inference directly inside the Hugging Face platform:

Routed Request

Billed by Hugging Face. Just select Featherless AI from the Inference Endpoints dropdown and go.
Custom Key or Direct Calls

Use your own Featherless API key for direct access and flat-rate unlimited usage (requires a Featherless subscription).

→ Read the Docs
→ Explore Featherless Pricing
→ Run Your First Model

Future-Proofing AI Deployment

As the world moves toward more personalized, specialized, and fine-tuned AI systems, Featherless is building the foundation.

We are both a serverless inference platform and an AI research lab. Our contributions to attention-alternative architectures like RWKV help us scale models other platforms can’t. We reduce inference costs for all models by at least 10 times. And we’ve built the world’s most reliable agent for everyday use, outperforming Gemini, Claude, and GPT-4o.

Together with Hugging Face, we’re making the long tail of models accessible, scalable, and production-ready.

6,700+ LLMs hosted today
100% of Hugging Face public models targeted by EOY 2026 🤗

About Featherless

Featherless is the fastest way to run reliable, open-source AI at scale. Featherless is an AI research lab and serverless platform that gives developers, researchers, and teams instant access to the world’s largest model catalog without managing infrastructure, token limits, or hidden costs. Whether you’re building prototypes, deploying applications, or scaling intelligent systems, Featherless helps you move faster with AI you can trust. Our mission is to make personalized AGI real: open, reliable, and built for everyone.

→ Explore the Catalog
→ Subscribe to Featherless
→ Join our Discord
→ Follow us on X

Context Isn’t Everything: Build Efficient LLM Apps with LlamaIndex + Featherless

Darin Verheijke — Mon, 02 Jun 2025 13:23:36 +0000

We’re excited to announce that LlamaIndex has official support for Featherless, bringing together two powerful tools for building production RAG applications. While everyone’s chasing longer context windows (100K, 1M tokens), we’ve noticed most production apps need something different: they need efficient retrieval that finds the right information, not all information.

That’s why this integration matters:

LlamaIndex provides the RAG infrastructure: data loaders, chunking strategies, and vector search
Featherless gives you access to 4,000+ open source models through a simple API
The combination can build you a RAG pipeline that can switch between models instantly, optimize for cost, and scale without infrastructure headaches.

Let’s have a deeper look at what you can build with this new integration.

Why Retrieval is Perhaps a Better solution for your problem

Stuffing your entire knowledge base into a single prompt might work for a simple demo, but at scale it leads to:

Slower response times: Processing 100k tokens takes time, even on fast hardware
More hallucinations: Models struggle with needle-in-haystack problems in massive contexts
Token overflow: Eventuelly you might hit limits, forcing crude truncation

What you actually want is precision, just the right information, fed to the model at the right time. That’s where RAG (Retrieval-Augmented Generation) shines, and LlamaIndex handles it beautifully.

What Featherless brings to the Stack

Featherless simplifies the process of getting access to open source models. Instead of provisioning GPUs, managing infrastructure, dealing with model deployment, and worrying about usage costs, you get instant access to over 4,300 open source models including DeepSeek, Llama, Qwen, Mistral and many more. Everything runs through our API, with a simple monthly subscription, you have unlimited access and tokens to our whole model catalog and can switch between them instantly, perfect for A/B testing different approaches without any infrastructure overhead.

Quickstart: Build a Local RAG application

Let’s walk through building a Q&A assistant that can answer questions about your local documents.

1. Install dependencies

pip install llama-index llama-index-llms-featherlessai llama-index-embeddings-huggingface

2. Set up your environment

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.featherlessai import FeatherlessLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os

# Set your Featherless API key
os.environ["FEATHERLESS_API_KEY"] = "your-api-key"

# Configure local embeddings
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"  # Efficient, high-quality embeddings
)
# Alternative: Use Ollama for local embeddings
# from llama_index.embeddings.ollama import OllamaEmbedding
# embed_model = OllamaEmbedding(model_name="nomic-embed-text")

3. Load and Index Your Documents

# Load all files from ./docs directory
documents = SimpleDirectoryReader("docs").load_data()

# Configure Featherless as your LLM
llm = FeatherlessLLM(
    model="Qwen/Qwen3-32B",  # Or any model from featherless.ai
    temperature=0.1,          # Lower for more consistent retrieval
)

# Build your vector index with free embeddings
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,  
    chunk_size=512,           # Optimal for precise retrieval
)

4. Query Your Knowledge Base

# Create a query engine
query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=3,  # Retrieve top 3 most relevant chunks
)

# Ask questions
response = query_engine.query("What's our onboarding process?")
print(response)

You just built a RAG pipeline system in under 30 lines of code, with zero of the infrastructure overhead

Advanced Features: Streaming and Chat

Our Featherless LlamaIndex integration supports both streaming responses and multi-turn conversations:

Streaming Responses

# Stream for real-time output
response = llm.stream_complete("Summarize the key points of machine learning")
for chunk in response:
    print(chunk.delta, end="")

Multi-turn Chat

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are a helpful technical assistant"),
    ChatMessage(role="user", content="What is RAG?"),
]

# Stream chat responses
stream = llm.stream_chat(messages)
for chunk in stream:
    print(chunk.delta, end="")

Model Switching: A/B Test Without Rewriting Code

One of Featherless’s powers is instantly model switching. Test different models for your use case

models_to_test = [
    "mistralai/Mistral-Small-24B-Instruct-2501",
    "Qwen/Qwen3-8B"
    "meta-llama/Meta-Llama-3.1-8B-Instruct"
]
query = "Explain our refund policy"

for model_name in models_to_test:
    llm.model = model_name
    response = query_engine.query(query)
    print(f"\n{model_name}:\n{response}")

Real-World Example: Customer Support Bot

Here's a complete example of a customer support bot that combines multiple best practices:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.featherlessai import FeatherlessLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.llms import ChatMessage
import os

# Initialize Featherless
os.environ["FEATHERLESS_API_KEY"] = "your-api-key"

# Set up embeddings
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

# Load different document types
faq_docs = SimpleDirectoryReader("./data/faqs").load_data()
policy_docs = SimpleDirectoryReader("./data/policies").load_data()
product_docs = SimpleDirectoryReader("./data/products").load_data()

# Tag documents with metadata
for doc in faq_docs:
    doc.metadata["category"] = "faq"
for doc in policy_docs:
    doc.metadata["category"] = "policy"
for doc in product_docs:
    doc.metadata["category"] = "product"

# Combine all documents
all_docs = faq_docs + policy_docs + product_docs

# Create index with custom settings
index = VectorStoreIndex.from_documents(
    all_docs,
    embed_model=embed_model,
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=50)
    ]
)

# Function to route queries to appropriate model
def get_llm_for_query(query: str) -> FeatherlessLLM:
    query_lower = query.lower()

    if any(word in query_lower for word in ["refund", "policy", "terms"]):
        # Use precise model for policy questions
        return FeatherlessLLM(model="Qwen/Qwen3-32B", temperature=0.1)
    elif any(word in query_lower for word in ["help", "how", "tutorial"]):
        # Use helpful model for guidance
        return FeatherlessLLM(model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B", temperature=0.3)
    else:
        # Default conversational model
        return FeatherlessLLM(model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", temperature=0.2)

# Create a support bot function
def support_bot(user_query: str, chat_history: list = None):
    # Select appropriate model
    llm = get_llm_for_query(user_query)

    # Create query engine with filters
    query_engine = index.as_query_engine(
        llm=llm,
        similarity_top_k=3,
        response_mode="compact",  # Synthesize concise answers
    )

    # Add chat context if available
    if chat_history:
        context = "\n".join([f"{msg['role']}: {msg['content']}" for msg in chat_history[-3:]])
        full_query = f"Previous conversation:\n{context}\n\nCurrent question: {user_query}"
    else:
        full_query = user_query

    # Get response
    response = query_engine.query(full_query)

    return response

# Example usage
print(support_bot("What's your refund policy?"))
print(support_bot("How do I reset my password?"))

Performance and efficiency strategies

As your RAG application scales, performance optimization becomes crucial. Start with embedding caching to avoid recomputing embeddings for documents you’ve already processed. LlamaIndex makes this straightforward with its storage context:

# Cache embeddings to avoid recomputation
from llama_index.core import StorageContext
from llama_index.core.storage import SimpleDocumentStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

storage_context = StorageContext.from_defaults(
    persist_dir="./storage"
)
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
    storage_context=storage_context
)

With Featherless’s monthly subscription, you have unlimited access to all models, which fundamentally changes how you approach optimization. Instead of minimizing token usage, you can experiment freely with different models to find the perfect fit for each use case. Don’t hesitate to use larger models for complex tasks where quality matters most.

Focus your optimization efforts on reducing latency through query caching for common questions and implementing parallel processing for better throughput. Since you’re not counting tokens, you can run extensive A/B tests across multiple models simultaneously, gathering real performance data to make informed decisions about which models work best for different query types. This freedom to experiment without constraints means you can optimize for what really matters: response quality and user experience.

What’s next?

You now have the foundation for building powerful RAG applications with LlamaIndex and the Featherless integration. Start by exploring the vast model selection at featherless.ai, you might discover specialized models perfect for your use case that you wouldn’t have considered before.

As your application grows, consider adding persistence with vector databases to handle larger document collections. Implement evaluation metrics to measure your retrieval quality and iterate on your chunking strategies. The real power comes when you start building agents that combine RAG with tool use, enabling complex workflows that go beyond simple Q&A.

Join our community on Discord to share your builds and learn from others who are pushing the boundaries of what’s possible with RAG.

Building Production-Ready LLM Apps with LangChain & Featherless Serverless Inference

Darin Verheijke — Fri, 23 May 2025 15:25:01 +0000

As the open source AI ecosystem rapidly evolves, developers are faced with two growing challenges: managing infrastructure and evaluating the ever-expanding universe of models. By integrating with LangChain, Featherless now enables you to build and scale LLM-powered applications with zero infrastructure hassle and instant access to over 4,300 open source models. Following up on our previous post, “Zero to AI: Deploying Language Models without the Infrastructure Headache,” we’re thrilled to announce a significant leap forward: Featherless now has a native integration with LangChain! You can find us on the LangChain Python documentation.

From Prototype to Production: Why Combining LangChain + Featherless is a Game-Changer

While LangChain has pioneered how developers chain together LLM operations, the challenge of managing model infrastructure remains. With Featherless we hope to solve this piece of the puzzle:

Scalable Infrastructure - Deploy production-grade LLM applications without a single line of DevOps code. No GPU provisioning, no autoscaling headaches and no containers to manage.
Unlimited Model Flexibility - Instant access to 4,300+ (and growing everyday) open source models through a single consistent API. Swap between Mistral, Llama, DeepSeek, Qwen and thousands more by changing just one parameter.
Predictable Pricing - Featherless offers straightforward subscription-based pricing with no hidden costs.
Rapid Prototyping & Testing - Evaluate different models for your use case in minutes, not days. Experiment with model parameters and find the perfect balance of performance and cost.

The goal is for you to focus on your application logic while we handle the heavy lifting of inference infrastructure.

Quickstart: Launch your LangChain App with Featherless

Getting started is incredibly straightforward

1. Install necessary packages:

pip install langchain langchain-core langchain-featherless-ai

(Note: langchain-featherless-ai is the dedicated package for our native integration.)

2. Initialize ChatFeatherlessAi as your LLM provider in LangChain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_featherless_ai import ChatFeatherlessAi

# Initialize Featherless LLM
# Best practice: Set your API key as an environment variable (FEATHERLESS_API_KEY)
# Or, you can pass it directly:
llm = ChatFeatherlessAi(
    featherless_api_key="YOUR_FEATHERLESS_API_KEY", # Replace with your actual key
    model="mistralai/Mistral-Small-24B-Instruct-2501", # Example model
    temperature=0.7,
    max_tokens=256 # Adjusted for a slogan
)

# Define a prompt template
prompt = ChatPromptTemplate.from_template(
    "What is a creative slogan for a product called {product}?"
)

# Define an output parser
output_parser = StrOutputParser()

# Construct the chain using LCEL's pipe (|) operator
chain = prompt | llm | output_parser

# Invoke the chain
product_name = "Featherless AI"
response = chain.invoke({"product": product_name})

print(f"Slogan for {product_name}: {response}")

_Key Change: We are now using ChatFeatherlessAi directly from langchain_featherless_ai instead of the OpenAI-compatible endpoint. The API key can be passed directly or set via the FEATHERLESS_API_KEY environment variable.
_

Done! You’ve just powered your LangChain application with a model from Featherless using our direct, native integration and modern LCEL syntax

Example Use Case: Building a RAG App with Native Featherless Integration

Let’s dig deeper on the power of this native integration by building a lightweight RAG (Retrieval-Augmented Generation) system. This is perfect for creating Q&A bots over your own documents.

We'll use:

ChatFeatherlessAi for LLM inference.
LangChain (LCEL, community packages) for orchestration and retrieval.
FAISS (from langchain-community) as a simple in-memory vector store.
HuggingFaceEmbeddings (from langchain-huggingface) for document embedding.

1. Install additional packages for RAG:

pip install langchain-community langchain-huggingface langchain-text-splitters faiss-cpu sentence-transformers

(Note: faiss-cpu is for CPU-based FAISS, use faiss-gpu if you have a GPU setup.)

2. Ingest and Index Your Documents

from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a dummy "your_document.txt" file in the same directory for this example:
# File content: "The Featherless API provides access to many LLMs. It's designed for ease of use and developer productivity."
try:
    with open("your_document.txt", "w", encoding="utf-8") as f:
        f.write("The Featherless API provides access to many LLMs. It's designed for ease of use and developer productivity.")

    loader = TextLoader("./your_document.txt", encoding="utf-8")
    documents = loader.load()
except Exception as e:
    print(f"Error preparing or loading document: {e}")
    print("Please ensure you can write to 'your_document.txt' or create it manually.")
    documents = []

retriever = None
if documents:
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    split_docs = text_splitter.split_documents(documents)

    # Using a common, reliable sentence transformer model
    embeddings_model_name = "sentence-transformers/all-mpnet-base-v2"
    embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)

    try:
        vectorstore = FAISS.from_documents(split_docs, embeddings)
        retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) 
    except Exception as e:
        print(f"Error creating FAISS vector store or retriever: {e}")
        print("This might be related to your PyTorch/Torchvision/FAISS setup.")
        print("Ensure you followed Step 1 for installing PyTorch correctly.")
else:
    print("No documents loaded, retriever will not be initialized.")

3. Set Up ChatFeatherlessAi and Build the RAG Chain using LCEL:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_featherless_ai import ChatFeatherlessAi

# Initialize Featherless LLM
llm = ChatFeatherlessAi(
    featherless_api_key="YOUR_FEATHERLESS_API_KEY", # Replace
    model="mistralai/Mistral-Small-24B-Instruct-2501",
    temperature=0.3 # Lower temperature for more factual RAG
)

# RAG Prompt Template
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer:"""
rag_prompt = ChatPromptTemplate.from_template(template)

# Helper function to format retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

if retriever:
    # Construct the RAG chain using LCEL
    rag_chain_from_docs = (
        RunnablePassthrough.assign(context=(lambda x: format_docs(x["documents"])))
        | rag_prompt
        | llm
        | StrOutputParser()
    )

    rag_chain_with_source = RunnableParallel(
        {"documents": retriever, "question": RunnablePassthrough()}
    ).assign(answer=rag_chain_from_docs)

    # Example Invocation
    question = "What is the Featherless API designed for?"
    result = rag_chain_with_source.invoke(question)

    print(f"\nQuestion: {question}")
    print(f"Answer: {result['answer']}")
    print("\nSources:")
    for doc in result['documents']:
        print(f"- {doc.page_content} (Metadata: {doc.metadata})")

else:
    print("RAG chain not created as retriever is unavailable.")

This example demonstrates a modern, LCEL-based approach to building a sophisticated RAG system, seamlessly powered by serverless inference from Featherless and orchestrated via LangChain.

Effortless Model Experimentation: Remember the `model` Parameter

Want to see if LLaMA 3 provides better answers for your use case? Or perhaps test DeepSeek's capabilities? With the native ChatFeatherlessAi integration, switching models is as simple as updating the model parameter:

# Initialize with LLaMA 3
llm_llama3 = ChatFeatherlessAi(
    featherless_api_key="YOUR_FEATHERLESS_API_KEY",
    model="meta-llama/Llama-3.3-70B-Instruct" # Example LLaMA 3 model from Featherless
)

# Or try DeepSeek
llm_deepseek = ChatFeatherlessAi(
    featherless_api_key="YOUR_FEATHERLESS_API_KEY",
    model="deepseek-ai/DeepSeek-V3-0324" # Example DeepSeek model from Featherless
)

# Then, you can plug these into your LCEL chains:
# new_chain = prompt | llm_llama3 | output_parser
# new_rag_chain_with_source = RunnableParallel(...).assign(answer=rag_chain_from_docs.with_llm(llm_deepseek))

This frictionless model evaluation is a game-changer for prompt tuning and finding the perfect LLM for your specific task, all within a familiar LangChain paradigm.

How to Get Started with Featherless and LangChain:

Create your free Featherless account: Sign up at Featherless.ai
Grab your API key: Find it on your Featherless dashboard. Set it as an environment variable FEATHERLESS_API_KEY or pass it directly to ChatFeatherlessAi.
Explore the Model Catalog: Discover over 4,300 models ready for instant deployment. Check the latest list here.
Install langchain-featherless-ai and other necessary langchain packages, then use ChatFeatherlessAi as shown.
Dive into the Docs:

Final Thoughts: Build Without Limits

The native synergy between LangChain's powerful orchestration (especially with LCEL) and Featherless's ChatFeatherlessAi component is set to redefine how developers build, test, and ship LLM-powered applications. By removing infrastructure bottlenecks and providing vast model choice through a dedicated integration, we're empowering you to focus solely on innovation. Cold starts, model hosting, and scaling headaches are now a thing of the past.

Ready to build your next groundbreaking LLM app without the usual friction? Join our Discord today to get help building your first app!

Try Featherless with LangChain's native integration today!

Featherless: Open source LLMs. One API. Zero infrastructure.

Running OpenHands LM 32B with Featherless.ai: A Practical Guide

Darin Verheijke — Tue, 08 Apr 2025 08:49:40 +0000

The landscape of AI-powered software development is evolving at breakneck speed, and the release of OpenHands LM 32B marks a significant leap forward. This powerful, open-source coding model, boasting an impressive 37.2% resolve rate on SWE-Bench Verified, brings enterprise-grade AI assistance directly to your local environment. By pairing it with Featherless.ai's efficient model hosting, you create a potent yet accessible development setup. Whether you're a solo developer aiming to accelerate your workflow or part of a team seeking freedom from proprietary solutions, this combination offers compelling advantages.

Why Choose Featherless.ai for Your OpenHands LM Deployment?

Running large models like OpenHands LM 32B locally can be resource-intensive. Featherless.ai provides an elegant solution, standing out in the crowded AI inference market with its unique approach to model hosting:

Vast Model Catalog: Easily access an extensive library of models, including the OpenHands LM 32B.
Predictable Subscription Pricing: Enjoy straightforward costs with a subscription, avoiding volatile pay-per-token fees.

Ready to get started? The first step is installing the OpenHands application itself.

Installing OpenHands

To get started with OpenHands, you'll need to install the application first. The installation process varies depending on your operating system and device. I recommend following the official installation guide provided by All Hands:

OpenHands Installation Guide

The documentation provides detailed instructions for:

Installing on macOS, Windows, and Linux
System requirements
Configuration options
Troubleshooting common installation issues

Ensure you have OpenHands installed and running correctly before proceeding to connect it with the powerful LM hosted on Featherless.ai.

Connecting OpenHands to Featherless.ai

With the OpenHands application installed, it's time to connect it to the OpenHands LM 32B model running efficiently on Featherless.ai. This integration unlocks the model's power without requiring local GPU resources. Here's how to set up the integration:

First, you'll need to create an account on Featherless.ai if you don't already have one.
Once logged in, navigate to your account settings to obtain your API key. This key will authenticate your requests to the Featherless.ai API.
Open the OpenHands application (usually configured on localhost:3000.
Go to the settings page (gear icon, typically at the bottom left).
Since Featherless.ai provides an OpenAI-compatible API endpoint, we can configure OpenHands to use it by setting the following options within the application:
Enable Advanced options (toggle switch).
Set the following:
- Custom Model to openai/all-hands/openhands-lm-32b-v0.1 . The openai/ prefix tells OpenHands to use the OpenAI API format with the specified model available on Featherless.ai.
- Base URL to https://api.featherless.ai/v1
- API Key to your Featherless API Key
- Disable memory condensation
Fill in your Git Provider Settings if necessary (e.g., GitHub token).
Save Changes!

That's all it takes! Your OpenHands application is now powered by the sophisticated OpenHands LM 32B via Featherless.ai. Start experimenting! Try feeding it complex coding challenges, asking it to refactor code, or even resolving GitHub issues directly. You might be surprised by its problem-solving prowess. Furthermore, this same setup process works seamlessly with other cutting-edge models available in the extensive Featherless.ai catalog, like the recent DeepSeek V3.

Ready to start building? Head over to https://featherless.ai/ to create an account. Our growing community of developers, enthusiasts, and AI practitioners is here to help you get the most out of Featherless:

Join our Discord community to connect with other users
Follow us on Twitter (@FeatherlessAI) for the latest updates

We’ll be looking forward to seeing what you all create and share with the community.

Initial Support for Google's Gemma 3 27B Models Now Live on Featherless.ai!

Darin Verheijke — Mon, 07 Apr 2025 14:45:00 +0000

We’re thrilled to announce that we’ve added initial support for Google’s Gemma 3 27B models to the Featherless.ai serverless inference platform! After dedicated work from our team, both the instruct-tuned and pre-trained versions of the 27B parameter model are now active and ready for use.

This marks the first step in bringing Gemma 3, Google’s latest state-of-the-art open model family, to our users. We plan to start onboarding fine-tuned versions of Gemma 3 27B over the next week.

Gemma 3 27B on Featherless.ai: Powerful Inference Without the Complexity

Access the impressive performance of Gemma 3 27B through our simple serverless API, letting you focus on building great applications instead of managing infrastructure. Built on the same research and technology behind Google’s Gemini models, Gemma 3 27B delivers cutting-edge AI capabilities with just an API call.

Technical Updates & What’s Next

Alongside this model release, we’ve also pulled in the latest version of vLLM. This foundational update means exciting features like tool calling, custom grammar support, and vision pipelines for Gemma are now solidly on our roadmap (and perhaps even closer than expected!). A big shoutout to our dedicated Inference team members for making these advancements possible!

The Growing Spectrum of Open Models on Featherless.ai

With the addition of Gemma 3 27B, our platform continues to offer a diverse range of powerful open models:

DeepSeek-R1: Pushing the boundaries of reasoning with 671B parameters.
Qwerky-72B: Built on the efficient RWKV architecture (sub-quadratic scaling) designed for significantly reduced inference costs (VRAM and compute).
QwQ-32B: Strong reasoning in an efficient 32B parameter package.
Gemma 3 27B: Google’s latest advancements available in a powerful 27B model. This selection provides developers with significant choice in matching model capabilities to their specific application needs.

What You Can Do with Gemma 3 27B

With our serverless implementation of Gemma 3 27B, you can immediately:

Leverage model performance designed to be highly competitive, even against much larger models in preliminary evaluations.
Build multilingual applications with its broad language support.
Utilize the power of the 27B parameter model for demanding generative AI tasks.
For Featherless.ai users seeking powerful AI through a simple API, Gemma 3 27B presents an exciting new option.

Try Gemma 3 27B on Featherless.ai Today!

The Gemma 3 27B models are ready for immediate use through our platform. Whether you’re building applications that need sophisticated language understanding or exploring the capabilities of this new model, our serverless API provides the simplest path to integration.

Chat with it on Phoenix
Integrate via the Featherless API
Explore our documentation: Check out our implementation guides and example code

Have questions about using Gemma 3 27B through our serverless platform? Reach out to us on Discord or check our documentation for API references and best practices.

Supercharging Your Development Workflow: Integrating Featherless.ai with Aider and Cursor

Darin Verheijke — Thu, 03 Apr 2025 13:54:40 +0000

Introduction

So you’ve heard that the latest DeepSeek V3 or Openhands LM model is good at coding and now you’re wondering how to bring that power directly into your coding workflow. You’ve come to the right place, at Featherless.ai we give you access to not only the latest DeepSeek but to any open-model on Hugging Face without the headache of managing infrastructure.

This guide walks you through integrating Featherless.ai with two popular AI-assisted coding tools: Aider and Cursor. Whether you’re pair programming with an AI assistant through Aider’s command-line interface or leveraging Cursor’s intelligent code completion and refactoring capabilities. Featherless.ai can significantly enhance your development experience by giving you access to the latest AND any future open-source models you will want to work with.

At the end we’ll go over some of our favorite models for coding but let’s first get started with the integration process

Your Featherless API Key

You’ll need a couple of things before you start. First and foremost, you’ll need a Featherless API key to connect to any of the 4000+ open models in our catalog. Head over to featherless.ai and create an account if you haven’t already. Once logged in, navigate to the API section in your dashboard where you can generate a new API key. This key is your secure passport to accessing all the powerful open-source models we host, including DeepSeek V3. Keep this key handy as you’ll need to configure it in both Aider and Cursor in the following steps. Remember to treat your API key like a password, don’t share it publicly or commit it to version control systems.

You will also need to choose a model in our model catalog. Let’s take the code-specific Qwen model called Qwen/Qwen2.5-Coder-32B-Instruct as an example.

Setting Up Aider with Featherless.ai

Installation

First, you'll need to install Aider if you haven't already. Aider is a command-line tool that lets you pair program with AI models directly from your terminal. Visit Aider's official documentation for the most up-to-date installation instructions.

Integrating with Featherless.ai

Once Aider is installed, integrating it with Featherless.ai requires just two configuration files in your project folder:

1. Create a model settings file named `.aider.model.settings.yml`:

- cache_control: false
  caches_by_default: false
  edit_format: whole
  examples_as_sys_msg: true
  extra_params:
    max_tokens: 4096
  lazy: false
  name: openai/Qwen/Qwen2.5-Coder-32B-Instruct
  reminder: user
  send_undo_reply: false
  streaming: true
  use_repo_map: true
  use_system_prompt: true
  use_temperature: true

2. Create a model metadata file named `.aider.model.metadata.json`:

{
    "openai/Qwen/Qwen2.5-Coder-32B-Instruct": {
        "max_tokens": 4096,
        "max_input_tokens": 4096,
        "max_output_tokens": 4096,
        "input_cost_per_token": 0,
        "output_cost_per_token": 0,
        "litellm_provider": "openai",
        "mode": "chat",
        "support_vision": false,
        "support_function_calling": false
    }
}

Running Aider with Featherless.ai

Now you can start Aider with Featherless.ai by running the following command in your terminal:

aider --openai-api-base 'https://api.featherless.ai/v1' \
      --openai-api-key your_featherless_API_key \
      --model 'openai/Qwen/Qwen2.5-Coder-32B-Instruct' \
      --map-tokens 1024 \
      --model-metadata-file '/path/to/.aider.model.metadata.json' \
      --model-settings-file '/path/to/.aider.model.settings.yml'

Make sure to:

Replace your_featherless_API_key with the API key you obtained from the Featherless.ai dashboard
Update /path/to/ to the actual path where you saved your configuration files
You can create separate configuration files for each model you want to use

That's it! You're now ready to use powerful models like Qwen2.5-Coder-32B through Aider, all powered by Featherless.ai's infrastructure.

Setting Up Cursor with Featherless.ai

Configuring Cursor to Use Featherless Models

Cursor is a powerful AI-assisted code editor that can be enhanced with custom models from Featherless.ai. Here's how to set it up step by step:

Open Cursor Settings
- Launch the Cursor application
- Click on the gear icon in the top right corner, or use the keyboard shortcut Ctrl+Shift+J, (Windows/Linux) or Cmd+Shift+J, (Mac)
- Navigate to the "Models" section in the sidebar
Configure Custom Models
- Uncheck all pre-selected models to start
- Click the "Add Model" button
- Enter Qwen/Qwen2.5-Coder-32B-Instruct as the model name
- If you want to use other models later, you can add them the same way

Set Up Featherless API Connection
- Look for the "OpenAI API Key" section in the settings
- Enter your Featherless API key in the "OpenAI Key" field
- Find the "Override OpenAI Base URL" option and enter: https://api.featherless.ai/v1
- Click the "Save and Verify" button to test your connection

Start Using Your Custom Model
- Open or create a project in Cursor
- Click on the chat/AI button in the sidebar
- In the model selector dropdown at the top of the chat panel, select Qwen/Qwen2.5-Coder-32B-Instruct
- Start coding with your Featherless-powered model!

Troubleshooting Tips

Connection Failed? Double-check your API key for typos and ensure the base URL is entered correctly
Model Not Appearing? Try restarting Cursor after saving your settings
Slow Responses? Complex coding tasks might take a moment - the model is processing your entire codebase context

Recommended Models for Coding

Now that you’ve set up your tools with Featherless.ai, here are some of our top model recommendations:

Best All-Around Coding Models

Openhands LM - The strongest 32B coding agent model, resolving 37.4% of issues on SWE-bench
DeepSeek V3 - Excellent balance of speed and accuracy for general coding tasks
Qwen2.5-Coder-32B - Best for complex projects and production-quality code
open-r1/OlympicCoder-32B - A code model that achieves very strong performance on competitive coding benchmarks such as LiveCodeBench and the 2024 International Olympiad in Informatics

Building a PDF-to-Podcast Pipeline with Open-Source AI: From Text Extraction to Voice Synthesis

Darin Verheijke — Tue, 11 Mar 2025 08:59:08 +0000

Introduction

Imagine this: you’re jogging through the park, earbuds in, grinning as two lively voices chat about the latest AI research paper, just like it’s a podcast made just for you. Or picture a busy content creator with a pile of blog posts, dreaming of turning them into audio gold without spending hours recording. That’s where this AI-powered pipeline comes in. It takes static PDFs and transforms them into engaging, conversational podcasts using open-source tools. In this post, I’ll walk you through the whole process: extracting text, crafting fun scripts, and synthesizing natural audio.

Why Turn PDFs into Podcasts?

PDFs are treasure troves of info, but let’s be real, they’re not exactly commute-friendly. Podcasts, though? They’re perfect for multitasking: driving, working out, or chilling out. The problem is, recording a podcast the old-school way: scripting, speaking, editing, is a time sink. This pipeline changes that. It automates the grind, so you can focus on the content. Here’s who could use it:

Researchers: Turn dense papers into listens for your morning run.
Professionals: Make industry reports your gym-session soundtrack.
Bloggers: Repurpose old posts into fresh podcast episodes.

Technologies Used
The pipeline leverages several powerful open-source technologies:

PyMuPDF: For extracting text content from PDFs while preserving structure
Featherless.ai API: Access to all open-weight models on Hugging Face for text cleaning and creative podcast script generation by using roleplay finetunes.
Kokoro TTS: Converts text into natural-sounding audio.
Python Libraries: Tools like Pandas, NumPy, and PyDub handle data and audio processing.

The Complete Pipeline Overview
This pipeline architecture consists of four main stages:

Text Extraction and Cleaning: Converting PDF to structured, readable text
Podcast Script Generation: Transforming factual content into natural dialogue
TTS Optimization: Formatting the script for speech synthesis compatibility
Audio Generation: Creating and combining audio segments into a cohesive podcast

Let’s dive into each stage in detail.

The pipeline consists of four interconnected Jupyter notebooks, each handling a specific stage of the transformation process:

PDF Document → Text Extraction → Script Generation → TTS Optimization → Audio Generation

Stage 1: Text Extraction and Cleaning

Extracting text from PDFs with PyMuPDF
The first challenge is to extract text from PDF documents while preserving its meaning and structure. PDFs are notoriously difficult to parse correctly, as they can contain multiple columns, images, headers, footers, and complex layouts. I chose PyMuPDF (via the pymupdf4llm wrapper) for its ability to handle these complexities. Here’s the core extraction function:

def extract_text_from_pdf(file_path: str, max_chars: int = 60000) -> Optional[str]:
    if not validate_pdf(file_path):
        return None
    try:
        # Convert PDF to markdown text
        markdown_text = pymupdf4llm.to_markdown(file_path)

        # Truncate if exceeds max_chars
        if len(markdown_text) > max_chars:
            print(f"Truncating text to {max_chars} characters")
            markdown_text = markdown_text[:max_chars]

        print(f"\\nExtraction complete! Total characters: {len(markdown_text)}")
        return markdown_text

    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return None

What’s happening here? It checks the PDF’s legit, pulls text as Markdown to preserve structure (like headings), and trims it if it’s massive. For non-coders: this is like a super-smart photocopier that grabs only the words you care about. Watch out, though scanned PDFs or locked files might need some extra work.

Cleaning and Structuring Content
Raw PDF text is often cluttered with page numbers, headers, footers, and other elements that don’t belong in a podcast script. Plus, academic and technical documents frequently contain notation that doesn’t translate well to speech. I used the Featherless.ai API to process and clean this text. This approach leverages large language models to understand the content and reformat it appropriately:

def process_chunk(text_chunk, chunk_num):
    """Process a chunk of text using Featherless API"""
    messages = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text_chunk},
    ]

    try:
        response = requests.post(
            f"{BASE_URL}/chat/completions",
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {FEATHERLESS_API_KEY}"
            },
            json={
                "model": DEFAULT_MODEL,
                "messages": messages
            }
        )
        response.raise_for_status()
        processed_text = response.json()["choices"][0]["message"]["content"]
        return processed_text

    except Exception as e:
        print(f"Error processing chunk {chunk_num}: {str(e)}")
        return text_chunk  # Return original text in case of error

The system prompt tells the model to keep the good stuff and ditch the rest:

"You are a world class text pre-processor, here is the raw data from a PDF, 
please parse and return it in a way that is crispy and usable to send to a 
podcast writer.
The raw data is messed up with new lines, LaTeX math and you will see fluff 
that we can remove completely. Basically take away any details that you think 
might be useless in a podcast author's transcript.
Remember, the podcast could be on any topic whatsoever so the issues listed 
above are not exhaustive.
Please be smart with what you remove and be creative ok?
Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT 
AND RE-WRITING WHEN NEEDED."

Example:

Before: #Intro\nPage 1\nData is key\\LaTeX{math}here.
After: Data is key

Handling Technical Challenges
Big PDFs bring big challenges:

Memory Limits: Huge files can crash things, so I split text into 1,000-character chunks , like this: Text → [Chunk 1 | Chunk 2 | Chunk 3] → Processed. Each chunk gets cleaned, then reassembled.
Weird Layouts: PyMuPDF and the LLM team up to straighten out columns and tables so the flow makes sense.

The output of this stage is clean, well-structured text that captures the essential information from the PDF in a format suitable for conversion to podcast dialogue.

Stage 2: Podcast Script Generation

Next, we transform the text into a dialogue between two speakers using the Featherless.ai API and a large language model (LLM) of choice. It creates a natural back-and-forth:

Speaker 1: The explainer, dropping clear insights.
Speaker 2: The curious one, tossing in questions and quirks. Here’s an example output:

SPEAKER 1: Data is critical for AI—it’s what powers the system, much like fuel for an engine.
SPEAKER 2: So, if the data isn’t great, does that affect how well the AI performs?

The LLM adds natural phrasing to make it feel like a real conversation, not just a read-aloud.

Stage 3: TTS Optimization

While the previous stage generated a conversational podcast script, this stage takes a different approach focused specifically on Text-to-Speech (TTS) compatibility. Instead of further processing the output from stage 2, we revisit the raw extracted text and apply specialized prompt engineering to generate a script format optimized for voice synthesis.

The Challenge of TTS-Ready Scripts
Text-to-speech engines often struggle with:

Natural-sounding dialogue that maintains distinct speaker voices
Appropriate pacing and pauses
Handling emotional expressions and reactions
Structured, predictable formats for programmatic processing

The goal of this stage is to transform our basic script into a structured format that both preserves its conversational nature and ensures reliable TTS processing as well as adds some flair to the conversation by using a specialized roleplaying language model accessed through Featherless.ai

SYSTEM_PROMPT = """
You are an international Oscar-winning screenwriter who has worked with 
multiple award-winning podcasters.Your job is to rewrite the provided podcast 
transcript for an AI Text-To-Speech pipeline. 
The original transcript was written by a less experienced AI, so you need 
to enhance it significantly.
Create an engaging dialogue between two speakers, each with distinct personalities:
- Speaker 1: A captivating teacher who leads the conversation, 
explains concepts with vivid analogies and personal anecdotes, 
and makes the topic accessible and memorable. They speak clearly and 
confidently, without using filler words like "umm" or "hmm."
- Speaker 2: A curious and enthusiastic learner who keeps the conversation 
on track by asking follow-up questions. They often get excited or confused, 
expressing their reactions verbally with phrases like "That's fascinating!", 
"Wait, I'm not sure I get that," or "Wow, that's like [analogy]."
[Additional instructions...]
Return the dialogue as a list of tuples, like this:
[
    ("Speaker 1", "Text here"),
    ("Speaker 2", "Text here"),
    ...
]
"""

This prompt engineers several crucial elements for TTS success:

Speaker-Specific Speech Patterns: By assigning distinct personalities, the model creates natural variations in speech patterns that TTS systems can interpret more distinctively.
Controlled Filler Usage: Speaker 1 avoids filler words while Speaker 2 can use them, creating natural rhythm without overwhelming the TTS engine.
Structured Data Format: The list of tuples creates a programming-friendly format that simplifies integration with TTS systems in the next stage.

By generating the script in this structured format, we eliminate many common TTS issues before they occur. The next stage can directly process this optimized script without additional parsing or formatting, streamlining the pipeline from text to spoken audio.

Stage 4: Audio Generation with Kokoro

The final stage transforms our TTS-optimized script into audio using Kokoro, an open-source text-to-speech library that provides high-quality voice synthesis.

Voice Selection and Configuration
Kokoro offers multiple voices with different characteristics. I selected distinct voices for each speaker to enhance the natural podcast feel:

# Initialize separate pipelines for each speaker with different voices
# Using American English as the base language
speaker1_pipeline = KPipeline(lang_code='a')  # American English
speaker2_pipeline = KPipeline(lang_code='a')  # American English
def generate_speech_kokoro(text, speaker="speaker1"):
    # Select the appropriate pipeline and voice
    if speaker == "speaker1":
        # Use a female voice for Speaker 1
        pipeline = speaker1_pipeline
        voice = 'af_heart'  # Female voice
        speed = 1.0
    else:
        # Use a male voice for Speaker 2
        pipeline = speaker2_pipeline
        voice = 'am_fenrir'  # Male voice
        speed = 1.1  # Slightly faster

For our podcast, I chose:

Speaker 1: af_heart - Female American English voice with excellent quality
Speaker 2: am_fenrir - Male American English voice with good quality

These different voices create a clear distinction between speakers, making the podcast easier to follow.

Combining Segments with Proper Timing
To create a cohesive podcast, we need to combine individual audio segments with appropriate spacing:

# Generate the podcast
final_podcast = AudioSegment.empty()
for i, (speaker, text) in enumerate(tqdm(podcast_segments, desc="Generating podcast")):
    speaker_id = "speaker1" if speaker == "Speaker 1" else "speaker2"
    # Generate audio for this segment
    audio_segment = generate_speech_kokoro(text, speaker_id)
    if audio_segment:
        # Add slight pause between segments
        if i > 0:
            final_podcast += AudioSegment.silent(duration=500)  # 500ms pause
        # Add to podcast
        final_podcast += audio_segment

This code adds a half-second pause between speaker transitions, creating a natural rhythm in the conversation.

Challenges and Solutions

Building this pipeline wasn’t without hurdles. Here are some key challenges and how I tackled them:

Handling Complex PDF Layouts: PDFs with multi-column formats, images, or tables can be tricky. PyMuPDF’s Markdown conversion preserved some structure, but additional cleaning via the Featherless.ai API removed artifacts like page numbers and headers intelligently.
Generating Natural Dialogue: Turning static text into a dynamic conversation required careful prompt engineering. I guided the LLM to include interruptions, filler words, and personality-driven responses, making the script feel authentic.
Optimizing for TTS: Ensuring the script was TTS-friendly meant structuring it for easy synthesis. Using a tuple-based format and controlling filler usage prevented common TTS pitfalls, like unnatural pacing or mispronounced expressions.

Future Improvements

The pipeline works well, but there’s room to grow:

Multi-Language Support: Adding support for PDFs and podcasts in multiple languages would broaden its reach.
Advanced TTS Features: Integrating emotional tone adjustments or background music could make the podcasts more immersive.
Fine-Tuned Models: Using LLMs fine-tuned for podcast script generation could enhance dialogue quality further. Try languages first, it’s a fun, doable leap and there are ton of finetuned models on Featherless.ai to assist you with that.

Conclusion

This PDF-to-podcast pipeline demonstrates the remarkable potential of open-source AI when creatively combined. By bridging PyMuPDF’s extraction capabilities with Featherless.ai’s language models and Kokoro’s voice synthesis, we’ve created a system that transforms static documents into engaging audio experience.

The true power lies in the modular design. Each component can be independently improved or replaced as new models emerge. Want to try a different LLM? Swap the API endpoint. Prefer different voices? Modify the TTS configuration. This flexibility makes it perfect for experimentation and customization.

We encourage readers to fork the project and make it their own. You can listen to a sample podcast generated with this pipeline or grab the full code on GitHub and start building your own. Try adding your own prompts, experiment with different voice combinations, or extend it to handle research papers and technical manuals. The future of content adaptation is open, accessible, and limited only by our imagination, happy podcasting!

QwQ-32B Now Available on Featherless.ai

Darin Verheijke — Fri, 07 Mar 2025 10:56:33 +0000

QwQ-32B: A Powerful Lightweight in the Age of Reasoning Models

AI development continues to advance through diverse approaches to model design and optimization. DeepSeek-R1, with its impressive 671B parameters, has established itself as one of the most capable reasoning-focused models on the market. Its remarkable capabilities have set new benchmarks for what models in this space can achieve.

Meanwhile, efficiency and adaptability continue opening new frontiers, and this is where QwQ-32B, Qwen's latest release, makes its mark.

QwQ-32B: Efficient Reasoning Power

QwQ-32B delivers high-level reasoning, problem-solving, and strong coding/math capabilities in a lightweight package. With just 32B parameters, early benchmarks show impressive performance, making it an attractive option for those looking for strong reasoning capabilities in a more efficient format.

With AI applications diversifying, the demand for models that deliver excellent performance with a smaller resource footprint continues to grow. QwQ-32B exemplifies how well-optimized models can achieve remarkable results.

The Evolution of Reasoning Models

The AI field is evolving rapidly. While early models focused heavily on generative fluency and knowledge retrieval, today's most exciting breakthroughs are in models that can reason, plan, and solve complex problems. DeepSeek-R1 has been instrumental in this evolution, demonstrating the power of advanced reasoning capabilities.

Now, Qwen is expanding possibilities further, showing that reasoning power can be delivered in different formats to meet diverse needs.

How Does QwQ-32B Perform?

Testing indicates that QwQ-32B:

Excels in logical reasoning tasks with impressive structured problem-solving
Performs well in math and coding, key benchmarks for reasoning capability
Offers strong efficiency, delivering high performance with reduced compute requirements

For users seeking high-quality reasoning in an efficient package, QwQ-32B presents an exciting option.

What This Means For You

With both QwQ-32B and DeepSeek-R1 available on Featherless.ai, you now have multiple excellent options for advanced reasoning capabilities. Our ongoing optimization efforts ensure that you'll benefit from continuous improvements in both performance and functionality.

We're committed to making advanced open AI models accessible and practical for everyone. Each model in our lineup offers unique advantages to suit different use cases and requirements.

Try Our Models and Share Your Thoughts

QwQ-32B is now available alongside DeepSeek-R1, and we want to hear about your experiences with both models. Each excels in reasoning tasks while offering different profiles in terms of scale and efficiency.

Leave a review on the Featherless.ai model page: QwQ-32B

Have questions about integrating these models into your workflow? Reach out to us on Discord or check our documentation for implementation guidelines and best practices.

Unlimited DeepSeek-R1 now available to Featherless premium subscribers!

Darin Verheijke — Mon, 03 Feb 2025 13:31:04 +0000

We’re happy to announce DeepSeek-R1 support is up on Featherless for our premium subscribers! With our simple monthly subscription (no pay-per-token fees), you get unlimited access.

DeepSeek has achieved exceptional performance with significantly lower costs and computational resources, challenging giants like OpenAI, Google and Meta.

Some highlights of the DeepSeek-R1 release:

Performance on par with OpenAI-o1
MIT Licensed: (Check out some of the distills in our model catalog!)
Fully open-source

DeepSeek’s rise to the forefront was not done overnight, their success comes from a year of incremental, thoughtful, specialization across two critical domains:

Specialization in mixture-of-experts (MoE) architecture
GPU compute optimization (forged under the contraints of hardware sanctions)

Experience DeepSeek-R1 on Featherless:

🦜Chat with it on Phoenix

⚡Integrate via the Featherless API

🔍Explore distills in our model catalog

Zero to AI: Deploying Language Models Without the Infrastructure Headache

Darin Verheijke — Fri, 17 Jan 2025 14:02:16 +0000

Introduction

You've finally found it! The perfect language model on Hugging Face that seems exactly what you need for your project. The model size is reasonable, the generation quality amazing, the community feedback is great and it handles your specific use case beautifully! Excited to start building but then reality hits. Where do you actually deploy this model? You could maybe spin up a GPU instance on AWS, manage your own infrastructure and pray you don't burn through your budget before launching. Or maybe you could try one of those specialized platforms that requires learning another set of tools and workflows. What started as excitement suddenly turns into an infrastructure nightmare.

As more developers venture into AI, the gap between finding a great model and using it in a production setting remains a significant step. While platforms like OpenAI and Anthropic have made inference-by-API seamless, the vast ecosystem of open-source models (often more niche and cost-effective for different use cases) remains out of reach for many developers who just want to build applications.

Whether you're a solo developer or part of a larger team. Let's explore how Featherless can help you get from model discovery of almost any Hugging Face model to production without losing your sanity (or savings).

Hidden costs of model deployment

Let's take a practical look at what deploying a model, such as Llama 3.1 (8B) actually entails in different scenarios.
At one end, platforms like RunPod offer raw GPU access starting at around $0.20 per hour for 16GB VRAM instances. This is just the beginning however, you'll have to handle CUDA drivers, PyTorch dependencies, quantization techniques. At the other end services like Hugging Face inference abstract away much of this complexity though you're still fundamentally paying for dedicated GPU time. Then there's the challenge of scaling. How do you handle multiple concurrent requests? Load balancing? Suddenly, you need expertise in Docker, Kubernetes and a spectrum of monitoring tools.

Inference-as-a-service, through providers like OpenRouter and AWS Bedrock offer attractive token prices with no configuration but they often come with their own set of challenges, rigid pricing structures that can reach $0.50 per million tokens or more, limited model selections and you're locked into the provider's ecosystem. As your usage scales, cost can become unpredictable and expensive, particularly if your application usage doesn't map in a simple way to tokens.
What started as a simple model deployment can quickly evolve into a full-time infrastructure management project.

Enter Featherless

This is where we at Featherless step in. Instead of building and maintaining your own infrastructure or getting locked into expensive managed services, Featherless provides direct access to Hugging Face's vast ecosystem of models through a simple (OpenAI-compatible) API. As a serverless inference platform we handle all the complex infrastructure orchestration behind the scenes while you maintain full control over your model selection and customization options.

What makes this approach advantageous is that you can deploy almost any Hugging Face model in minutes, not days or weeks, without sacrificing performance or breaking your budget. We target an output inference of 10–40 tokens per second, depending on the model and prompt size while keeping your costs predictable. Whether you're experimenting with different models or just scaling your production workloads, Featherless enables quick iteration as you can switch models with just a simple configuration change.

For developers who've worked with OpenAI's API the transition is easy, we maintain API and SDK compatibility while opening up access to a huge catalog of open-source models. Enabling you to leverage your existing codebase while gaining the freedom to choose and swap between any open-source model that fits your specific use case.

From Zero to Hello: 5-minute model deployment

Let's get into practical implementation. The best way to understand the simplicity of Featherless is to see it in action. In the following examples I'll quickly walk you through how to setup basic API calls with Featherless. The first thing you'll have to do is sign up for a Featherless account and choose a subscription plan that fits your needs. After which you'll have access to your own API key on your dashboard, keep this close as you'll be needing it in each of the examples I'll be demonstrating. If you're new to language model APIs altogether, don't worry as we've kept the examples clean and straightforward, focusing only on the essential patterns to get you started.

Your first API call

Firstly, choose your model from our vast catalog of Hugging Face models. Then depending on the use case of your application we have two endpoints. The first and simplest being /v1/chat/completions which is designed for interactions where your application needs to maintain a clear user - assistant relationship and conversation flow (think ChatGPT). It accepts messages in a format that distinguishes between system instructions, user inputs and assistant responses, making it ideal for chatbots, virtual assistants or any application that requires contextual conversation management.

Let's start with this simple chat completion example:

# Example shows how to make a basic chat completion call
import requests

response = requests.post(
    url="https://api.featherless.ai/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer FEATHERLESS_API_KEY" # Replace API key
    },
    json={
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", 
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user","content": "Hello! How are you?"},
            {"role": "assistant", "content": "I'm amazing, yourself?"},
            {"role": "user", "content": "Great! What are you up to?"}
        ]
    }
)
print(response.json()["choices"][0]["message"]["content"])

We specify the model we want to use, then provide an array of messages
On the other hand we have /v1/completions which provides a bit more advanced but more direct approach. It accepts a single text prompt and returns a completion. Giving you complete control over the prompt format. Examples where this can be useful are content and text generation or any cases where you want to implement more custom conversation formats.

# Example shows how to make a text completion call
import requests

response = requests.post(
    url="https://api.featherless.ai/v1/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer FEATHERLESS_API_KEY" # Replace with your API key
    },
    json={
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", 
        "prompt": "Once upon a time",
        "max_tokens": 500
    }
)
print(response.json()["choices"][0]["text"])

Notice how with text completion endpoint takes a simpler input: a prompt (one string), instead of an array of messages, and allows for the generation of any text. This endpoint is key for using the LLM as a reasoning engine; e.g. if using an LLM to extract structured data from a block of text, like a list of email addresses out of a body of text, this is much more simply done with text completions than using chat completion.
We've also added max_tokens as a parameter here to specify the the length of response in tokens we would want back from the model. A more elaborate overview of the different parameters you can provide to the endpoint is available in our documentation.

OpenAI compatibility

The widespread adoption of OpenAI's ecosystem has led to an implicit API standard for LLM integration. Featherless implements this standard, enabling any code or application designed to work with OpenAI's API to easily be reconfigured to work with Featherless instead. This compatibility extends across the ecosystem of applications and tools built for OpenAI, making the transition to Featherless straightforward for teams working with these tools. You can find a list of a few of those applications in our other blog.

Now let's have a look at how we can make use of the full range of open-source models by just adjusting the standard OpenAI SDK code.

# Using OpenAI SDK 
from openai import OpenAI

client = OpenAI(
  base_url="https://api.featherless.ai/v1",
  api_key="FEATHERLESS_API_KEY",
)
response = client.chat.completions.create(
  model='meta-llama/Meta-Llama-3.1-8B-Instruct',
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
)
print(response.model_dump()['choices'][0]['message']['content'])

The only changes needed here are the client's base_url, the api_key and the model parameter. The rest of the code is unchanged. This compatibility means you can switch between models without having to rewrite any of your existing application logic.

Comparing models

As switching between models is as easy as changing one line we might want to compare different responses over the same prompt from models to quickly iterate over which model is adequate for your use case. We can easily do this with the following example:

import requests
# Compare responses from different models with the same prompt
def compare_models(prompt, models):
    results = {}
    for model in models:
        response = requests.post(
            url="https://api.featherless.ai/v1/chat/completions",
            headers={
                "Content-Type": "application/json",
                "Authorization": "Bearer FEATHERLESS_API_KEY"
            },
            json={
                "model": model,
                "messages": [
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ]
            }
        )
        results[model] = response.json()["choices"][0]["message"]["content"]
    return results

# Add models from catalog
models_to_compare = [
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "meta-llama/Llama-3.3-70B-Instruct"
]
# The prompt you want to compare
results = compare_models("Explain AGI in simple terms.", models_to_compare)
for model, response in results.items():
    print(f"\nModel: {model}\nResponse: {response}\n")

What now?

You've seen how straightforward it is to get started! Just a few lines of code and you're up and running and chatting with your first models. Before we dive deeper into the next implementations we invite you to join our growing community of developers and enthusiasts on Discord. Share your experience, struggles, and connect with others who are building with Featherless.

Now in the following sections we'll introduce some basic building blocks such as how to use Featherless in LangChain and some patterns to help you with making use of the endless amount of models provided and how you can make the most of what this variety offers you.

Join us on Discord to continue the conversation, now let's dive into how LangChain can extend everything we've already discussed.

Beyond Basics: Integrating with LangChain

Moving beyond basic (and individual) inference calls, let's explore how to use Featherless with more sophisticated libraries. LangChain, the most widely adopted of these libraries, provides developers with powerful tools and patterns for managing complex prompts and conversational state. Here's how you can power any LangChain application with Featherless.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    api_key='FEATHERLESS_API_KEY', # Your Featherless API key
    base_url="https://api.featherless.ai/v1",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
)
messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    (
        "human",
        "I love programming."
    ),
]
ai_msg = llm.invoke(messages)
ai_msg

With LangChain you can use their building blocks to create more advanced applications such as pipelines that can summarize and analyze large documents by breaking them into chunks, implementing conversation patterns such as simple message history to more complex summary-based approaches to help you manage your context size.

The beauty of LangChain with Featherless is that you can experiment with different models for different parts of your application. Need a lighter model for classification but a more powerful one for generation? You can mix and match with the wide variety of models in our catalog while still maintaining a consistent and clean architecture.

The following example briefly demonstrates the power and flexibility of the combination:

from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

# Define models for different tasks
classification_llm = ChatOpenAI(
    api_key="FEATHERLESS_API_KEY",
    base_url="https://api.featherless.ai/v1",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
)

translation_llm = ChatOpenAI(
    api_key="FEATHERLESS_API_KEY",
    base_url="https://api.featherless.ai/v1",
    model="mistralai/Mistral-Nemo-Instruct-2407",
)

# Define prompt templates
translation_prompt = PromptTemplate.from_template(
    "Translate the following sentence from English to French:\n{input}"
)
classification_prompt = PromptTemplate.from_template(
    "Classify the sentiment of the following text as positive, negative, or neutral:\n{input}"
)

# Translation task
translation_task = RunnableLambda(
    lambda input: {
        "task": "Translation",
        "translated_text": translation_llm.invoke(translation_prompt.format(input=input["text"])).content  # Extract content
    }
)

# Classification task
classification_task = RunnableLambda(
    lambda input: {
        "task": "Classification",
        "translated_text": input["translated_text"],  # Passing the translated text
        "classification_result": classification_llm.invoke(classification_prompt.format(input=input["translated_text"])).content
    }
)

# Chain the tasks together
workflow = RunnablePassthrough() | translation_task | classification_task

# Run the workflow
result = workflow.invoke({"text": "I love using Featherless."})
print(result)

By creating separate model instances for classification and translation tasks, we can optimize our application's performance. Depending on the task at hand we can choose a specific model that tackles the nuance of that task.

What's particularly powerful about this approach is its extensibility. Need to add for example a profanity filter before your translation? Simply find an appropriate model and create a new task to inject in the workflow. The architecture scales with your needs while keeping your infrastructure complexity manageable.

Final thoughts

Throughout this guide, we've hopefully equipped you to run inference on any Hugging Face model, from prototype to production, without worrying one lick about the complexity of infrastructure management or the cost of directly running GPUs. This however is just the beginning. As you've seen with the LangChain implementation, the ability to seamlessly access any Hugging Face model opens up countless possibilities for your applications. Whether you're building a specialized chatbot, implementing domain-specific analysis, or creating the next Duolingo. We'll be coming back with some more advanced examples in future blogposts so make sure to keep an eye out.

Join our Discord community to connect with other users
Follow us on X (@FeatherlessAI) for the latest updates
Star our Github repository to stay updated on new examples and tutorials

We'll be looking forward to seeing what you all create and share with the community.

Running Open Source LLMs in Popular AI Clients with Featherless: A Complete Guide

Darin Verheijke — Fri, 10 Jan 2025 15:54:24 +0000

Introduction
Getting started with Featherless
Role-playing client integration
1. SillyTavern
2. WyvernChat
3. VenusAI
4. JanitorAI
5. anime.gf
Frontend chat client integration
1. Featherless Phoenix
2. Typing Mind
Security best practices
Community & Support

Introduction

Remember the excitement of discovering powerful open-source language models like the latest Qwen or LLama 3 AI models, only to face the daunting challenge of actually running them?

You’re not alone. While these models offer amazing capabilities, deploying them efficiently has been a significant hurdle for developers and enthusiasts alike. Today, we’re changing that with Featherless. In this guide, you’ll learn how to integrate the latest cutting-edge AI models into your favorite chat clients - whether you’re building an ai agent, a coding assistant, roleplaying, or setting up a general chat interface.

I’ll walk you through practical, step-by-step instructions for popular platforms like SillyTavern, WyvernChat and Typing Mind showing you how to leverage our serverless infrastructure. No GPU or server management required, no complex deployments - just powerful language models ready to use in the tools you already know and love.

Getting started with Featherless

Featherless is your gateway to using powerful open-source language models in your favorite applications. As a serverless inference platform, we make it simple to access the latest models without managing any infrastructure.

OpenAI compatibility

Our API is fully OpenAI-compatible. This means any application that works with OpenAI can be easily reconfigured to use Featherless.

Quick Setup API

Sign up for a Featherless account
Subscribe to a plan that fits your needs
Navigate to the API Keys section in your dashboard
Create a new API Key

That’s it! You’re ready to use this API Key in your preferred chat client

Choose your plan

Select the plan that fits your needs:

🌱 Featherless Basic ($10/month)

All models up to 15B parameters
2 concurrent requests
Unlimited monthly usage

⭐ Featherless Premium ($25/month)

All models up to 72B parameters
Everything in Basic
Perfect for power users

🚀 Featherless Scale ($75 per unit/month)

Everything in Premium
2x Premium or 6x Basic model concurrency per unit
Host private models from Hugging Face

🔒 Privacy First: We never log chats, prompts, or completions. Your conversations stay private.

Ready to begin?

Head to the specific integration guide for your preferred client, or join our Discord for support.

Role-playing client integration

Having spent countless hours lost in AI roleplay conversations, let me tell you - there’s nothing quite like that moment when your character truly comes alive. Whether it’s the witty jokes that feel spontaneous, or those surprising, deep exchanges that make you forget you’re talking to an AI.

From creating more complex characters with a deep lore to vibrant anime personalities, what I’ve learned is that finding the right model isn’t just a technical choice - it’s about finding a perfect way to show off the identity and personality behind your character. That’s why I’m so excited about our growing catalog of open-source models. Each one brings something special and different to your character interactions and creative writing, I’ve seen incredible roleplay scenes emerge from unexpected model choices.

Let me now walk you through integrating Featherless with some of your favorite roleplay clients.

SillyTavern

With SillyTavern it’s pretty easy to create a connection to Featherless. Simply click on the plug icon at the top and make the following selections:

API: Chat Completion
Chat Completion source: Custom (OpenAI-compatible)
Custom Endpoint: https://api.featherless.ai/v1
Custom API Key: Your Featherless API key
Enter a Model ID: A model chosen from our model catalog (e.g. meta-llama/Meta-Llama-3.1-8B-Instruct)
Connect!

Once connected, you’ll see a green status indicator after which you’ll be able to send a message to your characters to ensure everything is working properly. You should receive a response within seconds.

WyvernChat

WyvernChat offers native Featherless integration right out of the box. The built-in support means you’ll spend less time configuring and more time chatting. WyvernChat provides a streamlined setup process. Whether you’re new to AI chat platforms or migrating from another client, you’ll appreciate how seamlessly Featherless meshes with WyvernChat’s clean interface. All you need is your Featherless API Key and following steps:

Head on over to https://app.wyvern.chat/
At the bottom right of the page click on the plug icon
Click on ‘+ Add connection’
Select ‘Featherless’ from the Type dropdown
Password (API Key)*: Your Featherless API key
Select a model from the list at the bottom
Scroll down and press ‘Create’

Once your connection is created head over to any of your character chats, then at the bottom right under ‘Settings’ deselect ‘Free Queue’ (this ensures you’re using your Featherless connection). Then simply choose your model from the Connection dropdown. Switching between models is as simple as creating an extra connection. You’ll know everything is working when your character responds using your chosen model - typically within a few seconds.

Venus AI

VenusChat supports OpenAI-compatible APIs out of the box, making our connection process quick and straightforward.

To connect your VenusChat with Featherless, head over to any character chat:

Click on the Gear icon on the top right
Go to “AI Model Settings”
Choose ‘Open AI’ under select an AI Model
Select “Reverse Proxy”
Pick a model from our model catalog, copy the complete url
Paste the url under ‘Open API Reverse Proxy (e.g. https://featherless.ai/models/mistralai/Mistral-Nemo-Instruct-2407)
Enter your Featherless API Key under ‘Reverse Proxy Key’

JanitorAI

Head over to chat on any character on JanitorAI and let’s get you set up with just a few steps:

Open the dropdown in the topright corner and select ‘API Settings’
Go to ‘Proxy’
Pick a model from our model catalog
Under ‘Model’ choose custom and enter the models name (e.g. meta-llama/Meta-Llama-3.1-8B-Instruct)
Other API/proxy URL: https://api.featherless.ai/v1/chat/completions
API Key: Your Featherless API Key
Check API Key/Model to see everything is working
Scroll down and Save Settings

Once you’ve saved your settings, congratulations - you can now chat with your character with any of our compatible models and switching between them is as simple as repeating steps 3-4. Feel free to experiment with different models to find the perfect fit for each character.

anime.gf

Bringing your favorite anime.gf characters to life with Featherless is straightforward. Enhance your interactions by making use of our diverse model catalog. Let’s get you setup with just a few simple steps.

Click on the cog in the top right of your screen
Head on over to ‘A.I. Settings’ and click on ‘provider’
Under API Provider select ‘Proxy’
API Key: Your Featherless API Key
Base URL: https://api.featherless.ai/v1/
Model: Choose a model from our model catalog (e.g. anthracite-org/magnum-v4-72b)
Save

Great! Your anime.gf character is now powered by Featherless. Try sending a message to see the integration in action. Feel free to experiment with different models from our catalog to find the perfect match for each character - switching models is as easy as changing the model in your settings.

What’s next?

With your favorite roleplaying client connected to Featherless you’re ready to experiment with a variety of models from our catalog, join our discord to share your experiences and get model recommendations!

Frontend chat client integration

I’ve always find something satisfying about a clean chat interface - it’s like having a dedicated thinking space where you can have a focused conversation. Whether I’m exploring new concepts, in a deep coding session or just brainstorming ideas - platforms like Typing Mind and our own Phoenix have become essential companions. Let me show you how to set up these tools that give you access to our entire model catalog as I think you’ll find them as invaluable as I do.

Featherless Phoenix

If you’re looking for the most straightforward way to start chatting with our models, look no further than Featherless Phoenix. As our native chat interface, it requires zero additional setup - simply login with your Featherless account, choose your preferred model from the menu in the top left corner and start chatting. It’s the perfect starting point for exploring our model catalog and finding the right language model for your needs.

Typing Mind

Typing mind, as a Chat UI frontend allows you to use AI models from our whole catalog. Featherless integration is as easy as going to ‘Models’, then clicking on ‘+ Add Custom Model’ on the top right of your screen followed by these quick steps:

Choosing ‘OpenAI Compatible API’ as API Type
Endpoint: https://api.featherless.ai/v1/chat/completions
Model ID: Pick a model from our model catalog (e.g. mistralai/Mistral-Nemo-Instruct-2407)
Choosing a context length (which you can find on the model’s page)
Add a custom header with it’s key ‘authorization’ and the key will be Bearer <YOUR_FEATHERLESS_API_KEY>
Add a custom body params with a number param `max_tokens' followed by any amount up to your context length. This will be the length of your response.
Lastly press Test and if everything went well you can now 'Add Model' That’s it - Your Typing Mind interface is now connected to Featherless! Try chatting to verify if everything is working properly. Response times will vary by model, but you should typically see results within a few seconds.

Security best practices

Your Featherless API Key is your secure gateway to our services and all the models you love - protecting it is crucial for your account’s security and data privacy. Never share your keys publicly and we recommend creating separate API keys for different applications. If you suspect your key might have been exposed, rotate it immediately through your Featherless dashboard.

Community & Support

Our growing community of developers, enthusiasts, and AI practitioners is here to help you get the most out of Featherless:

Join our Discord community to connect with other users
Share your experiences with us!
Get model recommendations for your specific use case
Stay updated on the latest models that get added

As the world of AI is evolving rapidly, Featherless is committed to evolving with it. As new models emerge and capabilities expand, we’re working to ensure you have seamless access and integration to all the latest advancements. Our mission to make all AI models available for serverless inference remains unchanged.

We’re excited to see what you all create with Featherless. Whether you’re creating engaging characters for roleplay or exploring new applications we haven’t even imagined yet. Ready to get started? Head over to https://featherless.ai/ to create an account, or join our Discord community to connect with other enthusiasts.

DEV Community: Featherless.ai

Experimental support for Kimi-K2 by Moonshot AI now available for premium users!

Experimental support notice

Experience Kimi-K2 on Featherless:

Featherless Becomes Hugging Face’s Largest LLM Inference Provider with 6,700+ Models

Reliable, Open AI — At Scale

Two Ways to Use Featherless on Hugging Face

Future-Proofing AI Deployment

About Featherless

Context Isn’t Everything: Build Efficient LLM Apps with LlamaIndex + Featherless

Why Retrieval is Perhaps a Better solution for your problem

What Featherless brings to the Stack

Quickstart: Build a Local RAG application

Advanced Features: Streaming and Chat

Streaming Responses

Multi-turn Chat

Model Switching: A/B Test Without Rewriting Code

Real-World Example: Customer Support Bot

Performance and efficiency strategies

What’s next?

Building Production-Ready LLM Apps with LangChain & Featherless Serverless Inference

From Prototype to Production: Why Combining LangChain + Featherless is a Game-Changer

Quickstart: Launch your LangChain App with Featherless

Example Use Case: Building a RAG App with Native Featherless Integration

Effortless Model Experimentation: Remember the model Parameter

How to Get Started with Featherless and LangChain:

Final Thoughts: Build Without Limits

Running OpenHands LM 32B with Featherless.ai: A Practical Guide

Why Choose Featherless.ai for Your OpenHands LM Deployment?

Installing OpenHands

Connecting OpenHands to Featherless.ai

Initial Support for Google's Gemma 3 27B Models Now Live on Featherless.ai!

Gemma 3 27B on Featherless.ai: Powerful Inference Without the Complexity

Technical Updates & What’s Next

The Growing Spectrum of Open Models on Featherless.ai

What You Can Do with Gemma 3 27B

Try Gemma 3 27B on Featherless.ai Today!

Supercharging Your Development Workflow: Integrating Featherless.ai with Aider and Cursor

Introduction

Your Featherless API Key

Setting Up Aider with Featherless.ai

Installation

Integrating with Featherless.ai

1. Create a model settings file named .aider.model.settings.yml:

2. Create a model metadata file named .aider.model.metadata.json:

Running Aider with Featherless.ai

Setting Up Cursor with Featherless.ai

Configuring Cursor to Use Featherless Models

Troubleshooting Tips

Recommended Models for Coding

Best All-Around Coding Models

Building a PDF-to-Podcast Pipeline with Open-Source AI: From Text Extraction to Voice Synthesis

Introduction

Why Turn PDFs into Podcasts?

Stage 1: Text Extraction and Cleaning

Stage 2: Podcast Script Generation

Stage 3: TTS Optimization

Stage 4: Audio Generation with Kokoro

Challenges and Solutions

Future Improvements

Conclusion

QwQ-32B Now Available on Featherless.ai

QwQ-32B: A Powerful Lightweight in the Age of Reasoning Models

QwQ-32B: Efficient Reasoning Power

The Evolution of Reasoning Models

How Does QwQ-32B Perform?

What This Means For You

Try Our Models and Share Your Thoughts

Unlimited DeepSeek-R1 now available to Featherless premium subscribers!

Experience DeepSeek-R1 on Featherless:

Zero to AI: Deploying Language Models Without the Infrastructure Headache

Introduction

Hidden costs of model deployment

Enter Featherless

From Zero to Hello: 5-minute model deployment

Your first API call

OpenAI compatibility

Comparing models

What now?

Beyond Basics: Integrating with LangChain

Effortless Model Experimentation: Remember the `model` Parameter

1. Create a model settings file named `.aider.model.settings.yml`:

2. Create a model metadata file named `.aider.model.metadata.json`: