We are building a document continuation engine that reads a partial markdown file and generates the next section in the same voice and structure. It helps technical writers and developers who draft long-form content and need to maintain momentum without switching contexts. Because Oxlo.ai charges a flat rate per request rather than per token, feeding multi-thousand-character drafts into the model is economical for iterative writing workflows. See https://oxlo.ai/pricing for plan details.
What you'll need
- Python 3.10 or newer.
- An Oxlo.ai API key from https://portal.oxlo.ai.
- The OpenAI SDK: pip install openai.
Step 1: Configure the Oxlo.ai client
I always start by confirming the API handshake so I do not debug auth issues later in the pipeline. The Oxlo.ai endpoint is a drop-in replacement for the OpenAI client, so the setup is one line.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say 'Connection OK' and nothing else."},
],
)
print(response.choices[0].message.content)
Step 2: Design the continuation system prompt
The system prompt is the only place where we teach the model our style constraints. I keep it strict and explicit so the output stays on rails. This is the editable core of the engine.
SYSTEM_PROMPT = """You are a document continuation engine. Your job is to analyze the partial document provided by the user and generate the next section.
Rules:
- Match the existing tone, tense, and formatting.
- Do not repeat the input text.
- Output only the continuation, with no preamble or meta commentary.
- If the input ends mid-sentence, complete that sentence naturally before starting a new paragraph.
- Use Markdown syntax consistent with the input."""
Step 3: Build the document preprocessor
Long drafts can exceed the context window, so I trim intelligently. I keep the most recent content and a short summary anchor from the beginning. Because Oxlo.ai uses flat request-based pricing, sending a few thousand tokens of context costs the same as a minimal ping, which makes long-context iteration practical.
def prepare_context(file_path: str, max_chars: int = 6000) -> str:
with open(file_path, "r", encoding="utf-8") as f:
text = f.read()
if len(text) <= max_chars:
return text
oldest = text[:1000]
newest = text[-(max_chars - 1000):]
return oldest + "\n\n[... earlier content omitted ...]\n\n" + newest
Step 4: Implement the generation loop
I wrap the API call in a small function so I can tune temperature and max tokens in one place. I set temperature low because continuation tasks need consistency more than creativity.
def continue_document(context: str, extra_instruction: str = "") -> str:
user_message = context
if extra_instruction:
user_message += f"\n\nInstruction: {extra_instruction}"
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
],
temperature=0.3,
max_tokens=1024,
)
return response.choices[0].message.content.strip()
Step 5: Wire up the CLI runner
I use argparse so the script feels like a real shipped tool. The runner reads the file, calls the preprocessor, prints the continuation, and appends it back to the file if the user passes a flag.
import argparse
def main():
parser = argparse.ArgumentParser(description="Continue a markdown document.")
parser.add_argument("file", help="Path to the partial markdown file.")
parser.add_argument("--instruction", "-i", default="", help="Optional guidance for the next section.")
parser.add_argument("--append", "-a", action="store_true", help="Append the output to the file.")
args = parser.parse_args()
context = prepare_context(args.file)
continuation = continue_document(context, args.instruction)
print("\n--- Generated Continuation ---\n")
print(continuation)
if args.append:
with open(args.file, "a", encoding="utf-8") as f:
f.write("\n\n" + continuation + "\n")
print("\nAppended to", args.file)
if __name__ == "__main__":
main()
Run it
Create a file named draft.md with some starter text. Then run the script.
## Architecture Overview
The ingestion pipeline is built around three core primitives: collectors, buffers, and sinks. Collectors pull data from upstream REST endpoints and normalize each payload into a canonical JSON schema. Buffers stage the normalized records in memory, batching until either a size threshold or a timeout fires. Once a buffer flushes, it hands the batch to a sink, which writes to the downstream warehouse.
Call the engine:
python continuer.py draft.md --instruction "Add a paragraph about failure handling and retries." --append
Example output:
Sinks are responsible for graceful degradation when the warehouse is unavailable. Each sink maintains an exponential backoff policy with jitter, starting at 200 ms and capping at 30 seconds. If a batch fails after five retries, it is written to a dead-letter queue on S3 for later inspection. This ensures that transient network blips do not stall the entire pipeline while preserving data durability.
Wrap-up
A concrete next step is to add a sliding-window summarizer using a model like deepseek-v3.2 to compress older context when drafts grow past tens of thousands of characters. Another is to expose the engine as a Git pre-commit hook so writers can generate stubs for empty section headers before pushing.
Top comments (0)