slick phantom

Posted on Feb 5 • Edited on Feb 10

Telegem first public plugin

#news #opensource #ruby

Extract PDFs, JSON & HTML in Your Telegram Bot with Telegem's New Plugin

Building a Telegram bot that processes documents? Stop wrestling with file parsing. Telegem's new FileExtractor plugin lets you extract text from PDFs, parse JSON, and process HTML files in 3 lines of Ruby.

🚀 What This Solves

Before: Your users send PDFs/JSON files → You write 50+ lines of parsing code + install dependencies + handle edge cases.

After: Your users send files → You call one method → Get clean extracted text.

📦 Installation

# In your Gemfile
gem 'telegem'

# Install the optional dependency for PDF support
gem 'pdf-reader'

🎯 Real-World Examples

PDF Invoice Processor

bot.command('invoice') do |ctx|
  if ctx.message.document&.mime_type == 'application/pdf'
    extractor = Telegem::Plugins::FileExtractor.new(
      bot, 
      ctx.message.document.file_id
    )

    result = extractor.extract

    if result[:success]
      # Find amounts in invoice text
      amounts = result[:content].scan(/\$\d+\.\d{2}/)
      ctx.reply("📊 Found #{amounts.size} payment amounts")
    end
  end
end

JSON Config Validator

bot.on(:message, document: true) do |ctx|
  if ctx.message.document.file_name.end_with?('.json')
    extractor = Telegem::Plugins::FileExtractor.new(
      bot,
      ctx.message.document.file_id
    )

    config = extractor.extract

    if config[:success]
      ctx.reply("✅ Valid JSON with #{config[:content].keys.size} keys")
    else
      ctx.reply("❌ Invalid JSON: #{config[:error]}")
    end
  end
end

HTML to Markdown Converter

bot.command('html') do |ctx|
  if ctx.message.document&.mime_type == 'text/html'
    extractor = Telegem::Plugins::FileExtractor.new(
      bot,
      ctx.message.document.file_id
    )

    html = extractor.extract

    if html[:success]
      # Convert HTML to plain text (simplified)
      text = html[:content]
      ctx.reply("📝 Extracted #{text.length} characters")
    end
  end
end

🔧 How It Works Under the Hood

The plugin handles the tedious parts for you:

Downloads the file from Telegram's servers
Auto detects file type
Processes it with the appropriate library (PDF::Reader for PDFs, JSON.parse for JSON)
Cleans up temp files automatically
Returns a consistent hash format:

{
  success: true,
  content: "Extracted text here...",
  pages: 3,           # PDF only
  file_size: 45210    # All file types
}

⚠️ Important Security Notes

# ✅ SAFE - Use only Telegram-generated file_ids
extractor = Telegem::Plugins::FileExtractor.new(
  bot,
  ctx.message.document.file_id,  # From Telegram context
)

# ❌ DANGEROUS - Never use user input
extractor = Telegem::Plugins::FileExtractor.new(
  bot,
  params[:user_input],  # Malicious users could hack your server
)

🎨 Advanced: Processing Replies

# Extract from replied-to PDFs
bot.command('extract') do |ctx|
  if ctx.message.reply_to_message&.document
    file_id = ctx.message.reply_to_message.document.file_id

    extractor = Telegem::Plugins::FileExtractor.new(bot, file_id, file_type: :pdf)
    result = extractor.extract_pdf

    ctx.reply(result[:success] ? "✅ Done" : "❌ Failed: #{result[:error]}")
  end
end

📈 Why This Matters

Most bot frameworks make you handle file parsing manually. Telegem's approach:

· Reduces boilerplate from 50+ lines to 3
· Handles edge cases (encrypted PDFs, malformed JSON)
· Auto-cleans temp files (no memory leaks)
· Works seamlessly with Telegem's async architecture

🚀 Get Started

# Create a new bot
gem install telegem

# Check out the full example
git clone https://gitlab.com/ruby-telegem/telegem-examples

💬 Your Turn

What document processing tasks are you building with Telegram bots? Have you tried Telegem's new plugin? Share your use cases below!

Telegem is a modern, async-first Telegram Bot framework for Ruby. Built with ❤️ by @slick_phantom.

DEV Community

Telegem first public plugin

Extract PDFs, JSON & HTML in Your Telegram Bot with Telegem's New Plugin

🚀 What This Solves

📦 Installation

Top comments (0)