Extract PDFs, JSON & HTML in Your Telegram Bot with Telegem's New Plugin
Building a Telegram bot that processes documents? Stop wrestling with file parsing. Telegem's new FileExtractor plugin lets you extract text from PDFs, parse JSON, and process HTML files in 3 lines of Ruby.
🚀 What This Solves
Before: Your users send PDFs/JSON files → You write 50+ lines of parsing code + install dependencies + handle edge cases.
After: Your users send files → You call one method → Get clean extracted text.
📦 Installation
# In your Gemfile
gem 'telegem'
# Install the optional dependency for PDF support
gem 'pdf-reader'
🎯 Real-World Examples
- PDF Invoice Processor
bot.command('invoice') do |ctx|
if ctx.message.document&.mime_type == 'application/pdf'
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id,
file_type: :pdf
)
result = extractor.extract_pdf
if result[:success]
# Find amounts in invoice text
amounts = result[:content].scan(/\$\d+\.\d{2}/)
ctx.reply("📊 Found #{amounts.size} payment amounts")
end
end
end
- JSON Config Validator
bot.on(:message, document: true) do |ctx|
if ctx.message.document.file_name.end_with?('.json')
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id,
file_type: :json
)
config = extractor.extract_json
if config[:success]
ctx.reply("✅ Valid JSON with #{config[:content].keys.size} keys")
else
ctx.reply("❌ Invalid JSON: #{config[:error]}")
end
end
end
- HTML to Markdown Converter
bot.command('html') do |ctx|
if ctx.message.document&.mime_type == 'text/html'
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id,
file_type: :html
)
html = extractor.extract_html
if html[:success]
# Convert HTML to plain text (simplified)
text = html[:content].gsub(/<[^>]*>/, '').strip
ctx.reply("📝 Extracted #{text.length} characters")
end
end
end
🔧 How It Works Under the Hood
The plugin handles the tedious parts for you:
- Downloads the file from Telegram's servers
- Processes it with the appropriate library (PDF::Reader for PDFs, JSON.parse for JSON)
- Cleans up temp files automatically
- Returns a consistent hash format:
{
success: true,
content: "Extracted text here...",
pages: 3, # PDF only
file_size: 45210 # All file types
}
⚠️ Important Security Notes
# ✅ SAFE - Use only Telegram-generated file_ids
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id, # From Telegram context
file_type: :pdf
)
# ❌ DANGEROUS - Never use user input
extractor = Telegem::Plugins::FileExtractor.new(
bot,
params[:user_input], # Malicious users could hack your server
file_type: :pdf
)
🎨 Advanced: Processing Replies
# Extract from replied-to PDFs
bot.command('extract') do |ctx|
if ctx.message.reply_to_message&.document
file_id = ctx.message.reply_to_message.document.file_id
extractor = Telegem::Plugins::FileExtractor.new(bot, file_id, file_type: :pdf)
result = extractor.extract_pdf
ctx.reply(result[:success] ? "✅ Done" : "❌ Failed: #{result[:error]}")
end
end
📈 Why This Matters
Most bot frameworks make you handle file parsing manually. Telegem's approach:
· Reduces boilerplate from 50+ lines to 3
· Handles edge cases (encrypted PDFs, malformed JSON)
· Auto-cleans temp files (no memory leaks)
· Works seamlessly with Telegem's async architecture
🚀 Get Started
# Create a new bot
gem install telegem
# Check out the full example
git clone https://gitlab.com/ruby-telegem/telegem-examples
💬 Your Turn
What document processing tasks are you building with Telegram bots? Have you tried Telegem's new plugin? Share your use cases below!
Telegem is a modern, async-first Telegram Bot framework for Ruby. Built with ❤️ by @slick_phantom.
Top comments (0)