Handling unstructured data is one of the classic headaches in software engineering. Recently, while building JobFit AI, an AI resume builder designed to bypass modern ATS (Applicant Tracking Systems), I hit a major roadblock: how do you reliably extract structured data from screenshots of Job Descriptions (JDs) or poorly formatted PDF resumes?
Traditional regex-based parsers are fragile. They break the moment a candidate uses a creative layout or a recruiter formats a JD as an image. Here is a breakdown of how I combined OCR (Optical Character Recognition) with Large Language Models (LLMs) to build a robust, error-tolerant parsing pipeline.
The Ingestion Layer: Taming the Chaos with OCR
Users upload data in various formats—text paste, PDFs, or even screenshots from LinkedIn. For images and PDFs, a standard text extraction library isn't enough. You need an OCR engine.
While tools like Tesseract are great for open-source, for production-grade accuracy (especially with complex multi-column resume layouts), routing the image through a cloud OCR API (like Google Cloud Vision or AWS Textract) provides a much cleaner raw text string.The Extraction Layer: LLMs as Reasoning Parsers
Once we have the raw, messy text from the OCR layer, regex is out the window. Instead, we use an LLM (like GPT-4 or Claude 3) to structure the data. The trick here isn't just sending the text; it's enforcing a strict JSON output schema.
Here is a conceptual example of the system prompt:
You are an expert HR data extraction API.
Analyze the following raw OCR text extracted from a Job Description.
Extract the core requirements into a strict JSON format with the following keys:
"job_title", "required_hard_skills" (array), "years_of_experience" (integer), and "key_responsibilities" (array).
Do not include any markdown formatting outside the JSON object.
- The Matching Logic (The Fun Part) Once the JD is parsed into structured JSON, and the user's base resume is parsed into a similar JSON schema, calculating the gap becomes a straightforward programmatic task. We can map the required skills against the user's existing skills and calculate a baseline "Match Score".
Takeaways
Combining deterministic tools (OCR) with probabilistic engines (LLMs) allows us to handle unstructured real-world data gracefully. If you want to see this exact pipeline in action, feel free to try out JobFit AI to automatically tailor your resume to any job description.
Have you built any interesting pipelines combining OCR and AI? Let me know in the comments!
Top comments (0)