Elena Revicheva

Posted on Apr 18 • Originally published at aideazz.hashnode.dev

Building Autonomous Job Search AI: Technical Architecture and Ethical Boundaries

#ai #programming #machinelearning

Originally published on AIdeazz — cross-posted here with canonical link.

Most job search "AI" tools are glorified keyword matchers with a ChatGPT wrapper. After building production multi-agent systems that handle everything from CRM workflows to document processing, I decided to prototype what an actual autonomous job search AI would look like — one that could discover opportunities, evaluate fit, and navigate application processes without constant human oversight. What emerged wasn't just a technical challenge but a case study in where automation should and shouldn't replace human judgment.

The Discovery Problem: Beyond RSS Feeds and API Limits

Job discovery seems straightforward until you try to build it. LinkedIn's API is locked down tighter than their recommendation algorithm. Indeed's scraping detection will flag you faster than you can spell "rate limit." AngelList changes their DOM structure like they're running A/B tests on bot developers.

My initial approach used a distributed scraper running on Oracle Cloud instances across multiple regions. Different IPs, randomized user agents, human-like browsing patterns — the whole defensive programming playbook. It worked for about 72 hours before the first CAPTCHA appeared.

The sustainable solution required a hybrid approach:

class JobDiscoveryOrchestrator:
    def __init__(self):
        self.sources = {
            'api_based': ['greenhouse', 'lever', 'workable'],
            'rss_feeds': ['remoteok', 'weworkremotely'],
            'email_parsers': ['company_career_pages'],
            'social_monitors': ['twitter_job_accounts', 'slack_communities']
        }
        self.fallback_chain = ['api', 'rss', 'email', 'manual_queue']

The architecture routes through official APIs where available, falls back to RSS feeds and email subscriptions, and maintains a manual review queue for high-value sources that resist automation. Each source has its own agent with specific parsing logic and rate limiting rules.

Oracle's compartmentalized infrastructure helped here — each scraping agent runs in its own compute instance with dedicated IP allocation. When one gets flagged, it doesn't affect the others. The coordination happens through Oracle Streaming Service, which handles the message passing between discovery agents and the central orchestrator.

Scoring and Matching: The Multi-Model Approach

Job matching isn't just keyword overlap. A senior backend role at a 10-person startup and a senior backend role at Google might share 90% of the listed requirements but represent completely different careers.

I built a three-stage scoring pipeline:

Stage 1: Hard Filters
Simple boolean logic on non-negotiables. Location requirements, visa sponsorship, specific certifications. This runs locally — no need to burn API calls on jobs that require a PMP certification I don't have.

Stage 2: Semantic Matching
This is where it gets interesting. I route to different models based on the analysis needed:

Groq with Llama 70B for initial relevance scoring (fast, cheap, good enough for filtering)
Claude 3 Opus for detailed role analysis (understands nuanced requirements)
GPT-4 for company culture assessment (trained on more corporate communication styles)

The routing logic considers both cost and accuracy needs:

def route_analysis_request(self, job_desc, analysis_type):
    if analysis_type == 'quick_filter':
        return self.groq_client.analyze(job_desc, timeout=2)
    elif analysis_type == 'technical_depth':
        return self.claude_client.analyze(job_desc, max_tokens=4000)
    elif len(job_desc) > 8000:  # Long descriptions
        return self.gpt4_client.analyze(
            self.summarize_first(job_desc), 
            context_window='32k'
        )

Stage 3: Historical Pattern Matching
This is where autonomous job search AI shows its value. The system learns from past applications — which roles led to interviews, which companies never responded, which job titles oversell the actual work.

I store embedding vectors for every job applied to, along with outcome data. New opportunities get compared against this historical dataset:

def calculate_outcome_probability(self, new_job_embedding):
    similar_past_applications = self.vector_db.search(
        new_job_embedding, 
        k=20,
        filter={'has_outcome': True}
    )

    weights = self.calculate_similarity_weights(similar_past_applications)
    outcome_score = sum(
        app['outcome_success'] * weight 
        for app, weight in zip(similar_past_applications, weights)
    )

    return outcome_score

The tricky part: avoiding feedback loops. If the system only applies to jobs similar to past successes, it'll never explore new opportunity types. I added controlled randomization — 15% of applications go to roles outside the typical pattern, tagged for learning purposes.

ATS Navigation: The Parsing Arms Race

Applicant Tracking Systems are where good user experience goes to die. Every ATS vendor seems to compete on who can create the most baroque form structures. Workday wants your entire work history in their specific format. Greenhouse has custom questions that change per company. Lever somehow makes uploading a PDF feel like solving a CAPTCHA.

My first instinct was to build form-specific parsers for each major ATS. After the fifth Workday update broke my carefully crafted XPath selectors, I switched to a more robust approach:

class ATSAdapter:
    def __init__(self, browser_instance):
        self.browser = browser_instance
        self.vision_model = self.load_vision_model()  # GPT-4V

    def fill_application(self, ats_page, applicant_data):
        # Take screenshot of current form section
        screenshot = self.browser.screenshot()

        # Use vision model to identify form fields
        field_analysis = self.vision_model.analyze_form(
            screenshot,
            expected_fields=self.get_common_fields(ats_type)
        )

        # Map applicant data to identified fields
        for field in field_analysis.fields:
            if field.confidence > 0.8:
                self.smart_fill(field, applicant_data)
            else:
                self.flag_for_manual_review(field)

Using GPT-4V to understand form layouts proved more resilient than HTML parsing. When it fails, the system captures the problematic section and adds it to a manual review queue — accessible through our Telegram bot interface.

The infrastructure runs on Oracle Cloud's Container Instances with headless Chrome. Each application attempt gets its own container to avoid session contamination. Costs about $0.03 per application in compute resources, which beats the therapy costs of manually filling out Workday forms.

The Ethics Layer: Hard Boundaries in Code

Here's where autonomous job search AI gets complicated. The technical capability to apply to 1,000 jobs per day exists. Should you? Absolutely not.

I implemented hard limits throughout the system:

Application Rate Limiting
Maximum 10 applications per day, with mandatory 2-hour windows between applications to the same company's different roles. This isn't just about avoiding detection — it's about maintaining quality and giving each application appropriate consideration.

Mandatory Human Checkpoints
Certain triggers force human review:

Salary below historical minimum
Company size under 10 people (startup volatility)
Role requirements that match less than 60%
Any position involving relocation

Truthfulness Constraints
The system can optimize how experiences are presented but cannot fabricate. Every claim must map to a source document:

def validate_claim(self, claim_text, source_documents):
    embeddings = self.generate_embeddings([claim_text] + source_documents)

    max_similarity = max(
        cosine_similarity(embeddings[0], embeddings[i])
        for i in range(1, len(embeddings))
    )

    if max_similarity < 0.7:
        raise ValidationError(f"Claim '{claim_text}' lacks supporting evidence")

Outcome Transparency
Every application gets logged with full reasoning chains. When the system decides not to apply somewhere, it documents why. This audit trail isn't just for debugging — it's for understanding and correcting biases in the decision-making process.

Real Constraints and Production Reality

Running this in production surfaced constraints I hadn't anticipated:

Cost Management
A full analysis pipeline (discovery → scoring → application) costs approximately:

Groq quick filtering: $0.001 per job
Claude detailed analysis: $0.03 per job
GPT-4V form filling: $0.02 per application
Oracle compute: $0.03 per application

At 10 applications per day after analyzing ~200 jobs, daily cost runs about $8. Not breaking the bank, but not negligible for a personal tool.

Maintenance Overhead
ATS interfaces change. API rate limits adjust. New CAPTCHAs appear. I spend 3-4 hours per week maintaining adapters and handling edge cases. The system is autonomous, not automatic.

Legal Gray Areas
Some sites' Terms of Service explicitly prohibit automated applications. Others are vague. I maintain a blacklist of platforms with explicit prohibitions and accept that some opportunities will require manual handling.

The Telegram Interface: Practical Control

All system interaction happens through a Telegram bot. No web UI to maintain, no complex authentication, just messages:

/status - Current pipeline status
/pause - Halt all applications
/review - Show manual review queue
/stats week - Application outcomes for past week
/apply [job_url] - Manually trigger application
/blacklist [company] - Never apply to this company

Push notifications for important events:

Applications requiring manual review
Successful application confirmations
Weekly outcome summaries

This interface pattern — which we use across our Oracle Cloud agents — provides just enough control without encouraging micro-management. The bot runs on Oracle Functions, costs essentially nothing, and gives me application control from anywhere.

Beyond MVP: What Actually Matters

After running this system for three months, clear patterns emerged:

Quality over quantity wins. My callback rate is higher with 10 carefully selected applications per day than when I was manually spray-and-praying.
Historical learning compounds. The system now predicts with ~75% accuracy whether I'll get a response based on job posting patterns.
Partial automation is better than full automation. Human judgment at key decision points improves outcomes significantly.
Infrastructure flexibility matters. Oracle Cloud's compartmentalized structure made it easy to isolate different components and scale specific bottlenecks.

The autonomous job search AI doesn't replace human job searching — it augments it. It handles the mechanical work of discovery and initial filtering while preserving human judgment for decisions that matter. It respects both technical constraints (rate limits, CAPTCHAs) and ethical boundaries (application quality, truthfulness).

Most importantly, it recognizes that job searching isn't just a matching problem. It's a human process with human consequences. The automation should make that process more efficient, not more dehumanizing.

Building this taught me that the hard part of autonomous systems isn't making them work — it's deciding when they shouldn't. Every job application represents a potential relationship between humans. Respecting that relationship while still leveraging automation's efficiency is the real engineering challenge.

— Elena Revicheva · AIdeazz · Portfolio