Altug Tatlisu

Posted on Jun 1

Analyzing Your Own LinkedIn Posts With NLP and ML

#ai #programming #webdev #productivity

LinkedIn doesn't give you much. You can see a like count and a rough demographic breakdown if your post does well. You can't export your engagement history in a useful format, you can't query your own content programmatically, and you certainly can't train a model on it.

I wanted to understand what actually drives engagement on my posts - not just "this one did well," but why, and what that implies about what to write next. So I built LinkForge, a local content intelligence platform that ingests your LinkedIn post history, runs NLP analysis over the content and comments, trains a per-profile engagement predictor, and generates data-driven recommendations for future posts.

This article covers how it works, why I made certain architectural choices, and what I learned.

The data problem

The first instinct is to use the LinkedIn API. The problem: the API is designed for app integrations, not personal analytics. Endpoints for your own content exist, but engagement data at the granularity you'd want for ML is either unavailable or requires a Marketing Developer Platform review that takes weeks and isn't guaranteed.

The second instinct is scraping. The problem there is LinkedIn's enforcement posture - aggressive rate limiting, bot detection, and account restrictions.

The cleanest approach: LinkedIn's own data export. Under Settings > Data Privacy > Get a copy of your data, you can request your post history. You get a ZIP with CSVs containing your posts, dates, and engagement numbers. No scraping, no API negotiation, no risk.

For LinkForge I implemented a Playwright-based scraper as well, but for any production use I'd recommend the export path. The scraper is in the codebase for completeness, but the data export route is the one that makes sense.

Architecture

The system has three layers: ingestion, analysis, and presentation.

Ingestion parses the LinkedIn export (or optionally scrapes) and persists profiles and posts to PostgreSQL. Each post gets a 384-dimension sentence embedding via all-MiniLM-L6-v2 from sentence-transformers, stored in a pgvector column. This enables similarity queries later - find posts thematically similar to a given input, or cluster your content by topic.

Analysis runs four things over the post corpus:

Sentiment scoring with VADER, extended with three custom dimensions: pragmatic/balanced tone, tribalism score, and technical depth. Standard VADER doesn't capture what matters in technical LinkedIn content, so these additions matter.
Theme detection across a fixed taxonomy: technical deep dive, personal story, critique, pragmatic balance, and a few others. Each post gets a theme and a confidence score.
Polarization scoring derived from comment sentiment distribution. A post with highly polarized comments - some very positive, some very negative - scores differently than one with uniformly neutral reactions.
Engagement prediction via a scikit-learn random forest trained per request on the profile's own history. The model produces an engagement estimate and a success probability relative to the profile's own distribution, not a global baseline.

The per-profile training is a deliberate design choice. What works for a security researcher is not what works for a frontend developer. Training on your own distribution avoids that conflation entirely.

Recommendations are generated by identifying the highest-performing content patterns in the profile's history - theme, tone, structure, hook type - and constructing a suggested next-post plan from those patterns.

The backend is FastAPI with SQLAlchemy 2.0 async. The dashboard is Streamlit. PostgreSQL with pgvector handles all persistence.

What the analysis actually surfaces

A few things I didn't expect:

Technical depth correlates with engagement in non-obvious ways. Very high technical depth often underperforms relative to moderate depth. Posts that explain something technical but remain accessible to a broader technical audience tend to outperform posts aimed only at deep specialists.

The tribalism score - which measures how much a post positions an in-group against an out-group - is a strong positive predictor of engagement. That's not a recommendation to write tribal content. It's a finding worth knowing.

Comment polarization is a better engagement signal than raw reaction count on posts that spark debate. A post with 40 reactions and 30 polarized comments often outperforms a post with 200 reactions and no comments in terms of actual reach amplification.

Running it

git clone https://github.com/ChronoCoders/linkforge
cd linkforge
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
playwright install --with-deps chromium
cp .env.example .env

docker compose up -d db
alembic upgrade head

uvicorn app.main:app --reload
streamlit run streamlit_app/app.py --server.port 8501

For the data export path, place your LinkedIn CSV export in the project directory and use the seed script or the import endpoint. No cookies required.

The full API is documented at /docs when the backend is running.

What's next

The engagement predictor is a random forest trained on whatever history you have. With more data it improves. The next step I'm working on is a proper time-series component - engagement patterns shift over time, and the current model treats all historical posts equally regardless of when they were written.

The recommendation engine currently generates suggestions at the post level. I want to extend it to content calendar suggestions - given your posting frequency and content mix, what should the next two weeks look like.

Disclaimer

Automated scraping of LinkedIn is prohibited by LinkedIn's User Agreement and Professional Community Policies. The scraping components in this codebase are provided for educational and research purposes only. If you use them, you assume full responsibility for compliance with LinkedIn's terms of service, applicable laws, and data protection regulations including GDPR and CCPA. Account restriction or termination is a real consequence. The recommended approach is LinkedIn's official data export, which carries no such risk.

DEV Community