DEV Community: Sushrut Mishra

Software design documents in 2026, who writes them now?

Sushrut Mishra — Wed, 08 Apr 2026 04:46:00 +0000

AI coding agents generate production code, write test suites, and review pull requests with enough accuracy that most engineering teams trust the output for a meaningful share of their implementation work.

The downstream side of the development lifecycle, everything after the decision of what to build and how, got dramatically faster this year. The upstream side barely moved.

Software design documents, the artifacts where feasibility analysis, service boundaries, dependency mapping, and implementation approach get captured, still depend entirely on one or two senior engineers who carry the system's mental model. That work consumes 60 to 70% of their time, and the entire team queues behind them before a single line of code gets written.

The pre-coding bottleneck that AI skipped over

Every new feature follows the same sequence. Someone creates a Jira ticket. A senior engineer or architect reads it and starts answering the hard questions.

Can this be built on top of our current service topology?
Which downstream consumers will break if we change this data flow?
What implementation approach fits the patterns we already have in production?
Where did we try something similar before and what went wrong?

Those answers require system level context that spans repositories, past incidents, architectural decisions made years ago, and tribal knowledge that lives in very few heads.

Writing a software design document that captures all of this takes a full day or more, and the person qualified to write it is the same person every other team is waiting on for their feature too.

The coding agent sitting downstream is ready to generate code the moment it receives a clear, grounded input. The bottleneck is producing that input. The bottleneck is the software design document.

What makes this bottleneck so persistent

Coding agents operate well within a bounded scope. Give an agent a file, a function signature, a clear specification, and it produces strong output.

Pre-coding work requires the opposite of bounded scope. It requires understanding how dozens of services interact, what operational patterns have failed before, which APIs are deprecated, and what constraints the business layer imposes on the technical design.

That understanding lives in the heads of senior engineers who built or maintained the system over years. When those engineers are available, they produce the software design document and the team moves forward. When they are pulled into another project, on vacation, or have left the company, the team either waits or builds on assumptions that surface as rework three weeks into implementation.

The persistence of this bottleneck comes from the nature of the knowledge itself. System level context resists documentation because it changes constantly. Services get added, dependencies shift, incident patterns evolve.

A software design document written six months ago may reference an architecture that no longer exists. The person writing the document needs live, current understanding of the system, and that understanding refreshes through daily exposure to the codebase and its operational behavior.

Why this matters more now than it did two years ago

Two years ago, the pre-coding phase and the coding phase moved at roughly the same speed. Both were manual, both depended on senior engineering time, and both took days. The ratio between planning and execution was close to balanced.

AI compressed the coding phase from days to hours. The planning phase stayed the same. That compression exposed a ratio that was always there but never felt urgent. When coding took a week and planning took three days, the planning overhead felt proportional. When coding takes an afternoon and planning still takes three days, the planning phase becomes the dominant cost of every feature.

Senior engineers now spend the vast majority of their time on work that happens before the coding agent receives its first prompt. Feasibility analysis, technical design, cross-repo impact assessment, and scope breakdown. The software design document that captures this work is the single highest leverage artifact in the entire workflow, because it determines the quality of everything the coding agent produces downstream.

For teams looking to structure this artifact well, Bito published a comprehensive guide to software design documents that walks through the sections, templates, and practices that make a design doc usable for both engineers and AI agents.

The upstream layer is starting to move

One approach I have seen work is AI Architect in Jira from Bito, which generates software design documents directly from epics and stories before anyone writes a line of code. It builds a knowledge graph of the entire codebase and operational history from past Jira tickets, then runs feasibility analysis, technical design, impact assessment, and scope breakdown against it. The output posts as a structured planning artifact directly in the Jira ticket.

The senior engineer who used to spend a full day producing that artifact now reviews a grounded first draft instead of starting from scratch. Their time shifts from context gathering to the architectural decisions that actually require their judgment.

AI made the coding layer fast. The quality of that speed depends entirely on the quality of the thinking that precedes it. Software design documents capture that thinking, and they are finally getting the same AI treatment that code generation and code review received over the last two years.

Jira tickets were never built for AI coding agents, until now

Sushrut Mishra — Tue, 07 Apr 2026 04:46:16 +0000

84% of developers use AI coding tools. 41% of all code pushed to production is AI generated. Code generation, test writing, PR reviews, the downstream layer moved fast.

The upstream layer did not move at all.

IDC's 2024 survey found engineers spend only 16% of their time on actual application development. The rest goes to operational tasks, CI/CD, security, and coordination.

Microsoft Research found something adjacent. Developers want to spend more time architecting and designing systems. They consistently cannot because that work depends on tribal knowledge held by very few people.

The Jira ticket sits at the center of this problem.

A product manager creates an epic. Two sentences, a vague acceptance criteria, maybe a Figma link. That ticket becomes the input for the entire downstream workflow.

In a human workflow, context gaps got filled through conversation. The developer walked over, asked the senior engineer what the ticket meant, clarified service boundaries, and proceeded.

AI coding agents cannot walk over and ask.

What a thin ticket costs downstream

The agent receives a ticket that says "add retry logic to the payment service." It generates a retry implementation that compiles and passes unit tests.

Then the PR review happens.

A senior engineer reads the diff and spots the problem. The retry logic conflicts with the circuit breaker pattern already on that service. The retry intervals collide with timeout configs on two downstream consumers.

Worse, the payment service publishes events to a queue that three other services subscribe to. The retry behavior will produce duplicate events that none of those consumers are idempotent against.
The rework takes longer than the original implementation.

Research on developer time allocation confirms the pattern. Engineers spend the majority of their time reading and understanding existing code rather than writing new code. AI compressed the writing. The understanding still lives in human heads.

Every rework cycle traces back to the same root cause. Insufficient input at the ticket level.

What the ticket needs to contain now

A Jira ticket feeding an AI agent needs to capture what a senior engineer would explain verbally before coding starts.

Feasibility analysis against the live codebase. Which services, APIs, and data flows does this change touch. What patterns exist for similar functionality. Which deprecated approaches should the agent avoid. What downstream consumers will be affected.

That context lives in the heads of one or two architects. Writing it down manually takes hours, and those are the same hours the team needs from those engineers on three other features.

One approach I have seen work is AI Architect in Jira from Bito. It reads the epic against a knowledge graph of the entire codebase and operational history from past Jira tickets.

It runs feasibility analysis, technical design, impact assessment, and scope breakdown. Then posts the output as a structured planning artifact directly in the ticket comment.

The ticket transforms from a two sentence prompt into a grounded design artifact that both engineers and coding agents can act on immediately.

The senior engineer who used to spend half a day producing that context now reviews a first draft grounded in actual service topology. Their time moves from context gathering to architectural decisions.

The compounding implication

The quality ceiling for AI generated code is set by the Jira ticket that triggered it.

A two sentence ticket produces a best guess implementation. A ticket enriched with feasibility, dependency mapping, and scoped stories produces output that lands close to production ready on the first pass.

The gap between rework and one shot implementation has always been a context gap. Closing it at the ticket level, before the agent receives its first prompt, is where the highest leverage sits.

Claude Code got the architecture wrong (so we ran a controlled experiment to find out why)

Sushrut Mishra — Tue, 17 Mar 2026 16:25:57 +0000

If you have used Claude Code on a large codebase, you have probably felt this. The output compiles. The tests pass. But something feels off. The API surface is parallel to something that already exists. The approach is a workaround dressed up as an implementation. A senior engineer on your team would have done it differently. The instinct is to blame the model. The actual problem is something else entirely.

Coding agents explore large codebases through trial and error

Claude Code, Cursor, and every other coding agent navigates your codebase the same way: grep, glob, read files, repeat. On a small codebase this works well enough. On a codebase with millions of lines across thousands of files, this process produces a systematically incomplete picture.

The agent makes the best architectural decision it can with what it found. When extension points exist but the agent never found them, it creates a parallel implementation instead of extending what already exists. When conventions exist but the agent never saw them, it writes code that a senior engineer would reject in review. The model is capable. The context is incomplete.

The experiment

We ran a controlled test to measure exactly how much this matters. Same agent (Claude Code with Opus 4.6), same task, same codebase (Elasticsearch, 3.85 million lines of Java across 29,000+ files), one variable: whether Bito's AI Architect, a codebase intelligence layer that builds a knowledge graph of your entire system and exposes it to coding agents via MCP, was providing context or not.

The task was implementing deterministic terms aggregation using the TPUT algorithm, a multi-phase distributed coordination problem that requires changes across Elasticsearch's entire search pipeline.

Without AI Architect

Claude Code concluded the framework could not support multi-round shard communication and built a workaround instead. It created a separate aggregation type that forces every shard to return all unique terms in a single pass. Technically functional. Severe memory risk on high-cardinality fields. Zero multi-shard tests. 6 files changed. And critically, not actually TPUT.

With AI Architect

Claude Code understood exactly where to extend the pipeline, identified the correct integration point, followed Elasticsearch's own API conventions, and implemented genuine multi-phase TPUT with threshold computation, refinement rounds, and gap resolution. 27 files changed. Full test coverage across all coordination layers.
Same agent. Same codebase. Completely different architecture.

Why the gap exists

The agent without context made a reasonable conclusion based on incomplete information. It explored a 3.85-million-line codebase without a map and missed the extension points entirely. That is not a failure of reasoning. It is a failure of information.

AI Architect builds a knowledge graph of your entire codebase, mapping architecture, extension points, conventions, dependencies, and call graphs, and delivers that context to your coding agent before it writes a single line of code. The agent stops guessing and starts reasoning about your actual system.

The difference in output reflects the difference in understanding. A 6-file workaround versus a 27-file production-grade implementation. Both came from the same model on the same day.

What this means for your team

Most engineering teams accept the 6-file version because they never saw the 27-file version was possible. The architectural shortcuts your coding agents take today are a direct reflection of what they understand about your codebase, and most of them understand very little.

If your team is running Claude Code or Cursor on a large codebase, this experiment is worth reading in full. We published the complete side-by-side comparison, the layer-by-layer breakdown of what TPUT actually required, and links to both pull requests so you can read the code yourself.

Read the full experiment: The TPUT implementation Claude Code got wrong and AI Architect got right

If you want to connect AI Architect to your coding agent and run it on your own codebase, get started at bito.ai.

How I got Cursor to write code that could actually ship

Sushrut Mishra — Sun, 30 Nov 2025 08:34:12 +0000

AI code generators write more code than ever, and they do it fast. Sahil Lavingia said it well in a recent tweet. If developers now produce five to ten times more code with AI, we need something that can review five to ten times more code too.

That is the real problem. Tools like Cursor generate code that looks right but often contains small mistakes that only show up when the project runs. Wrong Prisma fields, missing checks, loose validation, repeated logic, or scripts that fail silently. None of this feels dramatic. It is just enough to slow you down or break production.

I hit all of this while building an Express TypeScript API. Cursor helped me create the project fast, then I spent real time fixing the parts it got wrong. That is when the pattern became clear. AI code generation is not the bottleneck anymore. Code review is.

The only way to keep up is to bring AI into the review step, not just the generation step.

The rest of this post walks through how that played out in my project, what went wrong, and how AI code review solved the gap Cursor left. I will talk about the tool I used later in the post.

The project I used to test this

I used an Express TypeScript API that handled auth, user management, Prisma, JWT, validation middleware, and a few automation scripts. It had routes, controllers, types, a Prisma schema, and a small shell toolchain for creating pull requests through the GitHub API.

Everything tied together.

The DTOs shaped the request data.
The controllers expected those shapes.
The Prisma model enforced its own contract.
The middleware ran checks before the handlers.
The scripts needed correct error handling and clean parsing.

That structure made it clear when something went wrong. If Cursor guessed a field name, the query failed. If it added logic in the wrong layer, the flow broke. If it skipped error checks in the scripts, automation stopped working.

This project gave me a clear view into where AI generated code drifts from the actual codebase.

Where Cursor started breaking things

When I asked Cursor to generate code for this project, the gaps showed up fast.

It used wrong field names in Prisma queries

Cursor produced queries with fields that did not exist in the Prisma schema. The handlers failed as soon as they hit the database.

It placed validation in the wrong layer

I already had middleware for input checks. Cursor added the same checks inside controllers, which created duplication and forced updates in two places.

It drifted from the actual DTOs

Some controllers expected request shapes that did not match the DTO definitions. TypeScript let a few of these slip through because the structures looked close enough.

It skipped error handling in bash scripts

The pull request scripts used curl and git commands with no exit code checks. If an API call failed, the script continued as if nothing happened.

It parsed JSON in fragile ways

Cursor relied on grep and cut to extract fields from GitHub responses. That approach broke as soon as the response shape changed.

It exposed the GitHub token

One script printed the token to the terminal. That leaked it to shell history and anyone watching the screen.

...and many more.

All of this came from a simple pattern. Cursor generated each file on its own, while the project depended on consistent behavior across routes, controllers, models, middleware, and scripts. Once the pieces drifted, problems stacked up.

How AI code review filled the gap

Cursor generated large chunks of this project fast, but it introduced inconsistencies across files. To keep the codebase stable, I needed something that reviewed the entire diff, not just the file I was currently editing.

I used Bito for this. The review on PR #1 in the repo showed exactly why this layer is required. Here are the concrete issues Bito surfaced in that pull request.

1. Incorrect assumptions about project flow

Bito traced the request path across files and produced an interaction diagram. It showed the exact flow:

Client → Express App → Route Handler → ValidationMiddleware → AuthController → Database → Response.

Cursor did not maintain this order. Bito caught cases where Cursor placed logic in the wrong layer or duplicated it.

2. Duplicate validation across middleware and controllers

In authRoutes.ts, Cursor generated:

router.post('/register', register);
router.post('/register', validateRegister, register);

Same for login.
Bito flagged these as duplicate routes and pointed out that the controller still performed its own validation.
It specifically referenced the sections in validation.middleware.ts (lines 10–50) and authController.ts (lines 11–26).
This was not a style preference. It broke the request flow and doubled maintenance work.

3. Prisma field and type drift

Cursor produced controller logic that assumed certain fields existed. In the PR review, Bito cross-checked controller usage with schema.prisma and flagged mismatches in user lookup and update logic.
Cursor did not verify the fields across files. Bito did.

4. Missing branch checks and error handling in bash scripts

The automation scripts were the worst offenders.
For example, in push-and-create-prs.sh:

git push -u origin main

Cursor wrote this with zero checks.
Bito flagged:

no error handling
no branch existence check
no exit status fallback

It suggested explicit guards, such as verifying refs/heads/branch-a before checkout and returning a non-zero exit code if the push fails.

This tightened the entire CI path.

5. Unreliable JSON parsing

Cursor parsed GitHub API responses with grep and cut, for example:

echo "$RESPONSE_A" | grep -o '"html_url":"[^"]*' | cut -d'"' -f4

This breaks on any format change.
Bito called this out directly in the PR and recommended using structured parsing (jq) or capturing HTTP status codes first.

6. Token exposure in interactive scripts

In create-prs-now.sh, Cursor prompted for token input like this:

read -p "Enter your GitHub token: " GITHUB_TOKEN

This prints the token in plain text and stores it in shell history.
Bito flagged this as a security issue and suggested:

using read -s to hide input

validating token prefix

avoiding echo statements that leak secrets

7. Missing exit codes for automation reliability

Cursor ended the script with:

echo "Done!"

No exit status.
CI cannot detect failures without exit codes.
Bito explained why exit codes matter and gave an explicit fix:

if [ ! -z "$PR_URL_A" ] && [ ! -z "$PR_URL_B" ] && [ ! -z "$PR_URL_C" ]; then
exit 0
else
exit 1
fi

This alone prevents silent deployment failures.

8. Repeated code that violated DRY

In create-prs-automated.sh, Cursor duplicated the entire PR creation block three times.
Bito highlighted the repetition and suggested extracting a reusable function.
Cursor rarely attempts global refactors on its own.
Bito is designed to detect them because it reviews the diff holistically.

9. Cross-file consistency issues

This was the biggest win.
Bito reviewed:

authRoutes.ts
validation.middleware.ts
authController.ts
userRoutes.ts
userController.ts
all four bash scripts

Cursor did not track how these pieces interacted.
Bito cross-referenced them.
This prevented logic drift across layers.

All of this came from one insight. Cursor can write code fast. It cannot guarantee consistency across the entire project. AI code review filled that gap in a measurable, concrete way, and the PR made that clear.

The workflow that produced code I could ship

Once I saw how Cursor and Bito behaved on the same project, I locked in a workflow that kept speed and removed breakage. The sequence is simple, but every step matters.

1. Generate code with the Cursor in small steps

I stopped asking Cursor to create large chunks at once. I asked for one file or one change at a time. This reduced cross-file drift and made each review cycle easier to reason about.

2. Run Bito inside the IDE immediately after each change

Bito reviewed the updated file in the context of the full codebase.
If a controller referenced the wrong field, Bito pointed back to the Prisma model.
If the change introduced duplicated validation or unnecessary checks, Bito flagged both spots.
If a script skipped exit code checks, the review highlighted the exact line.

3. Fix the issues while the context is fresh

I applied the suggestions right away.
If the issue required refactoring, I asked Cursor for a targeted fix. For example, if Bito flagged repeated PR creation blocks, I asked Cursor to extract a function and replaced the duplicates.

4. Open a pull request and run Bito again on the full diff

This is where most AI generated code breaks. Single-file checks miss cross-file inconsistencies.
In the PR review for this project, Bito identified duplicate register and login routes, repeated validation in controllers, missing error handling across scripts, fragile JSON parsing, and token exposure.
The second review pass ensured nothing slipped into the main.

5. Merge only when the review produced zero high-risk issues

Once Bito showed a clean review, I merged the PR.
This kept the repo stable while still letting Cursor generate large amounts of code.

This loop removed the guesswork. Cursor handled generation. Bito handled cross-file review. Together, they gave me a development flow where the output moved fast and stayed consistent.

Conclusion

AI helps you write more code, but it also adds more chances to slip. I saw that every time Cursor moved faster than the rest of the project could keep up. Once I added AI code review, the whole workflow settled. The code lined up across files, the scripts stopped failing in silence, and I spent less time chasing small errors.

If you use AI to generate code, pair it with an AI reviewer. Do not rely on one without the other. I use Bito for this, and it has saved me a lot of cleanup time. Try any AI code review tool you trust, but use one. It makes the entire process far more stable.

Code Smells Explained: Common Patterns to Watch Out For

Sushrut Mishra — Thu, 25 Sep 2025 12:00:01 +0000

I spend a lot of time at Bito running AI code reviews, both on my own projects and with the team.

Over time, I started noticing a pattern. The reviews weren’t just pointing out syntax issues or missing tests. They kept surfacing something subtler — code smells.

At first, I didn’t think much of it. If the code compiles and the feature works, isn’t that good enough? But the more I worked with smelly code, the more I realized how quickly it drags you down.

You open a file and suddenly it takes 20 minutes just to understand what’s happening. You go to fix a bug, and you’re scared to touch anything because one change might break five other things. You hand it off to a teammate, and they look at you like you just cursed them.

That’s when it clicked for me: code smells are not bugs, but they’re warning signs. Ignore them, and you end up buried in technical debt. Spot them early, and you save yourself a lot of pain later.

Since I keep running into this during AI code reviews, I thought I’d put together some thoughts on what code smells are, the ones that show up the most, and a few practical ways to deal with them.

Why Code Smells Matter

Code smells are easy to ignore in the beginning. The program runs, tests pass, and everything looks fine on the surface. But over time, those small issues start to grow.

A long function becomes harder to follow each time someone adds a new condition. A giant class turns into a dumping ground where nobody remembers what belongs where. Copy-pasted logic means fixing one bug in three different places.

The problem isn’t that the code stops working. The problem is that it slowly becomes harder to read, harder to test, and harder to change without breaking something else. This is how technical debt sneaks in.

I’ve seen it firsthand during reviews. A piece of code that seemed “good enough” in the beginning later took twice as long to debug because the design had rotted. If you work in a team, it also slows everyone else down, because they need to untangle the mess before they can even start adding new features.

That is why code smells matter. They are not just about clean code for the sake of clean code. They are about saving yourself and your team from pain later.

The Most Common Code Smell Patterns

Over time you start to notice the same kinds of smells showing up again and again. They look different in every codebase, but the patterns are surprisingly common. Here are a few that stand out the most.

Long Methods

Long methods make it hard to follow logic, and hard to test. Break them into small functions that do one thing.

Smell:

function processOrder(order) {
  // validate
  if (!order.id || !order.items) throw new Error('invalid')
  // calculate totals
  let subtotal = 0
  for (const item of order.items) {
    subtotal += item.price * item.qty
  }
  let tax = subtotal * 0.08
  // apply discounts
  if (order.coupon) {
    subtotal -= order.coupon.amount
  }
  // build payload
  const payload = { id: order.id, total: subtotal + tax }
  // send to billing
  sendToBilling(payload)
  // notify user
  sendEmail(order.userEmail, 'order processed')
}

Refactor:

function validateOrder(order) {
  if (!order.id || !order.items) throw new Error('invalid')
}

function calculateTotal(items, coupon) {
  let subtotal = 0
  for (const item of items) subtotal += item.price * item.qty
  if (coupon) subtotal -= coupon.amount
  const tax = subtotal * 0.08
  return subtotal + tax
}

function processOrder(order) {
  validateOrder(order)
  const total = calculateTotal(order.items, order.coupon)
  sendToBilling({ id: order.id, total })
  sendEmail(order.userEmail, 'order processed')
}

Duplicate Code

Duplicate logic causes multiple fixes. Extract shared code once, then reuse it.

Smell:

function createAdminUser(data) {
  const user = {
    name: data.name,
    email: data.email,
    role: 'admin',
    createdAt: new Date()
  }
  saveUser(user)
}

function createGuestUser(data) {
  const user = {
    name: data.name,
    email: data.email,
    role: 'guest',
    createdAt: new Date()
  }
  saveUser(user)
}

Refactor:

function buildUser(data, role) {
  return {
    name: data.name,
    email: data.email,
    role,
    createdAt: new Date()
  }
}

function createAdminUser(data) {
  saveUser(buildUser(data, 'admin'))
}

function createGuestUser(data) {
  saveUser(buildUser(data, 'guest'))
}

Large Classes (God Objects)

A class with many responsibilities becomes hard to change. Split responsibilities into focused classes.

Smell:

class OrderService {
  constructor(db) { this.db = db }

  createOrder(data) { /* validate, calculate, save, notify */ }
  calculateTotals(items) { /* lots of logic */ }
  sendInvoice(order) { /* email logic */ }
  exportOrdersCsv() { /* file logic */ }
}

Refactor:

class OrderCalculator {
  calculate(items, coupon) { /* totals logic */ }
}

class OrderRepository {
  constructor(db) { this.db = db }
  save(order) { /* db save */ }
}

class OrderNotifier {
  sendInvoice(order) { /* email logic */ }
}

class OrderService {
  constructor(calc, repo, notifier) {
    this.calc = calc
    this.repo = repo
    this.notifier = notifier
  }

  createOrder(data) {
    const total = this.calc.calculate(data.items, data.coupon)
    const order = { ...data, total }
    this.repo.save(order)
    this.notifier.sendInvoice(order)
  }
}

Primitive Obsession

Passing raw primitives hides intent, and it makes validation and behavior scattered. Create small types or objects.

Smell:

function createUser(name, email, addressLine1, addressLine2, city, zip) {
  const user = { name, email, addressLine1, addressLine2, city, zip }
  saveUser(user)
}

Refactor:

function buildAddress(line1, line2, city, zip) {
  return { line1, line2, city, zip }
}

function createUser(name, email, address) {
  const user = { name, email, address }
  saveUser(user)
}

// usage
const addr = buildAddress('123 St', '', 'Pune', '411001')
createUser('Asha', 'asha@example.com', addr)

Feature Envy

When a method reaches into another object to pull data, the logic likely belongs closer to that data. Move behavior to the right place.

Smell:

class OrderFormatter {
  format(order) {
    return `${order.user.firstName} ${order.user.lastName} placed order ${order.id}`
  }
}

Refactor:

class User {
  constructor(firstName, lastName) {
    this.firstName = firstName
    this.lastName = lastName
  }

  fullName() {
    return `${this.firstName} ${this.lastName}`
  }
}

class OrderFormatter {
  format(order) {
    return `${order.user.fullName()} placed order ${order.id}`
  }
}

Data Clumps

If the same group of values travels together, pack them into an object. This reduces errors and clarifies intent.

Smell:

function scheduleMeeting(title, startDate, endDate, organizerName, organizerEmail) {
  // lots of params passed around
}

Refactor:

function scheduleMeeting(title, range, organizer) {
  // range is { startDate, endDate }
  // organizer is { name, email }
}

const range = { startDate: '2025-09-01', endDate: '2025-09-01' }
const organizer = { name: 'Sam', email: 'sam@example.com' }
scheduleMeeting('sync', range, organizer)

How to Spot Code Smells in Practice

Code smells are tricky because they do not break your build or throw an error. The code still runs, which makes it easy to miss them. But once you know what to look for, they start standing out everywhere.

Peer reviews and pair programming

Having another set of eyes on your code helps a lot. A teammate who has not been staring at the same file for hours will quickly notice when a method is too long, a class is too heavy, or logic feels out of place.

Automated tools

Linters and static analysis tools can catch certain smells, like duplicate code or unused variables. AI code review tools go a step further and point out design-level issues that humans might overlook during a busy sprint. This is something I see every day using Bito. Our blog on code smell detection goes deeper into how AI reviews help spot these patterns early.

Your gut feeling as a developer

Sometimes you just know something smells off. If you have to scroll too much, pass around too many parameters, or write the same code twice, that is usually a sign. Trust that feeling and take a closer look.

The goal is not to obsess over every small thing, but to develop awareness. Once you can recognize these signals, you can choose which ones are worth fixing now and which can wait.

When to Fix vs When to Let Go

One of the hardest parts of dealing with code smells is knowing when to act. Not every smell deserves your attention right away. Some are harmless quirks, while others will slow your team down if you leave them alone.

Fix right away

If a code smell is blocking readability, slowing down debugging, or creating duplicate logic, it is usually worth fixing on the spot. For example, a long method that you are already editing is the perfect candidate for a quick cleanup. Small changes made in context are the easiest wins.

Let it go (for now)

If the code works, is rarely touched, and nobody is struggling with it, you might not need to refactor immediately. A messy utility function that runs once a month is not as urgent as a controller that every developer touches daily.

The Boy Scout Rule

A good way to balance this is to follow the Boy Scout Rule: leave the code cleaner than you found it. If you touch a file for a new feature or a bug fix, take a moment to clean up the worst smells while you are in there. Over time, the whole codebase improves without big refactor projects.

The real skill is not fixing everything, but knowing what to fix now and what to leave alone. That discipline saves you from wasting time while still keeping the codebase healthy.

Will AI Replace Developers?

Sushrut Mishra — Mon, 18 Aug 2025 04:08:35 +0000

I wrote code full time for two years, then switched to marketing because I liked telling the story behind the product more than shipping the product itself. I’m not a developer anymore, but a question I come across every day is: “Will AI replace developers?”

Since ChatGPT and Copilot blew up, every other message in my inbox is the same question. It is a fair worry. Tools can now spit out a React component or a Python script in a single prompt. That feels radical if you last checked in on AI back when autocomplete meant guessing the next three letters.

Why you should listen to me?

Because I have lived on both sides of the screen. I have shipped features, and I have sold them. The short answer from where I sit is no, AI is not walking into your stand-up and taking your laptop.

The longer answer is that the job is changing in real time. The keyboard work that used to eat a morning now takes minutes, and the skills that keep you valuable are moving up the stack toward architecture, review, and product sense.

This post breaks down why that shift is happening, what the big studies and forums really say, and what you should do now to stay ahead. Plus a quick note on how an AI code review layer like Bito fits into the workflow.

Will AI really replace software engineers? What the numbers show:

I skimmed the biggest studies and news pieces to see if the chatter lines up with reality. The Coursera deep dive on the question “Will AI replace developers” lands on a clear answer:

AI tools handle routine chores, yet they still lean on human skill for design, security, and new ideas. A full handoff is not coming any time soon.

Another signal comes from the GitHub Copilot lab trial. Developers who used the tool finished a standard JavaScript task 55.8% faster, yet nothing in the paper hints at removing the human role.

Fear still shows up in surveys. An Evans Data poll highlighted by Computerworld found that 29% of engineers worry they could be replaced by AI one day. That number is real, but the same article notes broader concerns about platforms going obsolete, which reminds me that tech anxiety is nothing new.

The bottom line so far: research shows AI boosts throughput, articles warn about limits, and a slice of developers stays nervous, but none of the evidence says the job itself is disappearing.

Developer tasks that AI already handles well

AI tools shine in the repetitive layer of software development. Automated code generation, static analysis, and predictive analytics are no longer science fiction.

Here is where they save the most time right now:

1. Boilerplate and docstring generation

Paste a prompt, get a clean class with constructors, getters, setters, and clear docstrings. Tools like Cursor or Copilot cut the grunt work so you focus on the core logic.

2. AI code review and quick bug fixes

AI code review tools today scan your code in seconds. I’m biased here but Bito does this with a private local index, leaves inline suggestions on style, security, and logic, and links to docs for a quick fix.

Bito’s AI Code Review Agent plugs into GitHub, GitLab, Bitbucket, and your IDE (coming soon), posts comments like a real teammate, and learns from each review.

3. Timeline estimation from commit history

Feed past commits into a model and you get shipping dates that beat gut instinct. The algorithm maps similar tickets, crunches cycle times, and offers a realistic delivery window, boosting project planning accuracy and developer productivity.

I may be thinking way ahead, but this is already a reality not too far from 2025.

Where AI still falls short

Large language models feel impressive in a demo, yet they still miss key parts of real software development work.

1. New algorithms and greenfield design

Ask the model to combine two known patterns and it shines, ask it to build a data structure for an unseen edge case and it stalls. Creative problem-solving still sits with the engineer who understands both the codebase and the customer need.

2. Hallucinated code that compiles but breaks in production

The model predicts tokens, it does not reason about runtime state. I have seen a neat looking fix that passes the tests, only to leak memory on day one in prod. Someone must read the output, trace the path, and prove it safe.

3. Security, IP, and data leaks

Coursera flags a risk most hype posts skip. Models can repeat licensed snippets or suggest logic that opens a door for attackers. Teams have to run checks, scrub prompts, and own the final call on what ships.

Real numbers on productivity

A controlled experiment on GitHub Copilot (arXiv) asked professional developers to build an HTTP server in JavaScript. Those with Copilot finished the task 55.8% faster than the control group. Speed jumped, but every participant still wrote tests, reviewed diffs, and approved the merge.

Upgrade kit for today’s coder

I no longer push code to production, yet I still speak with dev teams every week. The fastest teams I see have three habits in common. If you write software for a living, stack these on top of whatever you already do.

1. Sharpen your prompt craft

Save prompt templates the same way you save bash aliases. Each template holds three parts: short context, exact task, and required output format. The clearer the prompt, the fewer edits you make later.

Test each prompt at least a dozen times. Use Cursor, Windsurf, or Copilot and see what works for you. I bet you use VS Code as your IDE. But my recommendation is to find an AI IDE alternative for VS Code. Use AI, code fast, grow faster.

2. Keep computer-science basics tight

Data structures, algorithmic thinking, and solid design patterns let you judge AI output on sight. When a model suggests a quadratic loop on a hot path, you swap in a hashmap without blinking.

3. Run an AI code-review layer before peer review

Bito’s AI Code Review Agent drops straight into GitHub, GitLab, and Bitbucket (coming to IDE soon). One click and it indexes the whole repository with abstract syntax trees and vector embeddings, so every comment arrives in proper context.

It posts a pull-request summary, flags security issues, suggests test cases, and offers one-click fixes. It also offers incremental reviews. That means Bito scans only new commits, and its changelist view highlights the files you really need to open.

Bito is SOC 2 Type II certified, it doesn’t store your code, and offers secure code reviews. You can run it in the cloud or on-prem.

My favorite feature is Custom Review Guidelines. This feature is built for teams with specific code review standards. Whether you follow internal naming conventions, prefer a certain formatting style, or want the agent to avoid flagging certain patterns, you can now set all that yourself. 

You can add general rules or rules for specific languages. You can use a template or write everything from scratch.

Conclusion

I wrote code for two years before I crossed the hall into marketing, and that switch taught me something useful for anyone still in the editor. Your real worth was never the lines you typed per day, it was the way you solve problems and guide a product from idea to release.

The new wave of AI tools, from Copilot to Bito’s AI code review agent, just makes that truth louder. They sweep up boilerplate, spot bugs, and keep a steady eye on security, which means you have more room for architecture decisions, performance trade-offs, and, yes, the occasional late night inspiration that a model cannot fake.

So the next time someone asks, Will AI replace developers, tell them it is already replacing the boring parts. The thinking parts, the bits that need context and judgment, are still yours.

Use the tools, direct them with clear prompts, and keep your fundamentals sharp. That is how you stay ahead and how the craft moves forward.

AI Code Reviews: My 150-Day Experience

Sushrut Mishra — Tue, 08 Jul 2025 08:04:46 +0000

In 2021, I was deep in Salesforce development, reviewing pull requests, fixing edge cases, and trying to ship clean code. AI code reviews weren’t a thing back then.

A few years ago, code reviews meant reading every line closely, juggling context across files, bugging other developers for explainers, and dropping comments that often went unnoticed.

Then in 2025, I joined Bito.ai. It’s been more than 5 months, and I’ve been using Bito’s AI Code Review Agent on real pull requests.

Why I chose to work with Bito

I came across Amar Goel, Bito’s cofounder and CEO, while I was working in technical writing and developer marketing. We started talking, and he shared what they were building. An AI agent that reviews pull requests in GitHub without storing your code.

That felt so cool. I kept thinking about 2021.

Back then, every pull request I opened meant asking teammates to review logic, naming, structure, and edge cases. It took time, added friction, and sometimes, things slipped through.

The idea of getting instant, contextual suggestions inside a PR, without giving up your code, felt like something I would have wanted as a developer. It made sense, felt practical. It felt like something made for how developers actually work.

I joined Bito shortly after.

This post is the result of 150 days working with Bito and using the AI Code Review Agent on my own code. Every example I share here comes from that experience.

AI in pull requests: What changed for me?

Back in 2021, most of my pull request reviews used to follow the same pattern. I would open the diff, scroll through my changes, and try to catch anything that felt off.

Sometimes I would miss things. Sometimes I would forget what I was thinking when I wrote the logic in the first place.

Once I started using Bito’s AI Code Review Agent, I noticed the difference right away.

The AI-generated comments showed up directly in the pull request.

They pointed out specific lines and explained why something could be improved.

The suggestions were clear. If a function was too long or a condition could be simplified, the agent said so.

If I reused a pattern that could be abstracted, it highlighted that too.

I did not need to change how I reviewed pull requests. I just had more context in the same place. That saved time and made the feedback loop tighter.

The inline suggestions inside my PRs were the easiest part of the experience to adopt. I wrote more about that experience in this post on how I started chatting with my AI code reviewer.

Next, I’ll walk through what Bito is doing now, what new features we are working on, and why I’m betting on it.

AI code reviews: What we’re doing at Bito?

Over the past 150 days, Bito kept adding features that made pull request reviews smoother and more aligned with real-world development. And I tried them all.

We’ve already listed out all the major features that Bito’s AI Code Review Agent offers in the product page. Here’s a quick overview:

Reviews pull requests
Gives inline, contextual suggestions
Catches code smells, logic issues, and style problems
Learns from your feedback and adapts
Doesn’t store your code or use it for training … and much more.

In this section, I specifically want to talk about the major updates Bito released and things I did, since I joined:

1/ Custom code review rules

This is one of the latest and coolest updates Bito dropped after I joined. Bito’s AI Code Review Agent lets you enforce your own coding standards directly in PRs:

Automatic learning from feedback: When I mark a suggestion as irrelevant, the AI learns not to offer that again. Once you mark a similar suggestion as irrelevant three time, Bito creates a custom rule. Read about it here.
Custom rule uploads: This is the latest update! You can now create a custom guideline directly within your Bito dashboard. I created a video walkthrough. Watch it on YouTube here.

More on this is detailed in How I Personalized Bito’s AI Code Review Suggestions, which is actually about defining rules. The feature works as expected!

2/ Multi-product seat based billing

Earlier in June, Bito also rolled out a redesigned dashboard for managing users and billing. This was a big deal for teams.

The new member management view shows exactly how many seats your workspace has, how they’re distributed between IDE users and pull request reviewers, and who’s using what.

It’s now seat-based by product, which makes scaling much easier to track.

There’s also an auto-assignment option. Meaning: new dev joins and they get a seat automatically. Or turn it off and do it manually. That’s up to the admin.

I also created a video walkthrough to explain this update. Watch it on YouTube here.

3/ Chatting with the AI Code Review Agent

This feature was launched in April. It gives you the ability to chat directly with Bito’s AI Code Review Agent. You can ask the agent follow-up questions on its suggestions. Things like:

“Why is this a problem?”
“Can you suggest a cleaner way to do this?”
“What’s an alternative approach?”

I tried this in one of my own PRs and wrote about it in Chatting With My AI Code Review Agent.

Also worth noting: Bito supports over 20 human languages in this chat experience. I mostly use English, but I tested a few queries in Hindi just to see. Works just fine.

4/ Agentic code reviews

This was a game-changer. Bito’s AI Code Review Agent is now fully agentic. That means it no longer follows a fixed chain-of-thought pipeline.

Instead, it dynamically figures out what context matters, explores the code more freely, and generates suggestions based on real structure and patterns.

Read the doc.

5/ Multilingual code reviews

Bito now supports over 20 human languages in review comments. That includes English, Hindi, Chinese, and Spanish.

If you’re working with global teams or reviewing code with non-English speakers, this is just one less thing to worry about.

Read the doc for the steps.

6/ Amazon Nova Lite 1.0 in Bito

With the release of Bito’s free tier, Nova Lite became a key model powering the Code Review Agent for individual developers. 

What that means for developers? You get free AI code reviews for everyday tasks that don’t need deep reasoning. The experience still feels tight and helpful.

Amazon even featured this in a case study with Bito. You can read it here: Amazon Nova + Bito Case Study

Other Updates

Bito rolls out updates weekly. You can find all release notes and documentation on our official site.

Most of the features I used during these 120 days came from regular product iterations that improved my workflow quietly in the background.

See the full release changelog and docs here: Bito documentation.

Bito, its competitors, and benchmarking

If you’re a marketer in tech, you can’t just skim docs and pitch jargons. You need to understand the product deeply. And that means using it.

For me, that also meant trying out the competition. Because if I’m going to talk about Bito, I need to know exactly how it stacks up.

So I did what any curious developer-marketer would do. I opened real pull requests and used Bito side by side with the other tools. A few examples:

Bito vs Coderabbit

This one was a full comparison. I took the same PR and ran it through both tools. Bito gave sharper, more relevant suggestions. It caught things I actually cared about.

Coderabbit left more noise than value. Less signal, more cleanup. I documented everything here: Bito vs Coderabbit

Bito vs GitHub Copilot

Then came Copilot. It’s great for writing code in the IDE, sure. But when it comes to reviewing code in pull requests, it’s just not built for that. No inline feedback, no context, no real review flow.

Bito, on the other hand, lives inside your PR. It gives you comments where they matter, grounded in the actual diff. You can see my detailed comparison here: How is Bito Different from GitHub Copilot?

Benchmarking AI code review tools

After using Bito for a few months, I also wanted to see how it stacked up in more structured comparisons. The team at Bito has actually built a proper benchmarking setup for this.

We test the tool on a known set of code issues across multiple languages, and compare it with other tools in the space. Based on the benchmarking:

Bito performed well. It had the highest issue coverage and consistently caught more high-severity bugs in languages like TypeScript, Python, JavaScript, Go, and Java. That kind of data gave me more confidence in what I was seeing in my own pull requests.

You can check out the whole thing here: Benchmarking the Best AI Code Review Tool

Final thoughts

150 days ago, I joined Bito to understand how AI fits into real code reviews. Since then, I have used Bito on live PRs, tried every feature it shipped, and compared it directly with other tools.

As someone in developer marketing, I believe in understanding the product deeply. That means using it the way real developers do.

After five months, I see why Bito is different. It provides codebase aware PR suggestions, keeps your code private, respects your coding standards, and makes PR reviews more efficient.

Security matters. Many AI (or non-AI) tools still store your code or use it to train their models. That should not happen in a secure code review process. I wrote about that here: Secure Code Review Process

If you are curious, try Bito on a real pull request. That is the only way to see what it can actually do:

Code Security for Developers: How to Write & Review Code Securely

Sushrut Mishra — Thu, 05 Jun 2025 19:06:13 +0000

When people say code security, most developers think about things like SQL injection, broken auth, or insecure APIs. All of these matter, no doubt. But that is not why I'm writing this like.

This is about something simpler. What happens to your code when you send it for review? Where it goes? Who sees it? And whether it gets stored somewhere you never agreed to.

When I first started writing code in 2021, I knew nothing about code security or code review. For me, development and review meant: You push your changes, someone leaves comments, you clean things up. That was it.

What I did not think about was where that code might end up. A lot of tools, especially the AI ones, take your code and send it out of your dev environment. Some of them store it. Some of them train on it. And most of the time, they do not tell you that clearly.

That is a problem. Because if you care about code security, you have to care about the review process too.

What is code security

Code security is one of those terms that gets thrown around a lot. People usually connect it to OWASP lists or app vulnerabilities or something that the security team deals with once the code is already written.

That is not wrong. But it is only part of the picture.

At the most basic level, code security means writing code that does not put your users or systems at risk.

That includes avoiding things like SQL injection, hardcoded secrets, or logic that can be abused. It also means following secure coding practices like input validation, access control, and safe error handling.

But there is another part that does not get talked about enough. And that is keeping your actual source code protected while you are working on it. That includes the way you share it, review it, and run it through tools.

Because the code itself is valuable. It holds your product logic, your architecture, and sometimes things like internal tokens or data structures that should not leave your environment.

So code security is not just about writing safe code. It is also about protecting that code during the entire development process.

That includes reviews.
That includes automation.
And that includes AI tools.

How secure code gets exposed during review

When you hear the word "code review," you probably think of inline comments, feedback on naming, maybe some logic suggestions. That is what most of us learn. The whole process is meant to improve code quality. And that part still matters.

But here is what most devs never get told.

A lot of tools that help with code review are doing more than just reading your code. They are moving it.

Some of them upload it to their own servers. Some keep logs. Some save it for product analytics. And a few even use it to train large language models. All of this can happen silently.

The problem is, you usually do not know. You install a plugin, connect your IDE, and suddenly your pull request is part of someone else's training data.

And unless you are reading every privacy policy and security FAQ, you probably will not catch it. I'm talking so much about this because once your code leaves your machine, you do not control what happens next.

Even if it is encrypted in transit.
Even if it is deleted later.
There is still a window where something could go wrong.

This is how secure code gets exposed. Only because you trusted a tool that moved your code somewhere it should not have gone.

How to actually review code securely

Writing secure code is only half the job. You also need to make sure the way you review it does not open up new risks. That starts with being a little more curious about the tools you use.

The first thing to check is where the tool runs. If it runs locally, inside your IDE, that is a good sign. It means your code stays on your machine.

If the tool needs to connect to the internet or send your code to a server for processing, that is where things get risky.

The second thing is storage.

Some tools store code temporarily. Some keep it longer. Some train models on it. You need to check for that.

Read the docs. Look for clear statements about how your code is handled. If you cannot find that, assume the worst. (Paranoid, maybe?)

The third thing is access.

Who can see your code once it leaves your laptop? Is it going to a third-party cloud service? Is it getting logged somewhere? Even tools that say your data is secure might still be copying it around for “product improvement.”

That is not safe.

So here is a simple way to review code securely:

Use tools that process code locally
Avoid tools that store your code or send it to external servers
Check for clear data handling policies
Ask your team who has access to your source code during reviews
Keep sensitive logic out of third-party tools if you are not sure how they work

You should not have to give up convenience or automation. You just need tools that were built with security in mind from the start.

The good news is those tools exist. And they are getting better.

A safe way to review code with AI

There is nothing wrong with using AI to speed up your code reviews. In fact, it helps a lot. You catch issues faster. You get help refactoring. You waste less time on obvious stuff.

The problem only starts when those tools take your code and store it somewhere you do not control.

That is where Bito fits in.

Bito runs inside your IDE. It reviews your code locally. It gives you real-time suggestions without uploading anything to the cloud. There is no storage, no data training or tracking.

Your code stays where it should stay — on your machine.

It can still flag security risks like SQL injection, hardcoded secrets, or unsafe input handling. It can help simplify functions, reduce complexity, and make reviews less painful.

You get all of that, but without giving up your code.

For teams that work on sensitive logic or internal systems, this matters even more. You cannot risk your source code showing up in some model later. Bito keeps that from happening by never storing or sharing any part of it.

If you want to go deeper into why this matters, I wrote more here: Secure Code Review Process: A gap in Code Security

Secure code is not just about what you write. It is also about how you review it. Bito helps you keep both parts tight.

Final thoughts

If you are learning how to write secure code, do not stop at the code itself. Think about where it goes. Think about how it is reviewed. The tools you use are part of your security story, whether you realize it or not.

You do not need to ditch AI or slow down your workflow. You just need to stay in control of your code. That means knowing what your tools are doing behind the scenes and picking ones that respect your boundaries.

Start simple. Keep your reviews local. Ask questions when something feels off. And use tools that were actually built for developers who care about security.

That is how you build secure habits early. That is how you ship safe code later.

Cursor AI : A Code Editor That Refused to Be Acquired

Sushrut Mishra — Thu, 29 May 2025 16:37:43 +0000

If you are wondering what is Cursor, it is an AI-powered code editor built on top of VS Code. Cursor does not just drop suggestions into your file. It actually understands your codebase. It reads your logic. It figures out what you are trying to build and helps you get there faster.

Cursor was built by a team of engineers from MIT. They launched it through a company called Anysphere. The tool spread fast through dev circles. It felt different from other AI tools. It slowly became the main editor for a lot of people.

Then things got interesting. OpenAI wanted to buy Cursor. This was around the time they were also looking at another tool called Windsurf (not too long ago).

In the end, OpenAI picked Windsurf. Cursor said no to the offer. There is a thread about this on r/LocalLLaMA where people broke down what happened. Cursor turned it down because the team has bigger plans. They want to build a fullstack developer platform. Something more than a tool that just sits in your workflow.

This article walks through the whole thing. From the start of Cursor to how it works today. Why it matters. Why the team stayed independent. And which other tools are out there if you want to compare.

Who built Cursor and why

Cursor came out of a company called Anysphere. The team behind it started at MIT. Four engineers. They were not trying to raise hype or chase funding. They were just tired of fighting their editors.

They were dealing with large codebases. The kind where grep starts to fall apart. Context switching eats your focus. Every time you jump between files or revisit old logic, you lose time. They built Cursor to fix that.

The editor started as a side project. They dropped it quietly, with almost no noise. A few devs picked it up. Then word spread. People were sharing screenshots, test results, even full feature breakdowns in Discord and Reddit threads. Cursor started doing something other AI tools could not. It followed your train of thought.

Instead of just spitting out code from prompts, it looked at the whole repo. It followed function calls. It answered questions with actual context. And it let you edit code the way you think about it—one layer at a time.

The team shipped fast. Every week there was something new. Better context depth. Faster responses. Real codebase indexing. It felt like they were building for themselves, not for show.

At this point, Cursor was still free. They had not launched some giant pricing plan. There were no flashy launches. Just updates and a growing user base of actual developers.

That early momentum is what caught OpenAI’s attention later on. We will get into that in a bit. First, let’s talk about what Cursor actually does differently.

What makes Cursor different

Most AI coding tools feel like helpers. Cursor feels like another dev on your team. Not some floating text box. Not a one-off reply generator. It sits inside your editor, watches how you code, and helps without getting in the way.

You can select a block of code and ask it to explain what it does. You can ask it to refactor. You can even ask why a test might be failing, and it will trace back through your code to give a real answer. It looks at the full picture.

The biggest difference is how it handles context. Cursor pulls in actual code files, not just lines around your cursor. It understands project structure. It knows what your functions connect to. It follows imports. It knows what changed between commits.

You do not need to write long prompts either. You can say something simple like "make this cleaner" or "convert this to use hooks" and it gets the job done.

There is also a built-in chat. It sits in the sidebar, like a teammate you can talk to. You can ask how a file works. You can ask it to find where something breaks. The best part is, you never leave your flow. No browser. No copy and paste. No API juggling.

Cursor also handles search better than most editors. You can ask it stuff like "where is this function used across the app" or "what causes this error" and it returns actual answers, not just file hits.

It still works with all your VS Code extensions. You do not have to rebuild your whole setup. You just get smarter behavior inside something you already use.

This is why people started switching full time. Cursor is not just helping write code. It is helping understand it. That changes everything.

Why Cursor said no to OpenAI

Sometime in early 2024, OpenAI started looking for a serious AI coding assistant. They were already behind tools like Cursor, Replit Ghostwriter, and even their own Copilot collab with GitHub. They wanted something better. Something that could go deep inside the code. Something with momentum.

Cursor was on that list.

From what leaked out in dev threads, including a long one on r slash LocalLLaMA, OpenAI made an offer to acquire Cursor. That never happened. Cursor said no.

OpenAI moved on and ended up acquiring another tool called Windsurf. The deal went quiet for a while, but once Windsurf showed up in OpenAI’s ecosystem, the dev community started putting the pieces together.

So why did Cursor walk away?

The short answer is focus. The team at Anysphere is not trying to build a feature.

They are not trying to get absorbed into someone else's stack.
They want to turn software into something more than just apps and frameworks.
They are thinking about software itself as a platform. A place where devs can work with AI at every step. Not just when typing code.

From the outside, it looked like a wild move. Who says no to OpenAI?

From the inside, it makes some sense. Cursor is growing fast. Devs are switching full time. Feedback is rolling in from real projects. They did not need to sell. (though i still can't fathom how can someone refuse a multi-billion dollar deal?)

Anyway, that decision set the tone. Cursor is not a side project anymore. It is playing long game.

Cursor is great, but not flawless

I have used Cursor enough to know where it shines. I have also run into stuff that made me want to shut it down and switch back.

The biggest problem is that it sometimes guesses too much. When you ask it to fix something, it tries to rewrite more than you expected. It might change logic that did not need to change. Sometimes it over-explains. Other times, it gets stuck in vague answers that look smart but do not help.

The chat can get annoying if you are deep in a complex bug. It starts losing track of what you asked a few messages ago. You have to rephrase or start over. When it works, it works really well. When it breaks, it feels like trying to debug with someone who did not read the code properly.

There are also limits on context. Cursor pulls in more than Copilot, but it still misses edge cases in larger repos. Especially in monorepos or projects with nested services, it can lose track of cross-references.

And yeah, it is not lightweight. Cursor feels heavier than plain VS Code. Startup time is slower. The AI features take some time to kick in, especially on cold starts. You feel that lag when you are jumping into a new project or switching branches.

Pricing is also starting to become a question. It was free at first, which helped it spread. Now it is starting to cost real money. For some people, it is worth it. For others, especially folks who only use the AI once in a while, it might feel like overkill.

I am not saying Cursor is bad. I still use it. I just do not want to pretend it solves everything. It is a great tool, not a perfect one.

If Cursor does not click, try these

Not every dev wants to switch their whole editor. Some people want lightweight. Some want in-browser. Some want to keep VS Code but add smarter suggestions. Cursor is not the only option out there. I have tried a bunch of others, and here is what I found.

For a full breakdown, you can check this list I put together earlier: Best Cursor Alternatives — it goes deeper into features, pricing, and which tools work best for what kind of workflow.

GitHub Copilot

Copilot is the most well-known. It is built into VS Code and backed by OpenAI. It gives solid autocomplete suggestions as you type. For small functions or boilerplate, it works fine. It does not understand full context very well. You can’t ask questions like you would in Cursor. It also gets stuck in loops sometimes and repeats itself. Good for fast typing, not great for deep debugging.

Bito

Bito is built for devs who want AI help without leaving their flow. It works inside your IDE and helps with code explanations, suggestions, documentation, and test cases. It is faster than most browser-based tools and does not require tons of setup. You can also use it with private codebases safely, which is a big plus if you care about keeping things in-house. If you are looking for a no-nonsense AI pair dev that fits inside your setup, Bito is worth checking out.

Cody by Sourcegraph

Cody is somewhere between Copilot and Cursor. It comes with a built-in chat, can look at your whole repo, and ties into Sourcegraph’s code intelligence system. If you are already using Sourcegraph, Cody makes sense. It is more mature than Cursor in some areas, like search, but a bit clunky when it comes to editing. Still improving.

Replit Ghostwriter

Ghostwriter lives inside the Replit IDE. If you are coding in the browser or working on quick prototypes, this is one of the best setups. It helps with both code and docs. You can ask it to build something and it gives you a full layout. Problem is, you have to work in Replit. It is not for local dev or long-term projects unless your whole team is on it.

Continue

Continue is an open-source extension that brings AI chat to VS Code. It is not as polished as Cursor, but it is lightweight and flexible. You can bring your own model. You can use local setups. If you are tired of closed tools and want something you can tweak, Continue is worth trying. Just expect more setup and less magic out of the box.

CodeWhisperer by AWS

If you are deep in the AWS ecosystem, CodeWhisperer makes sense. It is focused more on security, cloud integration, and dev environments tied to AWS services. For general-purpose coding, it feels behind the others. It is slower. Suggestions are not as sharp. But if your day is full of Lambda functions and IAM configs, it might help.

The impasse of SQL performance optimizing

Sushrut Mishra — Tue, 19 Sep 2023 14:04:52 +0000

Many big data calculations are implemented in SQL. When running slowly, we have to optimize SQL, but we often encounter situations that we can do nothing about it.
For example, there are three statements in the stored procedure, which are roughly like this, and execute very slowly:

select a,b,sum(x) from T group by a,b where …; select c,d,max(y) from T group by c,d where …; select a,c,avg(y),min(z) from T group by a,c where …;

T is a huge table with hundreds of millions of rows. It needs to be grouped by three methods, and the grouped result sets are not large.

The grouping operation needs to traverse the data table. These three SQL statements will traverse the huge table three times. It takes a long time to traverse hundreds of millions of rows of data once, not to mention three times.

In this grouping operation, the CPU calculation time is almost negligible relative to the time of traversing the hard disk. If we can calculate multiple group aggregations in one traversal, although the amount of CPU calculation is not reduced, it can greatly reduce the amount of data read from the hard disk and double the speed.

If SQL could support syntax like this:

from T select a,b,sum(x) group by a,b where … -- the first grouping in the traversal select c,d,max(y) group by c,d where … -- the second grouping in the traversal select a,c,avg(y),min(z) group by a,c where …; -- the third grouping in the traversal

It would be able to return multiple result sets in one traversal, and the performance can be greatly improved.
Unfortunately, SQL does not have this syntax and cannot code like this. We can only use an alternative method, that is, use group a,b,c,d to calculate a more detailed grouping result set first, but first save it into a temporary table before we can further calculate the target results with SQL. The SQL statements are rough as follows:

create table T_temp as select a,b,c,d, sum(case when … then x else 0 end) sumx, max(case when … then y else null end) maxy, sum(case when … then y else 0 end) sumy, count(case when … then 1 else null end) county, min(case when … then z else null end) minz group by a,b,c,d; select a,b,sum(sumx) from T_temp group by a,b where …; select c,d,max(maxy) from T_temp group by c,d where …; select a,c,sum(sumy)/sum(county),min(minz) from T_temp group by a,c where …;

In this way, we only need to traverse once, but we have to transfer different where conditions to the previous case when, the code is much more complex and the amount of calculation will be increased. Moreover, when calculating the temporary table, the number of grouping fields becomes large, and the result set may be large. The temporary table is traversed many times, and the calculation performance is not good. Large result set grouping calculation needs hard disk buffer, and its performance is also very poor.

We can also use the database cursor of the stored procedure to fetch the data one by one, but we have to implement the actions of where and group by ourselves. It's too cumbersome to code, and the performance of the database cursor traversing the data will only be worse!

We can do nothing about it!
TopN operation will also encounter this helpless situation. For example, top5 written in Oracle SQL is rough as follows:

select * from (select x from T order by x desc) where rownum<=5

There are 1 billion pieces of data in table T. As can be seen from the SQL statement, the way to get the top five is to sort all the data and then get the first five, and the remaining sorting results are useless! The cost of large sorting is very high. The amount of data is too large to be loaded into memory. There will be multiple hard disk data buffering, and the computing performance will be very poor!

It is not difficult to avoid large sorting. Keep a small set of 5 records in memory. When traversing the data, save the top 5 calculated data in this small set. If the new data is larger than the current fifth, insert it and discard the current fifth. If it is smaller than the current fifth, no action will be taken. In this way, we only need to traverse 1 billion pieces of data once, and the memory occupation is very small, and the computing performance will be greatly improved.

In essence, this algorithm regards TopN as the same aggregate operation as sum and count, but returns a set rather than a single value. If the SQL could be written like this: select top (x, 5) from T, it would have been able to avoid large sorting.
Unfortunately, SQL does not have an explicit set data type. Aggregate functions can only return a single value and cannot write such statements!

However, fortunately, the TopN of the whole set is relatively simple. Although the SQL is written like that, the database can usually do some optimization in practice, and the above method is adopted to avoid large sorting. As a result, Oracle is not slow to calculate that SQL statement.

However, if the situation of TopN is complex, the optimization engine usually doesn't work when it is used in subqueries or mixed with join. For example, to calculate the TopN of each group after grouping, it is a little difficult to write it in SQL. The SQL of Oracle is written as follows:

select * from (select y,x,row_number() over (partition by y order by x desc) rn from T) where rn<=5

In this case, the database optimization engine will faint and will no longer use the above method of understanding TopN as an aggregate operation. It can only do the big sorting, and the operation speed drops sharply!
If only the SQL statement could be written as follows:

select y,top(x,5) from T group by y

Considering top as an aggregate function like sum, it would have been not only easier to read, but also easy to calculate at high speed.
Unfortunately, No.
We still can do nothing about it!

Join calculation is also very common. Take the filtering calculation after the order table is associated with multiple tables as an example. The SQL is basically like this:

select o.oid,o.orderdate,o.amount from orders o left join city ci on o.cityid = ci.cityid left join shipper sh on o.shid=sh.shid left join employee e on o.eid=e.eid left join supplier su on o.suid=su.suid where ci.state='New York' and e.title = 'manager' and ...

There are tens of millions of data in the order table, and the data in city, shipper, employee, supplier and other tables are not large. The filter condition fields may come from these tables, and the parameters are given from the front end and will change dynamically.

Generally, SQL uses the hash join algorithm to implement these associations. The hash values will be calculated and compared. Only one join can be resolved at a time, and the same action will have to be performed n times if there are n joins. After each join, the intermediate results need to be kept for the next round. The calculation process is complex, the data will be traversed many times, and the calculation performance is poor.

Usually, these associated tables are small and can be read into memory first. If each associated field in the order table is serialized in advance, for example, convert the employee id field value to the sequence number of the corresponding employee table record, when calculating, we can use the employee id field value (that is, the sequence number of employee table) to directly get the record at the corresponding position of the employee table in memory. The performance is much faster than hash join, and we only need to traverse the order table once, and the speed will be greatly improved!

That is, the SQL should be written as follows:

select o.oid,o.orderdate,o.amount from orders o left join city c on o.cid = c.# left join shipper sh on o.shid=sh.# left join employee e on o.eid=e.# left join supplier su on o.suid=su.# where ci.state='New York' and e.title = 'manager' and ...

Unfortunately, SQL uses the concept of unordered sets. Even if these ids have been serialized, the database can't take advantage of this feature. It can't use the mechanism of rapid sequence number positioning on these unordered sets of the corresponding associated tables. It can only use index search. Moreover, the database doesn't know that the ids have been serialized, and it still calculates hash values and makes comparisons, and the performance is still very poor!

Although there are good methods, they cannot be implemented. And we can still do nothing about it!
There are also highly concurrent account queries. This operation is very simple:

select id,amt,tdate,… from T where id='10100' and tdate>= to_date('2021-01-10', 'yyyy-MM-dd') and tdate<to_date('2021-01-25', 'yyyy-MM-dd') and …

In the hundreds of millions of historical data in the T table, quickly find several to thousands of details of an account. It is not complicated to code with SQL. The difficulty is that the response time should reach the second level or even faster in case of large concurrency. In order to improve the query response speed, the ID field of the T table is generally indexed:

create index index_T_1 on T(id)

In the database, the speed of using the index to find a single account is very fast, but it will be significantly slower in the case of large concurrency. The reason is also the theoretical unordered basis of SQL mentioned above. The total amount of data is huge and cannot be totally read into memory, and the database cannot ensure that the data of the same account is physically stored continuously.

The hard disk has the smallest reading unit. When reading discontinuous data, many irrelevant contents will be fetched, and the query will be slow. If each query under high concurrency is a little slower, the overall performance will be very poor. Who dares to let users wait for more than ten seconds at a time when user experience is so very important?!

An easy way to think of is to sort hundreds of millions of data according to accounts in advance and ensure the continuous storage of data of the same account. In this way, almost all of the data blocks read out from the hard disk during query are target values, and the performance will be greatly improved.

However, the relational database using SQL system does not have this awareness and will not force the physical order of data storage! This problem is not caused by SQL syntax, but is related to the theoretical basis of SQL. It is still impossible to implement these algorithms in a relational database.

Now, what can we do? Can we do anything about it?
We can no longer use SQL and relational databases. We need to use other computing engines.
Based on the innovative theoretical basis, the open-source esProc SPL supports more data types and operations and can describe the new algorithms in the above scenarios. To code with simple and convenient SPL can greatly improve the computing performance in a short time!
The code examples of the above tasks written in SPL are as follows:

Calculate multiple groupings in one traversal

Calculate top5 by aggregation method Top5 of the total set (multithreaded parallel computing)

Top5 of each group (multithreaded parallel computing)

Join： System initialization

Query

High concurrency account query： Data preprocessing and orderly storage

Account query

In addition to these simple examples, SPL can also implement more high-performance algorithms, such as orderly merging for the association between orders and details, pre-association technology for multi-layer dimension table association in multidimensional analysis, bit storage technology for the statistics of thousands of tags, Boolean set technology to speed up the query of multiple enumeration values filtering conditions, timing grouping technology for complex funnel analysis and so on.

Checkout the esProc_SPL GitHub for more info - https://github.com/SPLWare/esProc

Why a SQL Statement Often Consists of Hundreds of Lines, Measured by KBs？

Sushrut Mishra — Sat, 11 Mar 2023 04:39:40 +0000

Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for non-IT people. Simple SQL statements read like English, and even people without any programming experience can write them.

However, the language becomes clumsy as query needs become even slightly more complicated. It often needs hundreds of rows of multilevel nested statements to achieve a computing task. Even professional programmers often find it hard to write, let alone the non-IT people. As a result, such computing tasks become the popular main question in programmer recruitment tests of many software companies. In real-world business situations, the size of SQL code for report queries is usually measured by KBs. The several-line SQL statements only exist in programming textbooks and training courses.

SQL problem analysis

What is the reason behind the sheer bulk of the code? Let’s try to find the answer, that is, SQL’s weaknesses, through an example.

Suppose we have sales performance table sales_amount consisting of three fields (date information is omitted to make the analysis simpler):

We are trying to find salespeople whose sales amounts rank in top 10 in terms of both air conditioners and TV sets.

The task is not difficult. It is easy for us to think of the following natural computing process:

Sort the sales performance table by sales amount of air conditioners and get the top 10;
Sort the sales performance table by sales amount of TV sets and get the top 10;
Perform intersection operation on step 1 and step 2’s result sets to get the final result.

Let's try to solve it in SQL.

Early SQL did not support stepwise coding. The first two steps had to be written in subqueries, and the whole process looked a little complicated:

select *from (select top 10 sales from sales_amount where product='AC' order by amount desc) intersect (select top 10 sales from sales_amount where product='TV' order by amount desc)

The language later identified the issue and specifically offers CTE syntax, which uses WITH keyword to name an intermediate result set that can be referenced in subsequent parts of the computation:

withAas select top 10 sales from sales_amount where product='AC' order by amount desc Bas select top 10 sales from sales_amount where product='TV' order by amount desc select *fromA intersect B

The statement is still long but becomes clearer.

Now, we make the task a little harder. We will find salespeople whose sales amounts for all products rank in top 10. It’s easy to think up the following algorithm according to the above solution:

List all products;
Find salespeople whose sales amounts rank in top 10 for each product and store them separately;
Calculate the intersection between all top-10 result sets.

The problem is that CTE only works when the number of intermediate results is already known. In this case, we do not know the number of products. This means that the number of clauses under WITH keyword is indefinite and we are not able to write the statement.

Let’s try a different approach:

Group the original table by product, sort each group, and find top 10 records meeting the specified condition in each group;
Calculate intersection of all top 10 records.

But it requires to store the grouping result in step 1. The intermediate result is a table where one field will store the top 10 of members in each group, which means the field values will be sets. As SQL does not support set-type values, the solution becomes infeasible.

If we have window functions at hand, we can switch to another route. It will group the original table by product, calculate the number of appearances of every salesperson in the top 10 sales amounts of each group, and find those whose total appearances are equal to the number of products – they are the ones whose sales amounts rank in top 10 for all products.

select sales from( select sales, from( select sales, rank()over(partition by product order by amount desc ) ranking from sales_amount) where ranking <=10) group by sales having count(*)=(select count(distinct product) from sales_amount)

This way we are able to accomplish the computing task in SQL. But such a complicated SQL statement is beyond most users.

As the first two simple algorithms cannot be implemented in SQL, we have to adopt the roundabout third one. This reveals one important weakness of SQL - insufficient set-orientation.

Though SQL has the concept of sets, it does not offer them as a basic data type. A variable or field in the SQL context cannot have set type values. The only set type SQL object is table. This results in roundabout algorithms and complicated code for the large number of set-oriented calculations.

The keyword top is used in the above SQL sample programs. Actually, there isn’t such an operator in relational algebra (but it can be constructed using a series of other operations), and the code is not standard SQL.

Let me show you how difficult it is when the top keyword is not available for finding top N.

Here’s the general way of thinking: For each member, get the number of members where the sales amounts are greater than the current amount, define ranking of the current salesperson according to the number, and get members whose rankings are not greater than 10. Below is the SQL query:

select sales from( select A.sales sales,A.product product, (select count(*)+1 from sales_amount where A.product=product ANDA.amount<=amount) ranking from sales_amount A) where product='AC'AND ranking<=10

select sales from( select A.sales sales,A.product product,count(*)+1 ranking from sales_amount A, sales_amount B where A.sales=B.sales and A.product=B.product ANDA.amount<=B.amount group by A.sales,A.product ) where product='AC'AND ranking<=10

Even professional programmers find it hard to write. The code is too complicated for such a simple top 10 computation.

Even if SQL supports the keyword top, it can only solve top N problem conveniently. If the task becomes a bit more complex, such as getting members/values from the 6th to the 10th and finding salespeople whose sales amounts are 10% higher than their directly next, the above problems still exist and we have to resort to a roundabout way if we still trying to achieve it in SQL.

This is due to SQL’s another key weakness – lack of order-based syntax. SQL inherits mathematical unordered sets, which is the direct cause of difficulties in handling order-based calculations that are prevalent in real-world business situations (such as calculating link relative ratio, YOY, top 20%, and rankings).

SQL2003 standard adds window functions to try to improve the computing ability for dealing with order-based calculations. They have enabled simpler solutions to the above computing tasks and helped mitigate this SQL problem. However, the use of window functions is usually accompanied by nested queries, and the inability to let users access members of a set directly according to their positions leaves many order-based calculations hard to solve.

Suppose we are trying to find the gender ratio among the above top salespeople by calculating the number of females and that of males. Generally, the gender information of salespeople is recorded in employee table instead of the sales performance table, as shown below:

As the list of top salespeople is available, our first thought might be finding their genders from the employee table and then count the numbers. To achieve this cross-table query, SQL needs a table join. So, the SQL code following the above top 10 task is:

select employee.gender,count(*) from employee, (( select top 10 sales from sales_amount where product='AC' order by amount desc ) intersect ( select top 10 sales from sales_amount where product='TV' order by amount desc ))A where A.sales=employee.name group by employee.gender

Only one table join has already made the code complicated enough. In fact, related information is, on many occasions, stored in multiple tables and often of multilevel structure. For instance, salespeople have their departments and the latter has managers, and we might want to know the managers under whom those top salespeople work. A three-table join is needed to accomplish this, and it is not easy to write smooth and clear WHERE and GROUP for this join.

Now we find out the next SQL weakness – lack of object reference mechanism. In relational algebra, the relationship between objects is maintained purely by foreign keys match. This results in slow data searching and the inability to treat the member record in the related table pointed by the foreign key directly as an attribute of the current record. Try rewriting the above SQL as follows:

select sales.gender,count(*) from(…)// … is the SQL statement for getting the top 10 records of salespeople group by sales.gender

Apparently, this query is clearer and will be executed more efficiently (as there are no joins).

The several SQL key weaknesses shown through a simple example are causes of hard to write and lengthy SQL statements. The process of solving business problems based on a certain computational system is one that expressing an algorithm with the syntax of a formalized language (like solving word problems in primary school by transforming them into formalized four arithmetic operations). The SQL defects are great obstacles to translation of solutions computing problems. In extreme cases, the strangest thing happens – the process of converting algorithms to syntax of a formalized language turns out to be much harder and more complicated than finding a solution.

In other words, using SQL to compute data is like using an assembly language to accomplish four arithmetic operations – which might be easier to understand for programmers. A simple formula like 3+5*7 will become as follows if it is written in an assembly language, say X86:

mov ax,3 mov bx,5 mul bx,7 add ax,bx

Compared with the simple formula 3+5*7, the above code is complicated to write and hard to read (it is even more difficult when fractions are involved). Though it may be not a big deal for veteran programmers, it is almost unintelligible for most business people. In this regard, FORTRAN is a great invention.

Our examples are simple because I want you to understand my point easily. But real-world computing tasks are far more complicated, and users will face various SQL difficulties. Several more lines here and a few more lines there, it is therefore no wonder that SQL generates multilevel nested statements of hundreds of lines for a slightly complicated task. What’s worse, often the hundreds of lines of code are a single statement, making it hard to debug in terms of engineering aspect and increasing difficulty in handling complex queries.

More examples

Let’s look at SQL problems through more examples.

In order to simplify the SQL statement as much as possible, the above sample programs use many window functions and thus the Oracle syntax that supports window functions well. Syntax of the other databases will only make the SQL statement more complicated.

Even for these simple tasks that are common in daily analytic work, SQL is already sufficiently hard to use.

Unordered sets

Order-based calculations are prevalent in batch data processing (such as getting top 3 or record/value in 3rd position, and calculating link relative ratio). SQL switches to an unusual way of thinking and take a circuitous route because it cannot perform such a calculation directly thanks to its inheritance of the concept of mathematical unordered sets.

Task 1: Find employees whose ages are equal to the median.

select name, birthday from(select name, birthday,row_number()over(order by birthday) ranking from employee ) where ranking=(select floor((count(*)+1)/2) from employee)

Median calculation is common, and the process is simple. We just need to sort the original set and get the member at the middle position. SQL’s unordered-sets-based computational mechanism does not offer position-based member access method. It will invent a field of sequence number and select the eligible members through a conditional query, where subqueries are unavoidable.

Task 2: Find the largest number days when a stock rises consecutively.

select max(consecutive_day) from(select count(*)(consecutive_day from(select sum(rise_mark)over(order by trade_date) days_no_gain from(select trade_date, case when closing_price>lag(closing_price)over(order by trade_date) then 0else1END rise_mark from stock_price)) group by days_no_gain)

Unordered sets also lead to tortuous ways of solving problems.

Here is the general way of doing the task. Set a temporary variable to record the number of consecutive rising days with the initial value as 0, compare the current closing price with the previous one, reset the variable’s the current value as 0 if the price does not rise and add 1 to it if the price rises, and get the largest number when the loop is over.

SQL cannot express the algorithm and it gives an alternative, which first counts the non-rising frequency for each date from the initial one to the current one. The dates that have the same frequency contain prices rising consecutively. Then it groups these dates to get continuously rising intervals, counts members in each, and finds the largest number. It is extremely difficult to understand and even more hard to express.

Insufficient set-orientation

There is no doubt that sets are the basis of batch data processing. SQL is a set-oriented language, but it can only express simple result sets and does not make it a basic data type to extend its application.

Task 3: Find employees whose birthdays are on the same date.

select * from employee where to_char(birthday, ‘MMDD’)in ( select to_char(birthday,'MMDD') from employee group by to_char(birthday,'MMDD') having count(*)>1)

The original purpose of grouping a set is to divide it into multiple subsets, so a grouping operation should have returned a set of subsets. However, SQL cannot express such a “set of sets” and thus cannot help forcing an aggregate operation on the subsets to return a regular result set.

At times, what we need isn’t aggregate values but the subsets themselves. To do this, SQL will query the original set again according to the grouping condition, which unavoidably results in a nested query.

Task 4: Find students whose scores of all subjects rank in top 10.

select name from(select name from(select name, rank()over(partition by subject order by score DESC) ranking from score_table) where ranking<=10) group by name having count(*)=(select count(distinct subject) from score_table)

The set-oriented solution is to group data by subject, sort each subset by score, select top 10 from each subset, and calculate intersection between the subsets. As SQL’s inability to phrase “a set of sets” and support intersection operations on an indefinite number of sets, the language takes an unusual route to achieve the task. It finds top 10 scores in terms of subjects using a window function, group the result set by student, and find the group where the number of students is equal to the number of subjects. The process is hard to understand.

Lack of object reference method

A SQL reference relationship between data tables is maintained through matching foreign key values. Records pointed by these values cannot be used directly as an attribute of the corresponding records in the other table. Data query needs a multi-table join or a subquery, which is complicated to code and inefficient to run.

Task 5: Find male employees whose managers are female.

Through multi-table join:

select A.* from employee A, department B, employee C where A.department=B.department and B.manager=C.name and A.gender='male' and C.gender='female'

Through subquery:

select * from employee where gender='male' and department in (select department from department where manager in (select name from employee where gender='female'))

If the department field of the employee table is the foreign key pointing to records of the department table and the manager field of the department table is the foreign key that points to records of the employee table, the query condition can be written in the following simple, intuitive and efficient way:

where gender='male' and department.manager.gender='female'

SQL can only use a multi-table join or a subquery to generate difficult to understand statements.

Task 6: Find the companies where employees obtained their first jobs.

Through multi-table join:

select name, company, first_company from(select employee.name name, resume.company company, row_number()over(partition by resume. name order by resume.start_date) work_seq from employee, resume where employee.name = resume.name) where work_seq=1

Through subquery:

select name, (select company from resume where name=A.name and start date=(select min(start_date) from resume where name=A.name)) first_company from employee A

SQL is also unable to treat the sub table as an attribute (field) of the primary table because it lacks object reference method and has inadequate set-orientation. A query on the sub table uses either a multi-table join, which makes the statement particularly complex and needs to align the result set to records of the primary table in a one-to-one relationship through a filtering or grouping operation (since records of the joining result set has such a relationship with the sub table), or a subquery that calculates ad hoc the subset of records of the sub table related to each record in the primary table one by one – which increases amount of computations (the subquery cannot use the WITH subclause) and coding difficulty.

SPL as the solution

SQL problems need to have a solution.

Actually, the above analysis implies a way out. That is, designing a new language that gets rid of those SQL weaknesses.

And this is the original intention of creating SPL.

SPL is the abbreviation for Structured Process Language while SQL’s full name is Structured Query Language. It is an open-source programming language intended to facilitate structured data computations. SPL emphasizes orderliness and supports object reference method to achieve complete set-orientation, sharply reducing the difficulty of “algorithm translation”.

Here we just present SPL code of the 6 tasks in the previous section, giving you a glance of the language’s elegance and conciseness.

Task 1

Task 2

It is easy for SPL to code an intuitive and direct algorithm.

Task 3

SPL keeps result set of the grouping operation to further process it as it handles a regular set.

Task 4

SPL writes the code smoothly as the intuitive algorithm unfolds.

Task 5

With the support of object reference, it is convenient for SPL to access a field of the record pointed by the foreign key as it gets one of its original fields.

Task 6

SPL allows treating a set of records of the sub table as a field of the primary table and accesses it in the same way of getting its other fields, avoiding repeated computations on the sub table.

SPL has an intuitive IDE that offers convenient debug functionalities to track each step for processing a query, making coding even easier.

For a computation within an application, SPL offers the standard JDBC driver to be integrated with the application, such as JAVA, as SQL does:

… Class.forName("com.esproc.jdbc.InternalDriver"); Connection conn =DriverManager.getConnection("jdbc:esproc:local://"); Statement st = connection.(); CallableStatement st = conn.prepareCall("{call xxxx(?,?)}"); st.setObject(1,3000); st.setObject(2,5000); ResultSet result=st.execute(); ...

SPL: a database language featuring easy writing and fast running

Sushrut Mishra — Mon, 13 Feb 2023 10:28:05 +0000

Objective of database language

To clarify this objective, we need to first understand what the database does.

When it comes to database, it always makes people think that it is primarily for storage since its name has a part “base”. But in fact, it is not the case, database can achieve two important functions: calculation and transaction, which are what we often call OLAP and OLTP. The storage of database is intended for these two functions, and just serving as a storing role is not the objective of database.

As we know, SQL is currently the mainstream database language. So, is it convenient to do such two things in SQL?

The transaction function is mainly to solve the consistency of data during writing and reading. Although it is hard to achieve, its interface is very simple for applications, and the code for manipulating the reading and writing of database is also very simple. If it is assumed that the current logical storage scheme of relational database was reasonable (that is, using the data tables and records to store data. Whether it is reasonable or not is another complicated issue, which will not be discussed in detail here), then it would not be a big problem for SQL to describe the transaction function, because there is no need to describe complex action, and the complexity is already solved in the database.

As for the calculation function, however, the situation will be different.

The calculation we are talking about here is a broader concept. It is not just simple addition and subtraction, the search and association can all be regarded as some calculation.

So here comes a question, what kind of computing system is good?

Two characteristics needed: easy in writing, fast in running.

Easy in writing is easy to understand, which is to allow programmers to write code quickly so that more work can be done per unit of time; while for fast in running, it is easier to understand since we definitely hope to get the calculation results in a shorter time.

Actually, the Q in SQL represents query. The original purpose of inventing SQL is to query (i.e., calculation), which is the main goal of SQL. However, it is hard to say that SQL is very competent when describing computing tasks.

Why SQL is not competent

Let’s start with easy in writing.

The code written in SQL is very much like English, and some queries can be read and written in English (there are so many examples on the Internet, so we won’t give examples here). This should be regarded as satisfying the requirement of easy in writing.

Wait a minute! The code written in SQL we see in textbooks often has only two or three lines, it is indeed simple, but what if we try to solve some slightly more complicated problems?

Here is an example that is actually not very complicated: calculate the maximum consecutive days that a stock keeps rising. Write it in SQL is like this:


select max (consecutive_day)
from (select count(*) (consecutive_day
      from (select sum(rise_mark) over(order by trade_date) days_no_gain
            from (select trade_date,
                         case when closing_price>lag(closing_price) over(order by trade_date)
                              then 0 else 1 END rise_mark
                  from stock_price ) )
      group by days_no_gain)

The working principle of this statement won’t be explained here, it's a little confusing anyway. You can try it yourself.

This is a recruitment test of Raqsoft company, with a pass rate of less than 20%; Because it is too difficult, it is later changed to another testing way: ask the candidate to explain what the written SQL statement is, but unfortunately, the pass rate is still not high.

What does it reveal? It reveals that the situation is slightly complicated, and SQL becomes difficult to both understand and write!

Let's look at the issue of fast in running, and take the simple task that is often used as an example: take the top 10 out of 100 million pieces of data. This task is not complicated to write in SQL:

SELECT TOP 10 x FROM T ORDER BY x DESC

However, the execution logic corresponding to this statement is to perform the big sorting for all the data first, and then take the top 10, and discard the remaining data. As we all know that sorting is a very slow action, and will traverse the data many times. If the amount of data is too large to be loaded into memory, it also needs to buffer the data in external storage, resulting in a further sharp decrease in performance. If the logic embodied in this statement is strictly followed, the operation will not run fast anyway. Fortunately, many programmers know that this operation does not need the big sorting, nor the external storage to buffer since it can be done by traversing only once and only occupying a little memory space, it means a higher performance algorithm exists. Regrettably, such algorithm can't be implemented in SQL. We can only hope the database optimizer is smart enough to convert this SQL statement to a high-performance algorithm to execute, but the database optimizer may not be reliable when the situation is complicated.

It seems that SQL is not doing well in both aspects. Although these two examples are not very complicated, SQL does not perform well in either example. In reality, the difficult-to-write and slow running situation abounds in SQL codes with thousands of lines.

Then, why these two aspects cannot be well achieved in SQL?

To answer this question, we need to analyze what exactly the implementation of calculation with program code does.

Essentially, the process of programming is the process of translating problem-solving idea into a precise formal language executable by the computer. For example, just like solving an applied problem by a primary school student, the student also needs to list an expression relating to four basic arithmetic operations after analyzing the problem and coming up with a solution. Likewise, for the calculation with the program, not only does the solution need to be come up with, but it also needs to translate the solution into actions that can be understood and executed by the computer.

For the formal language used to describe calculation method, its core lies in the algebraic system adopted. To put it simply, the so-called algebraic system includes two key elements: data types and corresponding operation rules. For instance, the key elements of arithmetic we learned in primary school is the integer and the operations including the addition, subtraction, multiplication and division. Once we get both key elements, we can write the operation we want with the symbols stipulated in the algebraic system to something, i.e., the code, and then the computer can execute.

If an algebraic system is not well designed, causing the provided data types and operations to be inconvenient, it will make it very difficult to describe the algorithm. In this case, a strange phenomenon will occur: the difficulty of translating the solution into the code is far more than solving the problem itself.

For example, we learned to use Arabic numerals for daily calculation in our childhood, and using such numerals is very convenient to do addition, subtraction, multiplication and division, and hence everyone naturally believes that numerical operation should be like this. Are all numerical operations so convenient? Not necessarily! It is estimated that many people know there is another numeral called Roman numeral. Do you know how to add, subtract, multiply and divide with Roman numerals? And how did the ancient Romans go to the streets for shopping?

*The reason why coding is difficult is largely due to algebra.
*

Let's look at the reason for not running fast.

Software cannot change the performance of hardware; the speed of CPU and hard disk depends on their own configuration. However, we can design an algorithm of low complexity, that is, an algorithm with a smaller amount of calculation, so that the computer executes less actions, and thus the running speed will be faster naturally. Yet, just working out the algorithm is not enough, we also need to program the algorithm in some formal language, otherwise the computer won't execute. Moreover, it needs to be relatively simple to code. If the code in a certain formal language is very long, it will be very troublesome and no one will use such formal language. Therefore, for the program, easy in writing and fast in running are actually the same problem, behind which is the algebra adopted by the formal language. If the algebra is not good, it will make it difficult or even impossible to implement high-performance algorithm, as a result, there is no way to run fast. As mentioned above, our desired algorithm of occupying a little memory space and traversing only once cannot be implemented in SQL. Consequently, if you want it to run fast, you can only place hope on the optimizer.

Let's make another analogy:

Students who have gone to primary school probably know the story of Gauss calculating 1+2+3+…+100. Ordinary students adopted the most primitive method, which was to add 100 times step by step, while little Gauss was very smart, he found that 1+100=101, 2+99=101,…,50+51=101, from which he multiplied 50 by 101, and hence quickly figured out the result and then headed home for lunch.

After hearing this story, we all felt that Gauss was so clever that he thought of such an ingenious solution, which is simple and fast. Yes, that’s right, but it is easy to overlook one point: in the days of Gauss, multiplication already existed in the human arithmetic system (also an algebra)! As mentioned earlier, since we learned four arithmetic operations in our childhood, and hence we would take it for granted that multiplication should be used. But it is not, actually! Multiplication was invented after addition. If multiplication had not yet been invented in the days of Gauss, he wouldn’t have found a way to solve this problem quickly no matter how clever Gauss was.

At present, the mainstream database is the relational database, and the reason why it is called this way is because its mathematical basis is called relational algebra. SQL is exactly a formal language developed from the theory of relational algebra.

Now we can answer why SQL is not competent in both aspects we expect. The problem lies in relational algebra, and the relational algebra is just like an arithmetic system with only addition and no multiplication. Therefore, it is inevitable that many things cannot be done well.

Relational algebra has been invented for fifty years. The difference between the application requirements and hardware environments of fifty years ago and today is very huge. Continuing to apply the theory of fifty years ago to solve today's problems, does it sound too outdated? However, this is the reality. Due to the large number of existing users and the lack of mature new technologies, SQL, based on relational algebra, is still the most important database language today. Although some improvements have been made in recent decades, the foundation has not changed. In the face of contemporary complex requirements and hardware environments, it is reasonable that SQL is incompetent.

And, unfortunately, this problem is at the theoretical level, and it won't help no matter how optimized it is in practice, it can only be improved in a limited way, not eradicated. Regrettably, most database developers do not think of this level, or, in order to take care of the compatibility of existing users, they do not intend to think about this level. As a result, the mainstream database industry has been going around in circles in this limited space.

Why SPL is competent

Now then, how to make the calculation Easier in Writing and Faster in Running?

Invent new algebra! An algebra with “multiplication”, and then design a new language based on the new algebra.

This is where SPL comes from. Its theoretical basis is no longer the relational algebra, but something called discrete dataset. The formal language designed based on this new algebra is named SPL (structured process language).

Innovations against the shortcomings of SQL have been made to SPL (more precisely, innovations against various deficiencies of relational algebra have been made to the discrete dataset). SPL redefines and extends many operations of structured data, specifically, it adds the discreteness, enhances ordered computation, implements a thorough set orientation, supports object references, and advocates stepwise operation.

Recoding the previous problems in SPL will give you a direct feeling.

Calculate the maximum consecutive days that a stock keeps rising:

stock_price.sort(trade_date).group@i(closing_price<closing_price[-1]).max(~.len())

Although the calculation idea is the same as the previous SQL, it is much easier to express and no longer confusing, because of the introduction of ordering characteristic.

Take the top 10 out of 100 million pieces of data:


T.groups(;top(-10,x))

SPL has richer set data types, it is easy to describe the efficient algorithm that implements simple aggregation on a single traversal, without involving big sorting action.

Due to space limitations, we will not introduce SPL (discrete dataset) in an all-round way here, but will list some differential improvements of SPL (discrete dataset) against SQL (relational algebra):

Discrete records

The records in the discrete dataset are a basic data type that can exist independently of the data table. The data table is a set constituted by records, and the records that make up a certain data table can also be used to make up other data tables. For example, the filtering operation is to use the records that meet the conditions in original data table to make up a new data table, in this way, it has more advantages in both space occupation and operation performance.

The relational algebra has no computable data type to represent the record. A single record is actually a data table with only one row, and records in different data tables must not be same. For example, during the filtering operation, new records will be duplicated to form a new data table, it will result in an increase in the costs of space and time.

In particular, because there are discrete records, the discrete dataset allows record’s field value to be a certain record, which makes it easier to implement foreign key join.

Ordering characteristic

Relational algebra is designed based on unordered sets, and the set members do not have the concept of sequence number. Moreover, it doesn’t provide the mechanism of positioning calculation and adjacent reference. In practice, SQL has made some partial improvements, allowing the modern SQL to easily do some ordered operations.

On the contrary, the sets in discrete dataset are ordered, and all set members have the concept of sequence number and can be accessed with sequence number. Moreover, the discrete dataset defines the positioning operation so as to return the sequence number of members in the set. Also, the discrete dataset provides symbols to implement adjacent reference in set operation, and supports the calculation according to the position of a certain sequence number in the set.

Ordered operation is very common, but it has always been a difficult for SQL. The implementation of ordered operation in SQL is still very cumbersome even with window functions available. SPL has greatly improved this situation, which can be illustrated by the previous example of stock rising.

Discreteness and set orientation

The relational algebra defines rich set operations, that is, it can take the set as a whole to participate in operations such as aggregation and grouping. This is where SQL is more convenient than high-level programming languages like Java.

However, the relational algebra has very poor discreteness and no discrete records, while high-level programming languages such as Java have no problem in this regard.

As for the discrete dataset, it is equivalent to combining discreteness with set orientation, which means that it has not only the set data type and related operations, but also the set members that separate out of the set to do independent operation or form other sets. Therefore, it can be said that SPL integrates the advantages of both SQL and Java.

Ordered operation is a typical scenario that combines discreteness with set orientation. The concept of order is meaningful only for a set and meaningless for a single member, which reflects the set orientation; the ordered operation needs to calculate a certain member and its adjacent members, which requires discreteness.

Only with the support of discreteness can we obtain more thorough set orientation, and solve problems like ordered operation.

In short, the discrete dataset is an algebraic system with both discreteness and set orientation, while relational algebra has only set orientation.

Understanding of grouping

The original intention of the grouping operation is to split a large set into several subsets according to some rules. In relational algebra, since there is no data type that can represent the set of sets, it has to do the aggregation operation after grouping.

Conversely, the discrete dataset allows the set of sets, it can represent reasonable grouping operation result. The grouping operation and the aggregation operation after grouping are split into two-step independent operations. In this way, more complex operation can be performed on the grouped subsets.

In relational algebra, there is only one kind of equivalence grouping, that is, the sets are divided according to the grouping key value. The equivalence grouping is a complete division.

For the discrete dataset, however, it thinks that any method of splitting a large set is a grouping operation. In addition to the conventional equivalence grouping, it also provides the ordered grouping combined with ordering characteristic, as well as the aligned grouping that may get incomplete division result.

Understanding of aggregation

There is no explicit set data type in relational algebra. The results of aggregation calculation are all a single value, so does the aggregation operation after grouping, only including SUM, COUNT, MAX, MIN etc. Particularly, the relational algebra cannot regard TOPN operation as an aggregation. The TOPN operation performed on the whole set can only take the first N items following the sorting when outputting the result set. However, for the grouped subsets, it is difficult to implement TOPN, in this case, it needs to change idea and work out the sequence number to achieve.

The discrete dataset advocates the universal set, and the aggregation operation result is not necessarily a single value, but may still be a set. In discrete dataset, the TOPN operation has the same status as SUM and COUNT, etc., that is, it can be performed on the whole set or on the grouped subsets.

After SPL regards TOPN as the aggregation operation, it can also avoid the sorting of all data in practice, hereby obtaining high performance. However, the TOPN in SQL is always accompanied by ORDER BY action. In theory, it can only be implemented by big sorting, in this case, you need to place hope on the optimization of database in practice.

High performance supported by ordering characteristic

The discrete dataset places special emphasis on ordered set, and can implement many high-performance algorithms using ordered characteristic. This cannot be implemented by relational algebra based on unordered sets, and you can only hope for optimization in practice.

The following are some low-complexity operations that can be implemented using ordering characteristic:

1) The data table is ordered by the primary key, which is equivalent to having an index naturally. Filtering of key fields can often be quickly located to reduce the traversal amount in external storage. When fetching the data randomly by key value, the binary search can also be used for positioning; the index information can be reused in case of data-fetching by multiple key values at the same time.

2) Usually, the grouping operation is implemented with the HASH algorithm. If we are sure that the data are ordered by the grouping key value, we only need to do the adjacent comparison, hereby avoiding the calculation of HASH value, in this way, there will be no HASH conflict problem, and it is very easy to perform parallel computing.

3) The data table is ordered by the key, the merge algorithm with higher performance can be used for the alignment join between two large tables. In this way, we only need to traverse the data once, without the need to buffer the data, and thus it makes the memory occupation less. Yet, not only is the conventional HASH value partitioning method relatively complicated, requiring more memory and data buffering in external storage, but it also may cause secondary HASH and re-buffering due to improper hash function.

4) The join of large table as foreign key table. When the fact table is small, the foreign key table can be used to order, from which the data corresponding to the associated key value are quickly taken out to achieve join, and there is no need to do HASH partitioning. When the fact table is also large, we can divide the foreign key table into multiple logical segments by quantile, and then partition the fact table by logical segment. In this way, only one table needs to be partitioned, and the secondary partitioning, which may occur during HASH partitioning, will not occur in the process of partitioning, and thus the computational complexity greatly decreases.

Items 3)and 4) above exploit the modification of the join operation in discrete dataset. If we continue to use the definition in relational algebra (which may produce many-to-many), it is difficult to implement such low-complexity algorithms.

In addition to theoretical differences, SPL has many engineering-level advantages such as: easier to write parallel computing code, large memory pre-association to improve foreign key join performance, unique column storage mechanism to support arbitrary segmentation and parallel computing, etc.

In the era of big data, we are often interested in high performance computation. Here are some big data algorithms implemented in SPL:

Performance optimization skill: Multi-purpose traversa

Performance optimization skill: TopN

Performance optimization skill: Pre-Joining

Performance optimization skill: Numberizing Foreign Key

Performance optimization skill: Attached Table

Performance optimization skill: One-side Partition

......

And some high performance cases:

Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times

Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency

Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times

Open-source SPL turns pre-association of query on bank mobile account into real-time association

Open-source SPL speeds up batch operating of bank loan agreements by 10+ times

Open-source SPL optimizes batch operating of insurance company from 2 hours to 17 minutes

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.raqsoft.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc