Fedor Pasynkov

Posted on Jun 12

A Troubleshooting Log: Proxy Routing, QUIC, Docker State, and Ansible

#networking #linux #docker #ansible

This is not a "how to set up a proxy" tutorial.

I run a small self-hosted stack: a Linux client, a proxy client, a VPS, a Dockerized control panel, Xray-style routing, and Ansible automation. This is a log of real problems I hit along the way — not polished after the fact, just written down as I went.

The value isn't in any specific tool or final config. It's in the failures:

an app ignored system proxy settings;
a video service felt broken because of transport behavior;
fixing the video service broke my messenger;
a routing rule existed but never matched;
I confused server-side direct with client-side direct;
Ansible said success, but the service wasn't working;
a copied database preserved routing but broke panel access;
SQLite edits looked right but login still failed;
browser cookies produced fake-looking backend errors.

Each section follows the same structure:

Problem
Symptoms
Cause
Fix
How to check next time
What I learned

Problem 1: The app showed a location/API error even though the terminal looked fine

Symptoms

A desktop developer tool kept failing with an API or location error.

Meanwhile, the basic checks came back clean:

curl ipinfo.io
curl -4 ipinfo.io
curl -6 ipinfo.io

The browser worked. The terminal worked. The app didn't. That's what made it confusing — the obvious first checks were all green.

Cause

The client was in system proxy mode.

That works for programs that actually respect system proxy settings. A lot of them don't. Desktop apps, IDEs, background agents, internal CLI helpers — some of them just bypass it entirely and use the network directly.

So the browser and terminal looked fine, while the app quietly took a different path.

Fix

Switching to TUN/network-level mode fixed it.

The key insight wasn't "change a server rule." It was making sure the app's traffic was actually entering the right network layer in the first place.

How to check next time

[ ] Does the browser work?
[ ] Does curl work?
[ ] Does the specific failing app work?
[ ] Is the client in system proxy mode or TUN/VPN mode?
[ ] Could this app be ignoring system proxy settings?
[ ] Are IPv4 and IPv6 both tested?
[ ] Was the app restarted after changing network mode?

What I learned

Browser and terminal checks only prove that some traffic is routed correctly. They say nothing about every app.

Problem 2: TUN mode worked, but needed elevated permissions on Linux

Symptoms

Network-level mode failed to start, or complained about permissions. The error looked like a low-level interface issue rather than anything in the app itself.

Cause

Creating a TUN interface on Linux requires elevated privileges. That's just how it works — it's not a bug.

Fix

Set up the app to request admin privileges properly. For a desktop workflow, a launcher using pkexec is more practical than manually starting things from a terminal every time.

How to check next time

[ ] Is the error about creating/configuring a TUN interface?
[ ] Does it work when started with the right privileges?
[ ] Is the desktop launcher passing DISPLAY/XAUTHORITY correctly?
[ ] Is this client trusted enough to run elevated?

What I learned

System proxy mode is easier and requires less trust, but it doesn't catch everything. TUN mode catches more, but it demands more from the OS — and from you.

Problem 3: The video service was slow even though the proxy worked fine

Symptoms

The service loaded, technically. But it felt bad:

- previews loaded slowly;
- the feed was choppy;
- video playback was inconsistent;
- speeds were lower than expected;
- latency under load was noticeably worse than idle.

The proxy wasn't broken. This was a quality and stability problem.

Cause

It wasn't just raw bandwidth.

Modern video platforms often use HTTP/3 over QUIC, which runs over UDP. On some routes, UDP gets worse treatment than TCP — packet loss, jitter, shaping somewhere along the path. The "fast" transport turned out to be the unstable one.

Fix

Test whether routing the affected traffic away from the problematic UDP path helps.

The important thing: do it narrowly. A broad rule covering all UDP on a common port will catch unrelated apps. A tight rule scoped to the problematic traffic category is much safer.

How to check next time

[ ] Is the service slow or fully unreachable?
[ ] Is the problem specific to video/previews/feed?
[ ] What does latency under load look like?
[ ] Is QUIC/UDP involved?
[ ] Does bypassing that transport path actually help?
[ ] Is the rule narrow, or is it catching other apps too?
[ ] Was the core service restarted after the routing change?

What I learned

Don't debug streaming issues with Mbps alone. A connection with decent throughput but jittery latency can feel worse than something slower but stable.

Problem 4: Fixing the video service broke the messenger

Symptoms

After the video service started behaving, the messenger didn't:

- file transfers became unreliable;
- uploads and downloads failed intermittently;
- calls or media could be affected;
- the timing lined up exactly with the routing change.

Cause

The fix was too broad.

The routing rule I added caught not just the video service, but also other apps with similar transport characteristics. Classic mistake — solving one problem with a rule that matches too much.

Fix

Tighten the rule.

Instead of a global protocol/port rule, scope it to the specific traffic that actually needs it. Then retest both services:

[ ] Video service still works better
[ ] Messenger file transfers still work
[ ] Calls are unaffected
[ ] Nothing else is obviously broken

How to check next time

[ ] What changed right before the messenger broke?
[ ] Is there a global transport rule in place?
[ ] Can it be scoped to a specific domain or category?
[ ] Does disabling the rule temporarily restore the messenger?
[ ] Does a narrower rule preserve both?

What I learned

A global network rule is a hammer. It might fix one nail and split the table.

Problem 5: A routing rule existed in the UI but didn't actually do anything

Symptoms

Created a rule. The UI showed it. It looked correct. Traffic still went the wrong way.

Cause

Several things could cause this:

1. A broader rule sits above it and matches first.
2. The service wasn't restarted after saving.
3. The routing engine can't actually see the domain or protocol.
4. Sniffing/detection is missing or incomplete.
5. The traffic doesn't match what I assumed.

Fix

Work through the list:

[ ] Put specific rules above general/fallback rules
[ ] Enable protocol and domain detection where needed
[ ] Use route-only sniffing if available
[ ] Restart the actual service — not just the web UI
[ ] Test from the real client again

How to check next time

[ ] Is this rule above the fallback/default?
[ ] Is there a broader rule that's matching first?
[ ] Can the router even see the domain?
[ ] Is the relevant protocol included in sniffing?
[ ] Was the service restarted?
[ ] Is there a log showing the rule actually matched?

What I learned

A rule in the UI is not the same as a rule that's active, matched, and applied.

Problem 6: I confused server-side `direct` with client-side `direct`

Symptoms

I expected:

If I set direct on the server, the destination will see my local network.

It didn't work that way. The destination still saw the VPS IP.

Cause

Server-side routing happens after traffic reaches the VPS.

So server-side direct means the VPS sends traffic directly to the internet — from the VPS. It doesn't mean the client skips the VPS and uses the local network.

If a site or service needs to see the client's local IP, that decision has to be made on the client side.

Fix

Move the logic to client-side routing:

Local-sensitive traffic → direct locally
Everything else → the selected route

The exact implementation depends on the client, but the decision point has to be in the right place.

How to check next time

[ ] Does the destination need to see the VPS IP or the local IP?
[ ] Is the routing decision being made on the client or server?
[ ] Does server-side direct actually solve this case?
[ ] Does the client support rule-based routing?
[ ] Can rules be applied by domain, app, or category?

What I learned

Server routing and client routing solve different problems. Using the wrong one creates a fix that looks logical on paper but can't work.

Problem 7: Ansible finished with `failed=0`, but the panel wasn't on the expected port

Symptoms

The playbook completed cleanly. It even printed the expected panel URL.

The browser couldn't open it. A server check showed the port wasn't listening. The logs showed the app had started on a different default port.

Cause

The playbook did its tasks. But the application's runtime state didn't end up where expected — the database or template didn't fully control the web panel settings, or the app overrode part of that state on startup.

Fix

Add an explicit post-start step to set panel credentials and port using the app's own supported command. Then restart the container and verify what's actually listening.

How to check next time

[ ] Did Ansible finish?
[ ] Is the container running?
[ ] What do the logs say?
[ ] Which port is actually listening?
[ ] Is that port exposed?
[ ] Does the browser reach it?
[ ] Does login work?

What I learned

failed=0 means the automation ran. It doesn't mean the service works.

Problem 8: A database template preserved routing but broke panel access

Symptoms

A copied database carried useful routing configuration. But the panel didn't behave:

- wrong port;
- missing web settings;
- login problems;
- unexpected default values everywhere.

Cause

The database had useful runtime state, but not everything was portable. Some settings get generated, stored differently, or are expected to be configured through the app — not by copying a file.

Fix

Use the database template for what it's good at. Then, after the container starts, set panel access explicitly through the app's supported command.

Correct order:

1. Put the database template in place
2. Start the container
3. Set username/password/port through the app command
4. Restart the container
5. Verify port, login, and runtime settings

How to check next time

[ ] Does the DB template exist before container start?
[ ] Is the container mounting the expected DB path?
[ ] Are routing settings there?
[ ] Are panel settings there?
[ ] Was the app command used for credentials and port?
[ ] Was the container restarted afterward?

What I learned

A runtime database can work as a template, but it's not a clean declarative config. Don't treat it like one.

Problem 9: SQLite edits looked correct but login still failed

Symptoms

The database showed the right-looking values for the username and password fields.

The panel still rejected every login attempt:

invalid credentials

Cause

The app almost certainly doesn't store or validate passwords as plain text. There's probably hashing, internal secrets, migrations, derived values, or additional state that a direct SQL edit doesn't touch.

Changing the visible value doesn't mean you've changed what the app actually checks.

Fix

Don't update credentials through SQLite unless the app explicitly documents that workflow.

Use the built-in CLI or admin interface.

How to check next time

[ ] Does the app hash passwords?
[ ] Is there an official command for changing credentials?
[ ] Did the direct edits touch all the required fields?
[ ] Are the logs showing invalid credentials?
[ ] Is the browser holding onto an old session cookie?

What I learned

If the app gives you a command for changing credentials, use it. Direct database edits are a last resort, not a deployment workflow.

Problem 10: The binary path inside the container wasn't where I thought

Symptoms

A command like:

docker exec container_name app_command ...

didn't work. The container returned help text or behaved as if the arguments were wrong. The guessed binary path didn't exist.

Cause

The actual binary path inside the image was different from what I assumed. The image had changed, or it just didn't match examples from a different environment.

Fix

Find the binary inside the container instead of guessing it. Then use the real path in the Ansible task.

How to check next time

[ ] Does the binary actually exist where I think it does?
[ ] What does find show inside the running container?
[ ] Does the command work manually before automating it?
[ ] Did the Docker image version change?
[ ] Should the binary path be a variable?

What I learned

Never automate a guessed path. Verify it inside the running container first.

Problem 11: Resetting the database fixed login but wiped the routing config

Symptoms

Deleting the database and letting the app recreate it made the panel accessible again.

But the routing settings were gone.

Cause

The routing config lived in the old database. Resetting the database solved the panel access problem by throwing away the state that also held the useful configuration.

Fix

Don't delete runtime databases casually. Back up first.

If the database has useful routing state, keep it and fix credentials on top of it using the app's supported command — don't nuke the whole thing.

How to check next time

[ ] Did I back up the database?
[ ] What important state lives in there?
[ ] Am I solving one problem by deleting another config I need?
[ ] Can I fix credentials without resetting the DB?
[ ] Can I export the important settings separately?

What I learned

"Reset everything" is not a fix if what you're deleting is the configuration you need.

Problem 12: Browser cookies created confusing errors after replacing the database or secrets

Symptoms

After swapping the database or secrets, the panel threw session or cookie-related errors. It looked like a backend problem, but only in the existing browser session.

Cause

The browser still had an old cookie signed with the old secret. After the change, that cookie was invalid — and the panel rejected it.

Fix

Open the panel in a private window, or clear site data for that host. Then log in fresh.

How to check next time

[ ] Did I replace the DB, secrets, or session config?
[ ] Does the problem happen in incognito/private mode?
[ ] Does clearing site data fix it?
[ ] Are the logs complaining about session or cookie validation?

What I learned

After changing secrets, always suspect a stale browser session before going deeper into the backend.

Problem 13: The Docker volume explained why the container used unexpected state

Symptoms

Container behavior didn't match a fresh image. Old settings survived container recreation. The app started with state that came from the host directory.

Cause

The container had a mounted directory for persistent data. That mount supplied the application state. Deleting and recreating the container didn't touch the mounted data on the host.

Fix

Check the mounted paths and the host-side files. Treat the mounted database/config directory as the real source of runtime state — not the image.

How to check next time

[ ] What volumes or bind mounts are configured?
[ ] What host path is mounted into the container?
[ ] Does the host path have an old DB or config?
[ ] Does deleting the container actually delete state?
[ ] Is the intended template copied before container startup?

What I learned

A Dockerized app isn't just the image. It's:

image + env + command + mounts + existing host state

All of it matters.

Problem 14: Using `latest` made deployments unpredictable

Symptoms

Documentation, examples, and old commands didn't always match the running container. Paths and behavior could silently differ between versions.

Cause

The deployment depended on a moving tag. When the image changes, your assumptions can become wrong overnight without any warning.

Fix

Pin the image version. Upgrade intentionally:

1. Test on a non-critical server first
2. Check the logs
3. Check panel access
4. Check runtime config
5. Check routing
6. Check client behavior
7. Then upgrade the main server

How to check next time

[ ] Is the image version pinned?
[ ] Did behavior change after an image update?
[ ] Does the documentation match the exact version I'm running?
[ ] Is the binary path version-dependent?
[ ] Do I have a rollback plan?

What I learned

latest is fine for experiments. It's not a foundation for reproducible infrastructure.

Problem 15: The deployment copied secrets and runtime state too casually

Symptoms

The easiest working solution involved copying a real runtime database. That database could contain sensitive stuff:

- users;
- credentials;
- UUIDs;
- keys;
- panel settings;
- routing config;
- internal service details;
- session secrets.

Cause

Runtime databases are convenient because they contain everything. That convenience is exactly the risk.

Fix

Don't publish real runtime databases. For automation, prefer:

- sanitized templates;
- generated credentials;
- variables stored outside Git;
- private backups;
- explicit post-deploy commands;
- verification steps.

How to check next time

[ ] Is this file safe to commit?
[ ] Could it contain tokens, UUIDs, passwords, keys, or routing internals?
[ ] Is it a real DB from a working server?
[ ] Can I generate this state instead of copying it?
[ ] Is .gitignore protecting runtime files?

What I learned

If a file makes deployment magically easy, it probably contains sensitive state.

Problem 16: The original checklist was too optimistic

Symptoms

The early deployment checklist looked something like:

[ ] Ansible completed
[ ] Panel URL printed
[ ] Container started

That wasn't enough. The system could pass all three and still be broken.

A more honest checklist

[ ] Container is running
[ ] Logs show the expected service startup
[ ] Expected port is actually listening
[ ] Browser can reach the panel
[ ] Login works
[ ] Runtime DB exists
[ ] Important settings survived
[ ] Credentials were set through the app command, not raw SQL
[ ] No broad routing rule is breaking unrelated apps
[ ] Specific rules sit above fallback rules
[ ] Sniffing/protocol detection is enabled where needed
[ ] Service was restarted after config changes
[ ] Client-side behavior tested from the actual device or app
[ ] Browser cookies cleared if secrets changed
[ ] Runtime DB backed up before any destructive actions

What I learned

A good checklist verifies behavior, not just deployment steps.

Final debugging order

If something breaks again, I'd work through it like this:

1. Identify the exact failing app.
2. Check whether only this app fails or everything does.
3. Check whether the app uses system proxy or bypasses it.
4. Look at client-side routing before server-side routing.
5. Test IPv4 and IPv6 separately.
6. Check whether the issue is transport-specific.
7. Avoid broad protocol/port rules.
8. Check whether domain rules can actually match.
9. Check rule order.
10. Restart the real service after changes.
11. Inspect Docker mounts and host-side state.
12. Verify actual listening ports.
13. Use app-supported commands for credentials.
14. Clear cookies after DB or secret changes.
15. Back up the runtime DB before resetting anything.
16. Pin versions before expecting reproducible deploys.

Final thought

The biggest lesson wasn't about any specific proxy, panel, VPS, or Ansible role.

It was that almost every bug here was a "wrong layer" bug.

I tried to fix client-side behavior by changing something on the server.

I trusted terminal checks for an app that ignored system proxy.

I treated a runtime database like a clean config file.

I trusted Ansible's success output without checking what was actually running.

I changed a broad network rule and then wondered why something else broke.

There was no magic config. The fix was learning to ask better questions:

Which layer is making this decision?
What traffic is actually affected?
Is this rule too broad?
Is the config saved, or is it running?
Is this state from the image or from a mounted database?
Does the app support this kind of credential change?
Did I verify actual behavior, or just the deployment output?

That's the part worth carrying into the next project.

Problem 1: The app showed a location/API error even though the terminal looked fine

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 2: TUN mode worked, but needed elevated permissions on Linux

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 3: The video service was slow even though the proxy worked fine

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 4: Fixing the video service broke the messenger

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 5: A routing rule existed in the UI but didn't actually do anything

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 6: I confused server-side direct with client-side direct

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 7: Ansible finished with failed=0, but the panel wasn't on the expected port

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 8: A database template preserved routing but broke panel access

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 9: SQLite edits looked correct but login still failed

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 10: The binary path inside the container wasn't where I thought

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 11: Resetting the database fixed login but wiped the routing config

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 12: Browser cookies created confusing errors after replacing the database or secrets

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 13: The Docker volume explained why the container used unexpected state

Symptoms

Cause

Fix

How to check next time

What I learned

Problem 14: Using latest made deployments unpredictable

Symptoms

Problem 6: I confused server-side `direct` with client-side `direct`

Problem 7: Ansible finished with `failed=0`, but the panel wasn't on the expected port

Problem 14: Using `latest` made deployments unpredictable