DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Drawing Technical Boundaries in Network Consulting in 3 Steps

There are few things as frustrating as the phone ringing after you've designed and delivered a network topology with someone saying, "Mustafa Bey, the internet is down." The biggest lesson 20 years of field experience has taught me is this: any network whose boundaries you don't define with technical rules and clear protocols will eventually suffocate you. If you don't determine where you stand at the beginning of a project, you'll find yourself adjusting the vSwitch settings of a virtualization server you have nothing to do with at 3:00 AM on a Sunday night.

In this post, I explain how I draw the line between "where does my responsibility end, and where does the client's system administrator's or developer's responsibility begin" in infrastructure projects, in three fundamental steps and with concrete technical scenarios. My goal is to protect my own sanity and ensure the sustainability of the work I deliver.

Defining the Responsibility Matrix at the L2/L3 Level

The most frequent conflicts in network projects occur in the transition zone between physical switches and the virtual layer (hypervisor). I consider my job done after defining VLANs 10, 20, and 30 on the core switch and preparing the 802.1Q trunk port. However, when the system administrator on the other end misconfigures the port groups on the virtualization server, the blame immediately falls on me with a "network is down" complaint.

To prevent this confusion, I draw the boundary of responsibility at the exit of the physical switch port. If a packet leaves my switch with a tag and reaches the other side, my task is officially complete. The inability of the operating system within the virtual machine to get an IP or resolve the VLAN tag is entirely the responsibility of the system team.

The moment I deliver a Cisco IOS configuration like the one below, the vSwitch settings of the server on the other end of the line become the client's responsibility:

interface GigabitEthernet1/0/24
 description TO_ESXI_HOST_01
 switchport trunk encapsulation dot1q
 switchport mode trunk
 switchport trunk allowed vlan 10,20,30
 spanning-tree portfast trunk
!
Enter fullscreen mode Exit fullscreen mode

If I've verified that the packets coming from this port have the correct VLAN tag using the show interfaces trunk command, that link is clean for me. I once worked on a production ERP project where, because I didn't clearly define this boundary, a wrong virtual switch configuration by the server team caused the entire factory's shipping line to stop for 4 hours, and the contract was on the line. Since that day, I never compromise on this rule.

⚠️ The Virtual Switch Trap

Selecting "VLAN ID: 0 (None)" in vSwitch configurations within virtualization environments prevents tagged packets coming from the physical switch from reaching the virtual machine. This is a system configuration error, not a network failure.

"It Distributed an IP, the Rest is Not My Problem": DHCP and DNS Boundaries

The biggest nightmare for a network administrator is client devices failing to get an IP or performing incorrect DNS resolution. In these services, which operate according to RFC 2131 (DHCP) and RFC 1035 (DNS) standards, I always draw my boundary at the gateway device. The scope of the DHCP server being full or the DNS server performing negative caching is a system services problem, not a network problem.

In one project, users were complaining that "some sites are not opening." My analysis showed that the local DNS server was returning an NXDOMAIN response for addresses it couldn't resolve, and clients were caching this response for 15 minutes. There was no packet loss on the network, routing tables were clean; the problem was entirely with the forwarder settings of the local DNS server.

In such situations, to prove my responsibility, I run the following dig query directly on the client and show that the problem is independent of the network:

dig @192.168.10.1 mustafaerbay.xs --trace
Enter fullscreen mode Exit fullscreen mode

If the query leaves my gateway device outwards without loss but gets stuck at the internal Active Directory DNS server, I tell the system administrator, "Fix your DNS records," and step away. In the infrastructure of my own developed side products, I always prevent such insidious errors by setting up external and redundant DNS querying mechanisms.

"Rule Book" Method in Firewall and Access Policies

Developers often blame firewall rules first when a service isn't working. Every time I hear the phrase, "Mustafa Abi, the port is probably closed," I remain calm and ask for the telnet or nc (netcat) test results. Most of the time, the problem isn't the firewall blocking the port, but the application binding to the relevant port at 127.0.0.1 instead of 0.0.0.0.

For drawing boundaries in firewall management, I apply the "Rule Book" method. Before granting any access permission, I obtain written approval from the requesting team containing the source IP, destination IP, protocol, and port information. I do not activate any rules without this approval and never allow temporary rules.

To verify if a server's port is actually listening, I ask the system team for the output of this command:

ss -tulpn | grep 8080
Enter fullscreen mode Exit fullscreen mode

If the output shows only localhost, like the following, there's no point in opening a rule on the firewall because the application is closed to the outside world:

tcp   LISTEN 0      128      127.0.0.1:8080      0.0.0.0:*    users:(("node",pid=1234,fd=18))
Enter fullscreen mode Exit fullscreen mode

We once argued for 4 hours with a team that claimed the network had crashed due to a PostgreSQL database connection pool exhaustion. After proving to them with this method that the database server had reached its socket limit and was not accepting new connections, they had to fix the software architecture.

Client Boundary in VPN and ZTNA Integrations

With the rise of remote work models, VPN and Zero Trust Network Access (ZTNA) projects have become central to our lives. However, there's a significant boundary problem here too. Securing the company network is my job, but fixing the Wi-Fi channel interference on a user's 10-year-old ADSL modem is not.

In VPN tunnels, I draw my boundary by the successful "Connected" status of the client software and its ability to ping the internal DNS server. If the user's Internet Service Provider (ISP) is behind CGNAT or there's an MTU (Maximum Transmission Unit) size mismatch, this situation cannot be the responsibility of the network consultant.

Especially for insidious problems like web pages loading partially due to MTU/MSS mismatches, I apply TCP MSS Clamping on the firewall and fix the responsibility at the network layer:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
Enter fullscreen mode Exit fullscreen mode

While migrating the VPS infrastructure for my own side product, I experienced a similar MTU mismatch. If you don't optimize the size of packets passing through the tunnel, packets get fragmented on the way, appearing as packet loss. It's important to distinguish this and be able to tell the client, "The problem is with your computer or modem."

Bandwidth and QoS Conflicts: Whose Traffic is It?

"The internet is slow" is a complaint that everyone claims expertise in, even without technical depth. As a network consultant, I determine the line's capacity and how this capacity will be divided (QoS - Quality of Service). However, a user hogging all the bandwidth by streaming 4K video in the background is not related to network design but to the company's administrative policies.

In such situations, I draw my boundary with traffic shaping rules. I always prioritize critical business applications (e.g., ERP or VoIP voice packets). I then limit general internet traffic with hard limits.

Here's a simple set of rules showing how I control bandwidth using tc (traffic control) on a Linux-based gateway:

# Create class-based queuing (HTB)
tc qdisc add dev eth0 root handle 1: htb default 30

# Guaranteed bandwidth of 10mbps for critical traffic
tc class add dev eth0 parent 1: classid 1:10 htb rate 10mbit ceil 20mbit

# Limit of 2mbps for general internet traffic
tc class add dev eth0 parent 1: classid 1:30 htb rate 2mbit ceil 5mbit
Enter fullscreen mode Exit fullscreen mode

After implementing these rules, I directly show the QoS metrics to anyone complaining, "The internet is slow." If the packets for critical systems are going through without loss, my network is doing its job. The remaining administrative aspects are the concern of HR or general management.

ℹ️ QoS and DSCP

To achieve end-to-end QoS, you must ensure that DSCP (Differentiated Services Code Point) markings are supported not only on your own switches but also by the ISP on your company's egress router. Otherwise, your packets will be treated as ordinary traffic the moment they leave for the internet.

Proof of "System is Running": Metrics and Monitoring Infrastructure

The only way to draw all these boundaries and defend yourself is to have concrete technical data (metrics) in hand. As a network consultant, I always integrate an independent monitoring system into every project I deliver. ICMP latency, packet loss, traffic intensity on switch ports, and CPU/Memory usage must be recorded in real-time.

Just as the entire system locks up when we don't monitor Docker container disk usage, if you don't foresee network anomalies, you'll always be blamed. I generally use the Prometheus and SNMP exporter duo to continuously monitor critical devices.

The following Prometheus alert rule warns me the moment packet loss begins on a network interface, allowing me to find the source of the problem before the client even notices:

groups:
  - name: network_alerts
    rules:
      - alert: HighPacketLoss
        expr: rate(node_network_receive_drop_total[5m]) > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High packet loss on network interface: {{ $labels.device }}"
Enter fullscreen mode Exit fullscreen mode

With these metrics, when the client says, "There's a network issue," I can present them with the uptime and packet loss graphs for the last 30 days. I close the discussion by saying, "Look, packet loss is 0.01%, and latency is 2ms. The problem isn't with the network; it's with your server's disk I/O bottleneck."

In the next step, we will cover how to perform in-depth traffic analysis on the network and set up flow-based monitoring (NetFlow/sFlow) systems to detect subtle DDoS attacks in advance. Protect your network with rules, so your network can protect you.

Top comments (0)