GWLB in Production: 9 Pitfalls That Break Your Firewall Architecture

As a Cloud Engineer, I have frequently implemented solutions for clients that enhance both network and application security in their infrastructures. One of the most frequently used solutions was the selection of Palo Alto VM-Series firewalls, specifically designed for public clouds. Implementing VM-Series, however, isn’t as straightforward as it sounds in theory. To achieve a truly functional infrastructure, many other resources must be deployed around the firewalls themselves. Take AWS, for example. One of the most popular solutions is the use of a Gateway Load Balancer (in fact, this is one of the reasons this type of Load Balancer was implemented at AWS). Choosing GWLB, however, implies other dependencies, such as Gateway Load Balancer Endpoints, which should be located in dedicated subnets, and therefore, the routing tables should also be set up correctly in each of these subnets. Ultimately, it turns out that it’s best to encapsulate the security portion of the infrastructure within a dedicated VPC. But since these are separate VPCs, they need to be connected to other Virtual Private Networks somehow so that this traffic is actually filtered and examined by firewalls. This is where Transit Gateway comes in.

As you can see, simply gathering dependencies is no easy task, let alone configuring them. In this article, I’d like to focus on a few key aspects that can save you time if you choose this architecture. I’ve implemented this solution numerous times for clients across various industries. As I walk through the configuration process, I’ll describe some not-so-typical issues, but ones that might give you a few extra gray hairs.

The architecture

Before diving into the pitfalls, here’s the centralized inspection architecture this article is about:

Click the diagram to open full-size in a new tab — route table details are readable at full resolution.

1. Asymmetric traffic forwarding without TGW Appliance Mode

We’re considering a scenario where we implement our solution in a centralized architecture. The Transit Gateway is responsible for sending traffic between VPCs. Now let’s imagine this situation (let’s trace the packet flow together).

A virtual machine (let’s call it app_vm) in Spoke VPC attempts to send a packet to a second virtual machine in another Spoke VPC (let’s call it db_vm). app_vm is located in AZ A, db_vm is located in AZ B. Here’s what happens:

app_vm initiates a connection. It checks the routing table in its subnet, which states that every packet destined for the 172.16.0.0/16 subnet is sent to the Transit Gateway.
Transit Gateway receives the packet from the VPC where app_vm is located. It checks the routing table associated with that VPC. The routing table clearly states: send this packet to Security VPC.
Transit Gateway forwards the packet to Security VPC. And here’s a very important point that will have consequences later. Due to AZ affinity (TGW’s default behavior - it sends traffic to the same AZ the packet originated from), the packet is sent to the Transit Gateway Attachment subnet in AZ A.
The Transit Gateway Attachment subnet in AZ A receives the packet and forwards it to the Gateway Load Balancer Endpoint, also in AZ A.
The packet reaches the Gateway Load Balancer and is then forwarded to the VM-Series in AZ A.
Policies configured on the firewall allow the packet to pass through, so the packet is sent to the Gateway Load Balancer Endpoint subnet (AZ A) and then to the Transit Gateway.
The Transit Gateway receives the packet from the Security VPC and forwards it based on the routing table to the Spoke VPC where db_vm is located. The packet reaches the destination machine.

Sounds good, right? Now let’s trace the return traffic.

db_vm responds to the request received from app_vm. It checks the routing table in its subnet, which says that a packet destined for 192.168.0.0/24 should be sent to the Transit Gateway. It does so.
The Transit Gateway receives this packet, checks the routing table, and forwards it to the Security VPC.
This is the key moment. Due to the same AZ affinity mechanism, the Transit Gateway sends this packet to the Transit Gateway Attachment subnet in AZ B - because db_vm is in AZ B. This is not random - TGW deterministically picks the AZ based on where the packet entered.
The packet is forwarded to the Gateway Load Balancer Endpoint in the subnet in AZ B. The packet is then forwarded to the Gateway Load Balancer, which forwards it to the VM-Series in AZ B.
The VM-Series in AZ B receives the packet and thinks, “What is this? I have no idea what this session is about.”
DROP.

Fortunately, solving this problem is incredibly simple (but only if you understand the problem). In the Transit Gateway VPC attachment configuration, simply enable the Appliance Mode option. This changes TGW’s forwarding logic from AZ affinity to a flow hash based on the 4-tuple (source IP, destination IP, source port, destination port) - ensuring both directions of a flow are always delivered to the same AZ in the Security VPC. This option is not enabled by default.

Sources: AWS Docs: Transit Gateway Appliance Mode, AWS Prescriptive Guidance: Transit Gateway asymmetric routing

2. Fail-open when all targets are unhealthy

Imagine an extremely rare, but still possible, situation. All your firewalls in all AZs in your Security VPC become inoperable for some reason. The Target Group associated with GWLB sees them all as unhealthy. What comes to mind first? That GWLB will drop traffic and not forward it to unhealthy instances. This seems logical, but it’s a shame it’s not true.

GWLB will go into fail-open mode. What does this mean for you? It depends. If the firewall is in a crashed or terminated status, the traffic will indeed stop at the firewall and be dropped. However, if the firewall is in an up state but health checks fail (e.g., due to a CPU spike, a license expiry, or a bad Panorama push), the firewall can let this traffic through without inspection. This is a real security bypass.

How can you protect against this? There are several options.

Configuring alerts on CloudWatch for UnHealthyHostCount is a must - so you’re at least aware that there might be a threat.
Configuring target_failover.on_unhealthy to rebalance will rehash flows to healthy targets. Note that this helps when some targets are unhealthy - if all targets are down, there’s nowhere to rebalance to.
A great, though slightly more advanced, solution is to use a Lambda-based kill switch. If such a situation occurs, the function should modify the routing tables to blackhole traffic.

Sources: AWS Docs: Health checks for GWLB target groups, AWS Whitepaper: GWLB with TGW for centralized security

3. The real cost stack

It’s generally accepted that the price for GWLB is around $0.014 per hour per AZ. Well, that’s true, but that’s just GWLB. The table lists all the ACTUAL costs:

Component	Cost basis	3-AZ, 2 FW/AZ, 1TB/mo
GWLB hourly	$0.014/AZ-hour	$31
GWLB usage (GLCU)	$0.004/GLCU-hour	~$50
GWLBE hourly (PrivateLink)	$0.011/hour per endpoint	$24
GWLBE data processing	$0.01/GB	$10
Cross-AZ data transfer	$0.01/GB each direction	$20
TGW attachment	$0.07/hour per attachment	$153
TGW data processing	$0.02/GB	$20
EC2 instances (6x c5n.xlarge)	~$0.34/h per instance	$1,489
Subtotal (infra only)		~$1,797/mo
VM-Series PAYG license	$1.71/h per instance	$7,490
Total with PAYG		~$9,287/mo
VM-Series BYOL license (amortized)	varies	~$2,400-3,600
Total with BYOL		~$4,197-5,397/mo

As you can see, your monthly invoice doesn’t include just the GWLB itself. You budgeted around $500, and at the end of the month, you receive an invoice for ~$9,000 (depending on the region). Consider an alternative - perhaps a native AWS firewall will suffice for your needs, costing around $750 per month. (But of course, this also cuts out many features - I described this in more detail in this article.)

And another “pleasant” surprise: if you configure cross-zone load balancing on GWLB, remember that you pay $0.01/GB for each cross-AZ hop. This option is worth considering when planning your HA architecture.

Sources: AWS ELB Pricing, AWS PrivateLink Pricing

4. Palo Alto overlay routing - not a silver bullet

Overlay Routing in VM-Series can be a great solution. We don’t need to create separate NAT Gateways native to AWS; traffic to the internet exits directly through the firewall’s public interface. And that’s all great, but this configuration will only work for outbound traffic.

What about inbound traffic? The firewall will inspect the packet, apply overlay routing, and instead of returning the packet back through the GWLB endpoint, it will send it out through its public interface. The result - asymmetric routing and dropped connections.

East-west traffic (VPC-to-VPC) in a centralized TGW architecture is a different story - it actually works fine with overlay routing. The packets have private destination IPs, so the firewall’s L3 lookup routes them back via the GENEVE interface, not out the public interface.

But there are solutions for combined traffic too.

First and foremost, consider whether you really need overlay routing. If it’s only going to inspect outbound traffic, then yes, it’s a shame not to take advantage of this option.

If you need inbound traffic handling but don’t want to give up overlay routing, don’t worry. You’ll need to spend a bit more time on configuring subinterfaces and virtual routers, but it can be done while maintaining full functionality.

One more thing worth mentioning - there was a confirmed bug (PAN-229985, fixed in PAN-OS 11.1.3) where GWLB overlay routing packets were re-encapsulated with an incorrect flow cookie in the GENEVE header. Some of the issues reported on LIVEcommunity may have been caused by this bug rather than an architectural limitation. Make sure you’re running a version with this fix.

Finally, before you decide to deploy this solution to production, test it in a test environment.

Sources: Palo Alto: Enable Overlay Routing for VM-Series on AWS, LIVEcommunity: Overlay Routing with GWLB for Combined Model (SOLVED), LIVEcommunity: Issues with Overlay Routing and GWLB

5. PAN-OS version roulette

Remember - there’s no operating system in the world that’s bug-free. PAN-OS is no exception. Some versions of PAN-OS have problems coexisting with GWLB, particularly when overlay routing is enabled:

PAN-OS Version	GWLB Status
10.1.5-h5	Working
10.1.6	Broken (fix in 10.1.6-h6)
10.1.7	Working
10.2.2	Broken
10.2.3-h2	Issues reported
11.0.0 (EOL)	Issues reported

We usually assume that the newer version will be better than the previous one. We decide to upgrade (because who would test anyway…). Well, we updated our version to the latest one and… something’s not right. Gateway Load Balancer Endpoints don’t work, but they don’t show any errors either.

The solution is brutally simple, but many users seem to forget this. TEST the new PAN-OS version in a non-production environment. Don’t go straight to production with untested software. When you buy new running shoes, do you immediately wear them in the most important race of your life, or do you test them during training sessions to make sure they really suit you?

Sources: LIVEcommunity: Overlay Routing + GWLB issues, LIVEcommunity: GWLB VPC Endpoint broken post-upgrade

6. NAT on the firewall breaks traffic

Are you an administrator managing firewalls at your on-prem location and have been tasked with deploying VM-Series in the cloud? I’d bet your intuition (and probably rightly so) tells you that one of the most important configurations will be the correct NAT settings on the firewalls. You apply the same pattern to the Cloud Firewall with GWLB and… it doesn’t work? No wonder.

GWLB validates the 5-tuple of return packets against its connection state table. If you’ve set up DNAT on the firewall, the 5-tuple no longer matches, so GWLB will drop the packet. But don’t make it too easy - you won’t get a clear error (and forget about the logs).

When using GWLB, you don’t need to NAT on the VM-Series. If you carefully examine the architecture (the one at the beginning of the article), you’ll notice that using a NAT Gateway is enough to handle outbound traffic. Unless you’re using overlay routing (see section 4), in which case the firewall handles outbound NAT directly.

Sources: AWS re:Post: NAT on Palo FW with GWLB, AWS Best practices for GWLB

7. The debugging nightmare

Gateway Load Balancer is a brilliant AWS solution… but not for debugging traffic problems.

Colloquially speaking, even VPC Flow Logs won’t help here. The problem is that GWLB encapsulates traffic with the GENEVE protocol on UDP port 6081. Instead of the actual source and destination addresses, you’ll see some private addressing that tells you nothing.

Make one mistake in any routing table and you’re in… a black hole. Look at the architecture diagram to see how many routing tables appear in the VPC itself (and add the corresponding routing tables in TGW, in the Spoke VPCs). You have to be careful, and honestly, I don’t have a silver bullet.

What can help?

Flow Logs on Gateway Load Balancer Endpoint interface with custom fields: ${pkt-srcaddr}, ${pkt-dstaddr}, ${flow-direction}, ${tcp-flags}
Logs directly on VM-Series
AWS Reachability Analyzer
Simultaneous tcpdump on client, server, and firewall interfaces

8. One Security VPC or two?

If you need to inspect both east-west (VPC-to-VPC) and north-south (internet ingress/egress) traffic, you might wonder whether one Security VPC is enough.

The good news - a single Security VPC with Appliance Mode ON works for both traffic types. North-south traffic is not broken by Appliance Mode. For internet-bound traffic (where the destination has no AZ), TGW with Appliance Mode selects the ENI in the source AZ anyway - so it behaves almost identically to the default AZ affinity.

So why do some AWS guides recommend two separate Security VPCs? The answer is resilience, not cost (TGW cross-AZ data transfer has been free since April 2022). With Appliance Mode ON, TGW uses a flow hash that can send traffic from a healthy AZ to appliances in an impaired AZ. With Appliance Mode OFF, AZ affinity isolates the blast radius - if AZ1 goes down, AZ2 traffic continues unaffected.

In practice, there are three options:

One Security VPC with Appliance Mode ON - works for both E-W and N-S. Simpler to manage, accepts the resilience trade-off. This is what most deployments use.
Two Security VPCs - one for E-W (Appliance Mode ON), one for N-S (Appliance Mode OFF). Maximum AZ isolation, but double the infrastructure and operational overhead.
One Security VPC with Appliance Mode OFF - breaks east-west inspection. Don’t do this.

One more thing to keep in mind: in multi-account setups, AZ names map to different physical zones per account - use AZ IDs (e.g., use1-az1), not names.

Sources: AWS Whitepaper: GWLB with TGW for centralized security, AWS APN Blog: Centralized traffic inspection with GWLB

9. IMDSv2 and bootstrap - check your PAN-OS version

Not all versions of the PAN-OS VM-Series support IMDSv2. When I first encountered this problem, I thought I was going to lose all my hair. The process was standard: set the bootstrap in userdata, everything looked perfect, and… nothing bootstrapped. I scoured the internet for the problem, which turned out to be a single small checkbox in the virtual machine configuration - “Enable IMDSv2.” I unchecked it, redeployed it with the same bootstrap - eureka! Everything is working as it should.

That was on an older PAN-OS version. The good news is that Palo Alto has been supporting IMDSv2 since 2022:

BYOL: PAN-OS 10.2.0+ with VM-Series Plugin 3.0.0+
PAYG: PAN-OS 10.2.5+ with Plugin 3.0.0+
Panorama: PAN-OS 10.2.3+

The only thing you need to set is EC2 metadata:

metadata_options {
  http_endpoint = "enabled"
  http_tokens   = "required"
}

Note this if for some reason your bootstrap won’t work.

Sources: Palo Alto KB: IMDSv2 support for VM firewall and Panorama in AWS, VM-Series Plugin 3.0.0 Release Notes

So what should you do with all this information?

Generally, do what you feel is right, but I suggest answering a few important questions before implementing:

Is your environment truly sensitive enough to require centralized traffic inspection? Is the data stored in your environment highly sensitive? If you answered yes to both questions, then you need this solution. If you have any doubts, reconsider - maybe a native AWS firewall will suffice?

Do you have experience configuring Palo Alto hardware? Without it, it will be difficult to navigate the initial process without wading through reams of documentation. It’s not just the VM-Series configuration itself, but also the AWS configuration at both the network and resource levels. You can always ask Palo Alto for a dedicated specialist, who will handle this for you… But you’ll also pay for that.

Consider whether you can afford this solution. It’s not a small amount. Go through section 3 again and judge for yourself.

Remember that simply implementing VM-Series in production can be risky. It’s good to have at least a minimal test environment to test your configuration before rolling it out to production, as you could shut down your business and not know why.

If you have no doubts about the above and are able to meet all of the above requirements, go for it; this solution is for you.

Building a centralized Security VPC on AWS with GWLB? I’ve deployed this architecture for enterprise clients and know where the bodies are buried. Let’s talk.