One of the most fascinating challenges in the modern energy sector is that the price of electricity changes every single hour.
At 3:00 AM, when wind turbines are spinning but everyone is asleep, power is incredibly cheap—sometimes dropping to 6 cents per kilowatt-hour. But at 6:00 PM, when the sun goes down and everyone turns on their ovens and air conditioners, the grid undergoes immense stress. Prices can violently spike to 60 cents or even $2.50 per kilowatt-hour during a crisis.
If you own a massive industrial battery system, this volatility presents a massive opportunity known as Energy Arbitrage. You buy electricity when it is cheap, store it, and sell it back to the grid (or use it to power your building) when prices spike.
But there is a catch: You do not know the future.
To explore how Artificial Intelligence solves this, I built the Smart Grid Energy Optimizer. In this interactive simulation, we pit a Deep Reinforcement Learning agent against simple human heuristics and a mathematically perfect algorithm to see how well AI can trade energy under uncertainty.
1. The Heuristic Fallacy
If you asked a software engineer to write a script to manage this battery, they would likely write a simple Rule-Based Heuristic. In code, it looks something like this:
def heuristic_trader(current_price, daily_average_price):
if current_price < (daily_average_price * 0.75):
return "CHARGE"
elif current_price > (daily_average_price * 1.25):
return "DISCHARGE"
else:
return "IDLE"
This works reasonably well, but it is deeply flawed because it lacks nuance. A static rule doesn't know if a small price spike at 2:00 PM is the peak of the day, or if an absolutely massive spike is coming at 6:00 PM. Furthermore, if you add solar panels to the roof of your building, the logic becomes overwhelmingly complex. If the sun is shining, should you use that free solar energy to charge the battery, or should you push it directly into the building to offset demand?
Hard-coding rules for every possible combination of price, time, solar generation, and building load is a nightmare.
2. The Theoretical Ceiling: Dynamic Programming
Before we train an AI to solve this, we need a baseline. How do we know if our AI is actually doing a good job? We need to calculate the mathematical upper bound of the problem.
To do this, we use an elegant computer science technique called Dynamic Programming (DP).
If we assume that we have perfect foresight—meaning we magically know the exact electricity price, solar output, and building load for every hour of the upcoming day—we can use DP to calculate the absolute optimal charging schedule.
The trick to Dynamic Programming is Backward Induction via the Bellman Equation:
V(s) = max_a [ Reward(s, a) + V(next_s) ]
We don't start at 12:00 AM. We start at the very end of the day (Hour 24) and work backwards, calculating the exact value of every possible state:
def solve_dp(env, n_soc_levels):
# Create a grid of all possible battery charge levels
soc_grid = np.linspace(0, env.cap, n_soc_levels)
V = np.zeros((25, n_soc_levels))
# Work backwards from the end of the day to the beginning
for t in reversed(range(24)):
for si, soc in enumerate(soc_grid):
best_val = -1e9
# Test every possible action (-3kW to +3kW)
for action in possible_actions:
reward, next_soc = simulate_action(action, t, soc)
# The total value is the immediate reward + the known future value
total = reward + V[t + 1, get_index(next_soc)]
if total > best_val:
best_val = total
V[t, si] = best_val
return V # Returns the absolute maximum possible profit
Because DP explores every possible state starting from the end, it guarantees the perfect schedule.
The Curse of Dimensionality
If DP is mathematically perfect, why don't we use it for everything?
The answer is The Curse of Dimensionality. In our simulation, we only have 50 battery charge levels and 24 hours. The DP solver checks $50 \times 24 \times 7 \text{ actions} = 8,400$ combinations. That takes milliseconds. But imagine managing a factory with 10 independent batteries, 100 machines, and stochastic weather predictions. The number of state combinations explodes into the trillions. DP completely breaks down because the math takes years to compute.
Furthermore, DP requires perfect clairvoyance. In the real world, you cannot predict the exact solar output 12 hours in advance. DP is physically impossible to run in real-time, but it gives us a beautiful "Theoretical Ceiling" to grade our AI against.
3. Acting Under Uncertainty: Deep Q-Networks
In the real world, we must make decisions right now based only on what we currently know. This is where Deep Q-Networks (DQN) shine.
Unlike the DP solver, the DQN agent does not get to see the future. Instead, at every hour, it receives an 8-dimensional observation snapshot of the grid. It passes this snapshot through a neural network, which outputs the "Q-Value" (expected future profit) for 7 discrete actions: [-3kW, -2kW, -1kW, 0kW, +1kW, +2kW, +3kW].
To learn, the DQN essentially tries to approximate the exact same Bellman Equation that the DP solver uses, but without knowing the future. Here is what the core learning loop looks like in PyTorch:
import torch
import torch.nn.functional as F
# 1. Ask the neural network what the current state is worth
current_q_values = q_network(state)
q_value_of_action_taken = current_q_values.gather(1, action)
# 2. Ask the Target Network to predict the value of the NEXT state
with torch.no_grad():
next_q_values = target_network(next_state)
max_next_q_value = next_q_values.max(1)[0]
# 3. Calculate the Bellman Target (Immediate Reward + Future Value)
expected_q_value = reward + (gamma * max_next_q_value)
# 4. Update the neural network to minimize the mathematical error
loss = F.mse_loss(q_value_of_action_taken, expected_q_value)
Feature Engineering: The Cyclic Time Trick
One of the most fascinating engineering challenges in RL is how you represent the "State" to the neural network.
For example, how do you tell the AI what time it is? If you feed the network the raw integer hour = 23 (11 PM), the next step will be hour = 0 (Midnight). To a neural network, jumping from 23 to 0 looks like a massive, disruptive mathematical anomaly.
To fix this, we use Cyclic Encoding. We map the 24-hour clock onto a circle using sine and cosine functions:
sin_time = np.sin(hour * np.pi / 12)
cos_time = np.cos(hour * np.pi / 12)
Now, Hour 23 and Hour 0 are mathematically right next to each other on the circle. The network smoothly understands the passage of time without any jarring jumps.
Thermodynamics: Learning the "Spread"
The environment enforces a physical reality: battery systems have a 92% round-trip efficiency. If you put 1kW into the battery, you lose 8% to heat, and only get 0.92kW out.
Because of this efficiency loss, buying at 10 cents and selling at 10.5 cents actually loses money. You have to write zero code telling the AI about thermodynamics. The DQN naturally figures out that it must only execute trades when the price "spread" is wide enough to cover the 8% efficiency tax.
4. Preventing Overfitting via Noise Injection
If you train a neural network on the exact same 24-hour price curve for a million episodes, it doesn't actually become intelligent. It just becomes a clock. It memorizes "charge at step 4, discharge at step 16".
To prevent this, the environment uses Stochastic Training.
During training, every single price is perturbed with 12% Gaussian noise, and the solar generation is multiplied by random "cloud factors".
Because the prices are never the same twice, the AI cannot rely on the clock. It is forced to learn the causal relationship between its inputs (Price Trend, Solar Output, Time to Peak) and the reward. This forces the DQN to learn a robust, generalized trading strategy that can survive the chaos of real-world markets.
5. Measuring the "Intelligence Gap"
In the interactive dashboard, we can run a Benchmark that races three strategies on the exact same day:
- The Rule-Based Heuristic (The human attempt)
- The DQN Agent (The AI acting under uncertainty)
- The DP Solver (The theoretical maximum with perfect foresight)
When you look at the final cumulative revenue chart, the DP Solver is always at the top. The Rule-Based system is usually at the bottom. The DQN agent sits in the middle.
The financial gap between the DQN and the DP Solver is literally the mathematical measurement of how much the AI has left to learn, combined with the unavoidable cost of not knowing the future. It is a stunning visual representation of AI performance.
🧪 Try It Yourself
To truly understand how this works, open up the Smart Grid Simulator and run these experiments:
-
The Benchmark Race: Go to the Benchmark tab. Select the
Summer Peakscenario. Click Run Benchmark. Look at the bar chart comparing the total revenue. How much money did the DQN leave on the table compared to the DP optimal? -
The Solar Impact: Go to the Dispatch tab. Run the DQN on the
Summer Peakscenario (which has a 5kW solar array). Watch how it behaves at 1:00 PM. Then switch the scenario toNo Solarand run it again. You will see the agent completely change its strategy, forced to buy expensive grid power earlier in the day because it no longer has free solar power to rely on. -
Train the Brain: Go to the Training Lab. Set the steps to
20,000and watch the live reward curve climb as the agent learns to ignore the immediate cost of charging the battery in favor of the massive delayed gratification of the 6:00 PM discharge.
Wrapping Up
Energy arbitrage perfectly encapsulates the beauty of Reinforcement Learning. It forces an AI to balance short-term costs against long-term gains in a highly volatile, unpredictable environment. By comparing the AI against the mathematical perfection of Dynamic Programming, we stop guessing if our algorithm is "good" and instead mathematically measure its intuition.
This is the fourth of 12 interactive RL projects I am building to bridge the gap between academic math and real-world intuition. If this deep dive helped you understand how AI manages uncertainty, I would be incredibly grateful if you checked out the source code and dropped a star on the full repository:
⭐ Reinforcement Learning Portfolio on GitHub
Dash10107
/
rl-portfolio
End-to-end reinforcement learning projects — Q-Learning, DQN, PPO, SAC, A2C, IPPO, MBRL, HMM, RLHF, and Multi-Armed Bandits — each deployed as an interactive Gradio app on Hugging Face Spaces.
Reinforcement Learning Portfolio
A collection of 12 end-to-end reinforcement learning projects, each deployed as an interactive web application on Hugging Face Spaces. The projects span the full range of modern RL — from the simplest tabular methods that fit on a single page, to multi-agent coordination, model-based planning, and learning from human feedback.
Every project is built to be understood by someone who is new to RL. Each has its own README explaining the algorithm, the environment, and what you are looking at when you run it.
New to reinforcement learning? Start with these two documents before anything else:
- CONCEPTS.md — what RL is, the core vocabulary, and how all 12 algorithms relate to each other
- GETTING_STARTED.md — step-by-step guide to running your first project and your first experiment
Key Highlights
- ⚡ Zero-Install Interactive Demos: Every project is deployed live on Hugging Face Spaces for instant testing.
- 🎓 Curriculum-Based…
Let me know in the comments: If you had a giant battery in your garage, would you trust an AI to trade electricity for you while you slept?


Top comments (0)