At this point I think this series has evolved into an endless stream of lessons and Aha moments. It seems like every day something new happens that's interesting, or frustrating, or even both.
Lesson - Test the Stable Parts
I am not going to debate the value of unit tests or any other part of the test pyramid. I will say that I have developed the habit of asking the agent to generate new unit tests, or even to compare what's been implemented versus existing unit tests, and to cover gaps. The value from that practice has been inconsistent.
The majority of the generated unit tests are valuable, and they pass on the first time. But there are classifications of unit tests where no matter how I prompt, the generated tests don't work the first time. For example, unit tests for a function in a DAO unit that executes multiple statements within a transaction; I have several such unit tests that after tweaks I was able to make run correctly. Now when I need to create similar tests, I ask the agent to use the same pattern as unit X (tagging that file in context), and the generated code is usually mostly correct. But it hasn't been correct from the first prompt yet.
Overall I've found that generating tests for the backend has been most valuable. It's the most "stable", in that I am not iterating on it nearly as quickly as the front end. The front end, on the other hand ... that's been a completely different story. I have effectively given up on unit tests of any sort, for the time being. Once I achieve a certain level of functionality that I consider to be stable, I'll go back and implement unit and integration tests. But right now there's no point in testing every iteration of a particular screen when that screen make go through 15 iterations. That was just slowing me down.
I had one screen that, in my opinion, wasn't particularly complicated. Once I got the workflow where I wanted, I asked the agent to generate relevant unit tests for that screen. When I ran them, they all failed. That became a very frustrating experience - what started as an attempt to test some core logic and boundary conditions of input validation, etc, became a meandering journey through npm dependency hell, jest configurations, crazy mocking chains and racey behaviors due to callbacks. All of this was with the agent set to "Auto". I never could get the tests to run. After almost three hours, I threw out the test fixture altogether. I changed the agent to Claude 3.7, and started over. The resulting code was a little bit different, but after another hour, none of the tests were passing. I had spent an entire half-day trying to get tests to work for one particular screen, and had utterly failed.
Part of it is my lack of experience with the stack. If this was java or rust, I'm sure I could have figured it out. So for now, I've abandoned unit tests in the front end. Maybe when that codebase settles down a bit I'll give one of the Claude 4 models a shot.
Lesson - Use Wireframes to generate new screens
One thing that I've learned is that Cursor, even with the agent set to Auto, is pretty adept at producing functional screens/pages when provided a wireframe. So when I need to start on a new screen, this is my workflow:
Ask ChatGPT to generate a wireframe for the screen, taking into account my high level requirements. The result is a simple monochrome image (at least in my case). For example:
I then copy that image and paste it into the Cursor chat window. I ask Cursor to generate a new screen based on the image, give it a high level set of rules, and how to integrate it into the navigation flow.
Most of the time the output is 90% of what I need, and I tweak from there. I think this is one area where it saves me the most time. Granted, my app isn't particularly pretty, or even user-friendly at the moment. But I'll worry about polish once I achieve a certian level of functional breadth.
Lesson - Explicitly Ask Cursor to Implement/Fix
I'm on the Cursor $20/month plan, and I ran out of fast requests in like a week. I'm between jobs at the moment, so I'm basically doing this coding thing full time. But I was really surprised that I ran out of fast requests so soon; now all of my requests are in the slow pool.
Then I realized a pattern - I will request Cursor to perform some task, Cursor will analyze and then tell me what it will do, and then ask me if I want it to do what it recommends. Another example - some times a test will fail, and I will add some lines from the console to the chat context, and inform Cursor that the test failed. Cursor will respond by explaining why it thinks the test failed, and give me recommendations on what to do to fix the test.
In both cases, if I confirm or ask Cursor to implement the recommended changes, it's just another series of back-and-forth requests with the backing agent. So I highly suspect that I'm using 2x the number of requests I should be. I have now developed the habit of explicitly requesting cursor to implement the requested changes without asking for confirmation, or explicitly asking Cursor to fix a test or compiler problem.
Lesson - Try Different Models
I am on an extremely tight budget, so I am not going to use Max or meter based pricing. Therefore I've mostly just left the agent set on Auto. I did try 3.7 Claude 3.7-sonnet (which now seems to be unavailable) and my experience was similar to others' I've read online; sometimes the results were too complicated, or the agent eagerly implemented things that were not requested.
I also tried gpt-4.1, and to be honest I was completely underwhelmed. It's output seemed extremely ... naive, and I had to be much more explicit with instructions to get anything useful out of it.
Models are constantly changing, and it's likely that different models are better suited for different tasks. But the result is that you will likely have code in the same repo that all looks slightly different. In my case I've seen that different agents generte code with different tabbing/spacing, which I haven't gotten around to normalizing yet.
Things Are Improving
When I go back and re-read my posts so far, I think I come across as too negative. That is not my intent. This exercise can be frustrating at times, but it's still all very exciting. And things are definitely improving; in the past week I was able to implement double the features when compared to the previous week, and they were more complex.
There are still lots of things I can do to make further improvements; I want to leverage claude-task-master, I want to find a set of cursor rules that really makes sense for my setup, and I 100% have to polish the front end. But it does still feel like all the youtube bro's out there are getting different results / playing by a different set of rules than I am.
Top comments (0)