Does AI-Powered Test Design Work?
- Istvan Forgacs
- Sep 20
- 3 min read
AI is now embedded in nearly every aspect of software development. At large companies, AI generates a substantial portion of code, and AI-driven tools are increasingly automating test execution. But the heart of testing lies in test design. You can have the fastest test code—but if the tests are poorly designed, they won’t catch the defects.
So, we set out to explore both specialized tools and general-purpose large language models (LLMs) to see how well they perform in test design. After a thorough investigation, we found only four tools capable of generating test cases from requirements. Since this isn’t a “best tools for test design” comparison, I’ll refer to them as Company 1, 2, 3, and 4.
🧪 Trials That Weren’t
Companies 1 and 2 are large vendors that advertise free trials—but only in theory. I attempted to access their trials multiple times, without success.
Company 1 responded with: “A product expert will contact you within 48 business hours to get your trial started.” I tried several times, but no one ever followed up.
Company 2 used a different tactic. After receiving a free license and setting up a project, I reached the AI-powered test design feature—only to be told I needed a second license: “We have received your request for the AI license. Our team will get back to you.” They didn’t. Worse, I was notified that my “free” trial was about to expire.
In my opinion, these tools are currently unusable. I provided my company email and completed all required steps, yet I was blocked from trying them.
✅ Smaller Tools, Bigger Access
In contrast, I was able to test the AI-powered design features from two smaller companies. Both tools were intuitive—I could generate test cases within seconds, without reading the user manual. However, they shared two major flaws:
Requirement Validation: The tools failed to detect contradictory requirements and generated redundant test cases. For example:
Access Rights Requirements
R1: The user can create a file
R2: The creator has admin rights
R3: At least one user has admin rights
R4: Admins can assign admin, editor, and reader rights
R5: Admins can delete any access right, including their own, even if no other admin exists
R3 and R5 contradict each other. A test design automation tool should flag this.
Test Design Technique Application: Proper test design requires applying structured techniques. For instance, domain testing—recently added to the Advanced Level Test Analyst syllabus CTAL-TA—requires four test points for a single condition: ON, OFF, IN, and OUT.
Example: For x ≥ 2, x integer, test points should be:
ON: x = 2
OFF: x = 1
IN: x = 10
OUT: x = 0
Company 3 generated 24 test cases instead of 4—far too many for a simple condition.
Company 4 missed the OFF point and detected only 83% of defects in a price calculation scenario. That may sound decent, but control-flow defects are detectable at 100%. The test cases were so far off that I didn’t bother testing more complex requirements.
🤖 LLMs Surprising Performance
I also tested Google Gemini and GPT-5. Surprisingly, they outperformed the dedicated tools. For the complex Car Rental specification, they applied multiple test design techniques:
Model A used:
Requirement-based testing
Boundary Value Analysis (BVA)
Equivalence Partitioning
Statement Coverage
Error Guessing
Cause-Effect Graphing
Model B used:
Requirement-based testing
BVA
Equivalence Partitioning
Edge-case and Exception Coverage
Positive/Negative Testing
State-Transition Testing
Minimality/Efficiency
Using state-transition testing was an excellent choice—though action-state testing would be even better. Most testers avoid state-transition testing due to its complexity. Selecting states and merging test cases manually is difficult without tooling. In this case, the LLMs arguably performed better than some human testers. (Not you, dear reader, of course.)
📉 But Results Fell Short
Despite their promising approach, the results weren’t ideal. Using only requirement-based testing, 60% of seeded defects were found. The LLMs detected one or two additional bugs, but the number of test cases was excessive. They applied 3-value BVA correctly but didn’t incorporate our latest 4-value BVA method.
They didn’t recognize the wrong specification of ’Access Rights’ above. Instead, they generated 20 and 16 test cases for the flawed specification.
🧠 Final Verdict
In summary, neither pretrained LLMs nor specialized test design automation tools currently generate efficient test cases. The only exception is Harmony, which integrates the latest test design techniques and uncovers nearly all potential defects.
In future posts, I’ll show how we trained an LLM to validate requirements and generate reliable test cases. If you're curious about Harmony, feel free to request a demo.