top of page
Search

Does AI-Powered Test Design Work?


AI is now embedded in nearly every aspect of software development. At large companies, AI generates a substantial portion of code, and AI-driven tools are increasingly automating test execution. But the heart of testing lies in test design. You can have the fastest test code—but if the tests are poorly designed, they won’t catch the defects.

So, we set out to explore both specialized tools and general-purpose large language models (LLMs) to see how well they perform in test design. After a thorough investigation, we found only four tools capable of generating test cases from requirements. Since this isn’t a “best tools for test design” comparison, I’ll refer to them as Company 1, 2, 3, and 4.

🧪 Trials That Weren’t

Companies 1 and 2 are large vendors that advertise free trials—but only in theory. I attempted to access their trials multiple times, without success.

  • Company 1 responded with: “A product expert will contact you within 48 business hours to get your trial started.” I tried several times, but no one ever followed up.

  • Company 2 used a different tactic. After receiving a free license and setting up a project, I reached the AI-powered test design feature—only to be told I needed a second license: “We have received your request for the AI license. Our team will get back to you.” They didn’t. Worse, I was notified that my “free” trial was about to expire.

In my opinion, these tools are currently unusable. I provided my company email and completed all required steps, yet I was blocked from trying them.

✅ Smaller Tools, Bigger Access

In contrast, I was able to test the AI-powered design features from two smaller companies. Both tools were intuitive—I could generate test cases within seconds, without reading the user manual. However, they shared two major flaws:

  1. Requirement Validation: The tools failed to detect contradictory requirements and generated redundant test cases. For example:

Access Rights Requirements

  1. R1: The user can create a file

  2. R2: The creator has admin rights

  3. R3: At least one user has admin rights

  4. R4: Admins can assign admin, editor, and reader rights

  5. R5: Admins can delete any access right, including their own, even if no other admin exists

R3 and R5 contradict each other. A test design automation tool should flag this.

  1. Test Design Technique Application: Proper test design requires applying structured techniques. For instance, domain testing—recently added to the Advanced Level Test Analyst syllabus CTAL-TA—requires four test points for a single condition: ON, OFF, IN, and OUT.


    Example: For x ≥ 2, x integer, test points should be:

    • ON: x = 2

    • OFF: x = 1

    • IN: x = 10

    • OUT: x = 0

    • Company 3 generated 24 test cases instead of 4—far too many for a simple condition.

    • Company 4 missed the OFF point and detected only 83% of defects in a price calculation scenario. That may sound decent, but control-flow defects are detectable at 100%. The test cases were so far off that I didn’t bother testing more complex requirements.

🤖 LLMs Surprising Performance

I also tested Google Gemini and GPT-5. Surprisingly, they outperformed the dedicated tools. For the complex Car Rental specification, they applied multiple test design techniques:

Model A used:

  1. Requirement-based testing

  2. Boundary Value Analysis (BVA)

  3. Equivalence Partitioning

  4. Statement Coverage

  5. Error Guessing

  6. Cause-Effect Graphing

Model B used:

  1. Requirement-based testing

  2. BVA

  3. Equivalence Partitioning

  4. Edge-case and Exception Coverage

  5. Positive/Negative Testing

  6. State-Transition Testing

  7. Minimality/Efficiency

Using state-transition testing was an excellent choice—though action-state testing would be even better. Most testers avoid state-transition testing due to its complexity. Selecting states and merging test cases manually is difficult without tooling. In this case, the LLMs arguably performed better than some human testers. (Not you, dear reader, of course.)

📉 But Results Fell Short

Despite their promising approach, the results weren’t ideal. Using only requirement-based testing, 60% of seeded defects were found. The LLMs detected one or two additional bugs, but the number of test cases was excessive. They applied 3-value BVA correctly but didn’t incorporate our latest 4-value BVA method.

They didn’t recognize the wrong specification of ’Access Rights’ above. Instead, they generated 20 and 16 test cases for the flawed specification.

 

🧠 Final Verdict

In summary, neither pretrained LLMs nor specialized test design automation tools currently generate efficient test cases. The only exception is Harmony, which integrates the latest test design techniques and uncovers nearly all potential defects.

In future posts, I’ll show how we trained an LLM to validate requirements and generate reliable test cases. If you're curious about Harmony, feel free to request a demo.

 
 
 
bottom of page