Verified GenAI: Agentic QC for Reliable Product Visualization

Project summary

Problem

E-commerce GenAI is good at making pretty images, but not reliable at preserving the exact product identity (shape, color, texture, logo) across generations and turning that into a scalable pipeline. The missing piece is automated verification: today, most teams still rely on manual QC, which kills scale and trust.

What I’m building

Verified GenAI: an agentic LLM+VLM quality-control layer that sits on top of image/video generation and automatically checks whether outputs are acceptable for production.

Core idea:

Generate candidate visuals (image and optional short video)
Run VLM/CLIP-based checks for product identity + artifact detection
Iterate or reject automatically (agentic loop)
Produce a final output with a QC report (scores + reasons)

Scope and approach

1) Generation (baseline)

High-consistency product visualization using ComfyUI workflows:
- SDXL + IP-Adapter/Refiner + ControlNet (depth/canny/lineart)
- Image-to-video (I2V) and compositing where needed
Output is constrained by a reference product image + optional mask.

2) Verification (the funded “researchy” layer)

VLM/embedding checks to ensure:
- Identity consistency (product remains the same object)
- Color fidelity (prevent drift)
- Artifact detection (extra parts, broken geometry, wrong branding)
An agent controller (LLM) decides what to change next: prompt edits, strength, control settings, re-run, or stop.

3) Evaluation harness

A reproducible benchmark-style harness:
- acceptance rate
- failure taxonomy
- quality vs cost (GPU time per accepted output)
- regression tests for pipeline changes

Deliverables

Open repo (or shareable private repo if required) with:
- agentic orchestration code (FastAPI + worker queue)
- ComfyUI workflows + configs
- verification module + scoring outputs
Demo service: API endpoint that takes a product image URL and returns:
- best output + QC report + trace
Technical report: methods, metrics, ablations, and measured improvements (baseline vs verified loop)

Milestones (example)

Week 1–2: baseline pipeline + dataset of failure cases + eval metrics definition
Week 3–4: verification + agentic iteration loop + automated reporting
Week 5–6: hardening, regression tests, cost/latency tuning, public demo

Why this is fundable (costs are real and justified)

GPU compute for controlled experiments (acceptance-rate improvement requires many runs)
Evaluation runs + ablations (quality vs cost curves)
Hosting for public demo + logging/monitoring
Optional: small dataset curation and annotation for failure categories

About me (relevant credibility)

PhD AI/ML Research Scientist (AI since 2014), academic + industry, production-grade delivery.

Remote U.S. collaboration: SMT (North Carolina) – sports video analytics for PFL (demo):
https://drive.google.com/file/d/14RYbf63byBfrIr_9N-F0B9MjdaWrGA3k/view?usp=sharing
Publications:
Edge devices object detection: https://www.researchgate.net/publication/376783175_Efficient_Object_Detection_Model_for_Edge_Devices
Transformer/BERT NILM paper: https://www.mdpi.com/1996-1073/14/15/4649
Public demos (test links):
Jewelry try-on / product visualization: https://renderfy-ai-lightbox.hf.space
Fashion try-on (image-to-video + compositing): https://renderfy-fitsuite-ai.hf.space
LinkedIn: https://www.linkedin.com/in/vahit-feryad-19517256/

Expected impact

A practical, measurable step toward trustworthy, scalable GenAI for product visuals: fewer manual reviews, fewer bad outputs shipped, and a reusable QC framework that generalizes beyond Try-On.

What are this project's goals? How will you achieve them?

Goals

1) Make GenAI product visuals trustworthy

Goal: outputs preserve the same product identity (shape, texture, logo) and don’t drift in color or introduce artifacts.
Success metric: higher auto-acceptance rate at a fixed quality bar (vs. baseline generation without verification).

2) Reduce manual QC and make the workflow scalable

Goal: replace human review loops with an agentic verification loop that retries, fixes, or rejects automatically.
Success metric: fewer manual reviews per accepted asset; predictable cost per accepted output.

3) Produce a reproducible evaluation harness

Goal: a benchmark-like harness to measure quality, failure modes, and regressions across model/prompt/workflow changes.
Success metric: clear metrics dashboard + regression tests + ablation results.

How I’ll achieve them

A) Baseline generation pipeline (ComfyUI + SDXL stack)

Build strong reference-guided generation:
- SDXL + IP-Adapter/Refiner
- ControlNet (depth/canny/lineart) to constrain geometry
- optional segmentation masks for clean compositing
Expose via GPU-backed FastAPI so it’s testable and reproducible.

B) Verification module (VLM/CLIP-style scoring + rule checks)

Automated checks run on each candidate output:

Identity consistency: embedding similarity between reference product and generated product crop
Color fidelity: color-difference checks on the product region (prevent drift)
Artifact detection: detect extra parts/warping/logo corruption via VLM judgments + heuristics
Output: a QC report with scores + fail reasons.

C) Agentic loop (LLM controller to fix failures)

An LLM-based controller reads the QC report and chooses the next action:

adjust prompt/negative prompt
change img2img strength / denoise
tweak ControlNet conditioning / weights
rerun with different seed
stop and reject if quality can’t be achieved within budget

D) Evaluation harness and reporting

Curate a test set of products + scenarios (including hard cases).
Track:
- acceptance rate
- failure taxonomy
- GPU time per accepted output
- quality vs cost trade-off
Run ablations to prove what improves results (verification alone vs agentic loop vs parameter changes).

Deliverables at the end

Working API: input product image → outputs best asset + QC report + trace
Open/shareable repo with workflows + verification + evaluation harness
A short technical report with metrics and comparisons (baseline vs Verified GenAI)

How will this funding be used?

How the funding will be used

1) GPU compute for controlled experiments and iterations

Running many generations per product is required to measure and improve:
- acceptance rate
- quality vs cost curves
- ablations (with/without verification, different workflows, settings)
This is the main cost driver because the agentic loop intentionally does multiple retries until a strict QC threshold is met.

2) Hosting and infrastructure for a public demo API

A GPU-backed service (FastAPI + worker queue) with:
- logging/tracing of agent decisions
- storage for inputs/outputs and QC reports
- basic monitoring (uptime, latency, error rates)

3) Evaluation dataset preparation (lightweight but necessary)

Building a small, representative benchmark set:
- product reference images
- scenario prompts/backgrounds
- failure-case collection and labeling (artifact categories, identity drift, color drift)
This can be done mostly by me, with optional small paid annotation support if needed.

4) Engineering hardening and reproducibility

CI/regression tests to prevent quality regressions when changing:
- ComfyUI workflows
- model versions
- verification thresholds
Packaging and documentation so others can reproduce results.

Budget sanity

I’m intentionally focusing spend on compute + minimal infra + evaluation, avoiding unnecessary overhead. The goal is a measurable, reproducible system rather than a flashy demo.

Who is on your team? What's your track record on similar projects?

Team

Vahit FERYAD (PhD) – AI/ML Research Scientist (AI since 2014), based in Istanbul.
I will lead the project end-to-end: modeling choices, agent design, evaluation, and production deployment (API + GPU infra).
Optional support (only if budget allows): part-time annotation / QA help for labeling failure categories on a small evaluation set. This is not required to start and can be added later if it improves evaluation speed.

Track record on similar projects

1) Production-grade GenAI pipelines (image/video)

Built and deployed GPU-backed GenAI systems using ComfyUI + SDXL stacks and served behind FastAPI (async, batching, health checks).
Focus areas: high-consistency visual generation, workflow hardening, and automated verification components (CLIP/BLIP-style checks).

Public demos:

AI LightBox · Jewelry Virtual Try-On (high-consistency product visualization):
https://renderfy-ai-lightbox.hf.space
FitSuite AI · Fashion Virtual Try-On (image-to-video + compositing):
https://renderfy-fitsuite-ai.hf.space

2) Remote U.S. industry collaboration (computer vision in production)

Worked remotely with SMT (North Carolina) on Professional Fighters League (PFL) multi-camera video analytics, including real-time CV modeling for punch/kick speed analysis.
Demo video: https://drive.google.com/file/d/14RYbf63byBfrIr_9N-F0B9MjdaWrGA3k/view?usp=sharing

3) Peer-reviewed publications showing research depth

Efficient object detection for edge devices:
https://www.researchgate.net/publication/376783175_Efficient_Object_Detection_Model_for_Edge_Devices
Transformer-based NILM model using BERT (MDPI Energies):
https://www.mdpi.com/1996-1073/14/15/4649

Why this matters for this grant

This project is not “just prompts.” It needs:

evaluation design + rigorous metrics
agentic iteration logic
production-grade deployment discipline
That combination is exactly where I’ve repeatedly delivered.

What are the most likely causes and outcomes if this project fails?

Most likely causes of failure

1) Verification isn’t reliable enough

Cause: VLM/CLIP-style similarity can miss subtle identity drift (small logo changes, minor geometry shifts) or over-reject valid outputs.
Outcome: low acceptance rate, too many false positives/negatives, weak improvement over baseline.

2) Cost per accepted output is too high

Cause: the agentic loop may need multiple retries to pass strict QC, driving GPU spend up.
Outcome: the system works technically but is not economically viable for production.

3) Domain generalization is worse than expected

Cause: methods tuned for product visuals / try-on may not generalize to other categories or lighting/background conditions.
Outcome: results look good on a narrow demo set but don’t scale across varied products.

4) Data/evaluation set is not representative

Cause: benchmark set is too small or biased; failure taxonomy incomplete.
Outcome: “improvements” don’t hold in real use, regression risk remains.

5) Tooling and integration complexity

Cause: ComfyUI workflows + model versions + infra can be brittle; changes can silently degrade output quality.
Outcome: hard-to-reproduce results; maintenance burden increases.

If it fails, what do we still get? (salvageable outcomes)

Even in a “failure” scenario, we still produce useful assets:

A reproducible evaluation harness for product-accuracy in GenAI (baseline + metrics + failure taxonomy).
A set of verified baselines showing what does and doesn’t work (ablation results).
A production-ready API wrapper + logging/tracing around generation workflows.
Clear evidence on whether current VLM/CLIP methods are sufficient for strict product identity verification, and what gaps remain.

So the worst case is not “nothing works”; the worst case is “verification doesn’t meet a strict bar,” but we still generate a solid, publishable engineering/research package and a benchmark others can build on.

How much money have you raised in the last 12 months, and from where?

I have not raised any funding in the last 12 months (no grants, investors, or institutional funding).