Moncho's Agentic Workflow: Part 2 — Market Sizing Engine V2 | Moncho Blog

From naive TAM search to an economic rationale engine

In Part 2, we go deep on Moncho's Market Sizing Engine V2: how we moved from a search-only TAM estimator to an economic rationale engine backed by a Statistical Memory Layer that ingests real statistics from BANBEIS, BBS, WHO, GSMA, and similar sources.

TL;DR

V1 problem: Our original market sizing agent leaned heavily on web search to guess TAM. It could find impressive looking numbers, but struggled with consistency and ground truth when real data lived in PDF annex tables and statistical yearbooks.
V2 shift: We rebuilt the system as an economic rationale engine sitting on top of a Statistical Memory Layer (SML). Agents now ask: "What do we already know?" before ever hitting the open web.
How it works: Offline pipelines harvest atomic facts (student counts, pharmacy outlets, penetration rates, prices) into SML; online agents for customers, penetration, price, and reconciliation query SML first, then fall back to archetypes and search.
Why it matters: Market sizing becomes repeatable and auditable. When an analyst corrects a number once, it becomes the new default in subsequent runs; PDFs and annex tables turn into structured memory instead of one-off LLM prompts.

In Part 1 we described how Moncho discovers and scores organizations and products using an agentic workflow: discovery agents, LLM-as-judge scoring, validation tools, and asset enrichment. That pipeline answers the question: who exists in this market and what do they offer?

This post focuses on the next question: how big is this market, economically, and what would revenue or unit economics look like if we serve it? That requires more than search: it needs a theory of the market, a memory of past measurements, and a way to reconcile contradictory statistics.

What went wrong with V1: search-only TAM

Our first-generation market sizing agent (V1) looked powerful on paper. It could:

Search for "Bangladesh edtech market size" or "number of pharmacies in Dhaka" across the web.
Read a handful of pages, pull out numbers, and compute a top-down TAM.
Return a nicely formatted narrative with citations.

In practice, three problems kept showing up:

Epistemic mismatch: The real numbers we cared about — students by level, facilities by type, coverage by district — typically lived in PDFs and annex tables from BANBEIS, BBS, World Bank, WHO, and similar institutions. Search results and blog posts were often secondary or stale.
No memory: Each run started from scratch. Even when an analyst manually found the right table in a BANBEIS yearbook, the agent would not "remember" that fact the next time. We were burning tokens to rediscover the same numbers.
Unstable rationale: Because every answer was built ad hoc from whatever the search API returned that day, two runs for the same segment could produce different TAMs and very different stories. That is unacceptable for planning and investment decisions.

The conclusion was clear: we needed a measurement engine, not just a search agent.

Design goal: an economic rationale engine

Instead of asking an LLM to "search for TAM", we reframed the problem as:

Given a market segment, country, and timeframe, construct a defensible economic rationale for market size: who the customers are, how many there are, how much they spend, and how confident we are.

That rationale is built from three layers:

Phase −1: Statistical Memory Layer (SML) — atomic facts like students_total, pharmacy_outlets, penetration_rate, and median_price, each with country, year, unit, dimensions, and source.
Phase 0: Archetype Knowledge Base (AKB) — economic priors and archetypes (e.g., what "high adoption" vs "low adoption" looks like, typical price bands by GDP band).
Phase 1–2: Agentic reasoning — customer, penetration, and price agents that pull from SML and AKB, then a reconciliation agent that checks consistency and unit economics.

flowchart TB
    SML[(Statistical
Memory Layer)]
    AKB[(Archetype
Knowledge Base)]

    subgraph Agents[Market Sizing Engine V2]
        C[Customer Sizing Agent]
        P[Penetration Agent]
        R[Price Intelligence Agent]
        REC[Reconciliation Engine]
    end

    SML --> C
    SML --> P
    SML --> R
    AKB --> P
    AKB --> R

    C --> REC
    P --> REC
    R --> REC

    REC --> OUT[Market size + narrative]

    style SML fill:#fef9c3,stroke:#ca8a04
    style AKB fill:#e0f2fe,stroke:#0ea5e9
    style Agents fill:#f1f5f9,stroke:#64748b
    style REC fill:#dbeafe,stroke:#0ea5e9

SML = measured facts, AKB = archetypes, Agents = reasoning layer.

Phase −1: Statistical Memory Layer (SML)

SML is a dedicated Postgres table (market_facts) that stores atomic observations such as:

metric_key — e.g. students_total, customers_total, penetration_rate, median_price_usd.
country, iso_code, year — where and when this fact applies.
value, unit — the measured number and its unit (count, %, USD/month, etc.).
dimensions — JSON for qualifiers like { "education_level": "pre_primary", "geo_scope": "national" }.
source_name, source_url, publication_year, confidence — provenance and quality.

Offline agents and scripts fill SML via a PDF ingestion pipeline:

Locate reports: Identify BANBEIS yearbooks, BBS HIES, WHO health expenditure tables, GSMA connectivity reports, etc. for a country.
Extract tables: Use a PDF extraction kit (e.g., MinerU) to convert layout + tables + OCR into machine-readable tables.
Normalize: Run report-specific normalizers (e.g., BANBEIS pre-primary enrolment) that map table cells into NewMarketFactInput records.
Upsert into SML: Write facts into market_facts with ON CONFLICT so corrections and updated editions override older values.

Crucially, this all happens offline. When the Market Sizing Engine runs, it does not try to parse a 300-page PDF on the fly. It simply asks SML: "What is the best fact we have for students_total in Bangladesh, pre-primary, national, 2025?"

Online: Market sizing agents on top of SML

On top of SML and AKB, we run four main agents for V2:

Customer Sizing Agent: Answers "how many potential customers are there?" by first querying SML for customers_total or related metrics, then falling back to demographic priors and search if SML is empty.
Penetration Agent: Finds or infers the fraction of those customers who already adopt the product/service. It reads penetration_rate from SML when available, then uses KB benchmarks and analog inference as a fallback.
Price Intelligence Agent: Normalizes prices to meaningful units (per visit, per unit, per month) and country macros. It prefers structured price distributions from SML (median, urban/rural splits) before looking at web price comparators.
Reconciliation Engine: Checks whether customers × penetration × price align with known macros (GDP, total spend, plausible per capita spend) and flags inconsistencies for human review.

flowchart LR
    subgraph Offline[Offline harvesting]
        H1[PDF locators
(BANBEIS, BBS, WHO, GSMA)]
        H2[PDF extraction
(tables + OCR)]
        H3[Normalizers
(per report family)]
        H4[Upsert into
market_facts]
    end

    subgraph Online[Online market sizing run]
        CS[Customer Sizing Agent]
        PS[Penetration Agent]
        PR[Price Intelligence Agent]
        RC[Reconciliation Agent]
    end

    H1 --> H2 --> H3 --> H4 --> SML[(SML: market_facts)]

    SML --> CS
    SML --> PS
    SML --> PR

    CS --> RC
    PS --> RC
    PR --> RC

    RC --> RES[Segment TAM, SAM, SOM + narratives]

    style Offline fill:#f9fafb,stroke:#cbd5f5
    style Online fill:#eff6ff,stroke:#3b82f6
    style SML fill:#fef9c3,stroke:#ca8a04
    style RC fill:#dbeafe,stroke:#0ea5e9

Offline harvesting feeds SML; online agents read from SML first and only then hit the open web.

Selective harvesting: smart PDF extraction, not a data dump

A report like the BBS Bangladesh Statistical Yearbook runs to 500+ pages with hundreds of tables — district-level crop yields, gender ratios for minor age bands, administrative breakdowns of government staff. None of that is useful for market sizing. Ingesting it all would inflate SML with noise and make the smart-selection logic slower and less precise.

The solution is a two-step filter that runs before any row touches market_facts:

TOC and heading scan: When MinerU processes the PDF, it produces layout data including large-font headings and section titles. We build a rough section tree from these, so every table gets tagged with its nearest heading path — for example Part III → Health Statistics → Facilities → Table 7.4.
Profile-based selector filter: Each report family has a report profile config (scripts/harvesting/report_profiles/banbeis_edu_stats_v2025.json) that declares:
- target_metrics — the exact metric_key values this report can contribute (e.g. students_total, institutions_total).
- table_selectors — caption/heading patterns that mark a table as in-scope (e.g. "number of students by level", "institutions by type").
- exclude_patterns — patterns to skip regardless (e.g. "district-wise", "sub-district" for granularity we do not use).

Only tables that match a selector and do not match an exclude pattern are emitted as PdfTable objects. Everything else is discarded before it ever reaches a normalizer. In practice a 500-page BBS report might yield 8–12 in-scope tables, not 400.

Every run logs tables_scanned, tables_selected, and facts_upserted so we can tune selectors without guessing.

Memory-first, search-second

The most important behavioral change in V2 is this rule:

Always try SML / MBS lookup first. Use web search only as a fallback when the memory is empty or stale.

That changes how the agents think:

If a BANBEIS table has already been ingested into SML, the agent does not try to "rediscover" those enrolment numbers via web search — it simply uses the stored fact.
If an analyst corrects the number of retail pharmacies in Dhaka and saves it through our internal tools, all future runs automatically use that corrected fact via SML upsert semantics.
Analog inference is grounded in stored facts: instead of saying "Bangladesh lags India by 5–15%" as a vague rule of thumb, we have actual measured gaps in SML for specific metrics and years.

This makes the engine both more conservative (it refuses to guess when memory and primary sources are empty) and more consistent (runs converge toward the same numbers as SML improves).

All phases at a glance

Here is every phase in V2, what it does, what it uses, and where LLMs are actually involved.

Phase	Name	What it does	Tools / APIs	LLM?
−1	Statistical Memory Layer (SML)	Offline: scan PDF TOC, filter tables by report profile selectors, normalize matching tables into atomic facts (`market_facts` table). Only in-scope metrics are stored — never a full data dump.	MinerU (Layout+Tables+OCR), Python extractor, TS orchestrator, report profile configs, Supabase upsert	No — rule-based normalizers only
G	Pre-Flight Scoping (HITL)	Clarify the segment definition precisely (e.g. "pharmacy retail" vs "drug shops") before spending tokens. Human approves scope before the engine continues.	Internal taxonomy tools, AKB priors	Yes — LLM drafts scope proposal; human approves
0	Archetype Knowledge Base (AKB) + MBS	Static economic priors: B2C/B2B magnitude bands, penetration archetypes, 12 frequency archetypes, country macros. MBS provides dynamic benchmarks backed by SML.	TypeScript config files (git-versioned), Supabase `market_facts` for SML-backed MBS	No — pure lookup
1	Segment Classifier	Reads the approved scope and emits a Research Blueprint: B2C/B2B/B2G classification, expected customer magnitude, penetration band, frequency archetype, price band, and max plausible TAM ceiling.	AKB sector taxonomy, magnitude priors, country macro constraints	Yes — GPT-4o classifies segment and selects archetypes
2a	Customer Sizing Agent	SML lookup → KB lookup → web search → analog inference. Returns addressable customer count with confidence and band check.	SML (`getBestFact` for `customers_total`), Exa search, BBS/BANBEIS/WHO report links	Only if SML/KB miss — GPT-4o for search and reasoning
2b	Penetration Rate Agent	SML lookup → KB benchmarks → analog inference using stored SML gaps as evidence → LLM web search. Returns penetration fraction 0–1 with source.	SML (`getBestFact` for `penetration_rate`), GSMA/World Bank search, country benchmark library	Last resort only — GPT-4o via Exa search
2c	Purchase Frequency Agent	Archetype-first: classifies segment into one of 12 frequency archetypes before any search. Getting this wrong causes order-of-magnitude errors (e.g. pharmacies at durable-long = $23M; at monthly = $2.3B).	Frequency archetype library (Phase 0), Exa search for behavioral studies	GPT-4o for ambiguous segments; archetype covers most cases without LLM
2d	Price Intelligence Agent	SML lookup for stored price distributions (median, urban/rural) → web price comparators → currency normalization (BDT→USD) → transaction unit normalization (per visit, per unit, per month).	SML (`getBestFact` for `median_price_usd`), Numbeo/e-commerce search, country macro (currency)	Only if SML/KB miss — GPT-4o for search and normalization
3	Reconciliation Engine	Runs 7 deterministic sanity checks across all four agent outputs: plausibility vs GDP, per-capita spend, penetration band, price band, frequency band, customer magnitude, and top-down vs bottom-up cross-check. Emits PASS / PASS_WITH_FLAGS / INCOMPLETE / FAIL.	Country macro constraints (AKB), blueprint bands, all Phase 2 outputs	GPT-4o for narrative only — sanity math is fully deterministic

Decision priority every online run: SML fact → AKB/MBS archetype → Analog inference → Web search (last resort). The more SML is populated offline, the fewer LLM calls Phase 2 needs — directly reducing cost and improving consistency.

Human loop and governance

As with our discovery workflow, humans stay in the loop where it matters:

Analysts decide which reports to harvest and which metrics to prioritize for a given country or sector.
They review low-confidence facts and reconciliation flags, then either approve, correct, or discard them.
They can annotate facts with notes (e.g., "BANBEIS 2025 enrolment table excludes kindergarten") that downstream agents see.

The goal is not to automate judgment away, but to automate everything around judgment: extraction, normalization, storage, and retrieval.

What this unlocks next

By treating market sizing as an economic rationale engine sitting on a Statistical Memory Layer, we open up a few powerful next steps:

Scenario analysis: Re-run the same segment under different pricing, adoption, or policy assumptions while keeping the underlying facts constant.
Cross-country comparison: Compare Bangladesh vs India vs Vietnam on the same metric definitions and years, using a shared fact base.
Incremental refresh: When BANBEIS or BBS releases a new edition, only the relevant SML facts change; the agentic logic stays the same.

In future posts, we'll dig into how we expose this capability to analysts and partners: custom blueprints, interactive reconciliation views, and the unit economics of running these agents at scale.

Moncho's Agentic Workflow: Part 2 — Market Sizing Engine V2